Back to News

Using Machine Learning to Automate and Streamline Data Collection from Scanned Documents

Dave Westbrook Image

by Dave Westbrook,


While digital transformation has automated many processes, automating document management and data entry from scanned documents has been unreliable for a long time. However, the maturation of machine learning (ML) algorithms has enabled organisations and service providers to not only automate data collection but also to apply machine learning in a wider project context, such as processing large volumes of third-party documents to classify their relevance based on specific keywords using natural language processing (NLP).

This article will explore some of our work using machine learning, including the common issues and themes we see when using it to automate data collection.

Machine Learning TL;DR

Machine learning is an application of artificial intelligence (AI) that can be used to automate certain processes or tasks in a completely new way. Machine learning focuses on giving computers the ability to learn without being explicitly programmed, and allows them to build their own knowledge based on existing information. The use of ML is increasing and is being used by many organisations across many verticals—but how exactly does it help with data collection?

Improving Data Collection with Machine Learning

Machine learning can be especially useful when there is only scope to commit minor resources to a process that requires heavy data processing, such as scanning and classifying documents from multiple sources and entering the resulting data into one system. Machine learning is also an excellent substitute for manual data collection and entry when a high volume of documents need to be scanned and entered. Let's take a look at some of the finer details:


Machine learning helps automate document classification, allowing organisations to classify and index their documents based on keywords and metadata systematically. Machine learning has been applied in many industries including law firms, medical labs, government agencies etc., and can help automate this process of classifying documents into multiple different categories according to the subject matter. Machine learning also plays a role in applying optical character recognition (OCR) to large volumes of scanned images of paper-based documents for large-scale digitisation projects. Correctly configured ML can also tell us which parts of a scanned image are text, numbers, or graphics so that we don't have to label each section of a document. We have proven ML to be a fast and efficient way of processing large volumes of scanned documents.

Extended Learning

ML functionality can also be extended beyond a (relatively) straightforward automated data collection tool—it can also be used to give apps the ability to build their own knowledge based on what they are taught. Machine learning in this context can be performed in three steps: First, an algorithm is trained using seed data containing examples of pre-labelled records with relevant information. Second, the algorithm uses this information to train itself so that it can process new information for classification or analysis by building its own rules and logic from scratch (referred to as unsupervised machine learning). Third, these algorithms can then use the collected data to train themselves by learning from the new information based on what they have already learned with the previous step.

Apps will use this newly learned information to come up with rules and logic for classifying or analysing new data, which in turn can be applied to similar cases, creating an iterative feedback loop. Machine learning is not simply an automated data entry tool, but rather, it gives machines the ability to build their own knowledge based on what they are taught.

So we know that ML can be used to scan and process large volumes of documents and carry out wide-scale classifications based on keywords and metadata. It can also be expanded to build a more complex system using unsupervised machine learning techniques, which allow the algorithm to learn without being explicitly programmed for each case, and is already proving to be a powerful tool for numerous applications in many industries and verticals. But what about these applications? What is the use-case for this deeper functionality?

Natural Language Processing (NLP)

Natural language processing is a specific application of machine learning which uses algorithms and probability models to classify and analyse human-derived copy, such as extracting relationships based on given contexts. For example, Datamango has utilised NLP to analyse annual financial reports, as it can read, recognise and classify relevant sections within scanned documents available under the Open Government License from Companies House. The ability to employ NLP in this context is necessitated by the fact that, while the requirements for what data is presented in financial reports is bound by legislation, the language of director's reports and similar disclosures are not.

At a very early stage in the processing workflow, and with only a relatively small amount of seed data, we used NLP to create automated rules related to the required data where the information is not standardised. We looked to identify the salient aspects of the Director's Reports for large or quoted companies relating to the Streamlined Energy & Carbon Reporting (SECR) requirements. Since this is relatively recent legislation there is little consistency in formatting, making it the perfect candidate for NLP.

Side Note: Surely Financial Statements Should Already Be Digitised?

You'd think so, and you'd be (somewhat) correct:

"It is no longer acceptable for most companies to send either accounts or computations on paper or as a PDF."

UK Government XBRL Guide For Businesses

In short, companies should file financial statements digitally to HMRC, using the eXtensible Business Reporting Language (XBRL) data format and Inline XBRL (iXBRL). However, unlike HMRC, Companies House does not require iXBRL accounts, so filers have no legislative requirement to move away from postal submissions. Most significantly, the Big 4 accountancy firms—Deloitte, PricewaterhouseCoopers (PwC), Ernst & Young (EY) and KPMG—are acutely aware of the increased risk profile for large and quoted companies by making financial data freely available.

So that brings us neatly back to the need for machine learning...

Desired Outcomes

So how does machine learning work in practice? First, we used NLP to extract the desired data from thousands of scanned annual reports. Then, with this information as a basis for further training and analysis, we used ML algorithms to calculate the probability that a page is relevant to SECR requirements based on its content. Machine learning allowed us to build a streamlined and effective tool that identifies SECR-relevant pages in scanned documents and extracts the data.

The document library is continually updated with new examples as they become available through the Companies House Free Company Data Product. We also used information that had been manually entered by our data engineers into the data collection system from earlier to supplement the training database, resulting in a highly accurate machine learning model which applies specifically to the required SECR data and is fully integrated into the existing project workflow.

What's Next?

We are now working on automating and streamlining further manual processes through machine learning, such as digitising broader company data, including facility locations, from other publicly available datasets. The lessons learned have opened up several other possibilities for us to apply ML techniques to supplement our existing datasets and improve our service offering to new and existing clients.

As a direct result of changes in regulations around digital transformation, where businesses have little or no opportunity to influence the rules of the game, we have to think more creatively. Machine learning provides the ability to automate manual processes and take immediate action on new requirements without being involved in lengthy and costly consultation processes.

With properly constructed learning models, well-curated seed data, and reliable partners to implement them, there is no need for expensive software or expert, in-house knowledge of complex algorithms: simply feed the data into the system, collect the outputs and implement the decisions.

Could machine learning take you to the next level? Contact us today, or check out our machine learning project roadmap below.

Contact Form Image
Contact Form Hero Image

Get in touch.

We're here to address your challenges and help you unlock the full potential of your digital landscape. Drop our team a message to discuss how we can bring your vision to life.

Send us a message...

Name Icon
Email Icon
Phone Icon
Max. 500 characters

By providing your details, you agree to our privacy policy.