DocFlair - Extragere de date din documente structurate · DocFlair

What are the semi-structured documents?

It is estimated that 80% of the daily documents and forms are semi-structured. The main features of the semi-structured documents are the complexity of the data and its lack of predictability, in the sense that it is not possible to fully predict what type of information each document will contain.

The semi-structured documents represent a large and diverse category and thus it is difficult to set a clear definition. The best way to introduce them is by comparing them to the structured documents, which are called standard documents. These documents contain pre-set fields that request a certain type of information. Each document has the same structure and the same number of fields always located in the same spot on the page. Examples of structured/standard documents:

driver’s logbook;

personal training form regarding health and safety at the workplace;

invoice book;

medical prescription;

inventory book etc.

Structured documents are not problematic in terms of data extraction because the information is well organised and thus we know beforehand what type of data each documents contains. Consequently, the programs used for extracting data from structured documents are developed enough to ensure high efficiency. There are currently programs which extract data from different types of structured documents and which process thousands of documents every day. The automated extraction procedure starts with opening the document in design mode and setting the position of each field. Once the position is set, the program will extract the information from each structured document as every time it will find it the same pre-set position. But the documents used in financial-banking transactions, legal procedures, belonging to the public notary or administrative field, as well as research documents do not follow the same internal patter of organising the information. These are semi-structured documents. It is estimated that 80% of the documents and forms used daily are semi-structured because the nature of the requested information cannot be organised in completely fixed categories. In fact, most of the data we come in contact with is either unstructured (such as web pages) or semi-structured (for example online shops).

Unlike the standard documents, semi-structured documents can have fields in different positions on the page; some fields are always filled in (name, CNP etc.) while others are optional or sometimes remain unfilled (ex: part of a vulnerable group) and the number of rows in the table can vary. Moreover, some semi-structured documents have a variable number of pages. For example, on a refinancing application there can be a separate page for each credit, and in this case some will have 2 pages and others more pages. So, the basic features of the semi-structured documents are the complexity of the data and the lack of predictability, as it is not possible to fully predict the type of information that each document contains. That is the reason why the programs that extract data from standard documents are not enough for the semi-structured documents. They need some algorithms that can identify the position of each field and only then they can perform the data extraction. For example, the delivery address on a purchase document can be localised on the top side, in the middle or to the left, but it can also be on the bottom side. In all the situations, the program needs to identify it and fill it in as the address of delivery.

Last but not least, another challenge in the processing of a semi-structured document is the fact that not all the information in the document is relevant. The program needs to distinguish between relevant and irrelevant information. Traditional solutions for data processing are not flexible and smart enough to extract data from documents with a variable structure, without allocating extensive resources for customizing the program. Until recently, there were no efficient solutions from the point of view of the accuracy of the data extracted and of the costs suited for semi-structured documents. DocFlair is a type of program (Saas platform) which was created to extract data from structured and semi-structured documents and which reduces the time necessary for data processing up to five times. DocFlair makes the action of data processing more efficient also by using the extraction of the check marks from the fields with single or multiple choices.

In conclusion, the semi-structured documents represent a vast category of documents that have a particular internal organisation of the information, but also present bigger or smaller variations. Automated extraction of data from semi-structured documents is a challenge for the software solutions that need to identify the position, the category of information and its relevance before extracting it.