DocFlair - Extragere de date din documente structurate · DocFlair

Why do we (still) need data entry operators?

When companies need to digitalise structured or semi-structured documents, they are usually searching for a solution which is cost-efficient but which also provides accurate data. There are three options: manual entry, automation and the mixed version- operators assisted by a data extraction program. We will discuss each option from the point of view of cost efficiency.

Manual data entry (exclusive use of operators)

The data entry operators are involved in the entire digitalisation process, from data sorting, data extraction, data scanning and quality control. Based on their experience or following training, operators can become very efficient in data entry, but they can never provide complete accuracy. Even if the operators maintain the same data entry speed throughout the day- which is quite unrealistic- they are likely to make typing mistakes which can increase in the second part of the day due to exhaustion. Very often, the costs for a mistake can be very high (for example in the case of a European funding project), and that is why project managers usually use two operators for the same data entry. This is followed by a random check performed by a third operator. So, for the same data entry it is necessary to use three operators. The strategies used for reducing the number of errors can incur high costs.

Automated data entry (using a data extraction program)

We have written about data extraction programs here. These programs recognise capital letters (OCR- Optical Character Recognition), handwriting (ICR- Intelligent Character Recognition) and they extract the check marks from the boxes with single or multiple answers (OMR- Optical Mark Recognition). Some programs can recognise when a signature is present or absent, and that is very helpful in the case of contracts. Other programs can use the data from a scanned document in order to make connections with an external source. For example, they can read an invoice number and search in the data base for the client’s name or the bank draft. But no program offers 100% accuracy and inevitable errors will occur and thus it is necessary for an operator to intervene. In some cases, complete or almost complete automation is an optimal solution. For example, the American Postal Service uses programs that read the postal code on the letters. This is due to the fact that the operators find it difficult to manage the huge volume of documents they receive in one day. In other cases, when the cost of the errors is too high, automation is not a viable solution. It is worth mentioning that only the data extraction is automated, including in this case. The operators continue to deal with the other tasks, such as scanning the documents.

Mixed option (operators assisted by a data extraction program)

The mixed option might be the most efficient solution in terms of costs in the cases when the number of documents is high, the volume of data extracted from the document is big and /or the errors have significant consequences. The operators will use a program to perform the data extraction in various ways: it reveals the relevant content, it clears away potential stains from the image, replaces the page if it has not been properly scanned, it extracts the check marks etc. Thus, the operators’ speed increases dramatically by reducing the time spent on each document. The operators will be able to introduce more documents in one day, and it will be easier to correct the errors if the program has a function that helps detect the inconsistencies between the operators. DocFlair enables the automated checking of the documents in order to eliminate potential errors. Thus, you don’t have to rely anymore on a random check which can identify only some of the errors.

In conclusion, there are three options a company can choose from for data extraction. Following the analysis of the costs and benefits of every option, companies will choose the most suitable solution according to the allocated budget. However, all the options require the presence of operators because the data extraction programs were designed to assist the operators and not to replace the operators’ work. A program manages with difficulty the exceptions or the more ambiguous cases and does not offer 100% accuracy. The operators will continue to supervise, correct and sometimes perform data entry.