Data pre-processing to increase the quality of optical text recognition systems

Konstantin Dergachov; Leonid Krasnov; Vladislav Bilozerskyi; Anatoly Zymovin

doi:10.32620/reks.2021.4.15

Data pre-processing to increase the quality of optical text recognition systems

Konstantin Dergachov, Leonid Krasnov, Vladislav Bilozerskyi, Anatoly Zymovin

Abstract

The subject of study in the article is the formulation of a modern concept of improving the quality of work of optical recognition systems by using a set of various algorithms for preprocessing document images at the user's discretion. The research synthesizes algorithms that compensate for external negative influences (unfavorable geometric factor, poor lighting conditions when photographing, the effect of noise, etc.). The methods used imply a certain sequence of data preprocessing stages: geometric transformation of the original images, their processing with a set of various filters, image equalization without increasing the noise level to increase the contrast of images, the binarization of images with adaptive conversion thresholds to eliminate the influence of uneven photo illumination. The following results were obtained. A package of algorithms for preliminary processing of photographs of documentation has been created, in which, to increase the functionality of data identification, a face detection algorithm is also built in, intended for their further recognition (face recognition). A number of service procedures are provided to ensure the convenience of data processing and their information protection. In particular, interactive procedures for text segmentation with the possibility of anonymizing its individual fragments are proposed. It helps provide the confidentiality of the processed documents. The structure of the listed algorithms is described and the stability of their operation under various conditions is investigated. Based on the results of the research, a text recognition software was developed using the Tesseract version 4.0 optical character recognition (OCR) program. The program "HQ Scanner" is written in Python using the OpenCV library. An original technique for evaluating the effectiveness of the algorithms using the criterion of the maximum probability of correct text recognition has been implemented in software. A large number of examples of system operation and software testing results are provided. Conclusions. The results of the research conducted are a basis for developing software for creating cost-effective and easy-to-use OCR systems for commercial use.

Keywords

optical character recognition (OCR); image original geometry transformation; filter algorithm; picture equalization and binarization; face detection algorithm; probability of correct text recognition; segmentation of texts and anonymization of their indiv

Full Text:

PDF

References

Lecun, Y., et al. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 1998, vol. 86, no. 11, pp. 2278–2324. DOI: 10.1109/5.726791.

Krizhevsky, A., et al. ImageNet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 2017, vol. 60, no. 6, pp. 84–90. DOI: 10.1145/3065386.

He, Kaiming., et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026-1034. DOI: 10.1109/iccv.2015.123.

The ImageNet database is a project for the creation and maintenance of a massive database of annotated images, intended for the development and testing of image recognition and machine vision methods. Available at: https://ru.wikipedia.org/ wiki/ImageNet. (аccessed 3.06.2021).

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. Advances in neural information processing systems, 2012, pp. 1097-1105.

Karthick, K., Ravindrakumar, K. B., Francis, R., Ilankannan, S. Steps Involved in Text Recognition and Recent Research in OCR; A Study. International Journal of Recent Technology and Engineering (IJRTE), 2019, vol. 8, iss.1, pp. 3095-3100. Retrieval Number A2670058119/19©BEIESP.

Sahu, N., Sonkusare, M. A Study on Optical Character Recognition-Techniques. The International Journal of Computational Science, Information Technology and Control Engineering (IJCSITCE), 2017, vol. 4, no. 1. 14 p. DOI: 10.5121/ijcsitce.2017.4101.

Mujibur Rahman Majumder et al. Offline optical character recognition (OCR) method: An effective method for scanned documents. 22nd International Conference on Computer and Information Technology (ICCIT), 2019, pp. 1-15. DOI: 10.1109/ICCIT48885.2019. 9038593.

Anh Phan Viet et al. Improved OCR quality for smart scanned document management system. Journal of Science and Technique − Le Quy Don Technical University, 2020, no. 210, pp. 51-67.

Pawar, N., Shaikh, Z., Shinde, P., Warke, Y., Image to Text Conversion Using Tesseract, International Research Journal of Engineering and Technology (IRJET), 2019, vol. 6, iss. 2, pp. 516-519.

Acharya, M., Chouhan, P., Deshmukh, A. Scan.it - on Advances in Computing, Communication and Control (Text Recognition, Translation and Conversion, International Conference (ICAC3), 2019, pp. 1-5. DOI: 10.1109/ICAC347590.2019. 9036849.

Jaume, G., Kemal Ekenel, H, Thiran, J. A Dataset for Form Understanding in Noisy Scanned Documents. International Conference on Document Analysis and Recognition Workshops (ICDARW), 2019, pp. 1-6, DOI: 10.1109/ICDARW.2019.10029.

Chernyshova, Y. S., Sheshkus, A. V., Arlazarov, V. V. Two-Step CNN Framework for Text Line Recognition in Camera-Captured Images. IEEE Access, 2020, vol. 8, pp. 32587-32600. DOI: 10.1109/ACCESS.2020.2974051.

Wang, B., et al. An Effective Background Evaluation Method for Removing Shadows from Document Images. IEEE International Conference on Image Processing (ICIP), 2019, pp. 3611-3615. DOI: 10.1109/ICIP.2019.8803486.

Caffe, a comprehensive open source machine learning platform. Available at: http://caffe.berkeleyvision.org/ (аccessed 12.04.2021).

Tensorflow, a comprehensive open source machine learning platform. Available at: https://www.tensorflow.org/ (аccessed 12.04.2021).

Torch.ch, a comprehensive open source machine learning platform. Available at: http://torch.ch/ (аccessed 12.04.2021).

Open source neural networks in C. Available at: https://pjreddie.com/darknet/ (аccessed 12.04.2021).

Keras, a comprehensive open source machine learning platform. Available at: https://keras.io/ (аccessed 18.05.2021).

OpenCV / dnn modules. Available at: https://github.com/opencv/opencv/tree/master/modules/dnn (аccessed 18.05.2021).

Hardware - high-performance Raspberry Pi computers and accessories. Available at: https://www. raspberrypi.org (аccessed 18.05.2021).

Python Developer's Guide. Available at: http://python.org (аccessed 18.05.2021).

OpenCV Tutorials − Image Processing (imgproc module). Available at: https://opencv.org/ (аccessed 18.05.2021).

Abbyy Finereader (Scanner with artificial intelligence for digitizing to PDF and OCR). Available at: https://www.abbyy.com/ru/finereader/ (аccessed 18.05.2021).

OCRopus − tesseract based OCR system for text recognition. Available at: https://ru.wikipedia.org/ wiki/Cognitive_Technologies (аccessed 18.05.2021).

Optical Character Recognition (OCR). Available at: https://en.wikipedia.org/wiki/OCRopus (аccessed 18.05.2021).

Tesseract − ocr / Tesseract. Available at: https://github.com/tesseract-ocr/tesseract (аccessed 18.05.2021).

Python-tesseract − Optical character recognition (OCR) tool for Python. Available at: https: //pypi.org/project/ pytesseract/(аccessed 18.05.2021).

DOI: https://doi.org/10.32620/reks.2021.4.15

Refbacks

There are currently no refbacks.

Username
Password
Remember me

RADIOELECTRONIC AND COMPUTER SYSTEMS

Data pre-processing to increase the quality of optical text recognition systems

Abstract

Keywords

Full Text:

References

Refbacks