Methods and algorithms for protecting information in optical text recognition systems

Konstantin Dergachov, Leonid Krasnov, Vladislav Bilozerskyi, Anatoly Zymovin

Abstract


The subject of the study. A concept of OCR systems performance improvement is proposed, which is achieved through the integrated use of special algorithms for preliminary processing of documents picture, an extended set of service functions, and advanced techniques for information protection. Study objectives: development of algorithms that compensate for the influence of the unfavorable points like imperfect lighting conditions overshooting, images geometric deformation, noises, etc., which corrupt the pictured text, on the efficiency of that text recognition. It is needed to provide for a series of service procedures that would ensure adequate data handling while viewing, converting, and storing in standard formats the results, and ensuring the possibility to exchange data in open communication networks. Additionally, it is necessary to ensure the information protection against unauthorized use at the stage of data processing and provide secretiveness of their transmission through the communication channels. Investigation methods and research results: developed and tested algorithms for preliminary picture data processing, namely, for the captured image geometry transformation, picture noise correction with different filters, image binarization when using the adaptive thresholds reduced the negative influence of irregular image portions illumination; in the software, the special services ensure the data processing ease and information protection are affected. In particular, the interactive procedure for text segmentation is proposed, which implies the possibility of anonymizing its fragments and contributes to collecting confidentiality for documents treated. The package for processing document shots contains the face detection algorithm bringing the identification of such information features; it can be used further in the task of face recognition. After the textual doc is recognized, the received data encryption is provided by generating a QR-code and the steganography methods can deliver the privacy of this information transmission. The algorithms' structures are described in detail and the stability of their work under various conditions is investigated. Focused on the case study, docs' text recognition software was developed with the usage of Tesseract version 4.0 optical character recognition program. The program named "HQ Scanner" is written in Python using the present resources of the OpenCV library. An original technique for evaluating the efficiency of algorithms using the criterion of the maximum probability of correct text recognition is implemented in the software. Conclusions. The study results can constitute the basis for developing advanced specialized software for easy-to-use commercial OCR systems.

Keywords


Optical character recognition; probability of correct text recognition; text segmentation fragment anonymization; QR code; steganography algorithms

Full Text:

PDF

References


Tesseract − ocr / Tesseract. Available at: https://github.com/tesseract-ocr/tesseract. (аccessed 18.05.2021).

Python-tesseract − Optical character recognition (OCR) tool for Python. Available at: https: //pypi.org/project/ pytesseract/. (аccessed 18.05.2021).

Sahu, N., Sonkusare, M. A Study on Optical Character Recognition-Techniques. The International Journal of Computational Science, Information Technology and Control Engineering (IJCSITCE), 2017, vol. 4, no. 1. 14 p. DOI: 10.5121/ijcsitce.2017.4101.

Mujibur Rahman Majumder et al. Offline optical character recognition (OCR) method: An effective method for scanned documents. 22nd International Conference on Computer and Information Technology (ICCIT) – 2019, pp. 1-5. DOI: 10.1109/ICCIT48885.2019. 9038593.

Viet, Anh Phan. et al. Improved OCR quality for smart scanned document management system. Journal of Science and Technique − Le Quy Don Technical University, 2020, no. 210, pp. 51-67.

Pawar, N., Shaikh, Z., Shinde, P., Warke, Y., Image to Text Conversion Using Tesseract. International Research Journal of Engineering and Technology (IRJET), 2019, vol. 6, iss 2, pp. 516-519.

Acharya, M., Chouhan, P., Deshmukh, A. Scan.it - on Advances in Computing, Communication and Control (Text Recognition, Translation and Conversion). International Conference (ICAC3), 2019, pp. 1-5. DOI: 10.1109/ICAC347590.2019. 9036849.

OpenCV Tutorials − Image Processing (imgproc module). Available at: https://opencv.org/ (аccessed 18.05.2021).

OpenCV / dnn modules. Available at: https://github.com/opencv/opencv/tree/master/modules/ dnn (аccessed 18.05.2021).

Dergachov, K. et al. Data pre-processing to increase the quality of optical text recognition systems. Radioelektronni i komp'uterni sistemi – Radioelectronic and computer systems, 2021, no. 4(100), pp. 183-198. DOI: 10.32620/reks.2021.4.15.

Ahamed, M. S., Asiful, Mustafa H. A Secure QR Code System for Sharing Personal Confidential Information. International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), 2019, pp. 1-4, DOI: 10.1109/IC4ME247184.2019.9036521.

Pastukhov, D. F. et al. Some Methods of QR code Transmission using Steganography. World of transport and transportation, 2019, vol. 17, Iss. 3, pp. 16–39.

Yudin, O. et al. Efficiency Assessment of the Steganographic Coding Method with Indirect Integration of Critical Information. IEEE International Conference on Advanced Trends in Information Theory (ATIT), 2019, pp. 36-40, DOI: 10.1109/ATIT49449.2019.9030473.

Joshi, K. et al. PSNR and MSE based investigation of LSB. International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT), 2016, pp. 280-285, DOI: 10.1109/ICCTICT.2016.7514593.

Rituraj, R. et al. QR code image steganography (LSB BIT) with secret image (MSB BIT) using AES cryptography and JPEG compression. International Journal of Recent Scientific Research, 2019, vol. 9, Issue, 7, pp. 27820-27826.

Li, F., Krivenko, S., Lukin, V. Two-step provsding of desired quality in lossy image compression by spiht. Radioelektronni i komp'uterni sistemi – Radioelectronic and computer systems, 2020, no. 2(94), pp. 22-32. DOI: 10.32620/reks.2020.2.02.

Wazirali, R. et al. Objective Quality Metrics in Correlation with Subjective Quality Metrics for Steganography. Asia-Pacific Conference on Computer Aided System Engineering, 2015, pp. 238-245, DOI: 10.1109/APCASE.2015.49.

Lin, G.-S. et al. Keyword Detection Based on RetinaNet and Transfer Learning for Personal Information Protection in Document Image. Appl. Sci., 2021, vol. 11, article no. 9528. DOI: 10.3390/app11209528.

Shemiakina, J. et al. A Method of Image Quality Assessment for Text Recognition on Camera-Captured and Projectively Distorted Documents. Mathematics, 2021, vol. 9, article no. 2155. DOI: 10.3390/math9172155.

De Jager, C. et al. Business Process Automation: A Workflow Incorporating Optical Character Recognition and Approximate String and Pattern Matching for Solving Practical Industry Problems. Appl. Syst. Innov., 2019, vol. 2, no. 4, article no. 33. DOI: 10.3390/asi2040033.

Sasmitha Kumari Sahu et al. Manual character recognition with OCR. Project, 2021 DOI: 10.13140/RG.2.2.32608.81927.




DOI: https://doi.org/10.32620/reks.2022.1.12

Refbacks

  • There are currently no refbacks.