Syntactical method for reconstructing highly fragmented OOXML files

Maksym Boiko, Viacheslav Moskalenko

Abstract


A common task in computer forensics is to recover files that lack file system metadata. In the case of searching for file fragments in unallocated space, file carving is the most often used method, which is ideal for unfragmented data. However, such methods and the tools based on them are ineffective for recovering OOXML files with a high fragmentation level. These methods do not provide reliable determination of the correct order of fragments. Techniques for reconstructing documents based on the analysis of words and phrases are also ineffective in fragmented OOXML documents. The main reason is that OOXML files are ZIP archives and, as a result, store data on disk space in a compressed form. This paper proposes a syntactical method for reconstructing OOXML documents based on knowledge about the internal structure of this file type, regardless of their content. The details of the implementation of the reconstruction algorithm and the peculiarities of restoring certain types of local elements of the document were considered. The efficiency of the algorithm was tested on the Govdocs1 and NapierOne datasets. The proposed method was applied to 4096-byte data blocks, which correspond to the standard cluster size in different file systems. The experimental results confirmed the method's suitability for practical use with 82.97 % of recovered files, including 34.38 % reconstructed completely, 0.43 % excluding the last 21 bytes at most, and another 48.16 % excluding embeddings that require other approaches. In the latter case, obtaining a fully working document without displaying graphic images and other contents of different embeddings is possible. The presence in OOXML files of CRC-32 hashes of the uncompressed data stream of each local element allows us to confirm the correctness of information recovery and its integrity unambiguously. Simultaneously, the method's effectiveness depends mainly on data verification methods during the reconstruction of local elements that occupy at least three clusters in the file. Therefore, this method is supposed to be improved by developing new mechanisms for verifying XML elements.

Keywords


digital forensics; data recovery; file carving; syntactical file carving; fragmentation; file reconstruction; Office Open XML; OOXML; DOCX file; ZIP archive; DEFLATE compression

Full Text:

PDF

References


Cantrell, G., & Runs Through, J. The five levels of data destruction: A paradigm for introducing data recovery in a computer science course. 2019 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 2019, pp. 133-138. DOI: 10.1109/CSCI49370.2019.00029.

Ali, N. U. A., Iqbal, W., & Shafqat, N. Analysis of windows OS’s fragmented file carving techniques: A systematic literature review. Advances in Intelligent Systems and Computing, 2019, vol. 800, pp. 63-67. DOI: 10.1007/978-3-030-14070-0_10.

Ramli, N. I. S., Hisham, S. I., & Razak, M. F. A. Survey of File Carving Techniques. In Lecture Notes on Data Engineering and Communications Technologies, 2021, vol. 72, pp. 815-825. DOI: 10.1007/978-3-030-70713-2_74.

Sari, S. A., & Mohamad, K. M. A Review of Graph Theoretic and Weightage Techniques in File Carving. Journal of Physics: Conference Series, 2020, vol. 1529, iss. 5, article no. 052011. DOI: 10.1088/1742-6596/1529/5/052011.

van der Meer, V., Jonker, H., & van den Bos, J. A contemporary investigation of NTFS file fragmentation. Forensic Science International: Digital Investigation, 2021, vol. 38(Suppl.), article no. 301125. DOI: 10.1016/j.fsidi.2021.301125.

Lee, H., Lee, H.-W. Block based Smart Carving System for Forgery Analysis and Fragmented File Identification. Journal of Internet Computing and Services, 2020, vol. 21, no. 3, pp. 93–102. DOI: 10.7472/jksii.2020.21.3.93.

Memon, N., & Pal, A. Automated reassembly of file fragmented images using greedy algorithms. IEEE Transactions on Image Processing, 2006, vol. 15, iss. 2, pp. 385-393. DOI: 10.1109/TIP.2005.863054.

Ravi, A., Kumar, T. R., & Mathew, A. R. A method for carving fragmented document and image files. 2016 International Conference on Advances in Human Machine Interaction (HMI), Kodigehalli, India, 2016, pp. 1-6. DOI: 10.1109/HMI.2016.7449170.

Shanmugasundaram, K., & Memon, N. Automatic reassembly of document fragments via context based statistical models. 19th Annual Computer Security Applications Conference, 2003. Proceedings., Las Vegas, NV, USA, 2003, pp. 152-159. DOI: 10.1109/CSAC.2003.1254320.

Brown, R. D. Improved recovery and reconstruction of DEFLATEd files. Digital Investigation, 2013, vol. 10(Suppl.), pp. S21–S29. DOI: 10.1016/J.DIIN.2013.06.003.

Al-Sharif, Z. A., Bagci, H., Abu Zaitoun, T., & Asad, A. Towards the memory forensics of ms word documents. Advances in Intelligent Systems and Computing, 2018, vol. 558, pp. 179-185. DOI: 10.1007/978-3-319-54978-1_25.

Taşdelen, Kubilay & Süzen, Ahmet. Analysing and Carving MS Word and PDF Files from RAM Images on Windows. Tehnički vjesnik, 2022, vol. 29, no. 5, pp. 1714-1720. DOI: 10.17559/TV-20210218122046.

Ali, N. U. A., Iqbal, W., & Afzal, H. Carving of the OOXML document from volatile memory using unsupervised learning techniques. Journal of Information Security and Applications, 2022, vol. 65, article no. 103096. DOI: 10.1016/j.jisa.2021.103096.

Dergachov, K., Krasnov, L., Bilozerskyi, V. & Zymovin, A. Methods and algorithms for protecting information in optical text recognition systems. Radioelectronic and Computer Systems, 2022, no. 1, pp. 154-169. DOI: 10.32620/reks.2022.1.12.

Standard ECMA TR/98 JPEG File Interchange Format (JFIF). Available at: https://www.ecma-international.org/publications-and-standards/technical-reports/ecma-tr-98/. (accessed 12 january 2023).

G I F (tm) Graphics Interchange Format (tm) A standard defining a mechanism for the storage and transmission of raster-based graphics information. CompuServe Inc., 1987. Available at: https://www.w3.org/Graphics/GIF/spec-gif87.txt. (accessed 12 january 2023).

Ali, Hamza A. & Ne’ma, Bashar M. Effective Variations on Opened GIF Format Images. IJCSNS, 2008, vol. 8. No. 5, pp. 70-75.

Bitmap Image File (BMP), Version 5. Sustainability of Digital Formats: Planning for Library of Congress Collections. Available at: https://www.loc.gov/preservation/digital/formats/fdd/fdd000189.shtml. (accessed 12 january 2023).

Fedorchenko, I., Oliinyk, A., Stepanenko, A., Korniienko, S., Kharchenko, A., & Laktionov, V. Development of a method for compressing images on the basis of JPEG algorithm. Technology Audit and Production Reserves, 2020, vol. 2, no. 2(52), pp. 32-34. DOI: 10.15587/2706-5448.2020.202433.

Barannik, V., Krasnorutsky, A., Shulgin, S., Yeroshenko, V., Sidchenko, Y., & Hordiienko, A. Image compression based on classification coding of constant-pitched functions transformers. Radioelectronic and Computer Systems, 2021, no. 3, pp. 48-62. DOI: 10.32620/reks.2021.3.05.

Ali, R. R., & Mohamad, K. M. RX_myKarve carving framework for reassembling complex fragmentations of JPEG images. Journal of King Saud University - Computer and Information Sciences, 2021, vol. 33, iss. 1, pp. 21–32. DOI: 10.1016/J.JKSUCI.2018.12.007.

Chang, X., Wu, J., & Hao, F. JPEG fragment carving based on pixel similarity of MED-ED. Chinese Control Conference (CCC), Guangzhou, China, 2019, pp. 8862-8866. DOI: 10.23919/ChiCC.2019.8865161.

Durmus, E., Korus, P., & Memon, N. Every Shred Helps: Assembling Evidence from Orphaned JPEG Fragments. IEEE Transactions on Information Forensics and Security, 2019, vol. 14, iss. 9, pp. 2372-2386. DOI: 10.1109/TIFS.2019.2897912.

Hilgert, J. N., Lambertz, M., Rybalka, M., & Schell, R. Syntactical Carving of PNGs and Automated Generation of Reproducible Datasets. Digital Investigation, 2019, vol. 29(Suppl.), pp. S22-S30. DOI: 10.1016/j.diin.2019.04.014.

Tang, Y., Fang, J., Chow, K. P., Yiu, S. M., Xu, J., Feng, B., Li, Q., & Han, Q. Recovery of heavily fragmented JPEG files. DFRWS 2016 USA - Proceedings of the 16th Annual USA Digital Forensics Research Conference, 2016. DOI: 10.1016/j.diin.2016.04.016.

Uzun, E., & Sencar, H. T. Jpg Scraper : An Advanced Carver for JPEG Files. IEEE Transactions on Information Forensics and Security, 2020, vol. 15, pp. 1846-1857. DOI: 10.1109/TIFS.2019.2953382.

Zhang, L., Hao, S., & Zhang, Q. Recovering SQLite data from fragmented flash pages. Annales Des Telecommunications – Annals of Telecommunications, 2019, vol. 74, pp. 251-460. DOI: 10.1007/s12243-019-00707-9.

Lin, W., & Xu, M. A Microsoft Word documents carving method based on interior virtual streams. Advanced Materials Research, 2012, vol. 433–440, pp. 3028-3032. DOI: 10.4028/www.scientific.net/AMR.433-440.3028.

Paixão, T. M., Berriel, R. F., Boeres, M. C. S., Koerich, A. L., Badue, C., de Souza, A. F., & Oliveira-Santos, T. Self-supervised deep reconstruction of mixed strip-shredded text documents. Pattern Recognition, 2020, vol. 107, article no. 107535. DOI: 10.1016/J.PATCOG.2020.107535.

Bhawal, S., & Tabassum, M. Forensic image reconstruction based on efficient morphological operational model. Advances in Intelligent Systems and Computing, 2019, vol. 814, pp. 297-307. DOI: 10.1007/978-981-13-1501-5_26.

Alothman, A. F., Wahab Sait A. R. Managing and Retrieving Bilingual Documents Using Artificial Intelligence-Based Ontological Framework. Comput Intell Neurosci., 2022 vol. 2022, article no. 4636931. DOI: 10.1155/2022/4636931.

Standard ECMA-376 Office Open XML File Formats. Available at: https://www.ecma-international.org/publications-and-standards/standards/ecma-376/. (accessed 12 january 2023).

Didriksen, E. Forensic Analysis of OOXML Documents, 2014. Available at: https://ntnuopen.ntnu.no/ntnu-xmlui/bitstream/handle/11250/198656/EDidriksen.pdf. (accessed 12 january 2023).

Fu, Z., Sun, X., Liu, Y., & Li, B. Forensic investigation of OOXML format documents. Digital Investigation, 2011, vol. 8, iss. 1, pp. 48-55. DOI: 10.1016/j.diin.2011.04.001.

ZIP File Format Specification, version 6.3.10, PKWare, Inc., 2022. Available at: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT. (accessed 12 january 2023).

Fu, Z., Sun, X., Zhou, L., & Shu, J. New forensic methods for OOXML format documents. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, vol. 8389, pp. 503-513. DOI: 10.1007/978-3-662-43886-2_36.

Brown, R. D. Reconstructing corrupt DEFLATEd files. Digital Investigation, 2011, vol. 8(Suppl.), pp. S125-S131. DOI: 10.1016/j.diin.2011.05.015.

Garfinkel, S., Farrell, P., Roussev, V., & Dinolt, G. Bringing science to digital forensics with standardized forensic corpora. Digital Investigation, 2009, vol. 6(Suppl.), pp. S2-S11. DOI: 10.1016/j.diin.2009.06.016.

Davies, S. R., Macfarlane, R., & Buchanan, W. J. NapierOne: A modern mixed file data set alternative to Govdocs1. Forensic Science International: Digital Investigation, 2022, vol. 40, article no. 301330. DOI: 10.1016/J.FSIDI.2021.301330.

Chukhray, A., & Havrylenko, O. The method of student’s query analysis while intelligent computer tutoring in SQL. Radioelectronic and Computer Systems, 2021, no. 2, pp. 87-96. DOI: 10.32620/reks.2021.2.07.




DOI: https://doi.org/10.32620/reks.2023.1.14

Refbacks

  • There are currently no refbacks.