Analysis of archive formats and program solutions for compression of text files

Artem Perepelitsyn, Alona Chepelevych, Andrii Litvinov

Abstract


The subject of study in this article and research is widely used archive formats for file compression, features of their implementation, text compression rates, and the time required for existing programs for different platforms. The goal of the work is to simplify the process of choosing an archive format and program solutions for working with it for compressing text files, with taking into account time requirements, compression ratio, and opensource. The task is to perform an analysis of existing technologies and tools involved in the data archiving process, to analyze archive formats that are widely used, to perform an analysis of important features of archive formats, to perform an experimental study of compression parameters for a specific set of formats and software tools, to propose steps for integrating archives into the project of system with ensuring the compatibility. According to the tasks, the following results were obtained. The application of data compression in information storage systems is discussed. The available formats for archiving in Ubuntu are considered. The detailed analysis of widely used archive formats is performed. The features of the zip and rar formats for working with large files are analyzed. An experimental study of compression parameters for large-sized reference text files using ten combinations based on seven formats and four software tools is performed. Compression parameters of the text with use of the same archive formats using different software tools are investigated. Recommendations for integrating archives into the project of system with ensuring the compatibility are proposed. The use of zpaq for the compression of text information is proposed. Conclusions. The scientific novelty of the obtained results is in the fact that the analysis and experimental study of existing archive formats allows simplifying the process of making a decision on using the required archive format based on the requirements for archiving time, compression ratio, and the possibility of using software implementation for a specific platform. The obtained research results allow to propose the use of the open source archive format zpaq for compressing text or a set of project versions and documentation to achieve a compression ratio that is twice better than for rar format, and two percent better than for 7z and txz formats.

Keywords


archive formats; rar; zip; 7z; tar.gz; tar.xz; zpaq; archiving of data; text compression

References


Large Text Compression Benchmark, Matt Mahoney (June 4, 2024). Available at: http://www.mattmahoney.net/dc/text.html (accessed September 08, 2024).

Weimin, W., Huijiang, G, & Yi, H. Fan Jingbao and Wang Huan. Improvable Deflate Algorithm. Proceedings of 2008 3rd IEEE Conference on Industrial Electronics and Applications, 2008, pp. 1572-1574. DOI: 10.1109/ICIEA.2008.4582783.

Kim, H., & Yeom, H. Improving Small File I/O Performance for Massive Digital Archives. Proceedings of 2017 IEEE 13th International Conference on e-Science (e-Science 2017), 2017, pp. 256-265. DOI: 10.1109/eScience.2017.39.

Sparenberg, H., Bruns, V., & Foessel, S. Use-case-optimized data storage of scalable media files. Third International Conference on Innovative Computing Technology (INTECH 2013), 2013, pp. 35-39. DOI: 10.1109/INTECH.2013.6653633.

Prabavathy, B., Ramya, P., & Babu, C. Optimized private cloud storage for heterogeneous files in an university scenario. Proceedings of 2013 IEEE International Conference on Recent Trends in Information Technology (ICRTIT 2013), 2013, pp. 323-328. DOI: 10.1109/ICRTIT.2013.6844224.

An, X., Jia, H., & Zhang, Y. Optimized Password Recovery for Encrypted RAR on GPUs. Proceedings of 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, 2015, pp. 591-598. DOI: 10.1109/HPCC-CSS-ICESS.2015.270.

Bai, S., Chen, J., Lu, Z., & Li, W. Reference-Based Compression of FASTQ Data Using Longest Match Model. Proceedings of 2021 IEEE 4th International Conference on Information Communication and Signal Processing (ICICSP 2021), 2021, pp. 598-603. DOI: 10.1109/ICICSP54369.2021.9611973.

Wei, Y., Zheng, N., & Xu, M. An Automatic Carving Method for RAR File Based on Content and Structure. Proceedings of 2010 Second International Conference on Information Technology and Computer Science, 2010, pp. 68-72. DOI: 10.1109/ITCS.2010.23.

Kawano, H. Hierarchical Storage Systems and File Formats for Web Archiving. Proceedings of 2011 21st International Conference on Systems Engineering, 2011, pp. 217-220. DOI: 10.1109/ICSEng.2011.46.

P, N., Sathya, M. & Vengattaraman, T. The Study of Text Compression Algorithms and their Efficiencies Under Different Types of Files. Proceedings of 2023 IEEE 1st International Conference on Optimization Techniques for Learning (ICOTL 2023), 2023, pp. 1-8. DOI: 10.1109/ICOTL59758.2023.10435164.

Maximal Length LFSR Feedback Terms (29). Available at: https://users.ece.cmu.edu/~koopman/lfsr/29.dat.gz (accessed September 08, 2024).

Maximal Length LFSR Feedback Terms (32). Available at: https://users.ece.cmu.edu/~koopman/lfsr/32.dat.gz (accessed September 08, 2024).




DOI: https://doi.org/10.32620/aktt.2024.6.10