Software analysis of scientific texts: comparative study of distributed computing frameworks

Serik Altynbek, Gabit Shuitenov, Madi Muratbekov, Alibek Barlybayev

Abstract


The relevance of this study is related to the need for efficient analysis of scientific texts in the context of the growing amount of information. This study aims to conduct a study of popular distributed computing frameworks for scientific text processing. This study conducted an extensive analysis of the scientific literature, which has systematized the key features of distributed frameworks, such as Apache Flink, Apache Spark, and Apache Hadoop, with an in-depth focus on their application in the field of scientific text analysis. The results obtained from this study allowed delving into the architectural features of each of the studied frameworks, highlighting their strengths, such as high performance, scalability, and flexibility in data processing. Limitations such as resource requirements and customization complexity were also identified. The comparative analysis revealed the following: Apache Flink and Apache Spark have high performance and scalability by performing in-memory computation to increase processing speed and efficiency. They support both batch and streaming data processing and guarantee processing “exactly once”. Conversely, Apache Hadoop has lower performance, mainly using disc-based data processing. Importantly, Apache Flink and Apache Spark support several programming languages, such as Java, Scala, and Python, providing developers with flexibility. Thus, the results of the study provide comprehensive information for researchers and engineers, helping them to choose the most appropriate framework based on their research’s specific needs and objectives. The practical significance of this study is to provide information on the best tools for analyzing scientific texts, which can contribute to more efficient data processing and accelerate scientific research in various fields.

Keywords


text analysis; Apache Flink; Apache Spark; Apache Hadoop; machine learning; big data

Full Text:

PDF

References


Ahmed, A., Nishad Bapatdhar, Bipin Pradeep Kumar, Ghosh, S., Yachie‐Kinoshita, A., & Palaniappan, S. K. Large scale text mining for deriv-ing useful insights: A case study focused on microbiome. Frontiers in Physiology, 2022, vol. 13. DOI: 10.3389/fphys.2022.933069.

Gienapp, L., Wolfgang Kircheis, Sievers, B., Stein, B., & Potthast, M. A large dataset of scientific text reuse in Open-Access publications. Scientific Data, 2023, vol. 10, iss. 1. DOI: 10.1038/s41597-022-01908-z.

Sun, X., He, Y., Wu, D., & Huang, J.Z. Survey of Distributed Compu-ting Frameworks for Supporting Big Data Analysis. Big Data Mining and Ana-lytics, 2023, vol. 6, iss. 2, pp. 154-169. DOI: 10.26599/bdma.2022.9020014.

Qian, L., Yang, P., Xiao, M., Dobre, O.A., Marco Di Renzo, Li, J., Han, Z., Yi, Q., & Zhao, J.-R. Distributed Learning for Wireless Communica-tions: Methods, Applications and Challenges. IEEE Journal of Selected Topics in Signal Processing, 2022, vol. 16, iss. 3, pp. 326–342. DOI: 10.1109/jstsp.2022.3156756.

Sewal, P., & Singh, H. A Critical Analysis of Apache Hadoop and Spark for Big Data Processing. In: 6th International Conference on Signal Processing, Computing and Control (ISPCC), Solan, India, 2021, pp. 308-313. DOI: 10.1109/ISPCC53510.2021.9609518.

Morales-Hernández, R. C., Jagüey, J. G., & Becerra-Alonso, D. A Comparison of Multi-Label Text Classification Models in Research Articles Labeled With Sustainable Development Goals. IEEE Access, 2022, vol. 10, pp. 123534–123548. DOI: 10.1109/ACCESS.2022.3223094.

Bozkurt, Y., Braun, R., & Rossmann, A. The application of machine learning in literature reviews: A framework. Iadis International Journal on Computer Science and Information Systems, 2022, vol. 17, iss. 1, pp. 65–80. Available at: https://www.iadisportal.org/ijcsis/papers/2022170105.pdf (Accessed 6 Nov. 2024).

Cammarano, A., Varriale, V., Michelino, F., & Caputo, M. A Frame-work for Investigating the Adoption of Key Technologies: Presentation of the Methodology and Explorative Analysis of Emerging Practices. IEEE Transactions on Engineering Management, 2024, vol. 71, pp. 3843-3866. DOI: 10.1109/tem.2023.3240213.

Betz, G., & Richardson, K. DeepA2: A Modular Framework for Deep Argument Analysis with Pretrained Neural Text2Text Language Models. arXiv (Cornell University), 2022. DOI: 10.18653/v1/2022.starsem-1.2.

Blazevic, M., Sina, L. B., Secco, C. A., & Nazemi, K. Recommendation of Scientific Publications – A Real-Time Text Analysis and Publication Recommendation System. Electronics, 2023, vol. 12, iss. 7, article no. 1699. DOI: 10.3390/electronics12071699.

Hasan, S. A Novel Approach to Network Analysis: Multi-Space Analysis Model. Zenodo. 2022. DOI: 10.5281/zenodo.6451475.

Çitlak, O., Dörterler, M., & Dogru, İ. A Hybrid Spam Detection Framework for Social Networks. Journal of Polytechnic, 2022. DOI: 10.2339/politeknik.933785.

Batura, T., Bakiyeva, A., & Charintseva, M. A method for automatic text summarization based on rhetorical analysis and topic modeling. International Journal of Computing, 2020, vol. 19, iss. 1, pp. 118–127. DOI: 10.47839/ijc.19.1.1700.

Yenduri, L. K. Performance Evaluation of Apache Hadoop, Spark, and Flink for Batch Processing of Big Data: A Comparative Analysis. Third International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), 2024, Trichirappalli: IEEE. DOI: 10.1109/ICEEICT61591.2024.10718602.

Ullah, F., Dhingra, S., Xia, X., & Babar, M. A. Evaluation of distributed data processing frameworks in hybrid clouds. Journal of Network and Computer Applications, 2024, vol. 224, article no. 103837. DOI: 10.1016/j.jnca.2024.103837.

Ilinska, L., Ivanova, O., & Senko, Z. Teaching textual analysis of contemporary popular scientific texts. Procedia-Social and Behavioral Sciences, 2016, vol. 236, pp. 248–253. DOI: 10.1016/j.sbspro.2016.12.020.

Boyack, K. W., van Eck, N. J., Colavizza, G., & Waltman, L. Characterizing in-text citations in scientific articles: A large-scale analysis. Journal of Informetrics, 2018, vol. 12, iss. 1, pp. 59-73. DOI: 10.1016/j.joi.2017.11.005.

Kerimkhulle, S., Dildebayeva, Z., Tokhmetov, A., Amirova, A., Tussupov, J., Makhazhanova, U., Adalbek, A., Taberkhan, R., Zakirova, A., & Salykbayeva, A. Fuzzy Logic and Its Application in the Assessment of Information Security Risk of Industrial Internet of Things. Symmetry, 2023, vol. 15, iss. 10, article no.1958. DOI: 10.3390/sym15101958.

Savka, M. Analysis of the key models, methods, and means of data collection in the Internet of Things. Technologies and Engineering, 2025, vol. 26, iss. 2, pp. 66-78. DOI: 10.30857/2786-5371.2025.2.6

Bezshyyko, O., Dolinskii, A., Bezshyyko, K., Kadenko, I., Yermolenko, R., & Ziemann, V. PETAG01: A program for the direct simulation of a pellet target. Computer Physics Communications, 2008, vol. 178, iss. 2, pp. 144-155. DOI: 10.1016/j.cpc.2007.07.013.

Beisenbi, M., Kaliyeva, S., Sagymbay, A., Abdugulova, Z., & Ostayeva, A. A new approach for synthesis of the control system by gradient-velocity method of Lyapunov vector functions. Journal of Theoretical and Applied Information Technology, 2021, vol. 99, iss. 2, pp. 381-389.

Giri, P. R., & Sharma, G. Apache Hadoop Architecture, Appli-cations, and Hadoop Distributed File System. Semiconductor Science and In-formation Devices, 2022, vol. 4, iss. 1, article no.14. DOI: 10.30564/ssid.v4i1.4619.

Zarichuk, O. Comparative analysis of frameworks for mobile application development: Native, hybrid, or cross-platform solutions. Bulletin of Cherkasy State Technological University, 2023, vol. 28, iss. 4, pp. 19–27. DOI: 10.62660/2306-4412.4.2023.19-27.

Kondratenko, Y., & Kondratenko, V. Soft computing algorithm for arithmetic multiplication of fuzzy sets based on universal analytic models. Communications in Computer and Information Science, 2014, vol. 469, pp. 49–77. DOI: 10.1007/978-3-319-13206-8_3.

Farshid Bagheri Saravi, Shadi Moghanian, Giti Javidi, & Sheybani, E. O. Machine Learning in Apache Spark Environment for Diagnosis of Diabetes. Preprints, 2021. DOI: 10.20944/preprints202111.0200.v1.

Tariq, M. U., Babar, M., Poulin, M., & Khattak, A. S. Distribut-ed model for customer churn prediction using convolutional neural net-work. Journal of Modelling in Management, ahead-of-print(ahead-of-print), 2021. DOI: 10.1108/jm2-01-2021-0032.

Azhir, E., Hosseinzadeh, M., Khan, F., & Mosavi, A. Perfor-mance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark. Mathematics, 2022, vol. 10, iss. 19, article no. 3517. DOI: 10.3390/math10193517.

Gomolka, Z., Dudek-Dyduch, E., & Kondratenko, Y. P. From homogeneous network to neural nets with fractional derivative mechanism. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017, 10245 LNAI, pp. 52–63. DOI: 10.1007/978-3-319-59063-9_5.

Smailov, N., Tsyporenko, V., Ualiyev, Z., Issova, A., Dosbayev, Z., Tashtay, Y., Zhekambayeva, M., Alimbekov, T., Kadyrova, R., & Sabibolda, A. Improving accuracy of the spectral-correlation direction finding and delay estimation using machine learning. Eastern European Journal of Enterprise Technologies, 2025, vol. 2, no. 5(134), pp.15–24. DOI: 10.15587/1729-4061.2025.327021.

Bisenovna, K. A., Ashatuly, S. A., Beibutovna, L. Z., Yesilbayuly, K. S., Zagievna, A. A., Galymbe¬kovna, M. Z., & Oralkhanuly, O. B. Improving the efficiency of food supplies for a trading company based on an artificial neural network. International Journal of Electrical and Computer Engineering, 2024, vol. 14, iss. 4, pp. 4407-4417. DOI: 10.11591/ijece.v14i4.pp4407-4417.

Araújo, T. B., Stefanidis, K., Pires, C. E. S., Nummenmaa, J., & da Nóbrega, T. P. Incremental Entity Blocking over Heterogeneous Streaming Data. Information, 2022, vol. 13, iss. 12, article no. 568. DOI: 10.3390/info13120568.

Andriievskyi, I., Spivak, S., Gogota, O. and Yermolenko, R. Application of the regression neural network for the analysis of the results of ultrasonic testing. Machinery & Energetics, 2024, vol. 15, iss. 1, pp. 43-55. DOI: 10.31548/machinery/1.2024.43.

Imamguluyev, R., & Umarova, N. Application of Fuzzy Logic Apparatus to Solve the Problem of Spatial Selection in Architectural-Design Projects. Lecture Notes in Networks and Systems, 2022, vol. 307, pp. 842–848. DOI: 10.1007/978-3-030-85626-7_98.

Batista, J., Moreira, A. M., Vargas-Solar, G., & Musicante, M. A. Modeling Big Data Processing Programs. Lecture notes in computer science, 2020, pp. 101-118. DOI: 10.1007/978-3-030-63882-5_7.

Zhang, J., & Lin, M. A comprehensive bibliometric analysis of Apache Hadoop from 2008 to 2020. International Journal of Intelligent Computing and Cybernetics, 2022. DOI: 10.1108/ijicc-01-2022-0004.

Orazbayev, B., Zhumadillayeva, A., Kabibullin, M., Crabbe, M. J. C., Orazbayeva, K., & Yue, X. A Systematic Approach to the Model Development of Reactors and Reforming Furnaces with Fuzziness and Optimization of Operating Modes. IEEE Access, 2023, vol. 11, pp. 74980-74996. DOI: 10.1109/ACCESS.2023.3294701.

Tkachenko, O., Goncharov, V., & Jatkiewicz, P. Enhancing Front-End Security: Protecting User Data and Privacy in Web Applications. Computer Animation and Virtual Worlds, 2024, vol. 35, iss. 6, article no. e70003. DOI: 10.1002/cav.70003.

Semenenko, O., Kirsanov, S., Movchan, A., Ihnatiev, M., & Dobrovolskyi, U. Impact of computer-integrated technologies on cybersecurity in the defence sector. Machinery & Energetics, 2024, vol. 15, iss. 2, pp.118-129. DOI: 10.31548/machinery/2.2024.118.

Destek, M. A., Hossain, M. R., Manga, M., & Destek, G. Can digital government reduce the resource dependency? Evidence from method of moments quantile technique. Resources Policy, 2024, vol. 99, article no. 105426. DOI: 10.1016/j.resourpol.2024.105426.

Azeroual, O., & Nikiforova, A. Apache Spark and MLlib-Based Intrusion Detection System or How the Big Data Technologies Can Secure the Data. Information, 2022, vol. 13, iss. 2, article no. 58. DOI: 10.3390/info13020058.

Henning, S., & Hasselbring, W. Benchmarking Scalability of Stream Processing Frameworks Deployed as Event-Driven Microservices in the Cloud. SSRN Electronic Journal, 2023. DOI: 10.2139/ssrn.4379579.

Fernandes, A., Barretto, J., & Fernandes, J. Study on Big Data Frameworks. International Journal of Scientific Research in Science and Technology, 2021, pp. 491–499. DOI: 10.32628/ijsrst218475.

Yakymenko, D., & Kataieva, Y. Methods and means of intelligent analysis of text documents. Bulletin of Cherkasy State Technological University, 2022, vol. 27, iss. 2, pp.43–52. DOI: 10.24025/2306-4412.2.2022.259408.

Astistova, T., & Sedliar, A. Development of software for evaluation of of text originality. Technologies and Engineering, 2024, vol. 25, iss. 5, pp. 25–36. DOI: 10.30857/2786-5371.2024.5.3.




DOI: https://doi.org/10.32620/reks.2025.2.07

Refbacks

  • There are currently no refbacks.