Sentiment analysis of trending tweets using Spark NLP and deep learning: a benchmark study of CNN vs transformer models

Chaimaa Benyamani, Hassan Badi, Imad Badi, Aziz Khamjane

Abstract


This study investigates the implementation of different sentiment analysis models, exploring their theoretical foundations, robust evaluation criteria, and significant findings, and integrating natural language processing methods to preprocess data collected from social media, the main source of information. The goal of this study is to present notable advancements in sentiment classification by implementing innovative models, including Convolutional Neural Networks and Transformer-based architectures such as BERT, DistilBERT, and RoBERTa, with Spark NLP for preprocessing. The tasks to be performed include: collecting the dataset from Twitter/X focusing particularly on trending topics that appear daily; performing preprocessing steps using Spark NLP, a prominent and scalable NLP library built on Apache Spark to handle scalable and distributed processing of textual data; assigning the polarity for each tweet by applying VADER lexicon-based tool; pretraining both TextCNN and Transformer-based models on lexicon-based VADER labels under identical parameters; fine-tuning the models on manually annotated tweets; comparing their effectiveness, evaluating their strengths, weaknesses, and overall performance in sentiment classification that can guide model selection in resource-constrained settings. The methods used include Spark NLP pipelines, lexicon-based weak labeling, two-phase supervised learning on Convolutional Neural Networks (TextCNN), and Transformer-based models (BERT, DistilBERT, RoBERTa) with early stopping and learning-rate warm-up strategies, and finally the comparative evaluation metrics. The dataset used in this work (12,422 tweets, including 3,475 manually labeled tweets) is not large. The use of Apache Spark enables distributed data processing and supports scalability for larger datasets. The results show that the Transformer-based models outperformed TextCNN model in terms of classification accuracy and robustness. RoBERTa achieved the highest accuracy (85%), followed by BERT and DistilBERT (84%) and TextCNN (72%). DistilBERT balances the predictive performance and computational efficiency well. Conclusions. The scientific novelty of this study focuses particularly on the integration of Apache Spark NLP preprocessing, followed by a benchmarking of CNN and Transformer-based models trained on trending tweets in a hybrid two-phase learning strategy that combined lexicon-based weak supervision with human annotation under the same experimental conditions to provide methodological insights and select appropriate model in our sentiment classification experiments.

Keywords


Sentiment Analysis; Apache Spark NLP; social media; TextCNN; Bidirectional Encoder Representations from Transformers (BERT); DistilBERT; RoBERTa; Two-phase learning; Weak supervision

Full Text:

PDF

References


Mann, S., Arora, J., Bhatia, M., Sharma, R., & Taragi, R. Twitter Sentiment Analysis Using Enhanced BERT. Intelligent Systems and Applications, Lecture Notes in Electrical Engineering, 2023, vol. 959, pp. 263-271. Springer Nature Singapore. DOI: 10.1007/978-981-19-6581-4_21.

Ogunleye, B., Sharma, H., & Shobayo, O. Sentiment Informed Sentence BERT-Ensemble Algorithm for Depression Detection. Big Data and Cognitive Computing, 2024, vol. 8, no. 9, article no. 112. DOI: 10.3390/bdcc8090112.

Silva Barbon, R., & Akabane, A.T. Investigating Towards Transfer Learning Techniques—BERT, DistilBERT, BERTimbau, and DistilBERTimbau for Automatic Text Classification from Different Languages: A Case Study. Sensors, 2022, vol. 22, no. 21, article 8184. DOI: 10.3390/s22218184.

Fu, J. A Comparison of CNN and Transformer in Continual Learning. Master’s Thesis, KTH Royal Institute of Technology, School of Electrical Engineering and Computer Science (EECS), 2023. TRITA-EECS-EX; 2023:793. URN: urn:nbn:se:kth:diva-340947f. PDF Available at: https://kth.diva-portal.org/smash/get/diva2%3A1820229/FULLTEXT01.pdf (accessed 12 April 2025).

Li, F., Li, J., & Abza, F. Sentiment Analysis of Tweets Employing Convolutional Neural Network Optimized by Enhanced Gorilla Troops Optimization Algorithm. Scientific Reports, 2025, vol. 15, article no. 795, pp. 1-16. DOI: 10.1038/s41598-025-85392-6.

Badi, H., Badi, I., El Moutaouakil, K., Khamjane, A., & Bahri, A. Sentiment Analysis and Prediction of Polarity Vaccines Based on Twitter Data Using. Deep NLP Techniques. Radioelectronic and Computer Systems, 2022, no. 4, pp. 19-29. DOI: 10.32620/reks.2022.4.02.

Singla, S., & Ramachandra, N. Comparative Analysis of Transformer-Based Pre-Trained NLP Models. International Journal of Computer Sciences and Engineering, 2020, vol. 8, no. 11, pp. 4044-4050. DOI: 10.26438/ijcse/v8i11.4044.

Ling, Y. Bio+Clinical BERT, BERT Base, and CNN Performance Comparison for Predicting Drug-Review Satisfaction. In: Proceedings of the Workshop on Applied Data Science for Healthcare: Applications and New Frontiers of Generative Models for Healthcare, (KDD DSHealth 2023). Available at: https://arxiv.org/abs/2308.03782 (accessed 12 April 2025).

Kumar, G., Agrawal, R., Sharma, K., Gundalwar, P. R., Kazi, A., Agrawal, P., Tomar, M., & Salagrama, S. Combining BERT and CNN for Sentiment Analysis: A Case Study on COVID-19. International Journal of Advanced Computer Science and Applications (IJACSA), 2024, vol. 15, no. 10, pp. 676-685. DOI: 10.14569/IJACSA.2024.0151069.

Mareeswari, V., Patil, S. S., & Ramanan, G. Real Time Sentiment Analysis of Tweets using Apache Spark and Scala. ACS Journal for Science and Engineering, 2021, Vol. 1, No. 2. DOI:10.34293/acsjse.v1i2.9.

Kumar, S., Nandakumar, K., Rajesh, R. A Feature Extraction Based Improved Sentiment Analysis for Real-Time Twitter Data Through Apache Spark. International Journal of Computer Applications, 2023, vol. 184, no. 51. ISSN: 0975-8887. DOI: 10.5120/ijca2023922633.

Kale, T. V.; Mendhe, S. A Review on Advances in Sentiment Analysis: A Deep Learning Approach Using Transformer Based Models. Proceedings of the Fourth International Conference on Sentiment Analysis and Deep Learning (ICSADL-2025). IEEE. DOI: 10.1109/ICSADL65848.2025.10933230.

Sánchez-Moreno, P., & García-Muñoz, R. Sentiment analysis: A comprehensive review of recent advances and applications. Frontiers in Physics, 2024, vol. 12, article no. 1477714. DOI: 10.3389/fphy.2024.1477714.

García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M., & Herrera, F. Big Data Preprocessing: Methods and Prospects. Big Data Analytics, 2016, vol. 1, no. 1, article no. 9. DOI: 10.1186/s41044-016-0014-0.

Joshi, S., & Deshpande, D. Twitter Sentiment Analysis System. International Journal of Computer Applications, 2018, vol. 180, no. 47, pp. 35-39. DOI: 10.5120/ijca2018917319.

Jonnala, N. S., Alotaibi, R., & Reddy, B. R. Leveraging hybrid model for accurate sentiment analysis of Twitter data. Scientific Reports, 2025, vol. 15, article no. 1319, pp. 1–12. DOI: 10.1038/s41598-025-09794-2.

Martins, P., Cardoso, F., Váz, P., Silva, J., & Abbasi, M. Performance and Scalability of Data Cleaning and Preprocessing Tools: A Benchmark on Large Real-World Datasets. Data, 2025, vol. 10, no. 5, article no. 68, pp. 1-25. DOI: 10.3390/data10050068.

Kocaman, V., & Talby, D. Spark NLP: Natural Language Understanding at Scale. Software Impacts, 2021, vol. 8, article no. 100058. DOI: 10.1016/j.simpa.2021.100058.

Marco Pota, M., Ventura, M., Catelli, R., & Esposito, M. An Effective BERT-Based Pipeline for Twitter Sentiment Analysis: A Case Study in Italian. Sensors, 2021, vol. 21, no. 133. DOI: 10.3390/s21010133.

Ashbaugh, L., & Zhang, Y. A Comparative Study of Sentiment Analysis on Customer Reviews Using Machine Learning and Deep Learning. Computers, 2024, vol. 13, no. 12, article no. 340. DOI: 10.3390/computers1312340.

Gunasekara, S. P. Enhancing the Detection of Adversarial Attacks Using Deep Learning Neural Transformer Models. Doctoral Dissertation, The George Washington University, ProQuest Dissertations & Theses, 2025. ProQuest Document ID: 3142138338. ISBN: 9798346805526. Available at: https://www.proquest.com/dissertations/docview/3142138338 (accessed 10 October 2025).




DOI: https://doi.org/10.32620/reks.2026.1.10

Refbacks

  • There are currently no refbacks.