Multimodal detection of urban improvement  indicators in public visual-text content  using deep learning

Oleksandr Poplavskyi; Yuliia Riabchun

doi:10.32620/reks.2026.1.09

Multimodal detection of urban improvement indicators in public visual-text content using deep learning

Oleksandr Poplavskyi, Yuliia Riabchun

Abstract

The subject matter of the article is the automatic identification of user generated content about urban infrastructure improvement and construction within very large streams of social media and group messages, with a focus on filtering actionable posts from overwhelming background noise so that experts can see what needs restoration or investment first. The goal is to design, implement, and validate a multimodal deep learning system that receives any image and its accompanying text and decides whether the pair is relevant to city improvement topics such as roads, parks, buildings, lighting, cleanliness, accessibility, or playgrounds, and to provide a practical content filter that reduces manual triage for municipal teams. The tasks to be addressed are to base training on a single large scale public corpus rather than collecting new data, to define a clear relevant versus irrelevant decision target, to construct a multimodal neural architecture that learns from both visual content and written description, and to evaluate the approach against single modality baselines under the same conditions. The methods used rely on the Wikipedia based Image Text dataset known as WIT, which contains more than thirty-seven million image text examples with entity rich captions from many languages and is available for download on the Hugging Face platform, and on a fusion of a convolutional or vision transformer backbone for images with a transformer-based encoder for text that together form a unified classifier trained with standard supervised learning. Conclusions. Experiments show that the proposed approach accurately pinpoints posts related to urban improvement and that the multimodal design clearly surpasses the single modality baselines in the same conditions, which confirms that combining image evidence with textual context is advantageous for this filtering task and supports practical deployment as an automated screening stage. Scientific novelty lies in applying multimodal deep learning to the specific problem of real time content filtering in the urban planning domain, in framing the target as detection of relevant civic information within public image text communications using a single widely available corpus for training, and in demonstrating that a simple but principled fusion of computer vision and natural language processing can distill actionable information from very large volumes of online messages without manual collection of new training data.

Keywords

multimodal learning; urban improvement; social media; image-text classification; deep learning; attention mechanism; vision transformer

Full Text:

PDF

References

Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. arXiv, 2021, no. 2103.01913. DOI: 10.48550/arXiv.2103.01913.

Xu, P., Zhu, X., & Clifton, D.A. Multimodal Learning with Transformers: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, vol. 45, no. 10, pp. 12113–12133. DOI: 10.1109/TPAMI.2023.3275156.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv, 2020, no. 2010.11929. DOI: 10.48550/arXiv.2010.11929.

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. Learning Transferable Visual Models from Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021, vol. 139, pp. 8748-8763. Available at: https://proceedings.mlr.press/v139/radford21a.html (accessed 06.11.2025).

Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., & Gao, J. Vision-Language Pre-training: Basics, Recent Advances and Future Trends. arXiv, 2022, no. 2210.09263. DOI: 10.48550/arXiv.2210.09263.

Cheung, T.-H., & Lam, K.-M. Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing, 2022, vol. 514, pp. 1-12. DOI: 10.1016/j.neucom.2022.09.140.

Kim, W., Son, B., & Kim, I. ViLT: Vision-and-Language Transformer without Convolution or Region Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021, vol. 139, pp. 5583–5594. Available at: https://proceedings.mlr.press/v139/kim21k.html (accessed 06.11.2025).

Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S. R., Xiong, C., & Hoi, S. C. H. Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv, 2021, no. 2107.07651. DOI: 10.48550/arXiv.2107.07651.

Mamyrbayev, O., Pavlov, S., Poplavskyi, O., Momynzhanova, K., Saldan, Y., Zhanegiz, A., Zhumagulova, S., & Zhumazhan, N. Hybrid neural architectures combining convolutional and recurrent networks for the early detection of retinal pathologies. Engineering, Technology & Applied Science Research, 2025, vol. 15, no. 4, pp. 25150-25157. DOI: 10.48084/etasr.11521.

Matsiievskyi, O., Mazurenko, R., Netreba, A., & Sapaiev, V. Application of neural networks to optimize distributed computing in cloud and edge environments. IEEE International Conference on Smart Information Systems and Technologies (SIST), 2025, pp. 805-809. DOI: 10.1109/SIST61657.2025.11139218.

Mamyrbayev, O., Wójcik, W., Pavlov, S., Alimhan, K., Poplavskyi, O., Aitkazina, A., Nykyforova, L.E., & Zhumazhan, N. Engineering, Technology & Applied Science Research, 2025, vol. 15, no. 5, pp. 26943-26951. DOI: 10.48084/etasr.12779.

Zhang, J., Huang, J., Jin, S., & Lu, S. Vision-Language Models for Vision Tasks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, vol. 46, no. 8, pp. 5625-5644. DOI: 10.1109/TPAMI.2024.3369699.

Desai, K., Kaul, G., Aysola, Z., & Johnson, J. RedCaps: Web-curated image-text data created by the people, for the people. NeurIPS 2021 Datasets and Benchmarks Track, 2021, pp. 1-14. Available at: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/e00da03b685a0dd18fb6a08af0923de0-Paper-round1.pdf (accessed 06.11.2025).

OpenAI. GPT-4 Technical Report. arXiv, 2024, no. 2303.08774. DOI: 10.48550/arXiv.2303.08774.

Liu, H., Yang, B., & Yu, Z. A Multi-View Interactive Approach for Multimodal Sarcasm Detection. Applied Sciences, 2024, vol. 14, no. 5, article no. 2146. DOI: 10.3390/app14052146.

Poplavskyi, O., Pavlov, S., Zhumazhan, N., Zhanegiz, A., Saldan, Y., Momynzhanova, K., & Wójcik, W. High-performance information technology for processing biomedical big data to enhance the accuracy of computer-aided decision support systems. Proceedings of SPIE, 2024, vol. 13400, article no. 134000E. DOI: 10.1117/12.3057444.

Solovei, O., Solovei, B., & Riabchun, Y. An approach to evaluate a classification model to predict a construction object’s state. CEUR Workshop Proceedings, 2024, vol. 3896, pp. 194-200. Available at: https://ceur-ws.org/Vol-3896/short6.pdf (accessed 06.11.2025).

Aftan, S., & Shah, H. A survey on BERT and its applications. Proceedings of the 2023 20th Learning and Technology Conference (L&T), 2023, pp. 161-166. DOI: 10.1109/LT58159.2023.10092289.

Pavlov, S. V., Kozhukhar, A. T., Titkov, S. V., Tretiak, I. V., & Nesterenko, V. A. Electro-optical system for the automated selection of dental implants according to their colour matching. Przegląd Elektrotechniczny – Electrical Review, 2017, vol. 93, no. 3, pp. 121-124. DOI: 10.15199/48.2017.03.28.

Ren, J. Multimodal Sentiment Analysis Based on BERT and ResNet. arXiv, 2024, no. 2412.03625. DOI: 10.48550/arXiv.2412.03625.

Poplavska, A., Vassilenko, V., Poplavskyi, O., & Casal, D. AI-Based Classification Algorithm of Infrared Images of Patients with Spinal Disorders. IFIP Advances in Information and Communication Technology, 2021, vol. 626, pp. 316-323. DOI: 10.1007/978-3-030-78288-7_30.

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713. DOI: 10.1109/CVPR.2018.00286.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 2020, pp. 8440-8451. DOI: 10.18653/v1/2020.acl-main.747.

Kukharchuk, V.V., Kazyv, S.S., Bykovsky, S.A., Wójcik, W., Kotyra, A., Akhmetova, A., Bazarova, M., & Weryńska-Bieniasz, S. Discrete wavelet transformation in spectral analysis of vibration processes at hydropower units. Przegląd Elektrotechniczny – Electrical Review, 2017, vol. 93, no. 3, pp. 65-68. DOI: 10.15199/48.2017.03.16.

Wang, B., Li, W., Bradlow, A., Watt, A., Chan, A.T.Y., & Bazuaye, E. Multi-stage multimodal fusion network with language models and uncertainty evaluation for early risk stratification in rheumatic and musculoskeletal diseases. Information Fusion, 2025, vol. 120, article no. 103068. DOI: 10.1016/j.inffus.2025.103068.

Yu, C., & Wang, Z. Cross-modal evidential fusion network for social media classification. Computer Speech & Language, 2025, vol. 92, article no. 101784. DOI: 10.1016/j.csl.2025.101784.

Dolhopolov, S., Honcharenko, T., Savenko, V., & Liashchenko, T. Construction Site Modeling Objects Using Artificial Intelligence and BIM Technology: A Multi-Stage Approach. IEEE International Conference on Smart Information Systems and Technologies (SIST), 2023, pp. 174-179. DOI: 10.1109/SIST58284.2023.10223543.

Dolhopolov, S., Honcharenko, T., Terentyev, O., Savenko, V., & Liashchenko, T. Multi-Stage Classification of Construction Site Modeling Objects Using Artificial Intelligence Based on BIM Technology. Proceedings of the 35th Conference of Open Innovations Association FRUCT, 2024, pp. 179–185. DOI: 10.23919/FRUCT61870.2024.10516383.

Chernyshev, D., Ryzhakova, G., Honcharenko, T., Petrenko, H., Chupryna, I., & Reznik, N. Digital Administration of the Project Based on the Concept of Smart Construction. Lecture Notes in Networks and Systems, 2023, vol. 495, pp. 1316-1331. DOI: 10.1007/978-3-031-08954-1_114.

Zhou, Q., Zhang, J., & Zhu, Z. Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings, 2025, vol. 15, no. 16, article no. 2970. DOI: 10.3390/buildings15162970.

Liu, T., Chen, H., Ren, J., Zhang, L., Chen, H., Hong, R., Li, C., Cui, W., Guo, W., & Wen, C. Urban Functional Zone Classification via Advanced Multi-Modal Data Fusion. Sustainability, 2024, vol. 16, no. 24, article no. 11145. DOI: 10.3390/su162411145.

Cheng, M., Jin, H., Zhao, Q., Wang, Y., Wu, Y., Huang, S., & Yue, W. Deep learning for optimizing urban governance by “sensing-processing-responding” cycle: Recent advances, future prospects and challenges. Sustainable Cities and Society, 2025, vol. 135, article no. 106994. DOI: 10.1016/j.scs.2025.106994.

Sburlan, D.-F., & Bucos, M. A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System. Electronics, 2026, vol. 15, no. 6, article no. 1288. DOI: 10.3390/electronics15061288.

Dufitimana, E., Bizimana, J. P., Uwayezu, E., Gahungu, P., & Mugisha, E. Multimodal Deep Learning Framework for Profiling Socio-Economic Indicators and Public Health Determinants in Urban Environments. Urban Science, 2026, vol. 10, no. 4, article no. 177. DOI: 10.3390/urbansci10040177.

DOI: https://doi.org/10.32620/reks.2026.1.09

Refbacks

There are currently no refbacks.