Multimodal detection of urban improvement indicators in public visual-text content using deep learning
Abstract
Keywords
Full Text:
PDFReferences
Srinivasan, K., Raman, K., Chen, J., Bendersky, M., & Najork, M. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning. arXiv, 2021, no. 2103.01913. DOI: 10.48550/arXiv.2103.01913.
Xu, P., Zhu, X., & Clifton, D.A. Multimodal Learning with Transformers: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, vol. 45, no. 10, pp. 12113–12133. DOI: 10.1109/TPAMI.2023.3275156.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv, 2020, no. 2010.11929. DOI: 10.48550/arXiv.2010.11929.
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. Learning Transferable Visual Models from Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021, vol. 139, pp. 8748-8763. Available at: https://proceedings.mlr.press/v139/radford21a.html (accessed 06.11.2025).
Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., & Gao, J. Vision-Language Pre-training: Basics, Recent Advances and Future Trends. arXiv, 2022, no. 2210.09263. DOI: 10.48550/arXiv.2210.09263.
Cheung, T.-H., & Lam, K.-M. Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing, 2022, vol. 514, pp. 1-12. DOI: 10.1016/j.neucom.2022.09.140.
Kim, W., Son, B., & Kim, I. ViLT: Vision-and-Language Transformer without Convolution or Region Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML 2021), 2021, vol. 139, pp. 5583–5594. Available at: https://proceedings.mlr.press/v139/kim21k.html (accessed 06.11.2025).
Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S. R., Xiong, C., & Hoi, S. C. H. Align Before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv, 2021, no. 2107.07651. DOI: 10.48550/arXiv.2107.07651.
Mamyrbayev, O., Pavlov, S., Poplavskyi, O., Momynzhanova, K., Saldan, Y., Zhanegiz, A., Zhumagulova, S., & Zhumazhan, N. Hybrid neural architectures combining convolutional and recurrent networks for the early detection of retinal pathologies. Engineering, Technology & Applied Science Research, 2025, vol. 15, no. 4, pp. 25150-25157. DOI: 10.48084/etasr.11521.
Matsiievskyi, O., Mazurenko, R., Netreba, A., & Sapaiev, V. Application of neural networks to optimize distributed computing in cloud and edge environments. IEEE International Conference on Smart Information Systems and Technologies (SIST), 2025, pp. 805-809. DOI: 10.1109/SIST61657.2025.11139218.
Mamyrbayev, O., Wójcik, W., Pavlov, S., Alimhan, K., Poplavskyi, O., Aitkazina, A., Nykyforova, L.E., & Zhumazhan, N. Engineering, Technology & Applied Science Research, 2025, vol. 15, no. 5, pp. 26943-26951. DOI: 10.48084/etasr.12779.
Zhang, J., Huang, J., Jin, S., & Lu, S. Vision-Language Models for Vision Tasks: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, vol. 46, no. 8, pp. 5625-5644. DOI: 10.1109/TPAMI.2024.3369699.
Desai, K., Kaul, G., Aysola, Z., & Johnson, J. RedCaps: Web-curated image-text data created by the people, for the people. NeurIPS 2021 Datasets and Benchmarks Track, 2021, pp. 1-14. Available at: https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/e00da03b685a0dd18fb6a08af0923de0-Paper-round1.pdf (accessed 06.11.2025).
OpenAI. GPT-4 Technical Report. arXiv, 2024, no. 2303.08774. DOI: 10.48550/arXiv.2303.08774.
Liu, H., Yang, B., & Yu, Z. A Multi-View Interactive Approach for Multimodal Sarcasm Detection. Applied Sciences, 2024, vol. 14, no. 5, article no. 2146. DOI: 10.3390/app14052146.
Poplavskyi, O., Pavlov, S., Zhumazhan, N., Zhanegiz, A., Saldan, Y., Momynzhanova, K., & Wójcik, W. High-performance information technology for processing biomedical big data to enhance the accuracy of computer-aided decision support systems. Proceedings of SPIE, 2024, vol. 13400, article no. 134000E. DOI: 10.1117/12.3057444.
Solovei, O., Solovei, B., & Riabchun, Y. An approach to evaluate a classification model to predict a construction object’s state. CEUR Workshop Proceedings, 2024, vol. 3896, pp. 194-200. Available at: https://ceur-ws.org/Vol-3896/short6.pdf (accessed 06.11.2025).
Aftan, S., & Shah, H. A survey on BERT and its applications. Proceedings of the 2023 20th Learning and Technology Conference (L&T), 2023, pp. 161-166. DOI: 10.1109/LT58159.2023.10092289.
Pavlov, S. V., Kozhukhar, A. T., Titkov, S. V., Tretiak, I. V., & Nesterenko, V. A. Electro-optical system for the automated selection of dental implants according to their colour matching. Przegląd Elektrotechniczny – Electrical Review, 2017, vol. 93, no. 3, pp. 121-124. DOI: 10.15199/48.2017.03.28.
Ren, J. Multimodal Sentiment Analysis Based on BERT and ResNet. arXiv, 2024, no. 2412.03625. DOI: 10.48550/arXiv.2412.03625.
Poplavska, A., Vassilenko, V., Poplavskyi, O., & Casal, D. AI-Based Classification Algorithm of Infrared Images of Patients with Spinal Disorders. IFIP Advances in Information and Communication Technology, 2021, vol. 626, pp. 316-323. DOI: 10.1007/978-3-030-78288-7_30.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., & Kalenichenko, D. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704-2713. DOI: 10.1109/CVPR.2018.00286.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 2020, pp. 8440-8451. DOI: 10.18653/v1/2020.acl-main.747.
Kukharchuk, V.V., Kazyv, S.S., Bykovsky, S.A., Wójcik, W., Kotyra, A., Akhmetova, A., Bazarova, M., & Weryńska-Bieniasz, S. Discrete wavelet transformation in spectral analysis of vibration processes at hydropower units. Przegląd Elektrotechniczny – Electrical Review, 2017, vol. 93, no. 3, pp. 65-68. DOI: 10.15199/48.2017.03.16.
Wang, B., Li, W., Bradlow, A., Watt, A., Chan, A.T.Y., & Bazuaye, E. Multi-stage multimodal fusion network with language models and uncertainty evaluation for early risk stratification in rheumatic and musculoskeletal diseases. Information Fusion, 2025, vol. 120, article no. 103068. DOI: 10.1016/j.inffus.2025.103068.
Yu, C., & Wang, Z. Cross-modal evidential fusion network for social media classification. Computer Speech & Language, 2025, vol. 92, article no. 101784. DOI: 10.1016/j.csl.2025.101784.
Dolhopolov, S., Honcharenko, T., Savenko, V., & Liashchenko, T. Construction Site Modeling Objects Using Artificial Intelligence and BIM Technology: A Multi-Stage Approach. IEEE International Conference on Smart Information Systems and Technologies (SIST), 2023, pp. 174-179. DOI: 10.1109/SIST58284.2023.10223543.
Dolhopolov, S., Honcharenko, T., Terentyev, O., Savenko, V., & Liashchenko, T. Multi-Stage Classification of Construction Site Modeling Objects Using Artificial Intelligence Based on BIM Technology. Proceedings of the 35th Conference of Open Innovations Association FRUCT, 2024, pp. 179–185. DOI: 10.23919/FRUCT61870.2024.10516383.
Chernyshev, D., Ryzhakova, G., Honcharenko, T., Petrenko, H., Chupryna, I., & Reznik, N. Digital Administration of the Project Based on the Concept of Smart Construction. Lecture Notes in Networks and Systems, 2023, vol. 495, pp. 1316-1331. DOI: 10.1007/978-3-031-08954-1_114.
Zhou, Q., Zhang, J., & Zhu, Z. Evaluating Urban Visual Attractiveness Perception Using Multimodal Large Language Model and Street View Images. Buildings, 2025, vol. 15, no. 16, article no. 2970. DOI: 10.3390/buildings15162970.
Liu, T., Chen, H., Ren, J., Zhang, L., Chen, H., Hong, R., Li, C., Cui, W., Guo, W., & Wen, C. Urban Functional Zone Classification via Advanced Multi-Modal Data Fusion. Sustainability, 2024, vol. 16, no. 24, article no. 11145. DOI: 10.3390/su162411145.
Cheng, M., Jin, H., Zhao, Q., Wang, Y., Wu, Y., Huang, S., & Yue, W. Deep learning for optimizing urban governance by “sensing-processing-responding” cycle: Recent advances, future prospects and challenges. Sustainable Cities and Society, 2025, vol. 135, article no. 106994. DOI: 10.1016/j.scs.2025.106994.
Sburlan, D.-F., & Bucos, M. A Multimodal Deep Learning Approach for Analyzing Content Preferences on TikTok Across European Technical Universities Using Media Information Processing System. Electronics, 2026, vol. 15, no. 6, article no. 1288. DOI: 10.3390/electronics15061288.
Dufitimana, E., Bizimana, J. P., Uwayezu, E., Gahungu, P., & Mugisha, E. Multimodal Deep Learning Framework for Profiling Socio-Economic Indicators and Public Health Determinants in Urban Environments. Urban Science, 2026, vol. 10, no. 4, article no. 177. DOI: 10.3390/urbansci10040177.
DOI: https://doi.org/10.32620/reks.2026.1.09
Refbacks
- There are currently no refbacks.
