Real-time Arabic Video Captioning Using CNN and Transformer Networks Based on Parallel Implementation

https://doi.org/10.24237/djes.2024.17108

Authors

  • Adel Jalal Yousif Department of Computer Engineering, University of Mosul, Mosul, Iraq
  • Mohammed H. Al-Jammas College of Electronics Engineering, Ninevah University, Mosul, Iraq

Keywords:

Arabic video captioning, Parallel architecture, Deep learning, Video description, Real-time captioning

Abstract

Video captioning techniques have practical applications in fields like video surveillance and robotic vision, particularly in real-time scenarios. However, most of the current approaches still exhibit certain limitations when applied to live video, and research has predominantly focused on English language captioning. In this paper, we introduced a novel approach for live real-time Arabic video captioning using deep neural networks with a parallel architecture implementation. The proposed model primarily relied on the encoder-decoder architecture trained end-to-end on Arabic text. Video Swin Transformer and deep convolutional network are employed for video understanding, while the standard Transformer architecture is utilized for both video feature encoding and caption decoding. Results from experiments conducted on the translated MSVD and MSR-VTT datasets demonstrate that utilizing an end-to-end Arabic model yielded better performance than methods involving the translation of generated English captions to Arabic. Our approach demonstrates notable advancements over compared methods, yielding a CIDEr score of 78.3 and 36.3 for the MSVD and MSRVTT datasets, respectively. In the context of inference speed, our model achieved a latency of approximately 95 ms using an RTX 3090 GPU for a temporal video segment with 16 frames captured online from a camera device.

Downloads

Download data is not yet available.

References

A. J. Yousif and M. H. Al-Jammas, “Exploring deep learning approaches for video captioning: A comprehensive review,” e-Prime - Adv. Electr. Eng. Electron. Energy, vol. 6, no. October, p. 100372, 2023, doi: 10.1016/j.prime.2023.100372.

V. Chundi, J. Bammidi, A. Pegallapati, Y. Parnandi, A. Reddithala and S. K. Moru, "Intelligent Video Surveillance Systems," 2021 International Carnahan Conference on Security Technology (ICCST), Hatfield, United Kingdom, 2021, pp. 1-5, doi: 10.1109/ICCST49569.2021.9717400.

B. Irfan, A. Ramachandran, S. Spaulding, D. F. Glas, I. Leite and K. L. Koay, "Personalization in Long-Term Human-Robot Interaction," 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Daegu, Korea (South), 2019, pp. 685-686, doi: 10.1109/HRI.2019.8673076.

J. O. Williams, “Narrow-band analyzer,” Ph.D. dissertation, Dept. Elect. Eng., Harvard Univ., Cambridge, MA, 1993.

A. Khan, A. Khan and M. Waleed, "Wearable Navigation Assistance System for the Blind and Visually Impaired," 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT), Sakhier, Bahrain, 2018, pp. 1-6, doi: 10.1109/3ICT.2018.8855778.

Adel Jalal, Yousif. "Convolution Neural Network Based Method for Biometric Recognition." Central Asian Journal of Theoretical and Applied Sciences 4.8 (2023): 58-68.

T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi and M. Ghogho, "Deep Recurrent Neural Network for Intrusion Detection in SDN-based Networks," 2018 4th IEEE Conference on Network Softwarization and Workshops (NetSoft), Montreal, QC, Canada, 2018, pp. 202-206, doi: 10.1109/NETSOFT.2018.8460090.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser,

I. Polosukhin, Attention is all you need, Advance Neural Inf. Process. Syst. 30 (2017).

Liu, Ze, et al. "Video swin transformer." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

X. Chen, M. Zhao, F. Shi, M. Zhang, Y. He and S. Chen, "Enhancing Ocean Scene Video Captioning with Multimodal Pre-Training and Video-Swin-Transformer," IECON 2023- 49th Annual Conference of the IEEE Industrial Electronics Society, Singapore, Singapore, 2023, pp. 1-6, doi: 10.1109/IECON51785.2023.10312358.

A. Singh, T. D. Singh, and S. Bandyopadhyay, “Attention based video captioning framework for Hindi,” Multimed. Syst., vol. 28, no. 1, pp. 195–207, 2022, doi: 10.1007/s00530-021-00816-3.

Wissam Antoun, Fady Baly, and Hazem Hajj. AraBERT, “Transformer-based model for Arabic language understanding”, arXiv preprint arXiv:2003.00104, 2020.

D. Chen and W. Dolan, “Collecting highly parallel data for paraphrase evaluation”. In ACL: Human Language Technologies- Volume 1. ACL, 190-200, 2011.

J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.

Kojima, A., Tamura, T., & Fukunaga, K., “Natural language description of human activities from video images based on concept hierarchy of actions”, International Journal of Computer Vision, 50, 171–184, 2002.

Hanckmann P, Schutte K, Burghouts GJ., “Automated textual descriptions for a wide range of video events with 48 human actions”, In: IEEE ECCV, 2012.

B. Wang, L. Ma, W. Zhang and W. Liu, "Reconstruction Network for Video Captioning," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7622-7631, doi: 10.1109/CVPR.2018.00795.

H. Ye, G. Li, Y. Qi, S. Wang, Q. Huang and M. -H. Yang, "Hierarchical Modular Network for Video Captioning," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 17918-17927, doi: 10.1109/CVPR52688.2022.01741.

J. Zhang and Y. Peng, “Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation,” IEEE Trans. Image Process., vol. 29, no. c, pp. 6209–6222, 2020, doi: 10.1109/TIP.2020.2988435.

M. S. Zaoad, M. M. R. Mannan, A. B. Mandol, M. Rahman, M. A. Islam, and M. M. Rahman, “An attention-based hybrid deep learning approach for bengali video captioning,” J. King Saud Univ. - Comput. Inf. Sci., vol. 35, no. 1, pp. 257–269, 2023, doi: 10.1016/j.jksuci.2022.11.015.

S. Ma, L. Cui, D. Dai, F. Wei, and X. Sun, “LiveBot: Generating live video comments based on visual and textual contexts,” 33rd AAAI Conf. Artif. Intell. AAAI 2019, 31st Innov. Appl. Artif. Intell. Conf. IAAI 2019 9th AAAI Symp. Educ. Adv. Artif. Intell. EAAI 2019, pp. 6810–6817, 2019, doi: 10.1609/aaai.v33i01.33016810.

Y. Chen, S. Wang, W. Zhang, and Q. Huang, “Less is more: Picking informative frames for video captioning,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 11217 LNCS, pp. 367–384, 2018, doi: 10.1007/978-3-030-01261-8_22.

Papineni K, Roukos S, Ward T, Zhu WJ. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, 2002;311–318.

Banerjee S, Lavie A. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005;65–72.

Lin CY. Rouge: A packagefor automatic evaluation of summaries. In: Proceedings of Workshop on Text Summarization Branches Out, Post2Conference Workshop of ACL 2004.

Vedantam R, Lawrence Zitnick C, Parikh D. Cider: Consensus based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015;4566–4575.

Published

2024-03-07

How to Cite

[1]
A. J. Yousif and M. H. Al-Jammas, “Real-time Arabic Video Captioning Using CNN and Transformer Networks Based on Parallel Implementation ”, DJES, vol. 17, no. 1, pp. 84–93, Mar. 2024.