Automated Medical Image Captioning Using the BLIP Model: Enhancing Diagnostic Support with AI-Driven Language Generation

Authors

  • Enas Abbas Abed Department of Computer Engineering, University of Diyala, Diyala, Iraq
  • Taoufik Aguili Department of Communications System, University of Tunis El Manar, Tunisia

DOI:

https://doi.org/10.24237/djes.2025.18215

Keywords:

Medical Image Captioning, BLIP Model, Radiology, UMLS, Diagnostic Support, Transformer Models, Deep Learning, Clinical Applications, Natural Language Generation, Healthcare AI

Abstract

Medical diagnostics Interpretation of images is a important activity: the number of images is growing continuously, and the number of specialist radiologists is limited globally, which often results in late diagnosis and possible clinical misinformation. The paper under analysis analyzes the BLIP model, which is an automatic medical image clinical captioning model. To refine the BLIP model, a methodology was designed based on more than 81,000 radiology images with Unified Medical Language System (UMLS) identifiers, which were obtained by the ROCO (Radiology Objects in Context) dataset. A representative subset of 1,000 images was chosen to fit within computational limitations- 800 images were used in training, 100 in validation and 100 in testing, but with the preservation of representation across major imaging modalities. They trained the model on transformer-based encoder-decoder with cross-attention mechanisms. The four key contributions of this work are (1) domain-specific fine-tuning of the model to the radiological setting, (2) the use of standardized medical terminology by using UMLS concept unique identifiers, (3) integration of explainable AI with attention heatmaps and post-hoc explanations (SHAP and LIME), and (4) evaluation of performance using accepted NLP metrics. The model attained a high semantic and clinical agreement with quantitative scores of 0.7300 (BLEU-4), 0.6101 (METEOR), and 0.8405 (ROUGE). These results prompt the idea that AI-based image captioning has a considerable potential in facilitating clinical documentation and increasing the reliability of radiological assessments.

Downloads

Download data is not yet available.

References

[1] Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. Proceedings of the 39th International Conference on Machine Learning, PMLR 162:12888-12900. DOI: 10.48550/arXiv.2201.12086.

[2] Raddar. Open access chest x-ray collection from indiana university. https://www.kaggle. com, 2022.

[3] K. Wolff, L. Goldsmith, S. Katz, B. Gilchrest, A. S. Paller, and D. Leffell. Fitzpatrick’s Dermatology in General Medicine. McGraw-Hill, New York, NY, USA, 8th edition, 2012.

[4] Zhao, D., Chang, Z., & Guo, S. (2019). A multimodal fusion approach for image captioning. Neurocomputing, 329, 476–485. https://doi.org/10.1016/j.neucom.2018.11.004

[5] Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291–304. https://doi.org/10.1016/j.neucom.2018.05.080

[6] Huang, Y., Chen, J., Ma, H., Ma, H., Ouyang, W., & Yu, C. (2022). Attribute assisted teacher-critical training strategies for image captioning. Neurocomputing, 506, 265–276. https://doi.org/10.1016/j.neucom.2022.07.068

[7] Zeng, C., Kwong, S., Zhao, T., & Wang, H. (2022). Contrastive semantic similarity learning for image captioning evaluation. Information Sciences, 609, 913–930. https://doi.org/10.1016/j.ins.2022.07.142

[8] Wang, C., & Gu, X. (2022). Learning joint relationship attention network for image captioning. Expert Systems with Applications. https://doi.org/10.1016/j.eswa.2022.118474

[9] Wu, F., Yang, H., Peng, L., Lian, Z., Li, M., Qu, G., Jiang, S., & Han, Y. (2022). Agnet: Automatic generation network for skin imaging reports. Computers in Biology and Medicine, 141. https://doi.org/10.1016/j.compbiomed.2021.105037

[10] Barata, C., Celebi, M. E., & Marques, J. S. (2021). Explainable skin lesion diagnosis using taxonomies. Pattern Recognition, 110. https://doi.org/10.1016/j.patcog.2020.107413

[11] Li, Z. G., & Chen, H. H. Y. C. (2021). Biomedical text similarity evaluation using attention mechanism and siamese neural network. IEEE Access, 9. https://doi.org/10.1109/ACCESS.2021.3099021

[12] Bölücü, N., Can, B., & Artuner, H. (2023). A siamese neural network for learning semantically-informed sentence embeddings. Expert Systems with Applications, 214. https://doi.org/10.1016/j.eswa.2022.119103

[13] Park, H., Kim, K., Park, S., & Choi, J. (2021). Medical image captioning model to convey more details: Methodological comparison of feature difference generation. IEEE Access, 9, 150560–150568. https://doi.org/10.1109/ACCESS.2021.3124564

[14] Wang, F., Liang, X., Xu, L., & Lin, L. (2022). Unifying relational sentence generation and retrieval for medical image report composition. IEEE Transactions on Cybernetics, 52(6), 5015–5025. https://doi.org/10.1109/TCYB.2020.3026098

[15] Yang, Z., Wang, P., Chu, T., & Yang, J. (2022). Human-centric image captioning. Pattern Recognition, 126. https://doi.org/10.1016/j.patcog.2022.108545

[16] Zhang, Z., Zhang, W., Diao, W., Yan, M., Gao, X., & Sun, X. (2019). VAA: Visual aligning attention model for remote sensing image captioning. IEEE Access, 7, 137355–137364. https://doi.org/10.1109/ACCESS.2019.2942154

[17] Ye, X., Wang, S., Gu, Y., Wang, J., Wang, R., Hou, B., Giunchiglia, F., & Jiao, L. (2022). A joint-training two-stage method for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–16. https://doi.org/10.1109/TGRS.2021.3066700

[18] Gajbhiye, G., Nandedkar, A., & Faye, I. et al. (2020). Automatic report generation for chest x-ray images: A multilevel multi-attention approach. In 4th International Conference on Computer Vision and Image Processing, CVIP 2019 (Vol. 1147, pp. 174–182). https://doi.org/10.1007/978-3-030-36152-4_19

[19] Rodin, I., Fedulova, I., & Shelmanov, A. et al. (2019). Multitask and multimodal neural network model for interpretable analysis of x-ray images. In 2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2019 (pp. 1601–1604). https://doi.org/10.1109/BIBM47256.2019.8983272

[20] Tian, J., Zhong, C., & Shi, Z. et al. (2020). Towards automatic diagnosis from multi-modal medical data. In Interpretability in Machine Intelligence and Medical Image Computing for Multimodal Learning and Decision Support (Vol. 11797, pp. 67–74). https://doi.org/10.1007/978-3-030-50007-5_9

[21] van Sonsbeek, T., Worring, M., & SM, T. et al. (2020). Towards automated diagnosis with attentive multi-modal learning using electronic health records and chest x-rays. In 10th International Workshop on Multimodal Learning for Clinical Decision Support, ML-CDS 2020, and the 9th International Workshop on Clinical Image-Based Procedures, CLIP 2020, held in conjunction with the 23rd International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI 2020 (Vol. 12445, pp. 106–114). https://doi.org/10.1007/978-3-030-61191-1_12

[22] Yang, S., Niu, J., & Wu, J. et al. (2021). Automatic ultrasound image report generation with adaptive multimodal attention mechanism. Neurocomputing, 427, 40–49. https://doi.org/10.1016/j.neucom.2020.09.084

[23] Yuan, J., Liao, H., & Luo, R. et al. (2019). Automatic radiology report generation based on multi-view image fusion and medical concept enrichment. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2019, PT VI (Vol. 11769, pp. 721–729). https://doi.org/10.1007/978-3-030-32226-7_80

[24] Yang, S., Niu, J., & Wu, J. et al. (2020). Automatic medical image report generation with multi-view and multimodal attention mechanism. In 20th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2020 (Vol. 12454, pp. 687–699). https://doi.org/10.1007/978-3-030-60248-2_48

[25] Harzig, P., Chen, Y. Y., & Chen, F. et al. (2020). Addressing data bias problems for chest x-ray image report generation. In 30th British Machine Vision Conference, BMVC 2019. https://doi.org/10.48550/arXiv.1908.02123

[26] Syeda-Mahmood, T., Wong, K., & Gur, Y. et al. (2020). Chest x-ray report generation through fine-grained label learning. In 23rd International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2020 (Vol. 12262, pp. 561–571). https://doi.org/10.1007/978-3-030-59713-9_54

[27] Mishra, S., Banerjee, M., & C., R. et al. (2020). Automatic caption generation of retinal diseases with self-trained RNN merge model. In 7th International Doctoral Symposium on Applied Computation and Security Systems, ACSS 2020 (Vol. 1136, pp. 1–10). https://doi.org/10.1007/978-981-15-2930-6_1

[28] Alsharid, M., El-Bouri, R., & Sharma, H. et al. (2020). A curriculum learning based approach to captioning ultrasound images. In Medical Ultrasound, and Preterm, Perinatal and Paediatric Image Analysis (Vol. 12437). https://doi.org/10.1007/978-3-030-60334-2_8

[29] Yousif, A. J., & Al-Jammas, M. H. (2024). A lightweight visual understanding system for enhanced assistance to the visually impaired using an embedded platform. Diyala Journal of Engineering Sciences, 146-162. https://doi.org/10.24237/djes.2024.17310

[30] Yousif, A. J., & Al-Jammas, M. H. (2024). Real-time Arabic video captioning using CNN and Transformer networks based on parallel implementation. Diyala Journal of Engineering Sciences.https://doi.org/10.24237/djes.xxxx.13301

Downloads

Published

2025-06-01

How to Cite

[1]
“Automated Medical Image Captioning Using the BLIP Model: Enhancing Diagnostic Support with AI-Driven Language Generation”, DJES, vol. 18, no. 2, pp. 228–248, Jun. 2025, doi: 10.24237/djes.2025.18215.

Similar Articles

1-10 of 500

You may also start an advanced similarity search for this article.