Abstract
Medical image generation has significant potential for data augmentation, training, and educational purposes. However, generating realistic and anatomically accurate medical images remains challenging, particularly when using text descriptions that may lack detailed anatomical information. In this work, we propose VAP-Diffusion, a novel approach that leverages multimodal large language models (MLLMs) to enrich descriptions for enhanced medical image generation.
Our method introduces a description enrichment framework that uses MLLMs to enhance text prompts with detailed anatomical and pathological information. The framework consists of three key components: (1) an MLLM-based description enrichment module that adds anatomical details to input prompts, (2) a diffusion-based image generation module that produces high-quality medical images, and (3) an anatomical consistency module that ensures generated images maintain clinical relevance.
We evaluate our approach on multiple medical imaging datasets, including brain MRI and chest X-ray images. Experimental results demonstrate that our MLLM-enriched approach significantly improves image quality and anatomical accuracy compared to traditional text-to-image generation methods. The method shows excellent performance in generating diverse and clinically relevant medical images.
The proposed framework represents a significant advancement in medical image generation, providing more accurate and useful synthetic data that could improve AI model training and clinical education.
BibTeX
@inproceedings{huang2025vap,
title={VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation},
author={Huang, Peng and Fu, Junhu and Guo, Bowen and Li, Zeju and Wang, Yuanyuan and Guo, Yi},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2025)},
year={2025}
}