Multimodal RAG using text and visual data
This paper presents the development and investigation of a multimodal Retrieval-Augmented Generation system designed for the analysis and interpretation of medical images. The research focuses on chest X-ray images and their corresponding radiology reports. The primary goal was to create a system ca...
Gespeichert in:
| Datum: | 2025 |
|---|---|
| Hauptverfasser: | , |
| Format: | Artikel |
| Sprache: | Ukrainian |
| Veröffentlicht: |
PROBLEMS IN PROGRAMMING
2025
|
| Schlagworte: | |
| Online Zugang: | https://pp.isofts.kiev.ua/index.php/ojs1/article/view/859 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Назва журналу: | Problems in programming |
| Завантажити файл: | |
Institution
Problems in programming| Zusammenfassung: | This paper presents the development and investigation of a multimodal Retrieval-Augmented Generation system designed for the analysis and interpretation of medical images. The research focuses on chest X-ray images and their corresponding radiology reports. The primary goal was to create a system capable of performing two key tasks: generating a detailed radiology report for an input image and providing accurate answers to specific ques tions about it. A secondary goal was to demonstrate that employing a multimodal retrieval-augmented approach significantly improves generation quality compared to using large multimodal models without a retrieval com ponent. The system's implementation utilizes a combination of state-of-the-art deep learning models. The Bio medCLIP model, fine-tuned on the target dataset, was used to generate vector embeddings for both text and visual data. The generator component is based on the large language model LLaVA-Med 1.5, which is adapted for the medical domain and quantized to operate under limited computational resources. The system architecture also includes auxiliary classifiers based on DenseNet121 to determine the image projection and identify clinical findings, thereby enhancing retrieval accuracy. The experimental evaluation involved testing six different con figurations of the developed system. The evaluation was conducted using a range of metrics, including accuracy and F1-score for the question-answering task, as well as BLEU, ROUGE, F1-CheXbert, and F1-RadGraph for assessing the quality of the generated reports. The test results demonstrated a significant advantage of all system configurations over the baseline generator model. The best results were achieved by the configuration that uti lizes projection and clinical finding classifiers with an exact match requirement for the identified pathologies. The study confirmed that integrating a relevant data retrieval mechanism significantly enhances both the struc tural and semantic quality of the generated textual descriptions for medical images.Problems in programming 2025; 3: 66-78 |
|---|