Multimodal Large Language Model for Medical Report Generation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


MPhil Thesis Defence


Title: "Multimodal Large Language Model for Medical Report Generation"

By

Mr. Zhixuan CHEN


Abstract:

Medical report generation is a crucial task in medical imaging analysis, 
aiming to automatically translate medical imaging data into accurate and 
interpretable textual descriptions to support diagnostic decision-making. 
Among various imaging modalities, computed tomography (CT) poses unique 
challenges due to its high-resolution, volumetric nature and the need to 
interpret complex anatomical structures across multiple slices. Existing 
methods primarily only consider the global features of the entire volume, 
making it struggle to focus on specific regions and potentially missing 
abnormalities. To address this issue, we propose Reg2RG, the first 
region-guided multimodal large language model (LLM) framework for medical 
report generation, which enhances diagnostic performance by focusing on 
anatomical regions within the volume. Specifically, we utilize masks from a 
universal segmentation module to capture local features for each referring 
region. A local feature decoupling (LFD) strategy is proposed to preserve 
the local high-resolution details with little computational overhead. Then 
the local features are integrated with global features to capture 
inter-regional relationships within a cohesive context. Moreover, we propose 
a novel region-report alignment (RRA) training strategy. It leverages the 
recognition of referring regions to guide the generation of region-specific 
reports, enhancing the model's referring and grounding capabilities while 
also improving the report's interpretability. A LLM serves as the language 
decoder, generating reports from integrated visual features, facilitating 
fine-grained region-level comprehension. Extensive experiments on two 
large-scale chest CT-report datasets demonstrate the superiority of our 
method, which outperforms several state-of-the-art methods in terms of both 
natural language generation and clinical efficacy metrics while preserving 
promising interpretability. The code is available at 
https://github.com/zhi-xuan-chen/Reg2RG.


Date:                   Friday, 4 July 2025

Time:                   10:30am - 12:30pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Prof. Albert CHUNG

Committee Members:      Dr. Hao CHEN (Supervisor)
                        Dr. Dan XU