More about HKUST
Multimodal Large Language Model for Medical Report Generation
The Hong Kong University of Science and Technology Department of Computer Science and Engineering MPhil Thesis Defence Title: "Multimodal Large Language Model for Medical Report Generation" By Mr. Zhixuan CHEN Abstract: Medical report generation is a crucial task in medical imaging analysis, aiming to automatically translate medical imaging data into accurate and interpretable textual descriptions to support diagnostic decision-making. Among various imaging modalities, computed tomography (CT) poses unique challenges due to its high-resolution, volumetric nature and the need to interpret complex anatomical structures across multiple slices. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided multimodal large language model (LLM) framework for medical report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model's referring and grounding capabilities while also improving the report's interpretability. A LLM serves as the language decoder, generating reports from integrated visual features, facilitating fine-grained region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code is available at https://github.com/zhi-xuan-chen/Reg2RG. Date: Friday, 4 July 2025 Time: 10:30am - 12:30pm Venue: Room 5501 Lifts 25/26 Chairman: Prof. Albert CHUNG Committee Members: Dr. Hao CHEN (Supervisor) Dr. Dan XU