Multi-modal Retrieval-Augmented Generation (RAG): Tackling Hallucination in Multimodal Question Answering

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

Final Year Thesis Oral Defense

Title: "Multi-modal Retrieval-Augmented Generation (RAG): Tackling 
Hallucination in Multimodal Question Answering"

by

LAM Ching Yin

Abstract: 

Retrieval-Augmented Generation (RAG) has become the major method for 
grounding Large Language Model (LLM) outputs with external evidence, but 
existing text-only pipelines cannot accommodate multimodal information. 
Multimodal RAG systems are being developed for use with Multimodal LLMs 
(MLLMs), and can process texts and images. Current systems flatten visual 
features into text captions, or rely on CLIP-based embeddings that prioritize 
semantic alignment over fine-grained visual details, limiting RAG in 
applications that require visual precision. This thesis presents a multimodal 
RAG pipeline that preserves fine-grained visual features while enabling 
cross-modal alignment. We propose a dual-representation framework maintaining 
separate representation spaces, using state-of-the-art pretrained 
foundational models for feature extraction and cross-modal alignment. This 
thesis also compares and analyzes how different retrieval modes affect 
grounding in MLLM generation

Date            : 5 May 2026 (Tuesday)

Time            : 14:00 - 14:40

Venue           : Room 2131B (near Lift 19), HKUST

Advisor         : Dr. XU Dan

2nd Reader      : Dr. CHEN Long