More about HKUST
Multi-modal Retrieval-Augmented Generation (RAG): Tackling Hallucination in Multimodal Question Answering
The Hong Kong University of Science and Technology Department of Computer Science and Engineering Final Year Thesis Oral Defense Title: "Multi-modal Retrieval-Augmented Generation (RAG): Tackling Hallucination in Multimodal Question Answering" by LAM Ching Yin Abstract: Retrieval-Augmented Generation (RAG) has become the major method for grounding Large Language Model (LLM) outputs with external evidence, but existing text-only pipelines cannot accommodate multimodal information. Multimodal RAG systems are being developed for use with Multimodal LLMs (MLLMs), and can process texts and images. Current systems flatten visual features into text captions, or rely on CLIP-based embeddings that prioritize semantic alignment over fine-grained visual details, limiting RAG in applications that require visual precision. This thesis presents a multimodal RAG pipeline that preserves fine-grained visual features while enabling cross-modal alignment. We propose a dual-representation framework maintaining separate representation spaces, using state-of-the-art pretrained foundational models for feature extraction and cross-modal alignment. This thesis also compares and analyzes how different retrieval modes affect grounding in MLLM generation Date : 5 May 2026 (Tuesday) Time : 14:00 - 14:40 Venue : Room 2131B (near Lift 19), HKUST Advisor : Dr. XU Dan 2nd Reader : Dr. CHEN Long