More about HKUST
Multimodal Understanding for Earth Science: Bridging Multimodal Data and Human Insight with MLLMs
PhD Thesis Proposal Defence
Title: "Multimodal Understanding for Earth Science: Bridging Multimodal Data
and Human Insight with MLLMs"
by
Mr. Haobo LI
Abstract:
Earth science is increasingly characterized by large-scale, heterogeneous, and
multimodal data, including numerical reanalysis fields, satellite imagery,
textual reports, and human-generated records. However, these modalities are
still commonly processed in isolation, which limits our ability to transform
raw observations and model outputs into usable, interpretable, and actionable
knowledge for forecasting and decision-making. This thesis studies multimodal
understanding for earth science through the lens of large language models
(LLMs) and multimodal large language models (MLLMs), with the overarching goal
of bridging multimodal data and human insight.
The thesis makes three complementary contributions. First, it introduces the
Weather and Climate Events Forecasting (WCEF) task and presents CLLMate, the
first multimodal benchmark that aligns meteorological raster data with
structured, expert-validated textual event records extracted from
environmental news. By moving beyond the prediction of numerical variables
toward event-centric narratives and cascading consequences, CLLMate
establishes a benchmark for studying multimodal reasoning in earth science.
Second, the thesis presents PIPE, a physics-informed positional encoding
method for multimodal forecasting with satellite imagery and time series data.
PIPE injects temporal and geospatial information into multimodal alignment,
improving the preservation of physical context in high-stakes forecasting
tasks such as typhoon prediction. Third, the thesis presents Havior, an
LLM-empowered visual analytics system for heat risk management that integrates
numerical climate data with textual news and supports human-in-the-loop
retrieval, semantic exploration, and contextual question answering.
Taken together, these contributions advance a unified research agenda
spanning benchmark construction, physically grounded multimodal modeling, and
human-centered insight generation. The findings of this thesis suggest that
effective multimodal understanding for earth science requires not only
stronger cross-modal learning, but also benchmark resources that connect
physical observations with human-readable event knowledge, modeling methods
that preserve earth-science-specific physical information, and systems that
expose multimodal evidence in forms that experts can interpret, validate, and
act upon. In this way, the thesis contributes new datasets, methods, and
interactive systems toward more holistic, reliable, and actionable AI for
earth science.
Date: Thursday, 23 April 2026
Time: 4:00pm - 6:00pm
Venue: Room 2132C
Lift 22
Committee Members: Prof. Huamin Qu (Supervisor)
Prof. Alexis Kai Hon Lau (Co-supervisor)
Dr. Binhang Yuan (Chairperson)
Dr. Xiaomin Ouyang