Multimodal Understanding for Earth Science: Bridging Multimodal Data and Human Insight with MLLMs

PhD Thesis Proposal Defence


Title: "Multimodal Understanding for Earth Science: Bridging Multimodal Data 
and Human Insight with MLLMs"

by

Mr. Haobo LI


Abstract:

Earth science is increasingly characterized by large-scale, heterogeneous, and 
multimodal data, including numerical reanalysis fields, satellite imagery, 
textual reports, and human-generated records. However, these modalities are 
still commonly processed in isolation, which limits our ability to transform 
raw observations and model outputs into usable, interpretable, and actionable 
knowledge for forecasting and decision-making. This thesis studies multimodal 
understanding for earth science through the lens of large language models 
(LLMs) and multimodal large language models (MLLMs), with the overarching goal 
of bridging multimodal data and human insight.

The thesis makes three complementary contributions. First, it introduces the 
Weather and Climate Events Forecasting (WCEF) task and presents CLLMate, the 
first multimodal benchmark that aligns meteorological raster data with 
structured, expert-validated textual event records extracted from 
environmental news. By moving beyond the prediction of numerical variables 
toward event-centric narratives and cascading consequences, CLLMate 
establishes a benchmark for studying multimodal reasoning in earth science. 
Second, the thesis presents PIPE, a physics-informed positional encoding 
method for multimodal forecasting with satellite imagery and time series data. 
PIPE injects temporal and geospatial information into multimodal alignment, 
improving the preservation of physical context in high-stakes forecasting 
tasks such as typhoon prediction. Third, the thesis presents Havior, an 
LLM-empowered visual analytics system for heat risk management that integrates 
numerical climate data with textual news and supports human-in-the-loop 
retrieval, semantic exploration, and contextual question answering.

Taken together, these contributions advance a unified research agenda
spanning benchmark construction, physically grounded multimodal modeling, and
human-centered insight generation. The findings of this thesis suggest that
effective multimodal understanding for earth science requires not only
stronger cross-modal learning, but also benchmark resources that connect
physical observations with human-readable event knowledge, modeling methods
that preserve earth-science-specific physical information, and systems that
expose multimodal evidence in forms that experts can interpret, validate, and
act upon. In this way, the thesis contributes new datasets, methods, and
interactive systems toward more holistic, reliable, and actionable AI for
earth science.


Date:                   Thursday, 23 April 2026

Time:                   4:00pm - 6:00pm

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Prof. Huamin Qu (Supervisor)
                        Prof. Alexis Kai Hon Lau (Co-supervisor)
                        Dr. Binhang Yuan (Chairperson)
                        Dr. Xiaomin Ouyang