Multimodal Understanding for Earth Science: Bridging Multimodal Data and Human Insight with MLLMs

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Multimodal Understanding for Earth Science: Bridging Multimodal Data 
and Human Insight with MLLMs"

By

Mr. Haobo LI


Abstract:

Earth science is increasingly characterized by large-scale, heterogeneous, 
and multimodal data, including numerical reanalysis fields, satellite 
imagery, textual reports, and human-generated records. However, these 
modalities are still commonly processed in isolation, which limits our 
ability to transform raw observations and model outputs into usable, 
interpretable, and actionable knowledge for forecasting and decision-making. 
This thesis studies multimodal understanding for earth science through the 
lens of large language models (LLMs) and multimodal large language models 
(MLLMs), with the overarching goal of bridging multimodal data and human 
insight.

The thesis makes three complementary contributions. First, it introduces the 
Weather and Climate Events Forecasting (WCEF) task and presents CLLMate, the 
first multimodal benchmark that aligns meteorological raster data with 
structured, expert-validated textual event records extracted from 
environmental news. By moving beyond the prediction of numerical variables 
toward event-centric narratives and cascading consequences, CLLMate 
establishes a benchmark for studying multimodal reasoning in earth science. 
Second, the thesis presents PIPE, a physics-informed positional encoding 
method for multimodal forecasting with satellite imagery and time series 
data. PIPE injects temporal and geospatial information into multimodal 
alignment, improving the preservation of physical context in high-stakes 
forecasting tasks such as typhoon prediction. Third, the thesis presents 
Havior, an LLM-empowered visual analytics system for heat risk management 
that integrates numerical climate data with textual news and supports 
human-in-the-loop retrieval, semantic exploration, and contextual question 
answering.

Taken together, these contributions advance a unified research agenda 
spanning benchmark construction, physically grounded multimodal modeling, and 
human-centered insight generation. The findings of this thesis suggest that 
effective multimodal understanding for earth science requires not only 
stronger cross-modal learning, but also benchmark resources that connect 
physical observations with human-readable event knowledge, modeling methods 
that preserve earth-science-specific physical information, and systems that 
expose multimodal evidence in forms that experts can interpret, validate, and 
act upon. In this way, the thesis contributes new datasets, methods, and 
interactive systems toward more holistic, reliable, and actionable AI for 
earth science.


Date:                   Wednesday, 10 June 2026

Time:                   12:00noon - 2:00pm

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Dr. Becki Yi KUANG (CBE)

Committee Members:      Prof. Huamin QU (Supervisor)
                        Prof. Alexis Kai Hon LAU (Co-supervisor, ENVR)
                        Dr. Dan XU
                        Dr. Binhang YUAN
                        Prof. Dan WANG (ENVR)
                        Dr. Weidong YANG (Fudan University)