More about HKUST
Multimodal Understanding for Earth Science: Bridging Multimodal Data and Human Insight with MLLMs
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Multimodal Understanding for Earth Science: Bridging Multimodal Data
and Human Insight with MLLMs"
By
Mr. Haobo LI
Abstract:
Earth science is increasingly characterized by large-scale, heterogeneous,
and multimodal data, including numerical reanalysis fields, satellite
imagery, textual reports, and human-generated records. However, these
modalities are still commonly processed in isolation, which limits our
ability to transform raw observations and model outputs into usable,
interpretable, and actionable knowledge for forecasting and decision-making.
This thesis studies multimodal understanding for earth science through the
lens of large language models (LLMs) and multimodal large language models
(MLLMs), with the overarching goal of bridging multimodal data and human
insight.
The thesis makes three complementary contributions. First, it introduces the
Weather and Climate Events Forecasting (WCEF) task and presents CLLMate, the
first multimodal benchmark that aligns meteorological raster data with
structured, expert-validated textual event records extracted from
environmental news. By moving beyond the prediction of numerical variables
toward event-centric narratives and cascading consequences, CLLMate
establishes a benchmark for studying multimodal reasoning in earth science.
Second, the thesis presents PIPE, a physics-informed positional encoding
method for multimodal forecasting with satellite imagery and time series
data. PIPE injects temporal and geospatial information into multimodal
alignment, improving the preservation of physical context in high-stakes
forecasting tasks such as typhoon prediction. Third, the thesis presents
Havior, an LLM-empowered visual analytics system for heat risk management
that integrates numerical climate data with textual news and supports
human-in-the-loop retrieval, semantic exploration, and contextual question
answering.
Taken together, these contributions advance a unified research agenda
spanning benchmark construction, physically grounded multimodal modeling, and
human-centered insight generation. The findings of this thesis suggest that
effective multimodal understanding for earth science requires not only
stronger cross-modal learning, but also benchmark resources that connect
physical observations with human-readable event knowledge, modeling methods
that preserve earth-science-specific physical information, and systems that
expose multimodal evidence in forms that experts can interpret, validate, and
act upon. In this way, the thesis contributes new datasets, methods, and
interactive systems toward more holistic, reliable, and actionable AI for
earth science.
Date: Wednesday, 10 June 2026
Time: 12:00noon - 2:00pm
Venue: Room 5501
Lifts 25/26
Chairman: Dr. Becki Yi KUANG (CBE)
Committee Members: Prof. Huamin QU (Supervisor)
Prof. Alexis Kai Hon LAU (Co-supervisor, ENVR)
Dr. Dan XU
Dr. Binhang YUAN
Prof. Dan WANG (ENVR)
Dr. Weidong YANG (Fudan University)