Learning Under Distributional Shift

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Learning Under Distributional Shift"

By

Mr. Yong LIN


Abstract:

Machine learning models often assume that the training and testing data are
drawn from the same distribution, but this assumption can be violated in
real-world applications where the testing distribution differs. Enhancing
models' robustness under distributional shift, known as Out-of-Distribution
(OOD) Generalization, has gained significant attention in the machine learning
community. One popular framework, called Invariant Risk Minimization (IRM),
focuses on learning invariant features that can stably predict labels under
distributional shifts and discard spurious features that are unstable. IRM
enjoys theoretical guarantees for linear models with sufficient number of
environments. In this thesis, we first identify some fundamental limitations of
IRM and propose methods to alleviate these issues.

First, we demonstrate that IRM can be inherently susceptible to overfitting.
Specifically, we reveals that IRM theoretically degenerates to Empirical Risk
Minimization (ERM) if overfitting occurs. Additionally, through empirical
experiments, we provide evidence of IRM's performance degradation with larger
neural networks. To mitigate this issue, we propose several methods such as
incorporating Bayesian inference and sample reweighting.

Second, we theoretically show that learning invariant features is generally
impossible without explicit environment partitions. Furthermore, we propose
utilizing cheaply available auxiliary information to automatically generate
partitions and provide corresponding conditions for the framework.

Despite these efforts to improve IRM, it still faces challenges, including
non-identifiability with non-linear models under large distributional shift and
inadequate performance on large-scale real-world datasets. Interestingly,
practitioners have found exceptional success with ensemble-based models, such
as ensembles of independently trained models, for OOD generalization. However,
it is important to note that ensemble models inevitably utilize spurious
features because each individual model in ensemble is trained by Empirical Risk
Minimization (ERM) and inherently incorporates spurious features. The success
of ensemble models contradicts the IRM theory, which suggests that models
relying on spurious features would fail.

In this thesis, we unravel the mysteries surrounding ensemble-based models on
robust OOD generalization. Our research reveals that ensemble-based models
excel in reducing prediction errors in OOD scenarios by leveraging a diverse
range of spurious features. Challenging the prevailing belief that emphasizes
learning invariant features for improved OOD performance, our findings indicate
that incorporating a multitude of diverse spurious features diminishes their
individual impact and leads to enhanced overall OOD generalization. Through
empirical experiments on the MultiColorMNIST dataset, we substantiate the
effectiveness of leveraging diverse spurious features, which aligns with our
theoretical analysis. Capitalizing on these novel insights, we further develop
pioneering techniques that achieve SOTA OOD performance for foundation models,
such as CLIP, on large-scale datasets, including ImageNet variants.


Date:                   Friday, 1 December 2023

Time:                   2:00pm - 4:00pm

Venue:                  Room 4475
                        Lifts 25/26

Chairman:               Prof. Jin QI (IEDA)

Committee Members:      Prof. Tong ZHANG (Supervisor)
                        Prof. Nevin ZHANG
                        Prof. Xiaofang ZHOU
                        Prof. Yuan YAO (MATH)
                        Prof. Mingming GONG (The University of Melbourne)


**** ALL are Welcome ****