More about HKUST
Understanding What Deep Vision Models Depend on and Guiding What They Should
PhD Thesis Proposal Defence
Title: "Understanding What Deep Vision Models Depend on and Guiding What They
Should"
by
Mr. Weiyan XIE
Abstract:
Over the past decade, deep vision models have evolved from task-specific
classifiers into large-scale multimodal foundation models. However, their
real-world reliability is often undermined by misaligned visual dependencies:
classifiers can latch onto spurious backgrounds rather than causally relevant
features, while Multimodal Large Language Models (MLLMs) may rely on
misleading cues or language priors instead of pertinent visual evidence,
producing hallucinations. This thesis pursues two complementary objectives:
understanding what deep vision models currently depend on, and proactively
guiding what they ought to depend on. We advance these objectives along three
directions.
First, to understand visual dependencies in classifiers, we develop
diagnostic tools that go beyond surface-level correlations. ViT-CX estimates
the causal effect of semantic patches on Vision Transformer predictions,
while Contrastive Whole-Output Explanation (CWOX) explains a model's top-K
labels by systematically contrasting visually confusable competitors to
surface discriminative evidence.
Second, to guide classifiers toward generalizable features, we move beyond
standard Empirical Risk Minimization (ERM). Logit Attribution Matching (LAM)
anchors decisions to domain-invariant causal features by matching logit
attributions across semantic sharing pairs with identical core semantics,
while Dual Risk Minimization (DRM) combats robustness vanishing during
foundation model fine-tuning by jointly optimizing ERM with Worst-case Risk
Minimization (WRM), whose WRM objective is estimated through zero-shot CLIP
image-text similarities to LLM-generated visual descriptions of class labels.
Third, we extend guided dependencies to MLLMs for long-document
understanding. InSight-Doc, an active multi-agent framework, replaces
fixed-resolution, single-pass pipelines with an iterative perception process
that selectively acquires high-resolution crops on demand from a
low-resolution global view, anchoring answers in actively gathered evidence
and advancing the Pareto frontier between accuracy and cost.
Collectively, by providing both diagnostic tools and targeted mechanisms to
rectify misaligned visual dependencies, this thesis aims to bridge the gap
between raw predictive performance and trustworthy deep vision models that
generalize reliably in the real world.
Date: Wednesday, 27 May 2026
Time: 2:00pm - 4:00pm
Venue: Room 2128B
Lift 19
Committee Members: Prof. Nevin Zhang (Supervisor)
Prof. Dit-Yan Yeung (Chairperson)
Dr. Long Chen