Understanding What Deep Vision Models Depend on and Guiding What They Should

PhD Thesis Proposal Defence


Title: "Understanding What Deep Vision Models Depend on and Guiding What They 
Should"

by

Mr. Weiyan XIE


Abstract:

Over the past decade, deep vision models have evolved from task-specific 
classifiers into large-scale multimodal foundation models. However, their 
real-world reliability is often undermined by misaligned visual dependencies: 
classifiers can latch onto spurious backgrounds rather than causally relevant 
features, while Multimodal Large Language Models (MLLMs) may rely on 
misleading cues or language priors instead of pertinent visual evidence, 
producing hallucinations. This thesis pursues two complementary objectives: 
understanding what deep vision models currently depend on, and proactively 
guiding what they ought to depend on. We advance these objectives along three 
directions.

First, to understand visual dependencies in classifiers, we develop 
diagnostic tools that go beyond surface-level correlations. ViT-CX estimates 
the causal effect of semantic patches on Vision Transformer predictions, 
while Contrastive Whole-Output Explanation (CWOX) explains a model's top-K 
labels by systematically contrasting visually confusable competitors to 
surface discriminative evidence.

Second, to guide classifiers toward generalizable features, we move beyond 
standard Empirical Risk Minimization (ERM). Logit Attribution Matching (LAM) 
anchors decisions to domain-invariant causal features by matching logit 
attributions across semantic sharing pairs with identical core semantics, 
while Dual Risk Minimization (DRM) combats robustness vanishing during 
foundation model fine-tuning by jointly optimizing ERM with Worst-case Risk 
Minimization (WRM), whose WRM objective is estimated through zero-shot CLIP 
image-text similarities to LLM-generated visual descriptions of class labels.

Third, we extend guided dependencies to MLLMs for long-document 
understanding. InSight-Doc, an active multi-agent framework, replaces 
fixed-resolution, single-pass pipelines with an iterative perception process 
that selectively acquires high-resolution crops on demand from a 
low-resolution global view, anchoring answers in actively gathered evidence 
and advancing the Pareto frontier between accuracy and cost.

Collectively, by providing both diagnostic tools and targeted mechanisms to 
rectify misaligned visual dependencies, this thesis aims to bridge the gap 
between raw predictive performance and trustworthy deep vision models that 
generalize reliably in the real world.


Date:                   Wednesday, 27 May 2026

Time:                   2:00pm - 4:00pm

Venue:                  Room 2128B
                        Lift 19

Committee Members:      Prof. Nevin Zhang (Supervisor)
                        Prof. Dit-Yan Yeung (Chairperson)
                        Dr. Long Chen