3D and 4D Reconstruction and Generation for World Modeling

PhD Thesis Proposal Defence


Title: "3D and 4D Reconstruction and Generation for World Modeling"

by

Mr. Zhenxing MI


Abstract:

High-fidelity world modeling reconstructs accurate geometry from imagery and
synthesizes realistic dynamic scenes, is a core capability for spatial AI, robotics,
and immersive applications. However, practical methods remain constrained by
three challenges: efficiency, scalability to large, heterogeneous real-world
environments, and unified modeling that bridges reconstruction and generation in
dynamic 4D settings.

This thesis addresses these challenges through a series of methods spanning
multi-view reconstruction, large-scale neural scene representations, and
geometry-aware video generation. First, for efficient multi-view stereo (MVS), we
introduce GBi-Net, which reformulates depth estimation as a generalized binary
search problem. By replacing dense depth sampling with staged, classification-
based search with mechanisms to mitigate accumulated errors, GBi-Net
substantially reduces the cost-volume footprint while improving depth accuracy
on standard benchmarks.

Second, to model large-scale scenes with high fidelity and efficiency, we propose
a line of scalable NeRF frameworks. Switch-NeRF learns scene decomposition end-
to-end via a sparsely gated Mixture-of-Experts, routing 3D points to specialized
NeRF experts and learning consistent fusion across partitions. Building on this,
Switch-NeRF++ introduces a Heterogeneous Mixture of Hash Experts with a hash-
based gating network and heterogeneous hash experts to better capture scene
heterogeneity, paired with efficient dispatching implementations for substantial
training and rendering speedups on very large urban scenes. Complementarily,
LeCOO-NeRF learns a continuous, compact occupancy predictor in a self-
supervised manner to enable accurate empty-space skipping on large-scale
NeRFs, avoiding the rigidity and overhead of high-resolution occupancy grids.

Finally, to unify 4D reconstruction and generation, we present One4D, a single
framework that outputs synchronized RGB frames and pointmaps across settings
ranging from single-image-to-4D generation to full-video 4D reconstruction.
One4D introduces Decoupled LoRA Control to preserve strong video priors while
achieving accurate RGB-geometry coupling, and Unified Masked Conditioning to
seamlessly handle varying conditioning sparsities.

Together, these contributions advance efficient and scalable 3D reconstruction,
large-scale neural scene modeling, and unified geometry-aware 4D generation,
toward general-purpose, high-quality world models for real-world perception
and simulation.


Date:                   Thursday, 13 February 2026

Time:                   10:00am - 12:00pm

Venue:                  Room 2132C
                        Lift 22

Committee Members:      Dr. Dan Xu (Supervisor)
                        Dr. Qifeng Chen (Chairperson)
                        Dr. Hao Chen