3D and 4D Reconstruction and Generation for World Modeling

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "3D and 4D Reconstruction and Generation for World Modeling"

By

Mr. Zhenxing MI


Abstract:

This thesis studies world modeling from the perspective of geometry-grounded
visual modeling, where accurate reconstruction and realistic generation of 3D
and 4D scenes are core capabilities for spatial AI, robotics, and immersive
applications. Practical methods in this setting remain constrained by three
challenges: efficiency, scalability to large, heterogeneous real-world
environments, and unified modeling that bridges reconstruction and generation
in dynamic 4D settings.

This thesis addresses these challenges through a series of methods spanning
multi-view reconstruction, large-scale neural scene representations, and
geometry-aware video generation. First, for efficient multi-view stereo (MVS),
we introduce GBi-Net, which reformulates depth estimation as a generalized
binary search problem. By replacing dense depth sampling with staged,
classification-based search and mechanisms to mitigate accumulated errors,
GBi-Net substantially reduces the cost-volume footprint while improving depth
accuracy on standard benchmarks.

Second, to model large-scale scenes with high fidelity and efficiency, we
propose a line of scalable NeRF frameworks. Switch-NeRF learns scene
decomposition end-to-end via a sparsely gated Mixture-of-Experts, routing 3D
points to specialized NeRF experts and learning consistent fusion across
partitions. Building on this, Switch-NeRF++ introduces a Heterogeneous Mixture
of Hash Experts with a hash-based gating network and heterogeneous hash
experts to better capture scene heterogeneity, paired with efficient
dispatching implementations for substantial training and rendering speedups on
very large urban scenes.

Finally, to unify 4D reconstruction and generation, we present One4D, a single
framework that outputs synchronized RGB frames and pointmaps across settings
ranging from single-image-to-4D generation to full-video 4D reconstruction.
One4D introduces Decoupled LoRA Control to preserve strong video priors while
achieving accurate RGB--geometry coupling, and Unified Masked Conditioning to
seamlessly handle varying conditioning sparsities.

Overall, this thesis advances geometry-grounded visual world modeling in three
key aspects: efficient 3D reconstruction, scalable representation of large and
heterogeneous scenes, and unified geometry-aware modeling for dynamic 4D
reconstruction and generation. Together, these contributions move toward a
more practical and unified foundation for visual world modeling in real-world
settings.


Date:                   Friday, 22 May 2026

Time:                   10:00am - 12:00noon

Venue:                  Room 2128A
                        Lift 19

Chairman:               

Committee Members:      Dr. Dan XU (Supervisor)
                        Dr. Hao CHEN
                        Dr. Qifeng CHEN
                        Prof. Jun ZHANG (ECE)
                        Dr. Jiaqi YANG (NWPU)