Recovering 3D Structures from 2D Images and Videos

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Recovering 3D Structures from 2D Images and Videos"

By

Mr. Likang WANG


Abstract:

As humans, we exist in a three-dimensional space and perceive the world through
our eyes and sense of touch. However, capturing the three-dimensional world we
observe is far from a straightforward task. Among the available sensors, only
cameras can emulate the human visual system. Yet, conventional cameras typically
provide only two-dimensional images. This limitation means our understanding of
the world is akin to a blind person feeling an elephant. Over the past few
decades, many researchers have strived to recover the three-dimensional
structure of scenes from these two-dimensional images, but even today, reaching
satisfactory quality and efficiency in reconstruction remains a significant
challenge.

To navigate this challenge, we aim first to explore the limits of reconstruction
quality. Specifically, we propose a novel coarse-to-fine strategy for scene
reconstruction. This approach begins with estimating an initial spatial position
for each pixel in the image. Next, we introduce a self-supervised method for
estimating the error distribution between our preliminary predictions and the
ground truth. This function allows us to concentrate our efforts on areas most
likely to be accurate and carry out a more refined inspection. As a result, our
strategy leads to significant improvements in reconstruction quality under the
same time and space constraints.

We then explore how to achieve more satisfactory reconstruction results while
meeting the requirements of real-time inference efficiency. For this purpose, we
propose two innovative solutions.

Firstly, we focus on achieving the highest possible quality in three-dimensional
scene reconstruction while maintaining an inference speed of more than 30 frames
per second. To do this, we propose a feature fusion method capable of
simultaneously extracting and preserving the low-frequency and high-frequency
information between video frames. It delivers massive improvements on large
planes and fine details without introducing extra computational costs. In
addition, based on the sparsity of the three-dimensional space, we propose an
accurate and efficient loss correction strategy, enabling more comprehensive
scene recovery.

Secondly, we set our sights on achieving the most accurate detail recovery,
updating at least once every 1/30th of a second. We factor in semantic
consistency between frames to facilitate a swift preliminary screening of points
in the three-dimensional space. This is followed by a meticulous evaluation
focused only on a minority of spatial areas. As a result, our method not only
achieves low-latency updates but also provides significantly superior detail
quality.

While the novel methods we propose advance the field of three-dimensional
structure recovery from two-dimensional images and videos, our approach isn't
flawless. Therefore, we also discuss the shortcomings of our current work and
propose promising directions for future research.


Date:                   Friday, 19 January 2024

Time:                   1:00pm - 3:00pm

Venue:                  Room 5510
                        Lifts 25/26

Chairman:               Prof. Howard LUONG (ECE)

Committee Members:      Prof. Lei CHEN (Supervisor)
                        Prof. Junxian HE
                        Prof. Qiong LUO
                        Prof. Can YANG (MATH)
                        Prof. Haibo HU (HKPU)


**** ALL are Welcome ****