COMP5421, Spring 2025


COMPUTER VISION, yet another newer perspective in the year of 2025
 


Professor Long QUAN

quan@cse.ust.hk

phone: 2358-7018

office: 3506
                             

http://www.cs.ust.hk/~quan/comp5421/index.html

 

Teaching Assistant:  

 

Dehao HAO

dhaoab@connect.ust.hk

 

Kuan LI

klibs@cse.ust.hk

 

Lecture room and time: 

Lecture room 2404, Wednesday and Friday from 3:00-4:20pm.


Course description:

In this abrupt changing epoch, this course offers an updated introduction and perspective to the current computer vision fundamental developments, which is at the core of the recent artificial intelligence developments and achievments. It covers a deterministic geometry approach for visual recognition and reconstruction, and goes to a probabilistic introduction to and foundation  of the supervised and self-supervised visual learning and generation.

The content is challenging,  and only reserved for truly motivated post-graduate students and exceptionally mature undergraduate students.

Course outlines

1.      Introduction

2.      Visual features

3.      Convolutional networks

4.      Self-supervised generative methods

5.      Vision geometry

6.      3D reconstruction

7.      Perspective

There will be one course project by a group of two students.

, but there will be two stages of the project, so it becomes a kind of two projects, but the two are continued, the second stage is at the advanced level.

There will be a final exam which will be a long closed-book hand-writing exam.


Tentative schedules and notes:   

 

Week

Date

Topics/Notes

Remarks and reading materials

1

5 Feb

7 Feb

What is computer vision?

 

Introduction to (the classical) computer vision, and a brief historical review of computer vision.

Start to read the first classicals.

 

read the chapter 1 of the book ‘Vision’ by David Marr, 1982.

https://www.cse.ust.hk/~quan/comp5421/notes/marr.vision.chapter1.pdf

 

read ‘Chapter 4’ (8 pages) Feature Point by Long Quan, 2011

https://www.cse.ust.hk/~quan/comp5421/notes/chapter4-longQuan.pdf

 

This part is also to review the classical and the deterministic views of a visual representation of image as a continuous function f(x) or more exactly f(x,y), or just one x(u), which is the object of study. The mathematical tools are classical signal processing, functional analysis with Fourier, wavelets, sparse sensing, and the PDE and scale-space analysis.

 

In this context, it is important to fully understand why and how we approach the traditional low-level vision tasks of filtering, edge detection, and de-noising in the traditional mathematical, engineering and AI framework.

2

12 Feb
 

14 Feb (reschedule)

 

What are visual features?

 

Edges and Canny detector.

 

 

The first topical lecture will be on edges:

 

read ‘A computational approach to edge detection’ by John Canny,  1986, PAMI

https://www.cse.ust.hk/~quan/comp5421/notes/canny1986.pdf

 

www.cse.ust.hk/~quan/comp5421/notes/edge.ppt

 

we see how an edge detector brought into the signal processing framework as a filter is much like a neuron with linear operators (convolution and derivation) in functional spaces followed by the nonlinear activation function of the thresholding, also paving the way towards manual bank of filters, and later learned filters.

3

19 Feb

21 Feb

Point Features and SIFT

 

(local) (pixel) point matching, and (global) image matching, early stage of visual recognition and understanding

 

Global features to measure the ‘distance’ between images

The second topical lecture will be on scale-space:

read ‘Distinctive Image Features for Scale-Invariant Keypoints’ by David Lowe, 2004, IJCV

https://www.cse.ust.hk/~quan/comp5421/notes/lowe-ijcv2004.pdf

 

https://www.cse.ust.hk/~quan/comp5421/notes/features.ppt

 

we will see the three important concepts: the first is the importance of the scale in images and in any sciences. The second is the more systematic emergence of the descriptors very much like word embeddings in languages, yet it is still manual and small in 128 dimensions, which paves the way for more systematic general visual encoding.

The descriptors are still local, using local distributions.

Lastly, the point feature is the geometry point of 3D vision.

 

Matching points, key points.

 

More global features, matching global images, image retrievals from image databases

 

Read ‘A Metric for Distributions with Applications to Image Databases’ by Rubner et al. 1998, Earth Mover’s Distances EMD

https://www.cse.ust.hk/~quan/comp5421/notes/rubnerIccv98.pdf

 

4

26 Feb
28  Feb

In search of learned visual features with discriminative approaches

 

https://www.cse.ust.hk/~quan/comp5421/notes/cnn.pdf

 

The features are representations and everything, and they are the visual features that could be learned with a convolutional neural network in a supervised framework.

 

read ‘Gradient-based learning applied to document recognition’ by LeCun et al. 1998

https://www.cse.ust.hk/~quan/comp5421/notes/Lecun98.pdf

 

read ‘ImageNet classification with deep convolutional neural networks’ Krizhevsky et al. 2012

https://www.cse.ust.hk/~quan/comp5421/notes/alexnet2012.pdf

 

5

5 March

7 March

Convolutional neural networks

 

Super-vised Visual classification and recognition

 

 

Visual ‘segmentation’ and U-net

 

Semantic segmentation and object detection

read  ‘Deep residual learning for image recognition’ by He et al. 2015

https://www.cse.ust.hk/~quan/comp5421/notes/resnet2015.pdf

 

A few important machine learning and statistical topics and methodology are to be revisited throughout the super-vised learning development and discussed in depth.

 

(read the paper

the power of depth for feedforward neural networks

/https://proceedings.mlr.press/v49/eldan16.pdf)

 

In addition to visual ‘classification’ by CNN, there is also one important visual task of ‘segmentation’.  It is per pixel, or the importance is that the output is not labels y, yet another image. This is the appearance of the U-net architecture which does have some generative elements with encoding and decoding nature in its architecture.

read U-net paper

U-net: Convolutional Networks for Biomedical Image Segmentation Ronneberger et al. 2015

https://arxiv.org/abs/1505.04597

 

 

Segmentation U-net is Convolutional U-net, is a bridge between classification and generation. It is not surprising to see that the first diffusion implementation worked on a U-net architecture.

 

Going to unsupervised visual learning with a probabilistic view

 

https://www.cse.ust.hk/~quan/comp5421/notes/generative.pdf

6

12 March
14 March

Global descriptions, texture synthesis, generative nature of images

From supervised to un-supervised learning and generative approach vs discriminative

 

Unsupervised learning

 

Read these papers for preparation.

Kingma’s ‘Auto-Encoding Variational Bayes’ by Kingma and Welling, 2013

https://www.cse.ust.hk/~quan/comp5421/notes/vae-kingma.pdf

 

read the diffusion paper

‘Deep Unsupervised Learning using Nonequilibrium Thermodynamics’ by Jascha Sohl-Dickstein et al. 2015

/https://arxiv.org/pdf/1503.03585.pdf

 

Read ‘Denoising Diffusion Probabilistic Models’ (DDPM) by Ho et al.

https://arxiv.org/abs/2006.11239

Read Yang Song’s dissertation on Learning to generate data by estimating gradients of the data distribution.

https://www.cse.ust.hk/~quan/comp5421/notes/song-yang-thesis-submit-augmented.pdf

 

 

Some basic concepts from information theory for understanding the probabilistic nature of generation in high-dimensions: Probability distributions, entropy, typical sets, sampling, and Monte Carlo

7

19 March
21 March

Probabilistic modeling

Discussions on variational autoencoder VAE

 

A systematic way of modeling and approximating the distribution p(x), and the classical maximum likelihood of parameter estimation approach to the parameterized distribution p_theta(x) in different ways ranging from the energy based methods to ‘flow’ based methods via autoregressive for LLM and latent space in VAE.

 

Diffusion

8

26 March

28 March (Mid term progress report)

 

Room 2404 from 3pm to 4h20pm,

Room 2504 from 4h30pm  to 6pm

Mid term progress reporting and discussions

9

2 April (off)

4 April (off , Ching Ming Festival)

Mid break week

 

9 April

11 April

Generative and Sampling

MCMC Monte Carlo Markov Chaines, Markov Chaines and discrete diffusion, Continuous Langevin diffusion

Transportation

(Some basic concepts Read the diffusion summary slides by Bortoli, in a continuous formulation

 

/https://vdeborto.github.io/project/generative_modeling/session_3.pdf

)

 

Read ‘Flow Matching for Generative Modeling’ by Lipman et al.

‘Flow matching tutorial’ Neurips 2024, Flow matching for generative modeling

 

Flow-based approaches

10

16 April

18 April (off good Friday)

Back to the deterministic and low-dimensional geometry for 3D reconstruction

 

Basic geometric concepts Projective space

 

Transformations, Similarities and Euclidean geometry

 https://www.cse.ust.hk/~quan/comp5421/notes/geom.pptx

(geom.pptx is from intro.ppt)

read ‘lecture notes’ Chapters 2 and 3 by Long Quan, 2011.

https://www.cse.ust.hk/~quan/comp5421/notes/chap2-3-2015.pdf

11

23 April


25 April (Iclr2025)

What is a camera,

and where is it?

 

Single view geometry.

https://www.cse.ust.hk/~quan/comp5421/notes/single.ppt

12

30 April

2 May

Two-view geometry

https://www.cse.ust.hk/~quan/comp5421/notes/two.ppt

https://www.cse.ust.hk/~quan/comp5421/notes/three.ppt

 

13

7 May

9 May

Robust geometry estimation 3D reconstruction

 

New perspectives

 SFM, dense reconstruction, surface triangulation and refinement

https://www.cse.ust.hk/~quan/comp5421/notes/reconstruction2019.pptx

 

 

 

Final exam, closed book of long hours.

Room 2465 (lift 25/26), 4:30pm-8:30pm

 


 


 

Course projects:

Visual generations of images or 3D objects

https://dhoho2002.github.io/GenVision/comp5421.html

 
There will be one course project by a group of two students.

But there will be two stages of the project, so it becomes a kind of two projects, but the two are continued, the second stage is at the advanced level.


 
 
 
Area in which course can be counted:

VG

Background:

The equivalent prerequisites in linear algebra (eg. COMP3211 knowledge in linear algebra),  in object-oriented programming (eg. COMP2012 object-oriented programming), algorithm design and analysis (eg. COMP171,  COMP271) are required.  Basic knowledge in image processing and machine learning is helpful.

Course outline/content (by major topics):

1. Introduction
2. Visual features and descriptors (low level feature detection and description)
3.Visual recognition and CNN
4. Vision geometry (mid-level geometry, projective geometry, cameras, and 3D reconstruction)
5. Visual recognition (high-level object recognition and image understanding).
6. Perspective.

Reference books: 

* Image-based Modeling, Long Quan, 2010, Springer.
*
Three-Dimensional Computer Vision, O. Faugeras, MIT Press, 1993
* The Geometry of Multiple Images, Faugeras, Luong,  and Papadoupolo
* The Multi-View Geometry, Hartley and Zisserman
* Robot Vision, B.K.P. Horn, MIT Press, 1986
* Computer Vision, D. Ballard and C. Brown, Prentice-Hall, 1982
* Vision, David Marr, Freeman, 1982
* Computer Vision, A Modern Approach, D. Forsyth and J. Ponce

Grading scheme:

The first stage report of the project: X%
The second stage report of the  project: Y%

Mid term exam (written): Z1%
Final Exam (written): Z2%