The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence "Video Content Analysis and Its Applications for Multimedia Authoring of Lectures" By Mr. Feng Wang Abstract Video content analysis has attracted more and more attentions in recent years, due to the availability of a growing amount of digital video data. In this work, we focus on extracting three basic features for video content analysis, namely text, gesture and posture, and applying them to multimedia authoring of lectures. For text analysis, we address the problem of text recognition in low-resolution videos. A novel algorithm for text super-resolution is proposed, which reconstructs high-resolution textboxes by integrating multiple frames. Our experiments show that text recognition is significantly improved after super-resolution. For gesture detection and recognition, we propose algorithms for both off-line and real-time applications. In the former, to deal with the lack of salient features in gesture detection, different cues including frame difference, skin color and the gesture trajectory are combined to detect candidate gestures. HMM (Hidden Markov Model) based gesture recognition is then employed to refine the detection results. Another advantage is that intentional gestures are extracted and separated from non-gesture movement, which has not been addressed before. For real-time applications, to cope with the efficiency requirements besides accuracy, the HMM models for complete gesture recognition are modified to recognize incomplete gesture, so that a gesture can be identified before the complete trajectory is observed. Speech is combined with visual cue to further improve the accuracy and the responsiveness of gesture detection. For posture, two different algorithms are proposed. The first one is more appropriate for robust head pose estimation in offline applications by employing visual cue and image processing techniques. In the second algorithm, besides visual cue, we focus more on effectively exploiting contextual information, i.e. temporal smoothness of head movement to refine the pose estimation. This is useful especially for low-resolution images where direct estimation from one single image is not reliable enough. We propose an adaptive online learning approach to cope with the pace of different presenting styles and presenters. The second algorithm is efficient enough for most real-time applications. Based on the video content analysis, we employ the extracted features in different applications, including the synchronization of video and external document based on text analysis, the offline video enhancement and editing by integrating gesture, posture and text, and a simulated smartboard to show the effectiveness and efficiency of the proposed algorithms. Date: Monday, 28 August 2006 Time: 4:00p.m.-6:00p.m. Venue: Room 5501 Lifts 25-26 Chairman: Prof. James Westland (ISMT) Committee Members: Prof. Ting-Chuen Pong (Supervisor) Prof. Chong-Wah Ngo (Supervisor, CityU) Prof. Long Quan Prof. Chiew-Lan Tai Prof. Bing Zeng (ECE) Prof. Qing Li (Comp. Sci. CityU) **** ALL are Welcome ****