More about HKUST
Vision-Based Sign Language Processing: Recognition, Translation, and Generation
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Vision-Based Sign Language Processing: Recognition, Translation, and Generation" By Mr. Ronglai ZUO Abstract: Sign languages, also known as signed languages, are the primary communication method among the deaf and hard-of-hearing people, using both manual and non- manual parameters to convey information. These visual languages also have unique grammatical rules and vocabulary which are usually different with their spoken language counterparts, resulting in a two-way communication gap between the deaf and hearing. In this thesis, we will elaborate on our efforts invested in various fields of sign language processing (SLP): recognition, translation, and generation, aiming at narrowing the communication gap. We first focus on the design of sign encoder. Previous sign encoders are mostly single-modality with a focus on RGB videos, suffering from substantial visual redundancy such as background and signer appearance. To assist in sign language modeling, we adopt keypoints that are more robust to visual redundancy and can highlight critical human body parts, e.g., hands, as an additional modality in our sign encoder. By representing keypoints as a sequence of heatmaps, estimation noises can be reduced and the network architecture of keypoint modeling can be made consistent with that of video modeling without any ad-hoc design. The ultimate sign encoder, namely video-keypoint network (VKNet), has a two-stream architecture, in which videos and keypoints are processed as two streams of information which are exchanged by inter-stream connections. VKNet is first applied on continuous sign language recognition (CSLR), the core task in SLP. Training such a large network is non-trivial because of data scarcity. Besides using the widely adopted connectionist temporal classification as the major objective function, we propose a series of techniques including sign pyramid networks with auxiliary supervision and self-distillation to ease the training. The overall model is referred to as VKNet-CSLR. Taking a step forwards, we further extend it to support sign language translation (SLT) by appending a translation network. We then move to the conventional task in SLP: isolated sign language recognition (ISLR). To improve the model robustness over large variation in sign durations, we extend our VKNet to take video-keypoint pairs with varied temporal receptive field as inputs. Besides, we identify the existence of visually indistinguishable signs and propose two techniques based on natural language priors, language-aware label smoothing and inter-modality mixup, to assist in model training. In real-world scenarios, a system that can recognize and translate sign videos in real time is more user-friendly, motivating us to develop an online framework for CSLR and SLT. In contrast to previous CSLR works that perform training and inference over entire untrimmed sign videos (offline CSLR), our framework trains an ISLR model over short sign clips and makes predictions in a sliding-window manner. The framework can further be extended to boost offline CSLR performance and to support online SLT with additional lightweight networks. The recognition and translation tasks aim at converting sign videos into textual representations (gloss or text). As a reverse process, sign language generation (SLG) systems translate spoken languages into sign languages, completing the two-way communication loop. We also present a simple yet effective way to build an SLG baseline with 3D avatars. Date: Monday, 5 August 2024 Time: 10:00am - 12:00noon Venue: Room 5501 Lifts 25/26 Chairman: Prof. Wenjing YE (MAE) Committee Members: Dr. Brian MAK (Supervisor) Prof. Raymond WONG Dr. Dan XU Prof. Chi Ying TSUI (ECE) Prof. Tien Tsin WONG (CUHK)