Vision-Based Sign Language Processing: Recognition, Translation, and Generation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Vision-Based Sign Language Processing: Recognition, Translation, 
and Generation"

By

Mr. Ronglai ZUO


Abstract:

Sign languages, also known as signed languages, are the primary communication 
method among the deaf and hard-of-hearing people, using both manual and non- 
manual parameters to convey information. These visual languages also have 
unique grammatical rules and vocabulary which are usually different with their 
spoken language counterparts, resulting in a two-way communication gap between 
the deaf and hearing. In this thesis, we will elaborate on our efforts invested 
in various fields of sign language processing (SLP): recognition, translation, 
and generation, aiming at narrowing the communication gap.

We first focus on the design of sign encoder. Previous sign encoders are mostly 
single-modality with a focus on RGB videos, suffering from substantial visual 
redundancy such as background and signer appearance. To assist in sign language 
modeling, we adopt keypoints that are more robust to visual redundancy and can 
highlight critical human body parts, e.g., hands, as an additional modality in 
our sign encoder. By representing keypoints as a sequence of heatmaps, 
estimation noises can be reduced and the network architecture of keypoint 
modeling can be made consistent with that of video modeling without any ad-hoc 
design. The ultimate sign encoder, namely video-keypoint network (VKNet), has a 
two-stream architecture, in which videos and keypoints are processed as two 
streams of information which are exchanged by inter-stream connections.

VKNet is first applied on continuous sign language recognition (CSLR), the core 
task in SLP. Training such a large network is non-trivial because of data 
scarcity. Besides using the widely adopted connectionist temporal 
classification as the major objective function, we propose a series of 
techniques including sign pyramid networks with auxiliary supervision and 
self-distillation to ease the training. The overall model is referred to as 
VKNet-CSLR. Taking a step forwards, we further extend it to support sign 
language translation (SLT) by appending a translation network.

We then move to the conventional task in SLP: isolated sign language 
recognition (ISLR). To improve the model robustness over large variation in 
sign durations, we extend our VKNet to take video-keypoint pairs with varied 
temporal receptive field as inputs. Besides, we identify the existence of 
visually indistinguishable signs and propose two techniques based on natural 
language priors, language-aware label smoothing and inter-modality mixup, to 
assist in model training.

In real-world scenarios, a system that can recognize and translate sign videos 
in real time is more user-friendly, motivating us to develop an online 
framework for CSLR and SLT. In contrast to previous CSLR works that perform 
training and inference over entire untrimmed sign videos (offline CSLR), our 
framework trains an ISLR model over short sign clips and makes predictions in a 
sliding-window manner. The framework can further be extended to boost offline 
CSLR performance and to support online SLT with additional lightweight 
networks.

The recognition and translation tasks aim at converting sign videos into 
textual representations (gloss or text). As a reverse process, sign language 
generation (SLG) systems translate spoken languages into sign languages, 
completing the two-way communication loop. We also present a simple yet 
effective way to build an SLG baseline with 3D avatars.


Date:                   Monday, 5 August 2024

Time:                   10:00am - 12:00noon

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Prof. Wenjing YE (MAE)

Committee Members:      Dr. Brian MAK (Supervisor)
                        Prof. Raymond WONG
                        Dr. Dan XU
                        Prof. Chi Ying TSUI (ECE)
                        Prof. Tien Tsin WONG (CUHK)
Privacy Sitemap
Vision-Based Sign Language Processing: Recognition, Translation, and Generation

About

People

Research

Academics

Admissions