More about HKUST
Vision-Based Sign Language Processing: Recognition, Translation, and Generation
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Vision-Based Sign Language Processing: Recognition, Translation,
and Generation"
By
Mr. Ronglai ZUO
Abstract:
Sign languages, also known as signed languages, are the primary communication
method among the deaf and hard-of-hearing people, using both manual and non-
manual parameters to convey information. These visual languages also have
unique grammatical rules and vocabulary which are usually different with their
spoken language counterparts, resulting in a two-way communication gap between
the deaf and hearing. In this thesis, we will elaborate on our efforts invested
in various fields of sign language processing (SLP): recognition, translation,
and generation, aiming at narrowing the communication gap.
We first focus on the design of sign encoder. Previous sign encoders are mostly
single-modality with a focus on RGB videos, suffering from substantial visual
redundancy such as background and signer appearance. To assist in sign language
modeling, we adopt keypoints that are more robust to visual redundancy and can
highlight critical human body parts, e.g., hands, as an additional modality in
our sign encoder. By representing keypoints as a sequence of heatmaps,
estimation noises can be reduced and the network architecture of keypoint
modeling can be made consistent with that of video modeling without any ad-hoc
design. The ultimate sign encoder, namely video-keypoint network (VKNet), has a
two-stream architecture, in which videos and keypoints are processed as two
streams of information which are exchanged by inter-stream connections.
VKNet is first applied on continuous sign language recognition (CSLR), the core
task in SLP. Training such a large network is non-trivial because of data
scarcity. Besides using the widely adopted connectionist temporal
classification as the major objective function, we propose a series of
techniques including sign pyramid networks with auxiliary supervision and
self-distillation to ease the training. The overall model is referred to as
VKNet-CSLR. Taking a step forwards, we further extend it to support sign
language translation (SLT) by appending a translation network.
We then move to the conventional task in SLP: isolated sign language
recognition (ISLR). To improve the model robustness over large variation in
sign durations, we extend our VKNet to take video-keypoint pairs with varied
temporal receptive field as inputs. Besides, we identify the existence of
visually indistinguishable signs and propose two techniques based on natural
language priors, language-aware label smoothing and inter-modality mixup, to
assist in model training.
In real-world scenarios, a system that can recognize and translate sign videos
in real time is more user-friendly, motivating us to develop an online
framework for CSLR and SLT. In contrast to previous CSLR works that perform
training and inference over entire untrimmed sign videos (offline CSLR), our
framework trains an ISLR model over short sign clips and makes predictions in a
sliding-window manner. The framework can further be extended to boost offline
CSLR performance and to support online SLT with additional lightweight
networks.
The recognition and translation tasks aim at converting sign videos into
textual representations (gloss or text). As a reverse process, sign language
generation (SLG) systems translate spoken languages into sign languages,
completing the two-way communication loop. We also present a simple yet
effective way to build an SLG baseline with 3D avatars.
Date: Monday, 5 August 2024
Time: 10:00am - 12:00noon
Venue: Room 5501
Lifts 25/26
Chairman: Prof. Wenjing YE (MAE)
Committee Members: Dr. Brian MAK (Supervisor)
Prof. Raymond WONG
Dr. Dan XU
Prof. Chi Ying TSUI (ECE)
Prof. Tien Tsin WONG (CUHK)