Towards Automatic Testing and Fault Localization in Natural Language Processing Systems

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Automatic Testing and Fault Localization in Natural Language
Processing Systems"

By

Miss Jialun CAO


Abstract

Natural Language Processing (NLP) systems such as machine translation and
chatbots become pervasive in daily life. Despite their widespread applications,
how to test and debug such systems is still an open question and attracts
researchers from both academia and industry. However, unlike conventional
software programs, NLP systems, whose core algorithm is the deep neural network
(DNN) model, do not make decisions explicitly following the program's control
flow. Instead, decisions are made by propagating inputs against the tuned
parameters through the trained DNN model. The indirect correlation between the
program source code and the resulting DNN impedes the adoption of conventional
software testing and debugging approaches. In addition, the shortage of labeled
datasets for testing also brings about challenges to automatic testing. While
it is possible to feed arbitrary sentences as test inputs to an NLP system, a
test oracle that captures their expected test outputs is hard to define
automatically.

This thesis aims to propose automatic approaches to generate test cases
together with test oracles to detect defects in NLP systems, and further
propose a fault localization framework to aid the debugging. It consists of the
following three studies.

The first study proposes a semantic-based testing approach for machine
translation systems. Existing methodologies mostly rely on metamorphic
relations designed at the textual level (e.g., Levenshtein distance) or
syntactic level (e.g., the distance between grammar structures) to determine
the correctness of translation results. However, these metamorphic relations do
not consider whether the original and the translated sentences have the same
meaning (i.e., semantic similarity). To this end, this study proposes an
automatic testing approach for machine translation systems based on semantic
similarity checking. The insight of this study is that the semantics concerning
logical relations and quantifiers in sentences can be captured by regular
expressions (or deterministic finite automata) where efficient semantic
equivalence/similarity checking algorithms can be applied. By doing so, the
accuracy and F-Score of mistranslation detection could be improved.

The second study proposes a metamorphic testing approach to test coreference
resolution (CR) systems. CR is a task to resolve different expressions (e.g.,
named entities, pronouns) that refer to the same real-world entity/ event. It
is a core natural language processing (NLP) component that underlies and
empowers major downstream NLP applications such as machine translation,
chatbots, and question-answering. Despite its broad impact, the problem of
testing CR systems has rarely been studied. A major difficulty is the shortage
of a labeled dataset for testing. While it is possible to feed arbitrary
sentences as test inputs to a CR system, a test oracle that captures their
expected test outputs (coreference relations) is hard to define automatically.
To address this problem, this study proposes a methodology to generate
CR-preserving test cases. This is the first study that focuses on effectively
generating test inputs that respect coreference, which complements existing
works.

The third work designs a learning-based fault diagnosis and localization
framework for NLP programs. Although various studies have been conducted to
understand and detect program faults, these existing studies identify and
repair suspicious neurons on the trained DNN, which, unfortunately, might be a
detour. Instead, locating the faulty statements in the programs can provide
developers with more useful information for debugging. To achieve this goal,
this study proposes a learning-based fault diagnosis and localization framework
which maps the fault localization task to a learning problem. In particular, it
infers the suspicious fault types via monitoring the runtime features extracted
during DNN model training and then locates the diagnosed faults in NLP
programs. The evaluation exhibits the potential of learningbased fault
diagnosis for NLP programs.


Date:                   Wednesday, 10 January 2024

Time:                   2:30pm - 4:30pm

Venue:                  Room 3494
                        Lifts 25/26

Chairman:               Prof. Weiping LI (MATH)

Committee Members:      Prof. Shing Chi CHEUNG (Supervisor)
                        Prof. Lionel PARREAUX
                        Prof. Raymond WONG
                        Prof. Allen HUANG (ACCT)
                        Prof. Leonardo MARIANI (University of Milano-Bicocca)


**** ALL are Welcome ****