Testing the Reliability of Deep Learning Applications

PhD Thesis Proposal Defence


Title: "Testing the Reliability of Deep Learning Applications"

by

Mr. Yongqiang TIAN


Abstract:

Deep Learning (DL) applications are widely deployed in diverse areas, such as 
image classification, natural language processing, and auto-driving systems. 
Although these applications achieve outstanding performance in terms of 
accuracy, developers have raised strong concerns about their reliability since 
the logic of DL applications is a black box for humans. Specifically, DL 
applications learn the logic during stochastic training and encode the logic in 
high-dimensional weights of DL models. Unlike source code in conventional 
software, such weights are infeasible for humans to directly interpret, 
examine, and validate. As a result, the defects in DL applications are not easy 
to be detected in software development stages and may cause catastrophic 
accidents in safety-critical missions. Therefore, it is critical to adequately 
test DL applications in terms of reliability before they are deployed.

This thesis aims to propose automatic approaches to testing DL applications 
from the perspective of reliability. It consists of the following three 
studies.

The first study proposes object-relevancy, a property that reliable DL-based 
image classifiers should comply with, i.e., the classification results should 
be made based on the features relevant to the target object in a given image, 
instead of irrelevant features such as the background. This study further 
proposes a metamorphic testing approach and two corresponding metamorphic 
relations to assess if this property is violated in image classifications. The 
evaluation shows that the proposed approach can effectively detect the 
unreliable inferences violating the object-relevancy property, with the average 
precision 64.1% and 96.4% for the two relations, respectively. The subsequent 
empirical study reveals that such unreliable inferences are prevalent in the 
real world and the existing training strategies cannot tame this issue 
effectively.

The second study concentrates on the reliability issues induced by model 
compression of DL applications. Model compression can significantly reduce the 
sizes of Deep Neural Network (DNN) models, and thus facilitate the 
dissemination of sophisticated, sizable DNN models. However, the prediction 
results of compressed models may deviate from those of their original models, 
resulting in unreliable DL applications in deployment. To help developers 
thoroughly understand the impact of model compression, it is essential to test 
these models to find those deviated behaviors before dissemination. This study 
proposes DFLARE, a novel, search-based, black-box testing technique. The 
evaluation shows that DFLARE constantly outperforms the baseline in both 
efficacy and efficiency. More importantly, the triggering inputs found 
by DFLARE can be used to repair up to 48.48% deviated behaviors.

The third study focuses on the reliability of DL-based vulnerability detection 
(DLVD) techniques. DLVD techniques are designed to detect the vulnerability in 
the source code. However, these techniques may only capture the syntactic 
patterns of vulnerable code while ignoring the semantic information in the 
source code. As a result, malicious users can easily fool such techniques by 
manipulating the syntactic patterns of vulnerable code, e.g., variable 
renaming. This study proposes a new methodology to evaluate the learning 
ability of DLVD techniques, i.e., whether a DLVD technique can capture the 
semantic information from vulnerable source code and leverage such information 
in detection. Specifically, this approach creates a special dataset in which 
the vulnerable functions and non-vulnerable ones have almost identical 
syntactic code patterns but different semantic meanings. If a detection 
approach cannot capture the semantic difference between the vulnerable 
functions and the non-vulnerable ones, this approach will have low performance 
on the constructed dataset. Our preliminary results show that two common 
detection approaches are ineffective in capturing the semantic information from 
source code.


Date:			Thursday, 19 January 2023

Time:                  	10:45am to 12:45pm

Zoom Meeting: 
https://hkust.zoom.us/j/96994112085?pwd=UW1TaytUYjZFQkEvTDlDbWtuTGFQdz09

Committee Members:	Prof. Shing-Chi Cheung (Supervisor)
 			Prof. Fangzhen Lin (Chairperson)
 			Dr. Lionel Parreaux
 			Prof. Raymond Wong


**** ALL are Welcome ****