SOFTWARE DEFECT PREDICTION ON UNLABELED DATASETS

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "SOFTWARE DEFECT PREDICTION ON UNLABELED DATASETS"

By

Mr. Jaechang NAM


Abstract

Defect prediction on new projects or projects lacking in historical data 
is one of interesting problems in defect prediction studies. This is 
largely because it is difficult to collect bug information to label a 
dataset for training a prediction model. We call this problem defect 
prediction on unlabeled datasets. Cross-project defect prediction (CPDP) 
has tried to solve this problem by reusing prediction models built by 
other projects that have enough historical data. However, CPDP may not 
always build a strong prediction model because of the different 
distributions among datasets. Approaches for defect prediction only using 
unlabeled datasets have also tried to address the problem by adopting the 
unsupervised learning technique but have one major limitation, the 
necessity for manual effort.

To address these limitations, we propose three techniques that can build 
prediction models on unlabeled datasets. First, we propose TCA+ that 
improves the prediction performance of CPDP by adopting a state-of-the-art 
transfer learning technique, transfer component analysis (TCA). TCA+ is an 
extended TCA to suggest the most appropriate normalization technique 
before applying TCA for CPDP. Second, we propose heterogeneous defect 
prediction (HDP) that enables cross-project defect prediction on projects 
with heterogeneous metric sets. HDP generates the same metric set between 
datasets used in CPDP by matching metrics that have similar distributions. 
Lastly, we propose CLAMI that enables defect prediction by using only 
unlabeled datasets to build prediction models. The key idea of the CLAMI 
approach is to generate a training dataset by using the magnitude of 
metric values from an unlabeled dataset.

Our proposed techniques, TCA+, HDP, and CLAMI, address limitations for 
defect prediction on unlabeled datasets. However, the three techniques 
still have challenging issues to be resolved. We also discuss them as 
future work.


Date:			Thursday, 23 July 2015

Time:			1:30pm - 3:30pm

Venue:			Room 4483
 			Lifts 25/26

Chairman:		Prof. Hoi Sing Kwok (ECE)

Committee Members:	Prof. Sunghun Kim (Supervisor)
 			Prof. Shing Chi Cheung
 			Prof. Raymond Wong
 			Prof. Jing Wang (ISOM)
 			Prof. Martin Pinzger (Univ. of Klagenfurt, Austria)


**** ALL are Welcome ****