Efficient Frequent Pattern Mining over Probabilistic Database

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Efficient Frequent Pattern Mining over Probabilistic Database"

By

Mr. Yongxin TONG


Abstract

With the broad usage of Internet of Things (IoT), pervasive computing, and 
crowdsourcing techniques in the modern society, a growing number of data 
monitoring and collection systems lead to the uncertainty in massive data. 
For example, data integration of multiple data sources causes uncertainty. 
Thus, mining probabilistic data (or called uncertain data in this thesis) 
has become a hot topic in recent years. In particular, mining frequent 
patterns is one of the most fundamental problems in traditional data 
mining researches, thus mining frequent patterns over probabilistic 
databases has attracted much attention in the database and data mining 
communities. In the scenario of probabilistic data, the support of an 
pattern is a discrete random variable rather than a frequency of this 
pattern. Hence, the frequent pattern under uncertain data has different 
definitions due to the variation of probabilistic semantics, which even 
generates inconsistent results in current studies. However, the 
relationship between different definitions and the inconsistent results 
has not yet been thoroughly identified and explored. Furthermore, like its 
counterpart, mining frequent patterns in deterministic data, mining 
frequent patterns over uncertain data cannot avoid an exponential number 
of frequent patterns which cause the mining results less useful. In this 
thesis, we demonstrate how our solution can clarify the problems in 
existing studies and address the challenge of an exponential number of 
mining results. Moreover, our solution is also well applied to 
constructing effective indexes for query processing over other types of 
uncertain data, i.e., querying uncertain graphs. Furthermore, with the 
advent of big data era, we also propose the solution of finding frequent 
patterns in distributed uncertain data. To summarize, our study covers the 
following four aspects:

1) We conduct a comprehensive experimental study of existing 
representative frequent itemset mining algorithms over probabilistic 
databases and clarify several existing inconsistent conclusions;
2) We formalize a novel problem of mining probabilistic frequent closed 
itemsets in an uncertain transaction database and design an efficient 
solution, which includes a series of pruning techniques and an effective 
sampling algorithm.
3) We study the problem of efficient probabilistic supergraph containment 
queryand provide the efficient solution, which integrates probabilistic 
frequent pattern mining technique for constructing the index.
4) We propose the problem of finding frequent items in distributed 
probabilistic data and provide a local thresholds-based deterministic 
algorithm and a sketch-based sampling method to reduce both communication 
and computation costs.

We validate our solutions through extensive experiments and discuss 
several potential research directions of mining frequent pattern over 
probabilistic databases.


Date:			Friday, 6 December 2013

Time:			10:00am – 12:00noon

Venue:			Room 3494
 			Lifts 25/26

Chairman:		Prof. Danny Tsang (ECE)

Committee Members:	Prof. Lei Chen (Supervisor)
 			Prof. Frederick Lochovsky
 			Prof. Ke Yi
 			Prof. Weichuan Yu (ECE)
                        Prof. Jianliang Xu (Comp. Sci., Baptist U.)


**** ALL are Welcome ****