Efficient Frequent Pattern Mining over Probabilistic Databases

PhD Thesis Proposal Defence


Title: "Efficient Frequent Pattern Mining over Probabilistic Databases"

by

Mr. Yongxin TONG


ABSTRACT:

With the broad usage of Internet of Things (IoT) and pervasive computing 
techniques in the modern society, a growing number of real-time data 
monitoring systems lead to the uncertainty in massive collected data. For 
example, data integration of multiply data sources causes uncertainty. 
Thus, mining probabilistic data (or called uncertain data in this thesis) 
has become a hot research topic in recent years. In particular, since 
mining frequent patterns is one of the most fundamental problems in 
traditional data mining researches, mining frequent patterns over 
probabilistic databases has attracted much attention in the database and 
the data mining communities. In the scenario of uncertain data, the 
support of an itemset is a discrete random variable rather than the 
frequency of this itemset. Hence, unlike the corresponding problem in 
deterministic databases where the frequent itemset has a unique 
definition, the frequent itemset under uncertain data has different 
definitions due to variation of probabilistic semantics, which even 
generate inconsistent results in current studies. However, the 
relationship of different definitions and the inconsistent result has not 
yet been thoroughly identified and explored. Furthermore, like its 
counterpart, mining frequent patterns in deterministic data, mining 
frequent patterns over uncertain data cannot avoid an exponential number 
of frequent itemsets which causes the mining results less useful. In this 
thesis, we demonstrate how our solution can clarify the problems in 
existing studies and address the challenge of an exponential number of 
mining results. Moreover, our solution is also well applied to 
constructing effective indexes for query processing over other types of 
uncertain data, i.e., querying uncertain graphs. To summarize, our study 
covers the following three aspects:

1) We conduct a comprehensive experimental study of existing 
representative frequent itemset mining algorithms over probabilistic 
databases and clarify several existing inconsistent conclusions;

2) We propose a novel problem of mining probabilistic frequent closed 
itemsets in uncertain databases and design an efficient solution, which 
includes a series of pruning techniques and an effective sampling 
algorithm.

3) We study the problem of efficient probabilistic supergraph containment 
query and provide the efficient solution, which integrates probabilistic 
frequent pattern mining technique for constructing the index.

We validate our solutions through extensive experiments and discuss 
several potential research directions of mining frequent patterns over 
probabilistic databases.


Date:                   Thursday, 11 July 2013

Time:                   10:00am - 12:00noon

Venue:                  Room 3501
                         lifts 25/26

Committee Members:      Dr. Lei Chen (Supervisor)
                        Prof. Frederick Lochovsky (Chairperson)
 			Dr. Raymond Wong
 			Dr. Ke Yi


**** ALL are Welcome ****