More about HKUST
Efficient Frequent Pattern Mining over Probabilistic Database
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Efficient Frequent Pattern Mining over Probabilistic Database"
By
Mr. Yongxin TONG
Abstract
With the broad usage of Internet of Things (IoT), pervasive computing, and
crowdsourcing techniques in the modern society, a growing number of data
monitoring and collection systems lead to the uncertainty in massive data.
For example, data integration of multiple data sources causes uncertainty.
Thus, mining probabilistic data (or called uncertain data in this thesis)
has become a hot topic in recent years. In particular, mining frequent
patterns is one of the most fundamental problems in traditional data
mining researches, thus mining frequent patterns over probabilistic
databases has attracted much attention in the database and data mining
communities. In the scenario of probabilistic data, the support of an
pattern is a discrete random variable rather than a frequency of this
pattern. Hence, the frequent pattern under uncertain data has different
definitions due to the variation of probabilistic semantics, which even
generates inconsistent results in current studies. However, the
relationship between different definitions and the inconsistent results
has not yet been thoroughly identified and explored. Furthermore, like its
counterpart, mining frequent patterns in deterministic data, mining
frequent patterns over uncertain data cannot avoid an exponential number
of frequent patterns which cause the mining results less useful. In this
thesis, we demonstrate how our solution can clarify the problems in
existing studies and address the challenge of an exponential number of
mining results. Moreover, our solution is also well applied to
constructing effective indexes for query processing over other types of
uncertain data, i.e., querying uncertain graphs. Furthermore, with the
advent of big data era, we also propose the solution of finding frequent
patterns in distributed uncertain data. To summarize, our study covers the
following four aspects:
1) We conduct a comprehensive experimental study of existing
representative frequent itemset mining algorithms over probabilistic
databases and clarify several existing inconsistent conclusions;
2) We formalize a novel problem of mining probabilistic frequent closed
itemsets in an uncertain transaction database and design an efficient
solution, which includes a series of pruning techniques and an effective
sampling algorithm.
3) We study the problem of efficient probabilistic supergraph containment
queryand provide the efficient solution, which integrates probabilistic
frequent pattern mining technique for constructing the index.
4) We propose the problem of finding frequent items in distributed
probabilistic data and provide a local thresholds-based deterministic
algorithm and a sketch-based sampling method to reduce both communication
and computation costs.
We validate our solutions through extensive experiments and discuss
several potential research directions of mining frequent pattern over
probabilistic databases.
Date: Friday, 6 December 2013
Time: 10:00am – 12:00noon
Venue: Room 3494
Lifts 25/26
Chairman: Prof. Danny Tsang (ECE)
Committee Members: Prof. Lei Chen (Supervisor)
Prof. Frederick Lochovsky
Prof. Ke Yi
Prof. Weichuan Yu (ECE)
Prof. Jianliang Xu (Comp. Sci., Baptist U.)
**** ALL are Welcome ****