More about HKUST
Efficient Frequent Pattern Mining over Probabilistic Database
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Efficient Frequent Pattern Mining over Probabilistic Database" By Mr. Yongxin TONG Abstract With the broad usage of Internet of Things (IoT), pervasive computing, and crowdsourcing techniques in the modern society, a growing number of data monitoring and collection systems lead to the uncertainty in massive data. For example, data integration of multiple data sources causes uncertainty. Thus, mining probabilistic data (or called uncertain data in this thesis) has become a hot topic in recent years. In particular, mining frequent patterns is one of the most fundamental problems in traditional data mining researches, thus mining frequent patterns over probabilistic databases has attracted much attention in the database and data mining communities. In the scenario of probabilistic data, the support of an pattern is a discrete random variable rather than a frequency of this pattern. Hence, the frequent pattern under uncertain data has different definitions due to the variation of probabilistic semantics, which even generates inconsistent results in current studies. However, the relationship between different definitions and the inconsistent results has not yet been thoroughly identified and explored. Furthermore, like its counterpart, mining frequent patterns in deterministic data, mining frequent patterns over uncertain data cannot avoid an exponential number of frequent patterns which cause the mining results less useful. In this thesis, we demonstrate how our solution can clarify the problems in existing studies and address the challenge of an exponential number of mining results. Moreover, our solution is also well applied to constructing effective indexes for query processing over other types of uncertain data, i.e., querying uncertain graphs. Furthermore, with the advent of big data era, we also propose the solution of finding frequent patterns in distributed uncertain data. To summarize, our study covers the following four aspects: 1) We conduct a comprehensive experimental study of existing representative frequent itemset mining algorithms over probabilistic databases and clarify several existing inconsistent conclusions; 2) We formalize a novel problem of mining probabilistic frequent closed itemsets in an uncertain transaction database and design an efficient solution, which includes a series of pruning techniques and an effective sampling algorithm. 3) We study the problem of efficient probabilistic supergraph containment queryand provide the efficient solution, which integrates probabilistic frequent pattern mining technique for constructing the index. 4) We propose the problem of finding frequent items in distributed probabilistic data and provide a local thresholds-based deterministic algorithm and a sketch-based sampling method to reduce both communication and computation costs. We validate our solutions through extensive experiments and discuss several potential research directions of mining frequent pattern over probabilistic databases. Date: Friday, 6 December 2013 Time: 10:00am – 12:00noon Venue: Room 3494 Lifts 25/26 Chairman: Prof. Danny Tsang (ECE) Committee Members: Prof. Lei Chen (Supervisor) Prof. Frederick Lochovsky Prof. Ke Yi Prof. Weichuan Yu (ECE) Prof. Jianliang Xu (Comp. Sci., Baptist U.) **** ALL are Welcome ****