More about HKUST
Accelerated Learning of Latent Tree Models for Topic Detection and Multidimensional Clustering
PhD Thesis Proposal Defence Title: "Accelerated Learning of Latent Tree Models for Topic Detection and Multidimensional Clustering" by Mr. Tengfei LIU Abstract: Latent tree models (LTMs) are tree-structured probabilistic graphical models where the variables at leaf nodes are observed and the variables at internal nodes are latent. They represent complex relationships among observed variables and yet are computationally simple to work with. When used for data analysis, they can identify meaningful groupings of observed variables, where the correlations among the variables in each group can be modeled by a single latent variable. As such, LTMs are effective tools for detecting correlation patterns in data and can be applied to various domains. In this thesis, we focus on developing new algorithms for learning latent tree models and investigating their use in two applications. Firstly, we propose a new method called Bridged-Islands (BI) algorithm for learning latent tree models. Many algorithms have been proposed for the same task. However, previous methods for learning latent tree models have their limitations. For example, the search algorithms can only handle data sets with dozens of attributes which limits their use in processing large data set. The algorithms based on attribute clustering (AC) usually focus on latent tree models with special structures (e.g., binary trees or forests) which may be not appropriate for many real world applications. The algorithms motivated by phylogenetic tree reconstruction (PTR) have restrictions on both observed and latent variables. They normally require that all observed or latent variables have the same number of states and the number should be known beforehand. Compared with AC-based algorithms and PTR-motivated algorithms, the advantage of BI algorithm is model quality. It produces models with better quality and has no restrictions on the cardinalities of the observed variables or latent variables. Compared with search algorithms, BI is more efficient and can be employed in larger data sets. Secondly, we propose a new LTM-based algorithm called hierarchical latent tree analysis (HLTA) for identifying topics from a collection of documents. The predominant method for topic detection are latent Dirichlet allocation (LDA), which assume a generating process for the documents. HLTA provides an alternative view for topic detection. It identifies the topics by simultaneously clustering the documents and word variables. In this new view, a topic is considered as a class of documents that show clear thematic meaning. The new method models the word co-occurrence patterns by using a hierarchical latent tree model. Each latent variable represents a soft partition of the documents based on some word co-occurrence patterns. Each state of the latent variable, which represents a cluster of documents in the partition, is interpreted as a topic. In the hierarchical structure, the variables at a higher level correspond to more general topics, while the variables at a lower level correspond to more specific topics. Compared with alternative methods, HLTA integrates topic correlation modeling, topic structure detection and automatic determination of topic numbers in one framework. Empirical experiments also show that the new algorithm produces more semantically coherent topics and topic hierarchy than LDA-based methods. Thirdly, we explore the use of LTMs for multidimensional clustering. Conventional clustering usually assume that there is only one true clustering within a data set, which does not hold for many real-world data that are multifaceted. Multidimensional clustering aims to identify multiple clusterings from the data. Latent tree models have previously been used for multidimensional clustering, where each latent variable in the model represents one way to cluster the data. In this thesis, we apply the new algorithm to the task and simultaneously deal with much larger data sets than previously. We compared LTM-based methods with several alternative algorithms for multidimensional clustering on both labeled and unlabeled data sets. The empirical experiments indicate that the LTM-based methods can identify a rich set of meaningful clusterings and concurrently achieve good performance. Date: Monday, 1 December 2014 Time: 9:30am - 11:30am Venue: Room 3501 lifts 25/26 Committee Members: Prof. Nevin Zhang (Supervisor) Prof. Dit-Yan Yeung (Chairperson) Dr. Huamin Qu Dr. Raymond Wong **** ALL are Welcome ****