More about HKUST
Accelerated Learning of Latent Tree Models for Topic Detection and Multidimensional Clustering
PhD Thesis Proposal Defence
Title: "Accelerated Learning of Latent Tree Models for Topic Detection
and Multidimensional Clustering"
by
Mr. Tengfei LIU
Abstract:
Latent tree models (LTMs) are tree-structured probabilistic graphical
models where the variables at leaf nodes are observed and the variables at
internal nodes are latent. They represent complex relationships among
observed variables and yet are computationally simple to work with. When
used for data analysis, they can identify meaningful groupings of observed
variables, where the correlations among the variables in each group can be
modeled by a single latent variable. As such, LTMs are effective tools for
detecting correlation patterns in data and can be applied to various
domains. In this thesis, we focus on developing new algorithms for
learning latent tree models and investigating their use in two
applications.
Firstly, we propose a new method called Bridged-Islands (BI) algorithm for
learning latent tree models. Many algorithms have been proposed for the
same task. However, previous methods for learning latent tree models have
their limitations. For example, the search algorithms can only handle data
sets with dozens of attributes which limits their use in processing large
data set. The algorithms based on attribute clustering (AC) usually focus
on latent tree models with special structures (e.g., binary trees or
forests) which may be not appropriate for many real world applications.
The algorithms motivated by phylogenetic tree reconstruction (PTR) have
restrictions on both observed and latent variables. They normally require
that all observed or latent variables have the same number of states and
the number should be known beforehand. Compared with AC-based algorithms
and PTR-motivated algorithms, the advantage of BI algorithm is model
quality. It produces models with better quality and has no restrictions on
the cardinalities of the observed variables or latent variables. Compared
with search algorithms, BI is more efficient and can be employed in larger
data sets.
Secondly, we propose a new LTM-based algorithm called hierarchical latent
tree analysis (HLTA) for identifying topics from a collection of
documents. The predominant method for topic detection are latent Dirichlet
allocation (LDA), which assume a generating process for the documents.
HLTA provides an alternative view for topic detection. It identifies the
topics by simultaneously clustering the documents and word variables. In
this new view, a topic is considered as a class of documents that show
clear thematic meaning. The new method models the word co-occurrence
patterns by using a hierarchical latent tree model. Each latent variable
represents a soft partition of the documents based on some word
co-occurrence patterns. Each state of the latent variable, which
represents a cluster of documents in the partition, is interpreted as a
topic. In the hierarchical structure, the variables at a higher level
correspond to more general topics, while the variables at a lower level
correspond to more specific topics. Compared with alternative methods,
HLTA integrates topic correlation modeling, topic structure detection and
automatic determination of topic numbers in one framework. Empirical
experiments also show that the new algorithm produces more semantically
coherent topics and topic hierarchy than LDA-based methods.
Thirdly, we explore the use of LTMs for multidimensional clustering.
Conventional clustering usually assume that there is only one true
clustering within a data set, which does not hold for many real-world data
that are multifaceted. Multidimensional clustering aims to identify
multiple clusterings from the data. Latent tree models have previously
been used for multidimensional clustering, where each latent variable in
the model represents one way to cluster the data. In this thesis, we apply
the new algorithm to the task and simultaneously deal with much larger
data sets than previously. We compared LTM-based methods with several
alternative algorithms for multidimensional clustering on both labeled and
unlabeled data sets. The empirical experiments indicate that the LTM-based
methods can identify a rich set of meaningful clusterings and concurrently
achieve good performance.
Date: Monday, 1 December 2014
Time: 9:30am - 11:30am
Venue: Room 3501
lifts 25/26
Committee Members: Prof. Nevin Zhang (Supervisor)
Prof. Dit-Yan Yeung (Chairperson)
Dr. Huamin Qu
Dr. Raymond Wong
**** ALL are Welcome ****