More about HKUST
Unidimensional Clustering using Latent Tree Models
PhD Thesis Proposal Defence
Title: "Unidimensional Clustering using Latent Tree Models"
by
Miss Hua LIU
Abstract:
Cluster Analysis is an important research topic in machine learning and
data mining. The objective is to partition data into meaningful clusters,
where the number of clusters might be known or unknown. A variety of
approaches have been proposed for the task. Examples include
distance/similarity-based algorithms such as K-means, kernel K-means and
spectral clustering, as well as model-based methods such as Gaussian
mixture models (GMMs). Most previous clustering algorithms are targeted
for continuous data, that is, points in the Euclidean space of certain
dimension. In this thesis, we focus on discrete data where each attribute
takes a finite number of possible values.
A commonly used method for clustering discrete data is latent class
analysis (LCA). LCA is based on a class of finite mixture models known as
latent class models (LCMs). An LCM consists of a latent variable and a
number of attributes. It assumes that the attributes are mutually
independent given the latent variable. Known as local independence, this
assumption is often too strong in practice, and might lead to spurious
clusters. Another method for clustering discrete data is called latent
tree analysis (LTA) and it is based latent tree models (LTMs). LTMs allow
multiple latent variables and hence the local independence is relaxed.
Each latent variable represents a soft partition of data. As such, LTA
yields multiple partitions of data and is multidimensional clustering
method. While multiple partitions are desirable in some applications,
there are many situations where one wants only one partition of data.
In this thesis, we develop and investigate a novel method for clustering
discrete data that yields a single partition of data and does not make the
local independence assumption. The method learns, from data, a three-level
LTM where (1) the attributes are at the bottom, (2) there is a unique
latent variable, the root, at the top level, and (3) there are multiple
latent variables at the middle level that connect the attributes and the
root. The root represents the data partition to be obtained. It is not
directly connected to the attributes, and hence the local independence
assumption is relaxed. The method is called UC-LTM, which stands for
unidimensional clustering based on latent tree models.
In addition to the general-purpose algorithm UC-LTM, we also propose and
investigate a special-purpose unidimensional clustering method for solving
the rounding problem of spectral clustering. Spectral clustering starts
with a similarity matrix for a collection of data points, transforms the
matrix to get the so-called Laplacian matrix, finds the eigenvectors of
the Laplacian matrix, and obtains a partition of the data points using the
leading eigenvectors. The last step is called rounding and it consists of
three sub-problems: (1) decide how many leading eigenvectors to use, (2)
determine the number of clusters, and (3) partition the data points. To
solve the rounding problem, we treat the eigenvectors as features,
discretize them, and then learn a three-level LTM to cluster the data
points. Our method differs from previous rounding methods in three ways.
First, we relax the assumption that the number of clusters equals the
number of eigenvectors used. Second, when deciding how many leading
eigenvectors to use, we not only rely on information contained in the
leading eigenvectors themselves, but also make use of the subsequent
eigenvectors. Third, our method is model-based and solves all the three
sub-problems simultaneously.
Date: Tuesday, 19 May 2015
Time: 3:00pm - 5:00pm
Venue: Room 2132C
lift 19
Committee Members: Prof. Nevin Zhang (Supervisor)
Prof. Fangzhen Lin (Chairperson)
Dr. Lei Chen
Dr. Wilfred Ng
**** ALL are Welcome ****