More about HKUST
Unidimensional Clustering using Latent Tree Models
PhD Thesis Proposal Defence Title: "Unidimensional Clustering using Latent Tree Models" by Miss Hua LIU Abstract: Cluster Analysis is an important research topic in machine learning and data mining. The objective is to partition data into meaningful clusters, where the number of clusters might be known or unknown. A variety of approaches have been proposed for the task. Examples include distance/similarity-based algorithms such as K-means, kernel K-means and spectral clustering, as well as model-based methods such as Gaussian mixture models (GMMs). Most previous clustering algorithms are targeted for continuous data, that is, points in the Euclidean space of certain dimension. In this thesis, we focus on discrete data where each attribute takes a finite number of possible values. A commonly used method for clustering discrete data is latent class analysis (LCA). LCA is based on a class of finite mixture models known as latent class models (LCMs). An LCM consists of a latent variable and a number of attributes. It assumes that the attributes are mutually independent given the latent variable. Known as local independence, this assumption is often too strong in practice, and might lead to spurious clusters. Another method for clustering discrete data is called latent tree analysis (LTA) and it is based latent tree models (LTMs). LTMs allow multiple latent variables and hence the local independence is relaxed. Each latent variable represents a soft partition of data. As such, LTA yields multiple partitions of data and is multidimensional clustering method. While multiple partitions are desirable in some applications, there are many situations where one wants only one partition of data. In this thesis, we develop and investigate a novel method for clustering discrete data that yields a single partition of data and does not make the local independence assumption. The method learns, from data, a three-level LTM where (1) the attributes are at the bottom, (2) there is a unique latent variable, the root, at the top level, and (3) there are multiple latent variables at the middle level that connect the attributes and the root. The root represents the data partition to be obtained. It is not directly connected to the attributes, and hence the local independence assumption is relaxed. The method is called UC-LTM, which stands for unidimensional clustering based on latent tree models. In addition to the general-purpose algorithm UC-LTM, we also propose and investigate a special-purpose unidimensional clustering method for solving the rounding problem of spectral clustering. Spectral clustering starts with a similarity matrix for a collection of data points, transforms the matrix to get the so-called Laplacian matrix, finds the eigenvectors of the Laplacian matrix, and obtains a partition of the data points using the leading eigenvectors. The last step is called rounding and it consists of three sub-problems: (1) decide how many leading eigenvectors to use, (2) determine the number of clusters, and (3) partition the data points. To solve the rounding problem, we treat the eigenvectors as features, discretize them, and then learn a three-level LTM to cluster the data points. Our method differs from previous rounding methods in three ways. First, we relax the assumption that the number of clusters equals the number of eigenvectors used. Second, when deciding how many leading eigenvectors to use, we not only rely on information contained in the leading eigenvectors themselves, but also make use of the subsequent eigenvectors. Third, our method is model-based and solves all the three sub-problems simultaneously. Date: Tuesday, 19 May 2015 Time: 3:00pm - 5:00pm Venue: Room 2132C lift 19 Committee Members: Prof. Nevin Zhang (Supervisor) Prof. Fangzhen Lin (Chairperson) Dr. Lei Chen Dr. Wilfred Ng **** ALL are Welcome ****