More about HKUST
Latent Tree Analysis for Hierarchical Topic Detection: Scalability and Count Data
PhD Thesis Proposal Defence Title: "Latent Tree Analysis for Hierarchical Topic Detection: Scalability and Count Data" by Miss Peixian CHEN Abstract: Detecting topics and topic hierarchies from large archives of documents has been one of the most active research areas in last decade. The objective of topic detection is to discover the thematic structure underlying document collections, based on which the collections can be organized and summarized. Recently, hierarchical latent tree analysis (HLTA) is proposed as a new method for topic detection. It differs fundamentally from currently predominant topic detection approach, latent Dirichlet allocation (LDA), in terms of topic definition, topic-document relationship, and learning method. HLTA uses a class of graphical models called hierarchical latent tree models (HLTMs) to build a topic hierarchy. The variables at the bottom level of an HLTM are binary observed variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lowest level representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics. HLTA has been shown to discover significantly more coherent topics and better topic hierarchies than LDA-based hierarchical topic detection methods on binary data. However, it has two shortcomings in its current form. First, it does not scale up well. It takes, for instance, 17 hours to process a NIPS dataset that consists of fewer than 2,000 documents over 1,000 distinct words. Second, it operates on binary data and does not take word frequencies into consideration. This leads to significant information loss. In this thesis proposal, we propose and investigate methods for overcoming those shortcomings. First, we propose a new algorithm as to scale up HLTA. The computational bottleneck of previous HLTA lies in the use of the Expectation-Maximization (EM) algorithm for parameter estimation during model structure learning, which produces a large number of intermediate models. Here we propose progressive EM (PEM) as a replacement of EM. PEM is motivated by a spectral technique used in the method of moments, which relates model parameters to population moments that involve at most 3 observed variables. Similarly, PEM carries out parameter estimation in submodels that involve 3 or 4 observed binary variables. PEM is efficient because however large a dataset is, it consists of only 8 or 16 distinct cases when projected onto 3 or 4 binary variables. The new algorithm is hence named PEM-HLTA. To estimate the parameters of the final model, we use stepwise EM, which operates in a way similar to stochastic gradient descent. PEM-HLTA finishes processing the aforementioned NIPS data within 4 minutes under the same computing environment and is capable of analyzing much larger corpus. Second, we propose to incorporate word frequencies into HLTA. For now, HLTA models documents as binary vectors. Binary representations capture word co-occurrences but reflect little about word proportions in a document. Two documents using the same set of words might be on completely different topics with different wording preferences. Therefore we propose an extension, HLTA for bag-of-words data (HLTA-bow). HLTA-bow replaces the binary observed variables in current HLTMs with continuous variables, each of which follows a mixture of truncated Gaussian distributions between interval [0,1]. These continuous observed variables represent the relative frequencies of words in a document. HLTA-bow is hence capable of modeling word frequency distributions under different topics, which reflects the usage patterns of words instead of pure co-occurrences. Preliminary experiments demonstrate that HLTA-bow produces models with much better predictive performance compared with LDA-based methods on bag-of-words data. Date: Wednesday, 26 April 2017 Time: 1:30pm - 3:30pm Venue: Room 2463 (lifts 25/26) Committee Members: Prof. Nevin Zhang (Supervisor) Dr. Raymond Wong (Chairperson) Prof. Fangzhen Lin Dr. Yangqiu Song **** ALL are Welcome ****