More about HKUST
Latent Tree Analysis for Hierarchical Topic Detection: Scalability and Count Data
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Latent Tree Analysis for Hierarchical Topic Detection: Scalability and Count Data" By Miss Peixian CHEN Abstract Detecting topics and topic hierarchies from large archives of documents has been one of the most active research areas in last decade. The objective of topic detection is to discover the thematic structure underlying document collections, based on which the collections can be organized and summarized. Recently, hierarchical latent tree analysis (HLTA) is proposed as a new method for topic detection. It uses a class of graphical models called hierarchical latent tree models (HLTMs) to build a topic hierarchy. The variables at the bottom level of an HLTM are binary observed variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables that represent word co-occurrence patterns with different granularities. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics. HLTA has been shown to discover significantly better models, more coherent topics and topic hierarchies than the state-of-the-art LDA-based hierarchical topic detection methods. However, HLTA in its current form can hardly be recognized as a practical topic detection tool. First, HLTA has rather prohibitive computational cost; Second, HLTA only operates on binary data. In this thesis, we propose and investigate methods to overcome those shortcomings. First, we propose a new learning algorithm PEM-HLTA as to scale up HLTA. HLTA consists of two phases: model construction phase and parameter estimation phase. The computational bottleneck of HLTA lies in the use of the EM algorithm for evaluating parameters during model construction phase, which produces a large number of intermediate models. Here we propose progressive EM (PEM) as a replacement of EM. PEM carries out parameter evaluation in submodels that involve only 3 or 4 observed binary variables and gains great speed-up. Combined with the accelerating techniques applied to the parameter estimation phase, PEM-HLTA is capable of analyzing much larger corpus with over hundreds of thousands of documents. Second, we propose an extension HLTA-c to incorporate word counts into PEM-HLTA. The incapability of dealing with count data has always put HLTA at a disadvantage as a topic detection method. We introduce real-valued continuous variables to replace the observed binary variables in HLTMs. This is done in parameter estimation phase and allows PEM-HLTA to model word frequency distributions under different topics, which reflects the usage patterns of words instead of pure word co-occurrences. HLTA-c is now a new state-of-the-art topic detection approach with the aforementioned improvements on scalability and model flexibility. Empirical results show that HLTA-c achieves efficiency comparable with the best LDA-based hierarchical topic detection methods, and excels in model predictive performance, topic coherence and topic hierarchy quality. Date: Wednesday, 23 August 2017 Time: 2:00pm - 4:00pm Venue: Room 2612B Lifts 31/32 Chairman: Prof. Jeffrey Chasnov (MATH) Committee Members: Prof. Nevin Zhang (Supervisor) Prof. Lei Chen Prof. Wilfred Ng Prof. Weichuan Yu (ECE) Prof. Wai Lam (Sys Engg & Engg Mgmt, CUHK) **** ALL are Welcome ****