More about HKUST
Latent Tree Analysis for Hierarchical Topic Detection: Scalability and Count Data
PhD Thesis Proposal Defence
Title: "Latent Tree Analysis for Hierarchical Topic Detection: Scalability
and Count Data"
by
Miss Peixian CHEN
Abstract:
Detecting topics and topic hierarchies from large archives of documents
has been one of the most active research areas in last decade. The
objective of topic detection is to discover the thematic structure
underlying document collections, based on which the collections can be
organized and summarized. Recently, hierarchical latent tree analysis
(HLTA) is proposed as a new method for topic detection. It differs
fundamentally from currently predominant topic detection approach, latent
Dirichlet allocation (LDA), in terms of topic definition, topic-document
relationship, and learning method. HLTA uses a class of graphical models
called hierarchical latent tree models (HLTMs) to build a topic
hierarchy. The variables at the bottom level of an HLTM are binary
observed variables that represent the presence/absence of words in a
document. The variables at other levels are binary latent variables, with
those at the lowest level representing word co-occurrence patterns and
those at higher levels representing co-occurrence of patterns at the level
below. Each latent variable gives a soft partition of the documents, and
document clusters in the partitions are interpreted as topics.
HLTA has been shown to discover significantly more coherent topics and
better topic hierarchies than LDA-based hierarchical topic detection
methods on binary data. However, it has two shortcomings in its current
form. First, it does not scale up well. It takes, for instance, 17 hours
to process a NIPS dataset that consists of fewer than 2,000 documents over
1,000 distinct words. Second, it operates on binary data and does not
take word frequencies into consideration. This leads to significant
information loss. In this thesis proposal, we propose and investigate
methods for overcoming those shortcomings.
First, we propose a new algorithm as to scale up HLTA. The computational
bottleneck of previous HLTA lies in the use of the
Expectation-Maximization (EM) algorithm for parameter estimation during
model structure learning, which produces a large number of intermediate
models. Here we propose progressive EM (PEM) as a replacement of EM. PEM
is motivated by a spectral technique used in the method of moments, which
relates model parameters to population moments that involve at most 3
observed variables. Similarly, PEM carries out parameter estimation in
submodels that involve 3 or 4 observed binary variables. PEM is efficient
because however large a dataset is, it consists of only 8 or 16 distinct
cases when projected onto 3 or 4 binary variables. The new algorithm is
hence named PEM-HLTA. To estimate the parameters of the final model, we
use stepwise EM, which operates in a way similar to stochastic gradient
descent. PEM-HLTA finishes processing the aforementioned NIPS data within
4 minutes under the same computing environment and is capable of analyzing
much larger corpus.
Second, we propose to incorporate word frequencies into HLTA. For now,
HLTA models documents as binary vectors. Binary representations capture
word co-occurrences but reflect little about word proportions in a
document. Two documents using the same set of words might be on completely
different topics with different wording preferences. Therefore we propose
an extension, HLTA for bag-of-words data (HLTA-bow). HLTA-bow replaces the
binary observed variables in current HLTMs with continuous variables, each
of which follows a mixture of truncated Gaussian distributions between
interval [0,1]. These continuous observed variables represent the relative
frequencies of words in a document. HLTA-bow is hence capable of modeling
word frequency distributions under different topics, which reflects the
usage patterns of words instead of pure co-occurrences. Preliminary
experiments demonstrate that HLTA-bow produces models with much better
predictive performance compared with LDA-based methods on bag-of-words
data.
Date: Wednesday, 26 April 2017
Time: 1:30pm - 3:30pm
Venue: Room 2463
(lifts 25/26)
Committee Members: Prof. Nevin Zhang (Supervisor)
Dr. Raymond Wong (Chairperson)
Prof. Fangzhen Lin
Dr. Yangqiu Song
**** ALL are Welcome ****