More about HKUST
Topic Modeling of Chinese Language
====================================================================== Joint Seminar ====================================================================== The Hong Kong University of Science & Technology Department of Computer Science and Engineering Human Language Technology Center ---------------------------------------------------------------------- Speaker: Dr. Zengchang QIN Beihang University Title: "Topic Modeling of Chinese Language" Date: Friday, 2 September 2011 Time: 11:00am - 12 noon Venue: Room 3311 (via lifts 17/18), HKUST Abstract: Topic models are hierarchical Bayesian models for language modelling and document analysis. It has been well-used and achieved a lot of success in modeling English documents. However, unlike English and the majority of alphabetic languages, the basic structural unit of Chinese language is character instead of word, and Chinese words are written without spaces between them. Most previous research of using topic models for Chinese documents did not take the Chinese character-word relation into consideration and simply take the Chinese word as the basic term of documents. In this talk, we will discuss a novel model to consider the character-word relation in topic modeling by placing an asymmetric prior on the topic-word distribution of the standard Latent Dirichlet Allocation (LDA) model. Compared to LDA, this model can improve performance in document classification especially when test data contains considerable number of Chinese words not appeared in training data. ******************** Biography: Dr. Zengchang Qin is an associate professor in Beihang University, Beijing, China. Zengchang obtained his MSc and PhD from University of Bristol, UK, and did his postdoc research with Lotfi Zadeh in UC Berkeley, US. He used to work (or intern) in HP, BT, Optimor Labs and worked as a visiting scholar in University of Oxford and Carnegie Mellon University. His research interests are agent-based modeling, machine learning, computational intelligence and multimedia retrieval.