More about HKUST
Conditional Random Field Autoencoders for Feature-Rich, Unsupervised NLP
====================================================================== Joint Seminar ====================================================================== The Hong Kong University of Science & Technology Human Language Technology Center Department of Computer Science and Engineering Department of Electronic and Computer Engineering --------------------------------------------------------------------- Speaker: Prof. Chris DYER Carnegie Mellon University Title: "Conditional Random Field Autoencoders for Feature-Rich, Unsupervised NLP" Date: Friday, 7 November, 2014 Time: 11:00am - 12 noon Venue: Lecture Theater G (near lifts 25/26), HKUST Abstract: Human language is the result of cognitive processes whose contours are---at best---incompletely understood. Given the incomplete information we have about the processes involved, the frequently disappointing results obtained from attempts to use unsupervised learning to uncover latent linguistic structures (e.g., part-of-speech sequences, syntax trees, or word alignments in parallel data) can be attributed---in large part---to model misspecification. This work introduces a novel framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field. Then a reconstruction of the input is generated, conditional on the latent structure, as drawn from cheaply-estimated multinomials. The autoencoder structure enables efficient inference without unrealistic independence assumptions, enabling us to incorporate the often conflicting, overlapping theories (in the form of hand-crafted features) about how latent structures relate to observed data in a coherent model. We contrast our approach with traditional joint unsupervised models that are learned to maximize the marginal likelihood of observed data. We show state-of-the-art results with instantiations of the model for two canonical NLP tasks: part-of-speech induction and bitext word alignment, and show that training our model is substantially more efficient than training feature-rich models. This is joint work with Waleed Ammar and Noah A. Smith. ******************** Biography: Dr. Chris DYER is an Assistant Professor of Language Technologies in the School of Computer Science (SCS) at Carnegie Mellon University. He also holds an affiliated appointment in the Machine Learning Department at the same institution. Dr. DYER received his Ph.D. in Linguistics from the University of Maryland at College Park in December 2010, where his research with Prof. Philip Resnik developed tractable statistical models of machine translation that are robust to errors in automatic linguistic analysis components by simultaneously considering billions of multiple analyses during translation, and deferring analysis of uncertain inputs until the complete translation pipeline completes. Tractability relied on using automata theoretic insights that previously had their primary application in compiler design. The techniques he developed have been widely adopted in other research labs and in commercial translation applications. The translation and learning software developed during Dr. DYER's thesis work is publicly available and has been used in courses in natural language processing and machine translation at a several institutions, in three Ph.D. dissertations, and in numerous publications by other authors. In addition to numerous scientific articles and one patent, Dr. DYER co-authored a book with Dr. Jimmy Lin, Data Intensive Text-Processing with MapReduce, published by Morgan & Claypool. Since its publication, the book has been widely used in courses around the world. Following his graduate work, Dr. DYER was a post-doctorial associate at Carnegie Mellon with Dr. Noah Smith. Their work developed probabilistic models of natural language that incorporate structured prior linguistic knowledge to achieve better predictive performance than uninformative priors alone produced, in particular in low-resource languages.