More about HKUST
Toward Learning Corpora Structure for Statistical Machine Translation
====================================================================== Joint Seminar ====================================================================== The Hong Kong University of Science & Technology Human Language Technology Center Department of Computer Science and Engineering Department of Electronic and Computer Engineering --------------------------------------------------------------------- Speaker: Dr. Marine CARPUAT National Research Council Canada Title: "Toward Learning Corpora Structure for Statistical Machine Translation" Date: Monday, 10 October 2011 Time: 4::00pm - 5:00pm Venue: Lecture Theater F (near lifts 25 & 26), HKUST Abstract: While "there's no data like more data" has been the motto behind remarkable improvements in Statistical Machine Translation (SMT), we argue that current SMT architectures ignore precious information by modeling data as unstructured collections of sentences, rather than structured collections of documents. First, we show that documents matter by revisiting the "one sense per discourse" hypothesis for translation. Second, we show that not all data is equally useful and that unsupervised learning of corpora structure via document clustering can improve SMT. Biogrpahy: Marine CARPUAT is a Research Associate at the National Research Council of Canada, where she works on natural language processing, statistical machine translation, and machine learning for multilingual text and document processing. Dr. Carpuat received a PhD in Computer Science from HKUST in 2008, an MPhil in Electrical Engineering from HKUST in 2003, and an undergraduate engineering degree from the French Grande Ecole Supelec. She was a postdoctoral researcher at Columbia University from 2008-2010.