Toward Learning Corpora Structure for Statistical Machine Translation

======================================================================
                Joint Seminar
======================================================================
The Hong Kong University of Science & Technology
Human Language Technology Center
Department of Computer Science and Engineering
Department of Electronic and Computer Engineering
---------------------------------------------------------------------

Speaker:	Dr. Marine CARPUAT
		National Research Council
		Canada

Title:		"Toward Learning Corpora Structure for Statistical
		Machine Translation"

Date:		Monday, 10 October 2011

Time:		4::00pm - 5:00pm

Venue:		Lecture Theater F (near lifts 25 & 26), HKUST

Abstract:

While "there's no data like more data" has been the motto behind remarkable
improvements in Statistical Machine Translation (SMT), we argue that
current SMT architectures ignore precious information by modeling data as
unstructured collections of sentences, rather than structured collections
of documents. First, we show that documents matter by revisiting the "one
sense per discourse" hypothesis for translation.  Second, we show that not
all data is equally useful and that unsupervised learning of corpora
structure via document clustering can improve SMT.


Biogrpahy:

Marine CARPUAT is a Research Associate at the National Research Council of
Canada, where she works on natural language processing, statistical
machine translation, and machine learning for multilingual text and
document processing. Dr. Carpuat received a PhD in Computer Science from
HKUST in 2008, an MPhil in Electrical Engineering from HKUST in 2003, and
an undergraduate engineering degree from the French Grande Ecole Supelec.
She was a postdoctoral researcher at Columbia University from 2008-2010.