IMPROVING SEMANTIC SMT FOR LOW RESOURCE LANGUAGES

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "IMPROVING SEMANTIC SMT FOR LOW RESOURCE LANGUAGES"

By

Miss Meriem BELOUCIF


Abstract

We have managed to consistently improve the translation quality for challenging 
low resource languages by injecting semantic based objective functions into the 
training pipeline at an early (training) rather than late (tuning) stage as in 
previous attempts. The set of approaches suggested in this thesis are motivated 
by the fact that including semantics in a late stage tuning of machine 
translation models has already been shown to increase translation quality.

Any shortage of parallel data constitutes a serious obstacle for conventional 
machine translation training techniques, because of their heavy dependency on 
memorization from big data. With low resource languages, for which parallel 
corpora are scarce, it becomes imperative to make learning from small data much 
more efficient by adding additional constraints to create stronger inductive 
biases---especially linguistically well-motivated constraints, such as the 
shallow semantic parses of the training sentences. However, while automatic 
semantic parsing is readily available to produce shallow semantic parses for a 
high resource output language (typically English), the problem is that there 
are no semantic parsers for low resource input languages, as in the Uyghur or 
Uzbek translations challenges.

We propose the first ever methods that inject a crosslingual semantic based 
objective function into training translation models for translation tasks like 
Chinese--English where we have semantic parsers for both languages. We report 
promising results showing that this way of training the machine translation 
model, in general, helps bias the learning towards semantically more correct 
bilingual constituents. Semantic statistical machine translation for low 
resource languages has been a difficult challenge since semantic parses are not 
usually available for low resource input languages but only for high resource 
output languages such as English. We extend our bilingual approaches to a low 
resource setup via our new training approaches which only require the output 
language semantic parse.

We then thoroughly analyze the reasons behind the promising results that we 
achieved for multiple challenging low resource translation tasks such as Hausa, 
Uzbek and Uyghur, Swahili, Oromo and Amharic always translating into English. 
Our methods heavily rely on the degree of goodness of the semantic parser which 
completely fails to parse any sentence that contains any form of the verb to 
be. Ignoring sentences containing to be means that we are ignoring many 
sentences. Finally, we propose a novel way that attempts to semantically parse 
sentences that contains the to be verb and re-run all previous models on this 
new parsed data. We show even more translation improvements through our new 
proposed approach for many low resource languages.


Date:			Tuesday, 27 March 2018

Time:			10:30am - 12:30pm

Venue:			Room CYTG003
 			CYT Building

Chairman:		Prof. Inchi Hui (ISOM)

Committee Members:	Prof. Dekai Wu (Supervisor)
 			Prof. Fangzhen Lin
 			Prof. Xiaojuan Ma
 			Prof. Min Zhang (HUMA)
 			Prof. Martha Palmer (Univ. of Colorado)


**** ALL are Welcome ****