More about HKUST
IMPROVING SEMANTIC SMT FOR LOW RESOURCE LANGUAGES
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "IMPROVING SEMANTIC SMT FOR LOW RESOURCE LANGUAGES" By Miss Meriem BELOUCIF Abstract We have managed to consistently improve the translation quality for challenging low resource languages by injecting semantic based objective functions into the training pipeline at an early (training) rather than late (tuning) stage as in previous attempts. The set of approaches suggested in this thesis are motivated by the fact that including semantics in a late stage tuning of machine translation models has already been shown to increase translation quality. Any shortage of parallel data constitutes a serious obstacle for conventional machine translation training techniques, because of their heavy dependency on memorization from big data. With low resource languages, for which parallel corpora are scarce, it becomes imperative to make learning from small data much more efficient by adding additional constraints to create stronger inductive biases---especially linguistically well-motivated constraints, such as the shallow semantic parses of the training sentences. However, while automatic semantic parsing is readily available to produce shallow semantic parses for a high resource output language (typically English), the problem is that there are no semantic parsers for low resource input languages, as in the Uyghur or Uzbek translations challenges. We propose the first ever methods that inject a crosslingual semantic based objective function into training translation models for translation tasks like Chinese--English where we have semantic parsers for both languages. We report promising results showing that this way of training the machine translation model, in general, helps bias the learning towards semantically more correct bilingual constituents. Semantic statistical machine translation for low resource languages has been a difficult challenge since semantic parses are not usually available for low resource input languages but only for high resource output languages such as English. We extend our bilingual approaches to a low resource setup via our new training approaches which only require the output language semantic parse. We then thoroughly analyze the reasons behind the promising results that we achieved for multiple challenging low resource translation tasks such as Hausa, Uzbek and Uyghur, Swahili, Oromo and Amharic always translating into English. Our methods heavily rely on the degree of goodness of the semantic parser which completely fails to parse any sentence that contains any form of the verb to be. Ignoring sentences containing to be means that we are ignoring many sentences. Finally, we propose a novel way that attempts to semantically parse sentences that contains the to be verb and re-run all previous models on this new parsed data. We show even more translation improvements through our new proposed approach for many low resource languages. Date: Tuesday, 27 March 2018 Time: 10:30am - 12:30pm Venue: Room CYTG003 CYT Building Chairman: Prof. Inchi Hui (ISOM) Committee Members: Prof. Dekai Wu (Supervisor) Prof. Fangzhen Lin Prof. Xiaojuan Ma Prof. Min Zhang (HUMA) Prof. Martha Palmer (Univ. of Colorado) **** ALL are Welcome ****