MULTILINGUAL DOCUMENT EMBEDDING WITH SEQUENTIAL NEURAL NETWORK MODELS

MPhil Thesis Defence


Title: "MULTILINGUAL DOCUMENT EMBEDDING WITH SEQUENTIAL NEURAL NETWORK 
MODELS"

By

Mr. Wei LI


Abstract

One of the current state-of-the-art multilingual document embedding model 
LASER is based on the bidirectional LSTM (BiLSTM) neural machine 
translation (NMT) model. This paper presents a Transformer-based 
Multilingual Document Embedding model, T-MDE, which makes two significant 
improvements. Firstly, the BiLSTM encoder is replaced by the 
attention-based transformer structure with an novel information bottleneck 
design. The new model structure is more capable of learning sequential 
patterns in longer texts. Moreover, it is faster both in training and 
embedding generation. Secondly, we augment the NMT translation loss 
function with an carefully designed distance constraint loss term. It will 
further brings the embeddings of parallel sentences close together in the 
vector space. We call the T-MDE model trained with distance constraint, 
cT-MDE. Our T-MDE model significantly outperforms BiLSTM-based LASER in 
the cross-lingual document classification tasks.


Date:  			Thursday, 5 May 2022

Time:			10:00am - 12:00noon

Zoom Meeting:
https://hkust.zoom.us/j/4284493948?pwd=SXp0bWhESVNXc2djSGZLM1loYXFVZz09

Committee Members:	Dr. Brian Mak (Supervisor)
 			Prof. Raymond Wong (Chairperson)
 			Prof. Nevin Zhang


**** ALL are Welcome ****