More about HKUST
MULTILINGUAL DOCUMENT EMBEDDING WITH SEQUENTIAL NEURAL NETWORK MODELS
MPhil Thesis Defence Title: "MULTILINGUAL DOCUMENT EMBEDDING WITH SEQUENTIAL NEURAL NETWORK MODELS" By Mr. Wei LI Abstract One of the current state-of-the-art multilingual document embedding model LASER is based on the bidirectional LSTM (BiLSTM) neural machine translation (NMT) model. This paper presents a Transformer-based Multilingual Document Embedding model, T-MDE, which makes two significant improvements. Firstly, the BiLSTM encoder is replaced by the attention-based transformer structure with an novel information bottleneck design. The new model structure is more capable of learning sequential patterns in longer texts. Moreover, it is faster both in training and embedding generation. Secondly, we augment the NMT translation loss function with an carefully designed distance constraint loss term. It will further brings the embeddings of parallel sentences close together in the vector space. We call the T-MDE model trained with distance constraint, cT-MDE. Our T-MDE model significantly outperforms BiLSTM-based LASER in the cross-lingual document classification tasks. Date: Thursday, 5 May 2022 Time: 10:00am - 12:00noon Zoom Meeting: https://hkust.zoom.us/j/4284493948?pwd=SXp0bWhESVNXc2djSGZLM1loYXFVZz09 Committee Members: Dr. Brian Mak (Supervisor) Prof. Raymond Wong (Chairperson) Prof. Nevin Zhang **** ALL are Welcome ****