More about HKUST
Structured Sparsity for Pre-Training Distributed Word Representations with Subword Information
MPhil Thesis Defence Title: "Structured Sparsity for Pre-Training Distributed Word Representations with Subword Information" By Mr. Leonard Elias LAUSEN Abstract Facilitating computational methods that can “understand” and work with humans requires putting the general world knowledge of humans to their disposal in a computationally suitable representation (Bengio, Courville, and Vincent 2013). Semantic memory refers to this human knowledge and the memory system storing it. Computational models thereof have been studied since the advent of computing (McRae and M. Jones 2013), typically based on text data (Yee, M. N. Jones, and McRae 2018) and a distributional hypothesis (Harris 1954; Firth 1957; Miller and Charles 1991), postulating a relation between the co-occurrence distribution of sense inputs – such as words in language – and their respective semantic meaning. Next to their use in validating and exploring psychological theories, word-based computational semantic models have gained popularity in natural language processing (NLP) as word representations obtained from large corpora help to improve performance on supervised NLP tasks for which only comparatively little labeled training data can be obtained (Turian, Ratinov, and Bengio 2010). Recently a series of scalable methods beginning with Word2Vec (Tomas Mikolov, Chen, et al. 2013) have enabled the learning of word representations form very large unlabeled text corpora, obtaining better representations and representations for more words. Unfortunately the long-tail nature of human language – implying that most words are infrequent (Zipf 1949; Mandelbrot 1954) – prevents them from representing infrequent words well (Lowe 2001; Luong, Socher, and Christopher D. Manning 2013). These methods are commonly referred to as word embedding methods. Considering that words are typically formed of meaningful parts, the distribution considered in the distributional hypothesis depends not only on atomic word-level information but is largely based on the subword structure (Harris 1954). Taking morphological or subword information into account in computational models was therefore proposed as remedy (Luong, Socher, and Christopher D. Manning 2013) and recently Bojanowski et al. (2017) proposed a scalable model incorporating subword-level information termed fastText. fastText is based on learning separate vectorial representations for words and their parts, specifically all character ngrams. The final word representation provided by the model then is the average of the word and ngram level representations. In this thesis we propose an adaption of the fastText model motivated by the insight that estimating the word level part of the representation as well as the representations for some character ngrams may be unreliable as it is based only on few co-occurrence relations in the text corpus. We thus introduce a group lasso regularization (Yuan and Y. Lin 2006) to select a subset of word and subword-level parameters for which good representations can be learned. For optimization we introduce a scalable ProxASGD optimizer based on insights into asynchronous proximal optimization by Pedregosa, Leblond, and Lacoste-Julien (2017). We evaluate the proposed method on a variety of tasks and find that the regularization enables improved performance for rare words and morphologically complex languages such as German. By providing separate regularization for subword and word level information, the regularization hyperparameters further allow trading-off between performance on semantic and syntactic tasks. Date: Monday, 15 April 2019 Time: 3:00pm - 5:00pm Venue: Room 4621 Lifts 31/32 Committee Members: Prof. Dit-Yan Yeung (Supervisor) Prof. Nevin Zhang (Chairperson) Dr. Yangqiu Song **** ALL are Welcome ****