More about HKUST
Neural Architecture Design: Search Methods and Theoretical Understanding
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Neural Architecture Design: Search Methods and Theoretical Understanding" By Mr. Han SHI Abstract Deep Learning has emerged as a milestone in machine learning community due to its remarkable ability in a variety of tasks, such as computer vision and natural language processing. It has been demonstrated that the architecture of neural network influences the performance significantly and thus it's important to determine the neural architecture structure. Typically, the methods for neural architecture design can be classified into two categories. One category is designing neural architecture by search methods, which aims to achieve potential neural architectures automatically. For example, NASNet architecture is found in a defined search space using reinforcement learning algorithm. Another one category is designing neural architecture manually based on some knowledge and theoretical understanding. Most practical architectures like ResNet and Transformer are proposed based on prior knowledge. In this thesis, we provide a comprehensive discussion on neural architecture design from above two perspectives. Firstly, we introduce a neural architecture search algorithm using Bayesian optimization, named BONAS. In the search phase, GCN embedding extractor and Bayesian Sigmoid Regression constitute the surrogate model for Bayesian optimization and candidate architectures in the search space are selected based on the acquisition function. In the query phase, we merge them as a super network and evaluate each architecture by weight sharing mechanism. The proposed BONAS can obtain significant architecture with exploitation and exploration balance. Secondly, we focus on the self-attention module in famous Transformer and propose a differentiable architecture search method to find important attention patterns. Different from prior works, we find that diagonal elements in the attention map can be dropped without harming the performance. To understand this observation, we provide a theoretical proof from the perspective of universal approximation. Furthermore, we achieve a series of attention masks for efficient architecture design based on our proposed search method. Thirdly, we attempt to understand the feed-forward module in Transformer from a unified framework. Specifically, we introduce the concept of memory token and build the relationship between feed-forward and self-attention. Moreover, we propose a novel architecture named uni-attention, which contains all four types of attention connection in our framework. Uni-attention achieves better performance compared with previous baselines given the same number of memory tokens. Finally, we investigate the over-smoothing phenomenon in whole Transformer architecture. We provide a theoretical analysis by building the relationship between the self-attention and the graph field. Specifically, we find that layer normalization plays a important role in the over-smoothing problem and verify this empirically. To alleviate this issue, we propose hierarchical fusion architectures such that the output can be more diverse. Date: Friday, 5 August 2022 Time: 10:00am - 12:00noon Zoom Meeting: https://hkust.zoom.us/j/5599077828 Chairperson: Prof. Toyotaka ISHIBASHI (LIFS) Committee Members: Prof. James KWOK (Supervisor) Prof. Minhao CHENG Prof. Yangqiu SONG Prof. Yuan YAO (MATH) Prof. Irwin KING (CUHK) **** ALL are Welcome ****