More about HKUST
Large Scale Optimization Methods for Machine Learning
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Large Scale Optimization Methods for Machine Learning" By Mr. Shuai ZHENG Abstract Dealing with large-scale dataset and increasingly complex model has been a big challenge for optimization methods in machine learning. The stochastic gradient descent (SGD) method has been widely viewed as an ideal approach for large-scale machine learning problems while the conventional batch gradient method typically falters. Despite its flexibility and scalability, the stochastic gradient is associated with high variance which impedes training. In this thesis, we introduce new optimization methods with a focus on fundamental challenges such as scalability, computational and communication efficiency, for tackling the large-scale machine learning tasks. The first part of this thesis introduces optimization methods for optimizing empirical risk minimization (ERM) problems, based on the principle of variance reduction. We firstly propose a fast and scalable stochastic ADMM method for solving ERM problem with complex nonsmooth regularizers such as graph lasso and group lasso. Using similar principles, we also introduce a stochastic continuation method to optimize convex problems where loss and regularizer are both nonsmooth. While the existing approaches rely crucially on the assumption that the dataset is finite, we develop two SGD-like algorithms for the finite sums with infinite data (data augmentation). The proposed algorithms outperform existing methods in terms of both iteration complexity and storage. The second part of this thesis investigates adaptive gradient methods for solving optimization problems in deep learning. Inspired by the recent advancement in adaptive gradient methods for training deep neural networks, we present a fast and powerful optimization algorithm called the follow-the-moving-leader (FTML) method. The new algorithm significantly outperforms the existing approaches, thereby advancing the state-of-the-art results. As coordinate-wise adaptive methods, including FTML, can generalize worse than SGD methods in some cases, we propose blockwise adaptivity, which balances the adaptivity and the generalization. Both theoretically and empirically, it improves convergence and generalization over the coordinate-wise counterpart. The last part of this thesis studies two critical aspects of distributed optimization methods -- asynchronicity and communication efficiency. We develop a distributed asynchronous gradient-based method with fast linear convergence rate for strongly convex ERM problems, improving upon the existing distributed machine learning algorithms. On the other hand, the scalability of large-scale distributed training of deep neural networks is often limited by the communication overhead. Motivated by the recent advances in optimization with compressed gradient, we propose a communication-efficient distributed blockwise momentum SGD with error-feedback. The proposed method provably converges to a stationary point at the same asymptotic rate as distributed synchronous momentum SGD while achieving a nearly 32x reduction on communication cost. Date: Monday, 8 July 2019 Time: 10:00am - 12:00noon Venue: Room 5501 Lifts 25/26 Chairman: Prof. Howard C LUONG (ECE) Committee Members: Prof. Tin-Yau Kwok (Supervisor) Prof. Brian Mak Prof. Raymond Wong Prof. Tong Zhang (MATH) Prof. Zhihua Zhang (Beijing University) **** ALL are Welcome ****