More about HKUST
Towards Efficient Deep Learning Systems with Learning-Based Optimizations
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Towards Efficient Deep Learning Systems with Learning-Based Optimizations" By Mr. Yiding WANG Abstract Deep learning has demonstrated advanced performance in various computer vision and natural language processing tasks over the past decade. Deep learning models are now fundamental building blocks for applications including autonomous driving, cloud video analytics, sentiment analysis, and natural language inference. To achieve high accuracy for demanding tasks, deep neural networks grow rapidly in size and computation complexity and require high-fidelity and large volumes of data, making training and inference time-consuming and costly. These challenges become salient and motivate practitioners to focus on building machine learning systems. In recent years, the intersection of the traditional computer systems and machine learning topics has attracted considerable research attention, including applying machine learning techniques or learned policies in system designs (i.e., machine learning for systems) and optimizing systems especially for machine learning pipelines and workloads (i.e., systems for machine learning). Combining both, research on using machine learning techniques to optimize machine learning systems shows significant efficiency improvements for exploiting the inherent mechanism of learning tasks. This dissertation proposal provides and discusses new directions to optimize the speed, accuracy, and system overhead of machine learning training and inference systems in different applications using learning-based techniques. We find that aligning system designs and machine learning workloads can let systems prioritize the data, neural network parameters, and computation that machine learning tasks really need to improve performance, e.g., achieving high-quality edge-cloud video analytics with low bandwidth consumption using optimized video data that preserves the necessary information, reducing models’ training computation volume by focusing on under-trained parameters, and adaptively assigning less model capacity for simpler natural language queries with real-time semantic understanding. With three case studies ranging from training to inference and from computer vision to natural language processing, we show that using learning-based techniques to optimize the design of machine learning systems can precisely benefit the efficiency of machine learning applications. First, we propose and analyze Runespoor, an edge-cloud video analytics system using superresolution to manage the accuracy loss with compressed data over the network. Emerging deep learning-based video analytics tasks, e.g., object detection and semantic segmentation, demand computation-intensive neural networks and powerful computing resources on the cloud to achieve high inference accuracy. Due to the latency requirement and limited network bandwidth, edgecloud systems adaptively compress the data to strike a balance between overall analytics accuracy and bandwidth consumption. However, the degraded data leads to another issue of poor tail accuracy, which means the extremely low accuracy of a few semantic classes and video frames. Modern applications like autonomous robotics especially value the tail accuracy performance, but suffer using the prior edge-cloud systems. Our analytics-aware super-resolution extends super-resolution, which is an effective technique that learns a mapping from low-resolution frames to high-resolution frames. Runespoor can reconstruct high-resolution frames tailored for the tail accuracy performance of video analytics tasks with augmented details from compressed low-resolution data on the server. Our evaluation shows that Runespoor improves class-wise tail accuracy by up to 300%, frame-wise 90%/99% tail accuracy by up to 22%/54%, and greatly improves the overall accuracy and bandwidth trade-off. Next, we explore Egeria, a knowledge-guided deep learning training system that employs semantic knowledge from a reference model and knowledge distillation techniques to accelerate model training by accurately evaluating individual layers’ training progress, safely freezing the converged ones, and saving their corresponding backward computation and communication. Training deep neural networks is time-consuming. While most existing efficient training solutions try to overlap/schedule computation and communication, Egeria goes one step further by skipping them through layer freezing. The key insight is that the training progress of internal neural network layers differs significantly, and front layers often become well-trained much earlier than deep layers. To explore this, we introduce the notion of training plasticity to quantify the training progress of layers. Informed by the latest knowledge distillation research, we use a reference model that is generated on the fly with quantization techniques and runs forward operations asynchronously on available CPUs to minimize the overhead. Our experiments with popular vision and language models show that Egeria achieves 19%-43% training speedup w.r.t. the state-of-the-art without sacrificing accuracy. Finally, we present Tabi, an inference system with a multi-level inference engine optimized for large language models and diverse workloads by exploring the prediction confidence of neural networks and the Transformer’s attention mechanism. Today’s trend of building ever larger language models, while pushing the performance of natural language processing, adds significant latency to the inference stage. We observe that due to the diminishing returns of adding model parameters, a smaller model could make the same prediction as a costly large language model for a majority of queries. Based on this observation, we design Tabi that can serve queries using both small models and optional large ones, unlike the traditional one-model-for-all pattern. Tabi is optimized for discriminative models (i.e., not generative LLMs) in a serving framework. Tabi uses the calibrated confidence score to decide whether to return the accurate results of small models extremely fast or re-route them to large models. For re-routed queries, it uses attentionbased word pruning and weighted ensemble techniques to offset the system overhead and accuracy loss. Tabi achieves 21%-40% average latency reduction (with comparable tail latency) over the state-of-the-art while meeting top-grade high accuracy targets. Date: Friday, 28 April 2023 Time: 4:30pm - 6:30pm Venue: Room 5501 lifts 25/26 Chairperson: Prof. Bei ZENG (PHYS) Committee Members: Prof. Kai CHEN (Supervisor) Prof. Gary CHAN Prof. Hao LIU Prof. Yi YANG (ISOM) Prof. Heming CUI (HKU) **** ALL are Welcome ****