Towards Efficient Deep Learning Systems with Learning-Based Optimizations

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Efficient Deep Learning Systems with Learning-Based 
Optimizations"

By

Mr. Yiding WANG


Abstract

Deep learning has demonstrated advanced performance in various computer 
vision and natural language processing tasks over the past decade. Deep 
learning models are now fundamental building blocks for applications 
including autonomous driving, cloud video analytics, sentiment analysis, 
and natural language inference. To achieve high accuracy for demanding 
tasks, deep neural networks grow rapidly in size and computation 
complexity and require high-fidelity and large volumes of data, making 
training and inference time-consuming and costly. These challenges become 
salient and motivate practitioners to focus on building machine learning 
systems. In recent years, the intersection of the traditional computer 
systems and machine learning topics has attracted considerable research 
attention, including applying machine learning techniques or learned 
policies in system designs (i.e., machine learning for systems) and 
optimizing systems especially for machine learning pipelines and workloads 
(i.e., systems for machine learning). Combining both, research on using 
machine learning techniques to optimize machine learning systems shows 
significant efficiency improvements for exploiting the inherent mechanism 
of learning tasks.

This dissertation proposal provides and discusses new directions to 
optimize the speed, accuracy, and system overhead of machine learning 
training and inference systems in different applications using 
learning-based techniques. We find that aligning system designs and 
machine learning workloads can let systems prioritize the data, neural 
network parameters, and computation that machine learning tasks really 
need to improve performance, e.g., achieving high-quality edge-cloud video 
analytics with low bandwidth consumption using optimized video data that 
preserves the necessary information, reducing models’ training computation 
volume by focusing on under-trained parameters, and adaptively assigning 
less model capacity for simpler natural language queries with real-time 
semantic understanding. With three case studies ranging from training to 
inference and from computer vision to natural language processing, we show 
that using learning-based techniques to optimize the design of machine 
learning systems can precisely benefit the efficiency of machine learning 
applications.

First, we propose and analyze Runespoor, an edge-cloud video analytics 
system using superresolution to manage the accuracy loss with compressed 
data over the network. Emerging deep learning-based video analytics tasks, 
e.g., object detection and semantic segmentation, demand 
computation-intensive neural networks and powerful computing resources on 
the cloud to achieve high inference accuracy. Due to the latency 
requirement and limited network bandwidth, edgecloud systems adaptively 
compress the data to strike a balance between overall analytics accuracy 
and bandwidth consumption. However, the degraded data leads to another 
issue of poor tail accuracy, which means the extremely low accuracy of a 
few semantic classes and video frames. Modern applications like autonomous 
robotics especially value the tail accuracy performance, but suffer using 
the prior edge-cloud systems. Our analytics-aware super-resolution extends 
super-resolution, which is an effective technique that learns a mapping 
from low-resolution frames to high-resolution frames. Runespoor can 
reconstruct high-resolution frames tailored for the tail accuracy 
performance of video analytics tasks with augmented details from 
compressed low-resolution data on the server. Our evaluation shows that 
Runespoor improves class-wise tail accuracy by up to 300%, frame-wise 
90%/99% tail accuracy by up to 22%/54%, and greatly improves the overall 
accuracy and bandwidth trade-off.

Next, we explore Egeria, a knowledge-guided deep learning training system 
that employs semantic knowledge from a reference model and knowledge 
distillation techniques to accelerate model training by accurately 
evaluating individual layers’ training progress, safely freezing the 
converged ones, and saving their corresponding backward computation and 
communication. Training deep neural networks is time-consuming. While most 
existing efficient training solutions try to overlap/schedule computation 
and communication, Egeria goes one step further by skipping them through 
layer freezing. The key insight is that the training progress of internal 
neural network layers differs significantly, and front layers often become 
well-trained much earlier than deep layers. To explore this, we introduce 
the notion of training plasticity to quantify the training progress of 
layers. Informed by the latest knowledge distillation research, we use a 
reference model that is generated on the fly with quantization techniques 
and runs forward operations asynchronously on available CPUs to minimize 
the overhead. Our experiments with popular vision and language models show 
that Egeria achieves 19%-43% training speedup w.r.t. the state-of-the-art 
without sacrificing accuracy.

Finally, we present Tabi, an inference system with a multi-level inference 
engine optimized for large language models and diverse workloads by 
exploring the prediction confidence of neural networks and the 
Transformer’s attention mechanism. Today’s trend of building ever larger 
language models, while pushing the performance of natural language 
processing, adds significant latency to the inference stage. We observe 
that due to the diminishing returns of adding model parameters, a smaller 
model could make the same prediction as a costly large language model for 
a majority of queries. Based on this observation, we design Tabi that can 
serve queries using both small models and optional large ones, unlike the 
traditional one-model-for-all pattern. Tabi is optimized for 
discriminative models (i.e., not generative LLMs) in a serving framework. 
Tabi uses the calibrated confidence score to decide whether to return the 
accurate results of small models extremely fast or re-route them to large 
models. For re-routed queries, it uses attentionbased word pruning and 
weighted ensemble techniques to offset the system overhead and accuracy 
loss. Tabi achieves 21%-40% average latency reduction (with comparable 
tail latency) over the state-of-the-art while meeting top-grade high 
accuracy targets.


Date:			Friday, 28 April 2023

Time:			4:30pm - 6:30pm

Venue:			Room 5501
 			lifts 25/26

Chairperson:		Prof. Bei ZENG (PHYS)

Committee Members:	Prof. Kai CHEN (Supervisor)
 			Prof. Gary CHAN
 			Prof. Hao LIU
 			Prof. Yi YANG (ISOM)
 			Prof. Heming CUI (HKU)


**** ALL are Welcome ****
Privacy Sitemap
Towards Efficient Deep Learning Systems with Learning-Based Optimizations

About

People

Research

Academics

Admissions