Network Scheduling for Distributed Machine Learning

MPhil Thesis Defence


Title: "Network Scheduling for Distributed Machine Learning"

By

Mr. Jiacheng XIA


Abstract

Distributed machine learning (DML) is of growing importance. Due to the growing 
scale of data and complexity of models, many important machine learning 
problems cannot be effectively solved by single machine. Existing scheduling 
algorithms are insufficient due to the complex computation-communication 
pattern and approximation feature of DML. In the training stage of DML, 
networking becomes bottleneck as the models trained on different machines need 
synchronization and updates frequently, transmitting MB to GB scale of 
parameters at second to sub-second level. In this thesis, we focus on the 
network scheduling problems for DML.

Firstly, we propose Chukonu, a intra-job scheduler for allocating 
resources to processes on the same DML job on different servers. We show 
that DML attains faster speed with decoupling the computation and 
communication processes at scheduler design. Our prototype shows a 1.5x 
faster speed compared over different parameter synchronization schemes. 
Secondly, we propose DeepProphet, a tool to analyze the computation and 
network resource requirements offline via analyzing the dataflow graph 
representing the DML application. With given hardware configuration, 
DeepProphet accurately predicts the iteration completion time within below 
10% average error. We demonstrate the resource requirements for DML can be 
conducted accurately via offline analysis, a feature that benefits later 
inter-job scheduler designs.


Date:			Wednesday, 14 August 2019

Time:			4:00pm - 6:00pm

Venue:			Room 3494
 			Lifts 25/26

Committee Members:	Dr. Kai Chen (Supervisor)
 			Prof. Gary Chan (Chairperson)
 			Dr. Qifeng Chen


**** ALL are Welcome ****