More about HKUST
Network Scheduling for Distributed Machine Learning
MPhil Thesis Defence Title: "Network Scheduling for Distributed Machine Learning" By Mr. Jiacheng XIA Abstract Distributed machine learning (DML) is of growing importance. Due to the growing scale of data and complexity of models, many important machine learning problems cannot be effectively solved by single machine. Existing scheduling algorithms are insufficient due to the complex computation-communication pattern and approximation feature of DML. In the training stage of DML, networking becomes bottleneck as the models trained on different machines need synchronization and updates frequently, transmitting MB to GB scale of parameters at second to sub-second level. In this thesis, we focus on the network scheduling problems for DML. Firstly, we propose Chukonu, a intra-job scheduler for allocating resources to processes on the same DML job on different servers. We show that DML attains faster speed with decoupling the computation and communication processes at scheduler design. Our prototype shows a 1.5x faster speed compared over different parameter synchronization schemes. Secondly, we propose DeepProphet, a tool to analyze the computation and network resource requirements offline via analyzing the dataflow graph representing the DML application. With given hardware configuration, DeepProphet accurately predicts the iteration completion time within below 10% average error. We demonstrate the resource requirements for DML can be conducted accurately via offline analysis, a feature that benefits later inter-job scheduler designs. Date: Wednesday, 14 August 2019 Time: 4:00pm - 6:00pm Venue: Room 3494 Lifts 25/26 Committee Members: Dr. Kai Chen (Supervisor) Prof. Gary Chan (Chairperson) Dr. Qifeng Chen **** ALL are Welcome ****