More about HKUST
Automatically Debugging AutoML Pipelines Using Maro: ML Automated Remediation Oracle
Speaker: Julian Dolby IBM Title: "Automatically Debugging AutoML Pipelines Using Maro: ML Automated Remediation Oracle" Date: Friday, 18 August 2023 Time: 11:00am - 12 noon Venue: Room 4475 (via lift 25/26), HKUST Abstract: Machine learning in practice often involves complex pipelines for data cleansing, feature engineering, preprocessing, and prediction. These pipelines are composed of operators, which have to be correctly connected and whose hyperparameters must be correctly configured. Unfortunately, it is quite common for certain combinations of datasets, operators, or hyperparameters to cause failures. Diagnosing and fixing those failures is tedious and error-prone and can seriously derail a data scientist's workflow. We describe an approach for automatically debugging an ML pipeline, explaining the failures, and producing a remediation. We implemented our approach, which builds on a combination of AutoML and SMT, in a tool called Maro. Maro works by building a model of hyperparameter values and pipeline outcomes, and calculates values that lead to failure. For instance, some hyperparameter values requires numerical data---such as using the average value of a column for missing data---which will fail when non-numeric data is supplied. These constraints are solved to find a set of values that removes all failures. This is, in effect, a form of code generation based code generation. Maro works seamlessly with the familiar data science ecosystem including Python, Jupyter notebooks, scikit-learn, and AutoML tools such as Hyperopt. We empirically evaluate our tool and find that for most cases, a single remediation automatically fixes errors, produces no additional faults, and does not significantly impact optimal accuracy nor time to convergence. Since our ongoing collaborations with HKUST involves code generation based on data, I shall focus on the solver portion of this work, and discuss how it could be generalized. **************** Biography: Julian Dolby has been a Research Staff Member at the IBM Thomas J. Watson Research Center for more than 20 years. He has worked on a wide range of topics, spanning virtual machines, program analysis, model checking, databases, semantics and machine learning. He has published papers on all of these topics, and he has contributed to a range of IBM products over the years, in the WebSphere, Rational, DB2 and AppScan brands. He has also worked on open source projects, notably the WALA program analysis framework, of which he was a co-creator, GraphGen4Code, and Project CodeNet.