More about HKUST
Algorithms, Applications, and Verification of Causal Structure Learning
The Hong Kong University of Science and Technology Department of Computer Science and Engineering PhD Thesis Defence Title: "Algorithms, Applications, and Verification of Causal Structure Learning" By Mr. Pingchuan MA Abstract: Understanding causal relations is one of the most fundamental problems in scientific discovery, such as clinical trials, economics. The gold standard for inferring causal relations is to conduct randomized experiments, which, however, are often infeasible due to high costs or ethical concerns. In contrast, causal structure learning (a.k.a., causal discovery) aims to infer causal relations from observational data and learn the probabilistic graphical model of the underlying data. Historically, conventional causal structure learning algorithms generally rely on carefully-crafted criteria to deduce graph structures. For instance, PC (PeterClark) algorithm conducts conditional independence tests to constrain graphical structures and gradually deduce the whole graph from data. As a result, they often produce spurious causal relations. In this thesis, we propose two novel algorithms, namely, ML4S and SPOT, which leverages machine learning techniques to predict causal relations from observational data. ML4S is a supervised causal structure learning algorithm that predicts edge adjacencies in the causal skeleton. SPOT first infers the posteriors of causal skeletons using amortized variational inference, and then use the posteriors to guide the search of the causal graph (in a continuous optimization setting). We show that both algorithms outperform the state-of-the-art causal structure learning algorithms in terms of both accuracy and scalability. Then, we show two applications of causal structure learning in the context of databases. First, we present XINSIGHT, a general framework for XDA (explainable data analysis). XINSIGHT provides data analysis with qualitative and quantitative explanations of causal and non-causal semantics. This way, it will significantly improve human understanding and confidence in the outcomes of data analysis, facilitating accurate data interpretation and decision making in the real world. XINSIGHT is a three-module, end-to-end pipeline designed to extract causal graphs, translate causal primitives into XDA semantics, and quantify the quantitative contribution of each explanation to a data fact. XINSIGHT uses a set of design concepts and optimizations to address the inherent difficulties associated with integrating causality into XDA. Experiments on synthetic and real-world datasets as well as a user study demonstrate the highly promising capabilities of XINSIGHT. Then, we present GUARDRAIL, which advocates a novel focus on discovering integrity constraints with program synthesis techniques from noisy and opaque I/O examples. To support the program synthesis task, we introduce a domain-specific language (DSL) and propose a sketch-based and structure learning-powered synthesis algorithm over the DSL. We demonstrate the effectiveness of our approach on 48 ML-integrated SQL queries using 12 real-world datasets. Evaluation shows that our approach can effectively synthesize integrity constraints using noisy data, and also solidify queries with an average reduction of 87% in the error rates. Finally, we propose a runtime verification tool called CICHECK, designed to harden causal structure learning algorithms from reliability and privacy perspectives. CICHECK employs a sound and decidable encoding scheme that translates CIR into SMT problems. To solve the CIR problem efficiently, CICHECK introduces a four-stage decision procedure with three lightweight optimizations that actively prove or refute consistency, and only resort to costly SMT-based reasoning when necessary. Based on the decision procedure to CIR, CICHECK includes two variants: ED-CHECK and P-CHECK, which detect erroneous CI tests (to enhance reliability) and prune excessive CI tests (to enhance privacy), respectively. We evaluate CICHECK on four real-world datasets and 100 CIR instances, showing its effectiveness in detecting erroneous CI tests and reducing excessive CI tests while retaining practical performance. Date: Monday, 23 September 2024 Time: 9:00am - 11:00am Venue: Room 5501 Lifts 25/26 Chairman: Prof. Li Min ZHANG (CIVL) Committee Members: Dr. Shuai WANG (Supervisor) Prof. Raymond WONG Prof. Nevin ZHANG Dr. Jia LIU (MARK) Prof. Jun SUN (SMU)