Algorithms, Applications, and Verification of Causal Structure Learning

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Algorithms, Applications, and Verification of Causal Structure 
Learning"

By

Mr. Pingchuan MA


Abstract:

Understanding causal relations is one of the most fundamental problems in 
scientific discovery, such as clinical trials, economics. The gold standard for 
inferring causal relations is to conduct randomized experiments, which, 
however, are often infeasible due to high costs or ethical concerns. In 
contrast, causal structure learning (a.k.a., causal discovery) aims to infer 
causal relations from observational data and learn the probabilistic graphical 
model of the underlying data. Historically, conventional causal structure 
learning algorithms generally rely on carefully-crafted criteria to deduce 
graph structures. For instance, PC (PeterClark) algorithm conducts conditional 
independence tests to constrain graphical structures and gradually deduce the 
whole graph from data. As a result, they often produce spurious causal 
relations.

In this thesis, we propose two novel algorithms, namely, ML4S and SPOT, which 
leverages machine learning techniques to predict causal relations from 
observational data. ML4S is a supervised causal structure learning algorithm 
that predicts edge adjacencies in the causal skeleton. SPOT first infers the 
posteriors of causal skeletons using amortized variational inference, and then 
use the posteriors to guide the search of the causal graph (in a continuous 
optimization setting). We show that both algorithms outperform the 
state-of-the-art causal structure learning algorithms in terms of both accuracy 
and scalability.

Then, we show two applications of causal structure learning in the context of 
databases. First, we present XINSIGHT, a general framework for XDA (explainable 
data analysis). XINSIGHT provides data analysis with qualitative and 
quantitative explanations of causal and non-causal semantics. This way, it will 
significantly improve human understanding and confidence in the outcomes of 
data analysis, facilitating accurate data interpretation and decision making in 
the real world. XINSIGHT is a three-module, end-to-end pipeline designed to 
extract causal graphs, translate causal primitives into XDA semantics, and 
quantify the quantitative contribution of each explanation to a data fact. 
XINSIGHT uses a set of design concepts and optimizations to address the 
inherent difficulties associated with integrating causality into XDA. 
Experiments on synthetic and real-world datasets as well as a user study 
demonstrate the highly promising capabilities of XINSIGHT. Then, we present 
GUARDRAIL, which advocates a novel focus on discovering integrity constraints 
with program synthesis techniques from noisy and opaque I/O examples. To 
support the program synthesis task, we introduce a domain-specific language 
(DSL) and propose a sketch-based and structure learning-powered synthesis 
algorithm over the DSL. We demonstrate the effectiveness of our approach on 48 
ML-integrated SQL queries using 12 real-world datasets. Evaluation shows that 
our approach can effectively synthesize integrity constraints using noisy data, 
and also solidify queries with an average reduction of 87% in the error rates.

Finally, we propose a runtime verification tool called CICHECK, designed to 
harden causal structure learning algorithms from reliability and privacy 
perspectives. CICHECK employs a sound and decidable encoding scheme that 
translates CIR into SMT problems. To solve the CIR problem efficiently, CICHECK 
introduces a four-stage decision procedure with three lightweight optimizations 
that actively prove or refute consistency, and only resort to costly SMT-based 
reasoning when necessary. Based on the decision procedure to CIR, CICHECK 
includes two variants: ED-CHECK and P-CHECK, which detect erroneous CI tests 
(to enhance reliability) and prune excessive CI tests (to enhance privacy), 
respectively. We evaluate CICHECK on four real-world datasets and 100 CIR 
instances, showing its effectiveness in detecting erroneous CI tests and 
reducing excessive CI tests while retaining practical performance.


Date:                   Monday, 23 September 2024

Time:                   9:00am - 11:00am

Venue:                  Room 5501
                        Lifts 25/26

Chairman:               Prof. Li Min ZHANG (CIVL)

Committee Members:      Dr. Shuai WANG (Supervisor)
                        Prof. Raymond WONG
                        Prof. Nevin ZHANG
                        Dr. Jia LIU (MARK)
                        Prof. Jun SUN (SMU)