More about HKUST
Algorithms, Applications, and Verification of Causal Structure Learning
The Hong Kong University of Science and Technology
Department of Computer Science and Engineering
PhD Thesis Defence
Title: "Algorithms, Applications, and Verification of Causal Structure
Learning"
By
Mr. Pingchuan MA
Abstract:
Understanding causal relations is one of the most fundamental problems in
scientific discovery, such as clinical trials, economics. The gold standard for
inferring causal relations is to conduct randomized experiments, which,
however, are often infeasible due to high costs or ethical concerns. In
contrast, causal structure learning (a.k.a., causal discovery) aims to infer
causal relations from observational data and learn the probabilistic graphical
model of the underlying data. Historically, conventional causal structure
learning algorithms generally rely on carefully-crafted criteria to deduce
graph structures. For instance, PC (PeterClark) algorithm conducts conditional
independence tests to constrain graphical structures and gradually deduce the
whole graph from data. As a result, they often produce spurious causal
relations.
In this thesis, we propose two novel algorithms, namely, ML4S and SPOT, which
leverages machine learning techniques to predict causal relations from
observational data. ML4S is a supervised causal structure learning algorithm
that predicts edge adjacencies in the causal skeleton. SPOT first infers the
posteriors of causal skeletons using amortized variational inference, and then
use the posteriors to guide the search of the causal graph (in a continuous
optimization setting). We show that both algorithms outperform the
state-of-the-art causal structure learning algorithms in terms of both accuracy
and scalability.
Then, we show two applications of causal structure learning in the context of
databases. First, we present XINSIGHT, a general framework for XDA (explainable
data analysis). XINSIGHT provides data analysis with qualitative and
quantitative explanations of causal and non-causal semantics. This way, it will
significantly improve human understanding and confidence in the outcomes of
data analysis, facilitating accurate data interpretation and decision making in
the real world. XINSIGHT is a three-module, end-to-end pipeline designed to
extract causal graphs, translate causal primitives into XDA semantics, and
quantify the quantitative contribution of each explanation to a data fact.
XINSIGHT uses a set of design concepts and optimizations to address the
inherent difficulties associated with integrating causality into XDA.
Experiments on synthetic and real-world datasets as well as a user study
demonstrate the highly promising capabilities of XINSIGHT. Then, we present
GUARDRAIL, which advocates a novel focus on discovering integrity constraints
with program synthesis techniques from noisy and opaque I/O examples. To
support the program synthesis task, we introduce a domain-specific language
(DSL) and propose a sketch-based and structure learning-powered synthesis
algorithm over the DSL. We demonstrate the effectiveness of our approach on 48
ML-integrated SQL queries using 12 real-world datasets. Evaluation shows that
our approach can effectively synthesize integrity constraints using noisy data,
and also solidify queries with an average reduction of 87% in the error rates.
Finally, we propose a runtime verification tool called CICHECK, designed to
harden causal structure learning algorithms from reliability and privacy
perspectives. CICHECK employs a sound and decidable encoding scheme that
translates CIR into SMT problems. To solve the CIR problem efficiently, CICHECK
introduces a four-stage decision procedure with three lightweight optimizations
that actively prove or refute consistency, and only resort to costly SMT-based
reasoning when necessary. Based on the decision procedure to CIR, CICHECK
includes two variants: ED-CHECK and P-CHECK, which detect erroneous CI tests
(to enhance reliability) and prune excessive CI tests (to enhance privacy),
respectively. We evaluate CICHECK on four real-world datasets and 100 CIR
instances, showing its effectiveness in detecting erroneous CI tests and
reducing excessive CI tests while retaining practical performance.
Date: Monday, 23 September 2024
Time: 9:00am - 11:00am
Venue: Room 5501
Lifts 25/26
Chairman: Prof. Li Min ZHANG (CIVL)
Committee Members: Dr. Shuai WANG (Supervisor)
Prof. Raymond WONG
Prof. Nevin ZHANG
Dr. Jia LIU (MARK)
Prof. Jun SUN (SMU)