Practical Symbolic Static Analysis For Effective Bug Detection

PhD Thesis Proposal Defence


Title: "Practical Symbolic Static Analysis For Effective Bug Detection"

by

Mr. Yiyuan GUO


Abstract:

Static analysis infers the behaviors of programs without executing them and 
is widely used to detect bugs automatically. Many static analysis techniques 
employ symbolic methods where symbolic variables are introduced for unknown 
values and the program state is expressed in symbolic formulas. However, we 
encounter great challenges in making the symbolic method effective for bug 
detection in realistic, million-line codebases. First, when targeting 
real-world software, the inherent limitation of static analysis often leads 
to the introduction of many under-constrained symbolic variables, causing 
significant imprecision. Second, the scalability of the symbolic method is 
greatly affected by its memory model, especially when the checked property 
involves complex interactions between memory and numeric values. This thesis 
presents contributions to improve the precision and scalability of the 
symbolic method for enabling practical static bug detection at an industrial 
scale.

The first work focuses on addressing the imprecision of divide-by-zero 
detection caused by under-constrained symbolic variables. Our key idea is to 
gather extra evidence about the under-constrained variables by inferring the 
programmers' beliefs, which helps the static analyzer to report bugs that are 
more likely to be true. Our evidence-based inference approach greatly 
outperformed existing solutions in precision: 72 divide-by-zero bugs were 
uncovered in large codebases such as the Linux kernel with a low false 
positive rate of 22%.

The second work addresses the scalability problem of the symbolic method for 
handling intricate semantics of memory operations. Specifically, we study the 
problem of static buffer overflow detection, where mutually dependent program 
properties of heap and numeric value need to be precisely tracked at the same 
time. We observe that the key is to handle symbolic heap locations, i.e., 
heap locations with possibly symbolic numerical offsets. By introducing a 
disjointness assumption into the memory model, we enable strong updates 
efficiently and support a summary-based modular analysis algorithm. While 
existing works either cannot finish or produce an excessive number of false 
positives, our approach can handle millions of lines of code in less than 
four hours with a false positive rate of 37%.


Date:                   Monday, 25 November 2024

Time:                   3:30pm - 5:30pm

Venue:                  Room 5501
                        Lifts 25/26

Committee Members:      Prof. Charlesz Zhang (Supervisor)
                        Prof. Qiong Luo (Chairperson)
                        Dr. Dongdong She
                        Dr. Shuai Wang