Approximate Query Processing

In modern large-scale analytical database systems, the keywords are speed, accuracy, and interactivity. These are the criteria that count for data analysts and business users who need to perform analytical queries on a large amount of data with complex conditions and return aggregated results that can enable decisions to be made. For example, a sales director of a company handling millions of transactions per day might be interested to know the total revenue of all transactions for a specific product category in a certain period of time, where the buyer and seller are in specific regions, and the product has parts manufactured in yet another region. Yet, long running times for such analytical queries, even in leading, commercial-level database systems, are still among the major challenges to overcome.

Prof Ke Yi, an expert in database systems and algorithms, has solved a problem that has taxed the community for over 15 years, enabling responses to queries to be given in seconds rather than minutes or hours. Prof Yi and his team's novel algorithm allows the database to return approximate results in a very short time, and continue to improve their accuracy as more time is spent. Working on datasets of 100 gigabytes and larger, their "Wander Join" algorithm achieves more effective sampling through random walks, returning results with the same accuracy (for example, 95% confidence and 1%error) in one-hundredth of the time compared with prior solutions on the same hardware. The algorithm has been integrated into PostgreSQL, Spark, as well as Oracle through PL/SQL, demonstrating its viability in a variety of settings and bringing the sought-after goal of interactive data analysis a step closer.

This work was carried out in collaboration with the University of Utah, and has received the 2016 ACM SIGMOD Best Paper Award.