Detecting Concurrent Programming Errors

Brandon Lucia, Benjamin P. Wood and Luis Ceze
University of Washington
Dept. Computer Science and Engineering

Software failures pose a significant cost to our economy --- NIST estimates around $60B per year. Pervasive parallelism is likely to exacerbate this cost, due to the increased complexity of correctly writing concurrent and parallel programs. Today's software engineering tools do not provide programmers with the support they need to effectively write correct concurrent programs. The goal of this project is to provide new mechanisms that help programmers get their programs right, by identifying and elucidating concurrency errors in their programs. Our approach to the concurrent software correctness problem leverages novel and existing support across the system stack: from the architecture level, all the way up to the programmer. Such support in conjunction with novel abstractions for program behavior and algorithms for manipulating these abstractions, enable us to develop techniques to detect concurrency errors with high precision and low overhead. When possible, we leverage machine learning and statistical inference to further increase the precision of our techniques.

Publications

Bugaboo: Finding Concurrency Bugs with Context-Aware Communication Graphs

MICRO 2009

Incorrect thread synchronization often leads to concurrency bugs that manifest nondeterministically and are difficult to detect and fix. Past work on detecting concurrency bugs has addressed the general problem in an ad-hoc fashion, focusing mostly on data races and atomicity violations.

Using graphs to represent a multithreaded program execution is very natural, nodes represent static instructions and edges represent communication via shared memory. In this paper we make the fundamental observation that such basic context-oblivious graphs do not encode enough information to enable accurate bug detection. We propose context-aware communication graphs, a new kind of communication graph that encodes global ordering information by embedding communication contexts. We then build Bugaboo, a simple and generic framework that accurately detects complex concurrency bugs. Our framework collects communication graphs from multiple executions and uses invariant-based techniques to detect anomalies in the graphs.

We built two versions of Bugaboo: BB-SW, which is fully implemented in software but suffers from significant slowdowns; and BB-HW, which relies on custom architecture support but has negligible performance degradation. BB-HW requires modest extensions to a commodity multicore processor and can be used in deployment settings. We evaluate both versions using applications such as MySQL, Apache, PARSEC, and several others. Our results show that Bugaboo identifies a wide variety of concurrency bugs, including challenging multivariable bugs, with few (often zero) unnecessary code inspections.

Recon: Isolating and Understanding Concurrency Errors Using Reconstructed Execution Fragments

PLDI 2011

In this paper we propose Recon, a new general approach to concurrency debugging. Recon goes beyond just detecting bugs, it also presents to the programmer short fragments of buggy execution schedules that illustrate how and why bugs happened. These fragments, called reconstructions, are inferred from inter-thread communication surrounding the root cause of a bug and significantly simplify the process of understanding bugs.

The key idea in Recon is to monitor executions and build graphs that encode inter-thread communication with enough context information to build reconstructions. Recon leverages reconstructions built from multiple application executions and uses machine learning to identify which ones illustrate the root cause of a bug. Recon's approach is general because it does not rely on heuristics specific to any type of bug, application, or programming model. Therefore, it is able to deal with single- and multiple-variable concurrency bugs regardless of their type (e.g., atomicity violation, ordering, etc). To make graph collection efficient, Recon employs selective monitoring and allows metadata information to be imprecise without compromising accuracy. With these optimizations, Recon's graph collection imposes overheads typically between 5x and 20x for both C/C++ and Java programs, with overheads as low as 13% in our experiments. We evaluate Recon with buggy applications, and show it produces reconstructions that include all code points involved in bugs' causes, and presents them in an accurate order. We include a case study of understanding and fixing a previously unresolved bug to showcase Recon's effectiveness.

Try Recon!