Brandon Lucia, Joseph Devietti, Karin Strauss and Luis Ceze
University of Washington
Dept. Computer Science and Engineering

Concurrent and parallel programming are becoming increasingly important as multi-core architectures find their way to the mainstream. Concurrent programming is more complex than single-threaded programming because of the need to consider the interaction of multiple threads. Due to the extreme difficulty of the concurrent software problem, we believe that detecting concurrency errors is only one part of the solution. Our insight is that we can alleviate the problem of concurrency errors by designing architectures and systems that significantly decrease the probability that software bugs manifest themselves and lead to a failure. We are developing systems that can detect and avoid a broad class of concurrency bugs. Concurrency bugs manifest nondeterministically, i.e., they depend upon the occurrence of a particular interleaving of program operations. As a result, a straightforward program with a single possible outcome may yield multiple valid global interleaving schedules: some will lead to correct behavior, and some may lead to a failure. By allowing only a subset of these interleavings (the correct interleavings), systems avoid concurrency bugs, and retain the original program semantics. We leverage this property and avoid concurrency bugs by monitoring when one is likely to happen and choosing a less dangerous interleaving. Since we don't want bugs to go simply unnoticed, we also use the same mechanisms to reports bugs back to the programmer. Such mechanisms that increase the reliability of software (at deployment time), and the correctness of software (at development time) are useful for the lifetime of a computer system, justifying the cost of their inclusion in computer architectures.

Publications

Atom-Aid: Surviving and Detecting Atomicity Violations

ISCA 2008, IEEE Micro Top Picks 2009

Writing shared-memory parallel programs is error-prone. Among the concurrency errors that programmers often face are atomicity violations, which are especially challenging. They happen when programmers make incorrect assumptions about atomicity and fail to enclose memory accesses that should occur atomically inside the same critical section. If these accesses happen to be interleaved with conflicting accesses from different threads, the program might behave incorrectly.

Recent architectural proposals arbitrarily group consecutive dynamic memory operations into atomic blocks to enforce memory ordering at a coarse grain. This provides what we call implicit atomicity, as the atomic blocks are not derived from explicit program annotations. In this paper, we make the fundamental observation that implicit atomicity probabilistically hides atomicity violations by reducing the number of interleaving opportunities between memory operations. We then propose Atom-Aid, which creates implicit atomic blocks intelligently instead of arbitrarily, dramatically reducing the probability that atomicity violations will manifest themselves. Atom-Aid is also able to report where atomicity violations might exist in the code, providing resilience and debuggability. We evaluate Atom-Aid using buggy code from applications including Apache, MySQL, and XMMS, showing that Atom-Aid virtually eliminates the manifestation of atomicity violations.

ColorSafe: Architectural Support for Debugging and Dynamically Avoiding Multi-variable Atomicity Violations

ISCA 2010

In this paper, we propose ColorSafe, an architecture that detects and dynamically avoids single- and multi-variable atomicity violation bugs. The key idea is to group related data into colors and then monitor access interleavings in the 'color space'. This enables detection of atomicity violations involving any data of the same color. We leverage support for meta-data to maintain color information, and signatures to efficiently keep recent color access histories. ColorSafe dynamically avoids atomicity violations by inserting ephemeral transactions that prevent erroneous interleavings. ColorSafe has two modes of operation: (1) debugging mode makes detection more precise, producing fewer false positives and collecting more information; and, (2) deployment mode provides robust, efficient dynamic bug avoidance with less precise detection. This makes ColorSafe useful throughout the lifetime of programs, not just during development. Our results show that, in deployment mode, ColorSafe is able to successfully avoid the majority of multi-variable atomicity violations in bug kernels, as well as in large applications (Apache and MySQL). In debugging mode, ColorSafe detects bugs with few false positives.