Detecting visibility bugs in concurrent Java

August 22, 2018

Chances to detect visibility bugs vary. The following visibility bug can in the best case detected in 90 percent of all cases. In the worst case, the chance to detect the bug is lower than one in a million.
But first what are visibility bugs?

What are visibility bugs?

A visibility bug happens when a thread reads a stale value. In the following example a thread signals another thread to stop the processing of its while loop:

public class Termination {
   private int v;
   public void runTest() throws InterruptedException   {
	   Thread workerThread = new Thread( () -> { 
		   while(v == 0) {
			   // spin
		   }
	   });
	   workerThread.start();
	   v = 1;
	   workerThread.join();  // test might hang up here 
   }
 public static void main(String[] args)  throws InterruptedException {
	   for(int i = 0 ; i < 1000 ; i++) {
		   new Termination().runTest();
	   }
   }	
}

The bug is that the worker thread might never see the update of the variable v and therefore runs forever.

One reason, for reading stale values, is the cache of the CPU cores. Each core of modern CPUs has his own cache. So if the reading and writing thread runs on different cores the reading thread sees a cached value and not the value written by the writing thread. The following shows the cores and caches inside an Intel Pentium 4 CPU, from this superuser answer:

Each core of an Intel Pentium 4 CPU has its own level 1 and level 2 cache. All cores share a large level 3 cache. The reason for those caches is performance. The following numbers show the time needed to access the memory, from Computer Architecture, A Quantitative Approach, JL Hennessy, DA Patterson, 5th edition, page 72:

Reading and writing to a normal field does not invalidate the cache. So if two threads on different cores read and write to the same variable they see stale values. Let us see if we can reproduce this bug.

How to reproduce a visibility bug

If you have run the above example chances are high that the test does not hang up. The test needs so few CPU cycles that both threads typically run on the same core. And when both threads run on the same core they read and write to the same cache. Luckily the OpenJDK provides a tool, jcstress, which helps with this type of tests. jcstress uses multiple tricks that the threads of the tests run on different cores. Here the above example is rewritten as jcstress test:

@JCStressTest(Mode.Termination)
@Outcome(id = "TERMINATED", expect = Expect.ACCEPTABLE, desc = "Gracefully finished.")
@Outcome(id = "STALE", expect = Expect.ACCEPTABLE_INTERESTING, desc = "Test hung up.")
@State
public class APISample_03_Termination {
    int v;
    @Actor
    public void actor1() {
        while (v == 0) {
            // spin
        }
    }
    @Signal
    public void signal() {
        v = 1;
    }
}

This test is from the jcstress examples. By annotating the class with the annotation @JCStressTest we tell jcstress that this class is a jcstress test. jcstress runs the methods annotated with @Actor and @Signal in a separate thread. jcstress first starts the actor thread and then runs the signal thread. If the test exits in a reasonable time, jcstress records the "TERMINATED" result, otherwise the result "STALE".

jcstress runs the test case multiple times with different JVM parameters. Here are the results of this test on my development machine an Intel i5 4 core CPU using the test mode stress.

JVM options Observed state Occurrence
None TERMINATED 16
STALE 10
-XX:-TieredCompilation TERMINATED 1
STALE 10
-XX:TieredStopAtLevel=1 TERMINATED 8776026
Xint TERMINATED 9058042

As we see for the JVM parameter-XX:-TieredCompilation the thread hangs up in 90 percent of all cases. But for the JVM flags -XX:TieredStopAtLevel=1 and -Xint the thread terminated in all runs.

After confirming that indeed our example contains a bug, how can we fix it?

How to avoid visibility bugs?

Java has specialized instructions which guarantee that a thread always sees the latest written value. One such instruction is the volatile field modifier When reading a volatile field a thread is guaranteed to see the last written value. The guarantee not only applies to the value of the field but to all values written by the writing thread before the write to the volatile variable. Adding the field modifier volatile to the field v from the above example makes sure that the while loop terminates always. Even if run it in a test with jcstress.

public class Termination {
   volatile int v;
   // methods omitted
}

The volatile field modifier is not the only instruction which gives such visibility guarantees. For example the synchronized statement and classes in the package java.util.concurrent give the same guarantees. A good read to learn about techniques to avoid visibility bugs is the book JavaConcurrency In Practice by Brian Goetz et al.

After seeing why visibility bugs happen and how to reproduce and avoid them let us look at how to find them.

How to find visibility bugs?

The Java Language Specification Chapter 17. Threads and Locks defines the visibility guarantees of the Java instructions formally. This specification defines a so-called happens before relation to define the visibility guarantees:

Two actions can be ordered by a happens-before relationship. If one action happens-before another, then the first is visible to and ordered before the second.

And the reading from and writing to a volatile field creates such a happens-before relation:

A write to a volatile field (§8.3.1.4) happens-before every subsequent read of that field.

Using this specification we can check if a program contains visibility bugs, called data race in the specification.

When a program contains two conflicting accesses (§17.4.1) that are not ordered by a happens-before relationship, it is said to contain a data race. Two accesses to (reads of or writes to) the same variable are said to be conflicting if at least one of the accesses is a write.

Looking at our example we see that there is no happens-before relation between the read and the write to the shared variable v. So this example contains a data race according to the specification.

Of course, this reasoning can be automated. The following two tools use this rules to automatically detect visibility bugs:

Also on vmlens:

Make your application thread safe

LEARN MORE