Immutable Trees and Threading Evil – Part 2

In part 1 I introduced immutable trees, this time I will talk about threading evil. If you haven’t yet read part 1, go back and do so now.

We will start by recapping the argument showing that there cannot be a loop. This argument went as follows:

The RandomTreeBuilder and the random behavior of NodePool conspire to create an unpredictable mess, but we do know at least this much:

  1. The TreeNodes are immutable. The left and right children are passed in to the constructor, and they can never be modified later (because they are private and have only getters). So if there is no loop when the node is first created, there never will be.
  2. Since the left and right nodes are passed into the constructor, they must have been created before the new parent node. They cannot be created after or at the same time or they wouldn’t be available to pass in to the constructor.
  3. “Created before” is transitive: if all children must be created before their parent, then every node must be created before every ancestor.
  4. No node can be created before itself, since that creation happened just once. But if there were a loop, then a node is an ancestor of itself. Thus, there cannot be any loops.

It sounds very plausible. The fallacy in the argument is in step 2 when it concludes that he child nodes must have been created before the parent. With threads in the mix, it is possible that the child nodes weren’t create before, after OR at the same time: threading allows incomparable events that are none of these. The official Java memory model (other languages behave the same, but I know the Java one in detail so that’s what I’m describing) states that “happens before” is a “partial relation”: certain events “happen before” others but the some are just incomparable. It seems like a very odd thing to permit, so I will start by explaining the reason.

Suppose that you have the following snippet of code:

{
    int sum = a + b;      // Line 1
    int product = a * b;  // Line 2
    setSum(x);            // Line 3
    setProduct(y);        // Line 4
}

If the compiler were rather simpleminded then the low-level code for this could reads something like this:

# -- Line 1 --
1. Load variable "a" into register 1.
2. Load variable "b" into register 2.
3. Add register 1 and register 2; store in register 3.
4. Write register 3 to memory location for "sum".
# -- Line 2 --
5. Load variable "a" into register 1.
6. Load variable "b" into register 2.
7. Multiply register 1 and register 2; store in register 3.
8. Write register 3 to memory location for "product".
# -- Line 3 --
9. Load variable "x" into register 1.
10. Invoke "setSum" with argument from register 1.
# -- Line 4 --
11. Load variable "y" into register 1.
12. Invoke "setProduct" with argument from register 1.

[NOTE: this example is made up—real assembly code doesn't look like this. But the basic idea is accurate.]

If we assume that add and multiply take 1 clock cycle, while memory read or write takes 100 clock cycles then this takes 802 clock cycles plus the time for the subroutines. Now imagine an optimizing compiler that does this instead:

1. Load variable "a" into register 1.
2. Load variable "b" into register 2.
3. Add register 1 and register 2; store in register 3.
4. Invoke "setSum" with argument from register 3.
5. Multiply register 1 and register 2; store in register 3.
6. Invoke "setProduct" with argument from register 3.

That version takes 202 clock cycles (plus the time for the subroutines). It also used less memory (no need to store temporary variables x and y). If a and b had already been in registers, it might have taken only _2_ clock cycles!

There are enormous performance gains if the compiler-writers are allowed to reorder statements to perform efficiently. They take into account registers, levels of caching, and other factors when doing so. In the above example it is not exactly clear whether “line 3″ occurred before or after “line 4″, but the end result is guaranteed to be the same as if it HAD occurred in order.

So let’s imagine that an optimizing compiler is reordering the statements in the run() method of the RandomTreeBuilder class. This time we won’t try to reason out why it might chose one order or the other, we’ll accept that it picks the ordering it thinks will go fastest. Here is the original code:

for (int i=0; i<NUM_NODES_PER_THREAD; i++) {
    TreeNode leftChild = nodePool.takeRandomNode();
    TreeNode rightChild = nodePool.takeRandomNode();
    nodePool.putNode(new TreeNode(leftChild, rightChild));
}

Here is a straightforward same-as-the-text ordering:

1. Invoke nodePool.takeRandomNode(); store in register 1.
2. Invoke nodePool.takeRandomNode(); store in register 2.
3. Allocate memory for new TreeNode(); store address in register 3.
4. Invoke TreeNode.<init>() with argument registers 3, 1 and 2.
5. Invoke nodePool.putNode() with argument from register 3.

And here is another possible ordering:

1. Allocate memory for new TreeNode(); store address in register 3.
2. Invoke nodePool.putNode() with argument from register 3.
3. Invoke nodePool.takeRandomNode(); store in register 1.
4. Invoke nodePool.takeRandomNode(); store in register 2.
5. Invoke TreeNode.<init>() with argument registers 3, 1 and 2.

Perhaps they were swapped because TreeNode.<init>() and nodePool.takeRandomNode() both clobber register 3 so it’s more efficient to call nodePool.putNode() first. It doesn’t really matter what the reason is, just that the JVM is entitled to order it this way if it wants to.

But now imagine that there are two different threads doing this simultaneously. Thread A executes steps 1 and 2, storing its new node in slot 23 of the NodePool. Simultaneously, thread B executes steps 1 and 2 storing its new node in slot 75. Moments later thread A executes step 4 and happens to pick slot 75, while thread B executes step 4 and happens to pick slot 23. We now have created two nodes, each of which is a child of the other.

And THIS is why threads are “evil”. The behavior underneath the covers is complicated in ways that are not intuitive. If you stick to some well-defined practices (eg: if NodePool were synchronized using locks) then they will work just fine, but if you don’t then you can get very strange behavior which is nearly impossible to debug.

Before I wrap up, there’s something I mentioned early in part 1 that I want to come back to. Immutable objects like the TreeNode we designed are extremely handy things to use, particularly in threaded programs because an immutable object is completely threadsafe. The problem I described here comes because the “immutable object” isn’t really immutable: in this case another thread sees the partially-constructed value before all of the fields are set. Well, the Java memory model contains a special case rule to make it easier to use immutable objects. This rules says that (unless the programmer passes the partially-constructed object to another thread from within the constructor) any fields which are declared to be “final” will be seen by other threads as set before the constructor exits. That rule would have prohibited the reordering that caused our loop.

If we declare the “left” and “right” fields to be “final”, then loops in the tree become impossible just as we originally wanted. Yes, true immutability triumphs over evil! err…. I mean over threads.

Comments

4 Responses to “Immutable Trees and Threading Evil – Part 2”

  1. Brian McKeever on 2007-12-19 8:19 pm

    If your explanation were correct, you don’t need multiple threads to encounter the problem, do you? Your “other possible ordering” with just a single thread could result in a node being its own parent (if one of the gets uses the same index as the put), right?

    Supposing you rewrote the code to be a single statement:
    nodePool.putNode(new TreeNode(nodePool.takeRandomNode(), nodePool.takeRandomNode()));

    How would you reconcile the possibility of cyclic trees with the strict order of evaluation that Java applies to its statements?

    Did your code ever print “We Found A Loop!!”?

  2. mcherm on 2007-12-19 9:21 pm

    Brian:

    (1) No, it couldn’t happen with a single thread, because the Java Memory Model (which governs what kind of reordering the compiler can do) specifically states that any re-orderings must not be visible to the SAME thread, but allows them to be visible to OTHER threads.

    (2) Whether it’s one line or multiple lines doesn’t change things at all. In the end, the compiler is optimizing pretty much the same code. (All optimizing compilers are smart enough to realize when a variable can’t be seen outside of the method.)

    (3) I’m not quite sure what you mean by “strict order of evaluation”.

    (4) No, my code never printed that. I have two reasons for that. First, I’m not using both processors (when running it uses only 50% CPU) and these things won’t typically show up when running in time slices on one CPU because the crucial “two things happening at about the same time” isn’t occurring. Secondly, this example isn’t particularly realistic: the optimization I describe is *allowed* but is probably not *sensible* in this case. But similar (and equally puzzling) things DO occur in real programs.

  3. Brian McKeever on 2007-12-20 2:10 am

    I agree that modifying shared resources without locking can lead to trouble. But what I was getting at was that if the instruction reordering you suggested were allowed, then the problem would be visible from within a single thread (if put and take used the same index). Right?

    And as you say, the model doesn’t allow this, so it follows that the instruction reordering is not allowed. That was the contradiction I was trying to point out.

  4. David on 2008-01-24 2:27 pm

    I don’t know know enough about Java to comment on your post from that perspective. However, since the post was inspired from another post about .NET, I thought I would point out that what you described is impossible in .NET. Your scenario relies on the fact that once the memory for an object has been allocated, other objects can reference that object, even if it has not been constructed yet. In .NET this is not possible. There can never be an optimization such that an object is referenced before it is fully constructed. So this would never be a problem, and Eric’s original statement on his blog post holds.

Leave a Reply