Why cannot the load part of the atomic RMW instruction pass the earlier store to unrelated location in TSO(x86) memory consistency model?


It's known that x86 architecture doesn't implement sequential consistency memory model because of usage of write buffers, so that store->load reordering can take place (later loads can be committed while the earlier stores still reside in write buffers waiting for the commit to L1 cache).

In A Primer on Memory Consistency and Coherence we can read about Read-Modify-Write(RMW) operations in Total Store Order(TSO) memory consistency model (which is supposed to be very similar to x86):

... we consider the RMW as a load immediately followed by a store. The load part of the RMW cannot pass earlier loads due to TSO’s ordering rules. It might at first appear that the load part of the RMW could pass earlier stores in the write buffer, but this is not legal. If the load part of the RMW passes an earlier store, then the store part of the RMW would also have to pass the earlier store because the RMW is an atomic pair. But because stores are not allowed to pass each other in TSO, the load part of the RMW cannot pass an earlier store either.

Ok, atomic operation must be atomic, i.e. the memory location accessed by RMW can't be accessed by another threads/cores during the RMW operation, but what, if the earlier store passes by load part of the atomic operation is not related to the memory location accessed by RMW? Assume we have the following couple of instructions (in pseudocode):

store int32 value in 0x00000000 location
atomic increment int32 value in 0x10000000 location

The first store is added to the write buffer and is waiting for its turn. Meanwhile, the atomic operation loads the value from another location (even in another cache line), passing the first store, and adds store into the write buffer next after the first one. In global memory order we'll see the following order:

load (part of atomic) -> store (ordinal) -> store (part of atomic)

Yes, maybe it's not a best solution from the performance point of view, since we need to hold the cache line for the atomic operation in read-write state until all preceding stores from the write buffer are committed, but, performance considerations aside, are there any violations of TSO memory consistency model is we allow for the load part of RMW operation to pass the earlier stores to unrelated locations?

asked on Stack Overflow Mar 15, 2017 by undermind • edited Dec 2, 2019 by Peter Cordes

2 Answers


You could ask the same question about any store + load pair to different addresses: the load may be executed earlier internally than the older store due to out-of-order execution. In X86 this would be allowed, because:

Loads may be reordered with older stores to different locations but not with older stores to the same location

(source: Intel 64 Architecture Memory Ordering White Paper)

However, in your example, the lock perfix would prevent that, because (from the same set of rules):

Locked instructions have a total order

This means that the lock would enforce a memory barrier, like an mfence (and indeed some compilers use a locked operation as a fence). This will usually make the CPU stop the execution of the load until the store buffer has drained, forcing the store to execute first.

answered on Stack Overflow Mar 16, 2017 by Leeor

since we need to hold the cache line for the atomic operation in read-write state until all preceding stores from the write buffer are committed, but, performance considerations aside

If you hold a lock L while you do operations S that are of same nature as those prevented by L, that is there exist S' that can be blocked (delayed) by L and S can be blocked (delayed) by L', then you have the recipe for a deadlock, unless you are guaranteed to be the only actor doing that (which would make the whole atomic thing pointless).

answered on Stack Overflow Dec 2, 2019 by curiousguy

User contributions licensed under CC BY-SA 3.0