Centralized fair RW spin lock

0

In the 1991 paper "Scalable Reader-Writer Synchronization for Shared-Memory Multiprocessors" by John M. Mellor-Crummey and Michael L. Scott there is a fair RW lock based on the ticket lock algorithm (https://www.cs.rochester.edu/u/scott/papers/1991_PPoPP_read_write.pdf).

In pseudocode it looks like this (http://www.cs.rochester.edu/research/synchronization/pseudocode/rw.html):

type counter = unsigned integer
  // layout of counter
  //  31    ...   16 15    ...     0        
  // +------------------------------+
  // | reader count | writer count  |
  // +------------------------------+

const RC_INCR = 0x10000 // to adjust reader count
const WC_INCR = 0x1     // to adjust writer count
const W_MASK  = 0xffff  // to extract writer count

// mask bit for top of each count
const WC_TOPMSK = 0x8000      
const RC_TOPMSK = 0x80000000 

type lock = record
    requests : counter := 0
    completions : counter := 0

procedure start_write (L : ^lock)
    counter prev_processes := 
        fetch_and_clear_then_add (&L->requests, WC_TOPMSK, WC_INCR)
    repeat until completions = prev_processes

procedure start_read (L : ^lock)
    counter prev_writers := 
        fetch_and_clear_then_add (&L->requests, RC_TOPMSK, RC_INCR) & W_MASK
    repeat until (completions & W_MASK) = prev_writers

procedure end_write (L: ^lock)
    clear_then_add (&L->completions, WC_TOPMSK, WC_INCR)

procedure end_read (L : ^lock)
    clear_then_add (&L->completions, RC_TOPMSK, RC_INCR)

It uses fetch_and_clear_then_add and clear_then_add primitives that are not implemented in modern hardware. Clearly one could emulate them with a software CAS-loop.

However it occurred to me that the reasoning for using these primitives (preventing overflows) comes from the pure intuition to keep reader and writer counters separate and precise. But as it happens intuition might be wrong. I believe overflow-prevention logic can be scrapped without any harm to the underling algorithm.

I tried to google for similar results but found nothing so far. Perhaps I'm missing something.

So let me explain my idea in more detail. I think that standard fetch_add instruction will work just fine with this algorithm.

  1. The reader count is contained in the upper half of the word. If fetch_add(0x10000) results in overflow, the extra bit will silently drop off from the word boundary. No problem here.

  2. The writer count is contained in the lower half of the world. If fetch_add(1) results in overflow, the extra bit will spill over to the reader count. Oops. One out of 2^16 writers offends our rules, it ruins the reader count. But wait, is this really a problem? What we have after that? Let's see. In the requests word we have the following transition:

    ( 0xYYYY, 0xFFFF ) -> ( 0xYYYY + 1, 0x0000 ) // 0xYYYY stands for unknown reader count, 0xFFFF represents the writer count before the overflow

At the same time the completions word contains a value less than or equal to ( 0xYYYY, 0xFFFF ). And the 'offender' thread starts to poll the completions word until it gets exactly equal to the required value. Any readers or writers that started before it cannot produce any confusing value in the completions word. As they progress and do unlocking they invariably would reach the required value. After that the 'offender' thread kicks off.

On the other hand any new thread that arrives later will not see any confusing value as well. Writers increment the lower half of the word, readers increment the upper half. They remain ordered against each other. And unable to continue until the 'offender' thread does its work and unlocks.

Unlock done by the 'offender' fully compensates for its offense. It will result in symmetrical overflow in the writer count in the completions word and spill extra 1 to the reader count there. So the next arrived thread will start exactly on schedule as it has been expecting exactly that 'miscalculated' value. The whole thing will work without a hitch.

So the net result is that once in 2^16 write locks the read locks count becomes skewed by 1. Everything else looks fine.

An easy informal illustration to this is the situation when there are no reader locks taken at all. In this case the RW-lock with standard fetch_add() instructions will behave totally like an ordinary ticket lock. The reader count will steadily grow by one every 2^16 writer locks. But this is a distinction we do in our mind. On the machine level this is just an increment of a 2^32 word. There is no difference at this point with a plain ticket lock.

Now readers are added to the picture. I cannot come up with any schedule when arriving readers, incrementing just the upper half of the word, can spoil the whole picture. From the point of view of a writer an arrival of a reader looks like a sudden arrival of 2^16 writers at once. It still works as an ordinary ticket lock from this point of view.

Now from the point of view of a reader this is a different. They look only at the writer count, and do not care about any overflows to the upper half of the word at all. When a reader arrives it remembers the last arrived writer and fends off any upcoming writers by this massive 2^16 increment. It then waits until the remembered writer finishes and proceeds with its own work. When it's done it makes another massive 2^16 increment so any possibly waiting unsuspecting and naive writer can conclude that this is a whole bunch of 2^16 concurrent writers are now done.

Please correct me if I'm wrong with my reasoning.

A C++ implementation of this looks like this:

class shared_ticket_lock : non_copyable
{
public:
    void lock() noexcept
    {
        base_type tail = tail_.fetch_add(exclusive_step, std::memory_order_relaxed);
        for (;;) {
            base_type head = head_.load(std::memory_order_acquire);
            if (tail == head)
                break;
        }
    }

    void unlock() noexcept
    {
        base_type head = head_.load(std::memory_order_relaxed);
        head_.store(head + exclusive_step, std::memory_order_release);
    }

    void lock_shared() noexcept
    {
        base_type tail = tail_.fetch_add(shared_step, std::memory_order_relaxed);
        for (tail &= exclusive_mask;;) {
            base_type head = head_.load(std::memory_order_acquire);
            if (tail == (head & exclusive_mask))
                break;
        }
    }

    void unlock_shared() noexcept
    {
        head_.fetch_add(shared_step, std::memory_order_release);
    }

private:
    using base_type = std::uint32_t;
    static constexpr base_type shared_step = 1 << 16;
    static constexpr base_type exclusive_mask = shared_step - 1;
    static constexpr base_type exclusive_step = 1;

    std::atomic<base_type> head_ = ATOMIC_VAR_INIT(0);
    std::atomic<base_type> tail_ = ATOMIC_VAR_INIT(0);
};

NOTE: the original paper notes that it is possible to implement a fair RW lock with ordinary fetch_add but the code would be slightly more complicated. It's not clear what the authors meant. I guess something different (like maintaining reader and writer counters in separate words) because as far as I see it absolutely NOT more complicated.

UPDATE: Working code and a simple test can be found here:

c++
multithreading
concurrency
locking
asked on Stack Overflow May 16, 2019 by Aleksey Demakov • edited May 17, 2019 by Aleksey Demakov

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0