__atomic_test_and_set blocks entire program in AMP configuration with shared memory

Question

__atomic_test_and_set blocks entire program in AMP configuration with shared memory

I have a ZYBO Zynq 7000 (Dual-core ARM Cortex-A9) development board that I use in an Asymmetric MultiProcessing configuration: CPU0 runs Linux, and CPU1 runs a bare-metal application written in C++. I have configured a region of shared memory between the two CPUs as per this application note. I have reconfigured the cache on CPU1, and mmap'ed the shared memory in my Linux program, and this works fine: I can read and write to the shared memory from both applications.

The next thing I'm trying to achieve is to use atomic locks on the shared data. I'm using std::atomic_flag.
The problem I'm facing is that whenever I call atomic_flag::lock.test_and_set() (from either CPU), the entire program on that CPU just hangs at that line.

This is the code for the lock I'm using:

class ScopedLock {
  public:
    ScopedLock(volatile std::atomic_flag &lock) : lock{lock} {
        bool locked = true;
        for (size_t i = 0; i < NUM_RETRIES; ++i) {
            locked = lock.test_and_set(std::memory_order_acquire);
            std::cout << "locked = " << locked << std::endl;
            if (locked)
                usleep(WAIT_TIME);
            else
                break;
        }
        if (locked)
            throw std::runtime_error("Timeout: Could not acquire lock");
    }

    ~ScopedLock() {
        lock.clear(std::memory_order_release);
        std::cout << "released" << std::endl;
    }

  private:
    volatile std::atomic_flag &lock;
    constexpr static size_t NUM_RETRIES   = 10;
    constexpr static useconds_t WAIT_TIME = 50;
};

What am I doing wrong here?

I uploaded my test program and its main dependencies to GitHub:

It just increments a shared variable 1000× from each core. The desired result would be 2000, but as expected, the result is less than 2000 if I disable the lock. If I enable the lock, it seems to hang forever, so it never gets to incrementing the variable.

I commented out ScopedLock lock(test_lock); on lines 63 and 72 to disable the lock.

TL;DR of the code on GitHub:

#define atomic_flag32 std::atomic_flag __attribute__((aligned(4)))

struct TestStruct {
    mutable atomic_flag32 test_lock = ATOMIC_FLAG_INIT;
    uint32_t counter                = 0;

    void increment() volatile {
        {
            ScopedLock lock(test_lock);
            uint32_t tmp = counter;
            usleep(40);
            counter = tmp + 1;
        }
        usleep(10);
    }
};

// In bare-metal, I initialize the shared struct instance:
volatile TestStruct *sm = new ((void *) addressInSharedMem) TestStruct();
// On Linux, I use mmap, and I don't initialize the memory, I just use it immediately

for (size_t i = 0; i < 1'000; ++i)
    sm->increment();

On Linux, I'm using a GCC-8.3 installation I built using crosstool-NG: arm-cortexa9_neon-linux-gnueabihf-g++, and for the bare-metal core, I'm using GCC 8 gcc-arm-none-eabi-8-2018-q4-major that I got from the ARM website. (Googling this again makes me doubt if it's the right compiler for the Cortex-A9, but it seems to work fine) This toolchain is used by the Xilinx Vivado SDK in order to generate a boot image.
I don't know what flags or settings could be causing this problem, so if I should post more details about my compiler options etc., please notify me in a comment, and I'll add it to my post.

Edit
After some more research, it seems much more involved than I thought.

From the Zynq-7000 Technical Reference Manual (p.144):

There are exclusive monitors in APU L1 cache, but not in the L2 level cache. This means the exclusive access address must either terminate in L1 cache or L3 memory, but not in L2.
To use the L1 exclusive monitor, the addressed MMU region must be set to be inner cacheable and inner cache write-back with write-allocate. This allows an address targeted by a particular exclusive access to always be allocated to L1 cache.
To use the L3 exclusive monitor, the access must not terminate at the APU L2 cache. From the ARM CPU perspective, this means the address must be shareable, normal and non-cacheable. Also, the L2 cache controller shared override option (bit 22 in the L2 auxiliary control register) must be set in the auxiliary control register. By default in the APU L2 cache controller, any non-cacheable shared reads are treated as cacheable non-allocatable, while non-cacheable shared writes are treated as cacheable write-through/no write-allocate. The L2 cache controller shared override option in the PL310 auxiliary control register overrides this behavior and prevents allocation into L2 cache.

This is my best attempt, but it still doesn't work.

void eagle_setup_ipc(void) {
    // Original: 0b100110111100010 = 0x04de2

    // Configuration of the Level 1 Page Table
    // =======================================
    //
    // See Figure 3-5 on p.78 of the Zynq-7000 Technical Reference Manual
    // https://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf
    //
    // [31:20] → base address of section
    // [19]    0    → NS
    // [18]    0    → 1 MiB "sections"
    // [17]    0    → Global
    // [16]    1    → Shareable
    // [15]    0    → Access Permission [2]
    // [14:12] 100  → TEX → Normal memory, non-cacheable
    // [11:10] 11   → Access Permission [1:0] → Full Access
    // [9]     0
    // [8:5]   1111 → Domain
    // [4]     1    → Execute Never
    // [3:2]   00   → CB → non-cacheable
    // [1:0]   10   → 1 MiB "sections"

    eagle_SetTlbAttributes(0xFFFF0000, 0b1'0'100'11'0'1111'1'00'10);
}

void eagle_DCacheFlush(void) {
    Xil_L1DCacheFlush();
    //Xil_L2CacheFlush();
}

void eagle_SetTlbAttributes(u32 addr, u32 attrib) {
    u32 *ptr;
    u32 section;

    mtcp(XREG_CP15_INVAL_UTLB_UNLOCKED, 0);
    dsb();

    mtcp(XREG_CP15_INVAL_BRANCH_ARRAY, 0);
    dsb();
    eagle_DCacheFlush();

    section = addr / 0x100000;
    ptr = &MMUTable + section;
    *ptr = (addr & 0xFFF00000) | attrib;
    dsb();
}

I also set bit 22 of the auxiliary control register in boot.S:

.set L2CCAuxControl,    0x72760000

As per this thread. The addresses they're using are in the DDR memory (?) (0x00000000-0x3FFFFFFF), while mine are not (0xFFFF0000-0xFFFFFFFF).
This address range was used in the application note I mentioned earlier, but I don't understand how it's mapped, or why it was mapped this way.

c++

arm

multiprocessing

shared-memory

atomic

asked on Stack Overflow May 3, 2019 by

tttapa • edited May 4, 2019 by

tttapa

0 Answers

Nobody has answered this question yet.

User contributions licensed under CC BY-SA 3.0