ElBlo

Latching across the processverse

It turns out that most of c++’s synchronization primitives are only intended for using across threads of the same process.

I tried using std::latch to synchronize multiple processes with a shared memory mapping and it was not working correctly.

Here’s a simple replacement for std::latch in Linux, using atomics and futexes:

#include <assert.h>
#include <errno.h>
#include <limits.h>
#include <linux/futex.h>
#include <stdint.h>
#include <sys/syscall.h>
#include <unistd.h>

#include <atomic>

// A wrapper for the futex system call, that only supports FUTEX_WAIT and
// FUTEX_WAKE, without timeout.
long simple_futex(std::atomic<uint32_t>* uaddr, int futex_op, uint32_t val) {
  return syscall(SYS_futex, reinterpret_cast<uint32_t*>(uaddr), futex_op, val, NULL, NULL, 0);
}

// A class similar to std::latch that works across processes.
class Latch {
 public:
  Latch(size_t expected) : pending_(expected), done_(0) { assert(expected > 0); }

  void CountDown() {
    if (pending_.fetch_sub(1) == 1) {
      done_ = 1;
      long res = simple_futex(&done_, FUTEX_WAKE, INT_MAX);
      assert(res >= 0);
    }
  }

  void Wait() {
    while (done_ == 0) {
      long res = simple_futex(&done_, FUTEX_WAIT, 0);
      assert(res == 0 || (res == -1 && errno == EAGAIN));
    }
    assert(pending_ == 0);
  }

  Latch(const Latch&) = delete;
  Latch& operator=(const Latch&) = delete;

 private:
  std::atomic<size_t> pending_;
  alignas(4) std::atomic<uint32_t> done_;
};

The property we are looking for is address-free, which means that atomic operations on the object will work regardless of where it is located, and won’t depend on per-process state. Note that that will also cover the case when the same object is mapped multiple times in the same process.

std::atomic for integral types is always lock-free, which should make it address-free as well.

Why doesn’t std::latch work across process?

There are no guarantees that std::latch should work accross processes. It depends on how it was implemented.

In libcxx it works but in a degraded way. libcxx uses futexes with the FUTEX_PRIVATE_FLAG, so for starters, the futexes are constrained to each process.

It also uses a 2 second timeout for the futexes, making it so that the whole code will end up working, but it will be super slow: the waiters will wake up after 2 seconds and check the latch counter, which tells them that it’s ok to stop waiting.

Internally, libcxx tries to keep track if the latch is being contended. It has a separate table stored globally in which it keeps track of whether someone is waiting to be woken up.

As this table is stored outside the shared memory region, it doesn’t get modified when the other process accesses it, causing the first process to think that nobody is waiting to be notified, and thus, not even issuing a FUTEX_WAKE operation.

libstdc++’s implementation is similar, the futexes are used without the private flag, but they also do a similar trick to keep track of whether to issue a FUTEX_WAKE call or not, and they do not use timeouts.

© Marco Vanotti 2024

Powered by Hugo & new.css.