proxygen
folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing > Class Template Reference

#include <DistributedMutex.h>

Classes

class  DistributedMutexStateProxy
 

Public Member Functions

 DistributedMutex ()
 
 DistributedMutex (DistributedMutex &&)=delete
 
 DistributedMutex (const DistributedMutex &)=delete
 
DistributedMutexoperator= (DistributedMutex &&)=delete
 
DistributedMutexoperator= (const DistributedMutex &)=delete
 
DistributedMutexStateProxy lock ()
 
void unlock (DistributedMutexStateProxy)
 
DistributedMutexStateProxy try_lock ()
 
template<typename Rep , typename Period >
DistributedMutexStateProxy try_lock_for (const std::chrono::duration< Rep, Period > &duration)
 
template<typename Clock , typename Duration >
DistributedMutexStateProxy try_lock_until (const std::chrono::time_point< Clock, Duration > &deadline)
 

Private Attributes

Atomic< std::uintptr_t > state_ {0}
 

Detailed Description

template<template< typename > class Atomic = std::atomic, bool TimePublishing = true>
class folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >

DistributedMutex is a small, exclusive-only mutex that distributes the bookkeeping required for mutual exclusion in the stacks of threads that are contending for it. It tries to come at a lower space cost than std::mutex while still trying to maintain the fairness benefits that come from using std::mutex. DistributedMutex provides the entire API included in std::mutex, and more, with slight modifications. DistributedMutex is the same width as a single pointer (8 bytes on most platforms), where on the other hand, std::mutex and pthread_mutex_t are both 40 bytes. It is larger than some of the other smaller locks, but the wide majority of cases using the small locks are wasting the difference in alignment padding anyway

Benchmark results are good - at the time of writing in the common uncontended case, it is 30% faster than some of the other small mutexes in folly and as fast as std::mutex, which recently optimized its uncontended path. In the contended case, it is about 4-5x faster than some of the smaller locks in folly, ~2x faster than std::mutex in clang and ~1.8x faster in gcc. DistributedMutex is also resistent to tail latency pathalogies unlike many of the other small mutexes. Which sleep for large time quantums to reduce spin churn, this causes elevated latencies for threads that enter the sleep cycle. The tail latency of lock acquisition on average up to 10x better with DistributedMutex

DistributedMutex reduces cache line contention by making each thread wait on a thread local spinlock and futex. This allows threads to keep working only on their own cache lines without requiring cache coherence operations when a mutex heavy contention. This strategy does not require sequential ordering on the centralized atomic storage for wakeup operations as each thread assigned its own wait state

Non-timed mutex acquisitions are scheduled through intrusive LIFO contention chains. Each thread starts by spinning for a short quantum and falls back to two phased sleeping. Enqueue operations are lock free and are piggybacked off mutex acquisition attempts. The LIFO behavior of a contention chain is good in the case where the mutex is held for a short amount of time, as the head of the chain is likely to not have slept on futex() after exhausting its spin quantum. This allow us to avoid unnecessary traversal and syscalls in the fast path with a higher probability. Even though the contention chains are LIFO, the mutex itself does not adhere to that scheduling policy globally. During contention, threads that fail to lock the mutex form a LIFO chain on the central mutex state, this chain is broken when a wakeup is scheduled, and future enqueue operations form a new chain. This makes the chains themselves LIFO, but preserves global fairness through a constant factor which is limited to the number of concurrent failed mutex acquisition attempts. This binds the last in first out behavior to the number of contending threads and helps prevent starvation and latency outliers

This strategy of waking up wakers one by one in a queue does not scale well when the number of threads goes past the number of cores. At which point preemption causes elevated lock acquisition latencies. DistributedMutex implements a hardware timestamp publishing heuristic to detect and adapt to preemption.

DistributedMutex does not have the typical mutex API - it does not satisfy the Lockable concept. It requires the user to maintain ephemeral bookkeeping and pass that bookkeeping around to unlock() calls. The API overhead, however, comes for free when you wrap this mutex for usage with folly::Synchronized or std::unique_lock, which is the recommended usage (std::lock_guard, in optimized mode, has no performance benefit over std::unique_lock, so has been omitted). A benefit of this API is that it disallows incorrect usage where a thread unlocks a mutex that it does not own, thinking a mutex is functionally identical to a binary semaphore, which, unlike a mutex, is a suitable primitive for that usage

Timed locking through DistributedMutex is implemented through a centralized algorithm - all waiters wait on the central mutex state, by setting and resetting bits within the pointer-length word. Since pointer length atomic integers are incompatible with futex(FUTEX_WAIT) on most systems, a non-standard implementation of futex() is used, where wait queues are managed in user-space. See p1135r0 and folly::ParkingLot

Definition at line 101 of file DistributedMutex.h.

Constructor & Destructor Documentation

template<template< typename > class Atomic, bool TimePublishing>
folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutex ( )

DistributedMutex is only default constructible, it can neither be moved nor copied

Definition at line 305 of file DistributedMutex-inl.h.

template<template< typename > class Atomic = std::atomic, bool TimePublishing = true>
folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutex ( DistributedMutex< Atomic, TimePublishing > &&  )
delete
template<template< typename > class Atomic = std::atomic, bool TimePublishing = true>
folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutex ( const DistributedMutex< Atomic, TimePublishing > &  )
delete

Member Function Documentation

template<template< typename > class Atomic, bool TimePublishing>
DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::lock ( )

Acquires the mutex in exclusive mode

This returns an ephemeral proxy that contains internal mutex state. This must be kept around for the duration of the critical section and passed subsequently to unlock() as an rvalue

The proxy has no public API and is intended to be for internal usage only

This is not a recursive mutex - trying to acquire the mutex twice from the same thread without unlocking it results in undefined behavior

Definition at line 455 of file DistributedMutex-inl.h.

References folly::detail::distributed_mutex::extractAddress(), folly::detail::distributed_mutex::kAboutToWait, folly::detail::distributed_mutex::kLocked, folly::detail::distributed_mutex::kUninitialized, cpp.ast::next(), folly::detail::distributed_mutex::recordTimedWaiterAndClearTimedBit(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::state_, folly::f14::swap(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock(), and folly::detail::distributed_mutex::wait().

Referenced by burn().

455  {
456  // first try and acquire the lock as a fast path, the underlying
457  // implementation is slightly faster than using std::atomic::exchange() as
458  // is used in this function. So we get a small perf boost in the
459  // uncontended case
460  if (auto state = try_lock()) {
461  return state;
462  }
463 
464  auto previous = std::uintptr_t{0};
465  auto waitMode = kUninitialized;
466  auto nextWaitMode = kAboutToWait;
467  auto timedWaiter = false;
468  CachelinePadded<Waiter<Atomic>>* nextSleeper = nullptr;
469  while (true) {
470  // construct the state needed to wait
471  auto&& state = CachelinePadded<Waiter<Atomic>>{waitMode};
472  auto&& address = reinterpret_cast<std::uintptr_t>(&state);
473  DCHECK(!(address & 0b1));
474 
475  // set the locked bit in the address we will be persisting in the mutex
476  address |= kLocked;
477 
478  // attempt to acquire the mutex, mutex acquisition is successful if the
479  // previous value is zeroed out
480  //
481  // we use memory_order_acq_rel here because we want the read-modify-write
482  // operation to be both acquire and release. Acquire becasue if this is a
483  // successful lock acquisition, we want to acquire state any other thread
484  // has released from a prior unlock. We want release semantics becasue
485  // other threads that read the address of this value should see the full
486  // well-initialized node we are going to wait on if the mutex acquisition
487  // was unsuccessful
488  previous = state_.exchange(address, std::memory_order_acq_rel);
489  recordTimedWaiterAndClearTimedBit(timedWaiter, previous);
490  state->next_ = previous;
491  if (previous == kUnlocked) {
492  return {nullptr, address, timedWaiter, {}, nullptr, nextSleeper};
493  }
494  DCHECK(previous & kLocked);
495 
496  // wait until we get a signal from another thread, if this returns false,
497  // we got skipped and had probably been scheduled out, so try again
498  if (!wait(&state, (waitMode == kAboutToWait), nextSleeper)) {
499  std::swap(waitMode, nextWaitMode);
500  continue;
501  }
502 
503  // at this point it is safe to access the other fields in the waiter state,
504  // since the thread that woke us up is gone and nobody will be touching this
505  // state again, note that this requires memory ordering, and this is why we
506  // use memory_order_acquire (among other reasons) in the above wait
507  //
508  // first we see if the value we took off the mutex state was the thread that
509  // initated the wakeups, if so, we are the terminal node of the current
510  // contention chain. If we are the terminal node, then we should expect to
511  // see a kLocked in the mutex state when we unlock, if we see that, we can
512  // commit the unlock to the centralized mutex state. If not, we need to
513  // continue wakeups
514  //
515  // a nice consequence of passing kLocked as the current address if we are
516  // the terminal node is that it naturally just works with the algorithm. If
517  // we get a contention chain when coming out of a contention chain, the tail
518  // of the new contention chain will have kLocked set as the previous, which,
519  // as it happens "just works", since we have now established a recursive
520  // relationship until broken
521  auto next = previous;
522  auto expected = address;
523  if (previous == state->wakerMetadata_.waker_) {
524  next = 0;
525  expected = kLocked;
526  }
527 
528  // if we are just coming out of a futex call, then it means that the next
529  // waiter we are responsible for is also a waiter waiting on a futex, so
530  // we return that list in the list of ready threads. We wlil be waking up
531  // the ready threads on unlock no matter what
532  return {extractAddress<CachelinePadded<Waiter<Atomic>>>(next),
533  expected,
534  timedWaiter,
535  state->wakerMetadata_,
536  extractAddress<CachelinePadded<Waiter<Atomic>>>(state->waiters_),
537  nextSleeper};
538  }
539 }
Type * extractAddress(std::uintptr_t from)
void recordTimedWaiterAndClearTimedBit(bool &timedWaiter, std::uintptr_t &previous)
bool wait(Waiter *waiter, bool shouldSleep, Waiter *&next)
void swap(SwapTrackingAlloc< T > &, SwapTrackingAlloc< T > &)
Definition: F14TestUtil.h:414
state
Definition: http_parser.c:272
def next(obj)
Definition: ast.py:58
template<template< typename > class Atomic = std::atomic, bool TimePublishing = true>
DistributedMutex& folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::operator= ( DistributedMutex< Atomic, TimePublishing > &&  )
delete
template<template< typename > class Atomic = std::atomic, bool TimePublishing = true>
DistributedMutex& folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::operator= ( const DistributedMutex< Atomic, TimePublishing > &  )
delete
template<template< typename > class Atomic, bool TimePublishing>
DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock ( )

Try to acquire the mutex

A non blocking version of the lock() function. The returned object is contextually convertible to bool. And has the value true when the mutex was successfully acquired, false otherwise

This is allowed to return false spuriously, i.e. this is not guaranteed to return true even when the mutex is currently unlocked. In the event of a failed acquisition, this does not impose any memory ordering constraints for other threads

Definition at line 766 of file DistributedMutex-inl.h.

References folly::atomic_fetch_set(), and folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::state_.

Referenced by folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::lock(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock_for(), and folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock_until().

766  {
767  // Try and set the least significant bit of the centralized lock state to 1,
768  // indicating locked.
769  //
770  // If this succeeds, it must have been the case that we had a kUnlocked (or
771  // 0) in the centralized storage before, since that is the only case where a
772  // 0 can be found. So we assert that in debug mode
773  //
774  // If this fails, then it is a no-op
775  auto previous = atomic_fetch_set(state_, 0, std::memory_order_acquire);
776  if (!previous) {
777  return {nullptr, kLocked};
778  }
779 
780  return {nullptr, 0};
781 }
bool atomic_fetch_set(Atomic &atomic, std::size_t bit, std::memory_order mo)
template<template< typename > class Atomic, bool TimePublishing>
template<typename Rep , typename Period >
DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock_for ( const std::chrono::duration< Rep, Period > &  duration)

Try to acquire the mutex, blocking for the given time

Like try_lock(), this is allowed to fail spuriously and is not guaranteed to return false even when the mutex is currently unlocked. But only after the given time has elapsed

try_lock_for() accepts a duration to block for, and try_lock_until() accepts an absolute wall clock time point

Definition at line 870 of file DistributedMutex-inl.h.

References folly::gen::as(), now(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::state_, folly::detail::distributed_mutex::timedLock(), and folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock().

871  {
872  // fast path for the uncontended case. Reasoning for doing this here is the
873  // same as in try_lock_until()
874  if (auto state = try_lock()) {
875  return state;
876  }
877 
878  // fall back to the timed locking algorithm
879  using Proxy = DistributedMutexStateProxy;
880  auto deadline = std::chrono::steady_clock::now() + duration;
881  return timedLock(state_, deadline, [](auto... as) { return Proxy{as...}; });
882 }
std::chrono::steady_clock::time_point now()
Collect as()
Definition: Base.h:811
state
Definition: http_parser.c:272
auto timedLock(Atomic &state, Deadline deadline, MakeProxy proxy)
template<template< typename > class Atomic, bool TimePublishing>
template<typename Clock , typename Duration >
DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock_until ( const std::chrono::time_point< Clock, Duration > &  deadline)

Try to acquire the lock, blocking until the given deadline

Other than the difference in the meaning of the second argument, the semantics of this function are identical to try_lock_for()

Definition at line 847 of file DistributedMutex-inl.h.

References folly::gen::as(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::state_, folly::detail::distributed_mutex::timedLock(), and folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::try_lock().

848  {
849  // fast path for the uncontended case
850  //
851  // we get the time after trying to acquire the mutex because in the
852  // uncontended case, the price of getting the time is about 1/3 of the
853  // actual mutex acquisition. So we only pay the price of that extra bit of
854  // latency when needed
855  //
856  // this is even higher when VDSO is involved on architectures that do not
857  // offer a direct interface to the timestamp counter
858  if (auto state = try_lock()) {
859  return state;
860  }
861 
862  // fall back to the timed locking algorithm
863  using Proxy = DistributedMutexStateProxy;
864  return timedLock(state_, deadline, [](auto... as) { return Proxy{as...}; });
865 }
Collect as()
Definition: Base.h:811
state
Definition: http_parser.c:272
auto timedLock(Atomic &state, Deadline deadline, MakeProxy proxy)
template<template< typename > class Atomic, bool Publish>
void folly::detail::distributed_mutex::DistributedMutex< Atomic, Publish >::unlock ( DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy  proxy)

Unlocks the mutex

The proxy returned by lock must be passed to unlock as an rvalue. No other option is possible here, since the proxy is only movable and not copyable

Definition at line 705 of file DistributedMutex-inl.h.

References folly::detail::distributed_mutex::doFutexWake(), folly::exchange(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy< Atomic, TimePublishing >::expected_, folly::detail::distributed_mutex::kLocked, cpp.ast::next(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy< Atomic, TimePublishing >::next_, folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy< Atomic, TimePublishing >::ready_, folly::detail::distributed_mutex::recordTimedWaiterAndClearTimedBit(), SCOPE_EXIT, folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::state_, folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy< Atomic, TimePublishing >::timedWaiters_, folly::detail::distributed_mutex::tryUnlockClean(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy< Atomic, TimePublishing >::waiters_, folly::detail::distributed_mutex::wake(), folly::detail::distributed_mutex::DistributedMutex< Atomic, TimePublishing >::DistributedMutexStateProxy< Atomic, TimePublishing >::wakerMetadata_, and folly::detail::distributed_mutex::wakeTimedWaiters().

Referenced by burn().

706  {
707  // we always wake up ready threads and timed waiters if we saw either
708  DCHECK(proxy) << "Invalid proxy passed to DistributedMutex::unlock()";
709  SCOPE_EXIT {
710  doFutexWake(proxy.ready_);
711  wakeTimedWaiters(&state_, proxy.timedWaiters_);
712  };
713 
714  // if there is a wait queue we are responsible for, try and start wakeups,
715  // don't bother with the mutex state
716  auto sleepers = proxy.waiters_;
717  if (proxy.next_) {
718  if (wake(Publish, *proxy.next_, proxy.wakerMetadata_, sleepers)) {
719  return;
720  }
721 
722  // At this point, if are in the if statement, we were not the terminal
723  // node of the wakeup chain. Terminal nodes have the next_ pointer set to
724  // null in lock()
725  //
726  // So we need to pretend we were the end of the contention chain. Coming
727  // out of a contention chain always has the kLocked state set in the
728  // mutex. Unless there is another contention chain lined up, which does
729  // not matter since we are the terminal node anyway
730  proxy.expected_ = kLocked;
731  }
732 
733  while (true) {
734  // otherwise, since we don't have anyone we need to wake up, we try and
735  // release the mutex just as is
736  //
737  // if this is successful, we can return, the unlock was successful, we have
738  // committed a nice kUnlocked to the central storage, yay
739  if (tryUnlockClean(state_, proxy, sleepers)) {
740  return;
741  }
742 
743  // here we have a contention chain built up on the mutex. We grab the
744  // wait queue and start executing wakeups. We leave a locked bit on the
745  // centralized storage and handoff control to the head of the queue
746  //
747  // we use memory_order_acq_rel here because we want to see the
748  // full well-initialized node that the other thread is waiting on
749  //
750  // If we are unable to wake the contention chain, it is possible that when
751  // we come back to looping here, a new contention chain will form. In
752  // that case we need to use kLocked as the waker_ value because the
753  // terminal node of the new chain will see kLocked in the central storage
754  auto head = state_.exchange(kLocked, std::memory_order_acq_rel);
755  recordTimedWaiterAndClearTimedBit(proxy.timedWaiters_, head);
756  auto next = extractAddress<CachelinePadded<Waiter<Atomic>>>(head);
757  DCHECK((head & kLocked) && (head != kLocked)) << "incorrect state " << head;
758  if (wake(Publish, *next, {exchange(proxy.expected_, kLocked)}, sleepers)) {
759  break;
760  }
761  }
762 }
bool wake(bool publishing, Waiter &waiter, WakerMetadata metadata, Waiter *&sleepers)
#define SCOPE_EXIT
Definition: ScopeGuard.h:274
void recordTimedWaiterAndClearTimedBit(bool &timedWaiter, std::uintptr_t &previous)
void wakeTimedWaiters(Atomic *state, bool timedWaiters)
T exchange(T &obj, U &&new_value)
Definition: Utility.h:120
bool tryUnlockClean(Atomic &state, Proxy &proxy, Sleepers sleepers)
def next(obj)
Definition: ast.py:58

Member Data Documentation


The documentation for this class was generated from the following files: