proxygen
folly::AccessSpreader< Atom > Struct Template Reference

#include <CacheLocality.h>

Classes

class  CpuCache
 Caches the current CPU and refreshes the cache every so often. More...
 

Public Member Functions

template<>
size_t current (size_t numStripes)
 

Static Public Member Functions

static size_t current (size_t numStripes)
 
static size_t cachedCurrent (size_t numStripes)
 Fallback implementation when thread-local storage isn't available. More...
 

Private Types

enum  { kMaxCpus = 128 }
 
typedef uint8_t CompactStripe
 

Private Member Functions

template<>
Getcpu::Func pickGetcpuFunc ()
 
template<>
Getcpu::Func pickGetcpuFunc ()
 
template<>
Getcpu::Func pickGetcpuFunc ()
 
template<>
Getcpu::Func pickGetcpuFunc ()
 

Static Private Member Functions

static Getcpu::Func pickGetcpuFunc ()
 Returns the best getcpu implementation for Atom. More...
 
static int degenerateGetcpu (unsigned *cpu, unsigned *node, void *)
 Always claims to be on CPU zero, node zero. More...
 
static bool initialize ()
 

Static Private Attributes

static Getcpu::Func getcpuFunc
 
static CompactStripe widthAndCpuToStripe [kMaxCpus+1][kMaxCpus] = {}
 
static bool initialized = AccessSpreader<Atom>::initialize()
 

Detailed Description

template<template< typename > class Atom = std::atomic>
struct folly::AccessSpreader< Atom >

AccessSpreader arranges access to a striped data structure in such a way that concurrently executing threads are likely to be accessing different stripes. It does NOT guarantee uncontended access. Your underlying algorithm must be thread-safe without spreading, this is merely an optimization. AccessSpreader::current(n) is typically much faster than a cache miss (12 nanos on my dev box, tested fast in both 2.6 and 3.2 kernels).

If available (and not using the deterministic testing implementation) AccessSpreader uses the getcpu system call via VDSO and the precise locality information retrieved from sysfs by CacheLocality. This provides optimal anti-sharing at a fraction of the cost of a cache miss.

When there are not as many stripes as processors, we try to optimally place the cache sharing boundaries. This means that if you have 2 stripes and run on a dual-socket system, your 2 stripes will each get all of the cores from a single socket. If you have 16 stripes on a 16 core system plus hyperthreading (32 cpus), each core will get its own stripe and there will be no cache sharing at all.

AccessSpreader has a fallback mechanism for when __vdso_getcpu can't be loaded, or for use during deterministic testing. Using sched_getcpu or the getcpu syscall would negate the performance advantages of access spreading, so we use a thread-local value and a shared atomic counter to spread access out. On systems lacking both a fast getcpu() and TLS, we hash the thread id to spread accesses.

AccessSpreader is templated on the template type that is used to implement atomics, as a way to instantiate the underlying heuristics differently for production use and deterministic unit testing. See DeterministicScheduler for more. If you aren't using DeterministicScheduler, you can just use the default template parameter all of the time.

Definition at line 228 of file CacheLocality.h.

Member Typedef Documentation

template<template< typename > class Atom = std::atomic>
typedef uint8_t folly::AccessSpreader< Atom >::CompactStripe
private

Definition at line 265 of file CacheLocality.h.

Member Enumeration Documentation

template<template< typename > class Atom = std::atomic>
anonymous enum
private

If there are more cpus than this nothing will crash, but there might be unnecessary sharing

Enumerator
kMaxCpus 

Definition at line 263 of file CacheLocality.h.

Member Function Documentation

template<template< typename > class Atom = std::atomic>
static size_t folly::AccessSpreader< Atom >::cachedCurrent ( size_t  numStripes)
inlinestatic

Fallback implementation when thread-local storage isn't available.

Definition at line 255 of file CacheLocality.h.

Referenced by BENCHMARK(), and folly::AccessSpreader< Atom >::current().

255  {
256  return current(numStripes);
257  }
static size_t current(size_t numStripes)
template<>
size_t folly::AccessSpreader< CachedCurrentTag >::current ( size_t  numStripes)

Definition at line 63 of file CacheLocalityBenchmark.cpp.

References folly::AccessSpreader< Atom >::cachedCurrent().

63  {
65 }
static size_t cachedCurrent(size_t numStripes)
Fallback implementation when thread-local storage isn&#39;t available.
template<template< typename > class Atom = std::atomic>
static size_t folly::AccessSpreader< Atom >::current ( size_t  numStripes)
inlinestatic

Returns the stripe associated with the current CPU. The returned value will be < numStripes.

Definition at line 231 of file CacheLocality.h.

References current, and min.

Referenced by folly::detail::DigestBuilder< DigestT >::append(), BENCHMARK(), contentionAtWidth(), folly::CoreCachedSharedPtr< T, kNumSlots >::get(), folly::CoreCachedWeakPtr< T, kNumSlots >::get(), folly::AtomicCoreCachedSharedPtr< T, kNumSlots >::get(), and TEST().

231  {
232  // widthAndCpuToStripe[0] will actually work okay (all zeros), but
233  // something's wrong with the caller
234  assert(numStripes > 0);
235 
236  unsigned cpu;
237  getcpuFunc(&cpu, nullptr, nullptr);
238  return widthAndCpuToStripe[std::min(size_t(kMaxCpus), numStripes)]
239  [cpu % kMaxCpus];
240  }
static CompactStripe widthAndCpuToStripe[kMaxCpus+1][kMaxCpus]
LogLevel min
Definition: LogLevel.cpp:30
static Getcpu::Func getcpuFunc
template<template< typename > class Atom = std::atomic>
static int folly::AccessSpreader< Atom >::degenerateGetcpu ( unsigned *  cpu,
unsigned *  node,
void *   
)
inlinestaticprivate

Always claims to be on CPU zero, node zero.

Definition at line 321 of file CacheLocality.h.

321  {
322  if (cpu != nullptr) {
323  *cpu = 0;
324  }
325  if (node != nullptr) {
326  *node = 0;
327  }
328  return 0;
329  }
template<template< typename > class Atom = std::atomic>
static bool folly::AccessSpreader< Atom >::initialize ( )
inlinestaticprivate

Definition at line 344 of file CacheLocality.h.

References Atom, and max.

344  {
346 
347  auto& cacheLocality = CacheLocality::system<Atom>();
348  auto n = cacheLocality.numCpus;
349  for (size_t width = 0; width <= kMaxCpus; ++width) {
350  auto numStripes = std::max(size_t{1}, width);
351  for (size_t cpu = 0; cpu < kMaxCpus && cpu < n; ++cpu) {
352  auto index = cacheLocality.localityIndexByCpu[cpu];
353  assert(index < n);
354  // as index goes from 0..n, post-transform value goes from
355  // 0..numStripes
356  widthAndCpuToStripe[width][cpu] =
357  CompactStripe((index * numStripes) / n);
358  assert(widthAndCpuToStripe[width][cpu] < numStripes);
359  }
360  for (size_t cpu = n; cpu < kMaxCpus; ++cpu) {
361  widthAndCpuToStripe[width][cpu] = widthAndCpuToStripe[width][cpu - n];
362  }
363  }
364  return true;
365  }
static CompactStripe widthAndCpuToStripe[kMaxCpus+1][kMaxCpus]
LogLevel max
Definition: LogLevel.cpp:31
static Getcpu::Func getcpuFunc
static Getcpu::Func pickGetcpuFunc()
Returns the best getcpu implementation for Atom.
template<>
Getcpu::Func folly::AccessSpreader< ThreadLocalTag >::pickGetcpuFunc ( )
private

Definition at line 50 of file CacheLocalityBenchmark.cpp.

template<>
Getcpu::Func folly::AccessSpreader< PthreadSelfTag >::pickGetcpuFunc ( )
private

Definition at line 54 of file CacheLocalityBenchmark.cpp.

57 {
template<template< typename > class Atom = std::atomic>
static Getcpu::Func folly::AccessSpreader< Atom >::pickGetcpuFunc ( )
inlinestaticprivate

Returns the best getcpu implementation for Atom.

Definition at line 315 of file CacheLocality.h.

References folly::FallbackGetcpu< ThreadId >::getcpu(), and folly::Getcpu::resolveVdsoFunc().

Referenced by folly::test::DeterministicMutex::unlock().

315  {
316  auto best = Getcpu::resolveVdsoFunc();
317  return best ? best : &FallbackGetcpuType::getcpu;
318  }
static int getcpu(unsigned *cpu, unsigned *node, void *)
static Func resolveVdsoFunc()
template<>
Getcpu::Func folly::AccessSpreader< test::DeterministicAtomic >::pickGetcpuFunc ( )
private

Definition at line 429 of file DeterministicSchedule.cpp.

References folly::test::DeterministicSchedule::getcpu().

429  {
431 }
static int getcpu(unsigned *cpu, unsigned *node, void *unused)
template<>
Getcpu::Func folly::AccessSpreader< test::DeterministicAtomic >::pickGetcpuFunc ( )
private

Member Data Documentation

template<template< typename > class Atom = std::atomic>
Getcpu::Func folly::AccessSpreader< Atom >::getcpuFunc
staticprivate
Initial value:

Points to the getcpu-like function we are using to obtain the current cpu. It should not be assumed that the returned cpu value is in range. We use a static for this so that we can prearrange a valid value in the pre-constructed state and avoid the need for a conditional on every subsequent invocation (not normally a big win, but 20% on some inner loops here).

Definition at line 269 of file CacheLocality.h.

Referenced by folly::AccessSpreader< Atom >::CpuCache::cpu().

template<template< typename > class Atom = std::atomic>
bool folly::AccessSpreader< Atom >::initialized = AccessSpreader<Atom>::initialize()
staticprivate

Definition at line 312 of file CacheLocality.h.

template<template< typename > class Atom = std::atomic>
AccessSpreader< Atom >::CompactStripe folly::AccessSpreader< Atom >::widthAndCpuToStripe = {}
staticprivate

For each level of splitting up to kMaxCpus, maps the cpu (mod kMaxCpus) to the stripe. Rather than performing any inequalities or modulo on the actual number of cpus, we just fill in the entire array.

Definition at line 286 of file CacheLocality.h.


The documentation for this struct was generated from the following file: