proxygen
|
#include <CacheLocality.h>
Classes | |
class | CpuCache |
Caches the current CPU and refreshes the cache every so often. More... | |
Public Member Functions | |
template<> | |
size_t | current (size_t numStripes) |
Static Public Member Functions | |
static size_t | current (size_t numStripes) |
static size_t | cachedCurrent (size_t numStripes) |
Fallback implementation when thread-local storage isn't available. More... | |
Private Types | |
enum | { kMaxCpus = 128 } |
typedef uint8_t | CompactStripe |
Private Member Functions | |
template<> | |
Getcpu::Func | pickGetcpuFunc () |
template<> | |
Getcpu::Func | pickGetcpuFunc () |
template<> | |
Getcpu::Func | pickGetcpuFunc () |
template<> | |
Getcpu::Func | pickGetcpuFunc () |
Static Private Member Functions | |
static Getcpu::Func | pickGetcpuFunc () |
Returns the best getcpu implementation for Atom. More... | |
static int | degenerateGetcpu (unsigned *cpu, unsigned *node, void *) |
Always claims to be on CPU zero, node zero. More... | |
static bool | initialize () |
Static Private Attributes | |
static Getcpu::Func | getcpuFunc |
static CompactStripe | widthAndCpuToStripe [kMaxCpus+1][kMaxCpus] = {} |
static bool | initialized = AccessSpreader<Atom>::initialize() |
AccessSpreader arranges access to a striped data structure in such a way that concurrently executing threads are likely to be accessing different stripes. It does NOT guarantee uncontended access. Your underlying algorithm must be thread-safe without spreading, this is merely an optimization. AccessSpreader::current(n) is typically much faster than a cache miss (12 nanos on my dev box, tested fast in both 2.6 and 3.2 kernels).
If available (and not using the deterministic testing implementation) AccessSpreader uses the getcpu system call via VDSO and the precise locality information retrieved from sysfs by CacheLocality. This provides optimal anti-sharing at a fraction of the cost of a cache miss.
When there are not as many stripes as processors, we try to optimally place the cache sharing boundaries. This means that if you have 2 stripes and run on a dual-socket system, your 2 stripes will each get all of the cores from a single socket. If you have 16 stripes on a 16 core system plus hyperthreading (32 cpus), each core will get its own stripe and there will be no cache sharing at all.
AccessSpreader has a fallback mechanism for when __vdso_getcpu can't be loaded, or for use during deterministic testing. Using sched_getcpu or the getcpu syscall would negate the performance advantages of access spreading, so we use a thread-local value and a shared atomic counter to spread access out. On systems lacking both a fast getcpu() and TLS, we hash the thread id to spread accesses.
AccessSpreader is templated on the template type that is used to implement atomics, as a way to instantiate the underlying heuristics differently for production use and deterministic unit testing. See DeterministicScheduler for more. If you aren't using DeterministicScheduler, you can just use the default template parameter all of the time.
Definition at line 228 of file CacheLocality.h.
|
private |
Definition at line 265 of file CacheLocality.h.
|
private |
If there are more cpus than this nothing will crash, but there might be unnecessary sharing
Enumerator | |
---|---|
kMaxCpus |
Definition at line 263 of file CacheLocality.h.
|
inlinestatic |
Fallback implementation when thread-local storage isn't available.
Definition at line 255 of file CacheLocality.h.
Referenced by BENCHMARK(), and folly::AccessSpreader< Atom >::current().
size_t folly::AccessSpreader< CachedCurrentTag >::current | ( | size_t | numStripes | ) |
Definition at line 63 of file CacheLocalityBenchmark.cpp.
References folly::AccessSpreader< Atom >::cachedCurrent().
|
inlinestatic |
Returns the stripe associated with the current CPU. The returned value will be < numStripes.
Definition at line 231 of file CacheLocality.h.
Referenced by folly::detail::DigestBuilder< DigestT >::append(), BENCHMARK(), contentionAtWidth(), folly::CoreCachedSharedPtr< T, kNumSlots >::get(), folly::CoreCachedWeakPtr< T, kNumSlots >::get(), folly::AtomicCoreCachedSharedPtr< T, kNumSlots >::get(), and TEST().
|
inlinestaticprivate |
Always claims to be on CPU zero, node zero.
Definition at line 321 of file CacheLocality.h.
|
inlinestaticprivate |
Definition at line 344 of file CacheLocality.h.
|
private |
Definition at line 50 of file CacheLocalityBenchmark.cpp.
|
private |
Definition at line 54 of file CacheLocalityBenchmark.cpp.
|
inlinestaticprivate |
Returns the best getcpu implementation for Atom.
Definition at line 315 of file CacheLocality.h.
References folly::FallbackGetcpu< ThreadId >::getcpu(), and folly::Getcpu::resolveVdsoFunc().
Referenced by folly::test::DeterministicMutex::unlock().
|
private |
Definition at line 429 of file DeterministicSchedule.cpp.
References folly::test::DeterministicSchedule::getcpu().
|
private |
|
staticprivate |
Points to the getcpu-like function we are using to obtain the current cpu. It should not be assumed that the returned cpu value is in range. We use a static for this so that we can prearrange a valid value in the pre-constructed state and avoid the need for a conditional on every subsequent invocation (not normally a big win, but 20% on some inner loops here).
Definition at line 269 of file CacheLocality.h.
Referenced by folly::AccessSpreader< Atom >::CpuCache::cpu().
|
staticprivate |
Definition at line 312 of file CacheLocality.h.
|
staticprivate |
For each level of splitting up to kMaxCpus, maps the cpu (mod kMaxCpus) to the stripe. Rather than performing any inequalities or modulo on the actual number of cpus, we just fill in the entire array.
Definition at line 286 of file CacheLocality.h.