proxygen
proxygen/folly/folly/container/F14.md
Go to the documentation of this file.
1 # F14 Hash Table
2 
3 F14 is a 14-way probing hash table that resolves collisions by double
4 hashing. Up to 14 keys are stored in a chunk at a single hash table
5 position. Vector instructions (SSE2 on x86_64, NEON on aarch64)
6 are used to filter within a chunk; intra-chunk search takes only a
7 handful of instructions. **F14** refers to the fact that the algorithm
8 **F**ilters up to **14** keys at a time. This strategy allows the hash
9 table to be operated at a high maximum load factor (12/14) while still
10 keeping probe chains very short.
11 
12 F14 provides compelling replacements for most of the hash tables we use in
13 production at Facebook. Switching to it can improve memory efficiency
14 and performance at the same time. The hash table implementations
15 widely deployed in C++ at Facebook exist along a spectrum of space/time
16 tradeoffs. The fastest is the least memory efficient, and the most
17 memory efficient (google::sparse_hash_map) is much slower than the rest.
18 F14 moves the curve, simultaneously improving memory efficiency and
19 performance when compared to most of the existing algorithms.
20 
21 ## F14 VARIANTS
22 
23 The core hash table implementation has a pluggable storage strategy,
24 with three policies provided:
25 
26 F14NodeMap stores values indirectly, calling malloc on each insert like
27 std::unordered_map. This implementation is the most memory efficient
28 for medium and large keys. It provides the same iterator and reference
29 stability guarantees as the standard map while being faster and more
30 memory efficient, so you can substitute F14NodeMap for std::unordered_map
31 safely in production code. F14's filtering substantially reduces
32 indirection (and cache misses) when compared to std::unordered_map.
33 
34 F14ValueMap stores values inline, like google::dense_hash_map.
35 Inline storage is the most memory efficient for small values, but for
36 medium and large values it wastes space. Because it can tolerate a much
37 higher load factor, F14ValueMap is almost twice as memory efficient as
38 dense_hash_map while also faster for most workloads.
39 
40 F14VectorMap keeps values packed in a contiguous array. The main hash
41 array stores 32-bit indexes into the value vector. Compared to the
42 existing internal implementations that use a similar strategy, F14 is
43 slower for simple keys and small or medium-sized tables (because of the
44 cost of bit mixing), faster for complex keys and large tables, and saves
45 about 16 bytes per entry on average.
46 
47 We also provide:
48 
49 F14FastMap inherits from either F14ValueMap or F14VectorMap depending
50 on entry size. When the key and mapped_type are less than 24 bytes, it
51 inherits from F14ValueMap. For medium and large entries, it inherits
52 from F14VectorMap. This strategy provides the best performance, while
53 also providing better memory efficiency than dense_hash_map or the other
54 hash tables in use at Facebook that don't individually allocate nodes.
55 
56 ## WHICH F14 VARIANT IS RIGHT FOR ME?
57 
58 F14FastMap is a good default choice. If you care more about memory
59 efficiency than performance, F14NodeMap is better for medium and
60 large entries. F14NodeMap is the only F14 variant that doesn't move
61 its elements, so in the rare case that you need reference stability you
62 should use it.
63 
64 ## HETEROGENEOUS KEY TYPE WITH TRANSPARENT HASH AND EQUALITY
65 
66 In some cases it makes sense to define hash and key equality across
67 types. For example, StringPiece's hash and equality are capable of
68 accepting std::string (because std::string is implicitly convertible
69 to StringPiece). If you mark the hash functor and key equality functor
70 as _transparent_, then F14 will allow you to search the table directly
71 using any of the accepted key types without converting the key.
72 
73 For example, using H =
74 folly::transparent<folly::hasher<folly::StringPiece>> and
75 E = folly::transparent<std::equal_to<folly::StringPiece>>, an
76 F14FastSet<std::string, H, E> will allow you to use a StringPiece key
77 without the need to construct a std::string.
78 
79 Heterogeneous lookup and erase works for any key types that can be passed
80 to operator() on the hasher and key_equal functors. For operations
81 such as operator[] that might insert there is an additional constraint,
82 which is that the passed-in key must be explicitly convertible to the
83 table's key_type. F14 maps understand all possible forms that can be
84 used to construct the underlying std::pair<key_type const, value_type),
85 so heterogeneous keys can be used even with insert and emplace.
86 
87 ## WHY CHUNKS?
88 
89 Assuming that you have a magic wand that lets you search all of the keys
90 in a chunk in a single step (our wand is called _mm_cmpeq_epi8), then
91 using chunks fundamentally improves the load factor/collision tradeoff.
92 The cost is proportional only to the number of chunks visited to find
93 the key.
94 
95 It's kind of like the birthday paradox in reverse. In a room with 23
96 people there is a 50/50 chance that two of them have the same birthday
97 (overflowing a chunk with capacity 1), but the chance that 8 of them
98 were born in the same week (overflowing a chunk with capacity 7) is
99 very small. Even though the chance of any two people being born in
100 the same week is higher (1/52 instead of 1/365), the larger number of
101 coincidences required means that the final probability is much lower
102 (less than 1 in a million). It would require 160 people to reach a 50/50
103 chance that 8 of them were born in the same week.
104 
105 ## WHY PROBING?
106 
107 Chaining to a new chunk on collision is not very memory efficient,
108 because the new chunk is almost certain to be under-filled. We tried
109 chaining to individual entries, but that bloated the lookup code and
110 can't match the performance of a probing strategy.
111 
112 At our max load factor of 12/14, the expected probe length when searching
113 for an existing key (find hit) is 1.04, and fewer than 1% of keys are
114 not found in one of the first 3 chunks. When searching for a key that is
115 not in the map (find miss) the expected probe length at max load factor
116 is 1.275 and the P99 probe length is 4.
117 
118 ## CHUNK OVERFLOW COUNTS: REFERENCE-COUNTED TOMBSTONES
119 
120 Hash tables with a complex probing strategy (quadratic or double-hashing)
121 typically use a tombstone on erase, because it is very difficult to
122 find the keys that might have been displaced by a full bucket (i.e.,
123 chunk in F14). If the probing strategy allows only a small number of
124 potential destinations for a displaced key (linear probing, Robin Hood
125 hashing, or Cuckoo hashing), it is also an option to find a displaced key,
126 relocate it, and then recursively repair the new hole.
127 
128 Tombstones must be eventually reclaimed to deal with workloads that
129 continuously insert and erase. google::dense_hash_map eventually triggers
130 a rehash in this case, for example. Unfortunately, to avoid quadratic
131 behavior this rehash may have to halve the max load factor of the table,
132 resulting in a huge decrease in memory efficiency.
133 
134 Although most probing algorithms just keep probing until they find an
135 empty slot, probe lengths can be substantially reduced if you track
136 whether a bucket has actually rejected a key. This "overflow bit"
137 is set when an attempt is made to place a key into the bucket but the
138 bucket was full. (An especially unlucky key might have to try several
139 buckets, setting the overflow bit in each.) Amble and Knuth describe an
140 overflow bit in the "Further development" section of "Ordered hash tables"
141 (https://academic.oup.com/comjnl/article/17/2/135/525363).
142 
143 The overflow bit subsumes the role of a tombstone, since a tombstone's
144 only effect is to cause a probe search to continue. Unlike a tombstone,
145 however, the overflow bit is a property of the keys that were displaced
146 rather than the key that was erased. It's only a small step to turn
147 this into a counter that records the number of displaced keys, and that
148 can be decremented on erase. Overflow counts give us both an earlier
149 exit from probing and the effect of a reference-counted tombstone.
150 They automatically clean themselves up in a steady-state insert and
151 erase workload, giving us the upsides of double-hashing without the
152 normal downsides of tombstones.
153 
154 ## HOW DOES VECTOR FILTERING WORK?
155 
156 F14 computes a secondary hash value for each key, which we call the key's
157 tag. Tags are 1 byte: 7 bits of entropy with the top bit set. The 14
158 tags are joined with 2 additional bytes of metadata to form a 16-byte
159 aligned __m128i at the beginning of the chunk. When we're looking for a
160 key we can compare the needle's tag to all 14 tags in a chunk in parallel.
161 The result of the comparison is a bitmask that identifies only slots in
162 a chunk that might have a non-empty matching key. Failing searches are
163 unlikely to perform any key comparisons, successful searches are likely
164 to perform exactly 1 comparison, and all of the resulting branches are
165 pretty predictable.
166 
167 The vector search is coded using SIMD intrinsics, SSE2 on x86_64 and
168 NEON on aarch64. These instructions are a non-optional part of those
169 platforms (unlike later SIMD instruction sets like AVX2 or SVE), so no
170 special compilation flags are required. The exact vector operations
171 performed differs between x86_64 and aarch64 because aarch64 lacks a
172 movemask instruction, but the F14 algorithm is the same.
173 
174 ## WHAT ABOUT MEMORY OVERHEAD FOR SMALL TABLES?
175 
176 The F14 algorithm works well for large tables, because the tags can
177 fit in cache even when the keys and values can't. Tiny hash tables are
178 by far the most numerous, however, so it's important that we minimize
179 the footprint when the table is empty or has only 1 or 2 elements.
180 Conveniently, tags cause keys to be densely packed into the bottom of
181 a chunk and filter all memory accesses to the portions of a chunk that
182 are not used. That means that we can also support capacities that are
183 a fraction of 1 chunk with no change to any of the search and insertion
184 algorithms. The only change required is in the check to see if a rehash
185 is required. F14's first three capacities all use one chunk and one
186 16-byte metadata vector, but allocate space for 2, 6, and then 12 keys.
187 
188 ## IS F14NODEMAP FULLY STANDARDS-COMPLIANT?
189 
190 No. F14 does provide full support for stateful allocators, fancy
191 pointers, and as many parts of the C++ standard for unordered associative
192 containers as it can, but it is not fully standards-compliant.
193 
194 We don't know of a way to efficiently implement the full bucket API
195 in a table that uses double-hashed probing, in particular size_type
196 bucket(key_type const&). This function must compute the bucket index
197 for any key, even before it is inserted into the table. That means
198 that a local_iterator range can't partition the key space by the chunk
199 that terminated probing during insert; the only partition choice with
200 reasonable locality would be the first-choice chunk. The probe sequence
201 for a key in double-hashing depends on the key, not the first-choice
202 chunk, however, so it is infeasible to search for all of the displaced
203 keys given only their first-choice location. We're unwilling to use an
204 inferior probing strategy or dedicate space to the required metadata just
205 to support the full bucket API. Implementing the rest of the bucket API,
206 such as local_iterator begin(size_type), would not be difficult.
207 
208 F14 does not allow max_load_factor to be adjusted. Probing tables
209 can't support load factors greater than 1, so the standards-required
210 ability to temporarily disable rehashing by temporarily setting a very
211 high max load factor just isn't possible. We have also measured that
212 there is no performance advantage to forcing a low load factor, so it's
213 better just to omit the field and save space in every F14 instance.
214 This is part of the way we get empty maps down to 32 bytes. The void
215 max_load_factor(float) method is still present, but does nothing. We use
216 the default max_load_factor of 1.0f all of the time, adjusting the value
217 returned from size_type bucket_count() so that the externally-visible
218 load factor reaches 1 just as the actual internal load factor reaches
219 our threshold of 12/14.
220 
221 The standard requires that a hash table be iterable in O(size()) time
222 regardless of its load factor (rather than O(bucket_count()). That means
223 if you insert 1 million keys then erase all but 10, iteration should
224 be O(10). For std::unordered_map the cost of supporting this scenario
225 is an extra level of indirection in every read and every write, which is
226 part of why we can improve substantially on its performance. Low load
227 factor iteration occurs in practice when erasing keys during iteration
228 (for example by repeatedly calling map.erase(map.begin())), so we provide
229 the weaker guarantee that iteration is O(size()) after erasing any prefix
230 of the iteration order. F14VectorMap doesn't have this problem.
231 
232 The standard requires that clear() be O(size()), which has the practical
233 effect of prohibiting a change to bucket_count. F14 deallocates
234 all memory during clear() if it has space for more than 100 keys, to
235 avoid leaving a large table that will be expensive to iterate (see the
236 previous paragraph). google::dense_hash_map works around this tradeoff
237 by providing both clear() and clear_no_resize(); we could do something
238 similar.
239 
240 As stated above, F14NodeMap and F14NodeSet are the only F14 variants
241 that provides reference stability. When running under ASAN the other
242 storage policies will probabilistically perform extra rehashes, which
243 makes it likely that reference stability problems will be found by the
244 address sanitizer.
245 
246 An additional subtlety for hash tables that don't provide reference
247 stability is whether they rehash before evaluating the arguments passed
248 to insert(). F14 tables may rehash before evaluating the arguments
249 to a method that causes an insertion, so it's not safe to write
250 something like `map.insert(k2, map[k1])` with F14FastMap, F14ValueMap,
251 or F14VectorMap. This behavior matches google::dense_hash_map and the
252 excellent absl::flat_hash_map.
253 
254 F14NodeMap does not currently support the C++17 node API, but it could
255 be trivially added.
256 
257 * Nathan Bronson -- <ngbronson@fb.com>
258 * Xiao Shi -- <xshi@fb.com>