proxygen
proxygen/folly/folly/docs/Benchmark.md
Go to the documentation of this file.
1 `folly/Benchmark.h`
2 -----------------
3 
4 `folly/Benchmark.h` provides a simple framework for writing and
5 executing benchmarks. Currently the framework targets only
6 single-threaded testing (though you can internally use fork-join
7 parallelism and measure total run time).
8 
9 To use this library, you need to be using gcc 4.6 or later. Include
10 `folly/Benchmark.h` and make sure `folly/benchmark.cpp` is part of the
11 build (either directly or packaged with a library).
12 
13 ### Overview
14 ***
15 
16 Using `folly/Benchmark.h` is very simple. Here's an example:
17 
18 ``` Cpp
19  #include <folly/Benchmark.h>
20  #include <folly/container/Foreach.h>
21  #include <vector>
22  using namespace std;
23  using namespace folly;
24  BENCHMARK(insertFrontVector) {
25  // Let's insert 100 elements at the front of a vector
26  vector<int> v;
27  FOR_EACH_RANGE (i, 0, 100) {
28  v.insert(v.begin(), i);
29  }
30  }
31  BENCHMARK(insertBackVector) {
32  // Let's insert 100 elements at the back of a vector
33  vector<int> v;
34  FOR_EACH_RANGE (i, 0, 100) {
35  v.insert(v.end(), i);
36  }
37  }
38  int main() {
39  runBenchmarks();
40  }
41 ```
42 
43 Compiling and running this code produces to the standard output:
44 
45 ```
46  ===============================================================================
47  test.cpp relative ns/iter iters/s
48  ===============================================================================
49  insertFrontVector 3.84K 260.38K
50  insertBackVector 1.61K 622.75K
51  ===============================================================================
52 ```
53 
54 Let's worry about the empty column "relative" later. The table
55 contains, for each benchmark, the time spent per call and the converse
56 number of calls per second. Numbers are represented in metric notation
57 (K for thousands, M for millions etc). As expected, in this example
58 the second function is much faster (fewer ns/iter and more iters/s).
59 
60 The macro `BENCHMARK` introduces a function and also adds it to an
61 internal array containing all benchmarks in the system. The defined
62 function takes no arguments and returns `void`.
63 
64 The framework calls the function many times to collect statistics
65 about it. Sometimes the function itself would want to do that
66 iteration---for example how about inserting `n` elements instead of
67 100 elements? To do the iteration internally, use `BENCHMARK` with two
68 parameters. The second parameter is the number of iterations and is
69 passed by the framework down to the function. The type of the count is
70 implicitly `unsigned`. Consider a slightly reworked example:
71 
72 ``` Cpp
73  #include <folly/Benchmark.h>
74  #include <folly/container/Foreach.h>
75  #include <vector>
76  using namespace std;
77  using namespace folly;
78  BENCHMARK(insertFrontVector, n) {
79  vector<int> v;
80  FOR_EACH_RANGE (i, 0, n) {
81  v.insert(v.begin(), i);
82  }
83  }
84  BENCHMARK(insertBackVector, n) {
85  vector<int> v;
86  FOR_EACH_RANGE (i, 0, n) {
87  v.insert(v.end(), i);
88  }
89  }
90  int main() {
91  runBenchmarks();
92  }
93 ```
94 
95 The produced numbers are substantially different:
96 
97 ```
98  ===============================================================================
99  Benchmark relative ns/iter iters/s
100  ===============================================================================
101  insertFrontVector 39.92 25.05M
102  insertBackVector 3.46 288.89M
103  ===============================================================================
104 ```
105 
106 Now the numbers indicate the speed of one single insertion because the
107 framework assumed the user-defined function used internal iteration
108 (which it does). So inserting at the back of a vector is more than 10
109 times faster than inserting at the front! Speaking of comparisons...
110 
111 ### Baselines
112 ***
113 
114 Choosing one or more good baselines is a crucial activity in any
115 measurement. Without a baseline there is little information to derive
116 from the sheer numbers. If, for example, you do experimentation with
117 algorithms, a good baseline is often an established approach (e.g. the
118 built-in `std::sort` for sorting). Essentially all experimental
119 numbers should be compared against some baseline.
120 
121 To support baseline-driven measurements, `folly/Benchmark.h` defines
122 `BENCHMARK_RELATIVE`, which works much like `BENCHMARK`, except it
123 considers the most recent lexically-ocurring `BENCHMARK` a baseline,
124 and fills the "relative" column. Say, for example, we want to use
125 front insertion for a vector as a baseline and see how back insertion
126 compares with it:
127 
128 ``` Cpp
129  #include <folly/Benchmark.h>
130  #include <folly/container/Foreach.h>
131  #include <vector>
132  using namespace std;
133  using namespace folly;
134  BENCHMARK(insertFrontVector, n) {
135  vector<int> v;
136  FOR_EACH_RANGE (i, 0, n) {
137  v.insert(v.begin(), i);
138  }
139  }
140  BENCHMARK_RELATIVE(insertBackVector, n) {
141  vector<int> v;
142  FOR_EACH_RANGE (i, 0, n) {
143  v.insert(v.end(), i);
144  }
145  }
146  int main() {
147  runBenchmarks();
148  }
149 ```
150 
151 This program prints something like:
152 
153 ```
154  ===============================================================================
155  Benchmark relative ns/iter iters/s
156  ===============================================================================
157  insertFrontVector 42.65 23.45M
158  insertBackVector 1208.24% 3.53 283.30M
159  ===============================================================================
160 ```
161 
162 showing the 1208.24% relative speed advantage of inserting at the back
163 compared to front. The scale is chosen in such a way that 100% means
164 identical speed, numbers smaller than 100% indicate the benchmark is
165 slower than the baseline, and numbers greater than 100% indicate the
166 benchmark is faster. For example, if you see 42% that means the speed
167 of the benchmark is 0.42 of the baseline speed. If you see 123%, it
168 means the benchmark is 23% or 1.23 times faster.
169 
170 To close the current benchmark group and start another, simply use
171 `BENCHMARK` again.
172 
173 ### Ars Gratia Artis
174 ***
175 
176 If you want to draw a horizontal line of dashes (e.g. at the end of a
177 group or for whatever reason), use `BENCHMARK_DRAW_LINE()`. The line
178 fulfills a purely aesthetic role; it doesn't interact with
179 measurements in any way.
180 
181 ``` Cpp
182  BENCHMARK(foo) {
183  Foo foo;
184  foo.doSomething();
185  }
186 
187  BENCHMARK_DRAW_LINE();
188 
189  BENCHMARK(bar) {
190  Bar bar;
191  bar.doSomething();
192  }
193 ```
194 
195 ### Suspending a benchmark
196 ***
197 
198 Sometimes benchmarking code must do some preparation work that is
199 physically inside the benchmark function, but should not take part to
200 its time budget. To temporarily suspend the benchmark, use the
201 pseudo-statement `BENCHMARK_SUSPEND` as follows:
202 
203 ``` Cpp
204  BENCHMARK(insertBackVector, n) {
205  vector<int> v;
206  BENCHMARK_SUSPEND {
207  v.reserve(n);
208  }
209  FOR_EACH_RANGE (i, 0, n) {
210  v.insert(v.end(), i);
211  }
212  }
213 ```
214 
215 The preallocation effected with `v.reserve(n)` will not count toward
216 the total run time of the benchmark.
217 
218 Only the main thread should call `BENCHMARK_SUSPEND` (and of course it
219 should not call it while other threads are doing actual work). This is
220 because the timer is application-global.
221 
222 If the scope introduced by `BENCHMARK_SUSPEND` is not desired, you may
223 want to "manually" use the `BenchmarkSuspender` type. Constructing
224 such an object suspends time measurement, and destroying it resumes
225 the measurement. If you want to resume time measurement before the
226 destructor, call `dismiss` against the `BenchmarkSuspender`
227 object. The previous example could have been written like this:
228 
229 ``` Cpp
230  BENCHMARK(insertBackVector, n) {
231  BenchmarkSuspender braces;
232  vector<int> v;
233  v.reserve(n);
234  braces.dismiss();
235  FOR_EACH_RANGE (i, 0, n) {
236  v.insert(v.end(), i);
237  }
238  }
239 ```
240 
241 ### `doNotOptimizeAway`
242 ***
243 
244 Finally, the small utility function `doNotOptimizeAway` prevents
245 compiler optimizations that may interfere with benchmarking . Call
246 doNotOptimizeAway(var) against variables that you use for
247 benchmarking but otherwise are useless. The compiler tends to do a
248 good job at eliminating unused variables, and this function fools it
249 into thinking a variable is in fact needed. Example:
250 
251 ``` Cpp
252  BENCHMARK(fpOps, n) {
253  double d = 1;
254  FOR_EACH_RANGE (i, 1, n) {
255  d += i;
256  d -= i;
257  d *= i;
258  d /= i;
259  }
260  doNotOptimizeAway(d);
261  }
262 ```
263 
264 ### A look under the hood
265 ***
266 
267 `folly/Benchmark.h` has a simple, systematic approach to collecting
268 timings.
269 
270 First, it organizes measurements in several large epochs, and takes
271 the minimum over all epochs. Taking the minimum gives the closest
272 result to the real runtime. Benchmark timings are not a regular random
273 variable that fluctuates around an average. Instead, the real time
274 we're looking for is one to which there's a variety of additive noise
275 (i.e. there is no noise that could actually shorten the benchmark time
276 below its real value). In theory, taking an infinite amount of samples
277 and keeping the minimum is the actual time that needs
278 measuring. That's why the accuracy of benchmarking increases with the
279 number of epochs.
280 
281 Clearly, in real functioning there will also be noise and a variety of
282 effects caused by the running context. But the noise during the
283 benchmark (straight setup, simple looping) is a poor model for the
284 noise in the real application. So taking the minimum across several
285 epochs is the most informative result.
286 
287 Inside each epoch, the function measured is iterated an increasing
288 number of times until the total runtime is large enough to make noise
289 negligible. At that point the time is collected, and the time per
290 iteration is computed. As mentioned, the minimum time per iteration
291 over all epochs is the final result.
292 
293 The timer function used is `clock_gettime` with the `CLOCK_REALTIME`
294 clock id. Note that you must use a recent Linux kernel (2.6.38 or
295 newer), otherwise the resolution of `CLOCK_REALTIME` is inadequate.