libc/benchmarks/RATIONALE.md

   1 # Benchmarking `llvm-libc`'s memory functions
   2
   3 ## Foreword
   4
   5 Microbenchmarks are valuable tools to assess and compare the performance of
   6 isolated pieces of code. However they don't capture all interactions of complex
   7 systems; and so other metrics can be equally important:
   8
   9 -   **code size** (to reduce instruction cache pressure),
  10 -   **Profile Guided Optimization** friendliness,
  11 -   **hyperthreading / multithreading** friendliness.
  12
  13 ## Rationale
  14
  15 The goal here is to satisfy the [Benchmarking
  16 Principles](https://en.wikipedia.org/wiki/Benchmark_(computing)#Benchmarking_Principles).
  17
  18 1.  **Relevance**: Benchmarks should measure relatively vital features.
  19 2.  **Representativeness**: Benchmark performance metrics should be broadly
  20     accepted by industry and academia.
  21 3.  **Equity**: All systems should be fairly compared.
  22 4.  **Repeatability**: Benchmark results can be verified.
  23 5.  **Cost-effectiveness**: Benchmark tests are economical.
  24 6.  **Scalability**: Benchmark tests should measure from single server to
  25     multiple servers.
  26 7.  **Transparency**: Benchmark metrics should be easy to understand.
  27
  28 Benchmarking is a [subtle
  29 art](https://en.wikipedia.org/wiki/Benchmark_(computing)#Challenges) and
  30 benchmarking memory functions is no exception. Here we'll dive into
  31 peculiarities of designing good microbenchmarks for `llvm-libc` memory
  32 functions.
  33
  34 ## Challenges
  35
  36 As seen in the [README.md](README.md#stochastic-mode) the microbenchmarking
  37 facility should focus on measuring **low latency code**. If copying a few bytes
  38 takes in the order of a few cycles, the benchmark should be able to **measure
  39 accurately down to the cycle**.
  40
  41 ### Measuring instruments
  42
  43 There are different sources of time in a computer (ordered from high to low resolution)
  44  - [Performance
  45    Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
  46    introspect the internals of the CPU,
  47  - [High Precision Event
  48    Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
  49    trigger short lived actions,
  50  - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
  51    to keep track of the computer's time.
  52
  53 In theory **Performance Counters** provide cycle accurate measurement via the
  54 `cpu cycles` event. But as we'll see, they are not really practical in this
  55 context.
  56
  57 ### Performance counters and modern processor architecture
  58
  59 Modern CPUs are [out of
  60 order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
  61 [superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
  62 consequence it is [hard to know what is included when the counter is
  63 read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
  64 some instructions may still be **in flight**, some others may be executing
  65 [**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a
  66 matter of fact **on the same machine, measuring twice the same piece of code will yield
  67 different results.**
  68
  69 ### Performance counters semantics inconsistencies and availability
  70
  71 Although they have the same name, the exact semantics of performance counters
  72 are micro-architecture dependent: **it is generally not possible to compare two
  73 micro-architectures exposing the same performance counters.**
  74
  75 Each vendor decides which performance counters to implement and their exact
  76 meaning. Although we want to benchmark `llvm-libc` memory functions for all
  77 available [target
  78 triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
  79 are **no guarantees that the counter we're interested in is available.**
  80
  81 ### Additional imprecisions
  82
  83 -   Reading performance counters is done through Kernel [System
  84     calls](https://en.wikipedia.org/wiki/System_call). The System call itself
  85     is costly (hundreds of cycles) and will perturbate the counter's value.
  86 -   [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
  87     can occur during measurement.
  88 -   If the system is already under monitoring (virtual machines or system wide
  89     profiling) the kernel can decide to multiplex the performance counters
  90     leading to lower precision or even completely missing the measurement.
  91 -   The Kernel can decide to [migrate the
  92     process](https://en.wikipedia.org/wiki/Process_migration) to a different
  93     core.
  94 -   [Dynamic frequency
  95     scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
  96     in during the measurement and change the ticking duration. **Ultimately we
  97     care about the amount of work over a period of time**. This removes some
  98     legitimacy of measuring cycles rather than **raw time**.
  99
 100 ### Cycle accuracy conclusion
 101
 102 We have seen that performance counters are: not widely available, semantically
 103 inconsistent across micro-architectures and imprecise on modern CPUs for small
 104 snippets of code.
 105
 106 ## Design decisions
 107
 108 In order to achieve the needed precision we would need to resort on more widely
 109 available counters and derive the time from a high number of runs: going from a
 110 single deterministic measure to a probabilistic one.
 111
 112 **To get a good signal to noise ratio we need the running time of the piece of
 113 code to be orders of magnitude greater than the measurement precision.**
 114
 115 For instance, if measurement precision is of 10 cycles, we need the function
 116 runtime to take more than 1000 cycles to achieve 1%
 117 [SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
 118
 119 ### Repeating code N-times until precision is sufficient
 120
 121 The algorithm is as follows:
 122
 123 -   We measure the time it takes to run the code _N_ times (Initially _N_ is 10
 124     for instance)
 125 -   We deduce an approximation of the runtime of one iteration (= _runtime_ /
 126     _N_).
 127 -   We increase _N_ by _X%_ and repeat the measurement (geometric progression).
 128 -   We keep track of the _one iteration runtime approximation_ and build a
 129     weighted mean of all the samples so far (weight is proportional to _N_)
 130 -   We stop the process when the difference between the weighted mean and the
 131     last estimation is smaller than _ε_ or when other stopping conditions are
 132     met (total runtime, maximum iterations or maximum sample count).
 133
 134 This method allows us to be as precise as needed provided that the measured
 135 runtime is proportional to _N_. Longer run times also smooth out imprecision
 136 related to _interrupts_ and _context switches_.
 137
 138 Note: When measuring longer runtimes (e.g. copying several megabytes of data)
 139 the above assumption doesn't hold anymore and the _ε_ precision cannot be
 140 reached by increasing iterations. The whole benchmarking process becomes
 141 prohibitively slow. In this case the algorithm is limited to a single sample and
 142 repeated several times to get a decent 95% confidence interval.
 143
 144 ### Effect of branch prediction
 145
 146 When measuring code with branches, repeating the same call again and again will
 147 allow the processor to learn the branching patterns and perfectly predict all
 148 the branches, leading to unrealistic results.
 149
 150 **Decision: When benchmarking small buffer sizes, the function parameters should
 151 be randomized between calls to prevent perfect branch predictions.**
 152
 153 ### Effect of the memory subsystem
 154
 155 The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
 156 `L2` and `L3` data caches.
 157
 158 We may be tempted to randomize data accesses widely to exercise all the caching
 159 layers down to RAM but the [cost of accessing lower layers of
 160 memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
 161 completely dominates the runtime for small sizes.
 162
 163 So to respect **Equity** and **Repeatability** principles we should make sure we
 164 **do not** depend on the memory subsystem.
 165
 166 **Decision: When benchmarking small buffer sizes, the data accessed by the
 167 function should stay in `L1`.**
 168
 169 ### Effect of prefetching
 170
 171 In case of small buffer sizes,
 172 [prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
 173 in but in case of large buffers it may introduce a bias.
 174
 175 **Decision: When benchmarking large buffer sizes, the data should be accessed in
 176 a random fashion to lower the impact of prefetching between calls.**
 177
 178 ### Effect of dynamic frequency scaling
 179
 180 Modern processors implement [dynamic frequency
 181 scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
 182 `performance` mode the CPU will increase its frequency and run faster than usual
 183 within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
 184 increased clock rate is limited by the processor's power, current, and thermal
 185 limits, the number of cores currently in use, and the maximum frequency of the
 186 active cores."_
 187
 188 **Decision: When benchmarking we want to make sure the dynamic frequency scaling
 189 is always set to `performance`. We also want to make sure that the time based
 190 events are not impacted by frequency scaling.**
 191
 192 See [README.md](README.md) on how to set this up.
 193
 194 ### Reserved and pinned cores
 195
 196 Some operating systems allow [core
 197 reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
 198 It removes a set of perturbation sources like: process migration, context
 199 switches and interrupts. When a core is hyperthreaded, both cores should be
 200 reserved.
 201
 202 ## Microbenchmarks limitations
 203
 204 As stated in the Foreword section a number of effects do play a role in
 205 production but are not directly measurable through microbenchmarks. The code
 206 size of the benchmark is (much) smaller than the hot code of real applications
 207 and **doesn't exhibit instruction cache pressure as much**.
 208
 209 ### iCache pressure
 210
 211 Fundamental functions that are called frequently will occupy the L1 iCache
 212 ([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
 213 they are too big they will prevent other hot code to stay in the cache and incur
 214 [stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
 215 functions should be as small as possible.
 216
 217 ### iTLB pressure
 218
 219 The same reasoning goes for instruction Translation Lookaside Buffer
 220 ([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
 221 [TLB
 222 misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
 223
 224 ## FAQ
 225
 226 1.  Why don't you use Google Benchmark directly?
 227
 228     We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
 229     cache hierarchy informations) but when it comes to measuring memory
 230     functions Google Benchmark have a few issues:
 231
 232     -   Google Benchmark privileges code based configuration via macros and
 233         builders. It is typically done in a static manner. In our case the
 234         parameters we need to setup are a mix of what's usually controlled by
 235         the framework (number of trials, maximum number of iterations, size
 236         ranges) and parameters that are more tied to the function under test
 237         (randomization strategies, custom values). Achieving this with Google
 238         Benchmark is cumbersome as it involves templated benchmarks and
 239         duplicated code. In the end, the configuration would be spread across
 240         command line flags (via framework's option or custom flags), and code
 241         constants.
 242     -   Output of the measurements is done through a `BenchmarkReporter` class,
 243         that makes it hard to access the parameters discussed above.