Documentation/scheduler/sched-capacity.rst

   1 =========================
   2 Capacity Aware Scheduling
   3 =========================
   4
   5 1. CPU Capacity
   6 ===============
   7
   8 1.1 Introduction
   9 ----------------
  10
  11 Conventional, homogeneous SMP platforms are composed of purely identical
  12 CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
  13 different performance characteristics - on such platforms, not all CPUs can be
  14 considered equal.
  15
  16 CPU capacity is a measure of the performance a CPU can reach, normalized against
  17 the most performant CPU in the system. Heterogeneous systems are also called
  18 asymmetric CPU capacity systems, as they contain CPUs of different capacities.
  19
  20 Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
  21 from two factors:
  22
  23 - not all CPUs may have the same microarchitecture (µarch).
  24 - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
  25   physically able to attain the higher Operating Performance Points (OPP).
  26
  27 Arm big.LITTLE systems are an example of both. The big CPUs are more
  28 performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
  29 smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
  30 can.
  31
  32 CPU performance is usually expressed in Millions of Instructions Per Second
  33 (MIPS), which can also be expressed as a given amount of instructions attainable
  34 per Hz, leading to::
  35
  36   capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
  37
  38 1.2 Scheduler terms
  39 -------------------
  40
  41 Two different capacity values are used within the scheduler. A CPU's
  42 ``original capacity`` is its maximum attainable capacity, i.e. its maximum
  43 attainable performance level. This original capacity is returned by
  44 the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
  45 capacity`` to which some loss of available performance (e.g. time spent
  46 handling IRQs) is subtracted.
  47
  48 Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
  49 while ``original capacity`` is class-agnostic. The rest of this document will use
  50 the term ``capacity`` interchangeably with ``original capacity`` for the sake of
  51 brevity.
  52
  53 1.3 Platform examples
  54 ---------------------
  55
  56 1.3.1 Identical OPPs
  57 ~~~~~~~~~~~~~~~~~~~~
  58
  59 Consider an hypothetical dual-core asymmetric CPU capacity system where
  60
  61 - work_per_hz(CPU0) = W
  62 - work_per_hz(CPU1) = W/2
  63 - all CPUs are running at the same fixed frequency
  64
  65 By the above definition of capacity:
  66
  67 - capacity(CPU0) = C
  68 - capacity(CPU1) = C/2
  69
  70 To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
  71 be a LITTLE.
  72
  73 With a workload that periodically does a fixed amount of work, you will get an
  74 execution trace like so::
  75
  76  CPU0 work ^
  77            |     ____                ____                ____
  78            |    |    |              |    |              |    |
  79            +----+----+----+----+----+----+----+----+----+----+-> time
  80
  81  CPU1 work ^
  82            |     _________           _________           ____
  83            |    |         |         |         |         |
  84            +----+----+----+----+----+----+----+----+----+----+-> time
  85
  86 CPU0 has the highest capacity in the system (C), and completes a fixed amount of
  87 work W in T units of time. On the other hand, CPU1 has half the capacity of
  88 CPU0, and thus only completes W/2 in T.
  89
  90 1.3.2 Different max OPPs
  91 ~~~~~~~~~~~~~~~~~~~~~~~~
  92
  93 Usually, CPUs of different capacity values also have different maximum
  94 OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
  95
  96 - max_freq(CPU0) = F
  97 - max_freq(CPU1) = 2/3 * F
  98
  99 This yields:
 100
 101 - capacity(CPU0) = C
 102 - capacity(CPU1) = C/3
 103
 104 Executing the same workload as described in 1.3.1, which each CPU running at its
 105 maximum frequency results in::
 106
 107  CPU0 work ^
 108            |     ____                ____                ____
 109            |    |    |              |    |              |    |
 110            +----+----+----+----+----+----+----+----+----+----+-> time
 111
 112                             workload on CPU1
 113  CPU1 work ^
 114            |     ______________      ______________      ____
 115            |    |              |    |              |    |
 116            +----+----+----+----+----+----+----+----+----+----+-> time
 117
 118 1.4 Representation caveat
 119 -------------------------
 120
 121 It should be noted that having a *single* value to represent differences in CPU
 122 performance is somewhat of a contentious point. The relative performance
 123 difference between two different µarchs could be X% on integer operations, Y% on
 124 floating point operations, Z% on branches, and so on. Still, results using this
 125 simple approach have been satisfactory for now.
 126
 127 2. Task utilization
 128 ===================
 129
 130 2.1 Introduction
 131 ----------------
 132
 133 Capacity aware scheduling requires an expression of a task's requirements with
 134 regards to CPU capacity. Each scheduler class can express this differently, and
 135 while task utilization is specific to CFS, it is convenient to describe it here
 136 in order to introduce more generic concepts.
 137
 138 Task utilization is a percentage meant to represent the throughput requirements
 139 of a task. A simple approximation of it is the task's duty cycle, i.e.::
 140
 141   task_util(p) = duty_cycle(p)
 142
 143 On an SMP system with fixed frequencies, 100% utilization suggests the task is a
 144 busy loop. Conversely, 10% utilization hints it is a small periodic task that
 145 spends more time sleeping than executing. Variable CPU frequencies and
 146 asymmetric CPU capacities complexify this somewhat; the following sections will
 147 expand on these.
 148
 149 2.2 Frequency invariance
 150 ------------------------
 151
 152 One issue that needs to be taken into account is that a workload's duty cycle is
 153 directly impacted by the current OPP the CPU is running at. Consider running a
 154 periodic workload at a given frequency F::
 155
 156   CPU work ^
 157            |     ____                ____                ____
 158            |    |    |              |    |              |    |
 159            +----+----+----+----+----+----+----+----+----+----+-> time
 160
 161 This yields duty_cycle(p) == 25%.
 162
 163 Now, consider running the *same* workload at frequency F/2::
 164
 165   CPU work ^
 166            |     _________           _________           ____
 167            |    |         |         |         |         |
 168            +----+----+----+----+----+----+----+----+----+----+-> time
 169
 170 This yields duty_cycle(p) == 50%, despite the task having the exact same
 171 behaviour (i.e. executing the same amount of work) in both executions.
 172
 173 The task utilization signal can be made frequency invariant using the following
 174 formula::
 175
 176   task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
 177
 178 Applying this formula to the two examples above yields a frequency invariant
 179 task utilization of 25%.
 180
 181 2.3 CPU invariance
 182 ------------------
 183
 184 CPU capacity has a similar effect on task utilization in that running an
 185 identical workload on CPUs of different capacity values will yield different
 186 duty cycles.
 187
 188 Consider the system described in 1.3.2., i.e.::
 189
 190 - capacity(CPU0) = C
 191 - capacity(CPU1) = C/3
 192
 193 Executing a given periodic workload on each CPU at their maximum frequency would
 194 result in::
 195
 196  CPU0 work ^
 197            |     ____                ____                ____
 198            |    |    |              |    |              |    |
 199            +----+----+----+----+----+----+----+----+----+----+-> time
 200
 201  CPU1 work ^
 202            |     ______________      ______________      ____
 203            |    |              |    |              |    |
 204            +----+----+----+----+----+----+----+----+----+----+-> time
 205
 206 IOW,
 207
 208 - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
 209 - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
 210
 211 The task utilization signal can be made CPU invariant using the following
 212 formula::
 213
 214   task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
 215
 216 with ``max_capacity`` being the highest CPU capacity value in the
 217 system. Applying this formula to the above example above yields a CPU
 218 invariant task utilization of 25%.
 219
 220 2.4 Invariant task utilization
 221 ------------------------------
 222
 223 Both frequency and CPU invariance need to be applied to task utilization in
 224 order to obtain a truly invariant signal. The pseudo-formula for a task
 225 utilization that is both CPU and frequency invariant is thus, for a given
 226 task p::
 227
 228                                      curr_frequency(cpu)   capacity(cpu)
 229   task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
 230                                      max_frequency(cpu)    max_capacity
 231
 232 In other words, invariant task utilization describes the behaviour of a task as
 233 if it were running on the highest-capacity CPU in the system, running at its
 234 maximum frequency.
 235
 236 Any mention of task utilization in the following sections will imply its
 237 invariant form.
 238
 239 2.5 Utilization estimation
 240 --------------------------
 241
 242 Without a crystal ball, task behaviour (and thus task utilization) cannot
 243 accurately be predicted the moment a task first becomes runnable. The CFS class
 244 maintains a handful of CPU and task signals based on the Per-Entity Load
 245 Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
 246 opposed to instantaneous).
 247
 248 This means that while the capacity aware scheduling criteria will be written
 249 considering a "true" task utilization (using a crystal ball), the implementation
 250 will only ever be able to use an estimator thereof.
 251
 252 3. Capacity aware scheduling requirements
 253 =========================================
 254
 255 3.1 CPU capacity
 256 ----------------
 257
 258 Linux cannot currently figure out CPU capacity on its own, this information thus
 259 needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
 260 for that purpose.
 261
 262 The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver
 263 CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
 264 Documentation/devicetree/bindings/cpu/cpu-capacity.txt.
 265
 266 3.2 Frequency invariance
 267 ------------------------
 268
 269 As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
 270 utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
 271 purpose.
 272
 273 Implementing this function requires figuring out at which frequency each CPU
 274 have been running at. One way to implement this is to leverage hardware counters
 275 whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
 276 AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
 277 when the kernel is aware of the switched-to frequency (also employed by
 278 arm/arm64).
 279
 280 4. Scheduler topology
 281 =====================
 282
 283 During the construction of the sched domains, the scheduler will figure out
 284 whether the system exhibits asymmetric CPU capacities. Should that be the
 285 case:
 286
 287 - The sched_asym_cpucapacity static key will be enabled.
 288 - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
 289   level that spans all unique CPU capacity values.
 290 - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
 291   CPUs with any range of asymmetry.
 292
 293 The sched_asym_cpucapacity static key is intended to guard sections of code that
 294 cater to asymmetric CPU capacity systems. Do note however that said key is
 295 *system-wide*. Imagine the following setup using cpusets::
 296
 297   capacity    C/2          C
 298             ________    ________
 299            /        \  /        \
 300   CPUs     0  1  2  3  4  5  6  7
 301            \__/  \______________/
 302   cpusets   cs0         cs1
 303
 304 Which could be created via:
 305
 306 .. code-block:: sh
 307
 308   mkdir /sys/fs/cgroup/cpuset/cs0
 309   echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
 310   echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
 311
 312   mkdir /sys/fs/cgroup/cpuset/cs1
 313   echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
 314   echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
 315
 316   echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
 317
 318 Since there *is* CPU capacity asymmetry in the system, the
 319 sched_asym_cpucapacity static key will be enabled. However, the sched_domain
 320 hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
 321 set in that hierarchy, it describes an SMP island and should be treated as such.
 322
 323 Therefore, the 'canonical' pattern for protecting codepaths that cater to
 324 asymmetric CPU capacities is to:
 325
 326 - Check the sched_asym_cpucapacity static key
 327 - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
 328   the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
 329   CPU or group thereof)
 330
 331 5. Capacity aware scheduling implementation
 332 ===========================================
 333
 334 5.1 CFS
 335 -------
 336
 337 5.1.1 Capacity fitness
 338 ~~~~~~~~~~~~~~~~~~~~~~
 339
 340 The main capacity scheduling criterion of CFS is::
 341
 342   task_util(p) < capacity(task_cpu(p))
 343
 344 This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
 345 task "fits" on its CPU. If it is violated, the task will need to achieve more
 346 work than what its CPU can provide: it will be CPU-bound.
 347
 348 Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
 349 value for a task, either via sched_setattr() or via the cgroup interface (see
 350 Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
 351 clamp task_util() in the previous criterion.
 352
 353 5.1.2 Wakeup CPU selection
 354 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 355
 356 CFS task wakeup CPU selection follows the capacity fitness criterion described
 357 above. On top of that, uclamp is used to clamp the task utilization values,
 358 which lets userspace have more leverage over the CPU selection of CFS
 359 tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
 360
 361   clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
 362
 363 By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
 364 on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
 365 periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
 366 giving it a high uclamp.min value.
 367
 368 .. note::
 369
 370   Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
 371   (EAS), which is described in Documentation/scheduler/sched-energy.rst.
 372
 373 5.1.3 Load balancing
 374 ~~~~~~~~~~~~~~~~~~~~
 375
 376 A pathological case in the wakeup CPU selection occurs when a task rarely
 377 sleeps, if at all - it thus rarely wakes up, if at all. Consider::
 378
 379   w == wakeup event
 380
 381   capacity(CPU0) = C
 382   capacity(CPU1) = C / 3
 383
 384                            workload on CPU0
 385   CPU work ^
 386            |     _________           _________           ____
 387            |    |         |         |         |         |
 388            +----+----+----+----+----+----+----+----+----+----+-> time
 389                 w                   w                   w
 390
 391                            workload on CPU1
 392   CPU work ^
 393            |     ____________________________________________
 394            |    |
 395            +----+----+----+----+----+----+----+----+----+----+->
 396                 w
 397
 398 This workload should run on CPU0, but if the task either:
 399
 400 - was improperly scheduled from the start (inaccurate initial
 401   utilization estimation)
 402 - was properly scheduled from the start, but suddenly needs more
 403   processing power
 404
 405 then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
 406 the CPU capacity scheduling criterion is violated, and there may not be any more
 407 wakeup event to fix this up via wakeup CPU selection.
 408
 409 Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
 410 put in place to handle this shares the same name. Misfit task migration
 411 leverages the CFS load balancer, more specifically the active load balance part
 412 (which caters to migrating currently running tasks). When load balance happens,
 413 a misfit active load balance will be triggered if a misfit task can be migrated
 414 to a CPU with more capacity than its current one.
 415
 416 5.2 RT
 417 ------
 418
 419 5.2.1 Wakeup CPU selection
 420 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 421
 422 RT task wakeup CPU selection searches for a CPU that satisfies::
 423
 424   task_uclamp_min(p) <= capacity(task_cpu(cpu))
 425
 426 while still following the usual priority constraints. If none of the candidate
 427 CPUs can satisfy this capacity criterion, then strict priority based scheduling
 428 is followed and CPU capacities are ignored.
 429
 430 5.3 DL
 431 ------
 432
 433 5.3.1 Wakeup CPU selection
 434 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 435
 436 DL task wakeup CPU selection searches for a CPU that satisfies::
 437
 438   task_bandwidth(p) < capacity(task_cpu(p))
 439
 440 while still respecting the usual bandwidth and deadline constraints. If
 441 none of the candidate CPUs can satisfy this capacity criterion, then the
 442 task will remain on its current CPU.