1 <!DOCTYPE HTML PUBLIC
"-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
4 <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering
</title>
5 <meta HTTP-EQUIV=
"Content-Type" CONTENT=
"text/html; charset=iso-8859-1">
8 <p>This article was contributed by Paul E.
McKenney
</p>
12 <p>This document gives a rough visual overview of how Tree RCU's
13 grace-period memory ordering guarantee is provided.
16 <li> <a href=
"#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
17 What Is Tree RCU's Grace Period Memory Ordering Guarantee?
</a>
18 <li> <a href=
"#Tree RCU Grace Period Memory Ordering Building Blocks">
19 Tree RCU Grace Period Memory Ordering Building Blocks
</a>
20 <li> <a href=
"#Tree RCU Grace Period Memory Ordering Components">
21 Tree RCU Grace Period Memory Ordering Components
</a>
22 <li> <a href=
"#Putting It All Together">Putting It All Together
</a>
25 <h3><a name=
"What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
26 What Is Tree RCU's Grace Period Memory Ordering Guarantee?
</a></h3>
28 <p>RCU grace periods provide extremely strong memory-ordering guarantees
29 for non-idle non-offline code.
30 Any code that happens after the end of a given RCU grace period is guaranteed
31 to see the effects of all accesses prior to the beginning of that grace
32 period that are within RCU read-side critical sections.
33 Similarly, any code that happens before the beginning of a given RCU grace
34 period is guaranteed to see the effects of all accesses following the end
35 of that grace period that are within RCU read-side critical sections.
37 <p>This guarantee is particularly pervasive for
<tt>synchronize_sched()
</tt>,
38 for which RCU-sched read-side critical sections include any region
39 of code for which preemption is disabled.
40 Given that each individual machine instruction can be thought of as
41 an extremely small region of preemption-disabled code, one can think of
42 <tt>synchronize_sched()
</tt> as
<tt>smp_mb()
</tt> on steroids.
44 <p>RCU updaters use this guarantee by splitting their updates into
45 two phases, one of which is executed before the grace period and
46 the other of which is executed after the grace period.
47 In the most common use case, phase one removes an element from
48 a linked RCU-protected data structure, and phase two frees that element.
49 For this to work, any readers that have witnessed state prior to the
50 phase-one update (in the common case, removal) must not witness state
51 following the phase-two update (in the common case, freeing).
53 <p>The RCU implementation provides this guarantee using a network
54 of lock-based critical sections, memory barriers, and per-CPU
55 processing, as is described in the following sections.
57 <h3><a name=
"Tree RCU Grace Period Memory Ordering Building Blocks">
58 Tree RCU Grace Period Memory Ordering Building Blocks
</a></h3>
60 <p>The workhorse for RCU's grace-period memory ordering is the
61 critical section for the
<tt>rcu_node
</tt> structure's
63 These critical sections use helper functions for lock acquisition, including
64 <tt>raw_spin_lock_rcu_node()
</tt>,
65 <tt>raw_spin_lock_irq_rcu_node()
</tt>, and
66 <tt>raw_spin_lock_irqsave_rcu_node()
</tt>.
67 Their lock-release counterparts are
68 <tt>raw_spin_unlock_rcu_node()
</tt>,
69 <tt>raw_spin_unlock_irq_rcu_node()
</tt>, and
70 <tt>raw_spin_unlock_irqrestore_rcu_node()
</tt>,
73 <tt>raw_spin_trylock_rcu_node()
</tt>
75 The key point is that the lock-acquisition functions, including
76 <tt>raw_spin_trylock_rcu_node()
</tt>, all invoke
77 <tt>smp_mb__after_unlock_lock()
</tt> immediately after successful
78 acquisition of the lock.
80 <p>Therefore, for any given
<tt>rcu_node
</tt> struction, any access
81 happening before one of the above lock-release functions will be seen
82 by all CPUs as happening before any access happening after a later
83 one of the above lock-acquisition functions.
84 Furthermore, any access happening before one of the
85 above lock-release function on any given CPU will be seen by all
86 CPUs as happening before any access happening after a later one
87 of the above lock-acquisition functions executing on that same CPU,
88 even if the lock-release and lock-acquisition functions are operating
89 on different
<tt>rcu_node
</tt> structures.
90 Tree RCU uses these two ordering guarantees to form an ordering
91 network among all CPUs that were in any way involved in the grace
92 period, including any CPUs that came online or went offline during
93 the grace period in question.
95 <p>The following litmus test exhibits the ordering effects of these
96 lock-acquisition and lock-release functions:
103 5 raw_spin_lock_rcu_node(rnp);
106 8 raw_spin_unlock_rcu_node(rnp);
111 13 raw_spin_lock_rcu_node(rnp);
113 15 r2 = READ_ONCE(z);
114 16 raw_spin_unlock_rcu_node(rnp);
121 23 r3 = READ_ONCE(x);
124 26 WARN_ON(r1 ==
0 && r2 ==
0 && r3 ==
0);
127 <p>The
<tt>WARN_ON()
</tt> is evaluated at
“the end of time
”,
128 after all changes have propagated throughout the system.
129 Without the
<tt>smp_mb__after_unlock_lock()
</tt> provided by the
130 acquisition functions, this
<tt>WARN_ON()
</tt> could trigger, for example
132 The
<tt>smp_mb__after_unlock_lock()
</tt> invocations prevent this
133 <tt>WARN_ON()
</tt> from triggering.
135 <p>This approach must be extended to include idle CPUs, which need
136 RCU's grace-period memory ordering guarantee to extend to any
137 RCU read-side critical sections preceding and following the current
139 This case is handled by calls to the strongly ordered
140 <tt>atomic_add_return()
</tt> read-modify-write atomic operation that
141 is invoked within
<tt>rcu_dynticks_eqs_enter()
</tt> at idle-entry
142 time and within
<tt>rcu_dynticks_eqs_exit()
</tt> at idle-exit time.
143 The grace-period kthread invokes
<tt>rcu_dynticks_snap()
</tt> and
144 <tt>rcu_dynticks_in_eqs_since()
</tt> (both of which invoke
145 an
<tt>atomic_add_return()
</tt> of zero) to detect idle CPUs.
148 <tr><th> </th></tr>
149 <tr><th align=
"left">Quick Quiz:
</th></tr>
151 But what about CPUs that remain offline for the entire
154 <tr><th align=
"left">Answer:
</th></tr>
155 <tr><td bgcolor=
"#ffffff"><font color=
"ffffff">
156 Such CPUs will be offline at the beginning of the grace period,
157 so the grace period won't expect quiescent states from them.
158 Races between grace-period start and CPU-hotplug operations
159 are mediated by the CPU's leaf
<tt>rcu_node
</tt> structure's
160 <tt>-
>lock
</tt> as described above.
162 <tr><td> </td></tr>
165 <p>The approach must be extended to handle one final case, that
166 of waking a task blocked in
<tt>synchronize_rcu()
</tt>.
167 This task might be affinitied to a CPU that is not yet aware that
168 the grace period has ended, and thus might not yet be subject to
169 the grace period's memory ordering.
170 Therefore, there is an
<tt>smp_mb()
</tt> after the return from
171 <tt>wait_for_completion()
</tt> in the
<tt>synchronize_rcu()
</tt>
175 <tr><th> </th></tr>
176 <tr><th align=
"left">Quick Quiz:
</th></tr>
179 I don't see any
<tt>smp_mb()
</tt> after the return from
180 <tt>wait_for_completion()
</tt>!!!
182 <tr><th align=
"left">Answer:
</th></tr>
183 <tr><td bgcolor=
"#ffffff"><font color=
"ffffff">
184 That would be because I spotted the need for that
185 <tt>smp_mb()
</tt> during the creation of this documentation,
186 and it is therefore unlikely to hit mainline before v4.14.
187 Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
188 Jonathan Cameron for asking questions that sensitized me
189 to the rather elaborate sequence of events that demonstrate
190 the need for this memory barrier.
192 <tr><td> </td></tr>
195 <p>Tree RCU's grace--period memory-ordering guarantees rely most
196 heavily on the
<tt>rcu_node
</tt> structure's
<tt>-
>lock
</tt>
197 field, so much so that it is necessary to abbreviate this pattern
198 in the diagrams in the next section.
199 For example, consider the
<tt>rcu_prepare_for_idle()
</tt> function
200 shown below, which is one of several functions that enforce ordering
201 of newly arrived RCU callbacks against future grace periods:
204 1 static void rcu_prepare_for_idle(void)
207 4 struct rcu_data *rdp;
208 5 struct rcu_dynticks *rdtp = this_cpu_ptr(
&rcu_dynticks);
209 6 struct rcu_node *rnp;
210 7 struct rcu_state *rsp;
213 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
214 11 rcu_is_nocb_cpu(smp_processor_id()))
216 13 tne = READ_ONCE(tick_nohz_active);
217 14 if (tne != rdtp-
>tick_nohz_enabled_snap) {
218 15 if (rcu_cpu_has_callbacks(NULL))
219 16 invoke_rcu_core();
220 17 rdtp-
>tick_nohz_enabled_snap = tne;
225 22 if (rdtp-
>all_lazy
&&
226 23 rdtp-
>nonlazy_posted != rdtp-
>nonlazy_posted_snap) {
227 24 rdtp-
>all_lazy = false;
228 25 rdtp-
>nonlazy_posted_snap = rdtp-
>nonlazy_posted;
229 26 invoke_rcu_core();
232 29 if (rdtp-
>last_accelerate == jiffies)
234 31 rdtp-
>last_accelerate = jiffies;
235 32 for_each_rcu_flavor(rsp) {
236 33 rdp = this_cpu_ptr(rsp-
>rda);
237 34 if (rcu_segcblist_pend_cbs(
&rdp-
>cblist))
239 36 rnp = rdp-
>mynode;
240 37 raw_spin_lock_rcu_node(rnp);
241 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
242 39 raw_spin_unlock_rcu_node(rnp);
244 41 rcu_gp_kthread_wake(rsp);
249 <p>But the only part of
<tt>rcu_prepare_for_idle()
</tt> that really
250 matters for this discussion are lines
37–39.
251 We will therefore abbreviate this function as follows:
253 </p><p><img src=
"rcu_node-lock.svg" alt=
"rcu_node-lock.svg">
255 <p>The box represents the
<tt>rcu_node
</tt> structure's
<tt>-
>lock
</tt>
256 critical section, with the double line on top representing the additional
257 <tt>smp_mb__after_unlock_lock()
</tt>.
259 <h3><a name=
"Tree RCU Grace Period Memory Ordering Components">
260 Tree RCU Grace Period Memory Ordering Components
</a></h3>
262 <p>Tree RCU's grace-period memory-ordering guarantee is provided by
263 a number of RCU components:
266 <li> <a href=
"#Callback Registry">Callback Registry
</a>
267 <li> <a href=
"#Grace-Period Initialization">Grace-Period Initialization
</a>
268 <li> <a href=
"#Self-Reported Quiescent States">
269 Self-Reported Quiescent States
</a>
270 <li> <a href=
"#Dynamic Tick Interface">Dynamic Tick Interface
</a>
271 <li> <a href=
"#CPU-Hotplug Interface">CPU-Hotplug Interface
</a>
272 <li> <a href=
"Forcing Quiescent States">Forcing Quiescent States
</a>
273 <li> <a href=
"Grace-Period Cleanup">Grace-Period Cleanup
</a>
274 <li> <a href=
"Callback Invocation">Callback Invocation
</a>
277 <p>Each of the following section looks at the corresponding component
280 <h4><a name=
"Callback Registry">Callback Registry
</a></h4>
282 <p>If RCU's grace-period guarantee is to mean anything at all, any
283 access that happens before a given invocation of
<tt>call_rcu()
</tt>
284 must also happen before the corresponding grace period.
285 The implementation of this portion of RCU's grace period guarantee
286 is shown in the following figure:
288 </p><p><img src=
"TreeRCU-callback-registry.svg" alt=
"TreeRCU-callback-registry.svg">
290 <p>Because
<tt>call_rcu()
</tt> normally acts only on CPU-local state,
291 it provides no ordering guarantees, either for itself or for
292 phase one of the update (which again will usually be removal of
293 an element from an RCU-protected data structure).
294 It simply enqueues the
<tt>rcu_head
</tt> structure on a per-CPU list,
295 which cannot become associated with a grace period until a later
296 call to
<tt>rcu_accelerate_cbs()
</tt>, as shown in the diagram above.
298 <p>One set of code paths shown on the left invokes
299 <tt>rcu_accelerate_cbs()
</tt> via
300 <tt>note_gp_changes()
</tt>, either directly from
<tt>call_rcu()
</tt> (if
301 the current CPU is inundated with queued
<tt>rcu_head
</tt> structures)
302 or more likely from an
<tt>RCU_SOFTIRQ
</tt> handler.
303 Another code path in the middle is taken only in kernels built with
304 <tt>CONFIG_RCU_FAST_NO_HZ=y
</tt>, which invokes
305 <tt>rcu_accelerate_cbs()
</tt> via
<tt>rcu_prepare_for_idle()
</tt>.
306 The final code path on the right is taken only in kernels built with
307 <tt>CONFIG_HOTPLUG_CPU=y
</tt>, which invokes
308 <tt>rcu_accelerate_cbs()
</tt> via
309 <tt>rcu_advance_cbs()
</tt>,
<tt>rcu_migrate_callbacks
</tt>,
310 <tt>rcutree_migrate_callbacks()
</tt>, and
<tt>takedown_cpu()
</tt>,
311 which in turn is invoked on a surviving CPU after the outgoing
312 CPU has been completely offlined.
314 <p>There are a few other code paths within grace-period processing
315 that opportunistically invoke
<tt>rcu_accelerate_cbs()
</tt>.
316 However, either way, all of the CPU's recently queued
<tt>rcu_head
</tt>
317 structures are associated with a future grace-period number under
318 the protection of the CPU's lead
<tt>rcu_node
</tt> structure's
320 In all cases, there is full ordering against any prior critical section
321 for that same
<tt>rcu_node
</tt> structure's
<tt>-
>lock
</tt>, and
322 also full ordering against any of the current task's or CPU's prior critical
323 sections for any
<tt>rcu_node
</tt> structure's
<tt>-
>lock
</tt>.
325 <p>The next section will show how this ordering ensures that any
326 accesses prior to the
<tt>call_rcu()
</tt> (particularly including phase
328 happen before the start of the corresponding grace period.
331 <tr><th> </th></tr>
332 <tr><th align=
"left">Quick Quiz:
</th></tr>
334 But what about
<tt>synchronize_rcu()
</tt>?
336 <tr><th align=
"left">Answer:
</th></tr>
337 <tr><td bgcolor=
"#ffffff"><font color=
"ffffff">
338 The
<tt>synchronize_rcu()
</tt> passes
<tt>call_rcu()
</tt>
339 to
<tt>wait_rcu_gp()
</tt>, which invokes it.
340 So either way, it eventually comes down to
<tt>call_rcu()
</tt>.
342 <tr><td> </td></tr>
345 <h4><a name=
"Grace-Period Initialization">Grace-Period Initialization
</a></h4>
347 <p>Grace-period initialization is carried out by
348 the grace-period kernel thread, which makes several passes over the
349 <tt>rcu_node
</tt> tree within the
<tt>rcu_gp_init()
</tt> function.
350 This means that showing the full flow of ordering through the
351 grace-period computation will require duplicating this tree.
352 If you find this confusing, please note that the state of the
353 <tt>rcu_node
</tt> changes over time, just like Heraclitus's river.
354 However, to keep the
<tt>rcu_node
</tt> river tractable, the
355 grace-period kernel thread's traversals are presented in multiple
356 parts, starting in this section with the various phases of
357 grace-period initialization.
359 <p>The first ordering-related grace-period initialization action is to
360 increment the
<tt>rcu_state
</tt> structure's
<tt>-
>gpnum
</tt>
361 grace-period-number counter, as shown below:
363 </p><p><img src=
"TreeRCU-gp-init-1.svg" alt=
"TreeRCU-gp-init-1.svg" width=
"75%">
365 <p>The actual increment is carried out using
<tt>smp_store_release()
</tt>,
366 which helps reject false-positive RCU CPU stall detection.
367 Note that only the root
<tt>rcu_node
</tt> structure is touched.
369 <p>The first pass through the
<tt>rcu_node
</tt> tree updates bitmasks
370 based on CPUs having come online or gone offline since the start of
371 the previous grace period.
372 In the common case where the number of online CPUs for this
<tt>rcu_node
</tt>
373 structure has not transitioned to or from zero,
374 this pass will scan only the leaf
<tt>rcu_node
</tt> structures.
375 However, if the number of online CPUs for a given leaf
<tt>rcu_node
</tt>
376 structure has transitioned from zero,
377 <tt>rcu_init_new_rnp()
</tt> will be invoked for the first incoming CPU.
378 Similarly, if the number of online CPUs for a given leaf
<tt>rcu_node
</tt>
379 structure has transitioned to zero,
380 <tt>rcu_cleanup_dead_rnp()
</tt> will be invoked for the last outgoing CPU.
381 The diagram below shows the path of ordering if the leftmost
382 <tt>rcu_node
</tt> structure onlines its first CPU and if the next
383 <tt>rcu_node
</tt> structure has no online CPUs
384 (or, alternatively if the leftmost
<tt>rcu_node
</tt> structure offlines
385 its last CPU and if the next
<tt>rcu_node
</tt> structure has no online CPUs).
387 </p><p><img src=
"TreeRCU-gp-init-2.svg" alt=
"TreeRCU-gp-init-1.svg" width=
"75%">
389 <p>The final
<tt>rcu_gp_init()
</tt> pass through the
<tt>rcu_node
</tt>
390 tree traverses breadth-first, setting each
<tt>rcu_node
</tt> structure's
391 <tt>-
>gpnum
</tt> field to the newly incremented value from the
392 <tt>rcu_state
</tt> structure, as shown in the following diagram.
394 </p><p><img src=
"TreeRCU-gp-init-3.svg" alt=
"TreeRCU-gp-init-1.svg" width=
"75%">
396 <p>This change will also cause each CPU's next call to
397 <tt>__note_gp_changes()
</tt>
398 to notice that a new grace period has started, as described in the next
400 But because the grace-period kthread started the grace period at the
401 root (with the increment of the
<tt>rcu_state
</tt> structure's
402 <tt>-
>gpnum
</tt> field) before setting each leaf
<tt>rcu_node
</tt>
403 structure's
<tt>-
>gpnum
</tt> field, each CPU's observation of
404 the start of the grace period will happen after the actual start
408 <tr><th> </th></tr>
409 <tr><th align=
"left">Quick Quiz:
</th></tr>
411 But what about the CPU that started the grace period?
412 Why wouldn't it see the start of the grace period right when
413 it started that grace period?
415 <tr><th align=
"left">Answer:
</th></tr>
416 <tr><td bgcolor=
"#ffffff"><font color=
"ffffff">
417 In some deep philosophical and overly anthromorphized
418 sense, yes, the CPU starting the grace period is immediately
419 aware of having done so.
420 However, if we instead assume that RCU is not self-aware,
421 then even the CPU starting the grace period does not really
422 become aware of the start of this grace period until its
423 first call to
<tt>__note_gp_changes()
</tt>.
424 On the other hand, this CPU potentially gets early notification
425 because it invokes
<tt>__note_gp_changes()
</tt> during its
426 last
<tt>rcu_gp_init()
</tt> pass through its leaf
427 <tt>rcu_node
</tt> structure.
429 <tr><td> </td></tr>
432 <h4><a name=
"Self-Reported Quiescent States">
433 Self-Reported Quiescent States
</a></h4>
435 <p>When all entities that might block the grace period have reported
436 quiescent states (or as described in a later section, had quiescent
437 states reported on their behalf), the grace period can end.
438 Online non-idle CPUs report their own quiescent states, as shown
439 in the following diagram:
441 </p><p><img src=
"TreeRCU-qs.svg" alt=
"TreeRCU-qs.svg" width=
"75%">
443 <p>This is for the last CPU to report a quiescent state, which signals
444 the end of the grace period.
445 Earlier quiescent states would push up the
<tt>rcu_node
</tt> tree
446 only until they encountered an
<tt>rcu_node
</tt> structure that
447 is waiting for additional quiescent states.
448 However, ordering is nevertheless preserved because some later quiescent
449 state will acquire that
<tt>rcu_node
</tt> structure's
<tt>-
>lock
</tt>.
451 <p>Any number of events can lead up to a CPU invoking
452 <tt>note_gp_changes
</tt> (or alternatively, directly invoking
453 <tt>__note_gp_changes()
</tt>), at which point that CPU will notice
454 the start of a new grace period while holding its leaf
455 <tt>rcu_node
</tt> lock.
456 Therefore, all execution shown in this diagram happens after the
457 start of the grace period.
458 In addition, this CPU will consider any RCU read-side critical
459 section that started before the invocation of
<tt>__note_gp_changes()
</tt>
460 to have started before the grace period, and thus a critical
461 section that the grace period must wait on.
464 <tr><th> </th></tr>
465 <tr><th align=
"left">Quick Quiz:
</th></tr>
467 But a RCU read-side critical section might have started
468 after the beginning of the grace period
469 (the
<tt>-
>gpnum++
</tt> from earlier), so why should
470 the grace period wait on such a critical section?
472 <tr><th align=
"left">Answer:
</th></tr>
473 <tr><td bgcolor=
"#ffffff"><font color=
"ffffff">
474 It is indeed not necessary for the grace period to wait on such
476 However, it is permissible to wait on it.
477 And it is furthermore important to wait on it, as this
478 lazy approach is far more scalable than a
“big bang
”
479 all-at-once grace-period start could possibly be.
481 <tr><td> </td></tr>
484 <p>If the CPU does a context switch, a quiescent state will be
485 noted by
<tt>rcu_node_context_switch()
</tt> on the left.
486 On the other hand, if the CPU takes a scheduler-clock interrupt
487 while executing in usermode, a quiescent state will be noted by
488 <tt>rcu_check_callbacks()
</tt> on the right.
489 Either way, the passage through a quiescent state will be noted
490 in a per-CPU variable.
492 <p>The next time an
<tt>RCU_SOFTIRQ
</tt> handler executes on
493 this CPU (for example, after the next scheduler-clock
494 interrupt),
<tt>__rcu_process_callbacks()
</tt> will invoke
495 <tt>rcu_check_quiescent_state()
</tt>, which will notice the
496 recorded quiescent state, and invoke
497 <tt>rcu_report_qs_rdp()
</tt>.
498 If
<tt>rcu_report_qs_rdp()
</tt> verifies that the quiescent state
499 really does apply to the current grace period, it invokes
500 <tt>rcu_report_rnp()
</tt> which traverses up the
<tt>rcu_node
</tt>
501 tree as shown at the bottom of the diagram, clearing bits from
502 each
<tt>rcu_node
</tt> structure's
<tt>-
>qsmask
</tt> field,
503 and propagating up the tree when the result is zero.
505 <p>Note that traversal passes upwards out of a given
<tt>rcu_node
</tt>
506 structure only if the current CPU is reporting the last quiescent
507 state for the subtree headed by that
<tt>rcu_node
</tt> structure.
508 A key point is that if a CPU's traversal stops at a given
<tt>rcu_node
</tt>
509 structure, then there will be a later traversal by another CPU
510 (or perhaps the same one) that proceeds upwards
511 from that point, and the
<tt>rcu_node
</tt> <tt>-
>lock
</tt>
512 guarantees that the first CPU's quiescent state happens before the
513 remainder of the second CPU's traversal.
514 Applying this line of thought repeatedly shows that all CPUs'
515 quiescent states happen before the last CPU traverses through
516 the root
<tt>rcu_node
</tt> structure, the
“last CPU
”
517 being the one that clears the last bit in the root
<tt>rcu_node
</tt>
518 structure's
<tt>-
>qsmask
</tt> field.
520 <h4><a name=
"Dynamic Tick Interface">Dynamic Tick Interface
</a></h4>
522 <p>Due to energy-efficiency considerations, RCU is forbidden from
523 disturbing idle CPUs.
524 CPUs are therefore required to notify RCU when entering or leaving idle
525 state, which they do via fully ordered value-returning atomic operations
526 on a per-CPU variable.
527 The ordering effects are as shown below:
529 </p><p><img src=
"TreeRCU-dyntick.svg" alt=
"TreeRCU-dyntick.svg" width=
"50%">
531 <p>The RCU grace-period kernel thread samples the per-CPU idleness
532 variable while holding the corresponding CPU's leaf
<tt>rcu_node
</tt>
533 structure's
<tt>-
>lock
</tt>.
534 This means that any RCU read-side critical sections that precede the
535 idle period (the oval near the top of the diagram above) will happen
536 before the end of the current grace period.
537 Similarly, the beginning of the current grace period will happen before
538 any RCU read-side critical sections that follow the
539 idle period (the oval near the bottom of the diagram above).
541 <p>Plumbing this into the full grace-period execution is described
542 <a href=
"#Forcing Quiescent States">below
</a>.
544 <h4><a name=
"CPU-Hotplug Interface">CPU-Hotplug Interface
</a></h4>
546 <p>RCU is also forbidden from disturbing offline CPUs, which might well
547 be powered off and removed from the system completely.
548 CPUs are therefore required to notify RCU of their comings and goings
549 as part of the corresponding CPU hotplug operations.
550 The ordering effects are shown below:
552 </p><p><img src=
"TreeRCU-hotplug.svg" alt=
"TreeRCU-hotplug.svg" width=
"50%">
554 <p>Because CPU hotplug operations are much less frequent than idle transitions,
555 they are heavier weight, and thus acquire the CPU's leaf
<tt>rcu_node
</tt>
556 structure's
<tt>-
>lock
</tt> and update this structure's
557 <tt>-
>qsmaskinitnext
</tt>.
558 The RCU grace-period kernel thread samples this mask to detect CPUs
559 having gone offline since the beginning of this grace period.
561 <p>Plumbing this into the full grace-period execution is described
562 <a href=
"#Forcing Quiescent States">below
</a>.
564 <h4><a name=
"Forcing Quiescent States">Forcing Quiescent States
</a></h4>
566 <p>As noted above, idle and offline CPUs cannot report their own
567 quiescent states, and therefore the grace-period kernel thread
568 must do the reporting on their behalf.
569 This process is called
“forcing quiescent states
”, it is
570 repeated every few jiffies, and its ordering effects are shown below:
572 </p><p><img src=
"TreeRCU-gp-fqs.svg" alt=
"TreeRCU-gp-fqs.svg" width=
"100%">
574 <p>Each pass of quiescent state forcing is guaranteed to traverse the
575 leaf
<tt>rcu_node
</tt> structures, and if there are no new quiescent
576 states due to recently idled and/or offlined CPUs, then only the
577 leaves are traversed.
578 However, if there is a newly offlined CPU as illustrated on the left
579 or a newly idled CPU as illustrated on the right, the corresponding
580 quiescent state will be driven up towards the root.
581 As with self-reported quiescent states, the upwards driving stops
582 once it reaches an
<tt>rcu_node
</tt> structure that has quiescent
583 states outstanding from other CPUs.
586 <tr><th> </th></tr>
587 <tr><th align=
"left">Quick Quiz:
</th></tr>
589 The leftmost drive to root stopped before it reached
590 the root
<tt>rcu_node
</tt> structure, which means that
591 there are still CPUs subordinate to that structure on
592 which the current grace period is waiting.
593 Given that, how is it possible that the rightmost drive
594 to root ended the grace period?
596 <tr><th align=
"left">Answer:
</th></tr>
597 <tr><td bgcolor=
"#ffffff"><font color=
"ffffff">
599 It is in fact impossible in the absence of bugs in RCU.
600 But this diagram is complex enough as it is, so simplicity
602 You can think of it as poetic license, or you can think of
603 it as misdirection that is resolved in the
604 <a href=
"#Putting It All Together">stitched-together diagram
</a>.
606 <tr><td> </td></tr>
609 <h4><a name=
"Grace-Period Cleanup">Grace-Period Cleanup
</a></h4>
611 <p>Grace-period cleanup first scans the
<tt>rcu_node
</tt> tree
612 breadth-first setting all the
<tt>-
>completed
</tt> fields equal
613 to the number of the newly completed grace period, then it sets
614 the
<tt>rcu_state
</tt> structure's
<tt>-
>completed
</tt> field,
615 again to the number of the newly completed grace period.
616 The ordering effects are shown below:
618 </p><p><img src=
"TreeRCU-gp-cleanup.svg" alt=
"TreeRCU-gp-cleanup.svg" width=
"75%">
620 <p>As indicated by the oval at the bottom of the diagram, once
621 grace-period cleanup is complete, the next grace period can begin.
624 <tr><th> </th></tr>
625 <tr><th align=
"left">Quick Quiz:
</th></tr>
627 But when precisely does the grace period end?
629 <tr><th align=
"left">Answer:
</th></tr>
630 <tr><td bgcolor=
"#ffffff"><font color=
"ffffff">
631 There is no useful single point at which the grace period
633 The earliest reasonable candidate is as soon as the last
634 CPU has reported its quiescent state, but it may be some
635 milliseconds before RCU becomes aware of this.
636 The latest reasonable candidate is once the
<tt>rcu_state
</tt>
637 structure's
<tt>-
>completed
</tt> field has been updated,
638 but it is quite possible that some CPUs have already completed
639 phase two of their updates by that time.
640 In short, if you are going to work with RCU, you need to
641 learn to embrace uncertainty.
643 <tr><td> </td></tr>
647 <h4><a name=
"Callback Invocation">Callback Invocation
</a></h4>
649 <p>Once a given CPU's leaf
<tt>rcu_node
</tt> structure's
650 <tt>-
>completed
</tt> field has been updated, that CPU can begin
651 invoking its RCU callbacks that were waiting for this grace period
653 These callbacks are identified by
<tt>rcu_advance_cbs()
</tt>,
654 which is usually invoked by
<tt>__note_gp_changes()
</tt>.
655 As shown in the diagram below, this invocation can be triggered by
656 the scheduling-clock interrupt (
<tt>rcu_check_callbacks()
</tt> on
657 the left) or by idle entry (
<tt>rcu_cleanup_after_idle()
</tt> on
658 the right, but only for kernels build with
659 <tt>CONFIG_RCU_FAST_NO_HZ=y
</tt>).
660 Either way,
<tt>RCU_SOFTIRQ
</tt> is raised, which results in
661 <tt>rcu_do_batch()
</tt> invoking the callbacks, which in turn
662 allows those callbacks to carry out (either directly or indirectly
663 via wakeup) the needed phase-two processing for each update.
665 </p><p><img src=
"TreeRCU-callback-invocation.svg" alt=
"TreeRCU-callback-invocation.svg" width=
"60%">
667 <p>Please note that callback invocation can also be prompted by any
668 number of corner-case code paths, for example, when a CPU notes that
669 it has excessive numbers of callbacks queued.
670 In all cases, the CPU acquires its leaf
<tt>rcu_node
</tt> structure's
671 <tt>-
>lock
</tt> before invoking callbacks, which preserves the
672 required ordering against the newly completed grace period.
674 <p>However, if the callback function communicates to other CPUs,
675 for example, doing a wakeup, then it is that function's responsibility
676 to maintain ordering.
677 For example, if the callback function wakes up a task that runs on
678 some other CPU, proper ordering must in place in both the callback
679 function and the task being awakened.
680 To see why this is important, consider the top half of the
681 <a href=
"#Grace-Period Cleanup">grace-period cleanup
</a> diagram.
682 The callback might be running on a CPU corresponding to the leftmost
683 leaf
<tt>rcu_node
</tt> structure, and awaken a task that is to run on
684 a CPU corresponding to the rightmost leaf
<tt>rcu_node
</tt> structure,
685 and the grace-period kernel thread might not yet have reached the
687 In this case, the grace period's memory ordering might not yet have
688 reached that CPU, so again the callback function and the awakened
689 task must supply proper ordering.
691 <h3><a name=
"Putting It All Together">Putting It All Together
</a></h3>
693 <p>A stitched-together diagram is
694 <a href=
"Tree-RCU-Diagram.html">here
</a>.
696 <h3><a name=
"Legal Statement">
697 Legal Statement
</a></h3>
699 <p>This work represents the view of the author and does not necessarily
700 represent the view of IBM.
702 </p><p>Linux is a registered trademark of Linus Torvalds.
704 </p><p>Other company, product, and service names may be trademarks or
705 service marks of others.