Linux 4.8.3
[linux/fpc-iii.git] / Documentation / RCU / Design / Data-Structures / Data-Structures.html
blob7eb47ac25ad772bdc25b812e016982d974752f8c
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
3 <html>
4 <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title>
5 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
7 <p>January 27, 2016</p>
8 <p>This article was contributed by Paul E.&nbsp;McKenney</p>
10 <h3>Introduction</h3>
12 This document describes RCU's major data structures and their relationship
13 to each other.
15 <ol>
16 <li> <a href="#Data-Structure Relationships">
17 Data-Structure Relationships</a>
18 <li> <a href="#The rcu_state Structure">
19 The <tt>rcu_state</tt> Structure</a>
20 <li> <a href="#The rcu_node Structure">
21 The <tt>rcu_node</tt> Structure</a>
22 <li> <a href="#The rcu_data Structure">
23 The <tt>rcu_data</tt> Structure</a>
24 <li> <a href="#The rcu_dynticks Structure">
25 The <tt>rcu_dynticks</tt> Structure</a>
26 <li> <a href="#The rcu_head Structure">
27 The <tt>rcu_head</tt> Structure</a>
28 <li> <a href="#RCU-Specific Fields in the task_struct Structure">
29 RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
30 <li> <a href="#Accessor Functions">
31 Accessor Functions</a>
32 </ol>
34 At the end we have the
35 <a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>.
37 <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
39 <p>RCU is for all intents and purposes a large state machine, and its
40 data structures maintain the state in such a way as to allow RCU readers
41 to execute extremely quickly, while also processing the RCU grace periods
42 requested by updaters in an efficient and extremely scalable fashion.
43 The efficiency and scalability of RCU updaters is provided primarily
44 by a combining tree, as shown below:
46 </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
48 </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
49 containing a tree of <tt>rcu_node</tt> structures.
50 Each leaf node of the <tt>rcu_node</tt> tree has up to 16
51 <tt>rcu_data</tt> structures associated with it, so that there
52 are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
53 one for each possible CPU.
54 This structure is adjusted at boot time, if needed, to handle the
55 common case where <tt>nr_cpu_ids</tt> is much less than
56 <tt>NR_CPUs</tt>.
57 For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
58 which results in a three-level <tt>rcu_node</tt> tree.
59 If the actual hardware has only 16 CPUs, RCU will adjust itself
60 at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
62 </p><p>The purpose of this combining tree is to allow per-CPU events
63 such as quiescent states, dyntick-idle transitions,
64 and CPU hotplug operations to be processed efficiently
65 and scalably.
66 Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
67 and other events are recorded by the leaf-level <tt>rcu_node</tt>
68 structures.
69 All of these events are combined at each level of the tree until finally
70 grace periods are completed at the tree's root <tt>rcu_node</tt>
71 structure.
72 A grace period can be completed at the root once every CPU
73 (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
74 has passed through a quiescent state.
75 Once a grace period has completed, record of that fact is propagated
76 back down the tree.
78 </p><p>As can be seen from the diagram, on a 64-bit system
79 a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
80 of 64 at the root and a fanout of 16 at the leaves.
82 <table>
83 <tr><th>&nbsp;</th></tr>
84 <tr><th align="left">Quick Quiz:</th></tr>
85 <tr><td>
86 Why isn't the fanout at the leaves also 64?
87 </td></tr>
88 <tr><th align="left">Answer:</th></tr>
89 <tr><td bgcolor="#ffffff"><font color="ffffff">
90 Because there are more types of events that affect the leaf-level
91 <tt>rcu_node</tt> structures than further up the tree.
92 Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
93 64, the contention on these structures' <tt>-&gt;structures</tt>
94 becomes excessive.
95 Experimentation on a wide variety of systems has shown that a fanout
96 of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
97 </font>
99 <p><font color="ffffff">Of course, further experience with
100 systems having hundreds or thousands of CPUs may demonstrate
101 that the fanout for the non-leaf <tt>rcu_node</tt> structures
102 must also be reduced.
103 Such reduction can be easily carried out when and if it proves
104 necessary.
105 In the meantime, if you are using such a system and running into
106 contention problems on the non-leaf <tt>rcu_node</tt> structures,
107 you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
108 parameter to reduce the non-leaf fanout as needed.
109 </font>
111 <p><font color="ffffff">Kernels built for systems with
112 strong NUMA characteristics might also need to adjust
113 <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
114 <tt>rcu_node</tt> structures align with hardware boundaries.
115 However, there has thus far been no need for this.
116 </font></td></tr>
117 <tr><td>&nbsp;</td></tr>
118 </table>
120 <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
121 a 32-bit system), then RCU will automatically add more levels to the
122 tree.
123 For example, if you are crazy enough to build a 64-bit system with 65,536
124 CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
126 </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
128 </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
129 accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
130 32-bit systems.
131 On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be
132 as small as 2 if you wish, which would permit only 16 CPUs, which
133 is useful for testing.
135 </p><p>This multi-level combining tree allows us to get most of the
136 performance and scalability
137 benefits of partitioning, even though RCU grace-period detection is
138 inherently a global operation.
139 The trick here is that only the last CPU to report a quiescent state
140 into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
141 structure at the next level up the tree.
142 This means that at the leaf-level <tt>rcu_node</tt> structure, only
143 one access out of sixteen will progress up the tree.
144 For the internal <tt>rcu_node</tt> structures, the situation is even
145 more extreme: Only one access out of sixty-four will progress up
146 the tree.
147 Because the vast majority of the CPUs do not progress up the tree,
148 the lock contention remains roughly constant up the tree.
149 No matter how many CPUs there are in the system, at most 64 quiescent-state
150 reports per grace period will progress all the way to the root
151 <tt>rcu_node</tt> structure, thus ensuring that the lock contention
152 on that root <tt>rcu_node</tt> structure remains acceptably low.
154 </p><p>In effect, the combining tree acts like a big shock absorber,
155 keeping lock contention under control at all tree levels regardless
156 of the level of loading on the system.
158 </p><p>The Linux kernel actually supports multiple flavors of RCU
159 running concurrently, so RCU builds separate data structures for each
160 flavor.
161 For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides
162 rcu_sched and rcu_bh, as shown below:
164 </p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%">
166 </p><p>Energy efficiency is increasingly important, and for that
167 reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which
168 turns off the scheduling-clock interrupts on idle CPUs, which in
169 turn allows those CPUs to attain deeper sleep states and to consume
170 less energy.
171 CPUs whose scheduling-clock interrupts have been turned off are
172 said to be in <i>dyntick-idle mode</i>.
173 RCU must handle dyntick-idle CPUs specially
174 because RCU would otherwise wake up each CPU on every grace period,
175 which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>.
176 RCU uses the <tt>rcu_dynticks</tt> structure to track
177 which CPUs are in dyntick idle mode, as shown below:
179 </p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%">
181 </p><p>However, if a CPU is in dyntick-idle mode, it is in that mode
182 for all flavors of RCU.
183 Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per
184 CPU, and all of a given CPU's <tt>rcu_data</tt> structures share
185 that <tt>rcu_dynticks</tt>, as shown in the figure.
187 </p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support
188 rcu_preempt in addition to rcu_sched and rcu_bh, as shown below:
190 </p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%">
192 </p><p>RCU updaters wait for normal grace periods by registering
193 RCU callbacks, either directly via <tt>call_rcu()</tt> and
194 friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>),
195 there being a separate interface per flavor of RCU)
196 or indirectly via <tt>synchronize_rcu()</tt> and friends.
197 RCU callbacks are represented by <tt>rcu_head</tt> structures,
198 which are queued on <tt>rcu_data</tt> structures while they are
199 waiting for a grace period to elapse, as shown in the following figure:
201 </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
203 </p><p>This figure shows how <tt>TREE_RCU</tt>'s and
204 <tt>PREEMPT_RCU</tt>'s major data structures are related.
205 Lesser data structures will be introduced with the algorithms that
206 make use of them.
208 </p><p>Note that each of the data structures in the above figure has
209 its own synchronization:
211 <p><ol>
212 <li> Each <tt>rcu_state</tt> structures has a lock and a mutex,
213 and some fields are protected by the corresponding root
214 <tt>rcu_node</tt> structure's lock.
215 <li> Each <tt>rcu_node</tt> structure has a spinlock.
216 <li> The fields in <tt>rcu_data</tt> are private to the corresponding
217 CPU, although a few can be read and written by other CPUs.
218 <li> Similarly, the fields in <tt>rcu_dynticks</tt> are private
219 to the corresponding CPU, although a few can be read by
220 other CPUs.
221 </ol>
223 <p>It is important to note that different data structures can have
224 very different ideas about the state of RCU at any given time.
225 For but one example, awareness of the start or end of a given RCU
226 grace period propagates slowly through the data structures.
227 This slow propagation is absolutely necessary for RCU to have good
228 read-side performance.
229 If this balkanized implementation seems foreign to you, one useful
230 trick is to consider each instance of these data structures to be
231 a different person, each having the usual slightly different
232 view of reality.
234 </p><p>The general role of each of these data structures is as
235 follows:
237 </p><ol>
238 <li> <tt>rcu_state</tt>:
239 This structure forms the interconnection between the
240 <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
241 tracks grace periods, serves as short-term repository
242 for callbacks orphaned by CPU-hotplug events,
243 maintains <tt>rcu_barrier()</tt> state,
244 tracks expedited grace-period state,
245 and maintains state used to force quiescent states when
246 grace periods extend too long,
247 <li> <tt>rcu_node</tt>: This structure forms the combining
248 tree that propagates quiescent-state
249 information from the leaves to the root, and also propagates
250 grace-period information from the root to the leaves.
251 It provides local copies of the grace-period state in order
252 to allow this information to be accessed in a synchronized
253 manner without suffering the scalability limitations that
254 would otherwise be imposed by global locking.
255 In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
256 of tasks that have blocked while in their current
257 RCU read-side critical section.
258 In <tt>CONFIG_PREEMPT_RCU</tt> with
259 <tt>CONFIG_RCU_BOOST</tt>, it manages the
260 per-<tt>rcu_node</tt> priority-boosting
261 kernel threads (kthreads) and state.
262 Finally, it records CPU-hotplug state in order to determine
263 which CPUs should be ignored during a given grace period.
264 <li> <tt>rcu_data</tt>: This per-CPU structure is the
265 focus of quiescent-state detection and RCU callback queuing.
266 It also tracks its relationship to the corresponding leaf
267 <tt>rcu_node</tt> structure to allow more-efficient
268 propagation of quiescent states up the <tt>rcu_node</tt>
269 combining tree.
270 Like the <tt>rcu_node</tt> structure, it provides a local
271 copy of the grace-period information to allow for-free
272 synchronized
273 access to this information from the corresponding CPU.
274 Finally, this structure records past dyntick-idle state
275 for the corresponding CPU and also tracks statistics.
276 <li> <tt>rcu_dynticks</tt>:
277 This per-CPU structure tracks the current dyntick-idle
278 state for the corresponding CPU.
279 Unlike the other three structures, the <tt>rcu_dynticks</tt>
280 structure is not replicated per RCU flavor.
281 <li> <tt>rcu_head</tt>:
282 This structure represents RCU callbacks, and is the
283 only structure allocated and managed by RCU users.
284 The <tt>rcu_head</tt> structure is normally embedded
285 within the RCU-protected data structure.
286 </ol>
288 <p>If all you wanted from this article was a general notion of how
289 RCU's data structures are related, you are done.
290 Otherwise, each of the following sections give more details on
291 the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>,
292 and <tt>rcu_dynticks</tt> data structures.
294 <h3><a name="The rcu_state Structure">
295 The <tt>rcu_state</tt> Structure</a></h3>
297 <p>The <tt>rcu_state</tt> structure is the base structure that
298 represents a flavor of RCU.
299 This structure forms the interconnection between the
300 <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
301 tracks grace periods, contains the lock used to
302 synchronize with CPU-hotplug events,
303 and maintains state used to force quiescent states when
304 grace periods extend too long,
306 </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
307 singly and in groups, in the following sections.
308 The more specialized fields are covered in the discussion of their
309 use.
311 <h5>Relationship to rcu_node and rcu_data Structures</h5>
313 This portion of the <tt>rcu_state</tt> structure is declared
314 as follows:
316 <pre>
317 1 struct rcu_node node[NUM_RCU_NODES];
318 2 struct rcu_node *level[NUM_RCU_LVLS + 1];
319 3 struct rcu_data __percpu *rda;
320 </pre>
322 <table>
323 <tr><th>&nbsp;</th></tr>
324 <tr><th align="left">Quick Quiz:</th></tr>
325 <tr><td>
326 Wait a minute!
327 You said that the <tt>rcu_node</tt> structures formed a tree,
328 but they are declared as a flat array!
329 What gives?
330 </td></tr>
331 <tr><th align="left">Answer:</th></tr>
332 <tr><td bgcolor="#ffffff"><font color="ffffff">
333 The tree is laid out in the array.
334 The first node In the array is the head, the next set of nodes in the
335 array are children of the head node, and so on until the last set of
336 nodes in the array are the leaves.
337 </font>
339 <p><font color="ffffff">See the following diagrams to see how
340 this works.
341 </font></td></tr>
342 <tr><td>&nbsp;</td></tr>
343 </table>
345 <p>The <tt>rcu_node</tt> tree is embedded into the
346 <tt>-&gt;node[]</tt> array as shown in the following figure:
348 </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
350 </p><p>One interesting consequence of this mapping is that a
351 breadth-first traversal of the tree is implemented as a simple
352 linear scan of the array, which is in fact what the
353 <tt>rcu_for_each_node_breadth_first()</tt> macro does.
354 This macro is used at the beginning and ends of grace periods.
356 </p><p>Each entry of the <tt>-&gt;level</tt> array references
357 the first <tt>rcu_node</tt> structure on the corresponding level
358 of the tree, for example, as shown below:
360 </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
362 </p><p>The zero<sup>th</sup> element of the array references the root
363 <tt>rcu_node</tt> structure, the first element references the
364 first child of the root <tt>rcu_node</tt>, and finally the second
365 element references the first leaf <tt>rcu_node</tt> structure.
367 </p><p>For whatever it is worth, if you draw the tree to be tree-shaped
368 rather than array-shaped, it is easy to draw a planar representation:
370 </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
372 </p><p>Finally, the <tt>-&gt;rda</tt> field references a per-CPU
373 pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
375 </p><p>All of these fields are constant once initialization is complete,
376 and therefore need no protection.
378 <h5>Grace-Period Tracking</h5>
380 <p>This portion of the <tt>rcu_state</tt> structure is declared
381 as follows:
383 <pre>
384 1 unsigned long gpnum;
385 2 unsigned long completed;
386 </pre>
388 <p>RCU grace periods are numbered, and
389 the <tt>-&gt;gpnum</tt> field contains the number of the grace
390 period that started most recently.
391 The <tt>-&gt;completed</tt> field contains the number of the
392 grace period that completed most recently.
393 If the two fields are equal, the RCU grace period that most recently
394 started has already completed, and therefore the corresponding
395 flavor of RCU is idle.
396 If <tt>-&gt;gpnum</tt> is one greater than <tt>-&gt;completed</tt>,
397 then <tt>-&gt;gpnum</tt> gives the number of the current RCU
398 grace period, which has not yet completed.
399 Any other combination of values indicates that something is broken.
400 These two fields are protected by the root <tt>rcu_node</tt>'s
401 <tt>-&gt;lock</tt> field.
403 </p><p>There are <tt>-&gt;gpnum</tt> and <tt>-&gt;completed</tt> fields
404 in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
405 as well.
406 The fields in the <tt>rcu_state</tt> structure represent the
407 most current values, and those of the other structures are compared
408 in order to detect the start of a new grace period in a distributed
409 fashion.
410 The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
411 (down the tree from the root to the leaves) to <tt>rcu_data</tt>.
413 <h5>Miscellaneous</h5>
415 <p>This portion of the <tt>rcu_state</tt> structure is declared
416 as follows:
418 <pre>
419 1 unsigned long gp_max;
420 2 char abbr;
421 3 char *name;
422 </pre>
424 <p>The <tt>-&gt;gp_max</tt> field tracks the duration of the longest
425 grace period in jiffies.
426 It is protected by the root <tt>rcu_node</tt>'s <tt>-&gt;lock</tt>.
428 <p>The <tt>-&gt;name</tt> field points to the name of the RCU flavor
429 (for example, &ldquo;rcu_sched&rdquo;), and is constant.
430 The <tt>-&gt;abbr</tt> field contains a one-character abbreviation,
431 for example, &ldquo;s&rdquo; for RCU-sched.
433 <h3><a name="The rcu_node Structure">
434 The <tt>rcu_node</tt> Structure</a></h3>
436 <p>The <tt>rcu_node</tt> structures form the combining
437 tree that propagates quiescent-state
438 information from the leaves to the root and also that propagates
439 grace-period information from the root down to the leaves.
440 They provides local copies of the grace-period state in order
441 to allow this information to be accessed in a synchronized
442 manner without suffering the scalability limitations that
443 would otherwise be imposed by global locking.
444 In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
445 of tasks that have blocked while in their current
446 RCU read-side critical section.
447 In <tt>CONFIG_PREEMPT_RCU</tt> with
448 <tt>CONFIG_RCU_BOOST</tt>, they manage the
449 per-<tt>rcu_node</tt> priority-boosting
450 kernel threads (kthreads) and state.
451 Finally, they record CPU-hotplug state in order to determine
452 which CPUs should be ignored during a given grace period.
454 </p><p>The <tt>rcu_node</tt> structure's fields are discussed,
455 singly and in groups, in the following sections.
457 <h5>Connection to Combining Tree</h5>
459 <p>This portion of the <tt>rcu_node</tt> structure is declared
460 as follows:
462 <pre>
463 1 struct rcu_node *parent;
464 2 u8 level;
465 3 u8 grpnum;
466 4 unsigned long grpmask;
467 5 int grplo;
468 6 int grphi;
469 </pre>
471 <p>The <tt>-&gt;parent</tt> pointer references the <tt>rcu_node</tt>
472 one level up in the tree, and is <tt>NULL</tt> for the root
473 <tt>rcu_node</tt>.
474 The RCU implementation makes heavy use of this field to push quiescent
475 states up the tree.
476 The <tt>-&gt;level</tt> field gives the level in the tree, with
477 the root being at level zero, its children at level one, and so on.
478 The <tt>-&gt;grpnum</tt> field gives this node's position within
479 the children of its parent, so this number can range between 0 and 31
480 on 32-bit systems and between 0 and 63 on 64-bit systems.
481 The <tt>-&gt;level</tt> and <tt>-&gt;grpnum</tt> fields are
482 used only during initialization and for tracing.
483 The <tt>-&gt;grpmask</tt> field is the bitmask counterpart of
484 <tt>-&gt;grpnum</tt>, and therefore always has exactly one bit set.
485 This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
486 structure in its parent's bitmasks, which are described later.
487 Finally, the <tt>-&gt;grplo</tt> and <tt>-&gt;grphi</tt> fields
488 contain the lowest and highest numbered CPU served by this
489 <tt>rcu_node</tt> structure, respectively.
491 </p><p>All of these fields are constant, and thus do not require any
492 synchronization.
494 <h5>Synchronization</h5>
496 <p>This field of the <tt>rcu_node</tt> structure is declared
497 as follows:
499 <pre>
500 1 raw_spinlock_t lock;
501 </pre>
503 <p>This field is used to protect the remaining fields in this structure,
504 unless otherwise stated.
505 That said, all of the fields in this structure can be accessed without
506 locking for tracing purposes.
507 Yes, this can result in confusing traces, but better some tracing confusion
508 than to be heisenbugged out of existence.
510 <h5>Grace-Period Tracking</h5>
512 <p>This portion of the <tt>rcu_node</tt> structure is declared
513 as follows:
515 <pre>
516 1 unsigned long gpnum;
517 2 unsigned long completed;
518 </pre>
520 <p>These fields are the counterparts of the fields of the same name in
521 the <tt>rcu_state</tt> structure.
522 They each may lag up to one behind their <tt>rcu_state</tt>
523 counterparts.
524 If a given <tt>rcu_node</tt> structure's <tt>-&gt;gpnum</tt> and
525 <tt>-&gt;complete</tt> fields are equal, then this <tt>rcu_node</tt>
526 structure believes that RCU is idle.
527 Otherwise, as with the <tt>rcu_state</tt> structure,
528 the <tt>-&gt;gpnum</tt> field will be one greater than the
529 <tt>-&gt;complete</tt> fields, with <tt>-&gt;gpnum</tt>
530 indicating which grace period this <tt>rcu_node</tt> believes
531 is still being waited for.
533 </p><p>The <tt>&gt;gpnum</tt> field of each <tt>rcu_node</tt>
534 structure is updated at the beginning
535 of each grace period, and the <tt>-&gt;completed</tt> fields are
536 updated at the end of each grace period.
538 <h5>Quiescent-State Tracking</h5>
540 <p>These fields manage the propagation of quiescent states up the
541 combining tree.
543 </p><p>This portion of the <tt>rcu_node</tt> structure has fields
544 as follows:
546 <pre>
547 1 unsigned long qsmask;
548 2 unsigned long expmask;
549 3 unsigned long qsmaskinit;
550 4 unsigned long expmaskinit;
551 </pre>
553 <p>The <tt>-&gt;qsmask</tt> field tracks which of this
554 <tt>rcu_node</tt> structure's children still need to report
555 quiescent states for the current normal grace period.
556 Such children will have a value of 1 in their corresponding bit.
557 Note that the leaf <tt>rcu_node</tt> structures should be
558 thought of as having <tt>rcu_data</tt> structures as their
559 children.
560 Similarly, the <tt>-&gt;expmask</tt> field tracks which
561 of this <tt>rcu_node</tt> structure's children still need to report
562 quiescent states for the current expedited grace period.
563 An expedited grace period has
564 the same conceptual properties as a normal grace period, but the
565 expedited implementation accepts extreme CPU overhead to obtain
566 much lower grace-period latency, for example, consuming a few
567 tens of microseconds worth of CPU time to reduce grace-period
568 duration from milliseconds to tens of microseconds.
569 The <tt>-&gt;qsmaskinit</tt> field tracks which of this
570 <tt>rcu_node</tt> structure's children cover for at least
571 one online CPU.
572 This mask is used to initialize <tt>-&gt;qsmask</tt>,
573 and <tt>-&gt;expmaskinit</tt> is used to initialize
574 <tt>-&gt;expmask</tt> and the beginning of the
575 normal and expedited grace periods, respectively.
577 <table>
578 <tr><th>&nbsp;</th></tr>
579 <tr><th align="left">Quick Quiz:</th></tr>
580 <tr><td>
581 Why are these bitmasks protected by locking?
582 Come on, haven't you heard of atomic instructions???
583 </td></tr>
584 <tr><th align="left">Answer:</th></tr>
585 <tr><td bgcolor="#ffffff"><font color="ffffff">
586 Lockless grace-period computation! Such a tantalizing possibility!
587 </font>
589 <p><font color="ffffff">But consider the following sequence of events:
590 </font>
592 <ol>
593 <li> <font color="ffffff">CPU&nbsp;0 has been in dyntick-idle
594 mode for quite some time.
595 When it wakes up, it notices that the current RCU
596 grace period needs it to report in, so it sets a
597 flag where the scheduling clock interrupt will find it.
598 </font><p>
599 <li> <font color="ffffff">Meanwhile, CPU&nbsp;1 is running
600 <tt>force_quiescent_state()</tt>,
601 and notices that CPU&nbsp;0 has been in dyntick idle mode,
602 which qualifies as an extended quiescent state.
603 </font><p>
604 <li> <font color="ffffff">CPU&nbsp;0's scheduling clock
605 interrupt fires in the
606 middle of an RCU read-side critical section, and notices
607 that the RCU core needs something, so commences RCU softirq
608 processing.
609 </font>
611 <li> <font color="ffffff">CPU&nbsp;0's softirq handler
612 executes and is just about ready
613 to report its quiescent state up the <tt>rcu_node</tt>
614 tree.
615 </font><p>
616 <li> <font color="ffffff">But CPU&nbsp;1 beats it to the punch,
617 completing the current
618 grace period and starting a new one.
619 </font><p>
620 <li> <font color="ffffff">CPU&nbsp;0 now reports its quiescent
621 state for the wrong
622 grace period.
623 That grace period might now end before the RCU read-side
624 critical section.
625 If that happens, disaster will ensue.
626 </font>
627 </ol>
629 <p><font color="ffffff">So the locking is absolutely required in
630 order to coordinate
631 clearing of the bits with the grace-period numbers in
632 <tt>-&gt;gpnum</tt> and <tt>-&gt;completed</tt>.
633 </font></td></tr>
634 <tr><td>&nbsp;</td></tr>
635 </table>
637 <h5>Blocked-Task Management</h5>
639 <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
640 midst of their RCU read-side critical sections, and these tasks
641 must be tracked explicitly.
642 The details of exactly why and how they are tracked will be covered
643 in a separate article on RCU read-side processing.
644 For now, it is enough to know that the <tt>rcu_node</tt>
645 structure tracks them.
647 <pre>
648 1 struct list_head blkd_tasks;
649 2 struct list_head *gp_tasks;
650 3 struct list_head *exp_tasks;
651 4 bool wait_blkd_tasks;
652 </pre>
654 <p>The <tt>-&gt;blkd_tasks</tt> field is a list header for
655 the list of blocked and preempted tasks.
656 As tasks undergo context switches within RCU read-side critical
657 sections, their <tt>task_struct</tt> structures are enqueued
658 (via the <tt>task_struct</tt>'s <tt>-&gt;rcu_node_entry</tt>
659 field) onto the head of the <tt>-&gt;blkd_tasks</tt> list for the
660 leaf <tt>rcu_node</tt> structure corresponding to the CPU
661 on which the outgoing context switch executed.
662 As these tasks later exit their RCU read-side critical sections,
663 they remove themselves from the list.
664 This list is therefore in reverse time order, so that if one of the tasks
665 is blocking the current grace period, all subsequent tasks must
666 also be blocking that same grace period.
667 Therefore, a single pointer into this list suffices to track
668 all tasks blocking a given grace period.
669 That pointer is stored in <tt>-&gt;gp_tasks</tt> for normal
670 grace periods and in <tt>-&gt;exp_tasks</tt> for expedited
671 grace periods.
672 These last two fields are <tt>NULL</tt> if either there is
673 no grace period in flight or if there are no blocked tasks
674 preventing that grace period from completing.
675 If either of these two pointers is referencing a task that
676 removes itself from the <tt>-&gt;blkd_tasks</tt> list,
677 then that task must advance the pointer to the next task on
678 the list, or set the pointer to <tt>NULL</tt> if there
679 are no subsequent tasks on the list.
681 </p><p>For example, suppose that tasks&nbsp;T1, T2, and&nbsp;T3 are
682 all hard-affinitied to the largest-numbered CPU in the system.
683 Then if task&nbsp;T1 blocked in an RCU read-side
684 critical section, then an expedited grace period started,
685 then task&nbsp;T2 blocked in an RCU read-side critical section,
686 then a normal grace period started, and finally task&nbsp;3 blocked
687 in an RCU read-side critical section, then the state of the
688 last leaf <tt>rcu_node</tt> structure's blocked-task list
689 would be as shown below:
691 </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
693 </p><p>Task&nbsp;T1 is blocking both grace periods, task&nbsp;T2 is
694 blocking only the normal grace period, and task&nbsp;T3 is blocking
695 neither grace period.
696 Note that these tasks will not remove themselves from this list
697 immediately upon resuming execution.
698 They will instead remain on the list until they execute the outermost
699 <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
700 section.
703 The <tt>-&gt;wait_blkd_tasks</tt> field indicates whether or not
704 the current grace period is waiting on a blocked task.
706 <h5>Sizing the <tt>rcu_node</tt> Array</h5>
708 <p>The <tt>rcu_node</tt> array is sized via a series of
709 C-preprocessor expressions as follows:
711 <pre>
712 1 #ifdef CONFIG_RCU_FANOUT
713 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
714 3 #else
715 4 # ifdef CONFIG_64BIT
716 5 # define RCU_FANOUT 64
717 6 # else
718 7 # define RCU_FANOUT 32
719 8 # endif
720 9 #endif
722 11 #ifdef CONFIG_RCU_FANOUT_LEAF
723 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
724 13 #else
725 14 # ifdef CONFIG_64BIT
726 15 # define RCU_FANOUT_LEAF 64
727 16 # else
728 17 # define RCU_FANOUT_LEAF 32
729 18 # endif
730 19 #endif
732 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
733 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
734 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
735 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
737 26 #if NR_CPUS &lt;= RCU_FANOUT_1
738 27 # define RCU_NUM_LVLS 1
739 28 # define NUM_RCU_LVL_0 1
740 29 # define NUM_RCU_NODES NUM_RCU_LVL_0
741 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
742 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
743 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
744 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
745 34 #elif NR_CPUS &lt;= RCU_FANOUT_2
746 35 # define RCU_NUM_LVLS 2
747 36 # define NUM_RCU_LVL_0 1
748 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
749 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
750 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
751 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
752 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
753 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
754 43 #elif NR_CPUS &lt;= RCU_FANOUT_3
755 44 # define RCU_NUM_LVLS 3
756 45 # define NUM_RCU_LVL_0 1
757 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
758 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
759 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
760 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
761 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
762 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
763 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
764 53 #elif NR_CPUS &lt;= RCU_FANOUT_4
765 54 # define RCU_NUM_LVLS 4
766 55 # define NUM_RCU_LVL_0 1
767 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
768 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
769 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
770 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
771 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
772 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
773 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
774 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
775 64 #else
776 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
777 66 #endif
778 </pre>
780 <p>The maximum number of levels in the <tt>rcu_node</tt> structure
781 is currently limited to four, as specified by lines&nbsp;21-24
782 and the structure of the subsequent &ldquo;if&rdquo; statement.
783 For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
784 should be sufficient for the next few years at least.
785 For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
786 should see us through the next decade or so.
787 This four-level tree also allows kernels built with
788 <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
789 which might be useful in very large systems having eight CPUs per
790 socket (but please note that no one has yet shown any measurable
791 performance degradation due to misaligned socket and <tt>rcu_node</tt>
792 boundaries).
793 In addition, building kernels with a full four levels of <tt>rcu_node</tt>
794 tree permits better testing of RCU's combining-tree code.
796 </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
797 are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
798 If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
799 it is set based on the word size of the system, which is also
800 the Kconfig default.
802 </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
803 handled by each leaf <tt>rcu_node</tt> structure.
804 Experience has shown that allowing a given leaf <tt>rcu_node</tt>
805 structure to handle 64 CPUs, as permitted by the number of bits in
806 the <tt>-&gt;qsmask</tt> field on a 64-bit system, results in
807 excessive contention for the leaf <tt>rcu_node</tt> structures'
808 <tt>-&gt;lock</tt> fields.
809 The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
810 limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
811 If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
812 selected is based on the word size of the system, just as for
813 <tt>CONFIG_RCU_FANOUT</tt>.
814 Lines&nbsp;11-19 perform this computation.
816 </p><p>Lines&nbsp;21-24 compute the maximum number of CPUs supported by
817 a single-level (which contains a single <tt>rcu_node</tt> structure),
818 two-level, three-level, and four-level <tt>rcu_node</tt> tree,
819 respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
820 and <tt>RCU_FANOUT_LEAF</tt>.
821 These numbers of CPUs are retained in the
822 <tt>RCU_FANOUT_1</tt>,
823 <tt>RCU_FANOUT_2</tt>,
824 <tt>RCU_FANOUT_3</tt>, and
825 <tt>RCU_FANOUT_4</tt>
826 C-preprocessor variables, respectively.
828 </p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
829 statement spanning lines&nbsp;26-66 that computes the number of
830 <tt>rcu_node</tt> structures required for each level of the tree,
831 as well as the number of levels required.
832 The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
833 C-preprocessor variable by lines&nbsp;27, 35, 44, and&nbsp;54.
834 The number of <tt>rcu_node</tt> structures for the topmost level
835 of the tree is always exactly one, and this value is unconditionally
836 placed into <tt>NUM_RCU_LVL_0</tt> by lines&nbsp;28, 36, 45, and&nbsp;55.
837 The rest of the levels (if any) of the <tt>rcu_node</tt> tree
838 are computed by dividing the maximum number of CPUs by the
839 fanout supported by the number of levels from the current level down,
840 rounding up. This computation is performed by lines&nbsp;37,
841 46-47, and&nbsp;56-58.
842 Lines&nbsp;31-33, 40-42, 50-52, and&nbsp;62-63 create initializers
843 for lockdep lock-class names.
844 Finally, lines&nbsp;64-66 produce an error if the maximum number of
845 CPUs is too large for the specified fanout.
847 <h3><a name="The rcu_data Structure">
848 The <tt>rcu_data</tt> Structure</a></h3>
850 <p>The <tt>rcu_data</tt> maintains the per-CPU state for the
851 corresponding flavor of RCU.
852 The fields in this structure may be accessed only from the corresponding
853 CPU (and from tracing) unless otherwise stated.
854 This structure is the
855 focus of quiescent-state detection and RCU callback queuing.
856 It also tracks its relationship to the corresponding leaf
857 <tt>rcu_node</tt> structure to allow more-efficient
858 propagation of quiescent states up the <tt>rcu_node</tt>
859 combining tree.
860 Like the <tt>rcu_node</tt> structure, it provides a local
861 copy of the grace-period information to allow for-free
862 synchronized
863 access to this information from the corresponding CPU.
864 Finally, this structure records past dyntick-idle state
865 for the corresponding CPU and also tracks statistics.
867 </p><p>The <tt>rcu_data</tt> structure's fields are discussed,
868 singly and in groups, in the following sections.
870 <h5>Connection to Other Data Structures</h5>
872 <p>This portion of the <tt>rcu_data</tt> structure is declared
873 as follows:
875 <pre>
876 1 int cpu;
877 2 struct rcu_state *rsp;
878 3 struct rcu_node *mynode;
879 4 struct rcu_dynticks *dynticks;
880 5 unsigned long grpmask;
881 6 bool beenonline;
882 </pre>
884 <p>The <tt>-&gt;cpu</tt> field contains the number of the
885 corresponding CPU, the <tt>-&gt;rsp</tt> pointer references
886 the corresponding <tt>rcu_state</tt> structure (and is most frequently
887 used to locate the name of the corresponding flavor of RCU for tracing),
888 and the <tt>-&gt;mynode</tt> field references the corresponding
889 <tt>rcu_node</tt> structure.
890 The <tt>-&gt;mynode</tt> is used to propagate quiescent states
891 up the combining tree.
892 <p>The <tt>-&gt;dynticks</tt> pointer references the
893 <tt>rcu_dynticks</tt> structure corresponding to this
894 CPU.
895 Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt>
896 structure is shared among all flavors of RCU.
897 These first four fields are constant and therefore require not
898 synchronization.
900 </p><p>The <tt>-&gt;grpmask</tt> field indicates the bit in
901 the <tt>-&gt;mynode-&gt;qsmask</tt> corresponding to this
902 <tt>rcu_data</tt> structure, and is also used when propagating
903 quiescent states.
904 The <tt>-&gt;beenonline</tt> flag is set whenever the corresponding
905 CPU comes online, which means that the debugfs tracing need not dump
906 out any <tt>rcu_data</tt> structure for which this flag is not set.
908 <h5>Quiescent-State and Grace-Period Tracking</h5>
910 <p>This portion of the <tt>rcu_data</tt> structure is declared
911 as follows:
913 <pre>
914 1 unsigned long completed;
915 2 unsigned long gpnum;
916 3 bool cpu_no_qs;
917 4 bool core_needs_qs;
918 5 bool gpwrap;
919 6 unsigned long rcu_qs_ctr_snap;
920 </pre>
922 <p>The <tt>completed</tt> and <tt>gpnum</tt>
923 fields are the counterparts of the fields of the same name
924 in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.
925 They may each lag up to one behind their <tt>rcu_node</tt>
926 counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and
927 <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag
928 arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
929 will catch up upon exit from dyntick-idle mode).
930 If a given <tt>rcu_data</tt> structure's <tt>-&gt;gpnum</tt> and
931 <tt>-&gt;complete</tt> fields are equal, then this <tt>rcu_data</tt>
932 structure believes that RCU is idle.
933 Otherwise, as with the <tt>rcu_state</tt> and <tt>rcu_node</tt>
934 structure,
935 the <tt>-&gt;gpnum</tt> field will be one greater than the
936 <tt>-&gt;complete</tt> fields, with <tt>-&gt;gpnum</tt>
937 indicating which grace period this <tt>rcu_data</tt> believes
938 is still being waited for.
940 <table>
941 <tr><th>&nbsp;</th></tr>
942 <tr><th align="left">Quick Quiz:</th></tr>
943 <tr><td>
944 All this replication of the grace period numbers can only cause
945 massive confusion.
946 Why not just keep a global pair of counters and be done with it???
947 </td></tr>
948 <tr><th align="left">Answer:</th></tr>
949 <tr><td bgcolor="#ffffff"><font color="ffffff">
950 Because if there was only a single global pair of grace-period
951 numbers, there would need to be a single global lock to allow
952 safely accessing and updating them.
953 And if we are not going to have a single global lock, we need
954 to carefully manage the numbers on a per-node basis.
955 Recall from the answer to a previous Quick Quiz that the consequences
956 of applying a previously sampled quiescent state to the wrong
957 grace period are quite severe.
958 </font></td></tr>
959 <tr><td>&nbsp;</td></tr>
960 </table>
962 <p>The <tt>-&gt;cpu_no_qs</tt> flag indicates that the
963 CPU has not yet passed through a quiescent state,
964 while the <tt>-&gt;core_needs_qs</tt> flag indicates that the
965 RCU core needs a quiescent state from the corresponding CPU.
966 The <tt>-&gt;gpwrap</tt> field indicates that the corresponding
967 CPU has remained idle for so long that the <tt>completed</tt>
968 and <tt>gpnum</tt> counters are in danger of overflow, which
969 will cause the CPU to disregard the values of its counters on
970 its next exit from idle.
971 Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect
972 cases where a given operation has resulted in a quiescent state
973 for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>.
975 <h5>RCU Callback Handling</h5>
977 <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
978 the same CPU that registered them.
979 This is strictly a cache-locality optimization: callbacks can and
980 do get invoked on CPUs other than the one that registered them.
981 After all, if the CPU that registered a given callback has gone
982 offline before the callback can be invoked, there really is no other
983 choice.
985 </p><p>This portion of the <tt>rcu_data</tt> structure is declared
986 as follows:
988 <pre>
989 1 struct rcu_head *nxtlist;
990 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
991 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
992 4 long qlen_lazy;
993 5 long qlen;
994 6 long qlen_last_fqs_check;
995 7 unsigned long n_force_qs_snap;
996 8 unsigned long n_cbs_invoked;
997 9 unsigned long n_cbs_orphaned;
998 10 unsigned long n_cbs_adopted;
999 11 long blimit;
1000 </pre>
1002 <p>The <tt>-&gt;nxtlist</tt> pointer and the
1003 <tt>-&gt;nxttail[]</tt> array form a four-segment list with
1004 older callbacks near the head and newer ones near the tail.
1005 Each segment contains callbacks with the corresponding relationship
1006 to the current grace period.
1007 The pointer out of the end of each of the four segments is referenced
1008 by the element of the <tt>-&gt;nxttail[]</tt> array indexed by
1009 <tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period),
1010 <tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period),
1011 <tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next
1012 grace period), and
1013 <tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated
1014 with a specific grace period)
1015 respectively, as shown in the following figure.
1017 </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
1019 </p><p>In this figure, the <tt>-&gt;nxtlist</tt> pointer references the
1020 first
1021 RCU callback in the list.
1022 The <tt>-&gt;nxttail[RCU_DONE_TAIL]</tt> array element references
1023 the <tt>-&gt;nxtlist</tt> pointer itself, indicating that none
1024 of the callbacks is ready to invoke.
1025 The <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt> array element references callback
1026 CB&nbsp;2's <tt>-&gt;next</tt> pointer, which indicates that
1027 CB&nbsp;1 and CB&nbsp;2 are both waiting on the current grace period.
1028 The <tt>-&gt;nxttail[RCU_NEXT_READY_TAIL]</tt> array element
1029 references the same RCU callback that <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt>
1030 does, which indicates that there are no callbacks waiting on the next
1031 RCU grace period.
1032 The <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element references
1033 CB&nbsp;4's <tt>-&gt;next</tt> pointer, indicating that all the
1034 remaining RCU callbacks have not yet been assigned to an RCU grace
1035 period.
1036 Note that the <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element
1037 always references the last RCU callback's <tt>-&gt;next</tt> pointer
1038 unless the callback list is empty, in which case it references
1039 the <tt>-&gt;nxtlist</tt> pointer.
1041 </p><p>CPUs advance their callbacks from the
1042 <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
1043 <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
1044 as grace periods advance.
1045 The CPU advances the callbacks in its <tt>rcu_data</tt> structure
1046 whenever it notices that another RCU grace period has completed.
1047 The CPU detects the completion of an RCU grace period by noticing
1048 that the value of its <tt>rcu_data</tt> structure's
1049 <tt>-&gt;completed</tt> field differs from that of its leaf
1050 <tt>rcu_node</tt> structure.
1051 Recall that each <tt>rcu_node</tt> structure's
1052 <tt>-&gt;completed</tt> field is updated at the end of each
1053 grace period.
1055 </p><p>The <tt>-&gt;nxtcompleted[]</tt> array records grace-period
1056 numbers corresponding to the list segments.
1057 This allows CPUs that go idle for extended periods to determine
1058 which of their callbacks are ready to be invoked after reawakening.
1060 </p><p>The <tt>-&gt;qlen</tt> counter contains the number of
1061 callbacks in <tt>-&gt;nxtlist</tt>, and the
1062 <tt>-&gt;qlen_lazy</tt> contains the number of those callbacks that
1063 are known to only free memory, and whose invocation can therefore
1064 be safely deferred.
1065 The <tt>-&gt;qlen_last_fqs_check</tt> and
1066 <tt>-&gt;n_force_qs_snap</tt> coordinate the forcing of quiescent
1067 states from <tt>call_rcu()</tt> and friends when callback
1068 lists grow excessively long.
1070 </p><p>The <tt>-&gt;n_cbs_invoked</tt>,
1071 <tt>-&gt;n_cbs_orphaned</tt>, and <tt>-&gt;n_cbs_adopted</tt>
1072 fields count the number of callbacks invoked,
1073 sent to other CPUs when this CPU goes offline,
1074 and received from other CPUs when those other CPUs go offline.
1075 Finally, the <tt>-&gt;blimit</tt> counter is the maximum number of
1076 RCU callbacks that may be invoked at a given time.
1078 <h5>Dyntick-Idle Handling</h5>
1080 <p>This portion of the <tt>rcu_data</tt> structure is declared
1081 as follows:
1083 <pre>
1084 1 int dynticks_snap;
1085 2 unsigned long dynticks_fqs;
1086 </pre>
1088 The <tt>-&gt;dynticks_snap</tt> field is used to take a snapshot
1089 of the corresponding CPU's dyntick-idle state when forcing
1090 quiescent states, and is therefore accessed from other CPUs.
1091 Finally, the <tt>-&gt;dynticks_fqs</tt> field is used to
1092 count the number of times this CPU is determined to be in
1093 dyntick-idle state, and is used for tracing and debugging purposes.
1095 <h3><a name="The rcu_dynticks Structure">
1096 The <tt>rcu_dynticks</tt> Structure</a></h3>
1098 <p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state
1099 for the corresponding CPU.
1100 Unlike the other structures, <tt>rcu_dynticks</tt> is not
1101 replicated over the different flavors of RCU.
1102 The fields in this structure may be accessed only from the corresponding
1103 CPU (and from tracing) unless otherwise stated.
1104 Its fields are as follows:
1106 <pre>
1107 1 int dynticks_nesting;
1108 2 int dynticks_nmi_nesting;
1109 3 atomic_t dynticks;
1110 </pre>
1112 <p>The <tt>-&gt;dynticks_nesting</tt> field counts the
1113 nesting depth of normal interrupts.
1114 In addition, this counter is incremented when exiting dyntick-idle
1115 mode and decremented when entering it.
1116 This counter can therefore be thought of as counting the number
1117 of reasons why this CPU cannot be permitted to enter dyntick-idle
1118 mode, aside from non-maskable interrupts (NMIs).
1119 NMIs are counted by the <tt>-&gt;dynticks_nmi_nesting</tt>
1120 field, except that NMIs that interrupt non-dyntick-idle execution
1121 are not counted.
1123 </p><p>Finally, the <tt>-&gt;dynticks</tt> field counts the corresponding
1124 CPU's transitions to and from dyntick-idle mode, so that this counter
1125 has an even value when the CPU is in dyntick-idle mode and an odd
1126 value otherwise.
1128 <table>
1129 <tr><th>&nbsp;</th></tr>
1130 <tr><th align="left">Quick Quiz:</th></tr>
1131 <tr><td>
1132 Why not just count all NMIs?
1133 Wouldn't that be simpler and less error prone?
1134 </td></tr>
1135 <tr><th align="left">Answer:</th></tr>
1136 <tr><td bgcolor="#ffffff"><font color="ffffff">
1137 It seems simpler only until you think hard about how to go about
1138 updating the <tt>rcu_dynticks</tt> structure's
1139 <tt>-&gt;dynticks</tt> field.
1140 </font></td></tr>
1141 <tr><td>&nbsp;</td></tr>
1142 </table>
1144 <p>Additional fields are present for some special-purpose
1145 builds, and are discussed separately.
1147 <h3><a name="The rcu_head Structure">
1148 The <tt>rcu_head</tt> Structure</a></h3>
1150 <p>Each <tt>rcu_head</tt> structure represents an RCU callback.
1151 These structures are normally embedded within RCU-protected data
1152 structures whose algorithms use asynchronous grace periods.
1153 In contrast, when using algorithms that block waiting for RCU grace periods,
1154 RCU users need not provide <tt>rcu_head</tt> structures.
1156 </p><p>The <tt>rcu_head</tt> structure has fields as follows:
1158 <pre>
1159 1 struct rcu_head *next;
1160 2 void (*func)(struct rcu_head *head);
1161 </pre>
1163 <p>The <tt>-&gt;next</tt> field is used
1164 to link the <tt>rcu_head</tt> structures together in the
1165 lists within the <tt>rcu_data</tt> structures.
1166 The <tt>-&gt;func</tt> field is a pointer to the function
1167 to be called when the callback is ready to be invoked, and
1168 this function is passed a pointer to the <tt>rcu_head</tt>
1169 structure.
1170 However, <tt>kfree_rcu()</tt> uses the <tt>-&gt;func</tt>
1171 field to record the offset of the <tt>rcu_head</tt>
1172 structure within the enclosing RCU-protected data structure.
1174 </p><p>Both of these fields are used internally by RCU.
1175 From the viewpoint of RCU users, this structure is an
1176 opaque &ldquo;cookie&rdquo;.
1178 <table>
1179 <tr><th>&nbsp;</th></tr>
1180 <tr><th align="left">Quick Quiz:</th></tr>
1181 <tr><td>
1182 Given that the callback function <tt>-&gt;func</tt>
1183 is passed a pointer to the <tt>rcu_head</tt> structure,
1184 how is that function supposed to find the beginning of the
1185 enclosing RCU-protected data structure?
1186 </td></tr>
1187 <tr><th align="left">Answer:</th></tr>
1188 <tr><td bgcolor="#ffffff"><font color="ffffff">
1189 In actual practice, there is a separate callback function per
1190 type of RCU-protected data structure.
1191 The callback function can therefore use the <tt>container_of()</tt>
1192 macro in the Linux kernel (or other pointer-manipulation facilities
1193 in other software environments) to find the beginning of the
1194 enclosing structure.
1195 </font></td></tr>
1196 <tr><td>&nbsp;</td></tr>
1197 </table>
1199 <h3><a name="RCU-Specific Fields in the task_struct Structure">
1200 RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
1202 <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some
1203 additional fields in the <tt>task_struct</tt> structure:
1205 <pre>
1206 1 #ifdef CONFIG_PREEMPT_RCU
1207 2 int rcu_read_lock_nesting;
1208 3 union rcu_special rcu_read_unlock_special;
1209 4 struct list_head rcu_node_entry;
1210 5 struct rcu_node *rcu_blocked_node;
1211 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
1212 7 #ifdef CONFIG_TASKS_RCU
1213 8 unsigned long rcu_tasks_nvcsw;
1214 9 bool rcu_tasks_holdout;
1215 10 struct list_head rcu_tasks_holdout_list;
1216 11 int rcu_tasks_idle_cpu;
1217 12 #endif /* #ifdef CONFIG_TASKS_RCU */
1218 </pre>
1220 <p>The <tt>-&gt;rcu_read_lock_nesting</tt> field records the
1221 nesting level for RCU read-side critical sections, and
1222 the <tt>-&gt;rcu_read_unlock_special</tt> field is a bitmask
1223 that records special conditions that require <tt>rcu_read_unlock()</tt>
1224 to do additional work.
1225 The <tt>-&gt;rcu_node_entry</tt> field is used to form lists of
1226 tasks that have blocked within preemptible-RCU read-side critical
1227 sections and the <tt>-&gt;rcu_blocked_node</tt> field references
1228 the <tt>rcu_node</tt> structure whose list this task is a member of,
1229 or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
1230 read-side critical section.
1232 <p>The <tt>-&gt;rcu_tasks_nvcsw</tt> field tracks the number of
1233 voluntary context switches that this task had undergone at the
1234 beginning of the current tasks-RCU grace period,
1235 <tt>-&gt;rcu_tasks_holdout</tt> is set if the current tasks-RCU
1236 grace period is waiting on this task, <tt>-&gt;rcu_tasks_holdout_list</tt>
1237 is a list element enqueuing this task on the holdout list,
1238 and <tt>-&gt;rcu_tasks_idle_cpu</tt> tracks which CPU this
1239 idle task is running, but only if the task is currently running,
1240 that is, if the CPU is currently idle.
1242 <h3><a name="Accessor Functions">
1243 Accessor Functions</a></h3>
1245 <p>The following listing shows the
1246 <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>,
1247 <tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and
1248 <tt>rcu_for_each_leaf_node()</tt> function and macros:
1250 <pre>
1251 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
1253 3 return &amp;rsp-&gt;node[0];
1256 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
1257 7 for ((rnp) = &amp;(rsp)-&gt;node[0]; \
1258 8 (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
1260 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
1261 11 for ((rnp) = &amp;(rsp)-&gt;node[0]; \
1262 12 (rnp) &lt; (rsp)-&gt;level[NUM_RCU_LVLS - 1]; (rnp)++)
1264 14 #define rcu_for_each_leaf_node(rsp, rnp) \
1265 15 for ((rnp) = (rsp)-&gt;level[NUM_RCU_LVLS - 1]; \
1266 16 (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
1267 </pre>
1269 <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
1270 first element of the specified <tt>rcu_state</tt> structure's
1271 <tt>-&gt;node[]</tt> array, which is the root <tt>rcu_node</tt>
1272 structure.
1274 </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt>
1275 macro takes advantage of the layout of the <tt>rcu_node</tt>
1276 structures in the <tt>rcu_state</tt> structure's
1277 <tt>-&gt;node[]</tt> array, performing a breadth-first traversal by
1278 simply traversing the array in order.
1279 The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates
1280 similarly, but traverses only the first part of the array, thus excluding
1281 the leaf <tt>rcu_node</tt> structures.
1282 Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
1283 the last part of the array, thus traversing only the leaf
1284 <tt>rcu_node</tt> structures.
1286 <table>
1287 <tr><th>&nbsp;</th></tr>
1288 <tr><th align="left">Quick Quiz:</th></tr>
1289 <tr><td>
1290 What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and
1291 <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
1292 contains only a single node?
1293 </td></tr>
1294 <tr><th align="left">Answer:</th></tr>
1295 <tr><td bgcolor="#ffffff"><font color="ffffff">
1296 In the single-node case,
1297 <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op
1298 and <tt>rcu_for_each_leaf_node()</tt> traverses the single node.
1299 </font></td></tr>
1300 <tr><td>&nbsp;</td></tr>
1301 </table>
1303 <h3><a name="Summary">
1304 Summary</a></h3>
1306 So each flavor of RCU is represented by an <tt>rcu_state</tt> structure,
1307 which contains a combining tree of <tt>rcu_node</tt> and
1308 <tt>rcu_data</tt> structures.
1309 Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
1310 state is tracked by an <tt>rcu_dynticks</tt> structure.
1312 If you made it this far, you are well prepared to read the code
1313 walkthroughs in the other articles in this series.
1315 <h3><a name="Acknowledgments">
1316 Acknowledgments</a></h3>
1318 I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
1319 Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn
1320 for helping me get this document into a more human-readable state.
1322 <h3><a name="Legal Statement">
1323 Legal Statement</a></h3>
1325 <p>This work represents the view of the author and does not necessarily
1326 represent the view of IBM.
1328 </p><p>Linux is a registered trademark of Linus Torvalds.
1330 </p><p>Other company, product, and service names may be trademarks or
1331 service marks of others.
1333 </body></html>