1 User Interface for Resource Allocation in Intel Resource Director Technology
3 Copyright (C) 2016 Intel Corporation
5 Fenghua Yu <fenghua.yu@intel.com>
6 Tony Luck <tony.luck@intel.com>
7 Vikas Shivappa <vikas.shivappa@intel.com>
9 This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
10 X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
12 To use the feature mount the file system:
14 # mount -t resctrl resctrl [-o cdp] /sys/fs/resctrl
18 "cdp": Enable code/data prioritization in L3 cache allocations.
24 The 'info' directory contains information about the enabled
25 resources. Each resource has its own subdirectory. The subdirectory
26 names reflect the resource names.
27 Cache resource(L3/L2) subdirectory contains the following files:
29 "num_closids": The number of CLOSIDs which are valid for this
30 resource. The kernel uses the smallest number of
31 CLOSIDs of all enabled resources as limit.
33 "cbm_mask": The bitmask which is valid for this resource.
34 This mask is equivalent to 100%.
36 "min_cbm_bits": The minimum number of consecutive bits which
37 must be set when writing a mask.
39 Memory bandwitdh(MB) subdirectory contains the following files:
41 "min_bandwidth": The minimum memory bandwidth percentage which
44 "bandwidth_gran": The granularity in which the memory bandwidth
45 percentage is allocated. The allocated
46 b/w percentage is rounded off to the next
47 control step available on the hardware. The
48 available bandwidth control steps are:
49 min_bandwidth + N * bandwidth_gran.
51 "delay_linear": Indicates if the delay scale is linear or
52 non-linear. This field is purely informational
57 Resource groups are represented as directories in the resctrl file
58 system. The default group is the root directory. Other groups may be
59 created as desired by the system administrator using the "mkdir(1)"
60 command, and removed using "rmdir(1)".
62 There are three files associated with each group:
64 "tasks": A list of tasks that belongs to this group. Tasks can be
65 added to a group by writing the task ID to the "tasks" file
66 (which will automatically remove them from the previous
67 group to which they belonged). New tasks created by fork(2)
68 and clone(2) are added to the same group as their parent.
69 If a pid is not in any sub partition, it is in root partition
70 (i.e. default partition).
72 "cpus": A bitmask of logical CPUs assigned to this group. Writing
73 a new mask can add/remove CPUs from this group. Added CPUs
74 are removed from their previous group. Removed ones are
75 given to the default (root) group. You cannot remove CPUs
76 from the default group.
78 "cpus_list": One or more CPU ranges of logical CPUs assigned to this
79 group. Same rules apply like for the "cpus" file.
81 "schemata": A list of all the resources available to this group.
82 Each resource has its own line and format - see below for
85 When a task is running the following rules define which resources
88 1) If the task is a member of a non-default group, then the schemata
89 for that group is used.
91 2) Else if the task belongs to the default group, but is running on a
92 CPU that is assigned to some specific group, then the schemata for
93 the CPU's group is used.
95 3) Otherwise the schemata for the default group is used.
98 Schemata files - general concepts
99 ---------------------------------
100 Each line in the file describes one resource. The line starts with
101 the name of the resource, followed by specific values to be applied
102 in each of the instances of that resource on the system.
106 On current generation systems there is one L3 cache per socket and L2
107 caches are generally just shared by the hyperthreads on a core, but this
108 isn't an architectural requirement. We could have multiple separate L3
109 caches on a socket, multiple cores could share an L2 cache. So instead
110 of using "socket" or "core" to define the set of logical cpus sharing
111 a resource we use a "Cache ID". At a given cache level this will be a
112 unique number across the whole system (but it isn't guaranteed to be a
113 contiguous sequence, there may be gaps). To find the ID for each logical
114 CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
116 Cache Bit Masks (CBM)
117 ---------------------
118 For cache resources we describe the portion of the cache that is available
119 for allocation using a bitmask. The maximum value of the mask is defined
120 by each cpu model (and may be different for different cache levels). It
121 is found using CPUID, but is also provided in the "info" directory of
122 the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
123 requires that these masks have all the '1' bits in a contiguous block. So
124 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
125 and 0xA are not. On a system with a 20-bit mask each bit represents 5%
126 of the capacity of the cache. You could partition the cache into four
127 equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
129 Memory bandwidth(b/w) percentage
130 --------------------------------
131 For Memory b/w resource, user controls the resource by indicating the
132 percentage of total memory b/w.
134 The minimum bandwidth percentage value for each cpu model is predefined
135 and can be looked up through "info/MB/min_bandwidth". The bandwidth
136 granularity that is allocated is also dependent on the cpu model and can
137 be looked up at "info/MB/bandwidth_gran". The available bandwidth
138 control steps are: min_bw + N * bw_gran. Intermediate values are rounded
139 to the next control step available on the hardware.
141 The bandwidth throttling is a core specific mechanism on some of Intel
142 SKUs. Using a high bandwidth and a low bandwidth setting on two threads
143 sharing a core will result in both threads being throttled to use the
146 L3 details (code and data prioritization disabled)
147 --------------------------------------------------
148 With CDP disabled the L3 schemata format is:
150 L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
152 L3 details (CDP enabled via mount option to resctrl)
153 ----------------------------------------------------
154 When CDP is enabled L3 control is split into two separate resources
155 so you can specify independent masks for code and data like this:
157 L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
158 L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
162 L2 cache does not support code and data prioritization, so the
163 schemata format is always:
165 L2:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
167 Memory b/w Allocation details
168 -----------------------------
170 Memory b/w domain is L3 cache.
172 MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
174 Reading/writing the schemata file
175 ---------------------------------
176 Reading the schemata file will show the state of all resources
177 on all domains. When writing you only need to specify those values
178 which you wish to change. E.g.
181 L3DATA:0=fffff;1=fffff;2=fffff;3=fffff
182 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
183 # echo "L3DATA:2=3c0;" > schemata
185 L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
186 L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
190 On a two socket machine (one L3 cache per socket) with just four bits
191 for cache bit masks, minimum b/w of 10% with a memory bandwidth
194 # mount -t resctrl resctrl /sys/fs/resctrl
197 # echo "L3:0=3;1=c\nMB:0=50;1=50" > /sys/fs/resctrl/p0/schemata
198 # echo "L3:0=3;1=3\nMB:0=50;1=50" > /sys/fs/resctrl/p1/schemata
200 The default resource group is unmodified, so we have access to all parts
201 of all caches (its schemata file reads "L3:0=f;1=f").
203 Tasks that are under the control of group "p0" may only allocate from the
204 "lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
205 Tasks in group "p1" use the "lower" 50% of cache on both sockets.
207 Similarly, tasks that are under the control of group "p0" may use a
208 maximum memory b/w of 50% on socket0 and 50% on socket 1.
209 Tasks in group "p1" may also use 50% memory b/w on both sockets.
210 Note that unlike cache masks, memory b/w cannot specify whether these
211 allocations can overlap or not. The allocations specifies the maximum
212 b/w that the group may be able to use and the system admin can configure
217 Again two sockets, but this time with a more realistic 20-bit mask.
219 Two real time tasks pid=1234 running on processor 0 and pid=5678 running on
220 processor 1 on socket 0 on a 2-socket and dual core machine. To avoid noisy
221 neighbors, each of the two real-time tasks exclusively occupies one quarter
222 of L3 cache on socket 0.
224 # mount -t resctrl resctrl /sys/fs/resctrl
227 First we reset the schemata for the default group so that the "upper"
228 50% of the L3 cache on socket 0 and 50% of memory b/w cannot be used by
231 # echo "L3:0=3ff;1=fffff\nMB:0=50;1=100" > schemata
233 Next we make a resource group for our first real time task and give
234 it access to the "top" 25% of the cache on socket 0.
237 # echo "L3:0=f8000;1=fffff" > p0/schemata
239 Finally we move our first real time task into this resource group. We
240 also use taskset(1) to ensure the task always runs on a dedicated CPU
241 on socket 0. Most uses of resource groups will also constrain which
242 processors tasks run on.
244 # echo 1234 > p0/tasks
247 Ditto for the second real time task (with the remaining 25% of cache):
250 # echo "L3:0=7c00;1=fffff" > p1/schemata
251 # echo 5678 > p1/tasks
254 For the same 2 socket system with memory b/w resource and CAT L3 the
255 schemata would look like(Assume min_bandwidth 10 and bandwidth_gran is
258 For our first real time task this would request 20% memory b/w on socket
261 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
263 For our second real time task this would request an other 20% memory b/w
266 # echo -e "L3:0=f8000;1=fffff\nMB:0=20;1=100" > p0/schemata
271 A single socket system which has real-time tasks running on core 4-7 and
272 non real-time workload assigned to core 0-3. The real-time tasks share text
273 and data, so a per task association is not required and due to interaction
274 with the kernel it's desired that the kernel on these cores shares L3 with
277 # mount -t resctrl resctrl /sys/fs/resctrl
280 First we reset the schemata for the default group so that the "upper"
281 50% of the L3 cache on socket 0, and 50% of memory bandwidth on socket 0
282 cannot be used by ordinary tasks:
284 # echo "L3:0=3ff\nMB:0=50" > schemata
286 Next we make a resource group for our real time cores and give it access
287 to the "top" 50% of the cache on socket 0 and 50% of memory bandwidth on
291 # echo "L3:0=ffc00\nMB:0=50" > p0/schemata
293 Finally we move core 4-7 over to the new group and make sure that the
294 kernel and the tasks running there get 50% of the cache. They should
295 also get 50% of memory bandwidth assuming that the cores 4-7 are SMT
296 siblings and only the real time threads are scheduled on the cores 4-7.
300 4) Locking between applications
302 Certain operations on the resctrl filesystem, composed of read/writes
303 to/from multiple files, must be atomic.
305 As an example, the allocation of an exclusive reservation of L3 cache
308 1. Read the cbmmasks from each directory
309 2. Find a contiguous set of bits in the global CBM bitmask that is clear
310 in any of the directory cbmmasks
311 3. Create a new directory
312 4. Set the bits found in step 2 to the new directory "schemata" file
314 If two applications attempt to allocate space concurrently then they can
315 end up allocating the same bits so the reservations are shared instead of
318 To coordinate atomic operations on the resctrlfs and to avoid the problem
319 above, the following locking procedure is recommended:
321 Locking is based on flock, which is available in libc and also as a shell
326 A) Take flock(LOCK_EX) on /sys/fs/resctrl
327 B) Read/write the directory structure.
332 A) Take flock(LOCK_SH) on /sys/fs/resctrl
333 B) If success read the directory structure.
338 # Atomically read directory structure
339 $ flock -s /sys/fs/resctrl/ find /sys/fs/resctrl
341 # Read directory contents and create new subdirectory
344 find /sys/fs/resctrl/ > output.txt
345 mask = function-of(output.txt)
346 mkdir /sys/fs/resctrl/newres/
347 echo mask > /sys/fs/resctrl/newres/schemata
349 $ flock /sys/fs/resctrl/ ./create-dir.sh
354 * Example code do take advisory locks
355 * before accessing resctrl filesystem
357 #include <sys/file.h>
360 void resctrl_take_shared_lock(int fd)
364 /* take shared lock on resctrl filesystem */
365 ret = flock(fd, LOCK_SH);
372 void resctrl_take_exclusive_lock(int fd)
376 /* release lock on resctrl filesystem */
377 ret = flock(fd, LOCK_EX);
384 void resctrl_release_lock(int fd)
388 /* take shared lock on resctrl filesystem */
389 ret = flock(fd, LOCK_UN);
400 fd = open("/sys/fs/resctrl", O_DIRECTORY);
405 resctrl_take_shared_lock(fd);
406 /* code to read directory contents */
407 resctrl_release_lock(fd);
409 resctrl_take_exclusive_lock(fd);
410 /* code to read and write directory contents */
411 resctrl_release_lock(fd);