1 =====================================================
2 Memory Resource Controller(Memcg) Implementation Memo
3 =====================================================
7 Base Kernel Version: based on 2.6.33-rc7-mm(candidate for 34).
9 Because VM is getting complex (one of reasons is memcg...), memcg's behavior
10 is complex. This is a document for memcg's internal behavior.
11 Please note that implementation details can be changed.
13 (*) Topics on API should be in Documentation/admin-guide/cgroup-v1/memory.rst)
15 0. How to record usage ?
16 ========================
20 page_cgroup ....an object per page.
22 Allocated at boot or memory hotplug. Freed at memory hot removal.
24 swap_cgroup ... an entry per swp_entry.
26 Allocated at swapon(). Freed at swapoff().
28 The page_cgroup has USED bit and double count against a page_cgroup never
29 occurs. swap_cgroup is used only when a charged page is swapped-out.
34 a page/swp_entry may be charged (usage += PAGE_SIZE) at
36 mem_cgroup_try_charge()
41 a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by
44 Called when a page's refcount goes down to 0.
46 mem_cgroup_uncharge_swap()
47 Called when swp_entry's refcnt goes down to 0. A charge against swap
50 3. charge-commit-cancel
51 =======================
53 Memcg pages are charged in two steps:
55 - mem_cgroup_try_charge()
56 - mem_cgroup_commit_charge() or mem_cgroup_cancel_charge()
58 At try_charge(), there are no flags to say "this page is charged".
59 at this point, usage += PAGE_SIZE.
61 At commit(), the page is associated with the memcg.
63 At cancel(), simply usage -= PAGE_SIZE.
65 Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y.
70 Anonymous page is newly allocated at
71 - page fault into MAP_ANONYMOUS mapping.
75 At swap-in, the page is taken from swap-cache. There are 2 cases.
77 (a) If the SwapCache is newly allocated and read, it has no charges.
78 (b) If the SwapCache has been mapped by processes, it has been
82 At swap-out, typical state transition is below.
84 (a) add to swap cache. (marked as SwapCache)
85 swp_entry's refcnt += 1.
87 swp_entry's refcnt += # of ptes.
88 (c) write back to swap.
89 (d) delete from swap cache. (remove from SwapCache)
90 swp_entry's refcnt -= 1.
93 Finally, at task exit,
94 (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0.
99 Page Cache is charged at
100 - add_to_page_cache_locked().
102 The logic is very clear. (About migration, see below)
105 __remove_from_page_cache() is called by remove_from_page_cache()
106 and __remove_mapping().
108 6. Shmem(tmpfs) Page Cache
109 ===========================
111 The best way to understand shmem's page state transition is to read
114 But brief explanation of the behavior of memcg around shmem will be
115 helpful to understand the logic.
117 Shmem's page (just leaf page, not direct/indirect block) can be on
119 - radix-tree of shmem's inode.
121 - Both on radix-tree and SwapCache. This happens at swap-in
126 - A new page is added to shmem's radix-tree.
127 - A swp page is read. (move a charge from swap_cgroup to page_cgroup)
136 Each memcg has its own vector of LRUs (inactive anon, active anon,
137 inactive file, active file, unevictable) of pages from each node,
138 each LRU handled under a single lru_lock for that memcg and node.
143 Tests for racy cases.
145 9.1 Small limit to memcg.
146 -------------------------
148 When you do test to do racy case, it's good test to set memcg's limit
149 to be very small rather than GB. Many races found in the test under
152 (Memory behavior under GB and Memory behavior under MB shows very
153 different situation.)
158 Historically, memcg's shmem handling was poor and we saw some amount
159 of troubles here. This is because shmem is page-cache but can be
160 SwapCache. Test with shmem/tmpfs is always good test.
165 For NUMA, migration is an another special case. To do easy test, cpuset
166 is useful. Following is a sample script to do migration::
168 mount -t cgroup -o cpuset none /opt/cpuset
171 echo 1 > /opt/cpuset/01/cpuset.cpus
172 echo 0 > /opt/cpuset/01/cpuset.mems
173 echo 1 > /opt/cpuset/01/cpuset.memory_migrate
175 echo 1 > /opt/cpuset/02/cpuset.cpus
176 echo 1 > /opt/cpuset/02/cpuset.mems
177 echo 1 > /opt/cpuset/02/cpuset.memory_migrate
179 In above set, when you moves a task from 01 to 02, page migration to
180 node 0 to node 1 will occur. Following is a script to migrate all
188 /bin/echo $pid >$2/tasks 2>/dev/null
195 G1_TASK=`cat ${G1}/tasks`
196 G2_TASK=`cat ${G2}/tasks`
197 move_task "${G1_TASK}" ${G2} &
203 memory hotplug test is one of good test.
205 to offline memory, do following::
207 # echo offline > /sys/devices/system/memory/memoryXXX/state
209 (XXX is the place of memory)
211 This is an easy way to test page migration, too.
216 Use tests like the following for testing nested cgroups::
218 mkdir /opt/cgroup/01/child_a
219 mkdir /opt/cgroup/01/child_b
222 add limit to 01/child_b
223 run jobs under child_a and child_b
225 create/delete following groups at random while jobs are running::
227 /opt/cgroup/01/child_a/child_aa
228 /opt/cgroup/01/child_b/child_bb
229 /opt/cgroup/01/child_c
231 running new jobs in new group is also good.
233 9.6 Mount with other subsystems
234 -------------------------------
236 Mounting with other subsystems is a good test because there is a
237 race and lock dependency with other cgroup subsystems.
241 # mount -t cgroup none /cgroup -o cpuset,memory,cpu,devices
243 and do task move, mkdir, rmdir etc...under this.
248 Besides management of swap is one of complicated parts of memcg,
249 call path of swap-in at swapoff is not same as usual swap-in path..
250 It's worth to be tested explicitly.
252 For example, test like following is good:
256 # mount -t cgroup none /cgroup -o memory
258 # echo 40M > /cgroup/test/memory.limit_in_bytes
259 # echo 0 > /cgroup/test/tasks
261 Run malloc(100M) program under this. You'll see 60M of swaps.
265 # move all tasks in /cgroup/test to /cgroup
270 Of course, tmpfs v.s. swapoff test should be tested, too.
275 Out-of-memory caused by memcg's limit will kill tasks under
276 the memcg. When hierarchy is used, a task under hierarchy
277 will be killed by the kernel.
279 In this case, panic_on_oom shouldn't be invoked and tasks
280 in other groups shouldn't be killed.
282 It's not difficult to cause OOM under memcg as following.
284 Case A) when you can swapoff::
287 #echo 50M > /memory.limit_in_bytes
291 Case B) when you use mem+swap limitation::
293 #echo 50M > memory.limit_in_bytes
294 #echo 50M > memory.memsw.limit_in_bytes
298 9.9 Move charges at task migration
299 ----------------------------------
301 Charges associated with a task can be moved along with task migration.
306 #echo $$ >/cgroup/A/tasks
308 run some programs which uses some amount of memory in /cgroup/A.
313 #echo 1 >/cgroup/B/memory.move_charge_at_immigrate
314 #echo "pid of the program running in group A" >/cgroup/B/tasks
316 You can see charges have been moved by reading ``*.usage_in_bytes`` or
317 memory.stat of both A and B.
319 See 8.2 of Documentation/admin-guide/cgroup-v1/memory.rst to see what value should
320 be written to move_charge_at_immigrate.
322 9.10 Memory thresholds
323 ----------------------
325 Memory controller implements memory thresholds using cgroups notification
326 API. You can use tools/cgroup/cgroup_event_listener.c to test it.
328 (Shell-A) Create cgroup and run event listener::
331 # ./cgroup_event_listener /cgroup/A/memory.usage_in_bytes 5M
333 (Shell-B) Add task to cgroup and try to allocate and free memory::
335 # echo $$ >/cgroup/A/tasks
336 # a="$(dd if=/dev/zero bs=1M count=10)"
339 You will see message from cgroup_event_listener every time you cross
342 Use /cgroup/A/memory.memsw.usage_in_bytes to test memsw thresholds.
344 It's good idea to test root cgroup as well.