1 .. SPDX-License-Identifier: GPL-2.0
7 Linux is available for a wide range of architectures so there is a need for an
8 architecture-independent abstraction to represent the physical memory. This
9 chapter describes the structures used to manage physical memory in a running
12 The first principal concept prevalent in the memory management is
13 `Non-Uniform Memory Access (NUMA)
14 <https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_.
15 With multi-core and multi-socket machines, memory may be arranged into banks
16 that incur a different cost to access depending on the “distance” from the
17 processor. For example, there might be a bank of memory assigned to each CPU or
18 a bank of memory very suitable for DMA near peripheral devices.
20 Each bank is called a node and the concept is represented under Linux by a
21 ``struct pglist_data`` even if the architecture is UMA. This structure is
22 always referenced by its typedef ``pg_data_t``. A ``pg_data_t`` structure
23 for a particular node can be referenced by ``NODE_DATA(nid)`` macro where
24 ``nid`` is the ID of that node.
26 For NUMA architectures, the node structures are allocated by the architecture
27 specific code early during boot. Usually, these structures are allocated
28 locally on the memory bank they represent. For UMA architectures, only one
29 static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will
30 be discussed further in Section :ref:`Nodes <nodes>`
32 The entire physical address space is partitioned into one or more blocks
33 called zones which represent ranges within memory. These ranges are usually
34 determined by architectural constraints for accessing the physical memory.
35 The memory range within a node that corresponds to a particular zone is
36 described by a ``struct zone``, typedeffed to ``zone_t``. Each zone has
37 one of the types described below.
39 * ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for
40 DMA by peripheral devices that cannot access all of the addressable
41 memory. For many years there are better more and robust interfaces to get
42 memory with DMA specific requirements (Documentation/core-api/dma-api.rst),
43 but ``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have
44 restrictions on how they can be accessed.
45 Depending on the architecture, either of these zone types or even they both
46 can be disabled at build time using ``CONFIG_ZONE_DMA`` and
47 ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need
48 both zones as they support peripherals with different DMA addressing
51 * ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all
52 the time. DMA operations can be performed on pages in this zone if the DMA
53 devices support transfers to all addressable memory. ``ZONE_NORMAL`` is
56 * ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a
57 permanent mapping in the kernel page tables. The memory in this zone is only
58 accessible to the kernel using temporary mappings. This zone is available
59 only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``.
61 * ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``.
62 The difference is that the contents of most pages in ``ZONE_MOVABLE`` is
63 movable. That means that while virtual addresses of these pages do not
64 change, their content may move between different physical pages. Often
65 ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be
66 also populated on boot using one of ``kernelcore``, ``movablecore`` and
67 ``movable_node`` kernel command line parameters. See
68 Documentation/mm/page_migration.rst and
69 Documentation/admin-guide/mm/memory-hotplug.rst for additional details.
71 * ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU.
72 It has different characteristics than RAM zone types and it exists to provide
73 :ref:`struct page <Pages>` and memory map services for device driver
74 identified physical address ranges. ``ZONE_DEVICE`` is enabled with
75 configuration option ``CONFIG_ZONE_DEVICE``.
77 It is important to note that many kernel operations can only take place using
78 ``ZONE_NORMAL`` so it is the most performance critical zone. Zones are
79 discussed further in Section :ref:`Zones <zones>`.
81 The relation between node and zone extents is determined by the physical memory
82 map reported by the firmware, architectural constraints for memory addressing
83 and certain parameters in the kernel command line.
85 For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the
86 entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``,
87 ``ZONE_NORMAL`` and ``ZONE_HIGHMEM``::
90 +-------------------------------------------------------------+
92 +-------------------------------------------------------------+
95 +----------+-----------------------+--------------------------+
96 | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM |
97 +----------+-----------------------+--------------------------+
100 With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and
101 booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of
102 RAM equally split between two nodes, there will be ``ZONE_DMA32``,
103 ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and
104 ``ZONE_MOVABLE`` on node 1::
108 +--------------------------------+ +--------------------------+
109 | node 0 | | node 1 |
110 +--------------------------------+ +--------------------------+
112 1G 4G 4200M 9G 9320M 17G
113 +---------+----------+-----------+ +------------+-------------+
114 | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE |
115 +---------+----------+-----------+ +------------+-------------+
118 Memory banks may belong to interleaving nodes. In the example below an x86
119 machine has 16 Gbytes of RAM in 4 memory banks, even banks belong to node 0
120 and odd banks belong to node 1::
124 +-------------+ +-------------+ +-------------+ +-------------+
125 | node 0 | | node 1 | | node 0 | | node 1 |
126 +-------------+ +-------------+ +-------------+ +-------------+
129 +-----+-------+ +-------------+ +-------------+ +-------------+
130 | DMA | DMA32 | | NORMAL | | NORMAL | | NORMAL |
131 +-----+-------+ +-------------+ +-------------+ +-------------+
133 In this case node 0 will span from 0 to 12 Gbytes and node 1 will span from
141 As we have mentioned, each node in memory is described by a ``pg_data_t`` which
142 is a typedef for a ``struct pglist_data``. When allocating a page, by default
143 Linux uses a node-local allocation policy to allocate memory from the node
144 closest to the running CPU. As processes tend to run on the same CPU, it is
145 likely the memory from the current node will be used. The allocation policy can
146 be controlled by users as described in
147 Documentation/admin-guide/mm/numa_memory_policy.rst.
149 Most NUMA architectures maintain an array of pointers to the node
150 structures. The actual structures are allocated early during boot when
151 architecture specific code parses the physical memory map reported by the
152 firmware. The bulk of the node initialization happens slightly later in the
153 boot process by free_area_init() function, described later in Section
154 :ref:`Initialization <initialization>`.
157 Along with the node structures, kernel maintains an array of ``nodemask_t``
158 bitmasks called ``node_states``. Each bitmask in this array represents a set of
159 nodes with particular properties as defined by ``enum node_states``:
162 The node could become online at some point.
166 The node has regular memory.
168 The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled
169 aliased to ``N_NORMAL_MEMORY``.
171 The node has memory(regular, high, movable)
173 The node has one or more CPUs
175 For each node that has a property described above, the bit corresponding to the
176 node ID in the ``node_states[<property>]`` bitmask is set.
178 For example, for node 2 with normal memory and CPUs, bit 2 will be set in ::
180 node_states[N_POSSIBLE]
181 node_states[N_ONLINE]
182 node_states[N_NORMAL_MEMORY]
183 node_states[N_HIGH_MEMORY]
184 node_states[N_MEMORY]
187 For various operations possible with nodemasks please refer to
188 ``include/linux/nodemask.h``.
190 Among other things, nodemasks are used to provide macros for node traversal,
191 namely ``for_each_node()`` and ``for_each_online_node()``.
193 For instance, to call a function foo() for each online node::
195 for_each_online_node(nid) {
196 pg_data_t *pgdat = NODE_DATA(nid);
204 The nodes structure ``struct pglist_data`` is declared in
205 ``include/linux/mmzone.h``. Here we briefly describe fields of this
212 The zones for this node. Not all of the zones may be populated, but it is
213 the full list. It is referenced by this node's node_zonelists as well as
214 other node's node_zonelists.
217 The list of all zones in all nodes. This list defines the order of zones
218 that allocations are preferred from. The ``node_zonelists`` is set up by
219 ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of
220 core memory management structures.
223 Number of populated zones in this node.
226 For UMA systems that use FLATMEM memory model the 0's node
227 ``node_mem_map`` is array of struct pages representing each physical frame.
230 For UMA systems that use FLATMEM memory model the 0's node
231 ``node_page_ext`` is array of extensions of struct pages. Available only
232 in the kernels built with ``CONFIG_PAGE_EXTENSION`` enabled.
235 The page frame number of the starting page frame in this node.
237 ``node_present_pages``
238 Total number of physical pages present in this node.
240 ``node_spanned_pages``
241 Total size of physical page range, including holes.
244 A lock that protects the fields defining the node extents. Only defined when
245 at least one of ``CONFIG_MEMORY_HOTPLUG`` or
246 ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled.
247 ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to
248 manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG``
249 or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``.
252 The Node ID (NID) of the node, starts at 0.
254 ``totalreserve_pages``
255 This is a per-node reserve of pages that are not available to userspace
258 ``first_deferred_pfn``
259 If memory initialization on large machines is deferred then this is the first
260 PFN that needs to be initialized. Defined only when
261 ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled
263 ``deferred_split_queue``
264 Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled.
267 Per-node lruvec holding LRU lists and related parameters. Used only when
268 memory cgroups are disabled. It should not be accessed directly, use
269 ``mem_cgroup_lruvec()`` to look up lruvecs instead.
274 See also Documentation/mm/page_reclaim.rst.
277 Per-node instance of kswapd kernel thread.
279 ``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait``
280 Workqueues used to synchronize memory reclaim tasks
282 ``nr_writeback_throttled``
283 Number of tasks that are throttled waiting on dirty pages to clean.
286 Number of pages written while reclaim is throttled waiting for writeback.
289 Controls the order kswapd tries to reclaim
291 ``kswapd_highest_zoneidx``
292 The highest zone index to be reclaimed by kswapd
295 Number of runs kswapd was unable to reclaim any pages
297 ``min_unmapped_pages``
298 Minimal number of unmapped file backed pages that cannot be reclaimed.
299 Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
300 ``CONFIG_NUMA`` is enabled.
303 Minimal number of SLAB pages that cannot be reclaimed. Determined by
304 ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
307 Flags controlling reclaim behavior.
312 ``kcompactd_max_order``
313 Page order that kcompactd should try to achieve.
315 ``kcompactd_highest_zoneidx``
316 The highest zone index to be compacted by kcompactd.
319 Workqueue used to synchronize memory compaction tasks.
322 Per-node instance of kcompactd kernel thread.
324 ``proactive_compact_trigger``
325 Determines if proactive compaction is enabled. Controlled by
326 ``vm.compaction_proactiveness`` sysctl.
331 ``per_cpu_nodestats``
332 Per-CPU VM statistics for the node
335 VM statistics for the node.
344 This section is incomplete. Please list and describe the appropriate fields.
353 This section is incomplete. Please list and describe the appropriate fields.
362 This section is incomplete. Please list and describe the appropriate fields.
371 This section is incomplete. Please list and describe the appropriate fields.