10 There are a few existing ways to dump kernel data into user space. The most
11 popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
12 all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
13 sockets in the system. However, their output format tends to be fixed, and if
14 users want more information about these sockets, they have to patch the kernel,
15 which often takes time to publish upstream and release. The same is true for popular
16 tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
17 additional information needs a kernel patch.
19 To solve this problem, the `drgn
20 <https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
21 dig out the kernel data with no kernel change. However, the main drawback for
22 drgn is performance, as it cannot do pointer tracing inside the kernel. In
23 addition, drgn cannot validate a pointer value and may read invalid data if the
24 pointer becomes invalid inside the kernel.
26 The BPF iterator solves the above problem by providing flexibility on what data
27 (e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
30 ----------------------
31 How BPF Iterators Work
32 ----------------------
34 A BPF iterator is a type of BPF program that allows users to iterate over
35 specific types of kernel objects. Unlike traditional BPF tracing programs that
36 allow users to define callbacks that are invoked at particular points of
37 execution in the kernel, BPF iterators allow users to define callbacks that
38 should be executed for every entry in a variety of kernel data structures.
40 For example, users can define a BPF iterator that iterates over every task on
41 the system and dumps the total amount of CPU runtime currently used by each of
42 them. Another BPF task iterator may instead dump the cgroup information for each
43 task. Such flexibility is the core value of BPF iterators.
45 A BPF program is always loaded into the kernel at the behest of a user space
46 process. A user space process loads a BPF program by opening and initializing
47 the program skeleton as required and then invoking a syscall to have the BPF
48 program verified and loaded by the kernel.
50 In traditional tracing programs, a program is activated by having user space
51 obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
52 activated, the program callback will be invoked whenever the tracepoint is
53 triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
54 program is obtained using ``bpf_link_create()``, and the program callback is
55 invoked by issuing system calls from user space.
57 Next, let us see how you can use the iterators to iterate on kernel objects and
60 ------------------------
61 How to Use BPF iterators
62 ------------------------
64 BPF selftests are a great resource to illustrate how to use the iterators. In
65 this section, we’ll walk through a BPF selftest which shows how to load and use
66 a BPF iterator program. To begin, we’ll look at `bpf_iter.c
67 <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
68 which illustrates how to load and trigger BPF iterators on the user space side.
69 Later, we’ll look at a BPF program that runs in kernel space.
71 Loading a BPF iterator in the kernel from user space typically involves the
74 * The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
75 has verified and loaded the program, it returns a file descriptor (fd) to user
77 * Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
78 specified with the BPF program file descriptor received from the kernel.
79 * Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
80 ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
81 * Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
83 * Close the iterator fd using ``close(bpf_iter_fd)``.
84 * If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
86 The following are a few examples of selftest BPF iterator programs:
88 * `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
89 * `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
90 * `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
92 Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
94 Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
95 <https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
96 Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
97 represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
102 struct bpf_iter__task_file {
104 struct bpf_iter_meta *meta;
107 struct task_struct *task;
115 In the above code, the field 'meta' contains the metadata, which is the same for
116 all BPF iterator programs. The rest of the fields are specific to different
117 iterators. For example, for task_file iterators, the kernel layer provides the
118 'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
120 <https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
121 so they won't go away when the BPF program runs.
123 Here is a snippet from the ``bpf_iter_task_file.c`` file:
127 SEC("iter/task_file")
128 int dump_task_file(struct bpf_iter__task_file *ctx)
130 struct seq_file *seq = ctx->meta->seq;
131 struct task_struct *task = ctx->task;
132 struct file *file = ctx->file;
135 if (task == NULL || file == NULL)
138 if (ctx->meta->seq_num == 0) {
140 BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
143 if (tgid == task->tgid && task->tgid != task->pid)
146 if (last_tgid != task->tgid) {
147 last_tgid = task->tgid;
151 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
156 In the above example, the section name ``SEC(iter/task_file)``, indicates that
157 the program is a BPF iterator program to iterate all files from all tasks. The
158 context of the program is ``bpf_iter__task_file`` struct.
160 The user space program invokes the BPF iterator program running in the kernel
161 by issuing a ``read()`` syscall. Once invoked, the BPF
162 program can export data to user space using a variety of BPF helper functions.
163 You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
164 ``bpf_seq_write()`` function based on whether you need formatted output or just
165 binary data, respectively. For binary-encoded data, the user space applications
166 can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
167 you can use ``cat <path>`` to print the results similar to ``cat
168 /proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
169 use ``rm -f <path>`` to remove the pinned iterator.
171 For example, you can use the following command to create a BPF iterator from the
172 ``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
177 $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route
179 And then print out the results using the following command:
183 $ cat /sys/fs/bpf/my_route
186 -------------------------------------------------------
187 Implement Kernel Support for BPF Iterator Program Types
188 -------------------------------------------------------
190 To implement a BPF iterator in the kernel, the developer must make a one-time
191 change to the following key data structure defined in the `bpf.h
192 <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
197 struct bpf_iter_reg {
199 bpf_iter_attach_target_t attach_target;
200 bpf_iter_detach_target_t detach_target;
201 bpf_iter_show_fdinfo_t show_fdinfo;
202 bpf_iter_fill_link_info_t fill_link_info;
203 bpf_iter_get_func_proto_t get_func_proto;
204 u32 ctx_arg_info_size;
206 struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
207 const struct bpf_iter_seq_info *seq_info;
210 After filling the data structure fields, call ``bpf_iter_reg_target()`` to
211 register the iterator to the main BPF iterator subsystem.
213 The following is the breakdown for each field in struct ``bpf_iter_reg``.
222 - Specifies the name of the BPF iterator. For example: ``bpf_map``,
223 ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
224 * - attach_target and detach_target
225 - Allows for target specific ``link_create`` action since some targets
226 may need special processing. Called during the user space link_create stage.
227 * - show_fdinfo and fill_link_info
228 - Called to fill target specific information when user tries to get link
229 info associated with the iterator.
231 - Permits a BPF iterator to access BPF helpers specific to the iterator.
232 * - ctx_arg_info_size and ctx_arg_info
233 - Specifies the verifier states for BPF program arguments associated with
236 - Specifies certain action requests in the kernel BPF iterator
237 infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
238 that the kernel function cond_resched() is called to avoid other kernel
239 subsystem (e.g., rcu) misbehaving.
241 - Specifies the set of seq operations for the BPF iterator and helpers to
242 initialize/free the private data for the corresponding ``seq_file``.
245 <https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
246 to see an implementation of the ``task_vma`` BPF iterator in the kernel.
248 ---------------------------------
249 Parameterizing BPF Task Iterators
250 ---------------------------------
252 By default, BPF iterators walk through all the objects of the specified types
253 (processes, cgroups, maps, etc.) across the entire system to read relevant
254 kernel data. But often, there are cases where we only care about a much smaller
255 subset of iterable kernel objects, such as only iterating tasks within a
256 specific process. Therefore, BPF iterator programs support filtering out objects
257 from iteration by allowing user space to configure the iterator program when it
260 --------------------------
261 BPF Task Iterator Program
262 --------------------------
264 The following code is a BPF iterator program to print files and task information
265 through the ``seq_file`` of the iterator. It is a standard BPF iterator program
266 that visits every file of an iterator. We will use this BPF program in our
272 #include <bpf/bpf_helpers.h>
274 char _license[] SEC("license") = "GPL";
276 SEC("iter/task_file")
277 int dump_task_file(struct bpf_iter__task_file *ctx)
279 struct seq_file *seq = ctx->meta->seq;
280 struct task_struct *task = ctx->task;
281 struct file *file = ctx->file;
283 if (task == NULL || file == NULL)
285 if (ctx->meta->seq_num == 0) {
286 BPF_SEQ_PRINTF(seq, " tgid pid fd file\n");
288 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
293 ----------------------------------------
294 Creating a File Iterator with Parameters
295 ----------------------------------------
297 Now, let us look at how to create an iterator that includes only files of a
300 First, fill the ``bpf_iter_attach_opts`` struct as shown below:
304 LIBBPF_OPTS(bpf_iter_attach_opts, opts);
305 union bpf_iter_link_info linfo;
306 memset(&linfo, 0, sizeof(linfo));
307 linfo.task.pid = getpid();
308 opts.link_info = &linfo;
309 opts.link_info_len = sizeof(linfo);
311 ``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
312 that only includes opened files for the process with the specified ``pid``. In
313 this example, we will only be iterating files for our process. If
314 ``linfo.task.pid`` is zero, the iterator will visit every opened file of every
315 process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
316 that visits opened files of a specific thread, not a process. In this example,
317 ``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
318 separate file descriptor table. In most circumstances, all process threads share
319 a single file descriptor table.
321 Now, in the userspace program, pass the pointer of struct to the
322 ``bpf_program__attach_iter()``.
326 link = bpf_program__attach_iter(prog, &opts); iter_fd =
327 bpf_iter_create(bpf_link__fd(link));
329 If both *tid* and *pid* are zero, an iterator created from this struct
330 ``bpf_iter_attach_opts`` will include every opened file of every task in the
331 system (in the namespace, actually.) It is the same as passing a NULL as the
332 second argument to ``bpf_program__attach_iter()``.
334 The whole program looks like the following code:
341 #include <bpf/libbpf.h>
342 #include "bpf_iter_task_ex.skel.h"
344 static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
346 struct bpf_link *link;
348 int iter_fd = -1, len;
351 link = bpf_program__attach_iter(prog, opts);
353 fprintf(stderr, "bpf_program__attach_iter() fails\n");
356 iter_fd = bpf_iter_create(bpf_link__fd(link));
358 fprintf(stderr, "bpf_iter_create() fails\n");
362 /* not check contents, but ensure read() ends without error */
363 while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
371 bpf_link__destroy(link);
375 static void test_task_file(void)
377 LIBBPF_OPTS(bpf_iter_attach_opts, opts);
378 struct bpf_iter_task_ex *skel;
379 union bpf_iter_link_info linfo;
380 skel = bpf_iter_task_ex__open_and_load();
383 memset(&linfo, 0, sizeof(linfo));
384 linfo.task.pid = getpid();
385 opts.link_info = &linfo;
386 opts.link_info_len = sizeof(linfo);
387 printf("PID %d\n", getpid());
388 do_read_opts(skel->progs.dump_task_file, &opts);
389 bpf_iter_task_ex__destroy(skel);
392 int main(int argc, const char * const * argv)
398 The following lines are the output of the program.
404 1859 1859 0 ffffffff82270aa0
405 1859 1859 1 ffffffff82270aa0
406 1859 1859 2 ffffffff82270aa0
407 1859 1859 3 ffffffff82272980
408 1859 1859 4 ffffffff8225e120
409 1859 1859 5 ffffffff82255120
410 1859 1859 6 ffffffff82254f00
411 1859 1859 7 ffffffff82254d80
412 1859 1859 8 ffffffff8225abe0
418 Let us look at how a BPF iterator without parameters skips files of other
419 processes in the system. In this case, the BPF program has to check the pid or
420 the tid of tasks, or it will receive every opened file in the system (in the
421 current *pid* namespace, actually). So, we usually add a global variable in the
422 BPF program to pass a *pid* to the BPF program.
424 The BPF program would look like the following block.
431 SEC("iter/task_file")
432 int dump_task_file(struct bpf_iter__task_file *ctx)
435 if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
437 BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
442 The user space program would look like the following block:
447 static void test_task_file(void)
450 skel = bpf_iter_task_ex__open_and_load();
453 skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */
454 memset(&linfo, 0, sizeof(linfo));
455 linfo.task.pid = getpid();
456 opts.link_info = &linfo;
457 opts.link_info_len = sizeof(linfo);
461 ``target_pid`` is a global variable in the BPF program. The user space program
462 should initialize the variable with a process ID to skip opened files of other
463 processes in the BPF program. When you parametrize a BPF iterator, the iterator
464 calls the BPF program fewer times which can save significant resources.
466 ---------------------------
467 Parametrizing VMA Iterators
468 ---------------------------
470 By default, a BPF VMA iterator includes every VMA in every process. However,
471 you can still specify a process or a thread to include only its VMAs. Unlike
472 files, a thread can not have a separate address space (since Linux 2.6.0-test6).
473 Here, using *tid* makes no difference from using *pid*.
475 ----------------------------
476 Parametrizing Task Iterators
477 ----------------------------
479 A BPF task iterator with *pid* includes all tasks (threads) of a process. The
480 BPF program receives these tasks one after another. You can specify a BPF task
481 iterator with *tid* parameter to include only the tasks that match the given