css: set PRE’s max-width so it doesn’t stretch the viewport
[mina86.com.git] / posts / the-real-arg-max-part-2.en.html
blobe9f32e87ebcd64802d062dfd3a1592cb78a11c8c
1 <!-- subject: Will the real <code>ARG_MAX</code> please stand up? Part 2 -->
2 <!-- date: 2021-04-18 01:49:29 -->
3 <!-- tags: arg_max, unix, linux -->
4 <!-- categories: Articles, Techblog -->
6 <p>In <a href="/2021/the-real-arg-max-part-1/">part one</a> we’ve looked at
7 the <code>ARG_MAX</code> parameter on Linux-based systems. We’ve established
8 experimentally how it limits arguments passed to programs and what influences
9 the value. This time, we’ll look directly at the source to verify our
10 findings and see how <code>ARG_MAX</code> looks from the point of view of
11 system libraries and kernel itself.
13 <!-- FULL -->
15 <h2>C system library</h2>
17 <p>Application get value of the <code>ARG_MAX</code> parameter from
18 the <code>sysconf</code> function. It’s what the <code>getconf</code> utility
19 uses to report the limit. But even though the result of the function is
20 closely related to the kernel, looking for its definition in the Linux source
21 code is an exercise in futility. Rather, the function is defined in the
22 C system library which, in GNU/Linux distributions, is commonly providedy by
23 the glibc package.
25 <p>glibc is a cross-platform library which supports many kernels and
26 architectures. It often includes multiple definitions of the same function
27 each tailored for particular platform. Such is the case
28 with <code>sysconf</code>. Thankfully, our analysis is limited to Linux and
29 in glibc 2.33, the implementation we’re interested in is located
30 in <code>sysdeps/unix/sysv/linux/sysconf.c</code> file and looks as follows:
32 <pre>
33 #define legacy_ARG_MAX 131072
35 <i>/* […] */</i>
37 long int
38 __sysconf (int name)
40 const char *procfname = NULL;
42 switch (name)
44 <i>/* […] */</i>
46 case _SC_ARG_MAX:
48 struct rlimit rlimit;
49 /* Use getrlimit to get the stack limit. */
50 if (__getrlimit (RLIMIT_STACK, &amp;rlimit) == 0)
51 return MAX (legacy_ARG_MAX, rlimit.rlim_cur / 4);
53 return legacy_ARG_MAX;
56 <i>/* […] */</i>
59 return posix_sysconf (name);
61 </pre>
63 <p>This code explains discrepancies we’ve observed
64 when <a href="/2021/the-real-arg-max-part-1/#bigstack">testing large stack
65 size limit</a>. While glibc implements the 128 KiB lower bound it’s unaware
66 of the 6 MiB upper bound. Since <code>getconf</code> utility relies
67 on <code>sysconf</code> library function, having the above implementation
68 means that for large stacks the tool will wrongly report <code>ARG_MAX</code>
69 as quarter of maximum stack size.
71 <p>glibc isn’t the only library used on Linux systems. Others have their
72 own <code>sysconf</code> implementations which may return different values.
73 uClibc-ng 1.0.38 behaves the same way glibc does while bionic 10.0, dietlibc
74 0.34 and musl 1.2 return 128 KiB as <code>ARG_MAX</code>.
76 <p>The good news is that situation with glibc has since improved. glibc 2.34
77 has released
78 with <a href="https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=a9880586eedb3ba89ca6a7c5e3f0664c279cf636">my
79 commit</a> which makes <code>sysconf</code> aware of the 6 MiB upper bound.
80 Recent GNU/Linux systems will report <code>ARG_MAX</code> correctly even for
81 large stacks.
84 <h2>Linux kernel</h2>
86 <p>On the kernel side, we want to look at the <code>execve</code> system call.
87 It is defined using a <code>SYSCAL_DEFINE<var>n</var></code> macro and it
88 doesn’t take long to find its implementation in <code>fs/exec.c</code> file.
89 In Linux 5.11.11 it looks as follows:
91 <pre>
92 SYSCALL_DEFINE3(execve,
93 const char __user *, filename,
94 const char __user *const __user *, argv,
95 const char __user *const __user *, envp)
97 return do_execve(getname(filename), argv, envp);
99 </pre>
101 <p>Definition of <code>do_execve</code> can be found a few lines earlier in
102 the same file. All it does is call <code>do_execveat_common</code> function
103 so that’s what we’re going to take a closer look at. It is where most of the
104 checks and calculations happen:
106 <pre>
107 static int do_execveat_common(int fd, struct filename *filename,
108 struct user_arg_ptr argv,
109 struct user_arg_ptr envp,
110 int flags)
112 struct linux_binprm *bprm;
113 int retval;
114 <i>/* […] */</i>
116 retval = count(argv, MAX_ARG_STRINGS);
117 if (retval &lt; 0)
118 goto out_free;
119 bprm->argc = retval;
121 retval = count(envp, MAX_ARG_STRINGS);
122 if (retval &lt; 0)
123 goto out_free;
124 bprm->envc = retval;
126 retval = bprm_stack_limits(bprm);
127 if (retval &lt; 0)
128 goto out_free;
130 retval = copy_string_kernel(bprm->filename, bprm);
131 if (retval &lt; 0)
132 goto out_free;
133 bprm->exec = bprm->p;
135 retval = copy_strings(bprm->envc, envp, bprm);
136 if (retval &lt; 0)
137 goto out_free;
139 retval = copy_strings(bprm->argc, argv, bprm);
140 if (retval &lt; 0)
141 goto out_free;
143 retval = bprm_execve(bprm, fd, filename, flags);
145 <i>/* […] */</i>
146 return retval;
148 </pre>
150 <p>The two invocations to <code>count</code> function calculate number of
151 command line arguments and environment variables. Each call may fail if the
152 number exceeds <code>MAX_ARG_STRINGS</code>. Technically speaking this is
153 another limit but in practice the constant is over two billion and, as we’ll
154 see later, there is no way to reach this number without reaching other limits
155 first. The only other situation in which <code>count</code> function may
156 return an error is in case of memory fault, but that’s not interesting for our
157 analysis.
160 <h3>Limit calculation</h3>
162 <p><code>bprm_stack_limits</code> is where the actual calculation happens. The
163 function determines the limit and stores it in the <code>bprm</code>
164 structure. It’s defined as follows:
166 <pre>
167 static int bprm_stack_limits(struct linux_binprm *bprm)
169 unsigned long limit, ptr_size;
171 limit = _STK_LIM / 4 * 3;
172 limit = min(limit, bprm->rlim_stack.rlim_cur / 4);
173 limit = max_t(unsigned long, limit, ARG_MAX);
175 ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
176 if (limit &lt;= ptr_size)
177 return -E2BIG;
178 limit -= ptr_size;
180 bprm->argmin = bprm->p - limit;
181 return 0;
183 </pre>
185 <p><code>_STK_LIM</code> is the default stack size limit and equals 8 MiB. The
186 first expression in the function is what introduces the upper bound of 6 MiB
187 for arguments. It’s worth noting that it’s a relatively new restriction
188 introduced in Linux 4.13 (and later back-ported to previous releases). Why
189 it’s there might be a story for another time.
191 <p>The second expression in the function is what implements the ‘quarter of the
192 stack size’ rule. This is what could be called a ‘normal’ case and definitely
193 is most typical of common desktop and server configurations. With default
194 maximum stack size limit being 8 MiB the default limit for executable
195 arguments ends up being 2 MiB.
197 <p>The third expression sets the limit to be no less than
198 the <code>ARG_MAX</code>. This gets a bit confusing. <code>ARG_MAX</code> is
199 supposed to be a dynamic value and here we see a constant of the same name.
200 As often is the case, the explanation lays in the past. Historically the
201 value was constant and defined as a macro in kernel headers. Eventually,
202 a more dynamic approach was introduced but the definition of the macro stuck.
203 To maintain backwards-compatibility, the dynamic calculation kept the old
204 static value as a lower bound.
206 <p>The last adjustment in the function is to reserve space for
207 the <code>argv</code> and <code>envp</code> arrays. If the limit cannot
208 accommodate them the function returns an error; otherwise the limit is reduced
209 by the necessary space. This is where we can see that the limit of two
210 billion arguments and environment variables (imposed by the <code>count</code>
211 function called in <code>do_execveat_common</code>) can never be reached.
212 With a 6 MiB upper bound for the limit, the most one could hope for is 1.25
213 million arguments and that’s only on a 32-bit system with all strings empty.
215 <p>The calculated limit is finally stored in <code>argmin</code> field of
216 the <code>bprm</code> structure. It specifies the lowest address at which
217 arguments can still be stored and the value will be checked later on when
218 program executable path, environment variables and command line arguments are
219 copied. Recall that stack grows downward which is why the field specifies the
220 minimum and why it’s calculated by subtracting the argument size limit from
221 the current top of the stack (specified by <code>bprm->p</code>).
224 <h3>Copying strings</h3>
226 <p>Eventually, <code>do_execveat_common</code> checks the lengths of the strings
227 while copying them to the new program’s memory. First, the path to program’s
228 executable is transferred with the help of <code>copy_string_kernel</code>
229 function which is defined as follows:
231 <pre>
232 int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
234 int len = strnlen(arg, MAX_ARG_STRLEN) + 1 <i>/* terminating NUL */</i>;
235 unsigned long pos = bprm->p;
237 if (len == 0)
238 return -EFAULT;
239 if (!valid_arg_len(bprm, len))
240 return -E2BIG;
242 arg += len;
243 bprm->p -= len;
244 if (IS_ENABLED(CONFIG_MMU) && bprm->p &lt; bprm->argmin)
245 return -E2BIG;
247 <i>/* [… copy the string …] */</i>
248 <i>/* [… analogous to memcpy(bprm->p, arg, len); …] */</i>
250 return 0;
252 </pre>
254 <p>Firstly, <code>strnlen</code> paired with call to <code>valid_arg_len</code>
255 checks whether the string exceeds <code>MAX_ARG_STRLEN</code> bytes (or
256 128 KiB). <code>valid_arg_len</code> is a trivial inline function whose body
257 simply states <code>return len &lt;= MAX_ARG_STRLEN;</code>. If the size of
258 the string exceeds the limit, argument list is deemed too long and the
259 function returns an error.
261 <p>Then, the function checks if there’s enough space on stack to fit the string.
262 This is done by moving the stack pointer downwards
263 (i.e. subtracting <code>len</code> from <code>bprm->p</code> field) to reserve
264 memory for the argument and checking whether the new position of the edge of
265 the stack crossed the limit (by checking if <code>bprm->p &lt;
266 bprm->argmin</code>). If so, argument list is to long. Otherwise the
267 argument is copied onto the stack.
269 <p>The <code>copy_strings</code> function which <code>do_execveat_common</code>
270 function calls to transfer environment variables and command line arguments is
271 entirely analogous. The two differences are that i) source data lives in
272 user-space and ii) the function operates in a loop copying a sequence of
273 strings.
275 <pre>
276 static int copy_strings(int argc, struct user_arg_ptr argv,
277 struct linux_binprm *bprm)
279 <i>/* […] */</i>
280 int ret;
282 while (argc-- > 0) {
283 const char __user *str;
284 int len;
285 unsigned long pos;
287 ret = -EFAULT;
288 str = get_user_arg_ptr(argv, argc);
289 if (IS_ERR(str))
290 goto out;
292 len = strnlen_user(str, MAX_ARG_STRLEN);
293 if (!len)
294 goto out;
296 ret = -E2BIG;
297 if (!valid_arg_len(bprm, len))
298 goto out;
300 pos = bprm->p;
301 str += len;
302 bprm->p -= len;
303 if (bprm->p &lt; bprm->argmin)
304 goto out;
306 <i>/* [… copy the string …] */</i>
307 <i>/* [… analogous to memcpy(bprm->p, str, len); …] */</i>
309 ret = 0;
310 out:
311 <i>/* […] */</i>
312 return ret;
314 </pre>
316 <p>Having to read from user-space complicates the function, though much of that
317 complexity has been hidden from the listing above in the elided code. The
318 visible parts are calls to <code>get_user_arg_ptr</code>
319 and <code>strnlen_user</code> instead of <code>strnlen</code>.
321 <p>The parts that interests us remain the same: the <code>valid_arg_len</code>
322 call and the <code>bprm->p &lt; bprm->argmin</code> comparison.
325 <h2>Conclusion</h2>
327 <p>This concludes the investigation. In the previous article we’ve seen how the
328 argument length limit affects user-space, here we looked at the source code of
329 the kernel to confirm our previous findings. There are still a few minor
330 mysteries — such as why the 6 MiB exists or what happens if maximum stack size
331 is less that 128 KiB — which I may tackle at another time.
333 <p>It remains important to remember that our findings are true for Linux only.
334 Other kernels will set the limit differently and count different things
335 towards it. POSIX leaves the details purposefully vague. As a result
336 a portable application may struggle to interpret the limit; it should not only
337 take value of <code>ARG_MAX</code> with a grain of salt but ideally also
338 recover from <code>E2BIG</code> error by reducing number of arguments.
340 <p>Fortunately, UNIX-like systems provide a simple solution in the form
341 of <code>xargs</code> and <code>find … -exec … +</code> commands. Those
342 should be much easier to use and sufficient for most cases. They will
343 typically know how to deal with the command’s argument size limit.
345 <p>Whatever the case may be, I hope this article has been informative and
346 provided further understanding of the kernel and it’s interaction with
347 user-space}.