posts/the-real-arg-max-part-2.en.html

   1 <!-- subject: Will the real <code>ARG_MAX</code> please stand up? Part 2 -->
   2 <!-- date: 2021-04-18 01:49:29 -->
   3 <!-- tags: arg_max, unix, linux -->
   4 <!-- categories: Articles, Techblog -->
   5
   6 <p>In <a href="/2021/the-real-arg-max-part-1/">part one</a> we’ve looked at
   7   the <code>ARG_MAX</code> parameter on Linux-based systems.  We’ve established
   8   experimentally how it limits arguments passed to programs and what influences
   9   the value.  This time, we’ll look directly at the source to verify our
  10   findings and see how <code>ARG_MAX</code> looks from the point of view of
  11   system libraries and kernel itself.
  12
  13 <!-- FULL -->
  14
  15 <h2>C system library</h2>
  16
  17 <p>Application get value of the <code>ARG_MAX</code> parameter from
  18   the <code>sysconf</code> function.  It’s what the <code>getconf</code> utility
  19   uses to report the limit.  But even though the result of the function is
  20   closely related to the kernel, looking for its definition in the Linux source
  21   code is an exercise in futility.  Rather, the function is defined in the
  22   C system library which, in GNU/Linux distributions, is commonly providedy by
  23   the glibc package.
  24
  25 <p>glibc is a cross-platform library which supports many kernels and
  26   architectures.  It often includes multiple definitions of the same function
  27   each tailored for particular platform.  Such is the case
  28   with <code>sysconf</code>.  Thankfully, our analysis is limited to Linux and
  29   in glibc 2.33, the implementation we’re interested in is located
  30   in <code>sysdeps/unix/sysv/linux/sysconf.c</code> file and looks as follows:
  31
  32 <pre>
  33 #define legacy_ARG_MAX 131072
  34
  35 <i>/* […] */</i>
  36
  37 long int
  38 __sysconf (int name)
  39 {
  40   const char *procfname = NULL;
  41
  42   switch (name)
  43     {
  44       <i>/* […] */</i>
  45
  46     case _SC_ARG_MAX:
  47       {
  48         struct rlimit rlimit;
  49         /* Use getrlimit to get the stack limit.  */
  50         if (__getrlimit (RLIMIT_STACK, &amp;rlimit) == 0)
  51           return MAX (legacy_ARG_MAX, rlimit.rlim_cur / 4);
  52
  53         return legacy_ARG_MAX;
  54       }
  55
  56       <i>/* […] */</i>
  57     }
  58
  59   return posix_sysconf (name);
  60 }
  61 </pre>
  62
  63 <p>This code explains discrepancies we’ve observed
  64   when <a href="/2021/the-real-arg-max-part-1/#bigstack">testing large stack
  65   size limit</a>.  While glibc implements the 128 KiB lower bound it’s unaware
  66   of the 6 MiB upper bound.  Since <code>getconf</code> utility relies
  67   on <code>sysconf</code> library function, having the above implementation
  68   means that for large stacks the tool will wrongly report <code>ARG_MAX</code>
  69   as quarter of maximum stack size.
  70
  71 <p>glibc isn’t the only library used on Linux systems.  Others have their
  72   own <code>sysconf</code> implementations which may return different values.
  73   uClibc-ng 1.0.38 behaves the same way glibc does while bionic 10.0, dietlibc
  74   0.34 and musl 1.2 return 128 KiB as <code>ARG_MAX</code>.
  75
  76 <p>The good news is that situation with glibc has since improved.  glibc 2.34
  77   has released
  78   with <a href="https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=a9880586eedb3ba89ca6a7c5e3f0664c279cf636">my
  79   commit</a> which makes <code>sysconf</code> aware of the 6 MiB upper bound.
  80   Recent GNU/Linux systems will report <code>ARG_MAX</code> correctly even for
  81   large stacks.
  82
  83
  84 <h2>Linux kernel</h2>
  85
  86 <p>On the kernel side, we want to look at the <code>execve</code> system call.
  87   It is defined using a <code>SYSCAL_DEFINE<var>n</var></code> macro and it
  88   doesn’t take long to find its implementation in <code>fs/exec.c</code> file.
  89   In Linux 5.11.11 it looks as follows:
  90
  91 <pre>
  92 SYSCALL_DEFINE3(execve,
  93                 const char __user *, filename,
  94                 const char __user *const __user *, argv,
  95                 const char __user *const __user *, envp)
  96 {
  97         return do_execve(getname(filename), argv, envp);
  98 }
  99 </pre>
 100
 101 <p>Definition of <code>do_execve</code> can be found a few lines earlier in
 102   the same file.  All it does is call <code>do_execveat_common</code> function
 103   so that’s what we’re going to take a closer look at.  It is where most of the
 104   checks and calculations happen:
 105
 106 <pre>
 107 static int do_execveat_common(int fd, struct filename *filename,
 108                               struct user_arg_ptr argv,
 109                               struct user_arg_ptr envp,
 110                               int flags)
 111 {
 112         struct linux_binprm *bprm;
 113         int retval;
 114         <i>/* […] */</i>
 115
 116         retval = count(argv, MAX_ARG_STRINGS);
 117         if (retval &lt; 0)
 118                 goto out_free;
 119         bprm->argc = retval;
 120
 121         retval = count(envp, MAX_ARG_STRINGS);
 122         if (retval &lt; 0)
 123                 goto out_free;
 124         bprm->envc = retval;
 125
 126         retval = bprm_stack_limits(bprm);
 127         if (retval &lt; 0)
 128                 goto out_free;
 129
 130         retval = copy_string_kernel(bprm->filename, bprm);
 131         if (retval &lt; 0)
 132                 goto out_free;
 133         bprm->exec = bprm->p;
 134
 135         retval = copy_strings(bprm->envc, envp, bprm);
 136         if (retval &lt; 0)
 137                 goto out_free;
 138
 139         retval = copy_strings(bprm->argc, argv, bprm);
 140         if (retval &lt; 0)
 141                 goto out_free;
 142
 143         retval = bprm_execve(bprm, fd, filename, flags);
 144
 145         <i>/* […] */</i>
 146         return retval;
 147 }
 148 </pre>
 149
 150 <p>The two invocations to <code>count</code> function calculate number of
 151   command line arguments and environment variables.  Each call may fail if the
 152   number exceeds <code>MAX_ARG_STRINGS</code>.  Technically speaking this is
 153   another limit but in practice the constant is over two billion and, as we’ll
 154   see later, there is no way to reach this number without reaching other limits
 155   first.  The only other situation in which <code>count</code> function may
 156   return an error is in case of memory fault, but that’s not interesting for our
 157   analysis.
 158
 159
 160 <h3>Limit calculation</h3>
 161
 162 <p><code>bprm_stack_limits</code> is where the actual calculation happens.  The
 163   function determines the limit and stores it in the <code>bprm</code>
 164   structure.  It’s defined as follows:
 165
 166 <pre>
 167 static int bprm_stack_limits(struct linux_binprm *bprm)
 168 {
 169         unsigned long limit, ptr_size;
 170
 171         limit = _STK_LIM / 4 * 3;
 172         limit = min(limit, bprm->rlim_stack.rlim_cur / 4);
 173         limit = max_t(unsigned long, limit, ARG_MAX);
 174
 175         ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
 176         if (limit &lt;= ptr_size)
 177                 return -E2BIG;
 178         limit -= ptr_size;
 179
 180         bprm->argmin = bprm->p - limit;
 181         return 0;
 182 }
 183 </pre>
 184
 185 <p><code>_STK_LIM</code> is the default stack size limit and equals 8 MiB.  The
 186   first expression in the function is what introduces the upper bound of 6 MiB
 187   for arguments.  It’s worth noting that it’s a relatively new restriction
 188   introduced in Linux 4.13 (and later back-ported to previous releases).  Why
 189   it’s there might be a story for another time.
 190
 191 <p>The second expression in the function is what implements the ‘quarter of the
 192   stack size’ rule.  This is what could be called a ‘normal’ case and definitely
 193   is most typical of common desktop and server configurations.  With default
 194   maximum stack size limit being 8 MiB the default limit for executable
 195   arguments ends up being 2 MiB.
 196
 197 <p>The third expression sets the limit to be no less than
 198   the <code>ARG_MAX</code>.  This gets a bit confusing.  <code>ARG_MAX</code> is
 199   supposed to be a dynamic value and here we see a constant of the same name.
 200   As often is the case, the explanation lays in the past.  Historically the
 201   value was constant and defined as a macro in kernel headers.  Eventually,
 202   a more dynamic approach was introduced but the definition of the macro stuck.
 203   To maintain backwards-compatibility, the dynamic calculation kept the old
 204   static value as a lower bound.
 205
 206 <p>The last adjustment in the function is to reserve space for
 207   the <code>argv</code> and <code>envp</code> arrays.  If the limit cannot
 208   accommodate them the function returns an error; otherwise the limit is reduced
 209   by the necessary space.  This is where we can see that the limit of two
 210   billion arguments and environment variables (imposed by the <code>count</code>
 211   function called in <code>do_execveat_common</code>) can never be reached.
 212   With a 6 MiB upper bound for the limit, the most one could hope for is 1.25
 213   million arguments and that’s only on a 32-bit system with all strings empty.
 214
 215 <p>The calculated limit is finally stored in <code>argmin</code> field of
 216   the <code>bprm</code> structure.  It specifies the lowest address at which
 217   arguments can still be stored and the value will be checked later on when
 218   program executable path, environment variables and command line arguments are
 219   copied.  Recall that stack grows downward which is why the field specifies the
 220   minimum and why it’s calculated by subtracting the argument size limit from
 221   the current top of the stack (specified by <code>bprm->p</code>).
 222
 223
 224 <h3>Copying strings</h3>
 225
 226 <p>Eventually, <code>do_execveat_common</code> checks the lengths of the strings
 227   while copying them to the new program’s memory.  First, the path to program’s
 228   executable is transferred with the help of <code>copy_string_kernel</code>
 229   function which is defined as follows:
 230
 231 <pre>
 232 int copy_string_kernel(const char *arg, struct linux_binprm *bprm)
 233 {
 234         int len = strnlen(arg, MAX_ARG_STRLEN) + 1 <i>/* terminating NUL */</i>;
 235         unsigned long pos = bprm->p;
 236
 237         if (len == 0)
 238                 return -EFAULT;
 239         if (!valid_arg_len(bprm, len))
 240                 return -E2BIG;
 241
 242         arg += len;
 243         bprm->p -= len;
 244         if (IS_ENABLED(CONFIG_MMU) && bprm->p &lt; bprm->argmin)
 245                 return -E2BIG;
 246
 247         <i>/* [… copy the string …] */</i>
 248         <i>/* [… analogous to memcpy(bprm->p, arg, len); …] */</i>
 249
 250         return 0;
 251 }
 252 </pre>
 253
 254 <p>Firstly, <code>strnlen</code> paired with call to <code>valid_arg_len</code>
 255   checks whether the string exceeds <code>MAX_ARG_STRLEN</code> bytes (or
 256   128 KiB).  <code>valid_arg_len</code> is a trivial inline function whose body
 257   simply states <code>return len &lt;= MAX_ARG_STRLEN;</code>.  If the size of
 258   the string exceeds the limit, argument list is deemed too long and the
 259   function returns an error.
 260
 261 <p>Then, the function checks if there’s enough space on stack to fit the string.
 262   This is done by moving the stack pointer downwards
 263   (i.e. subtracting <code>len</code> from <code>bprm->p</code> field) to reserve
 264   memory for the argument and checking whether the new position of the edge of
 265   the stack crossed the limit (by checking if <code>bprm->p &lt;
 266   bprm->argmin</code>).  If so, argument list is to long.  Otherwise the
 267   argument is copied onto the stack.
 268
 269 <p>The <code>copy_strings</code> function which <code>do_execveat_common</code>
 270   function calls to transfer environment variables and command line arguments is
 271   entirely analogous.  The two differences are that i) source data lives in
 272   user-space and ii) the function operates in a loop copying a sequence of
 273   strings.
 274
 275 <pre>
 276 static int copy_strings(int argc, struct user_arg_ptr argv,
 277                         struct linux_binprm *bprm)
 278 {
 279         <i>/* […] */</i>
 280         int ret;
 281
 282         while (argc-- > 0) {
 283                 const char __user *str;
 284                 int len;
 285                 unsigned long pos;
 286
 287                 ret = -EFAULT;
 288                 str = get_user_arg_ptr(argv, argc);
 289                 if (IS_ERR(str))
 290                         goto out;
 291
 292                 len = strnlen_user(str, MAX_ARG_STRLEN);
 293                 if (!len)
 294                         goto out;
 295
 296                 ret = -E2BIG;
 297                 if (!valid_arg_len(bprm, len))
 298                         goto out;
 299
 300                 pos = bprm->p;
 301                 str += len;
 302                 bprm->p -= len;
 303                 if (bprm->p &lt; bprm->argmin)
 304                         goto out;
 305
 306                 <i>/* [… copy the string …] */</i>
 307                 <i>/* [… analogous to memcpy(bprm->p, str, len); …] */</i>
 308         }
 309         ret = 0;
 310 out:
 311         <i>/* […] */</i>
 312         return ret;
 313 }
 314 </pre>
 315
 316 <p>Having to read from user-space complicates the function, though much of that
 317   complexity has been hidden from the listing above in the elided code.  The
 318   visible parts are calls to <code>get_user_arg_ptr</code>
 319   and <code>strnlen_user</code> instead of <code>strnlen</code>.
 320
 321 <p>The parts that interests us remain the same: the <code>valid_arg_len</code>
 322   call and the <code>bprm->p &lt; bprm->argmin</code> comparison.
 323
 324
 325 <h2>Conclusion</h2>
 326
 327 <p>This concludes the investigation.  In the previous article we’ve seen how the
 328   argument length limit affects user-space, here we looked at the source code of
 329   the kernel to confirm our previous findings.  There are still a few minor
 330   mysteries — such as why the 6 MiB exists or what happens if maximum stack size
 331   is less that 128 KiB — which I may tackle at another time.
 332
 333 <p>It remains important to remember that our findings are true for Linux only.
 334   Other kernels will set the limit differently and count different things
 335   towards it.  POSIX leaves the details purposefully vague.  As a result
 336   a portable application may struggle to interpret the limit; it should not only
 337   take value of <code>ARG_MAX</code> with a grain of salt but ideally also
 338   recover from <code>E2BIG</code> error by reducing number of arguments.
 339
 340 <p>Fortunately, UNIX-like systems provide a simple solution in the form
 341   of <code>xargs</code> and <code>find … -exec … +</code> commands.  Those
 342   should be much easier to use and sufficient for most cases.  They will
 343   typically know how to deal with the command’s argument size limit.
 344
 345 <p>Whatever the case may be, I hope this article has been informative and
 346   provided further understanding of the kernel and it’s interaction with
 347   user-space}.