use bulk copy in pipe mode
while testing the new pipe mode, it turned out that reading line-
by-line, piping line-by-line, and writing line-by-line has a huge
syscall overhead.
for example just echoing the input is 500x slower than busybox cat.
that's probably not a big problem whit tasks that take considerable
time themselves, but if you want to parallelize simple tasks this
overhead will dominate the runtime.
what we now do is reading a big chunk of stdin (16KB) into a buffer
that's page-aligned due to usage of mmap(), and in case pipe-mode
is used we just pass the biggest possible hunk to one task.
in non-pipe mode (that means command line argument permutation is
used, we're forced to process line-by-line anyway, so in that case
we just pass line-after-line from our readbuf on, until we can not
find any newlines. the non-processed part of the buffer will then
be copied to a memory area just before the read buffer, and another
chunk will be fetched.
passing a 16KB chunk to a single task is not optimal for all use
cases. for example the total input might be less than that but
you're using long-running tasks, that you want to evenly distribute
among several cores. in that case all input will be passed to the
first job and all other jobs idle.
thus the plan is to add an option to explicitly enable this mode.
right now if pipe mode is used, all options that rely on the
line number counter, such as -skip, etc, are broken as well,
since it is currently unknown how many lines are passed on.