descriptionnone
repository URLhttps://github.com/rofl0r/jobflow.git
ownerretnyg@gmx.net
last changeSun, 19 Dec 2021 18:44:04 +0000 (19 18:44 +0000)
last refreshThu, 21 Nov 2024 09:05:06 +0000 (21 10:05 +0100)
content tags
add:
README.md

jobflow by rofl0r

this program is inspired by the functionality of GNU parallel, but tries to keep low overhead and follow the UNIX philosophy of doing one thing well.

how it works

basically, it works by processing stdin, launching one process per line. the actual line can be passed to the started program as an argv. this allows for easy parallelization of standard unix tasks.

it is possible to save the current processed line, so when the task is killed it can be continued later.

example usage

you have a list of things, and a tool that processes a single thing.

cat things.list | jobflow -threads=8 -exec ./mytask {}

seq 100 | jobflow -threads=100 -exec echo {}

cat urls.txt | jobflow -threads=32 -exec wget {}

find . -name '*.bmp' | jobflow -threads=8 -exec bmp2jpeg {.}.bmp {.}.jpg

run jobflow without arguments to see a list of possible command line options, and argument permutations.

starting from version 1.3.1, jobflow can also be used to extract a range of lines, e.g.:

seq 100 | jobflow -skip 10 -count 10  # print lines 11 to 20

Comparison with GNU parallel

GNU parallel is written in perl, which has the following disadvantages:

jobflow OTOH is written in C, which has numerous advantages.

apart from the chosen language and related performance differences, the following other differences exist between GNU parallel and jobflow:

available command line options

-skip N -threads N -resume -statefile=/tmp/state -delayedflush
-delayedspinup N -buffered -joinoutput -limits mem=16M,cpu=10
-eof=XXX
-exec ./mycommand {}

-skip N

N=number of entries to skip

-count N

N=only process count lines (after skipping)

-threads N (alternative: -j N)

N=number of parallel processes to spawn

-resume

resume from last jobnumber stored in statefile

-eof XXX

use XXX as the EOF marker on stdin
if the marker is encountered, behave as if stdin was closed
not compatible with pipe/bulk mode

-statefile XXX

XXX=filename
saves last launched jobnumber into a file

-delayedflush

only write to statefile whenever all processes are busy,
and at program end

-delayedspinup N

N=maximum amount of milliseconds
...to wait when spinning up a fresh set of processes
a random value between 0 and the chosen amount is used to delay initial
spinup.
this can be handy to circumvent an I/O lockdown because of a burst of
activity on program startup

-buffered

store the stdout and stderr of launched processes into a temporary file
which will be printed after a process has finished.
this prevents mixing up of output of different processes.

-joinoutput

if -buffered, write both stdout and stderr into the same file.
this saves the chronological order of the output, and the combined output
will only be printed to stdout.

-bulk N

do bulk copies with a buffer of N bytes. only usable in pipe mode.
this passes (almost) the entire buffer to the next scheduled job.
the passed buffer will be truncated to the last line break boundary,
so jobs always get entire lines to work with.
this option is useful when you have huge input files and relatively short
task runtimes. by using it, syscall overhead can be reduced to a minimum.
N must be a multiple of 4KB. the suffixes G/M/K are detected.
actual memory allocation will be twice the amount passed.
note that pipe buffer size is limited to 64K on linux, so anything higher
than that probably doesn't make sense.

-limits [mem=N,cpu=N,stack=N,fsize=N,nofiles=N]

sets the rlimit of the new created processes.
see "man setrlimit" for an explanation. the suffixes G/M/K are detected.

-exec command with args

everything past -exec is treated as the command to execute on each line of
stdin received. the line can be passed as an argument using {}.
{.} passes everything before the last dot in a line as an argument.
it is possible to use multiple substitutions inside a single argument,
but currently only of one type.
if -exec is omitted, input will merely be dumped to stdout (like cat).

BUILD

just run make.

you may override variables used in the Makefile and set optimization CFLAGS and similar thing using a file called config.mak, e.g.:

echo "CFLAGS=-O2 -g" > config.mak
make -j2
shortlog
2021-12-19 rofl0radd a test for -limit functionalitymaster
2021-12-19 rofl0rpropagate error exit status of called process
2021-12-19 rofl0rREADME: mention new -count option
2021-12-19 rofl0rrelicense as MIT
2021-12-16 rofl0rbump version to 1.3.1v1.3.1
2021-12-16 rofl0radd new -count option
2021-12-16 rofl0rfix new command line parser for -j
2021-12-16 rofl0radd new linenumber substitution feature
2021-12-16 rofl0rfix typo in --help output
2020-10-24 rofl0radd some argument permutation tests
2020-10-24 rofl0rREADME: fix markdown
2020-10-24 rofl0rbump version to 1.3.0v1.3.0
2020-10-24 rofl0rMakefile: add check target
2020-10-24 rofl0rfix test.sh
2020-10-24 rofl0rmake all funcs static
2020-10-24 rofl0ruse parse_human_number for all command line args with...
...
tags
2 years ago v1.3.1
4 years ago v1.3.0
4 years ago v1.2.4
5 years ago v1.2.3
6 years ago v1.2.2
7 years ago v1.2.1
7 years ago v1.2.0
8 years ago v1.1.1
10 years ago v1.0.0
heads
2 years ago master