benchmarks/unixbench-5.1.2/README

   1 Version 5.1.2 -- 2007-12-26
   2
   3 ================================================================
   4 To use Unixbench:
   5
   6 1. UnixBench from version 5.1 on has both system and graphics tests.
   7    If you want to use the graphic tests, edit the Makefile and make sure
   8    that the line "GRAPHIC_TESTS = defined" is not commented out; then check
   9    that the "GL_LIBS" definition is OK for your system.  Also make sure
  10    that the "x11perf" command is on your search path.
  11
  12    If you don't want the graphics tests, then comment out the
  13    "GRAPHIC_TESTS = defined" line.  Note: comment it out, don't
  14    set it to anything.
  15
  16 2. Do "make".
  17
  18 3. Do "Run" to run the system test; "Run graphics" to run the graphics
  19    tests; "Run gindex" to run both.
  20
  21 You will need perl, as Run is written in perl.
  22
  23 For more information on using the tests, read "USAGE".
  24
  25 For information on adding tests into the benchmark, see "WRITING_TESTS".
  26
  27
  28 ===================== RELEASE NOTES =====================================
  29
  30 ========================  Dec 07 ==========================
  31
  32 v5.1.2
  33
  34 One big fix: if unixbench is installed in a directory whose pathname contains
  35 a space, it should now run (previously it failed).
  36
  37 To avoid possible clashes, the environment variables unixbench uses are now
  38 prefixed with "UB_".  These are all optional, and for most people will be
  39 completely unnecessary, but if you want you can set these:
  40
  41     UB_BINDIR      Directory where the test programs live.
  42     UB_TMPDIR      Temp directory, for temp files.
  43     UB_RESULTDIR   Directory to put results in.
  44     UB_TESTDIR     Directory where the tests are executed.
  45
  46 And a couple of tiny fixes:
  47 * In pgms/tst.sh, changed "sort -n +1" to "sort -n -k 1"
  48 * In Makefile, made it clearer that GRAPHIC_TESTS should be commented
  49   out (not set to 0) to disable graphics
  50 Thanks to nordi for pointing these out.
  51
  52
  53 Ian Smith, December 26, 2007
  54 johantheghost at yahoo period com
  55
  56
  57 ========================  Oct 07 ==========================
  58
  59 v5.1.1
  60
  61 It turns out that the setting of LANG is crucial to the results.  This
  62 explains why people in different regions were seeing odd results, and also
  63 why runlevel 1 produced odd results -- runlevel 1 doesn't set LANG, and
  64 hence reverts to ASCII, whereas most people use a UTF-8 encoding, which is
  65 much slower in some tests (eg. shell tests).
  66
  67 So now we manually set LANG to "en_US.utf8", which is configured with the
  68 variable "$language".  Don't change this if you want to share your results.
  69 We also report the language settings in use.
  70
  71 See "The Language Setting" in USAGE for more info.  Thanks to nordi for
  72 pointing out the LANG issue.
  73
  74 I also added the "grep" and "sysexec" tests.  These are non-index tests,
  75 and "grep" uses the system's grep, so it's not much use for comparing
  76 different systems.  But some folks on the OpenSuSE list have been finding
  77 these useful.  They aren't in any of the main test groups; do "Run grep
  78 sysexec" to run them.
  79
  80 Index Changes
  81 -------------
  82
  83 The setting of LANG will affect consistency with systems where this is
  84 not the default value.  However, it should produce more consistent results
  85 in future.
  86
  87
  88 Ian Smith, October 15, 2007
  89 johantheghost at yahoo period com
  90
  91
  92 ========================  Oct 07 ==========================
  93
  94 v5.1
  95
  96 The major new feature in this version is the addition of graphical
  97 benchmarks.  Since these may not compile on all systems, you can enable/
  98 disable them with the GRAPHIC_TESTS variable in the Makefile.
  99
 100 As before, each test is run for 3 or 10 iterations.  However, we now discard
 101 the worst 1/3 of the scores before averaging the remainder.  The logic is
 102 that a glitch in the system (background process waking up, for example) may
 103 make one or two runs go slow, so let's discard those.  Hopefully this will
 104 produce more consistent and repeatable results.  Check the log file
 105 for a test run to see the discarded scores.
 106
 107 Made the tests compile and run on x86-64/Linux (fixed an execl bug passing
 108 int instead of pointer).
 109
 110 Also fixed some general bugs.
 111
 112 Thanks to Stefan Esser for help and testing / bug reporting.
 113
 114 Index Changes
 115 -------------
 116
 117 The tests are now divided into categories, and each category generates
 118 its own index.  This keeps the graphics test results separate from
 119 the system tests.
 120
 121 The "graphics" test and corresponding index are new.
 122
 123 The "discard the worst scores" strategy should produce slightly higher
 124 test scores, but at least they should (hopefully!) be more consistent.
 125 The scores should not be higher than the best scores you would have got
 126 with 5.0, so this should not be a huge consistency issue.
 127
 128 Ian Smith, October 11, 2007
 129 johantheghost at yahoo period com
 130
 131
 132 ========================  Sep 07 ==========================
 133
 134 v5.0
 135
 136 All the work I've done on this release is Linux-based, because that's
 137 the only Unix I have access to.  I've tried to make it more OS-agnostic
 138 if anything; for example, it no longer has to figure out the format reported
 139 by /usr/bin/time.  However, it's possible that portability has been damaged.
 140 If anyone wants to fix this, please feel free to mail me patches.
 141
 142 In particular, the analysis of the system's CPUs is done via /proc/cpuinfo.
 143 For systems which don't have this, please make appropriate changes in
 144 getCpuInfo() and getSystemInfo().
 145
 146 The big change has been to make the tests multi-CPU aware.  See the
 147 "Multiple CPUs" section in "USAGE" for details.  Other changes:
 148
 149 * Completely rewrote Run in Perl; drastically simplified the way data is
 150   processed.  The confusing system of interlocking shell and awk scripts is
 151   now just one script.  Various intermediate files used to store and process
 152   results are now replaced by Perl data structures internal to the script.
 153
 154 * Removed from the index runs file system read and write tests which were
 155   ignored for the index and wasted about 10 minutes per run (see fstime.c).
 156   The read and write tests can now be selected individually.  Made fstime.c
 157   take parameters, so we no longer need to build 3 versions of it.
 158
 159 * Made the output file names unique; they are built from
 160   hostname-date-sequence.
 161
 162 * Worked on result reporting, error handling, and logging.  See TESTS.
 163   We now generate both text and HTML reports.
 164
 165 * Removed some obsolete files.
 166
 167 Index Changes
 168 -------------
 169
 170 The index is still based on David Niemi's SPARCstation 20-61 (rated at 10.0),
 171 and the intention in the changes I've made has been to keep the tests
 172 unchanged, in order to maintain consistency with old result sets.
 173
 174 However, the following changes have been made to the index:
 175
 176 * The Pipe-based Context Switching test (context1) was being dropped
 177   from the index report in v4.1.0 due to a bug; I've put it back in.
 178
 179 * I've added shell1 to the index, to get a measure of how the shell tests
 180   scale with multiple CPUs (shell8 already exercises all the CPUs, even
 181   in single-copy mode).  I made up the baseline score for this by
 182   extrapolation.
 183
 184 Both of these test can be dropped, if you wish, by editing the "TEST
 185 SPECIFICATIONS" section of Run.
 186
 187 Ian Smith, September 20, 2007
 188 johantheghost at yahoo period com
 189
 190 ========================  Aug 97 ==========================
 191
 192 v4.1.0
 193
 194 Double precision Whetstone put in place instead of the old "double" benchmark.
 195
 196 Removal of some obsolete files.
 197
 198 "system" suite adds shell8.
 199
 200 perlbench and poll added as "exhibition" (non-index) benchmarks.
 201
 202 Incorporates several suggestions by Andre Derrick Balsa <andrewbalsa@usa.net>
 203
 204 Code cleanups to reduce compiler warnings by David C Niemi <niemi@tux.org>
 205 and Andy Kahn <kahn@zk3.dec.com>; Digital Unix options by Andy Kahn.
 206
 207 ========================  Jun 97 ==========================
 208
 209 v4.0.1
 210
 211 Minor change to fstime.c to fix overflow problems on fast machines.  Counting
 212 is now done in units of 256 (smallest BUFSIZE) and unsigned longs are used,
 213 giving another 23 dB or so of headroom ;^)  Results should be virtually
 214 identical aside from very small rounding errors.
 215
 216 ========================  Dec 95 ==========================
 217
 218 v4.0
 219
 220 Byte no longer seems to have anything to do with this benchmark, and I was
 221 unable to reach any of the original authors; so I have taken it upon myself
 222 to clean it up.
 223
 224 This is version 4.  Major assumptions made in these benchmarks have changed
 225 since they were written, but they are nonetheless popular (particularly for
 226 measuring hardware for Linux).  Some changes made:
 227
 228 -       The biggest change is to put a lot more operating system-oriented
 229         tests into the index.  I experimented for a while with a decibel-like
 230         logarithmic scale, but finally settled on using a geometric mean for
 231         the final index (the individual scores are a normalized, and their
 232         logs are averaged; the resulting value is exponentiated).
 233
 234         "George", certain SPARCstation 20-61 with 128 MB RAM, a SPARC Storage
 235         Array, and Solaris 2.3 is my new baseline; it is rated at 10.0 in each
 236         of the index scores for a final score of 10.0.
 237
 238         Overall I find the geometric averaging is a big improvement for
 239         avoiding the skew that was once possible (e.g. a Pentium-75 which got
 240         40 on the buggy version of fstime, such that fstime accounted for over
 241         half of its total score and hence wildly skewed its average).
 242
 243         I also expect that the new numbers look different enough from the old
 244         ones that no one is too likely to casually mistake them for each other.
 245
 246         I am finding new SPARCs running Solaris 2.4 getting about 15-20, and
 247         my 486 DX2-66 Compaq running Linux 1.3.45 got a 9.1.  It got
 248         understandably poor scores on CPU and FPU benchmarks (a horrible
 249         1.8 on "double" and 1.3 on "fsdisk"); but made up for it by averaging
 250         over 20 on the OS-oriented benchmarks.  The Pentium-75 running
 251         Linux gets about 20 (and it *still* runs Windows 3.1 slowly.  Oh well).
 252
 253 -       It is difficult to get a modern compiler to even consider making
 254         dhry2 without registers, short of turning off *all* optimizations.
 255         This is also not a terribly meaningful test, even if it were possible,
 256         as noone compiles without registers nowadays.  Replaced this benchmark
 257         with dhry2reg in the index, and dropped it out of usage in general as
 258         it is so hard to make a legitimate one.
 259
 260 -       fstime: this had some bugs when compiled on modern systems which return
 261         the number of bytes read/written for read(2)/write(2) calls.  The code
 262         assumed that a negative return code was given for EOF, but most modern
 263         systems return 0 (certainly on SunOS 4, Solaris2, and Linux, which is
 264         what counts for me).  The old code yielded wildly inflated read scores,
 265         would eat up tens of MB of disk space on fast systems, and yielded
 266         roughly 50% lower than normal copy scores than it should have.
 267
 268         Also, it counted partial blocks *fully*; made it count the proportional
 269         part of the block which was actually finished.
 270
 271         Made bigger and smaller variants of fstime which are designed to beat
 272         up the disk I/O and the buffer cache, respectively.  Adjusted the
 273         sleeps so that they are short for short benchmarks.
 274
 275 -       Instead of 1,2,4, and 8-shell benchmarks, went to 1, 8, and 16 to
 276         give a broader range of information (and to run 1 fewer test).
 277         The only real problem with this is that not many iterations get
 278         done with 16 at a time on slow systems, so there are some significant
 279         rounding errors; 8 therefore still used for the benchmark.  There is
 280         also the problem that the last (uncompleted) loop is counted as a full
 281         loop, so it is impossible to score below 1.0 lpm (which gave my laptop
 282         a break).  Probably redesigning Shell to do each loop a bit more
 283         quickly (but with less intensity) would be a good idea.
 284
 285         This benchmark appears to be very heavily influenced by the speed
 286         of the loader, by which shell is being used as /bin/sh, and by how
 287         well-compiled some of the common shell utilities like grep, sed, and
 288         sort are.  With a consistent tool set it is also a good indicator of
 289         the bandwidth between main memory and the CPU (e.g. Pentia score about
 290         twice as high as 486es due to their 64-bit bus).  Small, sometimes
 291         broken shells like "ash-linux" do particularly well here, while big,
 292         robust shells like bash do not.
 293
 294 -       "dc" is a somewhat iffy benchmark, because there are two versions of
 295         it floating around, one being small, very fast, and buggy, and one
 296         being more correct but slow.  It was never in the index anyway.
 297
 298 -       Execl is a somewhat troubling benchmark in that it yields much higher
 299         scores if compiled statically.  I frown on this practice because it
 300         distorts the scores away from reflecting how programs are really used
 301         (i.e. dynamically linked).
 302
 303 -       Arithoh is really more an indicator of the compiler quality than of
 304         the computer itself.  For example, GCC 2.7.x with -O2 and a few extra
 305         options optimizes much of it away, resulting in about a 1200% boost
 306         to the score.  Clearly not a good one for the index.
 307
 308 I am still a bit unhappy with the variance in some of the benchmarks, most
 309 notably the fstime suite; and with how long it takes to run.  But I think
 310 it gets significantly more reliable results than the older version in less
 311 time.
 312
 313 If anyone has ideas on how to make these benchmarks faster, lower-variance,
 314 or more meaningful; or has nice, new, portable benchmarks to add, don't
 315 hesitate to e-mail me.
 316
 317 David C Niemi <niemi@tux.org>           7 Dec 1995
 318
 319 ========================  May 91 ==========================
 320 This is version 3. This set of programs should be able to determine if
 321 your system is BSD or SysV. (It uses the output format of time (1)
 322 to see. If you have any problems, contact me (by email,
 323 preferably): ben@bytepb.byte.com
 324
 325 ---
 326
 327 The document doc/bench.doc describes the basic flow of the
 328 benchmark system. The document doc/bench3.doc describes the major
 329 changes in design of this version. As a user of the benchmarks,
 330 you should understand some of the methods that have been
 331 implemented to generate loop counts:
 332
 333 Tests that are compiled C code:
 334   The function wake_me(second, func) is included (from the file
 335 timeit.c). This function uses signal and alarm to set a countdown
 336 for the time request by the benchmark administration script
 337 (Run). As soon as the clock is started, the test is run with a
 338 counter keeping track of the number of loops that the test makes.
 339 When alarm sends its signal, the loop counter value is sent to stderr
 340 and the program terminates. Since the time resolution, signal
 341 trapping and other factors don't insure that the test is for the
 342 precise time that was requested, the test program is also run
 343 from the time (1) command. The real time value returned from time
 344 (1) is what is used in calculating the number of loops per second
 345 (or minute, depending on the test).  As is obvious, there is some
 346 overhead time that is not taken into account, therefore the
 347 number of loops per second is not absolute. The overhead of the
 348 test starting and stopping and the signal and alarm calls is
 349 common to the overhead of real applications. If a program loads
 350 quickly, the number of loops per second increases; a phenomenon
 351 that favors systems that can load programs quickly. (Setting the
 352 sticky bit of the test programs is not considered fair play.)
 353
 354 Test that use existing UNIX programs or shell scripts:
 355   The concept is the same as that of compiled tests, except the
 356 alarm and signal are contained in separate compiled program,
 357 looper (source is looper.c). Looper uses an execvp to invoke the
 358 test with its arguments. Here, the overhead includes the
 359 invocation and execution of looper.
 360
 361 --
 362
 363 The index numbers are generated from a baseline file that is in
 364 pgms/index.base. You can put tests that you wish in this file.
 365 All you need to do is take the results/log file from your
 366 baseline machine, edit out the comment and blank lines, and sort
 367 the result (vi/ex command: 1,$!sort). The sort in necessary
 368 because the process of generating the index report uses join (1).
 369 You can regenerate the reports by running "make report."
 370
 371 --
 372
 373 ========================= Jan 90 =============================
 374 Tom Yager has joined the effort here at BYTE; he is responsible
 375 for many refinements in the UNIX benchmarks.
 376
 377 The memory access tests have been deleted from the benchmarks.
 378 The file access tests have been reversed so that the test is run
 379 for a fixed time. The amount of data transfered (written, read,
 380 and copied) is the variable. !WARNING! This test can eat up a
 381 large hunk of disk space.
 382
 383 The initial line of all shell scripts has been changed from the
 384 SCO and XENIX form (:) to the more standard form "#! /bin/sh".
 385 But different systems handle shell switching differently. Check
 386 the documentation on your system and find out how you are
 387 supposed to do it. Or, simpler yet, just run the benchmarks from
 388 the Bourne shell. (You may need to set SHELL=/bin/sh as well.)
 389
 390 The options to Run have not been checked in a while. They may no
 391 longer function. Next time, I'll get back on them. There needs to
 392 be another option added (next time) that halts testing between
 393 each test. !WARNING! Some systems have caches that are not getting flushed
 394 before the next test or iteration is run. This can cause
 395 erroneous values.
 396
 397 ========================= Sept 89 =============================
 398 The database (db) programs now have a tuneable message queue space.
 399 queue space. The default set in the Run script is 1024 bytes.
 400 Other major changes are in the format of the times. We now show
 401 Arithmetic and Geometric mean and standard deviation for User
 402 Time, System Time, and Real Time. Generally, in reporting, we
 403 plan on using the Real Time values with the benchs run with one
 404 active user (the bench user). Comments and arguments are requested.
 405
 406 contact: BIX bensmith or rick_g