arch/x86/math-emu/README

   1  +---------------------------------------------------------------------------+
   2  |  wm-FPU-emu   an FPU emulator for 80386 and 80486SX microprocessors.      |
   3  |                                                                           |
   4  | Copyright (C) 1992,1993,1994,1995,1996,1997,1999                          |
   5  |                       W. Metzenthen, 22 Parker St, Ormond, Vic 3163,      |
   6  |                       Australia.  E-mail billm@melbpc.org.au              |
   7  |                                                                           |
   8  |    This program is free software; you can redistribute it and/or modify   |
   9  |    it under the terms of the GNU General Public License version 2 as      |
  10  |    published by the Free Software Foundation.                             |
  11  |                                                                           |
  12  |    This program is distributed in the hope that it will be useful,        |
  13  |    but WITHOUT ANY WARRANTY; without even the implied warranty of         |
  14  |    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the          |
  15  |    GNU General Public License for more details.                           |
  16  |                                                                           |
  17  |    You should have received a copy of the GNU General Public License      |
  18  |    along with this program; if not, write to the Free Software            |
  19  |    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.              |
  20  |                                                                           |
  21  +---------------------------------------------------------------------------+
  22
  23
  24
  25 wm-FPU-emu is an FPU emulator for Linux. It is derived from wm-emu387
  26 which was my 80387 emulator for early versions of djgpp (gcc under
  27 msdos); wm-emu387 was in turn based upon emu387 which was written by
  28 DJ Delorie for djgpp.  The interface to the Linux kernel is based upon
  29 the original Linux math emulator by Linus Torvalds.
  30
  31 My target FPU for wm-FPU-emu is that described in the Intel486
  32 Programmer's Reference Manual (1992 edition). Unfortunately, numerous
  33 facets of the functioning of the FPU are not well covered in the
  34 Reference Manual. The information in the manual has been supplemented
  35 with measurements on real 80486's. Unfortunately, it is simply not
  36 possible to be sure that all of the peculiarities of the 80486 have
  37 been discovered, so there is always likely to be obscure differences
  38 in the detailed behaviour of the emulator and a real 80486.
  39
  40 wm-FPU-emu does not implement all of the behaviour of the 80486 FPU,
  41 but is very close.  See "Limitations" later in this file for a list of
  42 some differences.
  43
  44 Please report bugs, etc to me at:
  45        billm@melbpc.org.au
  46 or     b.metzenthen@medoto.unimelb.edu.au
  47
  48 For more information on the emulator and on floating point topics, see
  49 my web pages, currently at  http://www.suburbia.net/~billm/
  50
  51
  52 --Bill Metzenthen
  53   December 1999
  54
  55
  56 ----------------------- Internals of wm-FPU-emu -----------------------
  57
  58 Numeric algorithms:
  59 (1) Add, subtract, and multiply. Nothing remarkable in these.
  60 (2) Divide has been tuned to get reasonable performance. The algorithm
  61     is not the obvious one which most people seem to use, but is designed
  62     to take advantage of the characteristics of the 80386. I expect that
  63     it has been invented many times before I discovered it, but I have not
  64     seen it. It is based upon one of those ideas which one carries around
  65     for years without ever bothering to check it out.
  66 (3) The sqrt function has been tuned to get good performance. It is based
  67     upon Newton's classic method. Performance was improved by capitalizing
  68     upon the properties of Newton's method, and the code is once again
  69     structured taking account of the 80386 characteristics.
  70 (4) The trig, log, and exp functions are based in each case upon quasi-
  71     "optimal" polynomial approximations. My definition of "optimal" was
  72     based upon getting good accuracy with reasonable speed.
  73 (5) The argument reducing code for the trig function effectively uses
  74     a value of pi which is accurate to more than 128 bits. As a consequence,
  75     the reduced argument is accurate to more than 64 bits for arguments up
  76     to a few pi, and accurate to more than 64 bits for most arguments,
  77     even for arguments approaching 2^63. This is far superior to an
  78     80486, which uses a value of pi which is accurate to 66 bits.
  79
  80 The code of the emulator is complicated slightly by the need to
  81 account for a limited form of re-entrancy. Normally, the emulator will
  82 emulate each FPU instruction to completion without interruption.
  83 However, it may happen that when the emulator is accessing the user
  84 memory space, swapping may be needed. In this case the emulator may be
  85 temporarily suspended while disk i/o takes place. During this time
  86 another process may use the emulator, thereby perhaps changing static
  87 variables. The code which accesses user memory is confined to five
  88 files:
  89     fpu_entry.c
  90     reg_ld_str.c
  91     load_store.c
  92     get_address.c
  93     errors.c
  94 As from version 1.12 of the emulator, no static variables are used
  95 (apart from those in the kernel's per-process tables). The emulator is
  96 therefore now fully re-entrant, rather than having just the restricted
  97 form of re-entrancy which is required by the Linux kernel.
  98
  99 ----------------------- Limitations of wm-FPU-emu -----------------------
 100
 101 There are a number of differences between the current wm-FPU-emu
 102 (version 2.01) and the 80486 FPU (apart from bugs).  The differences
 103 are fewer than those which applied to the 1.xx series of the emulator.
 104 Some of the more important differences are listed below:
 105
 106 The Roundup flag does not have much meaning for the transcendental
 107 functions and its 80486 value with these functions is likely to differ
 108 from its emulator value.
 109
 110 In a few rare cases the Underflow flag obtained with the emulator will
 111 be different from that obtained with an 80486. This occurs when the
 112 following conditions apply simultaneously:
 113 (a) the operands have a higher precision than the current setting of the
 114     precision control (PC) flags.
 115 (b) the underflow exception is masked.
 116 (c) the magnitude of the exact result (before rounding) is less than 2^-16382.
 117 (d) the magnitude of the final result (after rounding) is exactly 2^-16382.
 118 (e) the magnitude of the exact result would be exactly 2^-16382 if the
 119     operands were rounded to the current precision before the arithmetic
 120     operation was performed.
 121 If all of these apply, the emulator will set the Underflow flag but a real
 122 80486 will not.
 123
 124 NOTE: Certain formats of Extended Real are UNSUPPORTED. They are
 125 unsupported by the 80486. They are the Pseudo-NaNs, Pseudoinfinities,
 126 and Unnormals. None of these will be generated by an 80486 or by the
 127 emulator. Do not use them. The emulator treats them differently in
 128 detail from the way an 80486 does.
 129
 130 Self modifying code can cause the emulator to fail. An example of such
 131 code is:
 132           movl %esp,[%ebx]
 133           fld1
 134 The FPU instruction may be (usually will be) loaded into the pre-fetch
 135 queue of the CPU before the mov instruction is executed. If the
 136 destination of the 'movl' overlaps the FPU instruction then the bytes
 137 in the prefetch queue and memory will be inconsistent when the FPU
 138 instruction is executed. The emulator will be invoked but will not be
 139 able to find the instruction which caused the device-not-present
 140 exception. For this case, the emulator cannot emulate the behaviour of
 141 an 80486DX.
 142
 143 Handling of the address size override prefix byte (0x67) has not been
 144 extensively tested yet. A major problem exists because using it in
 145 vm86 mode can cause a general protection fault. Address offsets
 146 greater than 0xffff appear to be illegal in vm86 mode but are quite
 147 acceptable (and work) in real mode. A small test program developed to
 148 check the addressing, and which runs successfully in real mode,
 149 crashes dosemu under Linux and also brings Windows down with a general
 150 protection fault message when run under the MS-DOS prompt of Windows
 151 3.1. (The program simply reads data from a valid address).
 152
 153 The emulator supports 16-bit protected mode, with one difference from
 154 an 80486DX.  A 80486DX will allow some floating point instructions to
 155 write a few bytes below the lowest address of the stack.  The emulator
 156 will not allow this in 16-bit protected mode: no instructions are
 157 allowed to write outside the bounds set by the protection.
 158
 159 ----------------------- Performance of wm-FPU-emu -----------------------
 160
 161 Speed.
 162 -----
 163
 164 The speed of floating point computation with the emulator will depend
 165 upon instruction mix. Relative performance is best for the instructions
 166 which require most computation. The simple instructions are adversely
 167 affected by the FPU instruction trap overhead.
 168
 169
 170 Timing: Some simple timing tests have been made on the emulator functions.
 171 The times include load/store instructions. All times are in microseconds
 172 measured on a 33MHz 386 with 64k cache. The Turbo C tests were under
 173 ms-dos, the next two columns are for emulators running with the djgpp
 174 ms-dos extender. The final column is for wm-FPU-emu in Linux 0.97,
 175 using libm4.0 (hard).
 176
 177 function      Turbo C        djgpp 1.06        WM-emu387     wm-FPU-emu
 178
 179    +          60.5           154.8              76.5          139.4
 180    -          61.1-65.5      157.3-160.8        76.2-79.5     142.9-144.7
 181    *          71.0           190.8              79.6          146.6
 182    /          61.2-75.0      261.4-266.9        75.3-91.6     142.2-158.1
 183
 184  sin()        310.8          4692.0            319.0          398.5
 185  cos()        284.4          4855.2            308.0          388.7
 186  tan()        495.0          8807.1            394.9          504.7
 187  atan()       328.9          4866.4            601.1          419.5-491.9
 188
 189  sqrt()       128.7          crashed           145.2          227.0
 190  log()        413.1-419.1    5103.4-5354.21    254.7-282.2    409.4-437.1
 191  exp()        479.1          6619.2            469.1          850.8
 192
 193
 194 The performance under Linux is improved by the use of look-ahead code.
 195 The following results show the improvement which is obtained under
 196 Linux due to the look-ahead code. Also given are the times for the
 197 original Linux emulator with the 4.1 'soft' lib.
 198
 199  [ Linus' note: I changed look-ahead to be the default under linux, as
 200    there was no reason not to use it after I had edited it to be
 201    disabled during tracing ]
 202
 203             wm-FPU-emu w     original w
 204             look-ahead       'soft' lib
 205    +         106.4             190.2
 206    -         108.6-111.6      192.4-216.2
 207    *         113.4             193.1
 208    /         108.8-124.4      700.1-706.2
 209
 210  sin()       390.5            2642.0
 211  cos()       381.5            2767.4
 212  tan()       496.5            3153.3
 213  atan()      367.2-435.5     2439.4-3396.8
 214
 215  sqrt()      195.1            4732.5
 216  log()       358.0-387.5     3359.2-3390.3
 217  exp()       619.3            4046.4
 218
 219
 220 These figures are now somewhat out-of-date. The emulator has become
 221 progressively slower for most functions as more of the 80486 features
 222 have been implemented.
 223
 224
 225 ----------------------- Accuracy of wm-FPU-emu -----------------------
 226
 227
 228 The accuracy of the emulator is in almost all cases equal to or better
 229 than that of an Intel 80486 FPU.
 230
 231 The results of the basic arithmetic functions (+,-,*,/), and fsqrt
 232 match those of an 80486 FPU. They are the best possible; the error for
 233 these never exceeds 1/2 an lsb. The fprem and fprem1 instructions
 234 return exact results; they have no error.
 235
 236
 237 The following table compares the emulator accuracy for the sqrt(),
 238 trig and log functions against the Turbo C "emulator". For this table,
 239 each function was tested at about 400 points. Ideal worst-case results
 240 would be 64 bits. The reduced Turbo C accuracy of cos() and tan() for
 241 arguments greater than pi/4 can be thought of as being related to the
 242 precision of the argument x; e.g. an argument of pi/2-(1e-10) which is
 243 accurate to 64 bits can result in a relative accuracy in cos() of
 244 about 64 + log2(cos(x)) = 31 bits.
 245
 246
 247 Function      Tested x range            Worst result                Turbo C
 248                                         (relative bits)
 249
 250 sqrt(x)       1 .. 2                    64.1                         63.2
 251 atan(x)       1e-10 .. 200              64.2                         62.8
 252 cos(x)        0 .. pi/2-(1e-10)         64.4 (x <= pi/4)             62.4
 253                                         64.1 (x = pi/2-(1e-10))      31.9
 254 sin(x)        1e-10 .. pi/2             64.0                         62.8
 255 tan(x)        1e-10 .. pi/2-(1e-10)     64.0 (x <= pi/4)             62.1
 256                                         64.1 (x = pi/2-(1e-10))      31.9
 257 exp(x)        0 .. 1                    63.1 **                      62.9
 258 log(x)        1+1e-6 .. 2               63.8 **                      62.1
 259
 260 ** The accuracy for exp() and log() is low because the FPU (emulator)
 261 does not compute them directly; two operations are required.
 262
 263
 264 The emulator passes the "paranoia" tests (compiled with gcc 2.3.3 or
 265 later) for 'float' variables (24 bit precision numbers) when precision
 266 control is set to 24, 53 or 64 bits, and for 'double' variables (53
 267 bit precision numbers) when precision control is set to 53 bits (a
 268 properly performing FPU cannot pass the 'paranoia' tests for 'double'
 269 variables when precision control is set to 64 bits).
 270
 271 The code for reducing the argument for the trig functions (fsin, fcos,
 272 fptan and fsincos) has been improved and now effectively uses a value
 273 for pi which is accurate to more than 128 bits precision. As a
 274 consequence, the accuracy of these functions for large arguments has
 275 been dramatically improved (and is now very much better than an 80486
 276 FPU). There is also now no degradation of accuracy for fcos and fptan
 277 for operands close to pi/2. Measured results are (note that the
 278 definition of accuracy has changed slightly from that used for the
 279 above table):
 280
 281 Function      Tested x range          Worst result
 282                                      (absolute bits)
 283
 284 cos(x)        0 .. 9.22e+18              62.0
 285 sin(x)        1e-16 .. 9.22e+18          62.1
 286 tan(x)        1e-16 .. 9.22e+18          61.8
 287
 288 It is possible with some effort to find very large arguments which
 289 give much degraded precision. For example, the integer number
 290            8227740058411162616.0
 291 is within about 10e-7 of a multiple of pi. To find the tan (for
 292 example) of this number to 64 bits precision it would be necessary to
 293 have a value of pi which had about 150 bits precision. The FPU
 294 emulator computes the result to about 42.6 bits precision (the correct
 295 result is about -9.739715e-8). On the other hand, an 80486 FPU returns
 296 0.01059, which in relative terms is hopelessly inaccurate.
 297
 298 For arguments close to critical angles (which occur at multiples of
 299 pi/2) the emulator is more accurate than an 80486 FPU. For very large
 300 arguments, the emulator is far more accurate.
 301
 302
 303 Prior to version 1.20 of the emulator, the accuracy of the results for
 304 the transcendental functions (in their principal range) was not as
 305 good as the results from an 80486 FPU. From version 1.20, the accuracy
 306 has been considerably improved and these functions now give measured
 307 worst-case results which are better than the worst-case results given
 308 by an 80486 FPU.
 309
 310 The following table gives the measured results for the emulator. The
 311 number of randomly selected arguments in each case is about half a
 312 million.  The group of three columns gives the frequency of the given
 313 accuracy in number of times per million, thus the second of these
 314 columns shows that an accuracy of between 63.80 and 63.89 bits was
 315 found at a rate of 133 times per one million measurements for fsin.
 316 The results show that the fsin, fcos and fptan instructions return
 317 results which are in error (i.e. less accurate than the best possible
 318 result (which is 64 bits)) for about one per cent of all arguments
 319 between -pi/2 and +pi/2.  The other instructions have a lower
 320 frequency of results which are in error.  The last two columns give
 321 the worst accuracy which was found (in bits) and the approximate value
 322 of the argument which produced it.
 323
 324                                 frequency (per M)
 325                                -------------------   ---------------
 326 instr   arg range    # tests   63.7   63.8    63.9   worst   at arg
 327                                bits   bits    bits    bits
 328 -----  ------------  -------   ----   ----   -----   -----  --------
 329 fsin     (0,pi/2)     547756      0    133   10673   63.89  0.451317
 330 fcos     (0,pi/2)     547563      0    126   10532   63.85  0.700801
 331 fptan    (0,pi/2)     536274     11    267   10059   63.74  0.784876
 332 fpatan  4 quadrants   517087      0      8    1855   63.88  0.435121 (4q)
 333 fyl2x     (0,20)      541861      0      0    1323   63.94  1.40923  (x)
 334 fyl2xp1 (-.293,.414)  520256      0      0    5678   63.93  0.408542 (x)
 335 f2xm1     (-1,1)      538847      4    481    6488   63.79  0.167709
 336
 337
 338 Tests performed on an 80486 FPU showed results of lower accuracy. The
 339 following table gives the results which were obtained with an AMD
 340 486DX2/66 (other tests indicate that an Intel 486DX produces
 341 identical results).  The tests were basically the same as those used
 342 to measure the emulator (the values, being random, were in general not
 343 the same).  The total number of tests for each instruction are given
 344 at the end of the table, in case each about 100k tests were performed.
 345 Another line of figures at the end of the table shows that most of the
 346 instructions return results which are in error for more than 10
 347 percent of the arguments tested.
 348
 349 The numbers in the body of the table give the approx number of times a
 350 result of the given accuracy in bits (given in the left-most column)
 351 was obtained per one million arguments. For three of the instructions,
 352 two columns of results are given: * The second column for f2xm1 gives
 353 the number cases where the results of the first column were for a
 354 positive argument, this shows that this instruction gives better
 355 results for positive arguments than it does for negative.  * In the
 356 cases of fcos and fptan, the first column gives the results when all
 357 cases where arguments greater than 1.5 were removed from the results
 358 given in the second column. Unlike the emulator, an 80486 FPU returns
 359 results of relatively poor accuracy for these instructions when the
 360 argument approaches pi/2. The table does not show those cases when the
 361 accuracy of the results were less than 62 bits, which occurs quite
 362 often for fsin and fptan when the argument approaches pi/2. This poor
 363 accuracy is discussed above in relation to the Turbo C "emulator", and
 364 the accuracy of the value of pi.
 365
 366
 367 bits   f2xm1  f2xm1 fpatan   fcos   fcos  fyl2x fyl2xp1  fsin  fptan  fptan
 368 62.0       0      0      0      0    437      0      0      0      0    925
 369 62.1       0      0     10      0    894      0      0      0      0   1023
 370 62.2      14      0      0      0   1033      0      0      0      0    945
 371 62.3      57      0      0      0   1202      0      0      0      0   1023
 372 62.4     385      0      0     10   1292      0     23      0      0   1178
 373 62.5    1140      0      0    119   1649      0     39      0      0   1149
 374 62.6    2037      0      0    189   1620      0     16      0      0   1169
 375 62.7    5086     14      0    646   2315     10    101     35     39   1402
 376 62.8    8818     86      0    984   3050     59    287    131    224   2036
 377 62.9   11340   1355      0   2126   4153     79    605    357    321   1948
 378 63.0   15557   4750      0   3319   5376    246   1281    862    808   2688
 379 63.1   20016   8288      0   4620   6628    511   2569   1723   1510   3302
 380 63.2   24945  11127     10   6588   8098   1120   4470   2968   2990   4724
 381 63.3   25686  12382     69   8774  10682   1906   6775   4482   5474   7236
 382 63.4   29219  14722     79  11109  12311   3094   9414   7259   8912  10587
 383 63.5   30458  14936    393  13802  15014   5874  12666   9609  13762  15262
 384 63.6   32439  16448   1277  17945  19028  10226  15537  14657  19158  20346
 385 63.7   35031  16805   4067  23003  23947  18910  20116  21333  25001  26209
 386 63.8   33251  15820   7673  24781  25675  24617  25354  24440  29433  30329
 387 63.9   33293  16833  18529  28318  29233  31267  31470  27748  29676  30601
 388
 389 Per cent with error:
 390         30.9           3.2          18.5    9.8   13.1   11.6          17.4
 391 Total arguments tested:
 392        70194  70099 101784 100641 100641 101799 128853 114893 102675 102675
 393
 394
 395 ------------------------- Contributors -------------------------------
 396
 397 A number of people have contributed to the development of the
 398 emulator, often by just reporting bugs, sometimes with suggested
 399 fixes, and a few kind people have provided me with access in one way
 400 or another to an 80486 machine. Contributors include (to those people
 401 who I may have forgotten, please forgive me):
 402
 403 Linus Torvalds
 404 Tommy.Thorn@daimi.aau.dk
 405 Andrew.Tridgell@anu.edu.au
 406 Nick Holloway, alfie@dcs.warwick.ac.uk
 407 Hermano Moura, moura@dcs.gla.ac.uk
 408 Jon Jagger, J.Jagger@scp.ac.uk
 409 Lennart Benschop
 410 Brian Gallew, geek+@CMU.EDU
 411 Thomas Staniszewski, ts3v+@andrew.cmu.edu
 412 Martin Howell, mph@plasma.apana.org.au
 413 M Saggaf, alsaggaf@athena.mit.edu
 414 Peter Barker, PETER@socpsy.sci.fau.edu
 415 tom@vlsivie.tuwien.ac.at
 416 Dan Russel, russed@rpi.edu
 417 Daniel Carosone, danielce@ee.mu.oz.au
 418 cae@jpmorgan.com
 419 Hamish Coleman, t933093@minyos.xx.rmit.oz.au
 420 Bruce Evans, bde@kralizec.zeta.org.au
 421 Timo Korvola, Timo.Korvola@hut.fi
 422 Rick Lyons, rick@razorback.brisnet.org.au
 423 Rick, jrs@world.std.com
 424
 425 ...and numerous others who responded to my request for help with
 426 a real 80486.
 427