html/intro.html

   1 <HTML><HEAD><TITLE>FLK introduction</TITLE></HEAD><BODY>
   2 <TABLE>
   3 <TR><TD></TD><TD ALIGN=CENTER><A HREF="index.html">Up</A> </TD><TD></TD></TR>
   4 <TR><TD><A HREF="index.html">Previous</A></TD>
   5     <TD><IMG SRC="flk.gif"></TD>
   6     <TD><A HREF=code.html>Next</A></TD></TR>
   7 <TR><TD></TD>
   8     <!--TD ALIGN=CENTER>
   9       <A HREF="mailto:ai108@rz.tu-ilmenau.de">Mail the author</A></TD-->
  10     <TD></TD></TR>
  11 </TABLE><HR>
  12 <H1>FLK introduction</H1>
  13
  14 <A NAME=history><H2>History and Objective</H2></A>
  15 FLK started as an experiment in my 1997 summer holidays. I tried to improve
  16 the performance of FPC by inlined assembler code. Within two evenings the
  17 basic concept that FLK uses was implemented.<p>
  18 Then the new semester began and there was little time for further work. Till
  19 February 1998 almost nothing happened except that I wrote the assembler.
  20 Within the next two month the system grew and became a fully ANS (draft)
  21 compatible FORTH.<p>
  22 I now want FLK to be a fast standard system. It is meant to be an experiment
  23 in both meta compilation and code generation, but it should be a fully
  24 functional standalone FORTH too.<p>
  25 Since most of my work turns around neural networks fast floating point support
  26 is nessesary. Together with vector and matrix operations and some
  27 visualiziation tools FLK could become a good system for experiments in that
  28 field.<p>
  29 A different field that I'm interested in is symbolic computation. Sooner or
  30 later FLK will contain a computer algebra tool. It therefore has to be fast in
  31 non-floating point calculations too.<p>
  32
  33 <A NAME=usage><H2>Start-up and user interface</H2></A>
  34 To start FLK execute the program <TT>flk</TT>. Any commandline options are
  35 interpreted as file names to be <CODE>INCLUDED</CODE>.<p>
  36 When FLK is up and running you can enter words to execute. To include a file
  37 use <CODE>S" filename.ext" INCLUDED</CODE> or <CODE>INCLUDE</CODE>. The latter
  38 lets you input a filename using a different history list and completer.<p>
  39 A history list is accessible from the word <CODE>ACCEPT</CODE> only. If you
  40 press the up- or down-arrow key you can cycle your previously made inputs.
  41 Once you are sure that this the text you want press the return key and your
  42 text is appended to the history list and the word returns.<p>
  43 A completer is a special word to save you typing. Two completer words are
  44 implemented: One for the normal FORTH command line and one for filenames. If
  45 you want to learn to implement one yourself look in the files
  46 <TT>flkinput.fs</TT> and <TT>flktools.fs</TT> for the existing completers.<p>
  47 To activate the completer when <CODE>ACCEPT</CODE> is running press the
  48 Tabulator or Set-Tabulator key. In the FORTH commandline the completer
  49 searches the beginning of the word the cursor is in or behind. All words in
  50 the current search order with this beginning are searched and the longest
  51 common string of their names is generated. This string replaces the beginning
  52 in the inputline. If there is no word with this beginning one alert is
  53 produced, if there is more than one word with the beginning, two alerts are
  54 produced.<p>
  55 The filename completer takes the whole inputline and performs similiar to
  56 <TT>tcsh</TT>'s completer. It tries to expand the path level by level using
  57 the longest common string method similiar to the command line completer. Two
  58 alerts are produced if more than one file in the directory matches, one if
  59 none matches.<p>
  60 Upon startup all copies of flkkern (flk is one of them.) search for an system
  61 image to load in four places:
  62 <OL>
  63   <LI>The current executable in the current directory. You can produce
  64 standalone executables this way.
  65   <LI>The current executable in the installation directory. This is nessesary
  66 because argv[0] does not contain the path to the program.
  67   <LI>The file flk.flk in the current directory. This is a way to store
  68 e.g. project specific systems under the same name.
  69   <LI>The file default.flk in the installation directory as noted in the
  70 Makefile. This is meant to be a default system if no file or directory
  71 specific images could be found.
  72 </OL>
  73
  74 <A NAME=compiler><H2>Compiler basics</H2></A>
  75 FLK is an optimizing native code compiler. It acts similiar to the so-called
  76 nano-compilers where every word knows how to compile itself. In FLK only a few
  77 words, the so-called optimizer or primitives, can do this. Any word that isn't
  78 a primitive doesn't contain code to compile itself. Such words are compiled
  79 the usual way by <CODE>COMPILE,</CODE>. <p>
  80 With version 1.2 the so-called level 2 compiler words are introduced. They try
  81 to fold more than one word into fewer machine code than the separate compiling
  82 would produce. Since they have access to the last few literals (including
  83 CONSTANTs and CREATEd words) it is possible to include these literals into the
  84 code instead of loading a register and then working with that register.<p>
  85 The return stack is addressed by <CODE>esp</CODE>, the data stack by <CODE>ebp</CODE>.
  86 Since the indexed access using <CODE>ebp</CODE> requires an offset value this is
  87 the first opportunity to save time. Instead of increasing and decreasing
  88 <CODE>ebp</CODE> itself the offset is increased and decreased. At each access to
  89 the stack one <CODE>add</CODE> operation less is nessesary. Before calling another
  90 word or returning from this word the accumulated offset has to added to
  91 <CODE>ebp</CODE>.<p>
  92 Control-flow words like <CODE>IF</CODE> or <CODE>DO</CODE> have to save the offset to
  93 <CODE>ebp</CODE> and words like <CODE>THEN</CODE> or <CODE>LOOP</CODE> restore the value
  94 by adding the difference to <CODE>ebp</CODE>.<p>
  95 The next possible optimization is to keep the top few items of the data stack
  96 in the CPU registers to reduce fetch and store operations. Since every word
  97 has a different number of accepted and produced items a defined state has to
  98 be reached at the beginning of each word. In this state <CODE>eax</CODE> caches
  99 the top of stack item and no other registers (except <CODE>ebp</CODE> and
 100 <CODE>esp</CODE>) have a defined meaning.<p>
 101 Each primitive first resets the register allocator and then requests the
 102 stack items and free registers (in that order) it needs, performs its
 103 operation and eventually marks the requested register free or puts free
 104 registers onto the stack.<p>
 105 One important point to mention is that each saved image contains a relocation
 106 table. This table contains the addresses of cells whos contents have to be
 107 corrected relative to the memory address of the first byte of the image. The
 108 contents of these cells are absolute addresses. Words are provided for the
 109 handling of relocation issues.<p>
 110
 111 <A NAME=primitives><H2>Adding your own primitives</H2></A>
 112 This section describes the creation of primitives by the example of the word
 113 <CODE>COUNT</CODE>. The only way to compile a primitive is to put it into the file
 114 <TT>flkprim.fs</TT>. If you want to write a compiling word without
 115 interpretation semantics it is better to program an immediate word an throw an
 116 exception if interpreting. <p>
 117 <CODE>COUNT</CODE> can be written as the colon definition:
 118 <PRE> : COUNT ( caddr -- caddr+1 len ) DUP CHAR+ SWAP C@ ; </PRE><p>
 119 As a primitive it is written as:
 120 <PRE> p: COUNT ( caddr -- caddr+1 len )
 121   regalloc-reset
 122   req-any
 123   req-free
 124   free0 free0 xor,
 125   0 [tos0] free0l mov,
 126   tos0 inc,
 127   0 free&gt;tos ; </PRE><p>
 128
 129 The line <CODE> p: COUNT ( c-addr1 -- c-addr2 u )</CODE> defines the primitive
 130 and informs about the stack effect. Only one space before the name of
 131 the primitive is allowed. Tabs are allowed after the name only if a space
 132 immediately follows the name.<p>
 133 The first thing to do is to reset the register alloctor using
 134 <CODE>regalloc-reset</CODE>. Now we request one item from the stack and one free
 135 register by <CODE>req-any req-free</CODE>. Then the actual code generation starts.
 136 <PRE>  free0 free0 xor,
 137   0 [tos0] free0l mov,
 138   tos0 inc, </PRE><p>
 139 The byte at caddr is fetched into the cleared <CODE>free0</CODE> meta register.
 140 Which register is hidden behind <CODE>free0</CODE> is not interesting. Neither the
 141 user nor the programmer need to know it.<p>
 142 The last line puts the <CODE>free0</CODE> register on top of the stack.<p>
 143 Other control words for the register allocator can be found in
 144 <TT>flkprim.fs</TT> in the definitions of the other primitives.<p>
 145
 146 <A NAME=level2><H2>Adding your own level 2 compilers</H2></A>
 147 This section contains the desciption of a level 2 compiler (found in
 148 <TT>flkopt.fs</TT>).<p>
 149 Each level 2 optimizer consumes zero items and produces no items either. To
 150 declare an optimizer edit <TT>flkopt.fs</TT> for optimizers that work in host
 151 and target or <TT>flktopt.fs</TT> for those that only run in the target.<p>
 152 First thing to do is to declare the sequence to optimize away:
 153 <CODE>opt( ''# '' + '' @ )opt:</CODE> does this. This optimizer is declared
 154 for the sequence <I>number additiion fetch</I>. Whenever this sequence is
 155 found, the following code is executed instead of their individual
 156 optimizers.<p>
 157 The rest of the word is very similiar to a primitive declaration. There are
 158 three exceptions: You have to delete the optimized words at the end of the
 159 word and you have to get or set the actual value of the number parameter. How
 160 to do this is shown in the code snippets below.<p>
 161 <PRE>opt( ''# '' + '' @ )opt:
 162     ( Get the actual value and a flag telling if it is an address. )
 163     0 opt-getlit      \ x rel?
 164     ( Normal code generation. )
 165     regalloc-reset
 166     req-any         \ tos0=offs
 167     ?+relocate
 168     [tos0] tos0 mov,
 169     ( All items used up. )
 170     0 3 opt-remove
 171     ;opt
 172
 173 opt( ''# ''# '' + )opt:
 174    ( get left parameter to + )
 175    1 opt-getlit       \ x1 rel1
 176    ( get right parameter to + )
 177    0 opt-getlit       \ x1 rel1 x0 rel0
 178    ( If one is an address, result is an address to. )
 179    ROT OR -ROT        \ rel tos1 tos0
 180    ( Perform the actual calculation. )
 181    + SWAP         \ x rel
 182    ( Store it back into the cache. )
 183    0 opt-setlit
 184    ( Delete the words optimized away from the cache. )
 185    1 2 opt-remove
 186 ;opt
 187 </PRE>
 188
 189 <A NAME=warn><H2>Unexpected behaviour</H2></A>
 190 This section is meant to be a warning. FLK is still a construction site. You
 191 could end up having your car buried under a pile of dirt if you don't drive
 192 carefully. :-)<p>
 193 But seriously, some of the mistakes made by users (and programmers) are not
 194 reported at the moment. Some of them never will.<p>
 195 Among these unreported errors are data stack over- and underflows, return
 196 stack over- and underflows and floating point stack overflows. Some of them
 197 can produce unexpected or wrong results, some of them cause segmentation
 198 faults.<p>
 199 For a more detailed list of ambiguous conditions see
 200 <A HREF=ANS.html>here</A>.<p>
 201
 202 <A NAME=benchmarks><H2>Benchmarks</H2></A>
 203 To summarize this section: 63 % of all statistics are faked. 17 % of all
 204 people know that.<p>
 205 Seriously: I used the benchmarks of <A
 206 HREF="http://www.complang.tuwien.ac.at/forth/bench.tar.gz">
 207 Anton Ertl's Benchmark suite</A> to compare the speed of FLK with that of
 208 gforth. You can indirectly compare several other systems with FLK at <A
 209 HREF="http://www.complang.tuwien.ac.at/forth/performance.html">Anton Ertl's
 210 performance web page</A>. <p>
 211 The following sections describe the benchmark programs, show a list of times
 212 of gforth and FLK and explain which optimiziers have been implemented to
 213 achieve the speed-up.<p>
 214 The used system was a 133MHz Pentium without MMX running Linux kernel 2.0.30
 215 and KDE. All times can differ a bit due to limited timer resolution and cpu
 216 load. The cpu used was between 97 and 99 % in all tests.<p>
 217 The initial state had no optimizers except combining <CODE>OVER</CODE> or
 218 <CODE>2DUP</CODE>, relational operators and <CODE>IF</CODE> and
 219 <CODE>WHILE</CODE> to allocate fewer registers and not to generate an
 220 intermediate flag on the stack. All other changes are incremental. These tests
 221 were performed with version 1.2 but aplly for later versions too.<p>
 222
 223 <H3>Sieve</H3>
 224 What is a benchmark without the sieve of Erastothenes. The implementation in
 225 <TT>sieve.fs</TT> is a straight-forward one: Two nested loops to check and
 226 clear the flags. To have a reasonable time the search of all primes between 1
 227 and 8190 is repeated 1000 times. The most frequent used words are: <CODE>I C@
 228 C! DO +LOOP DUP</CODE.><p>
 229 <TABLE BORDER=2>
 230 <TR><TD>Optimization</TD><TD>Time of FLK in sec.</TD><TD>Speed factor
 231 (gforth: 14.37 sec)</TD>
 232 <TR><TD>initial test</TD><TD>3.94</TD><TD>3.6</TD>
 233 <TR><TD>DUP and +LOOP combined (1 register less used)</TD><TD>3.9</TD><TD>3.6</TD>
 234 <TR><TD>LOOP (short jumps when possible instead always near jumps)</TD><TD>3.7</TD><TD>3.9</TD>
 235 <TR><TD>I (esp access using SIB addressing instead exchanges and ebp access
 236 using MOD/RM addressing)</TD><TD>2.4</TD><TD>6</TD>
 237 </TABLE>
 238 As you can see in the second row saving a register when enough of them are
 239 available gains very little. Removing unnessary jump gains a bit more due to
 240 the saved space in the branch predictor of the pentium. The last change
 241 removes at least two AGIs (address generation interlock) per <CODE>I</CODE> in
 242 the innermost loop. That gains at least four cycles per loop.<p>
 243
 244 <H3>Bubble sort</H3>
 245 Another classical benchmark: sorting 6000 random numbers. Implementation: two
 246 nested loops. The most frequent used words are: <CODE>I 2@ &gt; SWAP
 247 2!</CODE.><p>
 248 <TABLE BORDER=2>
 249 <TR><TD>Optimization</TD><TD>Time of FLK in sec.</TD><TD>Speed factor
 250 (gforth: 14.54 sec)</TD>
 251 <TR><TD>initial test</TD><TD>4.29</TD><TD>3.4</TD>
 252 <TR><TD>all opt. above</TD><TD>2.72</TD><TD>5.4</TD>
 253 </TABLE>
 254
 255 <H3>Fibonacci</H3>
 256 This little word has two recursive calls and measures mostly call/return
 257 performance.<p>
 258 <TABLE BORDER=2>
 259 <TR><TD>Optimization</TD><TD>Time of FLK in sec.</TD><TD>Speed factor
 260 (gforth: 17.13 sec)</TD>
 261 <TR><TD>initial test</TD><TD>2.35</TD><TD>7.3</TD>
 262 <TR><TD>all opt. above</TD><TD>2.2</TD><TD>7.8</TD>
 263 <TR><TD>+ changed to SWAP +</TD><TD>2.16</TD><TD>7.9</TD>
 264 </TABLE>
 265 The change of <CODE>+</CODE> to </CODE>SWAP +</CODE> produces code that looks
 266 better before an <CODE>EXIT</CODE>. The four hundreds of a second saved can be
 267 blamed on the timer tolerance.
 268
 269 <HR><TABLE>
 270 <TR><TD></TD><TD ALIGN=CENTER><A HREF="index.html">Up</A> </TD><TD></TD></TR>
 271 <TR><TD><A HREF="index.html">Previous</A></TD>
 272     <TD ALIGN=CENTER><H1>FLK</H1></TD>
 273     <TD><A HREF=code.html>Next</A></TD></TR>
 274 <TR><TD></TD>
 275     <!--TD ALIGN=CENTER>
 276       <A HREF="mailto:ai108@rz.tu-ilmenau.de">Mail the author</A></TD-->
 277     <TD></TD></TR>
 278 </TABLE>
 279 </BODY></HTML>
 280 <!--  vim:tw=78
 281 >