cosmetix
[k8flk.git] / html / intro.html
blob8f745dd167dab15c056de99b53b937c56f7a994d
1 <HTML><HEAD><TITLE>FLK introduction</TITLE></HEAD><BODY>
2 <TABLE>
3 <TR><TD></TD><TD ALIGN=CENTER><A HREF="index.html">Up</A> </TD><TD></TD></TR>
4 <TR><TD><A HREF="index.html">Previous</A></TD>
5 <TD><IMG SRC="flk.gif"></TD>
6 <TD><A HREF=code.html>Next</A></TD></TR>
7 <TR><TD></TD>
8 <!--TD ALIGN=CENTER>
9 <A HREF="mailto:ai108@rz.tu-ilmenau.de">Mail the author</A></TD-->
10 <TD></TD></TR>
11 </TABLE><HR>
12 <H1>FLK introduction</H1>
14 <A NAME=history><H2>History and Objective</H2></A>
15 FLK started as an experiment in my 1997 summer holidays. I tried to improve
16 the performance of FPC by inlined assembler code. Within two evenings the
17 basic concept that FLK uses was implemented.<p>
18 Then the new semester began and there was little time for further work. Till
19 February 1998 almost nothing happened except that I wrote the assembler.
20 Within the next two month the system grew and became a fully ANS (draft)
21 compatible FORTH.<p>
22 I now want FLK to be a fast standard system. It is meant to be an experiment
23 in both meta compilation and code generation, but it should be a fully
24 functional standalone FORTH too.<p>
25 Since most of my work turns around neural networks fast floating point support
26 is nessesary. Together with vector and matrix operations and some
27 visualiziation tools FLK could become a good system for experiments in that
28 field.<p>
29 A different field that I'm interested in is symbolic computation. Sooner or
30 later FLK will contain a computer algebra tool. It therefore has to be fast in
31 non-floating point calculations too.<p>
33 <A NAME=usage><H2>Start-up and user interface</H2></A>
34 To start FLK execute the program <TT>flk</TT>. Any commandline options are
35 interpreted as file names to be <CODE>INCLUDED</CODE>.<p>
36 When FLK is up and running you can enter words to execute. To include a file
37 use <CODE>S" filename.ext" INCLUDED</CODE> or <CODE>INCLUDE</CODE>. The latter
38 lets you input a filename using a different history list and completer.<p>
39 A history list is accessible from the word <CODE>ACCEPT</CODE> only. If you
40 press the up- or down-arrow key you can cycle your previously made inputs.
41 Once you are sure that this the text you want press the return key and your
42 text is appended to the history list and the word returns.<p>
43 A completer is a special word to save you typing. Two completer words are
44 implemented: One for the normal FORTH command line and one for filenames. If
45 you want to learn to implement one yourself look in the files
46 <TT>flkinput.fs</TT> and <TT>flktools.fs</TT> for the existing completers.<p>
47 To activate the completer when <CODE>ACCEPT</CODE> is running press the
48 Tabulator or Set-Tabulator key. In the FORTH commandline the completer
49 searches the beginning of the word the cursor is in or behind. All words in
50 the current search order with this beginning are searched and the longest
51 common string of their names is generated. This string replaces the beginning
52 in the inputline. If there is no word with this beginning one alert is
53 produced, if there is more than one word with the beginning, two alerts are
54 produced.<p>
55 The filename completer takes the whole inputline and performs similiar to
56 <TT>tcsh</TT>'s completer. It tries to expand the path level by level using
57 the longest common string method similiar to the command line completer. Two
58 alerts are produced if more than one file in the directory matches, one if
59 none matches.<p>
60 Upon startup all copies of flkkern (flk is one of them.) search for an system
61 image to load in four places:
62 <OL>
63 <LI>The current executable in the current directory. You can produce
64 standalone executables this way.
65 <LI>The current executable in the installation directory. This is nessesary
66 because argv[0] does not contain the path to the program.
67 <LI>The file flk.flk in the current directory. This is a way to store
68 e.g. project specific systems under the same name.
69 <LI>The file default.flk in the installation directory as noted in the
70 Makefile. This is meant to be a default system if no file or directory
71 specific images could be found.
72 </OL>
74 <A NAME=compiler><H2>Compiler basics</H2></A>
75 FLK is an optimizing native code compiler. It acts similiar to the so-called
76 nano-compilers where every word knows how to compile itself. In FLK only a few
77 words, the so-called optimizer or primitives, can do this. Any word that isn't
78 a primitive doesn't contain code to compile itself. Such words are compiled
79 the usual way by <CODE>COMPILE,</CODE>. <p>
80 With version 1.2 the so-called level 2 compiler words are introduced. They try
81 to fold more than one word into fewer machine code than the separate compiling
82 would produce. Since they have access to the last few literals (including
83 CONSTANTs and CREATEd words) it is possible to include these literals into the
84 code instead of loading a register and then working with that register.<p>
85 The return stack is addressed by <CODE>esp</CODE>, the data stack by <CODE>ebp</CODE>.
86 Since the indexed access using <CODE>ebp</CODE> requires an offset value this is
87 the first opportunity to save time. Instead of increasing and decreasing
88 <CODE>ebp</CODE> itself the offset is increased and decreased. At each access to
89 the stack one <CODE>add</CODE> operation less is nessesary. Before calling another
90 word or returning from this word the accumulated offset has to added to
91 <CODE>ebp</CODE>.<p>
92 Control-flow words like <CODE>IF</CODE> or <CODE>DO</CODE> have to save the offset to
93 <CODE>ebp</CODE> and words like <CODE>THEN</CODE> or <CODE>LOOP</CODE> restore the value
94 by adding the difference to <CODE>ebp</CODE>.<p>
95 The next possible optimization is to keep the top few items of the data stack
96 in the CPU registers to reduce fetch and store operations. Since every word
97 has a different number of accepted and produced items a defined state has to
98 be reached at the beginning of each word. In this state <CODE>eax</CODE> caches
99 the top of stack item and no other registers (except <CODE>ebp</CODE> and
100 <CODE>esp</CODE>) have a defined meaning.<p>
101 Each primitive first resets the register allocator and then requests the
102 stack items and free registers (in that order) it needs, performs its
103 operation and eventually marks the requested register free or puts free
104 registers onto the stack.<p>
105 One important point to mention is that each saved image contains a relocation
106 table. This table contains the addresses of cells whos contents have to be
107 corrected relative to the memory address of the first byte of the image. The
108 contents of these cells are absolute addresses. Words are provided for the
109 handling of relocation issues.<p>
111 <A NAME=primitives><H2>Adding your own primitives</H2></A>
112 This section describes the creation of primitives by the example of the word
113 <CODE>COUNT</CODE>. The only way to compile a primitive is to put it into the file
114 <TT>flkprim.fs</TT>. If you want to write a compiling word without
115 interpretation semantics it is better to program an immediate word an throw an
116 exception if interpreting. <p>
117 <CODE>COUNT</CODE> can be written as the colon definition:
118 <PRE> : COUNT ( caddr -- caddr+1 len ) DUP CHAR+ SWAP C@ ; </PRE><p>
119 As a primitive it is written as:
120 <PRE> p: COUNT ( caddr -- caddr+1 len )
121 regalloc-reset
122 req-any
123 req-free
124 free0 free0 xor,
125 0 [tos0] free0l mov,
126 tos0 inc,
127 0 free&gt;tos ; </PRE><p>
129 The line <CODE> p: COUNT ( c-addr1 -- c-addr2 u )</CODE> defines the primitive
130 and informs about the stack effect. Only one space before the name of
131 the primitive is allowed. Tabs are allowed after the name only if a space
132 immediately follows the name.<p>
133 The first thing to do is to reset the register alloctor using
134 <CODE>regalloc-reset</CODE>. Now we request one item from the stack and one free
135 register by <CODE>req-any req-free</CODE>. Then the actual code generation starts.
136 <PRE> free0 free0 xor,
137 0 [tos0] free0l mov,
138 tos0 inc, </PRE><p>
139 The byte at caddr is fetched into the cleared <CODE>free0</CODE> meta register.
140 Which register is hidden behind <CODE>free0</CODE> is not interesting. Neither the
141 user nor the programmer need to know it.<p>
142 The last line puts the <CODE>free0</CODE> register on top of the stack.<p>
143 Other control words for the register allocator can be found in
144 <TT>flkprim.fs</TT> in the definitions of the other primitives.<p>
146 <A NAME=level2><H2>Adding your own level 2 compilers</H2></A>
147 This section contains the desciption of a level 2 compiler (found in
148 <TT>flkopt.fs</TT>).<p>
149 Each level 2 optimizer consumes zero items and produces no items either. To
150 declare an optimizer edit <TT>flkopt.fs</TT> for optimizers that work in host
151 and target or <TT>flktopt.fs</TT> for those that only run in the target.<p>
152 First thing to do is to declare the sequence to optimize away:
153 <CODE>opt( ''# '' + '' @ )opt:</CODE> does this. This optimizer is declared
154 for the sequence <I>number additiion fetch</I>. Whenever this sequence is
155 found, the following code is executed instead of their individual
156 optimizers.<p>
157 The rest of the word is very similiar to a primitive declaration. There are
158 three exceptions: You have to delete the optimized words at the end of the
159 word and you have to get or set the actual value of the number parameter. How
160 to do this is shown in the code snippets below.<p>
161 <PRE>opt( ''# '' + '' @ )opt:
162 ( Get the actual value and a flag telling if it is an address. )
163 0 opt-getlit \ x rel?
164 ( Normal code generation. )
165 regalloc-reset
166 req-any \ tos0=offs
167 ?+relocate
168 [tos0] tos0 mov,
169 ( All items used up. )
170 0 3 opt-remove
171 ;opt
173 opt( ''# ''# '' + )opt:
174 ( get left parameter to + )
175 1 opt-getlit \ x1 rel1
176 ( get right parameter to + )
177 0 opt-getlit \ x1 rel1 x0 rel0
178 ( If one is an address, result is an address to. )
179 ROT OR -ROT \ rel tos1 tos0
180 ( Perform the actual calculation. )
181 + SWAP \ x rel
182 ( Store it back into the cache. )
183 0 opt-setlit
184 ( Delete the words optimized away from the cache. )
185 1 2 opt-remove
186 ;opt
187 </PRE>
189 <A NAME=warn><H2>Unexpected behaviour</H2></A>
190 This section is meant to be a warning. FLK is still a construction site. You
191 could end up having your car buried under a pile of dirt if you don't drive
192 carefully. :-)<p>
193 But seriously, some of the mistakes made by users (and programmers) are not
194 reported at the moment. Some of them never will.<p>
195 Among these unreported errors are data stack over- and underflows, return
196 stack over- and underflows and floating point stack overflows. Some of them
197 can produce unexpected or wrong results, some of them cause segmentation
198 faults.<p>
199 For a more detailed list of ambiguous conditions see
200 <A HREF=ANS.html>here</A>.<p>
202 <A NAME=benchmarks><H2>Benchmarks</H2></A>
203 To summarize this section: 63 % of all statistics are faked. 17 % of all
204 people know that.<p>
205 Seriously: I used the benchmarks of <A
206 HREF="http://www.complang.tuwien.ac.at/forth/bench.tar.gz">
207 Anton Ertl's Benchmark suite</A> to compare the speed of FLK with that of
208 gforth. You can indirectly compare several other systems with FLK at <A
209 HREF="http://www.complang.tuwien.ac.at/forth/performance.html">Anton Ertl's
210 performance web page</A>. <p>
211 The following sections describe the benchmark programs, show a list of times
212 of gforth and FLK and explain which optimiziers have been implemented to
213 achieve the speed-up.<p>
214 The used system was a 133MHz Pentium without MMX running Linux kernel 2.0.30
215 and KDE. All times can differ a bit due to limited timer resolution and cpu
216 load. The cpu used was between 97 and 99 % in all tests.<p>
217 The initial state had no optimizers except combining <CODE>OVER</CODE> or
218 <CODE>2DUP</CODE>, relational operators and <CODE>IF</CODE> and
219 <CODE>WHILE</CODE> to allocate fewer registers and not to generate an
220 intermediate flag on the stack. All other changes are incremental. These tests
221 were performed with version 1.2 but aplly for later versions too.<p>
223 <H3>Sieve</H3>
224 What is a benchmark without the sieve of Erastothenes. The implementation in
225 <TT>sieve.fs</TT> is a straight-forward one: Two nested loops to check and
226 clear the flags. To have a reasonable time the search of all primes between 1
227 and 8190 is repeated 1000 times. The most frequent used words are: <CODE>I C@
228 C! DO +LOOP DUP</CODE.><p>
229 <TABLE BORDER=2>
230 <TR><TD>Optimization</TD><TD>Time of FLK in sec.</TD><TD>Speed factor
231 (gforth: 14.37 sec)</TD>
232 <TR><TD>initial test</TD><TD>3.94</TD><TD>3.6</TD>
233 <TR><TD>DUP and +LOOP combined (1 register less used)</TD><TD>3.9</TD><TD>3.6</TD>
234 <TR><TD>LOOP (short jumps when possible instead always near jumps)</TD><TD>3.7</TD><TD>3.9</TD>
235 <TR><TD>I (esp access using SIB addressing instead exchanges and ebp access
236 using MOD/RM addressing)</TD><TD>2.4</TD><TD>6</TD>
237 </TABLE>
238 As you can see in the second row saving a register when enough of them are
239 available gains very little. Removing unnessary jump gains a bit more due to
240 the saved space in the branch predictor of the pentium. The last change
241 removes at least two AGIs (address generation interlock) per <CODE>I</CODE> in
242 the innermost loop. That gains at least four cycles per loop.<p>
244 <H3>Bubble sort</H3>
245 Another classical benchmark: sorting 6000 random numbers. Implementation: two
246 nested loops. The most frequent used words are: <CODE>I 2@ &gt; SWAP
247 2!</CODE.><p>
248 <TABLE BORDER=2>
249 <TR><TD>Optimization</TD><TD>Time of FLK in sec.</TD><TD>Speed factor
250 (gforth: 14.54 sec)</TD>
251 <TR><TD>initial test</TD><TD>4.29</TD><TD>3.4</TD>
252 <TR><TD>all opt. above</TD><TD>2.72</TD><TD>5.4</TD>
253 </TABLE>
255 <H3>Fibonacci</H3>
256 This little word has two recursive calls and measures mostly call/return
257 performance.<p>
258 <TABLE BORDER=2>
259 <TR><TD>Optimization</TD><TD>Time of FLK in sec.</TD><TD>Speed factor
260 (gforth: 17.13 sec)</TD>
261 <TR><TD>initial test</TD><TD>2.35</TD><TD>7.3</TD>
262 <TR><TD>all opt. above</TD><TD>2.2</TD><TD>7.8</TD>
263 <TR><TD>+ changed to SWAP +</TD><TD>2.16</TD><TD>7.9</TD>
264 </TABLE>
265 The change of <CODE>+</CODE> to </CODE>SWAP +</CODE> produces code that looks
266 better before an <CODE>EXIT</CODE>. The four hundreds of a second saved can be
267 blamed on the timer tolerance.
269 <HR><TABLE>
270 <TR><TD></TD><TD ALIGN=CENTER><A HREF="index.html">Up</A> </TD><TD></TD></TR>
271 <TR><TD><A HREF="index.html">Previous</A></TD>
272 <TD ALIGN=CENTER><H1>FLK</H1></TD>
273 <TD><A HREF=code.html>Next</A></TD></TR>
274 <TR><TD></TD>
275 <!--TD ALIGN=CENTER>
276 <A HREF="mailto:ai108@rz.tu-ilmenau.de">Mail the author</A></TD-->
277 <TD></TD></TR>
278 </TABLE>
279 </BODY></HTML>
280 <!-- vim:tw=78