NEW-OPCODES

   1 Introduction
   2
   3 This documentation describes additional features and functionality of the threading ngaro version.
   4 DTC streams and token threading code:
   5
   6 There exist three categories of executable code in ngaro: CPU dependant machine code, DTC streams
   7 and bytecode sequences. The C code primitives, which form the basic instruction set, are implemented
   8 as label sequences and generated at compile time. A gcc extension allows to store label adresses
   9 into variables and thanks to this feature it is possible to execute the primitives though a simple
  10 goto statement.
  11
  12 Example:
  13   labelA: C code ..
  14
  15   void **pointer_array[255];
  16   void *inst;
  17
  18   pointer_array[0] = &&labelA;
  19   ...
  20
  21   goto *inst++ = vm_code[pointer_array[vpc++]];
  22
  23 With this technique it is possible to implement direct (DTC), indirect (ITC) and token threading (TTC)
  24 interpreters without the including of assembler code. The vm use TTC because it is a straightforward
  25 and fast way to decode the ngaro bytecode. TTC has the disavantage to be slower then DTC because of
  26 the bytecode to adress translation for executing but the advantage of lower TLB cache misses in the
  27 case of vm branch instructions.
  28
  29 The vm is able to compile linear bytecode sequences to DTC code (a stream). DTC streams can then be
  30 executed like all other bytecodes (so the interpreter can extend its instruction set at runtime). It
  31 is also possible to compile and execute a stream without extending the instruction set permanently
  32 (that's analog to the generation of super instructions in other forth systems like gforth). The
  33 combination of TTC and DTC combines a good branch performance with a faster dispatch method which can
  34 result in a huge performane gain over traditional threading techniques. However to profit from this
  35 capability the retro compiler must support the generation of super instructions at minimum.
  36
  37 Compiling a DTC stream:
  38
  39 To compile a new stream there exist a special instruction:
  40
  41
  42   SINST number of instructions: N instruction 1,instruction 2 .. instruction N
  43
  44
  45 Because of the incompatible way of threading beetween DTC and STC it is not possible to branch from
  46 a stream to bytecode directly and vis versa. Instead subroutines must be compiled seperatly to new
  47 instructions and can then be included as extended instruction with a XOP prefix (similar to a microcode
  48 technique very good known by Intel and Zilog to support more than 255 opcodes, by the way). Another
  49 possibility would be to use two special instructions (TCALL and TRETURN) for this task but these are
  50 not yet tested out and included in the actual vm version (coming soon :-).
  51
  52 The SINST bytecode push the start offset of the generated stream onto the data stack so the bytecode
  53 of the new instruction is identical with its stream offset. To implement a temporaly super instruction
  54 the bytecode can be executed with the EXE instruction before the internal offset pointer to the stream
  55 memory is decreased to its old position by the TRESET instruction with the result to free the allocated
  56 memory of the instruction:
  57
  58 Example:
  59
  60  SINST 6
  61  LI_IAC 0
  62  LIT 1
  63  ADD
  64  DUP
  65  LIT 1000000
  66  LT_JUMP 2
  67  EXE
  68  TRESET
  69
  70 These example shows two details of the SINST bytecode. The number of instructions without parameters is
  71 counted form one up in contrast to the immediate destination of conditional branches where parameters
  72 are counted too. Here, six bytecodes are compiled to a stream and the LI_JUMP instruction branches to
  73 the LIT bytecode.
  74
  75 Avoiding stack adressing:
  76
  77 Register based vm designs offer the chance of a better vm performance because there primitives doesn't
  78 need to adress the stack for arithmetic and logic operations. Ngaro compensate this though static
  79 2r-stack caching with the possibility to adress the two cache registers for the first and second stack
  80 element directly:
  81
  82
  83   LI_IAC  value
  84   LI_IOP  value
  85
  86 These two instruction load the TOS and NOS cache registers (IAC and IOP) directly with its immediate
  87 parameters.
  88
  89
  90   RADD
  91   RSUB
  92   RMUL
  93   RAND  value
  94   ROR  value
  95   RXOR  value
  96
  97  Arithmetic and logic operations, IOP or immediate parameter = operand, result stored in IAC.