assets/developer-notes/stephanie-gawroriski/2016/04/20.mkd

   1 # 2016/04/20
   2
   3 ## 08:57
   4
   5 So for the single pass compilation, I will also need an existing run-time state
   6 so that final variables may potentially be optimized globally and so that I
   7 actually get global optimizations. I can have the interpreter be similar.
   8 When compiling the memory space will be similar to an actual running system
   9 which essentially sees everything as a bunch of integers. The SSA operations
  10 would essentially be the required operations so that it is easily translated
  11 to the target system.
  12
  13 ## 09:16
  14
  15 I suppose for native systems, the SSA will need personalities. For example if
  16 I target a 16-bit CPU it will not have 32-bit add operations so they will need
  17 to be split up if specific values need to be calculated. That is, I will need
  18 to keep a potential value store. If an if is done and a value is say below
  19 100 but greater than zero and a value of 7 is added, I can just keep it as
  20 a 8-bit add instead of requiring a full 32-bits. This would be better for
  21 8-bit and 16-bit systems which I do plan to target. Otherwise even simple math
  22 would take lots of memory and far too many cycles to complete. The personality
  23 will provide details such as the native integer size, if it can do Java
  24 compatible floating point, the memory model (flat, segmented, split data/code,
  25 and banked).
  26
  27 ## 09:40
  28
  29 With all of the stack operations and such, and copies I can have a basic high
  30 speed to SSA first which is rather ugly. Then once all of it is loaded in then
  31 optimization can be performed to generate a new program which is faster and
  32 potentially smaller. For the program state, instead of having what I have
  33 before with a set of variables with values I can have instead a global register
  34 file which operations perform work on. So instead of operations having state
  35 such as "variable 2 is an int with value 42." the operations would at on
  36 unique variables given by their IDs (so for example `add $17 = $14 + $12`).
  37 Then in the register file there would be slots for 17, 14, and 12. Another
  38 possible kind of SSA thing would be to only care about side effects. The
  39 register file will contain operations and their derived values from other
  40 registers. Then if a method returns variable 12 then it goes backwards and
  41 performs the required operations to compute the given value. However this would
  42 be easily done if there were no method calls for example. A bunch of methods
  43 could be called which change the state of a global field which to the method
  44 itself it never set in that method, but the one that is called. One major
  45 hurdle is getting the SSA representation to be clean and the ability for
  46 optimization to update variable data and such. At least with a register file
  47 all of the used variables would be in a single location. Each variable could
  48 have direct dependencies. If an operation adds 14 and 12 and puts it into 17
  49 then the variable 17 will have a dependency on that operation.
  50
  51 ## 09:50
  52
  53 However, even with copy operations in the SSA code, the native code generation
  54 pass which takes input SSA could determine if any actual work was performed.
  55
  56 ## 10:14
  57
  58 However, a very limited system such as ancient 16-bit systems with about only
  59 64K of RAM will be unlikely to load the entire program into SSA form, optimize
  60 it, then generate native code very fast. So, I propose instead a straight
  61 generation pass from Java byte code to machine code with stack caching in a
  62 linear sense. However, in specific cases I cannot cache some items because they
  63 could for example be the start of a for loop. However, if I do not scan the
  64 method before hand for jumps going back I can do a rewind. Since everything is
  65 done in a single pass so to speak, if there is a goto into the back and it goes
  66 for example to the start of a loop then I can just drop everything I performed
  67 and flag a specific state. This would only be handled for a single operation
  68 however. Once its jump back point has been calculated it is dropped and not
  69 done again. However any jump backs will require recalculation. The next thing
  70 to consider is jump forwards (such as `tableswitch` and such). For future
  71 operations I can just set a marker to say that there is a given state. I will
  72 need to store the variable types so that verification is possible and the
  73 stack map can be used. I will do the same thing as before by using a linear
  74 passing of the stack map. However, jump backs which revert state will also
  75 force the stack map to go back also so that it has to be handled again. So if
  76 a jump back occurs, then all known state following the target instruction is
  77 dropped as if it never occured except for that flagging bit. Then for basic
  78 optimization of values instead of having value ranges I can have a sign
  79 extended mask. That is the highest bit of the mask represents the sign bit so
  80 a value of `127` will be `0xFF` along with `-127` using the same even though
  81 `127` is `0x7F`, the extra bit set will represent the sign bit to extend. This
  82 would only work for integers however. Then to remove multiple null checks I
  83 can have a null flagging bit which says that something is known to be null or
  84 not null. For arrays I can have a specifier which gives the array length. For
  85 example, an unknown array would have length `n`. If a request is made to access
  86 index 8 in the array, then a check is made against the size. Following that
  87 access or a similar check against the length of an array will result in the
  88 array having the indices up 8 values be unchecked. The important optimization
  89 would be removing array bounds checks and null pointer checks since those can
  90 be costly. So an operation which performs lots of work on a single array would
  91 eventually get all of its checks removed depending on how the method works.
  92
  93 ## 10:30
  94
  95 The next thing would be exceptions. I will do as I thought of before, have the
  96 exceptions be ran first so that the tables are generated at the start. I can
  97 also drop some temporaries because when it comes to exceptions anything that is
  98 on the stack is destroyed. The only thing on the stack is the exception which
  99 is to be handled. So as before, a dedicated register and an index which
 100 specifies the exception table entry which should be handled. With that, there
 101 would only be a need to have a single lookup for exception handling. One
 102 thing however I need to determine when exceptions and gotos exist is block
 103 barriers. Say an exception handler is between a jump forward and another kind
 104 of block and there is no potential jump in. A jump back should not remove the
 105 exception handler code at all. However, it is possible for an exception handler
 106 to jump back into the normal program control flow (an exception handler in a
 107 for loop for example). So exceptions complicate this. So if an exception
 108 performs a goto or similar then it will perform a jump back and the exception
 109 potentially rehandled since it could modify the state of the program. Then
 110 as before, jump forwards would set a future flag that an instruction later is
 111 a jump target. So the most important thing is reading the exception table first
 112 to determine the potential future jumps since past jumps would require
 113 recalculation.
 114
 115 ## 11:34
 116
 117 Took a walk, have some ideas. Instead of `CFMethod` returning an `InputStream`
 118 of a byte array containing the code attribute it instead has a
 119 `CFCodeAttribute` which can provide an `InputStream` for `CPProgram` usage.
 120 It would also handle exceptions. This way when I write the processor I do not
 121 have to have another copy of the all of the data such as exceptions, the byte
 122 code, and the attributes. The class will just provide a view of the class.
 123
 124 Then something similar could be done with the deflate processor. I can have the
 125 sliding window provide the `InputStream` data and it would support limited
 126 mark and such. However, it could only be done once. So instead of an output
 127 queue, it would just use the window directly. I also thought about having a
 128 system property which can set the maximum size of the sliding window. So for
 129 example on very limited systems with barely any memory, a sliding window of
 130 32KiB can be very large and not all of it might be needed at all.
 131
 132 ## 15:56
 133
 134 So I will need an output program which contains the parsing state which is set
 135 by an input program. The input program parser will read and handle the byte
 136 codes. Actually one property of the byte code is that all operations must have
 137 the same variable type state. So jump backs, if their value states differ then
 138 they have to be calculated. However types which are removed do nothing at all.
 139
 140 ## 16:05
 141
 142 For simplicity the stack caching idea would have a fast code generator although
 143 it would not produce the fastest code. So basically only local variables would
 144 count for the most part. The stack operations would really do nothing. The
 145 locals stay the same during exception handlers, so they are assigned actual
 146 registers. The stack entries would just be virtual temporaries which are
 147 assigned no registers. So essentially local variables will be permanent. When
 148 a local variable is written to in the byte code the stack is consumed and the
 149 code to be generated is executed. Other points such as returning values and
 150 loading and storing from fields and arrays, and would trigger events to occur.
 151 So essentially there would be synchronization events. Stack entries would be
 152 pure virtual. If there is an event where a method is called, then its return
 153 value would be placed in a real register (since it has to go somewhere). That
 154 variable will be like a local except that it is dropped during exception
 155 handlers. So I would flow through the program quite normally, and in the case
 156 of these synchronizations, depending on the operations I would not need to
 157 redo operations if I jump back unless a synchronization occurs. I do however
 158 have to consider that there could be an entry on the stack which has a
 159 different value when it is jumped to. So jump targets which are not exception
 160 handlers would have to have all their stack entries on a barrier of sorts. So
 161 that the stack entries become temporary real registers. One of the good things
 162 is that I never have to worry about `jsr` and `ret` since no code will ever
 163 have those instructions.
 164
 165 ## 16:16
 166
 167 So I may have to go through all the instructions to determine jump targets. If
 168 an instruction has stack entries then I will do an implicit generate of the
 169 entries and real register allocation on them. There would just be real
 170 registers contianing the actual values and the calculations dropped so that
 171 they just become values. Then normal virtual stack registers would be used
 172 while the temporary real ones will just contain some values. Then when that
 173 jump back point is reached, all the input values will be calculated between
 174 the slices so that it can jump back. Doing it this way, although not as
 175 optimized will be faster and simpler to implement. So I only perform stack
 176 caching when an instruction is not a target of a jump. If I jump to an
 177 instruction in the future, it would be the same. There would be temporary
 178 register sets. Thus, the basic blocks of a method will start with the
 179 instruction which is the start of an exception handler, the method, or the
 180 target of a jump. The end of basic blocks would be return values, method
 181 calls, load/store fields/arrays, jumps to a target, or instructions which have
 182 jumps to it. Each basic block has its own set of stack temporaries if
 183 required. At the end of a basic block, all values are synchronized into
 184 these temporaries. During the transition of one block to another, temporaries
 185 may be moved and shuffled as needed so the values fit.
 186
 187 I should note that as before, locals are always up to date regardless of
 188 whether they are just set, and never used again.
 189
 190 ## 16:25
 191
 192 For basic blocks though, should I end a block at remote call or should it end
 193 on the instruction before it? Before a method is called however, registers need
 194 to be saved and the stack and registers setup so that arguments can be passed
 195 to the given method. If I end before it, I would have all of the stack items
 196 saved accordingly since they are all popped from the stack. That would be good
 197 because then a method call or similar would start a basic block and it would
 198 have none of the stack temporaries used as method arguments in it. Then the
 199 same thing can be done for field get and set, throwing, and returning of
 200 values (and some others). So operations such as `getfield` will be the start
 201 of a basic block. The same would be done for all the synchronizing operations.
 202
 203 ## 16:31
 204
 205 This would not produce the best code, however it would be simple to perform and
 206 stuff such as method invocation would be simplified because methods being at
 207 the start would mean that all stack temporaries are synced. So after setting up
 208 the exception jump table, I determine the bounds of basic blocks. If there are
 209 two method calls in a row or similar then they would be in their own basic
 210 block since there can only be a method call at the start of the block.
 211
 212 ## 17:46
 213
 214 To start the virtual machine faster, there can be an initial program state
 215 which has already been calculated in memory. For example, all of the static
 216 variables and instances would for the most part already exist. This could be
 217 a bonus on systems with a small amount of memory where initialization and
 218 loading of classes could be quite filling.
 219
 220 ## 17:58
 221
 222 If I can get the run-time to run in under 8KiB then it could run on an NES.
 223 This would be an achievement. Including the 8KiB save chip, the NES comes with
 224 2KiB of space. So there is about 10KiB of RAM available. With the right ROM
 225 size and enough bank switching, it would be possible to have a system on a
 226 large ROM. However I would want it to be as close as possible to 2KiB so that
 227 it does not require "pirate" ROM carts.
 228
 229 ## 18:18
 230
 231 One thing though, are conditionals at the bounds of a basic block?
 232
 233 ## 18:39
 234
 235 For JIT simplicity, the entire program on start will either be compiled
 236 completely at the start and potentially cached or just compiled all at once.
 237 This would permit for some whole program optimization regarding fields. If
 238 say a final field with an int contains the value `42`, a `getstatic` or
 239 `getfield` of that field would just result in `42` with no actual memory
 240 operation required.
 241
 242 ## 18:45
 243
 244 The compiler's class library lookup will need a separated detail. I will
 245 require this so that I do need to actually have to depend on
 246 `javame-class-file` since for example the JIT when cross compiling will use
 247 class files, however on the target system when it is running all of the classes
 248 in the main library are built into the executable. Thus, they cannot be in
 249 class file (they can but that would waste much space).
 250
 251 ## 19:01
 252
 253 Having a bridge over the class file code will permit as stated before. Although
 254 there would be some duplication it would be better than hacking the class file
 255 code to support built in classes, including the class files in the ROM, or
 256 including the CLDC JARs somewhere.
 257