2 A new JIT compiler for the Mono Project
4 Miguel de Icaza (miguel@{ximian.com,gnome.org}),
5 Paolo Molaro (lupus@{ximian.com,debian.org})
10 Mini is a new compilation engine for the Mono runtime. The
11 new engine is designed to bring new code generation
12 optimizations, portability and pre-compilation.
14 In this document we describe the design decisions and the
15 architecture of the new compilation engine.
19 Mono is a Open Source implementation of the .NET Framework: it
20 is made up of a runtime engine that implements the ECMA Common
21 Language Infrastructure (CLI), a set of compilers that target
22 the CLI and a large collection of class libraries.
24 This article discusses the new code generation facilities that
25 have been added to the Mono runtime.
27 First we discuss the overall architecture of the Mono runtime,
28 and how code generation fits into it; Then we discuss the
29 development and basic architecture of our first JIT compiler
30 for the ECMA CIL framework. The next section covers the
31 objectives for the work on the new JIT compiler, then we
32 discuss the new features available in the new JIT compiler,
33 and finally a technical description of the new code generation
36 * Architecture of the Mono Runtime
38 The Mono runtime is an implementation of the ECMA Common
39 Language Infrastructure (CLI), whose aim is to be a common
40 platform for executing code in multiple languages.
42 Languages that target the CLI generate images that contain
43 code in high-level intermediate representation called the
44 "Common Intermediate Language". This intermediate language is
45 rich enough to allow for programs and pre-compiled libraries
46 to be reflected. The execution environment allows for an
47 object oriented execution environment with single inheritance
48 and multiple interface implementations.
50 This runtime provides a number of services for programs that
51 are targeted to it: Just-in-Time compilation of CIL code into
52 native code, garbage collection, thread management, I/O
53 routines, single, double and decimal floating point,
54 asynchronous method invocation, application domains, and a
55 framework for building arbitrary RPC systems (remoting) and
56 integration with system libraries through the Platform Invoke
59 The focus of this document is on the services provided by the
60 Mono runtime to transform CIL bytecodes into code that is
61 native to the underlying architecture.
63 The code generation interface is a set of macros that allow a
64 C programmer to generate code on the fly, this is done
65 through a set of macros found in the mono/jit/arch/ directory.
66 These macros are used by the JIT compiler to generate native
69 The platform invocation code is interesting, as it generates
70 CIL code on the fly to marshal parameters, and then this
71 code is in turned processed by the JIT engine.
73 * Previous Experiences
75 Mono has built a JIT engine, which has been used to bootstrap
76 Mono since January, 2002. This JIT engine has reasonable
77 performance, and uses an tree pattern matching instruction
78 selector based on the BURS technology. This JIT compiler was
79 designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
81 The existing JIT compiler has three phases:
83 * Re-creation of the semantic tree from CIL
86 * Instruction selection, with a cost-driven
89 * Code generation and register allocation.
91 It is also hooked into the rest of the runtime to provide
92 services like marshaling, just-in-time compilation and
93 invocation of "internal calls".
95 This engine constructed a collection of trees, which we
96 referred to as the "forest of trees", this forest is created by
97 "hydrating" the CIL instruction stream.
99 The first step was to identify the basic blocks on the method,
100 and computing the control flow graph (cfg) for it. Once this
101 information was computed, a stack analysis on each basic block
102 was performed to create a forest of trees for each one of
105 So for example, the following statement:
111 Which would be represented in CIL as:
118 After the stack analysis would create the following tree:
120 (STIND_I4 ADDR_L[EBX|2] (
121 ADD (LDIND_I4 ADDR_L[ESI|1])
124 This tree contains information from the stack analysis: for
125 instance, notice that the operations explicitly encode the
126 data types they are operating on, there is no longer an
127 ambiguity on the types, because this information has been
130 At this point the JIT would pass the constructed forest of
131 trees to the architecture-dependent JIT compiler.
133 The architecture dependent code then performed register
134 allocation (optionally using linear scan allocation for
135 variables, based on life analysis).
137 Once variables had been assigned, a tree pattern matching with
138 dynamic programming is used (the tree pattern matcher is
139 custom build for each architecture, using a code
140 generator: monoburg). The instruction selector used cost
141 functions to select the best instruction patterns.
143 The instruction selector is able to produce instructions that
144 take advantage of the x86 instruction indexing instructions
147 One problem though is that the code emitter and the register
148 allocator did not have any visibility outside the current
149 tree, which meant that some redundant instructions were
150 generated. A peephole optimizer with this architecture was
151 hard to write, given the tree-based representation that is
154 This JIT was functional, but it did not provide a good
155 architecture to base future optimizations on. Also the
156 line between architecture neutral and architecture
157 specific code and optimizations was hard to draw.
159 The JIT engine supported two code generation modes to support
160 the two optimization modes for applications that host multiple
161 application domains: generate code that will be shared across
162 application domains, or generate code that will not be shared
163 across application domains.
165 * Objectives of the new JIT engine.
167 We wanted to support a number of features that were missing:
169 * Ahead-of-time compilation.
171 The idea is to allow developers to pre-compile their code
172 to native code to reduce startup time, and the working
173 set that is used at runtime in the just-in-time compiler.
175 Although in Mono this has not been a visible problem, we
176 wanted to pro-actively address this problem.
178 When an assembly (a Mono/.NET executable) is installed in
179 the system, it would then be possible to pre-compile the
180 code, and have the JIT compiler tune the generated code
181 to the particular CPU on which the software is
184 This is done in the Microsoft.NET world with a tool
187 * Have a good platform for doing code optimizations.
189 The design called for a good architecture that would
190 enable various levels of optimizations: some
191 optimizations are better performed on high-level
192 intermediate representations, some on medium-level and
193 some at low-level representations.
195 Also it should be possible to conditionally turn these on
196 or off. Some optimizations are too expensive to be used
197 in just-in-time compilation scenarios, but these
198 expensive optimizations can be turned on for
199 ahead-of-time compilations or when using profile-guided
200 optimizations on a subset of the executed methods.
202 * Reduce the effort required to port the Mono code
203 generator to new architectures.
205 For Mono to gain wide adoption in the Unix world, it is
206 necessary that the JIT engine works in most of today's
207 commercial hardware platforms.
209 * Features of the new JIT engine.
211 The new JIT engine was architected by Dietmar Maurer and Paolo
212 Molaro, based on the new objectives.
214 Mono provides a number of services to applications running
215 with the new JIT compiler:
217 * Just-in-Time compilation of CLI code into native code.
219 * Ahead-of-Time compilation of CLI code, to reduce
220 startup time of applications.
222 A number of software development features are also available:
224 * Execution time profiling (--profile)
226 Generates a report of the times consumed by routines,
227 as well as the invocation times, as well as the
230 * Memory usage profiling (--profile)
232 Generates a report of the memory usage by a program
233 that is ran under the Mono JIT.
235 * Code coverage (--coverage)
239 People who are interested in developing and improving the Mini
240 JIT compiler will also find a few useful routines:
244 This is used to time the execution time for the JIT
245 when compiling a routine.
247 * Control Flow Graph and Dominator Tree drawing.
249 These are visual aids for the JIT developer: they
250 render representations of the Control Flow graph, and
251 for the more advanced optimizations, they draw the
252 dominator tree graph.
254 This requires Dot (from the graphwiz package) and Ghostview.
256 * Code generator regression tests.
258 The engine contains support for running regression
259 tests on the virtual machine, which is very helpful to
260 developers interested in improving the engine.
262 * Optimization benchmark framework.
264 The JIT engine will generate graphs that compare
265 various benchmarks embedded in an assembly, and run the
266 various tests with different optimization flags.
268 This requires Perl, GD::Graph.
272 This is probably the most important component of the new code
273 generation engine. The internals are relatively easy to
274 replace and update, even large passes can be replaced and
275 implemented differently.
279 Compiling a method begins with the `mini_method_to_ir' routine
280 that converts the CIL representation into a medium
281 intermediate representation.
283 The mini_method_to_ir routine performs a number of operations:
285 * Flow analysis and control flow graph computation.
287 Unlike the previous version, stack analysis and control
288 flow graphs are computed in a single pass in the
289 mini_method_to_ir function, this is done for performance
290 reasons: although the complexity increases, the benefit
291 for a JIT compiler is that there is more time available
292 for performing other optimizations.
294 * Basic block computation.
296 mini_method_to_ir populates the MonoCompile structure
297 with an array of basic blocks each of which contains
298 forest of trees made up of MonoInst structures.
302 Inlining is no longer restricted to methods containing
303 one single basic block, instead it is possible to inline
304 arbitrary complex methods.
306 The heuristics to choose what to inline are likely going
307 to be tuned in the future.
309 * Method to opcode conversion.
311 Some method call invocations like `call Math.Sin' are
312 transformed into an opcode: this transforms the call
313 into a semantically rich node, which is later inline
314 into an FPU instruction.
316 Various Array methods invocations are turned into
317 opcodes as well (The Get, Set and Address methods)
319 * Tail recursion elimination
323 The MonoInst structure holds the actual decoded instruction,
324 with the semantic information from the stack analysis.
325 MonoInst is interesting because initially it is part of a tree
326 structure, here is a sample of the same tree with the new JIT
329 (stind.i4 regoffset[0xffffffd4(%ebp)]
330 (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
333 This is a medium-level intermediate representation (MIR).
335 Some complex opcodes are decomposed at this stage into a
336 collection of simpler opcodes. Not every complex opcode is
337 decomposed at this stage, as we need to preserve the semantic
338 information during various optimization phases.
340 For example a NEWARR opcode carries the length and the type of
341 the array that could be used later to avoid type checking or
344 There are a number of operations supported on this
347 * Branch optimizations.
351 * Loop optimizations: the dominator trees are
352 computed, loops are detected, and their nesting
355 * Conversion of the method into static single assignment
358 * Dead code elimination.
360 * Constant propagation.
366 Once the above optimizations are optionally performed, a
367 decomposition phase is used to turn some complex opcodes into
368 internal method calls. In the initial version of the JIT
369 engine, various operations on longs are emulated instead of
370 being inlined. Also the newarr invocation is turned into a
373 At this point, after computing variable liveness, it is
374 possible to use the linear scan algorithm for allocating
375 variables to registers. The linear scan pass uses the
376 information that was previously gathered by the loop nesting
377 and loop structure computation to favor variables in inner
378 loops. This process updates the basic block `nesting' field
379 which is later used during liveness analysis.
381 Stack space is then reserved for the local variables and any
382 temporary variables generated during the various
385 ** Instruction selection
387 At this point, the BURS instruction selector is invoked to
388 transform the tree-based representation into a list of
389 instructions. This is done using a tree pattern matcher that
390 is generated for the architecture using the `monoburg' tool.
392 Monoburg takes as input a file that describes tree patterns,
393 which are matched against the trees that were produced by the
394 engine in the previous stages.
396 The pattern matching might have more than one match for a
397 particular tree. In this case, the match selected is the one
398 whose cost is the smallest. A cost can be attached to each
399 rule, and if no cost is provided, the implicit cost is one.
400 Smaller costs are selected over higher costs.
402 The cost function can be used to select particular blocks of
403 code for a given architecture, or by using a prohibitive high
404 number to avoid having the rule match.
406 The various rules that our JIT engine uses transform a tree of
407 MonoInsts into a list of monoinsts:
409 +-----------------------------------------------------------+
411 | of ===> Instruction selection ===> of |
412 | MonoInst MonoInst. |
413 +-----------------------------------------------------------+
415 During this process various "types" of MonoInst kinds
416 disappear and turned into lower-level representations. The
417 JIT compiler just happens to reuse the same structure (this is
418 done to reduce memory usage and improve memory locality).
420 The instruction selection rules are split in a number of
421 files, each one with a particular purpose:
424 Contains the generic instruction selection
428 Contains x86 specific rules.
431 Contains PowerPC specific rules.
434 burg file for 64bit instructions on 32bit architectures.
437 burg file for 64bit architectures.
440 burg file for floating point instructions
442 For a given build, a set of those files would be included.
443 For example, for the build of Mono on the x86, the following
446 inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
448 ** Native method generation
450 The native method generation has a number of steps:
452 * Architecture specific register allocation.
454 The information about loop nesting that was
455 previously gathered is used here to hint the
458 * Generating the method prolog/epilog.
460 * Optionally generate code to introduce tracing facilities.
462 * Hooking into the debugger.
464 * Performing any pending fixups.
470 The actual code generation is contained in the architecture
471 specific portion of the compiler. The input to the code
472 generator is each one of the basic blocks with its list of
473 instructions that were produced in the instruction selection
476 During the instruction selection phase, virtual registers are
477 assigned. Just before the peephole optimization is performed,
478 physical registers are assigned.
480 A simple peephole and algebraic optimizer is ran at this
483 The peephole optimizer removes some redundant operations at
484 this point. This is possible because the code generation at
485 this point has visibility into the basic block that spans the
488 The algebraic optimizer performs some simple algebraic
489 optimizations that replace expensive operations with cheaper
490 operations if possible.
492 The rest of the code generation is fairly simple: a switch
493 statement is used to generate code for each of the MonoInsts,
494 in the mono/mini/mini-ARCH.c files, the method is called
495 "mono_arch_output_basic_block".
497 We always try to allocate code in sequence, instead of just using
498 malloc. This way we increase spatial locality which gives a massive
499 speedup on most architectures.
501 *** Ahead of Time compilation
503 Ahead-of-Time compilation is a new feature of our new
504 compilation engine. The compilation engine is shared by the
505 Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
508 The difference is on the set of optimizations that are turned
509 on for each mode: Just-in-Time compilation should be as fast
510 as possible, while Ahead-of-Time compilation can take as long
511 as required, because this is not done at a time critical
514 With AOT compilation, we can afford to turn all of the
515 computationally expensive optimizations on.
517 After the code generation phase is done, the code and any
518 required fixup information is saved into a file that is
519 readable by "as" (the native assembler available on all
520 systems). This assembly file is then passed to the native
521 assembler, which generates a loadable module.
523 At execution time, when an assembly is loaded from the disk,
524 the runtime engine will probe for the existence of a
525 pre-compiled image. If the pre-compiled image exists, then it
526 is loaded, and the method invocations are resolved to the code
527 contained in the loaded module.
529 The code generated under the AOT scenario is slightly
530 different than the JIT scenario. It generates code that is
531 application-domain relative and that can be shared among
534 This is the same code generation that is used when the runtime
535 is instructed to maximize code sharing on a multi-application
538 * SSA-based optimizations
540 SSA form simplifies many optimization because each variable
541 has exactly one definition site. This means that each
542 variable is only initialized once.
544 For example, code like this:
551 Is internally turned into:
558 In the presence of branches, like:
567 The code is turned into:
576 All uses of a variable are "dominated" by its definition
578 This representation is useful as it simplifies the
579 implementation of a number of optimizations like conditional
580 constant propagation, array bounds check removal and dead code
583 * Register allocation.
585 Global register allocation is performed on the medium
586 intermediate representation just before instruction selection
587 is performed on the method. Local register allocation is
588 later performed at the basic-block level on the
590 Global register allocation uses the following input:
592 1) set of register-sized variables that can be allocated to a
593 register (this is an architecture specific setting, for x86
594 these registers are the callee saved register ESI, EDI and
597 2) liveness information for the variables
599 3) (optionally) loop info to favor variables that are used in
602 During instruction selection phase, symbolic registers are
603 assigned to temporary values in expressions.
605 Local register allocation assigns hard registers to the
606 symbolic registers, and it is performed just before the code
607 is actually emitted and is performed at the basic block level.
608 A CPU description file describes the input registers, output
609 registers, fixed registers and clobbered registers by each
612 * BURG Code Generator Generator
614 monoburg was written by Dietmar Maurer. It is based on the
615 papers from Christopher W. Fraser, Robert R. Henry and Todd
616 A. Proebsting: "BURG - Fast Optimal Instruction Selection and
617 Tree Parsing" and "Engineering a Simple, Efficient Code
618 Generator Generator".
620 The original BURG implementation is unable to work on DAGs, instead only
621 trees are allowed. Our monoburg implementations is able to generate tree
622 matcher which works on DAGs, and we use this feature in the new
623 JIT. This simplifies the code because we can directly pass DAGs and
624 don't need to convert them to trees.
626 * Adding IL opcodes: an excercise (from a post by Paolo Molaro)
628 mini.c is the file that read the IL code stream and decides
629 how any single IL instruction is implemented
630 (mono_method_to_ir () func), so you always have to add an
631 entry to the big switch inside the function: there are plenty
632 of examples in that file.
634 An IL opcode can be implemented in a number of ways, depending
635 on what it does and how it needs to do it.
637 Some opcodes are implemented using a helper function: one of
638 the simpler examples is the CEE_STELEM_REF implementation.
640 In this case the opcode implementation is written in a C
641 function. You will need to register the function with the jit
642 before you can use it (mono_register_jit_call) and you need to
643 emit the call to the helper using the mono_emit_jit_icall()
646 This is the simpler way to add a new opcode and it doesn't
647 require any arch-specific change (though it's limited to what
648 you can do in C code and the performance may be limited by the
651 Other opcodes can be implemented with one or more of the already
652 implemented low-level instructions.
654 An example is the OP_STRLEN opcode which implements
655 String.Length using a simple load from memory. In this case
656 you need to add a rule to the appropriate burg file,
657 describing what are the arguments of the opcode and what is,
658 if any, it's 'return' value.
660 The OP_STRLEN case is:
662 reg: OP_STRLEN (reg) {
663 MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
664 state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
667 The above means: the OP_STRLEN takes a register as an argument
668 and returns its value in a register. And the implementation
669 of this is included in the braces.
671 The opcode returns a value in an integer register
672 (state->reg1) by performing a int32 load of the length field
673 of the MonoString represented by the input register
674 (state->left->reg1): before the burg rules are applied, the
675 internal representation is based on trees, so you get the
676 left/right pointers (state->left and state->right
677 respectively, the result is stored in state->reg1).
679 This instruction implementation doesn't require arch-specific
680 changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
681 available on all platforms), and usually the produced code is
684 Next we have opcodes that must be implemented with new low-level
685 architecture specific instructions (either because of performance
686 considerations or because the functionality can't get implemented in
689 You also need a burg rule in this case, too. For example,
690 consider the OP_CHECK_THIS opcode (used to raise an exception
691 if the this pointer is null). The burg rule simply reads:
693 stmt: OP_CHECK_THIS (reg) {
694 mono_bblock_add_inst (s->cbb, tree);
697 Note that this opcode does not return a value (hence the
698 "stmt") and it takes a register as input.
700 mono_bblock_add_inst (s->cbb, tree) just adds the instruction
701 (the tree variable) to the current basic block (s->cbb). In
702 mini this is the place where the internal representation
703 switches from the tree format to the low-level format (the
704 list of simple instructions).
706 In this case the actual opcode implementation is delegated to
707 the arch-specific code. A low-level opcode needs an entry in
708 the machine description (the *.md files in mini/). This entry
709 describes what kind of registers are used if any by the
710 instruction, as well as other details such as constraints or
711 other hints to the low-level engine which are architecture
714 cpu-pentium.md, for example has the following entry:
716 checkthis: src1:b len:3
718 This means the instruction uses an integer register as a base
719 pointer (basically a load or store is done on it) and it takes
720 3 bytes of native code to implement it.
722 Now you just need to provide the low-level implementation for
723 the opcode in one of the mini-$arch.c files, in the
724 mono_arch_output_basic_block() function. There is a big switch
725 here too. The x86 implementation is:
728 /* ensure ins->sreg1 is not NULL */
729 x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
732 If the $arch-codegen.h header file doesn't have the code to
733 emit the low-level native code, you'll need to write that as
736 Complex opcodes with register constraints may require other
737 changes to the local register allocator, but usually they are
742 Profile-based optimization is something that we are very
743 interested in supporting. There are two possible usage
746 * Based on the profile information gathered during
747 the execution of a program, hot methods can be compiled
748 with the highest level of optimizations, while bootstrap
749 code and cold methods can be compiled with the least set
750 of optimizations and placed in a discardable list.
752 * Code reordering: this profile-based optimization would
753 only make sense for pre-compiled code. The profile
754 information is used to re-order the assembly code on disk
755 so that the code is placed on the disk in a way that
758 This is the same principle under which SGI's cord program
761 The nature of the CIL allows the above optimizations to be
762 easy to implement and deploy. Since we live and define our
763 universe for these things, there are no interactions with
764 system tools required, nor upgrades on the underlying
765 infrastructure required.
767 Instruction scheduling is important for certain kinds of
768 processors, and some of the framework exists today in our
769 register allocator and the instruction selector to cope with
770 this, but has not been finished. The instruction selection
771 would happen at the same time as local register allocation. <