src/backend/jit/README

   1 What is Just-in-Time Compilation?
   2 =================================
   3
   4 Just-in-Time compilation (JIT) is the process of turning some form of
   5 interpreted program evaluation into a native program, and doing so at
   6 runtime.
   7
   8 For example, instead of using a facility that can evaluate arbitrary
   9 SQL expressions to evaluate an SQL predicate like WHERE a.col = 3, it
  10 is possible to generate a function than can be natively executed by
  11 the CPU that just handles that expression, yielding a speedup.
  12
  13 This is JIT, rather than ahead-of-time (AOT) compilation, because it
  14 is done at query execution time, and perhaps only in cases where the
  15 relevant task is repeated a number of times. Given the way JIT
  16 compilation is used in PostgreSQL, the lines between interpretation,
  17 AOT and JIT are somewhat blurry.
  18
  19 Note that the interpreted program turned into a native program does
  20 not necessarily have to be a program in the classical sense. E.g. it
  21 is highly beneficial to JIT compile tuple deforming into a native
  22 function just handling a specific type of table, despite tuple
  23 deforming not commonly being understood as a "program".
  24
  25
  26 Why JIT?
  27 ========
  28
  29 Parts of PostgreSQL are commonly bottlenecked by comparatively small
  30 pieces of CPU intensive code. In a number of cases that is because the
  31 relevant code has to be very generic (e.g. handling arbitrary SQL
  32 level expressions, over arbitrary tables, with arbitrary extensions
  33 installed). This often leads to a large number of indirect jumps and
  34 unpredictable branches, and generally a high number of instructions
  35 for a given task. E.g. just evaluating an expression comparing a
  36 column in a database to an integer ends up needing several hundred
  37 cycles.
  38
  39 By generating native code large numbers of indirect jumps can be
  40 removed by either making them into direct branches (e.g. replacing the
  41 indirect call to an SQL operator's implementation with a direct call
  42 to that function), or by removing it entirely (e.g. by evaluating the
  43 branch at compile time because the input is constant). Similarly a lot
  44 of branches can be entirely removed (e.g. by again evaluating the
  45 branch at compile time because the input is constant). The latter is
  46 particularly beneficial for removing branches during tuple deforming.
  47
  48
  49 How to JIT
  50 ==========
  51
  52 PostgreSQL, by default, uses LLVM to perform JIT. LLVM was chosen
  53 because it is developed by several large corporations and therefore
  54 unlikely to be discontinued, because it has a license compatible with
  55 PostgreSQL, and because its IR can be generated from C using the Clang
  56 compiler.
  57
  58
  59 Shared Library Separation
  60 -------------------------
  61
  62 To avoid the main PostgreSQL binary directly depending on LLVM, which
  63 would prevent LLVM support being independently installed by OS package
  64 managers, the LLVM dependent code is located in a shared library that
  65 is loaded on-demand.
  66
  67 An additional benefit of doing so is that it is relatively easy to
  68 evaluate JIT compilation that does not use LLVM, by changing out the
  69 shared library used to provide JIT compilation.
  70
  71 To achieve this, code intending to perform JIT (e.g. expression evaluation)
  72 calls an LLVM independent wrapper located in jit.c to do so. If the
  73 shared library providing JIT support can be loaded (i.e. PostgreSQL was
  74 compiled with LLVM support and the shared library is installed), the task
  75 of JIT compiling an expression gets handed off to the shared library. This
  76 obviously requires that the function in jit.c is allowed to fail in case
  77 no JIT provider can be loaded.
  78
  79 Which shared library is loaded is determined by the jit_provider GUC,
  80 defaulting to "llvmjit".
  81
  82 Cloistering code performing JIT into a shared library unfortunately
  83 also means that code doing JIT compilation for various parts of code
  84 has to be located separately from the code doing so without
  85 JIT. E.g. the JIT version of execExprInterp.c is located in jit/llvm/
  86 rather than executor/.
  87
  88
  89 JIT Context
  90 -----------
  91
  92 For performance and convenience reasons it is useful to allow JITed
  93 functions to be emitted and deallocated together. It is e.g. very
  94 common to create a number of functions at query initialization time,
  95 use them during query execution, and then deallocate all of them
  96 together at the end of the query.
  97
  98 Lifetimes of JITed functions are managed via JITContext. Exactly one
  99 such context should be created for work in which all created JITed
 100 function should have the same lifetime. E.g. there's exactly one
 101 JITContext for each query executed, in the query's EState.  Only the
 102 release of a JITContext is exposed to the provider independent
 103 facility, as the creation of one is done on-demand by the JIT
 104 implementations.
 105
 106 Emitting individual functions separately is more expensive than
 107 emitting several functions at once, and emitting them together can
 108 provide additional optimization opportunities. To facilitate that, the
 109 LLVM provider separates defining functions from optimizing and
 110 emitting functions in an executable manner.
 111
 112 Creating functions into the current mutable module (a module
 113 essentially is LLVM's equivalent of a translation unit in C) is done
 114 using
 115   extern LLVMModuleRef llvm_mutable_module(LLVMJitContext *context);
 116 in which it then can emit as much code using the LLVM APIs as it
 117 wants. Whenever a function actually needs to be called
 118   extern void *llvm_get_function(LLVMJitContext *context, const char *funcname);
 119 returns a pointer to it.
 120
 121 E.g. in the expression evaluation case this setup allows most
 122 functions in a query to be emitted during ExecInitNode(), delaying the
 123 function emission to the time the first time a function is actually
 124 used.
 125
 126
 127 Error Handling
 128 --------------
 129
 130 There are two aspects of error handling.  Firstly, generated (LLVM IR)
 131 and emitted functions (mmap()ed segments) need to be cleaned up both
 132 after a successful query execution and after an error. This is done by
 133 registering each created JITContext with the current resource owner,
 134 and cleaning it up on error / end of transaction. If it is desirable
 135 to release resources earlier, jit_release_context() can be used.
 136
 137 The second, less pretty, aspect of error handling is OOM handling
 138 inside LLVM itself. The above resowner based mechanism takes care of
 139 cleaning up emitted code upon ERROR, but there's also the chance that
 140 LLVM itself runs out of memory. LLVM by default does *not* use any C++
 141 exceptions. Its allocations are primarily funneled through the
 142 standard "new" handlers, and some direct use of malloc() and
 143 mmap(). For the former a 'new handler' exists:
 144 http://en.cppreference.com/w/cpp/memory/new/set_new_handler
 145 For the latter LLVM provides callbacks that get called upon failure
 146 (unfortunately mmap() failures are treated as fatal rather than OOM errors).
 147 What we've chosen to do for now is have two functions that LLVM using code
 148 must use:
 149 extern void llvm_enter_fatal_on_oom(void);
 150 extern void llvm_leave_fatal_on_oom(void);
 151 before interacting with LLVM code.
 152
 153 When a libstdc++ new or LLVM error occurs, the handlers set up by the
 154 above functions trigger a FATAL error. We have to use FATAL rather
 155 than ERROR, as we *cannot* reliably throw ERROR inside a foreign
 156 library without risking corrupting its internal state.
 157
 158 Users of the above sections do *not* have to use PG_TRY/CATCH blocks,
 159 the handlers instead are reset on toplevel sigsetjmp() level.
 160
 161 Using a relatively small enter/leave protected section of code, rather
 162 than setting up these handlers globally, avoids negative interactions
 163 with extensions that might use C++ such as PostGIS. As LLVM code
 164 generation should never execute arbitrary code, just setting these
 165 handlers temporarily ought to suffice.
 166
 167
 168 Type Synchronization
 169 --------------------
 170
 171 To be able to generate code that can perform tasks done by "interpreted"
 172 PostgreSQL, it obviously is required that code generation knows about at
 173 least a few PostgreSQL types.  While it is possible to inform LLVM about
 174 type definitions by recreating them manually in C code, that is failure
 175 prone and labor intensive.
 176
 177 Instead there is one small file (llvmjit_types.c) which references each of
 178 the types required for JITing. That file is translated to bitcode at
 179 compile time, and loaded when LLVM is initialized in a backend.
 180
 181 That works very well to synchronize the type definition, but unfortunately
 182 it does *not* synchronize offsets as the IR level representation doesn't
 183 know field names.  Instead, required offsets are maintained as defines in
 184 the original struct definition, like so:
 185 #define FIELDNO_TUPLETABLESLOT_NVALID 9
 186         int                     tts_nvalid;             /* # of valid values in tts_values */
 187 While that still needs to be defined, it's only required for a
 188 relatively small number of fields, and it's bunched together with the
 189 struct definition, so it's easily kept synchronized.
 190
 191
 192 Inlining
 193 --------
 194
 195 One big advantage of JITing expressions is that it can significantly
 196 reduce the overhead of PostgreSQL's extensible function/operator
 197 mechanism, by inlining the body of called functions/operators.
 198
 199 It obviously is undesirable to maintain a second implementation of
 200 commonly used functions, just for inlining purposes. Instead we take
 201 advantage of the fact that the Clang compiler can emit LLVM IR.
 202
 203 The ability to do so allows us to get the LLVM IR for all operators
 204 (e.g. int8eq, float8pl etc), without maintaining two copies.  These
 205 bitcode files get installed into the server's
 206   $pkglibdir/bitcode/postgres/
 207 Using existing LLVM functionality (for parallel LTO compilation),
 208 additionally an index is over these is stored to
 209 $pkglibdir/bitcode/postgres.index.bc
 210
 211 Similarly extensions can install code into
 212   $pkglibdir/bitcode/[extension]/
 213 accompanied by
 214   $pkglibdir/bitcode/[extension].index.bc
 215
 216 just alongside the actual library.  An extension's index will be used
 217 to look up symbols when located in the corresponding shared
 218 library. Symbols that are used inside the extension, when inlined,
 219 will be first looked up in the main binary and then the extension's.
 220
 221
 222 Caching
 223 -------
 224
 225 Currently it is not yet possible to cache generated functions, even
 226 though that'd be desirable from a performance point of view. The
 227 problem is that the generated functions commonly contain pointers into
 228 per-execution memory. The expression evaluation machinery needs to
 229 be redesigned a bit to avoid that. Basically all per-execution memory
 230 needs to be referenced as an offset to one block of memory stored in
 231 an ExprState, rather than absolute pointers into memory.
 232
 233 Once that is addressed, adding an LRU cache that's keyed by the
 234 generated LLVM IR will allow the usage of optimized functions even for
 235 faster queries.
 236
 237 A longer term project is to move expression compilation to the planner
 238 stage, allowing e.g. to tie compiled expressions to prepared
 239 statements.
 240
 241 An even more advanced approach would be to use JIT with few
 242 optimizations initially, and build an optimized version in the
 243 background. But that's even further off.
 244
 245
 246 What to JIT
 247 ===========
 248
 249 Currently expression evaluation and tuple deforming are JITed. Those
 250 were chosen because they commonly are major CPU bottlenecks in
 251 analytics queries, but are by no means the only potentially beneficial cases.
 252
 253 For JITing to be beneficial a piece of code first and foremost has to
 254 be a CPU bottleneck. But also importantly, JITing can only be
 255 beneficial if overhead can be removed by doing so. E.g. in the tuple
 256 deforming case the knowledge about the number of columns and their
 257 types can remove a significant number of branches, and in the
 258 expression evaluation case a lot of indirect jumps/calls can be
 259 removed.  If neither of these is the case, JITing is a waste of
 260 resources.
 261
 262 Future avenues for JITing are tuple sorting, COPY parsing/output
 263 generation, and later compiling larger parts of queries.
 264
 265
 266 When to JIT
 267 ===========
 268
 269 Currently there are a number of GUCs that influence JITing:
 270
 271 - jit_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
 272   get JITed, *without* optimization (expensive part), corresponding to
 273   -O0. This commonly already results in significant speedups if
 274   expression/deforming is a bottleneck (removing dynamic branches
 275   mostly).
 276 - jit_optimize_above_cost = -1, 0-DBL_MAX - all queries with a higher total cost
 277   get JITed, *with* optimization (expensive part).
 278 - jit_inline_above_cost = -1, 0-DBL_MAX - inlining is tried if query has
 279   higher cost.
 280
 281 Whenever a query's total cost is above these limits, JITing is
 282 performed.
 283
 284 Alternative costing models, e.g. by generating separate paths for
 285 parts of a query with lower cpu_* costs, are also a possibility, but
 286 it's doubtful the overhead of doing so is sufficient.  Another
 287 alternative would be to count the number of times individual
 288 expressions are estimated to be evaluated, and perform JITing of these
 289 individual expressions.
 290
 291 The obvious seeming approach of JITing expressions individually after
 292 a number of execution turns out not to work too well. Primarily
 293 because emitting many small functions individually has significant
 294 overhead. Secondarily because the time until JITing occurs causes
 295 relative slowdowns that eat into the gain of JIT compilation.