src/backend/executor/README

   1 $PostgreSQL$
   2
   3 The Postgres Executor
   4 =====================
   5
   6 The executor processes a tree of "plan nodes".  The plan tree is essentially
   7 a demand-pull pipeline of tuple processing operations.  Each node, when
   8 called, will produce the next tuple in its output sequence, or NULL if no
   9 more tuples are available.  If the node is not a primitive relation-scanning
  10 node, it will have child node(s) that it calls in turn to obtain input
  11 tuples.
  12
  13 Refinements on this basic model include:
  14
  15 * Choice of scan direction (forwards or backwards).  Caution: this is not
  16 currently well-supported.  It works for primitive scan nodes, but not very
  17 well for joins, aggregates, etc.
  18
  19 * Rescan command to reset a node and make it generate its output sequence
  20 over again.
  21
  22 * Parameters that can alter a node's results.  After adjusting a parameter,
  23 the rescan command must be applied to that node and all nodes above it.
  24 There is a moderately intelligent scheme to avoid rescanning nodes
  25 unnecessarily (for example, Sort does not rescan its input if no parameters
  26 of the input have changed, since it can just reread its stored sorted data).
  27
  28 The plan tree concept implements SELECT directly: it is only necessary to
  29 deliver the top-level result tuples to the client, or insert them into
  30 another table in the case of INSERT ... SELECT.  (INSERT ... VALUES is
  31 handled similarly, but the plan tree is just a Result node with no source
  32 tables.)  For UPDATE, the plan tree selects the tuples that need to be
  33 updated (WHERE condition) and delivers a new calculated tuple value for each
  34 such tuple, plus a "junk" (hidden) tuple CTID identifying the target tuple.
  35 The executor's top level then uses this information to update the correct
  36 tuple.  DELETE is similar to UPDATE except that only a CTID need be
  37 delivered by the plan tree.
  38
  39 XXX a great deal more documentation needs to be written here...
  40
  41
  42 Plan Trees and State Trees
  43 --------------------------
  44
  45 The plan tree delivered by the planner contains a tree of Plan nodes (struct
  46 types derived from struct Plan).  Each Plan node may have expression trees
  47 associated with it, to represent its target list, qualification conditions,
  48 etc.  During executor startup we build a parallel tree of identical structure
  49 containing executor state nodes --- every plan and expression node type has
  50 a corresponding executor state node type.  Each node in the state tree has a
  51 pointer to its corresponding node in the plan tree, plus executor state data
  52 as needed to implement that node type.  This arrangement allows the plan
  53 tree to be completely read-only as far as the executor is concerned: all data
  54 that is modified during execution is in the state tree.  Read-only plan trees
  55 make life much simpler for plan caching and reuse.
  56
  57 Altogether there are four classes of nodes used in these trees: Plan nodes,
  58 their corresponding PlanState nodes, Expr nodes, and their corresponding
  59 ExprState nodes.  (Actually, there are also List nodes, which are used as
  60 "glue" in all four kinds of tree.)
  61
  62
  63 Memory Management
  64 -----------------
  65
  66 A "per query" memory context is created during CreateExecutorState();
  67 all storage allocated during an executor invocation is allocated in that
  68 context or a child context.  This allows easy reclamation of storage
  69 during executor shutdown --- rather than messing with retail pfree's and
  70 probable storage leaks, we just destroy the memory context.
  71
  72 In particular, the plan state trees and expression state trees described
  73 in the previous section are allocated in the per-query memory context.
  74
  75 To avoid intra-query memory leaks, most processing while a query runs
  76 is done in "per tuple" memory contexts, which are so-called because they
  77 are typically reset to empty once per tuple.  Per-tuple contexts are usually
  78 associated with ExprContexts, and commonly each PlanState node has its own
  79 ExprContext to evaluate its qual and targetlist expressions in.
  80
  81
  82 Query Processing Control Flow
  83 -----------------------------
  84
  85 This is a sketch of control flow for full query processing:
  86
  87         CreateQueryDesc
  88
  89         ExecutorStart
  90                 CreateExecutorState
  91                         creates per-query context
  92                 switch to per-query context to run ExecInitNode
  93                 ExecInitNode --- recursively scans plan tree
  94                         CreateExprContext
  95                                 creates per-tuple context
  96                         ExecInitExpr
  97
  98         ExecutorRun
  99                 ExecProcNode --- recursively called in per-query context
 100                         ExecEvalExpr --- called in per-tuple context
 101                         ResetExprContext --- to free memory
 102
 103         ExecutorEnd
 104                 ExecEndNode --- recursively releases resources
 105                 FreeExecutorState
 106                         frees per-query context and child contexts
 107
 108         FreeQueryDesc
 109
 110 Per above comments, it's not really critical for ExecEndNode to free any
 111 memory; it'll all go away in FreeExecutorState anyway.  However, we do need to
 112 be careful to close relations, drop buffer pins, etc, so we do need to scan
 113 the plan state tree to find these sorts of resources.
 114
 115
 116 The executor can also be used to evaluate simple expressions without any Plan
 117 tree ("simple" meaning "no aggregates and no sub-selects", though such might
 118 be hidden inside function calls).  This case has a flow of control like
 119
 120         CreateExecutorState
 121                 creates per-query context
 122
 123         CreateExprContext       -- or use GetPerTupleExprContext(estate)
 124                 creates per-tuple context
 125
 126         ExecPrepareExpr
 127                 temporarily switch to per-query context
 128                 run the expression through expression_planner
 129                 ExecInitExpr
 130
 131         Repeatedly do:
 132                 ExecEvalExprSwitchContext
 133                         ExecEvalExpr --- called in per-tuple context
 134                 ResetExprContext --- to free memory
 135
 136         FreeExecutorState
 137                 frees per-query context, as well as ExprContext
 138                 (a separate FreeExprContext call is not necessary)
 139
 140
 141 EvalPlanQual (READ COMMITTED Update Checking)
 142 ---------------------------------------------
 143
 144 For simple SELECTs, the executor need only pay attention to tuples that are
 145 valid according to the snapshot seen by the current transaction (ie, they
 146 were inserted by a previously committed transaction, and not deleted by any
 147 previously committed transaction).  However, for UPDATE and DELETE it is not
 148 cool to modify or delete a tuple that's been modified by an open or
 149 concurrently-committed transaction.  If we are running in SERIALIZABLE
 150 isolation level then we just raise an error when this condition is seen to
 151 occur.  In READ COMMITTED isolation level, we must work a lot harder.
 152
 153 The basic idea in READ COMMITTED mode is to take the modified tuple
 154 committed by the concurrent transaction (after waiting for it to commit,
 155 if need be) and re-evaluate the query qualifications to see if it would
 156 still meet the quals.  If so, we regenerate the updated tuple (if we are
 157 doing an UPDATE) from the modified tuple, and finally update/delete the
 158 modified tuple.  SELECT FOR UPDATE/SHARE behaves similarly, except that its
 159 action is just to lock the modified tuple.
 160
 161 To implement this checking, we actually re-run the entire query from scratch
 162 for each modified tuple, but with the scan node that sourced the original
 163 tuple set to return only the modified tuple, not the original tuple or any
 164 of the rest of the relation.  If this query returns a tuple, then the
 165 modified tuple passes the quals (and the query output is the suitably
 166 modified update tuple, if we're doing UPDATE).  If no tuple is returned,
 167 then the modified tuple fails the quals, so we ignore it and continue the
 168 original query.  (This is reasonably efficient for simple queries, but may
 169 be horribly slow for joins.  A better design would be nice; one thought for
 170 future investigation is to treat the tuple substitution like a parameter,
 171 so that we can avoid rescanning unrelated nodes.)
 172
 173 Note a fundamental bogosity of this approach: if the relation containing
 174 the original tuple is being used in a self-join, the other instance(s) of
 175 the relation will be treated as still containing the original tuple, whereas
 176 logical consistency would demand that the modified tuple appear in them too.
 177 But we'd have to actually substitute the modified tuple for the original,
 178 while still returning all the rest of the relation, to ensure consistent
 179 answers.  Implementing this correctly is a task for future work.
 180
 181 In UPDATE/DELETE, only the target relation needs to be handled this way,
 182 so only one special recheck query needs to execute at a time.  In SELECT FOR
 183 UPDATE, there may be multiple relations flagged FOR UPDATE, so it's possible
 184 that while we are executing a recheck query for one modified tuple, we will
 185 hit another modified tuple in another relation.  In this case we "stack up"
 186 recheck queries: a sub-recheck query is spawned in which both the first and
 187 second modified tuples will be returned as the only components of their
 188 relations.  (In event of success, all these modified tuples will be locked.)
 189 Again, this isn't necessarily quite the right thing ... but in simple cases
 190 it works.  Potentially, recheck queries could get nested to the depth of the
 191 number of FOR UPDATE/SHARE relations in the query.
 192
 193 It should be noted also that UPDATE/DELETE expect at most one tuple to
 194 result from the modified query, whereas in the FOR UPDATE case it's possible
 195 for multiple tuples to result (since we could be dealing with a join in
 196 which multiple tuples join to the modified tuple).  We want FOR UPDATE to
 197 lock all relevant tuples, so we pass all tuples output by all the stacked
 198 recheck queries back to the executor toplevel for locking.