src/backend/access/transam/README.parallel

   1 Overview
   2 ========
   3
   4 PostgreSQL provides some simple facilities to make writing parallel algorithms
   5 easier.  Using a data structure called a ParallelContext, you can arrange to
   6 launch background worker processes, initialize their state to match that of
   7 the backend which initiated parallelism, communicate with them via dynamic
   8 shared memory, and write reasonably complex code that can run either in the
   9 user backend or in one of the parallel workers without needing to be aware of
  10 where it's running.
  11
  12 The backend which starts a parallel operation (hereafter, the initiating
  13 backend) starts by creating a dynamic shared memory segment which will last
  14 for the lifetime of the parallel operation.  This dynamic shared memory segment
  15 will contain (1) a shm_mq that can be used to transport errors (and other
  16 messages reported via elog/ereport) from the worker back to the initiating
  17 backend; (2) serialized representations of the initiating backend's private
  18 state, so that the worker can synchronize its state with of the initiating
  19 backend; and (3) any other data structures which a particular user of the
  20 ParallelContext data structure may wish to add for its own purposes.  Once
  21 the initiating backend has initialized the dynamic shared memory segment, it
  22 asks the postmaster to launch the appropriate number of parallel workers.
  23 These workers then connect to the dynamic shared memory segment, initiate
  24 their state, and then invoke the appropriate entrypoint, as further detailed
  25 below.
  26
  27 Error Reporting
  28 ===============
  29
  30 When started, each parallel worker begins by attaching the dynamic shared
  31 memory segment and locating the shm_mq to be used for error reporting; it
  32 redirects all of its protocol messages to this shm_mq.  Prior to this point,
  33 any failure of the background worker will not be reported to the initiating
  34 backend; from the point of view of the initiating backend, the worker simply
  35 failed to start.  The initiating backend must anyway be prepared to cope
  36 with fewer parallel workers than it originally requested, so catering to
  37 this case imposes no additional burden.
  38
  39 Whenever a new message (or partial message; very large messages may wrap) is
  40 sent to the error-reporting queue, PROCSIG_PARALLEL_MESSAGE is sent to the
  41 initiating backend.  This causes the next CHECK_FOR_INTERRUPTS() in the
  42 initiating backend to read and rethrow the message.  For the most part, this
  43 makes error reporting in parallel mode "just work".  Of course, to work
  44 properly, it is important that the code the initiating backend is executing
  45 CHECK_FOR_INTERRUPTS() regularly and avoid blocking interrupt processing for
  46 long periods of time, but those are good things to do anyway.
  47
  48 (A currently-unsolved problem is that some messages may get written to the
  49 system log twice, once in the backend where the report was originally
  50 generated, and again when the initiating backend rethrows the message.  If
  51 we decide to suppress one of these reports, it should probably be second one;
  52 otherwise, if the worker is for some reason unable to propagate the message
  53 back to the initiating backend, the message will be lost altogether.)
  54
  55 State Sharing
  56 =============
  57
  58 It's possible to write C code which works correctly without parallelism, but
  59 which fails when parallelism is used.  No parallel infrastructure can
  60 completely eliminate this problem, because any global variable is a risk.
  61 There's no general mechanism for ensuring that every global variable in the
  62 worker will have the same value that it does in the initiating backend; even
  63 if we could ensure that, some function we're calling could update the variable
  64 after each call, and only the backend where that update is performed will see
  65 the new value.  Similar problems can arise with any more-complex data
  66 structure we might choose to use.  For example, a pseudo-random number
  67 generator should, given a particular seed value, produce the same predictable
  68 series of values every time.  But it does this by relying on some private
  69 state which won't automatically be shared between cooperating backends.  A
  70 parallel-safe PRNG would need to store its state in dynamic shared memory, and
  71 would require locking.  The parallelism infrastructure has no way of knowing
  72 whether the user intends to call code that has this sort of problem, and can't
  73 do anything about it anyway.
  74
  75 Instead, we take a more pragmatic approach. First, we try to make as many of
  76 the operations that are safe outside of parallel mode work correctly in
  77 parallel mode as well.  Second, we try to prohibit common unsafe operations
  78 via suitable error checks.  These checks are intended to catch 100% of
  79 unsafe things that a user might do from the SQL interface, but code written
  80 in C can do unsafe things that won't trigger these checks.  The error checks
  81 are engaged via EnterParallelMode(), which should be called before creating
  82 a parallel context, and disarmed via ExitParallelMode(), which should be
  83 called after all parallel contexts have been destroyed.  The most
  84 significant restriction imposed by parallel mode is that all operations must
  85 be strictly read-only; we allow no writes to the database and no DDL.  We
  86 might try to relax these restrictions in the future.
  87
  88 To make as many operations as possible safe in parallel mode, we try to copy
  89 the most important pieces of state from the initiating backend to each parallel
  90 worker.  This includes:
  91
  92   - The set of libraries dynamically loaded by dfmgr.c.
  93
  94   - The authenticated user ID and current database.  Each parallel worker
  95     will connect to the same database as the initiating backend, using the
  96     same user ID.
  97
  98   - The values of all GUCs.  Accordingly, permanent changes to the value of
  99     any GUC are forbidden while in parallel mode; but temporary changes,
 100     such as entering a function with non-NULL proconfig, are OK.
 101
 102   - The current subtransaction's XID, the top-level transaction's XID, and
 103     the list of XIDs considered current (that is, they are in-progress or
 104     subcommitted).  This information is needed to ensure that tuple visibility
 105     checks return the same results in the worker as they do in the
 106     initiating backend.  See also the section Transaction Integration, below.
 107
 108   - The combo CID mappings.  This is needed to ensure consistent answers to
 109     tuple visibility checks.  The need to synchronize this data structure is
 110     a major reason why we can't support writes in parallel mode: such writes
 111     might create new combo CIDs, and we have no way to let other workers
 112     (or the initiating backend) know about them.
 113
 114   - The transaction snapshot.
 115
 116   - The active snapshot, which might be different from the transaction
 117     snapshot.
 118
 119   - The currently active user ID and security context.  Note that this is
 120     the fourth user ID we restore: the initial step of binding to the correct
 121     database also involves restoring the authenticated user ID.  When GUC
 122     values are restored, this incidentally sets SessionUserId and OuterUserId
 123     to the correct values.  This final step restores CurrentUserId.
 124
 125   - State related to pending REINDEX operations, which prevents access to
 126     an index that is currently being rebuilt.
 127
 128   - Active relmapper.c mapping state.  This is needed to allow consistent
 129     answers when fetching the current relfilenumber for relation oids of
 130     mapped relations.
 131
 132 To prevent unprincipled deadlocks when running in parallel mode, this code
 133 also arranges for the leader and all workers to participate in group
 134 locking.  See src/backend/storage/lmgr/README for more details.
 135
 136 Transaction Integration
 137 =======================
 138
 139 Regardless of what the TransactionState stack looks like in the parallel
 140 leader, each parallel worker begins with a stack of depth 1.  This stack
 141 entry is marked with the special transaction block state
 142 TBLOCK_PARALLEL_INPROGRESS so that it's not confused with an ordinary
 143 toplevel transaction.  The XID of this TransactionState is set to the XID of
 144 the innermost currently-active subtransaction in the initiating backend.  The
 145 initiating backend's toplevel XID, and the XIDs of all current (in-progress
 146 or subcommitted) XIDs are stored separately from the TransactionState stack,
 147 but in such a way that GetTopTransactionId(), GetTopTransactionIdIfAny(), and
 148 TransactionIdIsCurrentTransactionId() return the same values that they would
 149 in the initiating backend.  We could copy the entire transaction state stack,
 150 but most of it would be useless: for example, you can't roll back to a
 151 savepoint from within a parallel worker, and there are no resources to
 152 associated with the memory contexts or resource owners of intermediate
 153 subtransactions.
 154
 155 No meaningful change to the transaction state can be made while in parallel
 156 mode.  No XIDs can be assigned, and no command counter increments can happen,
 157 because we have no way of communicating these state changes to cooperating
 158 backends, or of synchronizing them.  It's clearly unworkable for the initiating
 159 backend to exit any transaction or subtransaction that was in progress when
 160 parallelism was started before all parallel workers have exited; and it's even
 161 more clearly crazy for a parallel worker to try to subcommit or subabort the
 162 current subtransaction and execute in some other transaction context than was
 163 present in the initiating backend.  However, we allow internal subtransactions
 164 (e.g. those used to implement a PL/pgSQL EXCEPTION block) to be used in
 165 parallel mode, provided that they remain XID-less, because other backends
 166 don't really need to know about those transactions or do anything differently
 167 because of them.
 168
 169 At the end of a parallel operation, which can happen either because it
 170 completed successfully or because it was interrupted by an error, parallel
 171 workers associated with that operation exit.  In the error case, transaction
 172 abort processing in the parallel leader kills off any remaining workers, and
 173 the parallel leader then waits for them to die.  In the case of a successful
 174 parallel operation, the parallel leader does not send any signals, but must
 175 wait for workers to complete and exit of their own volition.  In either
 176 case, it is very important that all workers actually exit before the
 177 parallel leader cleans up the (sub)transaction in which they were created;
 178 otherwise, chaos can ensue.  For example, if the leader is rolling back the
 179 transaction that created the relation being scanned by a worker, the
 180 relation could disappear while the worker is still busy scanning it.  That's
 181 not safe.
 182
 183 Generally, the cleanup performed by each worker at this point is similar to
 184 top-level commit or abort.  Each backend has its own resource owners: buffer
 185 pins, catcache or relcache reference counts, tuple descriptors, and so on
 186 are managed separately by each backend, and must free them before exiting.
 187 There are, however, some important differences between parallel worker
 188 commit or abort and a real top-level transaction commit or abort.  Most
 189 importantly:
 190
 191   - No commit or abort record is written; the initiating backend is
 192     responsible for this.
 193
 194   - Cleanup of pg_temp namespaces is not done.  Parallel workers cannot
 195     safely access the initiating backend's pg_temp namespace, and should
 196     not create one of their own.
 197
 198 Coding Conventions
 199 ===================
 200
 201 Before beginning any parallel operation, call EnterParallelMode(); after all
 202 parallel operations are completed, call ExitParallelMode().  To actually
 203 parallelize a particular operation, use a ParallelContext.  The basic coding
 204 pattern looks like this:
 205
 206         EnterParallelMode();            /* prohibit unsafe state changes */
 207
 208         pcxt = CreateParallelContext("library_name", "function_name", nworkers);
 209
 210         /* Allow space for application-specific data here. */
 211         shm_toc_estimate_chunk(&pcxt->estimator, size);
 212         shm_toc_estimate_keys(&pcxt->estimator, keys);
 213
 214         InitializeParallelDSM(pcxt);    /* create DSM and copy state to it */
 215
 216         /* Store the data for which we reserved space. */
 217         space = shm_toc_allocate(pcxt->toc, size);
 218         shm_toc_insert(pcxt->toc, key, space);
 219
 220         LaunchParallelWorkers(pcxt);
 221
 222         /* do parallel stuff */
 223
 224         WaitForParallelWorkersToFinish(pcxt);
 225
 226         /* read any final results from dynamic shared memory */
 227
 228         DestroyParallelContext(pcxt);
 229
 230         ExitParallelMode();
 231
 232 If desired, after WaitForParallelWorkersToFinish() has been called, the
 233 context can be reset so that workers can be launched anew using the same
 234 parallel context.  To do this, first call ReinitializeParallelDSM() to
 235 reinitialize state managed by the parallel context machinery itself; then,
 236 perform any other necessary resetting of state; after that, you can again
 237 call LaunchParallelWorkers.