src/backend/access/transam/README

   1 src/backend/access/transam/README
   2
   3 The Transaction System
   4 ======================
   5
   6 PostgreSQL's transaction system is a three-layer system.  The bottom layer
   7 implements low-level transactions and subtransactions, on top of which rests
   8 the mainloop's control code, which in turn implements user-visible
   9 transactions and savepoints.
  10
  11 The middle layer of code is called by postgres.c before and after the
  12 processing of each query, or after detecting an error:
  13
  14                 StartTransactionCommand
  15                 CommitTransactionCommand
  16                 AbortCurrentTransaction
  17
  18 Meanwhile, the user can alter the system's state by issuing the SQL commands
  19 BEGIN, COMMIT, ROLLBACK, SAVEPOINT, ROLLBACK TO or RELEASE.  The traffic cop
  20 redirects these calls to the toplevel routines
  21
  22                 BeginTransactionBlock
  23                 EndTransactionBlock
  24                 UserAbortTransactionBlock
  25                 DefineSavepoint
  26                 RollbackToSavepoint
  27                 ReleaseSavepoint
  28
  29 respectively.  Depending on the current state of the system, these functions
  30 call low level functions to activate the real transaction system:
  31
  32                 StartTransaction
  33                 CommitTransaction
  34                 AbortTransaction
  35                 CleanupTransaction
  36                 StartSubTransaction
  37                 CommitSubTransaction
  38                 AbortSubTransaction
  39                 CleanupSubTransaction
  40
  41 Additionally, within a transaction, CommandCounterIncrement is called to
  42 increment the command counter, which allows future commands to "see" the
  43 effects of previous commands within the same transaction.  Note that this is
  44 done automatically by CommitTransactionCommand after each query inside a
  45 transaction block, but some utility functions also do it internally to allow
  46 some operations (usually in the system catalogs) to be seen by future
  47 operations in the same utility command.  (For example, in DefineRelation it is
  48 done after creating the heap so the pg_class row is visible, to be able to
  49 lock it.)
  50
  51
  52 For example, consider the following sequence of user commands:
  53
  54 1)              BEGIN
  55 2)              SELECT * FROM foo
  56 3)              INSERT INTO foo VALUES (...)
  57 4)              COMMIT
  58
  59 In the main processing loop, this results in the following function call
  60 sequence:
  61
  62      /  StartTransactionCommand;
  63     /       StartTransaction;
  64 1) <    ProcessUtility;                 << BEGIN
  65     \       BeginTransactionBlock;
  66      \  CommitTransactionCommand;
  67
  68     /   StartTransactionCommand;
  69 2) /    PortalRunSelect;                << SELECT ...
  70    \    CommitTransactionCommand;
  71     \       CommandCounterIncrement;
  72
  73     /   StartTransactionCommand;
  74 3) /    ProcessQuery;                   << INSERT ...
  75    \    CommitTransactionCommand;
  76     \       CommandCounterIncrement;
  77
  78      /  StartTransactionCommand;
  79     /   ProcessUtility;                 << COMMIT
  80 4) <        EndTransactionBlock;
  81     \   CommitTransactionCommand;
  82      \      CommitTransaction;
  83
  84 The point of this example is to demonstrate the need for
  85 StartTransactionCommand and CommitTransactionCommand to be state smart -- they
  86 should call CommandCounterIncrement between the calls to BeginTransactionBlock
  87 and EndTransactionBlock and outside these calls they need to do normal start,
  88 commit or abort processing.
  89
  90 Furthermore, suppose the "SELECT * FROM foo" caused an abort condition. In
  91 this case AbortCurrentTransaction is called, and the transaction is put in
  92 aborted state.  In this state, any user input is ignored except for
  93 transaction-termination statements, or ROLLBACK TO <savepoint> commands.
  94
  95 Transaction aborts can occur in two ways:
  96
  97 1) system dies from some internal cause  (syntax error, etc)
  98 2) user types ROLLBACK
  99
 100 The reason we have to distinguish them is illustrated by the following two
 101 situations:
 102
 103         case 1                                  case 2
 104         ------                                  ------
 105 1) user types BEGIN                     1) user types BEGIN
 106 2) user does something                  2) user does something
 107 3) user does not like what              3) system aborts for some reason
 108    she sees and types ABORT                (syntax error, etc)
 109
 110 In case 1, we want to abort the transaction and return to the default state.
 111 In case 2, there may be more commands coming our way which are part of the
 112 same transaction block; we have to ignore these commands until we see a COMMIT
 113 or ROLLBACK.
 114
 115 Internal aborts are handled by AbortCurrentTransaction, while user aborts are
 116 handled by UserAbortTransactionBlock.  Both of them rely on AbortTransaction
 117 to do all the real work.  The only difference is what state we enter after
 118 AbortTransaction does its work:
 119
 120 * AbortCurrentTransaction leaves us in TBLOCK_ABORT,
 121 * UserAbortTransactionBlock leaves us in TBLOCK_ABORT_END
 122
 123 Low-level transaction abort handling is divided in two phases:
 124 * AbortTransaction executes as soon as we realize the transaction has
 125   failed.  It should release all shared resources (locks etc) so that we do
 126   not delay other backends unnecessarily.
 127 * CleanupTransaction executes when we finally see a user COMMIT
 128   or ROLLBACK command; it cleans things up and gets us out of the transaction
 129   completely.  In particular, we mustn't destroy TopTransactionContext until
 130   this point.
 131
 132 Also, note that when a transaction is committed, we don't close it right away.
 133 Rather it's put in TBLOCK_END state, which means that when
 134 CommitTransactionCommand is called after the query has finished processing,
 135 the transaction has to be closed.  The distinction is subtle but important,
 136 because it means that control will leave the xact.c code with the transaction
 137 open, and the main loop will be able to keep processing inside the same
 138 transaction.  So, in a sense, transaction commit is also handled in two
 139 phases, the first at EndTransactionBlock and the second at
 140 CommitTransactionCommand (which is where CommitTransaction is actually
 141 called).
 142
 143 The rest of the code in xact.c are routines to support the creation and
 144 finishing of transactions and subtransactions.  For example, AtStart_Memory
 145 takes care of initializing the memory subsystem at main transaction start.
 146
 147
 148 Subtransaction Handling
 149 -----------------------
 150
 151 Subtransactions are implemented using a stack of TransactionState structures,
 152 each of which has a pointer to its parent transaction's struct.  When a new
 153 subtransaction is to be opened, PushTransaction is called, which creates a new
 154 TransactionState, with its parent link pointing to the current transaction.
 155 StartSubTransaction is in charge of initializing the new TransactionState to
 156 sane values, and properly initializing other subsystems (AtSubStart routines).
 157
 158 When closing a subtransaction, either CommitSubTransaction has to be called
 159 (if the subtransaction is committing), or AbortSubTransaction and
 160 CleanupSubTransaction (if it's aborting).  In either case, PopTransaction is
 161 called so the system returns to the parent transaction.
 162
 163 One important point regarding subtransaction handling is that several may need
 164 to be closed in response to a single user command.  That's because savepoints
 165 have names, and we allow to commit or rollback a savepoint by name, which is
 166 not necessarily the one that was last opened.  Also a COMMIT or ROLLBACK
 167 command must be able to close out the entire stack.  We handle this by having
 168 the utility command subroutine mark all the state stack entries as commit-
 169 pending or abort-pending, and then when the main loop reaches
 170 CommitTransactionCommand, the real work is done.  The main point of doing
 171 things this way is that if we get an error while popping state stack entries,
 172 the remaining stack entries still show what we need to do to finish up.
 173
 174 In the case of ROLLBACK TO <savepoint>, we abort all the subtransactions up
 175 through the one identified by the savepoint name, and then re-create that
 176 subtransaction level with the same name.  So it's a completely new
 177 subtransaction as far as the internals are concerned.
 178
 179 Other subsystems are allowed to start "internal" subtransactions, which are
 180 handled by BeginInternalSubTransaction.  This is to allow implementing
 181 exception handling, e.g. in PL/pgSQL.  ReleaseCurrentSubTransaction and
 182 RollbackAndReleaseCurrentSubTransaction allows the subsystem to close said
 183 subtransactions.  The main difference between this and the savepoint/release
 184 path is that we execute the complete state transition immediately in each
 185 subroutine, rather than deferring some work until CommitTransactionCommand.
 186 Another difference is that BeginInternalSubTransaction is allowed when no
 187 explicit transaction block has been established, while DefineSavepoint is not.
 188
 189
 190 Transaction and Subtransaction Numbering
 191 ----------------------------------------
 192
 193 Transactions and subtransactions are assigned permanent XIDs only when/if
 194 they first do something that requires one --- typically, insert/update/delete
 195 a tuple, though there are a few other places that need an XID assigned.
 196 If a subtransaction requires an XID, we always first assign one to its
 197 parent.  This maintains the invariant that child transactions have XIDs later
 198 than their parents, which is assumed in a number of places.
 199
 200 The subsidiary actions of obtaining a lock on the XID and entering it into
 201 pg_subtrans and PGPROC are done at the time it is assigned.
 202
 203 A transaction that has no XID still needs to be identified for various
 204 purposes, notably holding locks.  For this purpose we assign a "virtual
 205 transaction ID" or VXID to each top-level transaction.  VXIDs are formed from
 206 two fields, the procNumber and a backend-local counter; this arrangement
 207 allows assignment of a new VXID at transaction start without any contention
 208 for shared memory.  To ensure that a VXID isn't re-used too soon after backend
 209 exit, we store the last local counter value into shared memory at backend
 210 exit, and initialize it from the previous value for the same PGPROC slot at
 211 backend start.  All these counters go back to zero at shared memory
 212 re-initialization, but that's OK because VXIDs never appear anywhere on-disk.
 213
 214 Internally, a backend needs a way to identify subtransactions whether or not
 215 they have XIDs; but this need only lasts as long as the parent top transaction
 216 endures.  Therefore, we have SubTransactionId, which is somewhat like
 217 CommandId in that it's generated from a counter that we reset at the start of
 218 each top transaction.  The top-level transaction itself has SubTransactionId 1,
 219 and subtransactions have IDs 2 and up.  (Zero is reserved for
 220 InvalidSubTransactionId.)  Note that subtransactions do not have their
 221 own VXIDs; they use the parent top transaction's VXID.
 222
 223
 224 Interlocking Transaction Begin, Transaction End, and Snapshots
 225 --------------------------------------------------------------
 226
 227 We try hard to minimize the amount of overhead and lock contention involved
 228 in the frequent activities of beginning/ending a transaction and taking a
 229 snapshot.  Unfortunately, we must have some interlocking for this, because
 230 we must ensure consistency about the commit order of transactions.
 231 For example, suppose an UPDATE in xact A is blocked by xact B's prior
 232 update of the same row, and xact B is doing commit while xact C gets a
 233 snapshot.  Xact A can complete and commit as soon as B releases its locks.
 234 If xact C's GetSnapshotData sees xact B as still running, then it had
 235 better see xact A as still running as well, or it will be able to see two
 236 tuple versions - one deleted by xact B and one inserted by xact A.  Another
 237 reason why this would be bad is that C would see (in the row inserted by A)
 238 earlier changes by B, and it would be inconsistent for C not to see any
 239 of B's changes elsewhere in the database.
 240
 241 Formally, the correctness requirement is "if a snapshot A considers
 242 transaction X as committed, and any of transaction X's snapshots considered
 243 transaction Y as committed, then snapshot A must consider transaction Y as
 244 committed".
 245
 246 What we actually enforce is strict serialization of commits and rollbacks
 247 with snapshot-taking: we do not allow any transaction to exit the set of
 248 running transactions while a snapshot is being taken.  (This rule is
 249 stronger than necessary for consistency, but is relatively simple to
 250 enforce, and it assists with some other issues as explained below.)  The
 251 implementation of this is that GetSnapshotData takes the ProcArrayLock in
 252 shared mode (so that multiple backends can take snapshots in parallel),
 253 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
 254 while clearing the ProcGlobal->xids[] entry at transaction end (either
 255 commit or abort). (To reduce context switching, when multiple transactions
 256 commit nearly simultaneously, we have one backend take ProcArrayLock and
 257 clear the XIDs of multiple processes at once.)
 258
 259 ProcArrayEndTransaction also holds the lock while advancing the shared
 260 latestCompletedXid variable.  This allows GetSnapshotData to use
 261 latestCompletedXid + 1 as xmax for its snapshot: there can be no
 262 transaction >= this xid value that the snapshot needs to consider as
 263 completed.
 264
 265 In short, then, the rule is that no transaction may exit the set of
 266 currently-running transactions between the time we fetch latestCompletedXid
 267 and the time we finish building our snapshot.  However, this restriction
 268 only applies to transactions that have an XID --- read-only transactions
 269 can end without acquiring ProcArrayLock, since they don't affect anyone
 270 else's snapshot nor latestCompletedXid.
 271
 272 Transaction start, per se, doesn't have any interlocking with these
 273 considerations, since we no longer assign an XID immediately at transaction
 274 start.  But when we do decide to allocate an XID, GetNewTransactionId must
 275 store the new XID into the shared ProcArray before releasing XidGenLock.
 276 This ensures that all top-level XIDs <= latestCompletedXid are either
 277 present in the ProcArray, or not running anymore.  (This guarantee doesn't
 278 apply to subtransaction XIDs, because of the possibility that there's not
 279 room for them in the subxid array; instead we guarantee that they are
 280 present or the overflow flag is set.)  If a backend released XidGenLock
 281 before storing its XID into ProcGlobal->xids[], then it would be possible for
 282 another backend to allocate and commit a later XID, causing latestCompletedXid
 283 to pass the first backend's XID, before that value became visible in the
 284 ProcArray.  That would break ComputeXidHorizons, as discussed below.
 285
 286 We allow GetNewTransactionId to store the XID into ProcGlobal->xids[] (or the
 287 subxid array) without taking ProcArrayLock.  This was once necessary to
 288 avoid deadlock; while that is no longer the case, it's still beneficial for
 289 performance.  We are thereby relying on fetch/store of an XID to be atomic,
 290 else other backends might see a partially-set XID.  This also means that
 291 readers of the ProcArray xid fields must be careful to fetch a value only
 292 once, rather than assume they can read it multiple times and get the same
 293 answer each time.  (Use volatile-qualified pointers when doing this, to
 294 ensure that the C compiler does exactly what you tell it to.)
 295
 296 Another important activity that uses the shared ProcArray is
 297 ComputeXidHorizons, which must determine a lower bound for the oldest xmin
 298 of any active MVCC snapshot, system-wide.  Each individual backend
 299 advertises the smallest xmin of its own snapshots in MyProc->xmin, or zero
 300 if it currently has no live snapshots (eg, if it's between transactions or
 301 hasn't yet set a snapshot for a new transaction).  ComputeXidHorizons takes
 302 the MIN() of the valid xmin fields.  It does this with only shared lock on
 303 ProcArrayLock, which means there is a potential race condition against other
 304 backends doing GetSnapshotData concurrently: we must be certain that a
 305 concurrent backend that is about to set its xmin does not compute an xmin
 306 less than what ComputeXidHorizons determines.  We ensure that by including
 307 all the active XIDs into the MIN() calculation, along with the valid xmins.
 308 The rule that transactions can't exit without taking exclusive ProcArrayLock
 309 ensures that concurrent holders of shared ProcArrayLock will compute the
 310 same minimum of currently-active XIDs: no xact, in particular not the
 311 oldest, can exit while we hold shared ProcArrayLock.  So
 312 ComputeXidHorizons's view of the minimum active XID will be the same as that
 313 of any concurrent GetSnapshotData, and so it can't produce an overestimate.
 314 If there is no active transaction at all, ComputeXidHorizons uses
 315 latestCompletedXid + 1, which is a lower bound for the xmin that might
 316 be computed by concurrent or later GetSnapshotData calls.  (We know that no
 317 XID less than this could be about to appear in the ProcArray, because of the
 318 XidGenLock interlock discussed above.)
 319
 320 As GetSnapshotData is performance critical, it does not perform an accurate
 321 oldest-xmin calculation (it used to, until v14). The contents of a snapshot
 322 only depend on the xids of other backends, not their xmin. As backend's xmin
 323 changes much more often than its xid, having GetSnapshotData look at xmins
 324 can lead to a lot of unnecessary cacheline ping-pong.  Instead
 325 GetSnapshotData updates approximate thresholds (one that guarantees that all
 326 deleted rows older than it can be removed, another determining that deleted
 327 rows newer than it can not be removed). GlobalVisTest* uses those thresholds
 328 to make invisibility decision, falling back to ComputeXidHorizons if
 329 necessary.
 330
 331 Note that while it is certain that two concurrent executions of
 332 GetSnapshotData will compute the same xmin for their own snapshots, there is
 333 no such guarantee for the horizons computed by ComputeXidHorizons.  This is
 334 because we allow XID-less transactions to clear their MyProc->xmin
 335 asynchronously (without taking ProcArrayLock), so one execution might see
 336 what had been the oldest xmin, and another not.  This is OK since the
 337 thresholds need only be a valid lower bound.  As noted above, we are already
 338 assuming that fetch/store of the xid fields is atomic, so assuming it for
 339 xmin as well is no extra risk.
 340
 341
 342 pg_xact and pg_subtrans
 343 -----------------------
 344
 345 pg_xact and pg_subtrans are permanent (on-disk) storage of transaction related
 346 information.  There is a limited number of pages of each kept in memory, so
 347 in many cases there is no need to actually read from disk.  However, if
 348 there's a long running transaction or a backend sitting idle with an open
 349 transaction, it may be necessary to be able to read and write this information
 350 from disk.  They also allow information to be permanent across server restarts.
 351
 352 pg_xact records the commit status for each transaction that has been assigned
 353 an XID.  A transaction can be in progress, committed, aborted, or
 354 "sub-committed".  This last state means that it's a subtransaction that's no
 355 longer running, but its parent has not updated its state yet.  It is not
 356 necessary to update a subtransaction's transaction status to subcommit, so we
 357 can just defer it until main transaction commit.  The main role of marking
 358 transactions as sub-committed is to provide an atomic commit protocol when
 359 transaction status is spread across multiple clog pages. As a result, whenever
 360 transaction status spreads across multiple pages we must use a two-phase commit
 361 protocol: the first phase is to mark the subtransactions as sub-committed, then
 362 we mark the top level transaction and all its subtransactions committed (in
 363 that order).  Thus, subtransactions that have not aborted appear as in-progress
 364 even when they have already finished, and the subcommit status appears as a
 365 very short transitory state during main transaction commit.  Subtransaction
 366 abort is always marked in clog as soon as it occurs.  When the transaction
 367 status all fit in a single CLOG page, we atomically mark them all as committed
 368 without bothering with the intermediate sub-commit state.
 369
 370 Savepoints are implemented using subtransactions.  A subtransaction is a
 371 transaction inside a transaction; its commit or abort status is not only
 372 dependent on whether it committed itself, but also whether its parent
 373 transaction committed.  To implement multiple savepoints in a transaction we
 374 allow unlimited transaction nesting depth, so any particular subtransaction's
 375 commit state is dependent on the commit status of each and every ancestor
 376 transaction.
 377
 378 The "subtransaction parent" (pg_subtrans) mechanism records, for each
 379 transaction with an XID, the TransactionId of its parent transaction.  This
 380 information is stored as soon as the subtransaction is assigned an XID.
 381 Top-level transactions do not have a parent, so they leave their pg_subtrans
 382 entries set to the default value of zero (InvalidTransactionId).
 383
 384 pg_subtrans is used to check whether the transaction in question is still
 385 running --- the main Xid of a transaction is recorded in ProcGlobal->xids[],
 386 with a copy in PGPROC->xid, but since we allow arbitrary nesting of
 387 subtransactions, we can't fit all Xids in shared memory, so we have to store
 388 them on disk.  Note, however, that for each transaction we keep a "cache" of
 389 Xids that are known to be part of the transaction tree, so we can skip looking
 390 at pg_subtrans unless we know the cache has been overflowed.  See
 391 storage/ipc/procarray.c for the gory details.
 392
 393 slru.c is the supporting mechanism for both pg_xact and pg_subtrans.  It
 394 implements the LRU policy for in-memory buffer pages.  The high-level routines
 395 for pg_xact are implemented in transam.c, while the low-level functions are in
 396 clog.c.  pg_subtrans is contained completely in subtrans.c.
 397
 398
 399 Write-Ahead Log Coding
 400 ----------------------
 401
 402 The WAL subsystem (also called XLOG in the code) exists to guarantee crash
 403 recovery.  It can also be used to provide point-in-time recovery, as well as
 404 hot-standby replication via log shipping.  Here are some notes about
 405 non-obvious aspects of its design.
 406
 407 A basic assumption of a write AHEAD log is that log entries must reach stable
 408 storage before the data-page changes they describe.  This ensures that
 409 replaying the log to its end will bring us to a consistent state where there
 410 are no partially-performed transactions.  To guarantee this, each data page
 411 (either heap or index) is marked with the LSN (log sequence number --- in
 412 practice, a WAL file location) of the latest XLOG record affecting the page.
 413 Before the bufmgr can write out a dirty page, it must ensure that xlog has
 414 been flushed to disk at least up to the page's LSN.  This low-level
 415 interaction improves performance by not waiting for XLOG I/O until necessary.
 416 The LSN check exists only in the shared-buffer manager, not in the local
 417 buffer manager used for temp tables; hence operations on temp tables must not
 418 be WAL-logged.
 419
 420 During WAL replay, we can check the LSN of a page to detect whether the change
 421 recorded by the current log entry is already applied (it has been, if the page
 422 LSN is >= the log entry's WAL location).
 423
 424 Usually, log entries contain just enough information to redo a single
 425 incremental update on a page (or small group of pages).  This will work only
 426 if the filesystem and hardware implement data page writes as atomic actions,
 427 so that a page is never left in a corrupt partly-written state.  Since that's
 428 often an untenable assumption in practice, we log additional information to
 429 allow complete reconstruction of modified pages.  The first WAL record
 430 affecting a given page after a checkpoint is made to contain a copy of the
 431 entire page, and we implement replay by restoring that page copy instead of
 432 redoing the update.  (This is more reliable than the data storage itself would
 433 be because we can check the validity of the WAL record's CRC.)  We can detect
 434 the "first change after checkpoint" by noting whether the page's old LSN
 435 precedes the end of WAL as of the last checkpoint (the RedoRecPtr).
 436
 437 The general schema for executing a WAL-logged action is
 438
 439 1. Pin and exclusive-lock the shared buffer(s) containing the data page(s)
 440 to be modified.
 441
 442 2. START_CRIT_SECTION()  (Any error during the next three steps must cause a
 443 PANIC because the shared buffers will contain unlogged changes, which we
 444 have to ensure don't get to disk.  Obviously, you should check conditions
 445 such as whether there's enough free space on the page before you start the
 446 critical section.)
 447
 448 3. Apply the required changes to the shared buffer(s).
 449
 450 4. Mark the shared buffer(s) as dirty with MarkBufferDirty().  (This must
 451 happen before the WAL record is inserted; see notes in SyncOneBuffer().)
 452 Note that marking a buffer dirty with MarkBufferDirty() should only
 453 happen iff you write a WAL record; see Writing Hints below.
 454
 455 5. If the relation requires WAL-logging, build a WAL record using
 456 XLogBeginInsert and XLogRegister* functions, and insert it.  (See
 457 "Constructing a WAL record" below).  Then update the page's LSN using the
 458 returned XLOG location.  For instance,
 459
 460                 XLogBeginInsert();
 461                 XLogRegisterBuffer(...)
 462                 XLogRegisterData(...)
 463                 recptr = XLogInsert(rmgr_id, info);
 464
 465                 PageSetLSN(dp, recptr);
 466
 467 6. END_CRIT_SECTION()
 468
 469 7. Unlock and unpin the buffer(s).
 470
 471 Complex changes (such as a multilevel index insertion) normally need to be
 472 described by a series of atomic-action WAL records.  The intermediate states
 473 must be self-consistent, so that if the replay is interrupted between any
 474 two actions, the system is fully functional.  In btree indexes, for example,
 475 a page split requires a new page to be allocated, and an insertion of a new
 476 key in the parent btree level, but for locking reasons this has to be
 477 reflected by two separate WAL records.  Replaying the first record, to
 478 allocate the new page and move tuples to it, sets a flag on the page to
 479 indicate that the key has not been inserted to the parent yet.  Replaying the
 480 second record clears the flag.  This intermediate state is never seen by
 481 other backends during normal operation, because the lock on the child page
 482 is held across the two actions, but will be seen if the operation is
 483 interrupted before writing the second WAL record.  The search algorithm works
 484 with the intermediate state as normal, but if an insertion encounters a page
 485 with the incomplete-split flag set, it will finish the interrupted split by
 486 inserting the key to the parent, before proceeding.
 487
 488
 489 Constructing a WAL record
 490 -------------------------
 491
 492 A WAL record consists of a header common to all WAL record types,
 493 record-specific data, and information about the data blocks modified.  Each
 494 modified data block is identified by an ID number, and can optionally have
 495 more record-specific data associated with the block.  If XLogInsert decides
 496 that a full-page image of a block needs to be taken, the data associated
 497 with that block is not included.
 498
 499 The API for constructing a WAL record consists of five functions:
 500 XLogBeginInsert, XLogRegisterBuffer, XLogRegisterData, XLogRegisterBufData,
 501 and XLogInsert.  First, call XLogBeginInsert().  Then register all the buffers
 502 modified, and data needed to replay the changes, using XLogRegister*
 503 functions.  Finally, insert the constructed record to the WAL by calling
 504 XLogInsert().
 505
 506         XLogBeginInsert();
 507
 508         /* register buffers modified as part of this WAL-logged action */
 509         XLogRegisterBuffer(0, lbuffer, REGBUF_STANDARD);
 510         XLogRegisterBuffer(1, rbuffer, REGBUF_STANDARD);
 511
 512         /* register data that is always included in the WAL record */
 513         XLogRegisterData(&xlrec, SizeOfFictionalAction);
 514
 515         /*
 516          * register data associated with a buffer. This will not be included
 517          * in the record if a full-page image is taken.
 518          */
 519         XLogRegisterBufData(0, tuple->data, tuple->len);
 520
 521         /* more data associated with the buffer */
 522         XLogRegisterBufData(0, data2, len2);
 523
 524         /*
 525          * Ok, all the data and buffers to include in the WAL record have
 526          * been registered. Insert the record.
 527          */
 528         recptr = XLogInsert(RM_FOO_ID, XLOG_FOOBAR_DO_STUFF);
 529
 530 Details of the API functions:
 531
 532 void XLogBeginInsert(void)
 533
 534     Must be called before XLogRegisterBuffer and XLogRegisterData.
 535
 536 void XLogResetInsertion(void)
 537
 538     Clear any currently registered data and buffers from the WAL record
 539     construction workspace.  This is only needed if you have already called
 540     XLogBeginInsert(), but decide to not insert the record after all.
 541
 542 void XLogEnsureRecordSpace(int max_block_id, int ndatas)
 543
 544     Normally, the WAL record construction buffers have the following limits:
 545
 546     * highest block ID that can be used is 4 (allowing five block references)
 547     * Max 20 chunks of registered data
 548
 549     These default limits are enough for most record types that change some
 550     on-disk structures.  For the odd case that requires more data, or needs to
 551     modify more buffers, these limits can be raised by calling
 552     XLogEnsureRecordSpace().  XLogEnsureRecordSpace() must be called before
 553     XLogBeginInsert(), and outside a critical section.
 554
 555 void XLogRegisterBuffer(uint8 block_id, Buffer buf, uint8 flags);
 556
 557     XLogRegisterBuffer adds information about a data block to the WAL record.
 558     block_id is an arbitrary number used to identify this page reference in
 559     the redo routine.  The information needed to re-find the page at redo -
 560     relfilelocator, fork, and block number - are included in the WAL record.
 561
 562     XLogInsert will automatically include a full copy of the page contents, if
 563     this is the first modification of the buffer since the last checkpoint.
 564     It is important to register every buffer modified by the action with
 565     XLogRegisterBuffer, to avoid torn-page hazards.
 566
 567     The flags control when and how the buffer contents are included in the
 568     WAL record.  Normally, a full-page image is taken only if the page has not
 569     been modified since the last checkpoint, and only if full_page_writes=on
 570     or an online backup is in progress.  The REGBUF_FORCE_IMAGE flag can be
 571     used to force a full-page image to always be included; that is useful
 572     e.g. for an operation that rewrites most of the page, so that tracking the
 573     details is not worth it.  For the rare case where it is not necessary to
 574     protect from torn pages, REGBUF_NO_IMAGE flag can be used to suppress
 575     full page image from being taken.  REGBUF_WILL_INIT also suppresses a full
 576     page image, but the redo routine must re-generate the page from scratch,
 577     without looking at the old page contents.  Re-initializing the page
 578     protects from torn page hazards like a full page image does.
 579
 580     The REGBUF_STANDARD flag can be specified together with the other flags to
 581     indicate that the page follows the standard page layout.  It causes the
 582     area between pd_lower and pd_upper to be left out from the image, reducing
 583     WAL volume.
 584
 585     If the REGBUF_KEEP_DATA flag is given, any per-buffer data registered with
 586     XLogRegisterBufData() is included in the WAL record even if a full-page
 587     image is taken.
 588
 589 void XLogRegisterData(const char *data, int len);
 590
 591     XLogRegisterData is used to include arbitrary data in the WAL record.  If
 592     XLogRegisterData() is called multiple times, the data are appended, and
 593     will be made available to the redo routine as one contiguous chunk.
 594
 595 void XLogRegisterBufData(uint8 block_id, const char *data, int len);
 596
 597     XLogRegisterBufData is used to include data associated with a particular
 598     buffer that was registered earlier with XLogRegisterBuffer().  If
 599     XLogRegisterBufData() is called multiple times with the same block ID, the
 600     data are appended, and will be made available to the redo routine as one
 601     contiguous chunk.
 602
 603     If a full-page image of the buffer is taken at insertion, the data is not
 604     included in the WAL record, unless the REGBUF_KEEP_DATA flag is used.
 605
 606
 607 Writing a REDO routine
 608 ----------------------
 609
 610 A REDO routine uses the data and page references included in the WAL record
 611 to reconstruct the new state of the page.  The record decoding functions
 612 and macros in xlogreader.c/h can be used to extract the data from the record.
 613
 614 When replaying a WAL record that describes changes on multiple pages, you
 615 must be careful to lock the pages properly to prevent concurrent Hot Standby
 616 queries from seeing an inconsistent state.  If this requires that two
 617 or more buffer locks be held concurrently, you must lock the pages in
 618 appropriate order, and not release the locks until all the changes are done.
 619
 620 Note that we must only use PageSetLSN/PageGetLSN() when we know the action
 621 is serialised. Only Startup process may modify data blocks during recovery,
 622 so Startup process may execute PageGetLSN() without fear of serialisation
 623 problems. All other processes must only call PageSet/GetLSN when holding
 624 either an exclusive buffer lock or a shared lock plus buffer header lock,
 625 or be writing the data block directly rather than through shared buffers
 626 while holding AccessExclusiveLock on the relation.
 627
 628
 629 Writing Hints
 630 -------------
 631
 632 In some cases, we write additional information to data blocks without
 633 writing a preceding WAL record. This should only happen iff the data can
 634 be reconstructed later following a crash and the action is simply a way
 635 of optimising for performance. When a hint is written we use
 636 MarkBufferDirtyHint() to mark the block dirty.
 637
 638 If the buffer is clean and checksums are in use then MarkBufferDirtyHint()
 639 inserts an XLOG_FPI_FOR_HINT record to ensure that we take a full page image
 640 that includes the hint. We do this to avoid a partial page write, when we
 641 write the dirtied page. WAL is not written during recovery, so we simply skip
 642 dirtying blocks because of hints when in recovery.
 643
 644 If you do decide to optimise away a WAL record, then any calls to
 645 MarkBufferDirty() must be replaced by MarkBufferDirtyHint(),
 646 otherwise you will expose the risk of partial page writes.
 647
 648 The all-visible hint in a heap page (PD_ALL_VISIBLE) is a special
 649 case, because it is treated like a durable change in some respects and
 650 a hint in other respects. It must satisfy the invariant that, if a
 651 heap page's associated visibilitymap (VM) bit is set, then
 652 PD_ALL_VISIBLE is set on the heap page itself. Clearing of
 653 PD_ALL_VISIBLE is always treated like a fully-durable change to
 654 maintain this invariant. Additionally, if checksums or wal_log_hints
 655 are enabled, setting PD_ALL_VISIBLE is also treated like a
 656 fully-durable change to protect against torn pages.
 657
 658 But, if neither checksums nor wal_log_hints are enabled, torn pages
 659 are of no consequence if the only change is to PD_ALL_VISIBLE; so no
 660 full heap page image is taken, and the heap page's LSN is not
 661 updated. NB: it would be incorrect to update the heap page's LSN when
 662 applying this optimization, even though there is an associated WAL
 663 record, because subsequent modifiers (e.g. an unrelated UPDATE) of the
 664 page may falsely believe that a full page image is not required.
 665
 666 Write-Ahead Logging for Filesystem Actions
 667 ------------------------------------------
 668
 669 The previous section described how to WAL-log actions that only change page
 670 contents within shared buffers.  For that type of action it is generally
 671 possible to check all likely error cases (such as insufficient space on the
 672 page) before beginning to make the actual change.  Therefore we can make
 673 the change and the creation of the associated WAL log record "atomic" by
 674 wrapping them into a critical section --- the odds of failure partway
 675 through are low enough that PANIC is acceptable if it does happen.
 676
 677 Clearly, that approach doesn't work for cases where there's a significant
 678 probability of failure within the action to be logged, such as creation
 679 of a new file or database.  We don't want to PANIC, and we especially don't
 680 want to PANIC after having already written a WAL record that says we did
 681 the action --- if we did, replay of the record would probably fail again
 682 and PANIC again, making the failure unrecoverable.  This means that the
 683 ordinary WAL rule of "write WAL before the changes it describes" doesn't
 684 work, and we need a different design for such cases.
 685
 686 There are several basic types of filesystem actions that have this
 687 issue.  Here is how we deal with each:
 688
 689 1. Adding a disk page to an existing table.
 690
 691 This action isn't WAL-logged at all.  We extend a table by writing a page
 692 of zeroes at its end.  We must actually do this write so that we are sure
 693 the filesystem has allocated the space.  If the write fails we can just
 694 error out normally.  Once the space is known allocated, we can initialize
 695 and fill the page via one or more normal WAL-logged actions.  Because it's
 696 possible that we crash between extending the file and writing out the WAL
 697 entries, we have to treat discovery of an all-zeroes page in a table or
 698 index as being a non-error condition.  In such cases we can just reclaim
 699 the space for re-use.
 700
 701 2. Creating a new table, which requires a new file in the filesystem.
 702
 703 We try to create the file, and if successful we make a WAL record saying
 704 we did it.  If not successful, we can just throw an error.  Notice that
 705 there is a window where we have created the file but not yet written any
 706 WAL about it to disk.  If we crash during this window, the file remains
 707 on disk as an "orphan".  It would be possible to clean up such orphans
 708 by having database restart search for files that don't have any committed
 709 entry in pg_class, but that currently isn't done because of the possibility
 710 of deleting data that is useful for forensic analysis of the crash.
 711 Orphan files are harmless --- at worst they waste a bit of disk space ---
 712 because we check for on-disk collisions when allocating new relfilenumber
 713 OIDs.  So cleaning up isn't really necessary.
 714
 715 3. Deleting a table, which requires an unlink() that could fail.
 716
 717 Our approach here is to WAL-log the operation first, but to treat failure
 718 of the actual unlink() call as a warning rather than error condition.
 719 Again, this can leave an orphan file behind, but that's cheap compared to
 720 the alternatives.  Since we can't actually do the unlink() until after
 721 we've committed the DROP TABLE transaction, throwing an error would be out
 722 of the question anyway.  (It may be worth noting that the WAL entry about
 723 the file deletion is actually part of the commit record for the dropping
 724 transaction.)
 725
 726 4. Creating and deleting databases and tablespaces, which requires creating
 727 and deleting directories and entire directory trees.
 728
 729 These cases are handled similarly to creating individual files, ie, we
 730 try to do the action first and then write a WAL entry if it succeeded.
 731 The potential amount of wasted disk space is rather larger, of course.
 732 In the creation case we try to delete the directory tree again if creation
 733 fails, so as to reduce the risk of wasted space.  Failure partway through
 734 a deletion operation results in a corrupt database: the DROP failed, but
 735 some of the data is gone anyway.  There is little we can do about that,
 736 though, and in any case it was presumably data the user no longer wants.
 737
 738 In all of these cases, if WAL replay fails to redo the original action
 739 we must panic and abort recovery.  The DBA will have to manually clean up
 740 (for instance, free up some disk space or fix directory permissions) and
 741 then restart recovery.  This is part of the reason for not writing a WAL
 742 entry until we've successfully done the original action.
 743
 744
 745 Skipping WAL for New RelFileLocator
 746 --------------------------------
 747
 748 Under wal_level=minimal, if a change modifies a relfilenumber that ROLLBACK
 749 would unlink, in-tree access methods write no WAL for that change.  Code that
 750 writes WAL without calling RelationNeedsWAL() must check for this case.  This
 751 skipping is mandatory.  If a WAL-writing change preceded a WAL-skipping change
 752 for the same block, REDO could overwrite the WAL-skipping change.  If a
 753 WAL-writing change followed a WAL-skipping change for the same block, a
 754 related problem would arise.  When a WAL record contains no full-page image,
 755 REDO expects the page to match its contents from just before record insertion.
 756 A WAL-skipping change may not reach disk at all, violating REDO's expectation
 757 under full_page_writes=off.  For any access method, CommitTransaction() writes
 758 and fsyncs affected blocks before recording the commit.
 759
 760 Prefer to do the same in future access methods.  However, two other approaches
 761 can work.  First, an access method can irreversibly transition a given fork
 762 from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
 763 smgrimmedsync().  Second, an access method can opt to write WAL
 764 unconditionally for permanent relations.  Under these approaches, the access
 765 method callbacks must not call functions that react to RelationNeedsWAL().
 766
 767 This applies only to WAL records whose replay would modify bytes stored in the
 768 new relfilenumber.  It does not apply to other records about the relfilenumber,
 769 such as XLOG_SMGR_CREATE.  Because it operates at the level of individual
 770 relfilenumbers, RelationNeedsWAL() can differ for tightly-coupled relations.
 771 Consider "CREATE TABLE t (); BEGIN; ALTER TABLE t ADD c text; ..." in which
 772 ALTER TABLE adds a TOAST relation.  The TOAST relation will skip WAL, while
 773 the table owning it will not.  ALTER TABLE SET TABLESPACE will cause a table
 774 to skip WAL, but that won't affect its indexes.
 775
 776
 777 Asynchronous Commit
 778 -------------------
 779
 780 As of PostgreSQL 8.3 it is possible to perform asynchronous commits - i.e.,
 781 we don't wait while the WAL record for the commit is fsync'ed.
 782 We perform an asynchronous commit when synchronous_commit = off.  Instead
 783 of performing an XLogFlush() up to the LSN of the commit, we merely note
 784 the LSN in shared memory.  The backend then continues with other work.
 785 We record the LSN only for an asynchronous commit, not an abort; there's
 786 never any need to flush an abort record, since the presumption after a
 787 crash would be that the transaction aborted anyway.
 788
 789 We always force synchronous commit when the transaction is deleting
 790 relations, to ensure the commit record is down to disk before the relations
 791 are removed from the filesystem.  Also, certain utility commands that have
 792 non-roll-backable side effects (such as filesystem changes) force sync
 793 commit to minimize the window in which the filesystem change has been made
 794 but the transaction isn't guaranteed committed.
 795
 796 The walwriter regularly wakes up (via wal_writer_delay) or is woken up
 797 (via its latch, which is set by backends committing asynchronously) and
 798 performs an XLogBackgroundFlush().  This checks the location of the last
 799 completely filled WAL page.  If that has moved forwards, then we write all
 800 the changed buffers up to that point, so that under full load we write
 801 only whole buffers.  If there has been a break in activity and the current
 802 WAL page is the same as before, then we find out the LSN of the most
 803 recent asynchronous commit, and write up to that point, if required (i.e.
 804 if it's in the current WAL page).  If more than wal_writer_delay has
 805 passed, or more than wal_writer_flush_after blocks have been written, since
 806 the last flush, WAL is also flushed up to the current location.  This
 807 arrangement in itself would guarantee that an async commit record reaches
 808 disk after at most two times wal_writer_delay after the transaction
 809 completes. However, we also allow XLogFlush to write/flush full buffers
 810 "flexibly" (ie, not wrapping around at the end of the circular WAL buffer
 811 area), so as to minimize the number of writes issued under high load when
 812 multiple WAL pages are filled per walwriter cycle. This makes the worst-case
 813 delay three wal_writer_delay cycles.
 814
 815 There are some other subtle points to consider with asynchronous commits.
 816 First, for each page of CLOG we must remember the LSN of the latest commit
 817 affecting the page, so that we can enforce the same flush-WAL-before-write
 818 rule that we do for ordinary relation pages.  Otherwise the record of the
 819 commit might reach disk before the WAL record does.  Again, abort records
 820 need not factor into this consideration.
 821
 822 In fact, we store more than one LSN for each clog page.  This relates to
 823 the way we set transaction status hint bits during visibility tests.
 824 We must not set a transaction-committed hint bit on a relation page and
 825 have that record make it to disk prior to the WAL record of the commit.
 826 Since visibility tests are normally made while holding buffer share locks,
 827 we do not have the option of changing the page's LSN to guarantee WAL
 828 synchronization.  Instead, we defer the setting of the hint bit if we have
 829 not yet flushed WAL as far as the LSN associated with the transaction.
 830 This requires tracking the LSN of each unflushed async commit.  It is
 831 convenient to associate this data with clog buffers: because we will flush
 832 WAL before writing a clog page, we know that we do not need to remember a
 833 transaction's LSN longer than the clog page holding its commit status
 834 remains in memory.  However, the naive approach of storing an LSN for each
 835 clog position is unattractive: the LSNs are 32x bigger than the two-bit
 836 commit status fields, and so we'd need 256K of additional shared memory for
 837 each 8K clog buffer page.  We choose instead to store a smaller number of
 838 LSNs per page, where each LSN is the highest LSN associated with any
 839 transaction commit in a contiguous range of transaction IDs on that page.
 840 This saves storage at the price of some possibly-unnecessary delay in
 841 setting transaction hint bits.
 842
 843 How many transactions should share the same cached LSN (N)?  If the
 844 system's workload consists only of small async-commit transactions, then
 845 it's reasonable to have N similar to the number of transactions per
 846 walwriter cycle, since that is the granularity with which transactions will
 847 become truly committed (and thus hintable) anyway.  The worst case is where
 848 a sync-commit xact shares a cached LSN with an async-commit xact that
 849 commits a bit later; even though we paid to sync the first xact to disk,
 850 we won't be able to hint its outputs until the second xact is sync'd, up to
 851 three walwriter cycles later.  This argues for keeping N (the group size)
 852 as small as possible.  For the moment we are setting the group size to 32,
 853 which makes the LSN cache space the same size as the actual clog buffer
 854 space (independently of BLCKSZ).
 855
 856 It is useful that we can run both synchronous and asynchronous commit
 857 transactions concurrently, but the safety of this is perhaps not
 858 immediately obvious.  Assume we have two transactions, T1 and T2.  The Log
 859 Sequence Number (LSN) is the point in the WAL sequence where a transaction
 860 commit is recorded, so LSN1 and LSN2 are the commit records of those
 861 transactions.  If T2 can see changes made by T1 then when T2 commits it
 862 must be true that LSN2 follows LSN1.  Thus when T2 commits it is certain
 863 that all of the changes made by T1 are also now recorded in the WAL.  This
 864 is true whether T1 was asynchronous or synchronous.  As a result, it is
 865 safe for asynchronous commits and synchronous commits to work concurrently
 866 without endangering data written by synchronous commits.  Sub-transactions
 867 are not important here since the final write to disk only occurs at the
 868 commit of the top level transaction.
 869
 870 Changes to data blocks cannot reach disk unless WAL is flushed up to the
 871 point of the LSN of the data blocks.  Any attempt to write unsafe data to
 872 disk will trigger a write which ensures the safety of all data written by
 873 that and prior transactions.  Data blocks and clog pages are both protected
 874 by LSNs.
 875
 876 Changes to a temp table are not WAL-logged, hence could reach disk in
 877 advance of T1's commit, but we don't care since temp table contents don't
 878 survive crashes anyway.
 879
 880 Database writes that skip WAL for new relfilenumbers are also safe.  In these
 881 cases it's entirely possible for the data to reach disk before T1's commit,
 882 because T1 will fsync it down to disk without any sort of interlock.  However,
 883 all these paths are designed to write data that no other transaction can see
 884 until after T1 commits.  The situation is thus not different from ordinary
 885 WAL-logged updates.
 886
 887 Transaction Emulation during Recovery
 888 -------------------------------------
 889
 890 During Recovery we replay transaction changes in the order they occurred.
 891 As part of this replay we emulate some transactional behaviour, so that
 892 read only backends can take MVCC snapshots. We do this by maintaining a
 893 list of XIDs belonging to transactions that are being replayed, so that
 894 each transaction that has recorded WAL records for database writes exist
 895 in the array until it commits. Further details are given in comments in
 896 procarray.c.
 897
 898 Many actions write no WAL records at all, for example read only transactions.
 899 These have no effect on MVCC in recovery and we can pretend they never
 900 occurred at all. Subtransaction commit does not write a WAL record either
 901 and has very little effect, since lock waiters need to wait for the
 902 parent transaction to complete.
 903
 904 Not all transactional behaviour is emulated, for example we do not insert
 905 a transaction entry into the lock table, nor do we maintain the transaction
 906 stack in memory. Clog, multixact and commit_ts entries are made normally.
 907 Subtrans is maintained during recovery but the details of the transaction
 908 tree are ignored and all subtransactions reference the top-level TransactionId
 909 directly. Since commit is atomic this provides correct lock wait behaviour
 910 yet simplifies emulation of subtransactions considerably.
 911
 912 Further details on locking mechanics in recovery are given in comments
 913 with the Lock rmgr code.