src/backend/storage/lmgr/README

   1 src/backend/storage/lmgr/README
   2
   3 Locking Overview
   4 ================
   5
   6 Postgres uses four types of interprocess locks:
   7
   8 * Spinlocks.  These are intended for *very* short-term locks.  If a lock
   9 is to be held more than a few dozen instructions, or across any sort of
  10 kernel call (or even a call to a nontrivial subroutine), don't use a
  11 spinlock. Spinlocks are primarily used as infrastructure for lightweight
  12 locks. They are implemented using a hardware atomic-test-and-set
  13 instruction, if available.  Waiting processes busy-loop until they can
  14 get the lock. There is no provision for deadlock detection, automatic
  15 release on error, or any other nicety.  There is a timeout if the lock
  16 cannot be gotten after a minute or so (which is approximately forever in
  17 comparison to the intended lock hold time, so this is certainly an error
  18 condition).
  19
  20 * Lightweight locks (LWLocks).  These locks are typically used to
  21 interlock access to datastructures in shared memory.  LWLocks support
  22 both exclusive and shared lock modes (for read/write and read-only
  23 access to a shared object). There is no provision for deadlock
  24 detection, but the LWLock manager will automatically release held
  25 LWLocks during elog() recovery, so it is safe to raise an error while
  26 holding LWLocks.  Obtaining or releasing an LWLock is quite fast (a few
  27 dozen instructions) when there is no contention for the lock.  When a
  28 process has to wait for an LWLock, it blocks on a SysV semaphore so as
  29 to not consume CPU time.  Waiting processes will be granted the lock in
  30 arrival order.  There is no timeout.
  31
  32 * Regular locks (a/k/a heavyweight locks).  The regular lock manager
  33 supports a variety of lock modes with table-driven semantics, and it has
  34 full deadlock detection and automatic release at transaction end.
  35 Regular locks should be used for all user-driven lock requests.
  36
  37 * SIReadLock predicate locks.  See separate README-SSI file for details.
  38
  39 Acquisition of either a spinlock or a lightweight lock causes query
  40 cancel and die() interrupts to be held off until all such locks are
  41 released. No such restriction exists for regular locks, however.  Also
  42 note that we can accept query cancel and die() interrupts while waiting
  43 for a regular lock, but we will not accept them while waiting for
  44 spinlocks or LW locks. It is therefore not a good idea to use LW locks
  45 when the wait time might exceed a few seconds.
  46
  47 The rest of this README file discusses the regular lock manager in detail.
  48
  49
  50 Lock Data Structures
  51 --------------------
  52
  53 Lock methods describe the overall locking behavior.  Currently there are
  54 two lock methods: DEFAULT and USER.
  55
  56 Lock modes describe the type of the lock (read/write or shared/exclusive).
  57 In principle, each lock method can have its own set of lock modes with
  58 different conflict rules, but currently DEFAULT and USER methods use
  59 identical lock mode sets. See src/include/storage/lock.h for more details.
  60 (Lock modes are also called lock types in some places in the code and
  61 documentation.)
  62
  63 There are two main methods for recording locks in shared memory.  The primary
  64 mechanism uses two main structures: the per-lockable-object LOCK struct, and
  65 the per-lock-and-requestor PROCLOCK struct.  A LOCK object exists for each
  66 lockable object that currently has locks held or requested on it.  A PROCLOCK
  67 struct exists for each backend that is holding or requesting lock(s) on each
  68 LOCK object.
  69
  70 There is also a special "fast path" mechanism which backends may use to
  71 record a limited number of locks with very specific characteristics: they must
  72 use the DEFAULT lockmethod; they must represent a lock on a database relation
  73 (not a shared relation), they must be a "weak" lock which is unlikely to
  74 conflict (AccessShareLock, RowShareLock, or RowExclusiveLock); and the system
  75 must be able to quickly verify that no conflicting locks could possibly be
  76 present.  See "Fast Path Locking", below, for more details.
  77
  78 Each backend also maintains an unshared LOCALLOCK structure for each lockable
  79 object and lock mode that it is currently holding or requesting.  The shared
  80 lock structures only allow a single lock grant to be made per lockable
  81 object/lock mode/backend.  Internally to a backend, however, the same lock may
  82 be requested and perhaps released multiple times in a transaction, and it can
  83 also be held both transactionally and session-wide.  The internal request
  84 counts are held in LOCALLOCK so that the shared data structures need not be
  85 accessed to alter them.
  86
  87 ---------------------------------------------------------------------------
  88
  89 The lock manager's LOCK objects contain:
  90
  91 tag -
  92     The key fields that are used for hashing locks in the shared memory
  93     lock hash table.  The contents of the tag essentially define an
  94     individual lockable object.  See include/storage/lock.h for details
  95     about the supported types of lockable objects.  This is declared as
  96     a separate struct to ensure that we always zero out the correct number
  97     of bytes.  It is critical that any alignment-padding bytes the compiler
  98     might insert in the struct be zeroed out, else the hash computation
  99     will be random.  (Currently, we are careful to define struct LOCKTAG
 100     so that there are no padding bytes.)
 101
 102 grantMask -
 103     This bitmask indicates what types of locks are currently held on the
 104     given lockable object.  It is used (against the lock table's conflict
 105     table) to determine if a new lock request will conflict with existing
 106     lock types held.  Conflicts are determined by bitwise AND operations
 107     between the grantMask and the conflict table entry for the requested
 108     lock type.  Bit i of grantMask is 1 if and only if granted[i] > 0.
 109
 110 waitMask -
 111     This bitmask shows the types of locks being waited for.  Bit i of waitMask
 112     is 1 if and only if requested[i] > granted[i].
 113
 114 procLocks -
 115     This is a shared memory queue of all the PROCLOCK structs associated with
 116     the lock object.  Note that both granted and waiting PROCLOCKs are in this
 117     list (indeed, the same PROCLOCK might have some already-granted locks and
 118     be waiting for more!).
 119
 120 waitProcs -
 121     This is a shared memory queue of all PGPROC structures corresponding to
 122     backends that are waiting (sleeping) until another backend releases this
 123     lock.  The process structure holds the information needed to determine
 124     if it should be woken up when the lock is released.
 125
 126 nRequested -
 127     Keeps a count of how many times this lock has been attempted to be
 128     acquired.  The count includes attempts by processes which were put
 129     to sleep due to conflicts.  It also counts the same backend twice
 130     if, for example, a backend process first acquires a read and then
 131     acquires a write.  (But multiple acquisitions of the same lock/lock mode
 132     within a backend are not multiply counted here; they are recorded
 133     only in the backend's LOCALLOCK structure.)
 134
 135 requested -
 136     Keeps a count of how many locks of each type have been attempted.  Only
 137     elements 1 through MAX_LOCKMODES-1 are used as they correspond to the lock
 138     type defined constants.  Summing the values of requested[] should come out
 139     equal to nRequested.
 140
 141 nGranted -
 142     Keeps count of how many times this lock has been successfully acquired.
 143     This count does not include attempts that are waiting due to conflicts.
 144     Otherwise the counting rules are the same as for nRequested.
 145
 146 granted -
 147     Keeps count of how many locks of each type are currently held.  Once again
 148     only elements 1 through MAX_LOCKMODES-1 are used (0 is not).  Also, like
 149     requested[], summing the values of granted[] should total to the value
 150     of nGranted.
 151
 152 We should always have 0 <= nGranted <= nRequested, and
 153 0 <= granted[i] <= requested[i] for each i.  When all the request counts
 154 go to zero, the LOCK object is no longer needed and can be freed.
 155
 156 ---------------------------------------------------------------------------
 157
 158 The lock manager's PROCLOCK objects contain:
 159
 160 tag -
 161     The key fields that are used for hashing entries in the shared memory
 162     PROCLOCK hash table.  This is declared as a separate struct to ensure that
 163     we always zero out the correct number of bytes.  It is critical that any
 164     alignment-padding bytes the compiler might insert in the struct be zeroed
 165     out, else the hash computation will be random.  (Currently, we are careful
 166     to define struct PROCLOCKTAG so that there are no padding bytes.)
 167
 168     tag.myLock
 169         Pointer to the shared LOCK object this PROCLOCK is for.
 170
 171     tag.myProc
 172         Pointer to the PGPROC of backend process that owns this PROCLOCK.
 173
 174     Note: it's OK to use pointers here because a PROCLOCK never outlives
 175     either its lock or its proc.  The tag is therefore unique for as long
 176     as it needs to be, even though the same tag values might mean something
 177     else at other times.
 178
 179 holdMask -
 180     A bitmask for the lock modes successfully acquired by this PROCLOCK.
 181     This should be a subset of the LOCK object's grantMask, and also a
 182     subset of the PGPROC object's heldLocks mask (if the PGPROC is
 183     currently waiting for another lock mode on this lock).
 184
 185 releaseMask -
 186     A bitmask for the lock modes due to be released during LockReleaseAll.
 187     This must be a subset of the holdMask.  Note that it is modified without
 188     taking the partition LWLock, and therefore it is unsafe for any
 189     backend except the one owning the PROCLOCK to examine/change it.
 190
 191 lockLink -
 192     List link for shared memory queue of all the PROCLOCK objects for the
 193     same LOCK.
 194
 195 procLink -
 196     List link for shared memory queue of all the PROCLOCK objects for the
 197     same backend.
 198
 199 ---------------------------------------------------------------------------
 200
 201
 202 Lock Manager Internal Locking
 203 -----------------------------
 204
 205 Before PostgreSQL 8.2, all of the shared-memory data structures used by
 206 the lock manager were protected by a single LWLock, the LockMgrLock;
 207 any operation involving these data structures had to exclusively lock
 208 LockMgrLock.  Not too surprisingly, this became a contention bottleneck.
 209 To reduce contention, the lock manager's data structures have been split
 210 into multiple "partitions", each protected by an independent LWLock.
 211 Most operations only need to lock the single partition they are working in.
 212 Here are the details:
 213
 214 * Each possible lock is assigned to one partition according to a hash of
 215 its LOCKTAG value.  The partition's LWLock is considered to protect all the
 216 LOCK objects of that partition as well as their subsidiary PROCLOCKs.
 217
 218 * The shared-memory hash tables for LOCKs and PROCLOCKs are organized
 219 so that different partitions use different hash chains, and thus there
 220 is no conflict in working with objects in different partitions.  This
 221 is supported directly by dynahash.c's "partitioned table" mechanism
 222 for the LOCK table: we need only ensure that the partition number is
 223 taken from the low-order bits of the dynahash hash value for the LOCKTAG.
 224 To make it work for PROCLOCKs, we have to ensure that a PROCLOCK's hash
 225 value has the same low-order bits as its associated LOCK.  This requires
 226 a specialized hash function (see proclock_hash).
 227
 228 * Formerly, each PGPROC had a single list of PROCLOCKs belonging to it.
 229 This has now been split into per-partition lists, so that access to a
 230 particular PROCLOCK list can be protected by the associated partition's
 231 LWLock.  (This rule allows one backend to manipulate another backend's
 232 PROCLOCK lists, which was not originally necessary but is now required in
 233 connection with fast-path locking; see below.)
 234
 235 * The other lock-related fields of a PGPROC are only interesting when
 236 the PGPROC is waiting for a lock, so we consider that they are protected
 237 by the partition LWLock of the awaited lock.
 238
 239 For normal lock acquisition and release, it is sufficient to lock the
 240 partition containing the desired lock.  Deadlock checking needs to touch
 241 multiple partitions in general; for simplicity, we just make it lock all
 242 the partitions in partition-number order.  (To prevent LWLock deadlock,
 243 we establish the rule that any backend needing to lock more than one
 244 partition at once must lock them in partition-number order.)  It's
 245 possible that deadlock checking could be done without touching every
 246 partition in typical cases, but since in a properly functioning system
 247 deadlock checking should not occur often enough to be performance-critical,
 248 trying to make this work does not seem a productive use of effort.
 249
 250 A backend's internal LOCALLOCK hash table is not partitioned.  We do store
 251 a copy of the locktag hash code in LOCALLOCK table entries, from which the
 252 partition number can be computed, but this is a straight speed-for-space
 253 tradeoff: we could instead recalculate the partition number from the LOCKTAG
 254 when needed.
 255
 256
 257 Fast Path Locking
 258 -----------------
 259
 260 Fast path locking is a special purpose mechanism designed to reduce the
 261 overhead of taking and releasing certain types of locks which are taken
 262 and released very frequently but rarely conflict.  Currently, this includes
 263 two categories of locks:
 264
 265 (1) Weak relation locks.  SELECT, INSERT, UPDATE, and DELETE must acquire a
 266 lock on every relation they operate on, as well as various system catalogs
 267 that can be used internally.  Many DML operations can proceed in parallel
 268 against the same table at the same time; only DDL operations such as
 269 CLUSTER, ALTER TABLE, or DROP -- or explicit user action such as LOCK TABLE
 270 -- will create lock conflicts with the "weak" locks (AccessShareLock,
 271 RowShareLock, RowExclusiveLock) acquired by DML operations.
 272
 273 (2) VXID locks.  Every transaction takes a lock on its own virtual
 274 transaction ID.  Currently, the only operations that wait for these locks
 275 are CREATE INDEX CONCURRENTLY and Hot Standby (in the case of a conflict),
 276 so most VXID locks are taken and released by the owner without anyone else
 277 needing to care.
 278
 279 The primary locking mechanism does not cope well with this workload.  Even
 280 though the lock manager locks are partitioned, the locktag for any given
 281 relation still falls in one, and only one, partition.  Thus, if many short
 282 queries are accessing the same relation, the lock manager partition lock for
 283 that partition becomes a contention bottleneck.  This effect is measurable
 284 even on 2-core servers, and becomes very pronounced as core count increases.
 285
 286 To alleviate this bottleneck, beginning in PostgreSQL 9.2, each backend is
 287 permitted to record a limited number of locks on unshared relations in an
 288 array within its PGPROC structure, rather than using the primary lock table.
 289 This mechanism can only be used when the locker can verify that no conflicting
 290 locks exist at the time of taking the lock.
 291
 292 A key point of this algorithm is that it must be possible to verify the
 293 absence of possibly conflicting locks without fighting over a shared LWLock or
 294 spinlock.  Otherwise, this effort would simply move the contention bottleneck
 295 from one place to another.  We accomplish this using an array of 1024 integer
 296 counters, which are in effect a 1024-way partitioning of the lock space.
 297 Each counter records the number of "strong" locks (that is, ShareLock,
 298 ShareRowExclusiveLock, ExclusiveLock, and AccessExclusiveLock) on unshared
 299 relations that fall into that partition.  When this counter is non-zero, the
 300 fast path mechanism may not be used to take new relation locks within that
 301 partition.  A strong locker bumps the counter and then scans each per-backend
 302 array for matching fast-path locks; any which are found must be transferred to
 303 the primary lock table before attempting to acquire the lock, to ensure proper
 304 lock conflict and deadlock detection.
 305
 306 On an SMP system, we must guarantee proper memory synchronization.  Here we
 307 rely on the fact that LWLock acquisition acts as a memory sequence point: if
 308 A performs a store, A and B both acquire an LWLock in either order, and B
 309 then performs a load on the same memory location, it is guaranteed to see
 310 A's store.  In this case, each backend's fast-path lock queue is protected
 311 by an LWLock.  A backend wishing to acquire a fast-path lock grabs this
 312 LWLock before examining FastPathStrongRelationLocks to check for the presence
 313 of a conflicting strong lock.  And the backend attempting to acquire a strong
 314 lock, because it must transfer any matching weak locks taken via the fast-path
 315 mechanism to the shared lock table, will acquire every LWLock protecting a
 316 backend fast-path queue in turn.  So, if we examine
 317 FastPathStrongRelationLocks and see a zero, then either the value is truly
 318 zero, or if it is a stale value, the strong locker has yet to acquire the
 319 per-backend LWLock we now hold (or, indeed, even the first per-backend LWLock)
 320 and will notice any weak lock we take when it does.
 321
 322 Fast-path VXID locks do not use the FastPathStrongRelationLocks table.  The
 323 first lock taken on a VXID is always the ExclusiveLock taken by its owner.
 324 Any subsequent lockers are share lockers waiting for the VXID to terminate.
 325 Indeed, the only reason VXID locks use the lock manager at all (rather than
 326 waiting for the VXID to terminate via some other method) is for deadlock
 327 detection.  Thus, the initial VXID lock can *always* be taken via the fast
 328 path without checking for conflicts.  Any subsequent locker must check
 329 whether the lock has been transferred to the main lock table, and if not,
 330 do so.  The backend owning the VXID must be careful to clean up any entry
 331 made in the main lock table at end of transaction.
 332
 333 Deadlock detection does not need to examine the fast-path data structures,
 334 because any lock that could possibly be involved in a deadlock must have
 335 been transferred to the main tables beforehand.
 336
 337
 338 The Deadlock Detection Algorithm
 339 --------------------------------
 340
 341 Since we allow user transactions to request locks in any order, deadlock
 342 is possible.  We use a deadlock detection/breaking algorithm that is
 343 fairly standard in essence, but there are many special considerations
 344 needed to deal with Postgres' generalized locking model.
 345
 346 A key design consideration is that we want to make routine operations
 347 (lock grant and release) run quickly when there is no deadlock, and
 348 avoid the overhead of deadlock handling as much as possible.  We do this
 349 using an "optimistic waiting" approach: if a process cannot acquire the
 350 lock it wants immediately, it goes to sleep without any deadlock check.
 351 But it also sets a delay timer, with a delay of DeadlockTimeout
 352 milliseconds (typically set to one second).  If the delay expires before
 353 the process is granted the lock it wants, it runs the deadlock
 354 detection/breaking code. Normally this code will determine that there is
 355 no deadlock condition, and then the process will go back to sleep and
 356 wait quietly until it is granted the lock.  But if a deadlock condition
 357 does exist, it will be resolved, usually by aborting the detecting
 358 process' transaction.  In this way, we avoid deadlock handling overhead
 359 whenever the wait time for a lock is less than DeadlockTimeout, while
 360 not imposing an unreasonable delay of detection when there is an error.
 361
 362 Lock acquisition (routines LockAcquire and ProcSleep) follows these rules:
 363
 364 1. A lock request is granted immediately if it does not conflict with
 365 any existing or waiting lock request, or if the process already holds an
 366 instance of the same lock type (eg, there's no penalty to acquire a read
 367 lock twice).  Note that a process never conflicts with itself, eg one
 368 can obtain read lock when one already holds exclusive lock.
 369
 370 2. Otherwise the process joins the lock's wait queue.  Normally it will
 371 be added to the end of the queue, but there is an exception: if the
 372 process already holds locks on this same lockable object that conflict
 373 with the request of any pending waiter, then the process will be
 374 inserted in the wait queue just ahead of the first such waiter.  (If we
 375 did not make this check, the deadlock detection code would adjust the
 376 queue order to resolve the conflict, but it's relatively cheap to make
 377 the check in ProcSleep and avoid a deadlock timeout delay in this case.)
 378 Note special case when inserting before the end of the queue: if the
 379 process's request does not conflict with any existing lock nor any
 380 waiting request before its insertion point, then go ahead and grant the
 381 lock without waiting.
 382
 383 When a lock is released, the lock release routine (ProcLockWakeup) scans
 384 the lock object's wait queue.  Each waiter is awoken if (a) its request
 385 does not conflict with already-granted locks, and (b) its request does
 386 not conflict with the requests of prior un-wakable waiters.  Rule (b)
 387 ensures that conflicting requests are granted in order of arrival. There
 388 are cases where a later waiter must be allowed to go in front of
 389 conflicting earlier waiters to avoid deadlock, but it is not
 390 ProcLockWakeup's responsibility to recognize these cases; instead, the
 391 deadlock detection code will re-order the wait queue when necessary.
 392
 393 To perform deadlock checking, we use the standard method of viewing the
 394 various processes as nodes in a directed graph (the waits-for graph or
 395 WFG).  There is a graph edge leading from process A to process B if A
 396 waits for B, ie, A is waiting for some lock and B holds a conflicting
 397 lock.  There is a deadlock condition if and only if the WFG contains a
 398 cycle.  We detect cycles by searching outward along waits-for edges to
 399 see if we return to our starting point.  There are three possible
 400 outcomes:
 401
 402 1. All outgoing paths terminate at a running process (which has no
 403 outgoing edge).
 404
 405 2. A deadlock is detected by looping back to the start point.  We
 406 resolve such a deadlock by canceling the start point's lock request and
 407 reporting an error in that transaction, which normally leads to
 408 transaction abort and release of that transaction's held locks.  Note
 409 that it's sufficient to cancel one request to remove the cycle; we don't
 410 need to kill all the transactions involved.
 411
 412 3. Some path(s) loop back to a node other than the start point.  This
 413 indicates a deadlock, but one that does not involve our starting
 414 process. We ignore this condition on the grounds that resolving such a
 415 deadlock is the responsibility of the processes involved --- killing our
 416 start-point process would not resolve the deadlock.  So, cases 1 and 3
 417 both report "no deadlock".
 418
 419 Postgres' situation is a little more complex than the standard discussion
 420 of deadlock detection, for two reasons:
 421
 422 1. A process can be waiting for more than one other process, since there
 423 might be multiple PROCLOCKs of (non-conflicting) lock types that all
 424 conflict with the waiter's request.  This creates no real difficulty
 425 however; we simply need to be prepared to trace more than one outgoing
 426 edge.
 427
 428 2. If a process A is behind a process B in some lock's wait queue, and
 429 their requested locks conflict, then we must say that A waits for B, since
 430 ProcLockWakeup will never awaken A before B.  This creates additional
 431 edges in the WFG.  We call these "soft" edges, as opposed to the "hard"
 432 edges induced by locks already held.  Note that if B already holds any
 433 locks conflicting with A's request, then their relationship is a hard edge
 434 not a soft edge.
 435
 436 A "soft" block, or wait-priority block, has the same potential for
 437 inducing deadlock as a hard block.  However, we may be able to resolve
 438 a soft block without aborting the transactions involved: we can instead
 439 rearrange the order of the wait queue.  This rearrangement reverses the
 440 direction of the soft edge between two processes with conflicting requests
 441 whose queue order is reversed.  If we can find a rearrangement that
 442 eliminates a cycle without creating new ones, then we can avoid an abort.
 443 Checking for such possible rearrangements is the trickiest part of the
 444 algorithm.
 445
 446 The workhorse of the deadlock detector is a routine FindLockCycle() which
 447 is given a starting point process (which must be a waiting process).
 448 It recursively scans outward across waits-for edges as discussed above.
 449 If it finds no cycle involving the start point, it returns "false".
 450 (As discussed above, we can ignore cycles not involving the start point.)
 451 When such a cycle is found, FindLockCycle() returns "true", and as it
 452 unwinds it also builds a list of any "soft" edges involved in the cycle.
 453 If the resulting list is empty then there is a hard deadlock and the
 454 configuration cannot succeed.  However, if the list is not empty, then
 455 reversing any one of the listed edges through wait-queue rearrangement
 456 will eliminate that cycle.  Since such a reversal might create cycles
 457 elsewhere, we may need to try every possibility.  Therefore, we need to
 458 be able to invoke FindLockCycle() on hypothetical configurations (wait
 459 orders) as well as the current real order.
 460
 461 The easiest way to handle this seems to be to have a lookaside table that
 462 shows the proposed new queue order for each wait queue that we are
 463 considering rearranging.  This table is checked by FindLockCycle, and it
 464 believes the proposed queue order rather than the real order for each lock
 465 that has an entry in the lookaside table.
 466
 467 We build a proposed new queue order by doing a "topological sort" of the
 468 existing entries.  Each soft edge that we are currently considering
 469 reversing creates a property of the partial order that the topological sort
 470 has to enforce.  We must use a sort method that preserves the input
 471 ordering as much as possible, so as not to gratuitously break arrival
 472 order for processes not involved in a deadlock.  (This is not true of the
 473 tsort method shown in Knuth, for example, but it's easily done by a simple
 474 doubly-nested-loop method that emits the first legal candidate at each
 475 step.  Fortunately, we don't need a highly efficient sort algorithm, since
 476 the number of partial order constraints is not likely to be large.)  Note
 477 that failure of the topological sort tells us we have conflicting ordering
 478 constraints, and therefore that the last-added soft edge reversal
 479 conflicts with a prior edge reversal.  We need to detect this case to
 480 avoid an infinite loop in the case where no possible rearrangement will
 481 work: otherwise, we might try a reversal, find that it still leads to
 482 a cycle, then try to un-reverse the reversal while trying to get rid of
 483 that cycle, etc etc.  Topological sort failure tells us the un-reversal
 484 is not a legitimate move in this context.
 485
 486 So, the basic step in our rearrangement method is to take a list of
 487 soft edges in a cycle (as returned by FindLockCycle()) and successively
 488 try the reversal of each one as a topological-sort constraint added to
 489 whatever constraints we are already considering.  We recursively search
 490 through all such sets of constraints to see if any one eliminates all
 491 the deadlock cycles at once.  Although this might seem impossibly
 492 inefficient, it shouldn't be a big problem in practice, because there
 493 will normally be very few, and not very large, deadlock cycles --- if
 494 any at all.  So the combinatorial inefficiency isn't going to hurt us.
 495 Besides, it's better to spend some time to guarantee that we've checked
 496 all possible escape routes than to abort a transaction when we didn't
 497 really have to.
 498
 499 Each edge reversal constraint can be viewed as requesting that the waiting
 500 process A be moved to before the blocking process B in the wait queue they
 501 are both in.  This action will reverse the desired soft edge, as well as
 502 any other soft edges between A and other processes it is advanced over.
 503 No other edges will be affected (note this is actually a constraint on our
 504 topological sort method to not re-order the queue more than necessary.)
 505 Therefore, we can be sure we have not created any new deadlock cycles if
 506 neither FindLockCycle(A) nor FindLockCycle(B) discovers any cycle.  Given
 507 the above-defined behavior of FindLockCycle, each of these searches is
 508 necessary as well as sufficient, since FindLockCycle starting at the
 509 original start point will not complain about cycles that include A or B
 510 but not the original start point.
 511
 512 In short then, a proposed rearrangement of the wait queue(s) is determined
 513 by one or more broken soft edges A->B, fully specified by the output of
 514 topological sorts of each wait queue involved, and then tested by invoking
 515 FindLockCycle() starting at the original start point as well as each of
 516 the mentioned processes (A's and B's).  If none of the tests detect a
 517 cycle, then we have a valid configuration and can implement it by
 518 reordering the wait queues per the sort outputs (and then applying
 519 ProcLockWakeup on each reordered queue, in case a waiter has become wakable).
 520 If any test detects a soft cycle, we can try to resolve it by adding each
 521 soft link in that cycle, in turn, to the proposed rearrangement list.
 522 This is repeated recursively until we either find a workable rearrangement
 523 or determine that none exists.  In the latter case, the outer level
 524 resolves the deadlock by aborting the original start-point transaction.
 525
 526 The particular order in which rearrangements are tried depends on the
 527 order FindLockCycle() happens to scan in, so if there are multiple
 528 workable rearrangements of the wait queues, then it is unspecified which
 529 one will be chosen.  What's more important is that we guarantee to try
 530 every queue rearrangement that could lead to success.  (For example,
 531 if we have A before B before C and the needed order constraints are
 532 C before A and B before C, we would first discover that A before C
 533 doesn't work and try the rearrangement C before A before B.  This would
 534 eventually lead to the discovery of the additional constraint B before C.)
 535
 536 Got that?
 537
 538 Miscellaneous Notes
 539 -------------------
 540
 541 1. It is easily proven that no deadlock will be missed due to our
 542 asynchronous invocation of deadlock checking.  A deadlock cycle in the WFG
 543 is formed when the last edge in the cycle is added; therefore the last
 544 process in the cycle to wait (the one from which that edge is outgoing) is
 545 certain to detect and resolve the cycle when it later runs CheckDeadLock.
 546 This holds even if that edge addition created multiple cycles; the process
 547 may indeed abort without ever noticing those additional cycles, but we
 548 don't particularly care.  The only other possible creation of deadlocks is
 549 during deadlock resolution's rearrangement of wait queues, and we already
 550 saw that that algorithm will prove that it creates no new deadlocks before
 551 it attempts to actually execute any rearrangement.
 552
 553 2. It is not certain that a deadlock will be resolved by aborting the
 554 last-to-wait process.  If earlier waiters in the cycle have not yet run
 555 CheckDeadLock, then the first one to do so will be the victim.
 556
 557 3. No live (wakable) process can be missed by ProcLockWakeup, since it
 558 examines every member of the wait queue (this was not true in the 7.0
 559 implementation, BTW).  Therefore, if ProcLockWakeup is always invoked
 560 after a lock is released or a wait queue is rearranged, there can be no
 561 failure to wake a wakable process.  One should also note that
 562 LockErrorCleanup (abort a waiter due to outside factors) must run
 563 ProcLockWakeup, in case the canceled waiter was soft-blocking other
 564 waiters.
 565
 566 4. We can minimize excess rearrangement-trial work by being careful to
 567 scan the wait queue from the front when looking for soft edges.  For
 568 example, if we have queue order A,B,C and C has deadlock conflicts with
 569 both A and B, we want to generate the "C before A" constraint first,
 570 rather than wasting time with "C before B", which won't move C far
 571 enough up.  So we look for soft edges outgoing from C starting at the
 572 front of the wait queue.
 573
 574 5. The working data structures needed by the deadlock detection code can
 575 be limited to numbers of entries computed from MaxBackends.  Therefore,
 576 we can allocate the worst-case space needed during backend startup. This
 577 seems a safer approach than trying to allocate workspace on the fly; we
 578 don't want to risk having the deadlock detector run out of memory, else
 579 we really have no guarantees at all that deadlock will be detected.
 580
 581 6. We abuse the deadlock detector to implement autovacuum cancellation.
 582 When we run the detector and we find that there's an autovacuum worker
 583 involved in the waits-for graph, we store a pointer to its PGPROC, and
 584 return a special return code (unless a hard deadlock has been detected).
 585 The caller can then send a cancellation signal.  This implements the
 586 principle that autovacuum has a low locking priority (eg it must not block
 587 DDL on the table).
 588
 589 Group Locking
 590 -------------
 591
 592 As if all of that weren't already complicated enough, PostgreSQL now supports
 593 parallelism (see src/backend/access/transam/README.parallel), which means that
 594 we might need to resolve deadlocks that occur between gangs of related
 595 processes rather than individual processes.  This doesn't change the basic
 596 deadlock detection algorithm very much, but it makes the bookkeeping more
 597 complicated.
 598
 599 We choose to regard locks held by processes in the same parallel group as
 600 non-conflicting with the exception of relation extension lock.  This means that
 601 two processes in a parallel group can hold a self-exclusive lock on the same
 602 relation at the same time, or one process can acquire an AccessShareLock while
 603 the other already holds AccessExclusiveLock.  This might seem dangerous and
 604 could be in some cases (more on that below), but if we didn't do this then
 605 parallel query would be extremely prone to self-deadlock.  For example, a
 606 parallel query against a relation on which the leader already had
 607 AccessExclusiveLock would hang, because the workers would try to lock the same
 608 relation and be blocked by the leader; yet the leader can't finish until it
 609 receives completion indications from all workers.  An undetected deadlock
 610 results.  This is far from the only scenario where such a problem happens.  The
 611 same thing will occur if the leader holds only AccessShareLock, the worker
 612 seeks AccessShareLock, but between the time the leader attempts to acquire the
 613 lock and the time the worker attempts to acquire it, some other process queues
 614 up waiting for an AccessExclusiveLock.  In this case, too, an indefinite hang
 615 results.
 616
 617 It might seem that we could predict which locks the workers will attempt to
 618 acquire and ensure before going parallel that those locks would be acquired
 619 successfully.  But this is very difficult to make work in a general way.  For
 620 example, a parallel worker's portion of the query plan could involve an
 621 SQL-callable function which generates a query dynamically, and that query
 622 might happen to hit a table on which the leader happens to hold
 623 AccessExclusiveLock.  By imposing enough restrictions on what workers can do,
 624 we could eventually create a situation where their behavior can be adequately
 625 restricted, but these restrictions would be fairly onerous, and even then, the
 626 system required to decide whether the workers will succeed at acquiring the
 627 necessary locks would be complex and possibly buggy.
 628
 629 So, instead, we take the approach of deciding that locks within a lock group
 630 do not conflict.  This eliminates the possibility of an undetected deadlock,
 631 but also opens up some problem cases: if the leader and worker try to do some
 632 operation at the same time which would ordinarily be prevented by the
 633 heavyweight lock mechanism, undefined behavior might result.  In practice, the
 634 dangers are modest.  The leader and worker share the same transaction,
 635 snapshot, and combo CID hash, and neither can perform any DDL or, indeed,
 636 write any data at all.  Thus, for either to read a table locked exclusively by
 637 the other is safe enough.  Problems would occur if the leader initiated
 638 parallelism from a point in the code at which it had some backend-private
 639 state that made table access from another process unsafe, for example after
 640 calling SetReindexProcessing and before calling ResetReindexProcessing,
 641 catastrophe could ensue, because the worker won't have that state.  Similarly,
 642 problems could occur with certain kinds of non-relation locks, such as
 643 GIN page locks.  It's no safer for two related processes to perform GIN clean
 644 up at the same time than for unrelated processes to do the same.
 645 However, since parallel mode is strictly read-only at present, neither this
 646 nor most of the similar cases can arise at present.  To allow parallel writes,
 647 we'll either need to (1) further enhance the deadlock detector to handle those
 648 types of locks in a different way than other types; or (2) have parallel
 649 workers use some other mutual exclusion method for such cases.
 650
 651 Group locking adds three new members to each PGPROC: lockGroupLeader,
 652 lockGroupMembers, and lockGroupLink. A PGPROC's lockGroupLeader is NULL for
 653 processes not involved in parallel query. When a process wants to cooperate
 654 with parallel workers, it becomes a lock group leader, which means setting
 655 this field to point to its own PGPROC. When a parallel worker starts up, it
 656 points this field at the leader. The lockGroupMembers field is only used in
 657 the leader; it is a list of the member PGPROCs of the lock group (the leader
 658 and all workers). The lockGroupLink field is the list link for this list.
 659
 660 All three of these fields are considered to be protected by a lock manager
 661 partition lock.  The partition lock that protects these fields within a given
 662 lock group is chosen by taking the leader's pgprocno modulo the number of lock
 663 manager partitions.  This unusual arrangement has a major advantage: the
 664 deadlock detector can count on the fact that no lockGroupLeader field can
 665 change while the deadlock detector is running, because it knows that it holds
 666 all the lock manager locks.  Also, holding this single lock allows safe
 667 manipulation of the lockGroupMembers list for the lock group.
 668
 669 We need an additional interlock when setting these fields, because a newly
 670 started parallel worker has to try to join the leader's lock group, but it
 671 has no guarantee that the group leader is still alive by the time it gets
 672 started.  We try to ensure that the parallel leader dies after all workers
 673 in normal cases, but also that the system could survive relatively intact
 674 if that somehow fails to happen.  This is one of the precautions against
 675 such a scenario: the leader relays its PGPROC and also its PID to the
 676 worker, and the worker fails to join the lock group unless the given PGPROC
 677 still has the same PID and is still a lock group leader.  We assume that
 678 PIDs are not recycled quickly enough for this interlock to fail.
 679
 680
 681 User Locks (Advisory Locks)
 682 ---------------------------
 683
 684 User locks are handled totally on the application side as long term
 685 cooperative locks which may extend beyond the normal transaction boundaries.
 686 Their purpose is to indicate to an application that someone is `working'
 687 on an item.  So it is possible to put a user lock on a tuple's oid,
 688 retrieve the tuple, work on it for an hour and then update it and remove
 689 the lock.  While the lock is active other clients can still read and write
 690 the tuple but they can be aware that it has been locked at the application
 691 level by someone.
 692
 693 User locks and normal locks are completely orthogonal and they don't
 694 interfere with each other.
 695
 696 User locks can be acquired either at session level or transaction level.
 697 A session-level lock request is not automatically released at transaction
 698 end, but must be explicitly released by the application.  (However, any
 699 remaining locks are always released at session end.)  Transaction-level
 700 user lock requests behave the same as normal lock requests, in that they
 701 are released at transaction end and do not need explicit unlocking.
 702
 703 Locking during Hot Standby
 704 --------------------------
 705
 706 The Startup process is the only backend that can make changes during
 707 recovery, all other backends are read only.  As a result the Startup
 708 process does not acquire locks on relations or objects except when the lock
 709 level is AccessExclusiveLock.
 710
 711 Regular backends are only allowed to take locks on relations or objects
 712 at RowExclusiveLock or lower. This ensures that they do not conflict with
 713 each other or with the Startup process, unless AccessExclusiveLocks are
 714 requested by the Startup process.
 715
 716 Deadlocks involving AccessExclusiveLocks are not possible, so we need
 717 not be concerned that a user initiated deadlock can prevent recovery from
 718 progressing.
 719
 720 AccessExclusiveLocks on the primary node generate WAL records
 721 that are then applied by the Startup process. Locks are released at end
 722 of transaction just as they are in normal processing. These locks are
 723 held by the Startup process, acting as a proxy for the backends that
 724 originally acquired these locks. Again, these locks cannot conflict with
 725 one another, so the Startup process cannot deadlock itself either.
 726
 727 Although deadlock is not possible, a regular backend's weak lock can
 728 prevent the Startup process from making progress in applying WAL, which is
 729 usually not something that should be tolerated for very long.  Mechanisms
 730 exist to forcibly cancel a regular backend's query if it blocks the
 731 Startup process for too long.