src/backend/access/hash/README

   1 $PostgreSQL$
   2
   3 Hash Indexing
   4 =============
   5
   6 This directory contains an implementation of hash indexing for Postgres.  Most
   7 of the core ideas are taken from Margo Seltzer and Ozan Yigit, A New Hashing
   8 Package for UNIX, Proceedings of the Winter USENIX Conference, January 1991.
   9 (Our in-memory hashtable implementation, src/backend/utils/hash/dynahash.c,
  10 also relies on some of the same concepts; it is derived from code written by
  11 Esmond Pitt and later improved by Margo among others.)
  12
  13 A hash index consists of two or more "buckets", into which tuples are
  14 placed whenever their hash key maps to the bucket number.  The
  15 key-to-bucket-number mapping is chosen so that the index can be
  16 incrementally expanded.  When a new bucket is to be added to the index,
  17 exactly one existing bucket will need to be "split", with some of its
  18 tuples being transferred to the new bucket according to the updated
  19 key-to-bucket-number mapping.  This is essentially the same hash table
  20 management technique embodied in src/backend/utils/hash/dynahash.c for
  21 in-memory hash tables.
  22
  23 Each bucket in the hash index comprises one or more index pages.  The
  24 bucket's first page is permanently assigned to it when the bucket is
  25 created.  Additional pages, called "overflow pages", are added if the
  26 bucket receives too many tuples to fit in the primary bucket page.
  27 The pages of a bucket are chained together in a doubly-linked list
  28 using fields in the index page special space.
  29
  30 There is currently no provision to shrink a hash index, other than by
  31 rebuilding it with REINDEX.  Overflow pages can be recycled for reuse
  32 in other buckets, but we never give them back to the operating system.
  33 There is no provision for reducing the number of buckets, either.
  34
  35
  36 Page Addressing
  37 ---------------
  38
  39 There are four kinds of pages in a hash index: the meta page (page zero),
  40 which contains statically allocated control information; primary bucket
  41 pages; overflow pages; and bitmap pages, which keep track of overflow
  42 pages that have been freed and are available for re-use.  For addressing
  43 purposes, bitmap pages are regarded as a subset of the overflow pages.
  44
  45 Primary bucket pages and overflow pages are allocated independently (since
  46 any given index might need more or fewer overflow pages relative to its
  47 number of buckets).  The hash code uses an interesting set of addressing
  48 rules to support a variable number of overflow pages while not having to
  49 move primary bucket pages around after they are created.
  50
  51 Primary bucket pages (henceforth just "bucket pages") are allocated in
  52 power-of-2 groups, called "split points" in the code.  Buckets 0 and 1
  53 are created when the index is initialized.  At the first split, buckets 2
  54 and 3 are allocated; when bucket 4 is needed, buckets 4-7 are allocated;
  55 when bucket 8 is needed, buckets 8-15 are allocated; etc.  All the bucket
  56 pages of a power-of-2 group appear consecutively in the index.  This
  57 addressing scheme allows the physical location of a bucket page to be
  58 computed from the bucket number relatively easily, using only a small
  59 amount of control information.  We take the log2() of the bucket number
  60 to determine which split point S the bucket belongs to, and then simply
  61 add "hashm_spares[S] + 1" (where hashm_spares[] is an array stored in the
  62 metapage) to compute the physical address.  hashm_spares[S] can be
  63 interpreted as the total number of overflow pages that have been allocated
  64 before the bucket pages of splitpoint S.  hashm_spares[0] is always 0,
  65 so that buckets 0 and 1 (which belong to splitpoint 0) always appear at
  66 block numbers 1 and 2, just after the meta page.  We always have
  67 hashm_spares[N] <= hashm_spares[N+1], since the latter count includes the
  68 former.  The difference between the two represents the number of overflow
  69 pages appearing between the bucket page groups of splitpoints N and N+1.
  70
  71 (Note: the above describes what happens when filling an initially minimally
  72 sized hash index.  In practice, we try to estimate the required index size
  73 and allocate a suitable number of splitpoints immediately, to avoid
  74 expensive re-splitting during initial index build.)
  75
  76 When S splitpoints exist altogether, the array entries hashm_spares[0]
  77 through hashm_spares[S] are valid; hashm_spares[S] records the current
  78 total number of overflow pages.  New overflow pages are created as needed
  79 at the end of the index, and recorded by incrementing hashm_spares[S].
  80 When it is time to create a new splitpoint's worth of bucket pages, we
  81 copy hashm_spares[S] into hashm_spares[S+1] and increment S (which is
  82 stored in the hashm_ovflpoint field of the meta page).  This has the
  83 effect of reserving the correct number of bucket pages at the end of the
  84 index, and preparing to allocate additional overflow pages after those
  85 bucket pages.  hashm_spares[] entries before S cannot change anymore,
  86 since that would require moving already-created bucket pages.
  87
  88 The last page nominally used by the index is always determinable from
  89 hashm_spares[S].  To avoid complaints from smgr, the logical EOF as seen by
  90 the filesystem and smgr must always be greater than or equal to this page.
  91 We have to allow the case "greater than" because it's possible that during
  92 an index extension we crash after allocating filesystem space and before
  93 updating the metapage.  Note that on filesystems that allow "holes" in
  94 files, it's entirely likely that pages before the logical EOF are not yet
  95 allocated: when we allocate a new splitpoint's worth of bucket pages, we
  96 physically zero the last such page to force the EOF up, and the first such
  97 page will be used immediately, but the intervening pages are not written
  98 until needed.
  99
 100 Since overflow pages may be recycled if enough tuples are deleted from
 101 their bucket, we need a way to keep track of currently-free overflow
 102 pages.  The state of each overflow page (0 = available, 1 = not available)
 103 is recorded in "bitmap" pages dedicated to this purpose.  The entries in
 104 the bitmap are indexed by "bit number", a zero-based count in which every
 105 overflow page has a unique entry.  We can convert between an overflow
 106 page's physical block number and its bit number using the information in
 107 hashm_spares[] (see hashovfl.c for details).  The bit number sequence
 108 includes the bitmap pages, which is the reason for saying that bitmap
 109 pages are a subset of the overflow pages.  It turns out in fact that each
 110 bitmap page's first bit represents itself --- this is not an essential
 111 property, but falls out of the fact that we only allocate another bitmap
 112 page when we really need one.  Bit number zero always corresponds to the
 113 first bitmap page, which is allocated during index creation just after all
 114 the initially created buckets.
 115
 116
 117 Lock Definitions
 118 ----------------
 119
 120 We use both lmgr locks ("heavyweight" locks) and buffer context locks
 121 (LWLocks) to control access to a hash index.  lmgr locks are needed for
 122 long-term locking since there is a (small) risk of deadlock, which we must
 123 be able to detect.  Buffer context locks are used for short-term access
 124 control to individual pages of the index.
 125
 126 We define the following lmgr locks for a hash index:
 127
 128 LockPage(rel, 0) represents the right to modify the hash-code-to-bucket
 129 mapping.  A process attempting to enlarge the hash table by splitting a
 130 bucket must exclusive-lock this lock before modifying the metapage data
 131 representing the mapping.  Processes intending to access a particular
 132 bucket must share-lock this lock until they have acquired lock on the
 133 correct target bucket.
 134
 135 LockPage(rel, page), where page is the page number of a hash bucket page,
 136 represents the right to split or compact an individual bucket.  A process
 137 splitting a bucket must exclusive-lock both old and new halves of the
 138 bucket until it is done.  A process doing VACUUM must exclusive-lock the
 139 bucket it is currently purging tuples from.  Processes doing scans or
 140 insertions must share-lock the bucket they are scanning or inserting into.
 141 (It is okay to allow concurrent scans and insertions.)
 142
 143 The lmgr lock IDs corresponding to overflow pages are currently unused.
 144 These are available for possible future refinements.
 145
 146 Note that these lock definitions are conceptually distinct from any sort
 147 of lock on the pages whose numbers they share.  A process must also obtain
 148 read or write buffer lock on the metapage or bucket page before accessing
 149 said page.
 150
 151 Processes performing hash index scans must hold share lock on the bucket
 152 they are scanning throughout the scan.  This seems to be essential, since
 153 there is no reasonable way for a scan to cope with its bucket being split
 154 underneath it.  This creates a possibility of deadlock external to the
 155 hash index code, since a process holding one of these locks could block
 156 waiting for an unrelated lock held by another process.  If that process
 157 then does something that requires exclusive lock on the bucket, we have
 158 deadlock.  Therefore the bucket locks must be lmgr locks so that deadlock
 159 can be detected and recovered from.  This also forces the page-zero lock
 160 to be an lmgr lock, because as we'll see below it is held while attempting
 161 to acquire a bucket lock, and so it could also participate in a deadlock.
 162
 163 Processes must obtain read (share) buffer context lock on any hash index
 164 page while reading it, and write (exclusive) lock while modifying it.
 165 To prevent deadlock we enforce these coding rules: no buffer lock may be
 166 held long term (across index AM calls), nor may any buffer lock be held
 167 while waiting for an lmgr lock, nor may more than one buffer lock
 168 be held at a time by any one process.  (The third restriction is probably
 169 stronger than necessary, but it makes the proof of no deadlock obvious.)
 170
 171
 172 Pseudocode Algorithms
 173 ---------------------
 174
 175 The operations we need to support are: readers scanning the index for
 176 entries of a particular hash code (which by definition are all in the same
 177 bucket); insertion of a new tuple into the correct bucket; enlarging the
 178 hash table by splitting an existing bucket; and garbage collection
 179 (deletion of dead tuples and compaction of buckets).  Bucket splitting is
 180 done at conclusion of any insertion that leaves the hash table more full
 181 than the target load factor, but it is convenient to consider it as an
 182 independent operation.  Note that we do not have a bucket-merge operation
 183 --- the number of buckets never shrinks.  Insertion, splitting, and
 184 garbage collection may all need access to freelist management, which keeps
 185 track of available overflow pages.
 186
 187 The reader algorithm is:
 188
 189         share-lock page 0 (to prevent active split)
 190         read/sharelock meta page
 191         compute bucket number for target hash key
 192         release meta page
 193         share-lock bucket page (to prevent split/compact of this bucket)
 194         release page 0 share-lock
 195 -- then, per read request:
 196         read/sharelock current page of bucket
 197                 step to next page if necessary (no chaining of locks)
 198         get tuple
 199         release current page
 200 -- at scan shutdown:
 201         release bucket share-lock
 202
 203 By holding the page-zero lock until lock on the target bucket is obtained,
 204 the reader ensures that the target bucket calculation is valid (otherwise
 205 the bucket might be split before the reader arrives at it, and the target
 206 entries might go into the new bucket).  Holding the bucket sharelock for
 207 the remainder of the scan prevents the reader's current-tuple pointer from
 208 being invalidated by other processes.  Notice though that the reader need
 209 not prevent other buckets from being split or compacted.
 210
 211 The insertion algorithm is rather similar:
 212
 213         share-lock page 0 (to prevent active split)
 214         read/sharelock meta page
 215         compute bucket number for target hash key
 216         release meta page
 217         share-lock bucket page (to prevent split/compact of this bucket)
 218         release page 0 share-lock
 219 -- (so far same as reader)
 220         read/exclusive-lock current page of bucket
 221         if full, release, read/exclusive-lock next page; repeat as needed
 222         >> see below if no space in any page of bucket
 223         insert tuple
 224         write/release current page
 225         release bucket share-lock
 226         read/exclusive-lock meta page
 227         increment tuple count, decide if split needed
 228         write/release meta page
 229         done if no split needed, else enter Split algorithm below
 230
 231 It is okay for an insertion to take place in a bucket that is being
 232 actively scanned, because it does not change the position of any existing
 233 item in the bucket, so scan states are not invalidated.  We only need the
 234 short-term buffer locks to ensure that readers do not see a
 235 partially-updated page.
 236
 237 It is clearly impossible for readers and inserters to deadlock, and in
 238 fact this algorithm allows them a very high degree of concurrency.
 239 (The exclusive metapage lock taken to update the tuple count is stronger
 240 than necessary, since readers do not care about the tuple count, but the
 241 lock is held for such a short time that this is probably not an issue.)
 242
 243 When an inserter cannot find space in any existing page of a bucket, it
 244 must obtain an overflow page and add that page to the bucket's chain.
 245 Details of that part of the algorithm appear later.
 246
 247 The page split algorithm is entered whenever an inserter observes that the
 248 index is overfull (has a higher-than-wanted ratio of tuples to buckets).
 249 The algorithm attempts, but does not necessarily succeed, to split one
 250 existing bucket in two, thereby lowering the fill ratio:
 251
 252         exclusive-lock page 0 (assert the right to begin a split)
 253         read/exclusive-lock meta page
 254         check split still needed
 255         if split not needed anymore, drop locks and exit
 256         decide which bucket to split
 257         Attempt to X-lock old bucket number (definitely could fail)
 258         Attempt to X-lock new bucket number (shouldn't fail, but...)
 259         if above fail, drop locks and exit
 260         update meta page to reflect new number of buckets
 261         write/release meta page
 262         release X-lock on page 0
 263         -- now, accesses to all other buckets can proceed.
 264         Perform actual split of bucket, moving tuples as needed
 265         >> see below about acquiring needed extra space
 266         Release X-locks of old and new buckets
 267
 268 Note the page zero and metapage locks are not held while the actual tuple
 269 rearrangement is performed, so accesses to other buckets can proceed in
 270 parallel; in fact, it's possible for multiple bucket splits to proceed
 271 in parallel.
 272
 273 Split's attempt to X-lock the old bucket number could fail if another
 274 process holds S-lock on it.  We do not want to wait if that happens, first
 275 because we don't want to wait while holding the metapage exclusive-lock,
 276 and second because it could very easily result in deadlock.  (The other
 277 process might be out of the hash AM altogether, and could do something
 278 that blocks on another lock this process holds; so even if the hash
 279 algorithm itself is deadlock-free, a user-induced deadlock could occur.)
 280 So, this is a conditional LockAcquire operation, and if it fails we just
 281 abandon the attempt to split.  This is all right since the index is
 282 overfull but perfectly functional.  Every subsequent inserter will try to
 283 split, and eventually one will succeed.  If multiple inserters failed to
 284 split, the index might still be overfull, but eventually, the index will
 285 not be overfull and split attempts will stop.  (We could make a successful
 286 splitter loop to see if the index is still overfull, but it seems better to
 287 distribute the split overhead across successive insertions.)
 288
 289 A problem is that if a split fails partway through (eg due to insufficient
 290 disk space) the index is left corrupt.  The probability of that could be
 291 made quite low if we grab a free page or two before we update the meta
 292 page, but the only real solution is to treat a split as a WAL-loggable,
 293 must-complete action.  I'm not planning to teach hash about WAL in this
 294 go-round.
 295
 296 The fourth operation is garbage collection (bulk deletion):
 297
 298         next bucket := 0
 299         read/sharelock meta page
 300         fetch current max bucket number
 301         release meta page
 302         while next bucket <= max bucket do
 303                 Acquire X lock on target bucket
 304                 Scan and remove tuples, compact free space as needed
 305                 Release X lock
 306                 next bucket ++
 307         end loop
 308         exclusive-lock meta page
 309         check if number of buckets changed
 310         if so, release lock and return to for-each-bucket loop
 311         else update metapage tuple count
 312         write/release meta page
 313
 314 Note that this is designed to allow concurrent splits.  If a split occurs,
 315 tuples relocated into the new bucket will be visited twice by the scan,
 316 but that does no harm.  (We must however be careful about the statistics
 317 reported by the VACUUM operation.  What we can do is count the number of
 318 tuples scanned, and believe this in preference to the stored tuple count
 319 if the stored tuple count and number of buckets did *not* change at any
 320 time during the scan.  This provides a way of correcting the stored tuple
 321 count if it gets out of sync for some reason.  But if a split or insertion
 322 does occur concurrently, the scan count is untrustworthy; instead,
 323 subtract the number of tuples deleted from the stored tuple count and
 324 use that.)
 325
 326 The exclusive lock request could deadlock in some strange scenarios, but
 327 we can just error out without any great harm being done.
 328
 329
 330 Free Space Management
 331 ---------------------
 332
 333 (Question: why is this so complicated?  Why not just have a linked list
 334 of free pages with the list head in the metapage?  It's not like we
 335 avoid needing to modify the metapage with all this.)
 336
 337 Free space management consists of two sub-algorithms, one for reserving
 338 an overflow page to add to a bucket chain, and one for returning an empty
 339 overflow page to the free pool.
 340
 341 Obtaining an overflow page:
 342
 343         read/exclusive-lock meta page
 344         determine next bitmap page number; if none, exit loop
 345         release meta page lock
 346         read/exclusive-lock bitmap page
 347         search for a free page (zero bit in bitmap)
 348         if found:
 349                 set bit in bitmap
 350                 write/release bitmap page
 351                 read/exclusive-lock meta page
 352                 if first-free-bit value did not change,
 353                         update it and write meta page
 354                 release meta page
 355                 return page number
 356         else (not found):
 357         release bitmap page
 358         loop back to try next bitmap page, if any
 359 -- here when we have checked all bitmap pages; we hold meta excl. lock
 360         extend index to add another overflow page; update meta information
 361         write/release meta page
 362         return page number
 363
 364 It is slightly annoying to release and reacquire the metapage lock
 365 multiple times, but it seems best to do it that way to minimize loss of
 366 concurrency against processes just entering the index.  We don't want
 367 to hold the metapage exclusive lock while reading in a bitmap page.
 368 (We can at least avoid repeated buffer pin/unpin here.)
 369
 370 The normal path for extending the index does not require doing I/O while
 371 holding the metapage lock.  We do have to do I/O when the extension
 372 requires adding a new bitmap page as well as the required overflow page
 373 ... but that is an infrequent case, so the loss of concurrency seems
 374 acceptable.
 375
 376 The portion of tuple insertion that calls the above subroutine looks
 377 like this:
 378
 379         -- having determined that no space is free in the target bucket:
 380         remember last page of bucket, drop write lock on it
 381         call free-page-acquire routine
 382         re-write-lock last page of bucket
 383         if it is not last anymore, step to the last page
 384         update (former) last page to point to new page
 385         write-lock and initialize new page, with back link to former last page
 386         write and release former last page
 387         insert tuple into new page
 388         -- etc.
 389
 390 Notice this handles the case where two concurrent inserters try to extend
 391 the same bucket.  They will end up with a valid, though perhaps
 392 space-inefficient, configuration: two overflow pages will be added to the
 393 bucket, each containing one tuple.
 394
 395 The last part of this violates the rule about holding write lock on two
 396 pages concurrently, but it should be okay to write-lock the previously
 397 free page; there can be no other process holding lock on it.
 398
 399 Bucket splitting uses a similar algorithm if it has to extend the new
 400 bucket, but it need not worry about concurrent extension since it has
 401 exclusive lock on the new bucket.
 402
 403 Freeing an overflow page is done by garbage collection and by bucket
 404 splitting (the old bucket may contain no-longer-needed overflow pages).
 405 In both cases, the process holds exclusive lock on the containing bucket,
 406 so need not worry about other accessors of pages in the bucket.  The
 407 algorithm is:
 408
 409         delink overflow page from bucket chain
 410         (this requires read/update/write/release of fore and aft siblings)
 411         read/share-lock meta page
 412         determine which bitmap page contains the free space bit for page
 413         release meta page
 414         read/exclusive-lock bitmap page
 415         update bitmap bit
 416         write/release bitmap page
 417         if page number is less than what we saw as first-free-bit in meta:
 418         read/exclusive-lock meta page
 419         if page number is still less than first-free-bit,
 420                 update first-free-bit field and write meta page
 421         release meta page
 422
 423 We have to do it this way because we must clear the bitmap bit before
 424 changing the first-free-bit field (hashm_firstfree).  It is possible that
 425 we set first-free-bit too small (because someone has already reused the
 426 page we just freed), but that is okay; the only cost is the next overflow
 427 page acquirer will scan more bitmap bits than he needs to.  What must be
 428 avoided is having first-free-bit greater than the actual first free bit,
 429 because then that free page would never be found by searchers.
 430
 431 All the freespace operations should be called while holding no buffer
 432 locks.  Since they need no lmgr locks, deadlock is not possible.
 433
 434
 435 Other Notes
 436 -----------
 437
 438 All the shenanigans with locking prevent a split occurring while *another*
 439 process is stopped in a given bucket.  They do not ensure that one of
 440 our *own* backend's scans is not stopped in the bucket, because lmgr
 441 doesn't consider a process's own locks to conflict.  So the Split
 442 algorithm must check for that case separately before deciding it can go
 443 ahead with the split.  VACUUM does not have this problem since nothing
 444 else can be happening within the vacuuming backend.
 445
 446 Should we instead try to fix the state of any conflicting local scan?
 447 Seems mighty ugly --- got to move the held bucket S-lock as well as lots
 448 of other messiness.  For now, just punt and don't split.