src/backend/access/brin/README

   1 Block Range Indexes (BRIN)
   2 ==========================
   3
   4 BRIN indexes intend to enable very fast scanning of extremely large tables.
   5
   6 The essential idea of a BRIN index is to keep track of summarizing values in
   7 consecutive groups of heap pages (page ranges); for example, the minimum and
   8 maximum values for datatypes with a btree opclass, or the bounding box for
   9 geometric types.  These values can be used to avoid scanning such pages
  10 during a table scan, depending on query quals.
  11
  12 The cost of this is having to update the stored summary values of each page
  13 range as tuples are inserted into them.
  14
  15
  16 Access Method Design
  17 --------------------
  18
  19 Since item pointers are not stored inside indexes of this type, it is not
  20 possible to support the amgettuple interface.  Instead, we only provide
  21 amgetbitmap support.  The amgetbitmap routine returns a lossy TIDBitmap
  22 comprising all pages in those page ranges that match the query
  23 qualifications.  The recheck step in the BitmapHeapScan node prunes tuples
  24 that are not visible according to the query qualifications.
  25
  26 An operator class must have the following entries:
  27
  28 - generic support procedures (pg_amproc), identical to all opclasses:
  29   * "opcinfo" (BRIN_PROCNUM_OPCINFO) initializes a structure for index
  30     creation or scanning
  31   * "addValue" (BRIN_PROCNUM_ADDVALUE) takes an index tuple and a heap item,
  32     and possibly changes the index tuple so that it includes the heap item
  33     values
  34   * "consistent" (BRIN_PROCNUM_CONSISTENT) takes an index tuple and query
  35     quals, and returns whether the index tuple values match the query quals.
  36   * "union" (BRIN_PROCNUM_UNION) takes two index tuples and modifies the first
  37     one so that it represents the union of the two.
  38 Procedure numbers up to 10 are reserved for future expansion.
  39
  40 Additionally, each opclass needs additional support functions:
  41 - Minmax-style operator classes:
  42   * Proc numbers 11-14 are used for the functions implementing inequality
  43     operators for the type, in this order: less than, less or equal,
  44     greater or equal, greater than.
  45
  46 Opclasses using a different design will require different additional procedure
  47 numbers.
  48
  49 Operator classes also need to have operator (pg_amop) entries so that the
  50 optimizer can choose the index to execute queries.
  51 - Minmax-style operator classes:
  52   * The same operators as btree (<=, <, =, >=, >)
  53
  54 Each index tuple stores some NULL bits and some opclass-specified values, which
  55 are stored in a single null bitmask of length twice the number of columns.  The
  56 generic NULL bits indicate, for each column:
  57   * bt_hasnulls: Whether there's any NULL value at all in the page range
  58   * bt_allnulls: Whether all values are NULLs in the page range
  59
  60 The opclass-specified values are:
  61 - Minmax-style operator classes
  62   * minimum value across all tuples in the range
  63   * maximum value across all tuples in the range
  64
  65 Note that the addValue and Union support procedures  must be careful to
  66 datumCopy() the values they want to store in the in-memory BRIN tuple, and
  67 must pfree() the old copies when replacing older ones.  Since some values
  68 referenced from the tuple persist and others go away, there is no
  69 well-defined lifetime for a memory context that would make this automatic.
  70
  71
  72 The Range Map
  73 -------------
  74
  75 To find the index tuple for a particular page range, we have an internal
  76 structure we call the range map, or "revmap" for short.  This stores one TID
  77 per page range, which is the address of the index tuple summarizing that
  78 range.  Since the map entries are fixed size, it is possible to compute the
  79 address of the range map entry for any given heap page by simple arithmetic.
  80
  81 When a new heap tuple is inserted in a summarized page range, we compare the
  82 existing index tuple with the new heap tuple.  If the heap tuple is outside
  83 the summarization data given by the index tuple for any indexed column (or
  84 if the new heap tuple contains null values but the index tuple indicates
  85 there are no nulls), the index is updated with the new values.  In many
  86 cases it is possible to update the index tuple in-place, but if the new
  87 index tuple is larger than the old one and there's not enough space in the
  88 page, it is necessary to create a new index tuple with the new values.  The
  89 range map can be updated quickly to point to it; the old index tuple is
  90 removed.
  91
  92 If the range map points to an invalid TID, the corresponding page range is
  93 considered to be not summarized.  When tuples are added to unsummarized
  94 pages, nothing needs to happen.
  95
  96 To scan a table following a BRIN index, we scan the range map sequentially.
  97 This yields index tuples in ascending page range order.  Query quals are
  98 matched to each index tuple; if they match, each page within the page range
  99 is returned as part of the output TID bitmap.  If there's no match, they are
 100 skipped.  Range map entries returning invalid index TIDs, that is
 101 unsummarized page ranges, are also returned in the TID bitmap.
 102
 103 The revmap is stored in the first few blocks of the index main fork,
 104 immediately following the metapage.  Whenever the revmap needs to be
 105 extended by another page, existing tuples in that page are moved to some
 106 other page.
 107
 108 Heap tuples can be removed from anywhere without restriction.  It might be
 109 useful to mark the corresponding index tuple somehow, if the heap tuple is
 110 one of the constraining values of the summary data (i.e. either min or max
 111 in the case of a btree-opclass-bearing datatype), so that in the future we
 112 are aware of the need to re-execute summarization on that range, leading to
 113 a possible tightening of the summary values.
 114
 115 Summarization
 116 -------------
 117
 118 At index creation time, the whole table is scanned; for each page range the
 119 summarizing values of each indexed column and nulls bitmap are collected and
 120 stored in the index.  The partially-filled page range at the end of the
 121 table is also summarized.
 122
 123 As new tuples get inserted at the end of the table, they may update the
 124 index tuple that summarizes the partial page range at the end.  Eventually
 125 that page range is complete and new tuples belong in a new page range that
 126 hasn't yet been summarized.  Those insertions do not create a new index
 127 entry; instead, the page range remains unsummarized until later.
 128
 129 Whenever VACUUM is run on the table, all unsummarized page ranges are
 130 summarized.  This action can also be invoked by the user via
 131 brin_summarize_new_values().  Both these procedures scan all the
 132 unsummarized ranges, and create a summary tuple.  Again, this includes the
 133 partially-filled page range at the end of the table.
 134
 135 Vacuuming
 136 ---------
 137
 138 Since no heap TIDs are stored in a BRIN index, it's not necessary to scan the
 139 index when heap tuples are removed.  It might be that some summary values can
 140 be tightened if heap tuples have been deleted; but this would represent an
 141 optimization opportunity only, not a correctness issue.  It's simpler to
 142 represent this as the need to re-run summarization on the affected page range
 143 rather than "subtracting" values from the existing one.  This is not
 144 currently implemented.
 145
 146 Note that if there are no indexes on the table other than the BRIN index,
 147 usage of maintenance_work_mem by vacuum can be decreased significantly, because
 148 no detailed index scan needs to take place (and thus it's not necessary for
 149 vacuum to save TIDs to remove).  It's unlikely that BRIN would be the only
 150 indexes in a table, though, because primary keys can be btrees only, and so
 151 we don't implement this optimization.
 152
 153
 154 Optimizer
 155 ---------
 156
 157 The optimizer selects the index based on the operator class' pg_amop
 158 entries for the column.
 159
 160
 161 Future improvements
 162 -------------------
 163
 164 * Different-size page ranges?
 165   In the current design, each "index entry" in a BRIN index covers the same
 166   number of pages.  There's no hard reason for this; it might make sense to
 167   allow the index to self-tune so that some index entries cover smaller page
 168   ranges, if this allows the summary values to be more compact.  This would incur
 169   larger BRIN overhead for the index itself, but might allow better pruning of
 170   page ranges during scan.  In the limit of one index tuple per page, the index
 171   itself would occupy too much space, even though we would be able to skip
 172   reading the most heap pages, because the summary values are tight; in the
 173   opposite limit of a single tuple that summarizes the whole table, we wouldn't
 174   be able to prune anything even though the index is very small.  This can
 175   probably be made to work by using the range map as an index in itself.
 176
 177 * More compact representation for TIDBitmap?
 178   TIDBitmap is the structure used to represent bitmap scans.  The
 179   representation of lossy page ranges is not optimal for our purposes, because
 180   it uses a Bitmapset to represent pages in the range; since we're going to return
 181   all pages in a large range, it might be more convenient to allow for a
 182   struct that uses start and end page numbers to represent the range, instead.
 183
 184 * Better vacuuming?
 185   It might be useful to enable passing more useful info to BRIN indexes during
 186   vacuuming about tuples that are deleted, i.e. do not require the callback to
 187   pass each tuple's TID.  For instance we might need a callback that passes a
 188   block number instead of a TID.  That would help determine when to re-run
 189   summarization on blocks that have seen lots of tuple deletions.