doc/src/sgml/indexam.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="indexam">
   4  <title>Index Access Method Interface Definition</title>
   5
   6   <para>
   7    This chapter defines the interface between the core
   8    <productname>PostgreSQL</productname> system and <firstterm>index access
   9    methods</>, which manage individual index types.  The core system
  10    knows nothing about indexes beyond what is specified here, so it is
  11    possible to develop entirely new index types by writing add-on code.
  12   </para>
  13
  14   <para>
  15    All indexes in <productname>PostgreSQL</productname> are what are known
  16    technically as <firstterm>secondary indexes</>; that is, the index is
  17    physically separate from the table file that it describes.  Each index
  18    is stored as its own physical <firstterm>relation</> and so is described
  19    by an entry in the <structname>pg_class</> catalog.  The contents of an
  20    index are entirely under the control of its index access method.  In
  21    practice, all index access methods divide indexes into standard-size
  22    pages so that they can use the regular storage manager and buffer manager
  23    to access the index contents.  (All the existing index access methods
  24    furthermore use the standard page layout described in <xref
  25    linkend="storage-page-layout">, and they all use the same format for index
  26    tuple headers; but these decisions are not forced on an access method.)
  27   </para>
  28
  29   <para>
  30    An index is effectively a mapping from some data key values to
  31    <firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
  32    (tuples) in the index's parent table.  A TID consists of a
  33    block number and an item number within that block (see <xref
  34    linkend="storage-page-layout">).  This is sufficient
  35    information to fetch a particular row version from the table.
  36    Indexes are not directly aware that under MVCC, there might be multiple
  37    extant versions of the same logical row; to an index, each tuple is
  38    an independent object that needs its own index entry.  Thus, an
  39    update of a row always creates all-new index entries for the row, even if
  40    the key values did not change.  (HOT tuples are an exception to this
  41    statement; but indexes do not deal with those, either.)  Index entries for
  42    dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
  43    are reclaimed.
  44   </para>
  45
  46  <sect1 id="index-catalog">
  47   <title>Catalog Entries for Indexes</title>
  48
  49   <para>
  50    Each index access method is described by a row in the
  51    <structname>pg_am</structname> system catalog (see
  52    <xref linkend="catalog-pg-am">).  The principal contents of a
  53    <structname>pg_am</structname> row are references to
  54    <link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>
  55    entries that identify the index access
  56    functions supplied by the access method.  The APIs for these functions
  57    are defined later in this chapter.  In addition, the
  58    <structname>pg_am</structname> row specifies a few fixed properties of
  59    the access method, such as whether it can support multicolumn indexes.
  60    There is not currently any special support
  61    for creating or deleting <structname>pg_am</structname> entries;
  62    anyone able to write a new access method is expected to be competent
  63    to insert an appropriate row for themselves.
  64   </para>
  65
  66   <para>
  67    To be useful, an index access method must also have one or more
  68    <firstterm>operator families</> and
  69    <firstterm>operator classes</> defined in
  70    <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
  71    <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
  72    <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
  73    <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
  74    These entries allow the planner
  75    to determine what kinds of query qualifications can be used with
  76    indexes of this access method.  Operator families and classes are described
  77    in <xref linkend="xindex">, which is prerequisite material for reading
  78    this chapter.
  79   </para>
  80
  81   <para>
  82    An individual index is defined by a
  83    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
  84    entry that describes it as a physical relation, plus a
  85    <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
  86    entry that shows the logical content of the index &mdash; that is, the set
  87    of index columns it has and the semantics of those columns, as captured by
  88    the associated operator classes.  The index columns (key values) can be
  89    either simple columns of the underlying table or expressions over the table
  90    rows.  The index access method normally has no interest in where the index
  91    key values come from (it is always handed precomputed key values) but it
  92    will be very interested in the operator class information in
  93    <structname>pg_index</structname>.  Both of these catalog entries can be
  94    accessed as part of the <structname>Relation</> data structure that is
  95    passed to all operations on the index.
  96   </para>
  97
  98   <para>
  99    Some of the flag columns of <structname>pg_am</structname> have nonobvious
 100    implications.  The requirements of <structfield>amcanunique</structfield>
 101    are discussed in <xref linkend="index-unique-checks">.
 102    The <structfield>amcanmulticol</structfield> flag asserts that the
 103    access method supports multicolumn indexes, while
 104    <structfield>amoptionalkey</structfield> asserts that it allows scans
 105    where no indexable restriction clause is given for the first index column.
 106    When <structfield>amcanmulticol</structfield> is false,
 107    <structfield>amoptionalkey</structfield> essentially says whether the
 108    access method allows full-index scans without any restriction clause.
 109    Access methods that support multiple index columns <emphasis>must</>
 110    support scans that omit restrictions on any or all of the columns after
 111    the first; however they are permitted to require some restriction to
 112    appear for the first index column, and this is signaled by setting
 113    <structfield>amoptionalkey</structfield> false.
 114    <structfield>amindexnulls</structfield> asserts that index entries are
 115    created for NULL key values.  Since most indexable operators are
 116    strict and hence cannot return TRUE for NULL inputs,
 117    it is at first sight attractive to not store index entries for null values:
 118    they could never be returned by an index scan anyway.  However, this
 119    argument fails when an index scan has no restriction clause for a given
 120    index column.  In practice this means that
 121    indexes that have <structfield>amoptionalkey</structfield> true must
 122    index nulls, since the planner might decide to use such an index
 123    with no scan keys at all.  A related restriction is that an index
 124    access method that supports multiple index columns <emphasis>must</>
 125    support indexing null values in columns after the first, because the planner
 126    will assume the index can be used for queries that do not restrict
 127    these columns.  For example, consider an index on (a,b) and a query with
 128    <literal>WHERE a = 4</literal>.  The system will assume the index can be
 129    used to scan for rows with <literal>a = 4</literal>, which is wrong if the
 130    index omits rows where <literal>b</> is null.
 131    It is, however, OK to omit rows where the first indexed column is null.
 132    Thus, <structfield>amindexnulls</structfield> should be set true only if the
 133    index access method indexes all rows, including arbitrary combinations of
 134    null values.  An index access method that sets
 135    <structfield>amindexnulls</structfield> may also set
 136    <structfield>amsearchnulls</structfield>, indicating that it supports
 137    <literal>IS NULL</> clauses as search conditions.
 138   </para>
 139
 140  </sect1>
 141
 142  <sect1 id="index-functions">
 143   <title>Index Access Method Functions</title>
 144
 145   <para>
 146    The index construction and maintenance functions that an index access
 147    method must provide are:
 148   </para>
 149
 150   <para>
 151 <programlisting>
 152 IndexBuildResult *
 153 ambuild (Relation heapRelation,
 154          Relation indexRelation,
 155          IndexInfo *indexInfo);
 156 </programlisting>
 157    Build a new index.  The index relation has been physically created,
 158    but is empty.  It must be filled in with whatever fixed data the
 159    access method requires, plus entries for all tuples already existing
 160    in the table.  Ordinarily the <function>ambuild</> function will call
 161    <function>IndexBuildHeapScan()</> to scan the table for existing tuples
 162    and compute the keys that need to be inserted into the index.
 163    The function must return a palloc'd struct containing statistics about
 164    the new index.
 165   </para>
 166
 167   <para>
 168 <programlisting>
 169 bool
 170 aminsert (Relation indexRelation,
 171           Datum *values,
 172           bool *isnull,
 173           ItemPointer heap_tid,
 174           Relation heapRelation,
 175           bool check_uniqueness);
 176 </programlisting>
 177    Insert a new tuple into an existing index.  The <literal>values</> and
 178    <literal>isnull</> arrays give the key values to be indexed, and
 179    <literal>heap_tid</> is the TID to be indexed.
 180    If the access method supports unique indexes (its
 181    <structname>pg_am</>.<structfield>amcanunique</> flag is true) then
 182    <literal>check_uniqueness</> might be true, in which case the access method
 183    must verify that there is no conflicting row; this is the only situation in
 184    which the access method normally needs the <literal>heapRelation</>
 185    parameter.  See <xref linkend="index-unique-checks"> for details.
 186    The result is TRUE if an index entry was inserted, FALSE if not. (A FALSE
 187    result does not denote an error condition, but is used for cases such
 188    as an index method refusing to index a NULL.)
 189   </para>
 190
 191   <para>
 192 <programlisting>
 193 IndexBulkDeleteResult *
 194 ambulkdelete (IndexVacuumInfo *info,
 195               IndexBulkDeleteResult *stats,
 196               IndexBulkDeleteCallback callback,
 197               void *callback_state);
 198 </programlisting>
 199    Delete tuple(s) from the index.  This is a <quote>bulk delete</> operation
 200    that is intended to be implemented by scanning the whole index and checking
 201    each entry to see if it should be deleted.
 202    The passed-in <literal>callback</> function must be called, in the style
 203    <literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
 204    to determine whether any particular index entry, as identified by its
 205    referenced TID, is to be deleted.  Must return either NULL or a palloc'd
 206    struct containing statistics about the effects of the deletion operation.
 207    It is OK to return NULL if no information needs to be passed on to
 208    <function>amvacuumcleanup</>.
 209   </para>
 210
 211   <para>
 212    Because of limited <varname>maintenance_work_mem</>,
 213    <function>ambulkdelete</> might need to be called more than once when many
 214    tuples are to be deleted.  The <literal>stats</> argument is the result
 215    of the previous call for this index (it is NULL for the first call within a
 216    <command>VACUUM</> operation).  This allows the AM to accumulate statistics
 217    across the whole operation.  Typically, <function>ambulkdelete</> will
 218    modify and return the same struct if the passed <literal>stats</> is not
 219    null.
 220   </para>
 221
 222   <para>
 223 <programlisting>
 224 IndexBulkDeleteResult *
 225 amvacuumcleanup (IndexVacuumInfo *info,
 226                  IndexBulkDeleteResult *stats);
 227 </programlisting>
 228    Clean up after a <command>VACUUM</command> operation (zero or more
 229    <function>ambulkdelete</> calls).  This does not have to do anything
 230    beyond returning index statistics, but it might perform bulk cleanup
 231    such as reclaiming empty index pages.  <literal>stats</> is whatever the
 232    last <function>ambulkdelete</> call returned, or NULL if
 233    <function>ambulkdelete</> was not called because no tuples needed to be
 234    deleted.  If the result is not NULL it must be a palloc'd struct.
 235    The statistics it contains will be used to update <structname>pg_class</>,
 236    and will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
 237    It is OK to return NULL if the index was not changed at all during the
 238    <command>VACUUM</command> operation, but otherwise correct stats should
 239    be returned.
 240   </para>
 241
 242   <para>
 243    As of <productname>PostgreSQL</productname> 8.4,
 244    <function>amvacuumcleanup</> will also be called at completion of an
 245    <command>ANALYZE</> operation.  In this case <literal>stats</> is always
 246    NULL and any return value will be ignored.  This case can be distinguished
 247    by checking <literal>info-&gt;analyze_only</literal>.  It is recommended
 248    that the access method do nothing except post-insert cleanup in such a
 249    call, and that only in an autovacuum worker process.
 250   </para>
 251
 252   <para>
 253 <programlisting>
 254 void
 255 amcostestimate (PlannerInfo *root,
 256                 IndexOptInfo *index,
 257                 List *indexQuals,
 258                 RelOptInfo *outer_rel,
 259                 Cost *indexStartupCost,
 260                 Cost *indexTotalCost,
 261                 Selectivity *indexSelectivity,
 262                 double *indexCorrelation);
 263 </programlisting>
 264    Estimate the costs of an index scan.  This function is described fully
 265    in <xref linkend="index-cost-estimation">, below.
 266   </para>
 267
 268   <para>
 269 <programlisting>
 270 bytea *
 271 amoptions (ArrayType *reloptions,
 272            bool validate);
 273 </programlisting>
 274    Parse and validate the reloptions array for an index.  This is called only
 275    when a non-null reloptions array exists for the index.
 276    <parameter>reloptions</> is a <type>text</> array containing entries of the
 277    form <replaceable>name</><literal>=</><replaceable>value</>.
 278    The function should construct a <type>bytea</> value, which will be copied
 279    into the <structfield>rd_options</> field of the index's relcache entry.
 280    The data contents of the <type>bytea</> value are open for the access
 281    method to define; most of the standard access methods use struct
 282    <structname>StdRdOptions</>.
 283    When <parameter>validate</> is true, the function should report a suitable
 284    error message if any of the options are unrecognized or have invalid
 285    values; when <parameter>validate</> is false, invalid entries should be
 286    silently ignored.  (<parameter>validate</> is false when loading options
 287    already stored in <structname>pg_catalog</>; an invalid entry could only
 288    be found if the access method has changed its rules for options, and in
 289    that case ignoring obsolete entries is appropriate.)
 290    It is OK to return NULL if default behavior is wanted.
 291   </para>
 292
 293   <para>
 294    The purpose of an index, of course, is to support scans for tuples matching
 295    an indexable <literal>WHERE</> condition, often called a
 296    <firstterm>qualifier</> or <firstterm>scan key</>.  The semantics of
 297    index scanning are described more fully in <xref linkend="index-scanning">,
 298    below.  An index access method can support <quote>plain</> index scans,
 299    <quote>bitmap</> index scans, or both.  The scan-related functions that an
 300    index access method must or may provide are:
 301   </para>
 302
 303   <para>
 304 <programlisting>
 305 IndexScanDesc
 306 ambeginscan (Relation indexRelation,
 307              int nkeys,
 308              ScanKey key);
 309 </programlisting>
 310    Begin a new scan.  The <literal>key</> array (of length <literal>nkeys</>)
 311    describes the scan key(s) for the index scan.  The result must be a
 312    palloc'd struct. For implementation reasons the index access method
 313    <emphasis>must</> create this struct by calling
 314    <function>RelationGetIndexScan()</>.  In most cases
 315    <function>ambeginscan</> itself does little beyond making that call;
 316    the interesting parts of index-scan startup are in <function>amrescan</>.
 317   </para>
 318
 319   <para>
 320 <programlisting>
 321 boolean
 322 amgettuple (IndexScanDesc scan,
 323             ScanDirection direction);
 324 </programlisting>
 325    Fetch the next tuple in the given scan, moving in the given
 326    direction (forward or backward in the index).  Returns TRUE if a tuple was
 327    obtained, FALSE if no matching tuples remain.  In the TRUE case the tuple
 328    TID is stored into the <literal>scan</> structure.  Note that
 329    <quote>success</> means only that the index contains an entry that matches
 330    the scan keys, not that the tuple necessarily still exists in the heap or
 331    will pass the caller's snapshot test.  On success, <function>amgettuple</>
 332    must also set <literal>scan-&gt;xs_recheck</> to TRUE or FALSE.
 333    FALSE means it is certain that the index entry matches the scan keys.
 334    TRUE means this is not certain, and the conditions represented by the
 335    scan keys must be rechecked against the heap tuple after fetching it.
 336    This provision supports <quote>lossy</> index operators.
 337    Note that rechecking will extend only to the scan conditions; a partial
 338    index predicate (if any) is never rechecked by <function>amgettuple</>
 339    callers.
 340   </para>
 341
 342   <para>
 343    The <function>amgettuple</> function need only be provided if the access
 344    method supports <quote>plain</> index scans.  If it doesn't, the
 345    <structfield>amgettuple</> field in its <structname>pg_am</> row must
 346    be set to zero.
 347   </para>
 348
 349   <para>
 350 <programlisting>
 351 int64
 352 amgetbitmap (IndexScanDesc scan,
 353              TIDBitmap *tbm);
 354 </programlisting>
 355    Fetch all tuples in the given scan and add them to the caller-supplied
 356    TIDBitmap (that is, OR the set of tuple IDs into whatever set is already
 357    in the bitmap).  The number of tuples fetched is returned (this might be
 358    just an approximate count, for instance some AMs do not detect duplicates).
 359    While inserting tuple IDs into the bitmap, <function>amgetbitmap</> can
 360    indicate that rechecking of the scan conditions is required for specific
 361    tuple IDs.  This is analogous to the <literal>xs_recheck</> output parameter
 362    of <function>amgettuple</>.  Note: in the current implementation, support
 363    for this feature is conflated with support for lossy storage of the bitmap
 364    itself, and therefore callers recheck both the scan conditions and the
 365    partial index predicate (if any) for recheckable tuples.  That might not
 366    always be true, however.
 367    <function>amgetbitmap</> and
 368    <function>amgettuple</> cannot be used in the same index scan; there
 369    are other restrictions too when using <function>amgetbitmap</>, as explained
 370    in <xref linkend="index-scanning">.
 371   </para>
 372
 373   <para>
 374    The <function>amgetbitmap</> function need only be provided if the access
 375    method supports <quote>bitmap</> index scans.  If it doesn't, the
 376    <structfield>amgetbitmap</> field in its <structname>pg_am</> row must
 377    be set to zero.
 378   </para>
 379
 380   <para>
 381 <programlisting>
 382 void
 383 amrescan (IndexScanDesc scan,
 384           ScanKey key);
 385 </programlisting>
 386    Restart the given scan, possibly with new scan keys (to continue using
 387    the old keys, NULL is passed for <literal>key</>).  Note that it is not
 388    possible for the number of keys to be changed.  In practice the restart
 389    feature is used when a new outer tuple is selected by a nested-loop join
 390    and so a new key comparison value is needed, but the scan key structure
 391    remains the same.  This function is also called by
 392    <function>RelationGetIndexScan()</>, so it is used for initial setup
 393    of an index scan as well as rescanning.
 394   </para>
 395
 396   <para>
 397 <programlisting>
 398 void
 399 amendscan (IndexScanDesc scan);
 400 </programlisting>
 401    End a scan and release resources.  The <literal>scan</> struct itself
 402    should not be freed, but any locks or pins taken internally by the
 403    access method must be released.
 404   </para>
 405
 406   <para>
 407 <programlisting>
 408 void
 409 ammarkpos (IndexScanDesc scan);
 410 </programlisting>
 411    Mark current scan position.  The access method need only support one
 412    remembered scan position per scan.
 413   </para>
 414
 415   <para>
 416 <programlisting>
 417 void
 418 amrestrpos (IndexScanDesc scan);
 419 </programlisting>
 420    Restore the scan to the most recently marked position.
 421   </para>
 422
 423   <para>
 424    By convention, the <literal>pg_proc</literal> entry for an index
 425    access method function should show the correct number of arguments,
 426    but declare them all as type <type>internal</> (since most of the arguments
 427    have types that are not known to SQL, and we don't want users calling
 428    the functions directly anyway).  The return type is declared as
 429    <type>void</>, <type>internal</>, or <type>boolean</> as appropriate.
 430    The only exception is <function>amoptions</>, which should be correctly
 431    declared as taking <type>text[]</> and <type>bool</> and returning
 432    <type>bytea</>.  This provision allows client code to execute
 433    <function>amoptions</> to test validity of options settings.
 434   </para>
 435
 436  </sect1>
 437
 438  <sect1 id="index-scanning">
 439   <title>Index Scanning</title>
 440
 441   <para>
 442    In an index scan, the index access method is responsible for regurgitating
 443    the TIDs of all the tuples it has been told about that match the
 444    <firstterm>scan keys</>.  The access method is <emphasis>not</> involved in
 445    actually fetching those tuples from the index's parent table, nor in
 446    determining whether they pass the scan's time qualification test or other
 447    conditions.
 448   </para>
 449
 450   <para>
 451    A scan key is the internal representation of a <literal>WHERE</> clause of
 452    the form <replaceable>index_key</> <replaceable>operator</>
 453    <replaceable>constant</>, where the index key is one of the columns of the
 454    index and the operator is one of the members of the operator family
 455    associated with that index column.  An index scan has zero or more scan
 456    keys, which are implicitly ANDed &mdash; the returned tuples are expected
 457    to satisfy all the indicated conditions.
 458   </para>
 459
 460   <para>
 461    The access method can report that the index is <firstterm>lossy</>, or
 462    requires rechecks, for a particular query.  This implies that the index
 463    scan will return all the entries that pass the scan key, plus possibly
 464    additional entries that do not.  The core system's index-scan machinery
 465    will then apply the index conditions again to the heap tuple to verify
 466    whether or not it really should be selected.  If the recheck option is not
 467    specified, the index scan must return exactly the set of matching entries.
 468   </para>
 469
 470   <para>
 471    Note that it is entirely up to the access method to ensure that it
 472    correctly finds all and only the entries passing all the given scan keys.
 473    Also, the core system will simply hand off all the <literal>WHERE</>
 474    clauses that match the index keys and operator families, without any
 475    semantic analysis to determine whether they are redundant or
 476    contradictory.  As an example, given
 477    <literal>WHERE x &gt; 4 AND x &gt; 14</> where <literal>x</> is a b-tree
 478    indexed column, it is left to the b-tree <function>amrescan</> function
 479    to realize that the first scan key is redundant and can be discarded.
 480    The extent of preprocessing needed during <function>amrescan</> will
 481    depend on the extent to which the index access method needs to reduce
 482    the scan keys to a <quote>normalized</> form.
 483   </para>
 484
 485   <para>
 486    Some access methods return index entries in a well-defined order, others
 487    do not.  If entries are returned in sorted order, the access method should
 488    set <structname>pg_am</>.<structfield>amcanorder</> true to indicate that
 489    it supports ordered scans.
 490    All such access methods must use btree-compatible strategy numbers for
 491    their equality and ordering operators.
 492   </para>
 493
 494   <para>
 495    The <function>amgettuple</> function has a <literal>direction</> argument,
 496    which can be either <literal>ForwardScanDirection</> (the normal case)
 497    or  <literal>BackwardScanDirection</>.  If the first call after
 498    <function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
 499    set of matching index entries is to be scanned back-to-front rather than in
 500    the normal front-to-back direction, so <function>amgettuple</> must return
 501    the last matching tuple in the index, rather than the first one as it
 502    normally would.  (This will only occur for access
 503    methods that advertise they support ordered scans.)  After the
 504    first call, <function>amgettuple</> must be prepared to advance the scan in
 505    either direction from the most recently returned entry.  (But if
 506    <structname>pg_am</>.<structfield>amcanbackward</> is false, all subsequent
 507    calls will have the same direction as the first one.)
 508   </para>
 509
 510   <para>
 511    Access methods that support ordered scans must support <quote>marking</> a
 512    position in a scan and later returning to the marked position.  The same
 513    position might be restored multiple times.  However, only one position need
 514    be remembered per scan; a new <function>ammarkpos</> call overrides the
 515    previously marked position.  An access method that does not support
 516    ordered scans should still provide mark and restore functions in
 517    <structname>pg_am</>, but it is sufficient to have them throw errors if
 518    called.
 519   </para>
 520
 521   <para>
 522    Both the scan position and the mark position (if any) must be maintained
 523    consistently in the face of concurrent insertions or deletions in the
 524    index.  It is OK if a freshly-inserted entry is not returned by a scan that
 525    would have found the entry if it had existed when the scan started, or for
 526    the scan to return such an entry upon rescanning or backing
 527    up even though it had not been returned the first time through.  Similarly,
 528    a concurrent delete might or might not be reflected in the results of a scan.
 529    What is important is that insertions or deletions not cause the scan to
 530    miss or multiply return entries that were not themselves being inserted or
 531    deleted.
 532   </para>
 533
 534   <para>
 535    Instead of using <function>amgettuple</>, an index scan can be done with
 536    <function>amgetbitmap</> to fetch all tuples in one call.  This can be
 537    noticeably more efficient than <function>amgettuple</> because it allows
 538    avoiding lock/unlock cycles within the access method.  In principle
 539    <function>amgetbitmap</> should have the same effects as repeated
 540    <function>amgettuple</> calls, but we impose several restrictions to
 541    simplify matters.  First of all, <function>amgetbitmap</> returns all
 542    tuples at once and marking or restoring scan positions isn't
 543    supported. Secondly, the tuples are returned in a bitmap which doesn't
 544    have any specific ordering, which is why <function>amgetbitmap</> doesn't
 545    take a <literal>direction</> argument.  Finally, <function>amgetbitmap</>
 546    does not guarantee any locking of the returned tuples, with implications
 547    spelled out in <xref linkend="index-locking">.
 548   </para>
 549
 550   <para>
 551    Note that it is permitted for an access method to implement only
 552    <function>amgetbitmap</> and not <function>amgettuple</>, or vice versa,
 553    if its internal implementation is unsuited to one API or the other.
 554   </para>
 555
 556  </sect1>
 557
 558  <sect1 id="index-locking">
 559   <title>Index Locking Considerations</title>
 560
 561   <para>
 562    Index access methods must handle concurrent updates
 563    of the index by multiple processes.
 564    The core <productname>PostgreSQL</productname> system obtains
 565    <literal>AccessShareLock</> on the index during an index scan, and
 566    <literal>RowExclusiveLock</> when updating the index (including plain
 567    <command>VACUUM</>).  Since these lock
 568    types do not conflict, the access method is responsible for handling any
 569    fine-grained locking it might need.  An exclusive lock on the index as a whole
 570    will be taken only during index creation, destruction,
 571    <command>REINDEX</>, or <command>VACUUM FULL</>.
 572   </para>
 573
 574   <para>
 575    Building an index type that supports concurrent updates usually requires
 576    extensive and subtle analysis of the required behavior.  For the b-tree
 577    and hash index types, you can read about the design decisions involved in
 578    <filename>src/backend/access/nbtree/README</> and
 579    <filename>src/backend/access/hash/README</>.
 580   </para>
 581
 582   <para>
 583    Aside from the index's own internal consistency requirements, concurrent
 584    updates create issues about consistency between the parent table (the
 585    <firstterm>heap</>) and the index.  Because
 586    <productname>PostgreSQL</productname> separates accesses
 587    and updates of the heap from those of the index, there are windows in
 588    which the index might be inconsistent with the heap.  We handle this problem
 589    with the following rules:
 590
 591     <itemizedlist>
 592      <listitem>
 593       <para>
 594        A new heap entry is made before making its index entries.  (Therefore
 595        a concurrent index scan is likely to fail to see the heap entry.
 596        This is okay because the index reader would be uninterested in an
 597        uncommitted row anyway.  But see <xref linkend="index-unique-checks">.)
 598       </para>
 599      </listitem>
 600      <listitem>
 601       <para>
 602        When a heap entry is to be deleted (by <command>VACUUM</>), all its
 603        index entries must be removed first.
 604       </para>
 605      </listitem>
 606      <listitem>
 607       <para>
 608        An index scan must maintain a pin
 609        on the index page holding the item last returned by
 610        <function>amgettuple</>, and <function>ambulkdelete</> cannot delete
 611        entries from pages that are pinned by other backends.  The need
 612        for this rule is explained below.
 613       </para>
 614      </listitem>
 615     </itemizedlist>
 616
 617    Without the third rule, it is possible for an index reader to
 618    see an index entry just before it is removed by <command>VACUUM</>, and
 619    then to arrive at the corresponding heap entry after that was removed by
 620    <command>VACUUM</>.
 621    This creates no serious problems if that item
 622    number is still unused when the reader reaches it, since an empty
 623    item slot will be ignored by <function>heap_fetch()</>.  But what if a
 624    third backend has already re-used the item slot for something else?
 625    When using an MVCC-compliant snapshot, there is no problem because
 626    the new occupant of the slot is certain to be too new to pass the
 627    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
 628    <literal>SnapshotNow</>), it would be possible to accept and return
 629    a row that does not in fact match the scan keys.  We could defend
 630    against this scenario by requiring the scan keys to be rechecked
 631    against the heap row in all cases, but that is too expensive.  Instead,
 632    we use a pin on an index page as a proxy to indicate that the reader
 633    might still be <quote>in flight</> from the index entry to the matching
 634    heap entry.  Making <function>ambulkdelete</> block on such a pin ensures
 635    that <command>VACUUM</> cannot delete the heap entry before the reader
 636    is done with it.  This solution costs little in run time, and adds blocking
 637    overhead only in the rare cases where there actually is a conflict.
 638   </para>
 639
 640   <para>
 641    This solution requires that index scans be <quote>synchronous</>: we have
 642    to fetch each heap tuple immediately after scanning the corresponding index
 643    entry.  This is expensive for a number of reasons.  An
 644    <quote>asynchronous</> scan in which we collect many TIDs from the index,
 645    and only visit the heap tuples sometime later, requires much less index
 646    locking overhead and can allow a more efficient heap access pattern.
 647    Per the above analysis, we must use the synchronous approach for
 648    non-MVCC-compliant snapshots, but an asynchronous scan is workable
 649    for a query using an MVCC snapshot.
 650   </para>
 651
 652   <para>
 653    In an <function>amgetbitmap</> index scan, the access method does not
 654    keep an index pin on any of the returned tuples.  Therefore
 655    it is only safe to use such scans with MVCC-compliant snapshots.
 656   </para>
 657
 658  </sect1>
 659
 660  <sect1 id="index-unique-checks">
 661   <title>Index Uniqueness Checks</title>
 662
 663   <para>
 664    <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
 665    using <firstterm>unique indexes</>, which are indexes that disallow
 666    multiple entries with identical keys.  An access method that supports this
 667    feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
 668    (At present, only b-tree supports it.)
 669   </para>
 670
 671   <para>
 672    Because of MVCC, it is always necessary to allow duplicate entries to
 673    exist physically in an index: the entries might refer to successive
 674    versions of a single logical row.  The behavior we actually want to
 675    enforce is that no MVCC snapshot could include two rows with equal
 676    index keys.  This breaks down into the following cases that must be
 677    checked when inserting a new row into a unique index:
 678
 679     <itemizedlist>
 680      <listitem>
 681       <para>
 682        If a conflicting valid row has been deleted by the current transaction,
 683        it's okay.  (In particular, since an UPDATE always deletes the old row
 684        version before inserting the new version, this will allow an UPDATE on
 685        a row without changing the key.)
 686       </para>
 687      </listitem>
 688      <listitem>
 689       <para>
 690        If a conflicting row has been inserted by an as-yet-uncommitted
 691        transaction, the would-be inserter must wait to see if that transaction
 692        commits.  If it rolls back then there is no conflict.  If it commits
 693        without deleting the conflicting row again, there is a uniqueness
 694        violation.  (In practice we just wait for the other transaction to
 695        end and then redo the visibility check in toto.)
 696       </para>
 697      </listitem>
 698      <listitem>
 699       <para>
 700        Similarly, if a conflicting valid row has been deleted by an
 701        as-yet-uncommitted transaction, the would-be inserter must wait
 702        for that transaction to commit or abort, and then repeat the test.
 703       </para>
 704      </listitem>
 705     </itemizedlist>
 706   </para>
 707
 708   <para>
 709    Furthermore, immediately before raising a uniqueness violation
 710    according to the above rules, the access method must recheck the
 711    liveness of the row being inserted.  If it is committed dead then
 712    no error should be raised.  (This case cannot occur during the
 713    ordinary scenario of inserting a row that's just been created by
 714    the current transaction.  It can happen during
 715    <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
 716   </para>
 717
 718   <para>
 719    We require the index access method to apply these tests itself, which
 720    means that it must reach into the heap to check the commit status of
 721    any row that is shown to have a duplicate key according to the index
 722    contents.  This is without a doubt ugly and non-modular, but it saves
 723    redundant work: if we did a separate probe then the index lookup for
 724    a conflicting row would be essentially repeated while finding the place to
 725    insert the new row's index entry.  What's more, there is no obvious way
 726    to avoid race conditions unless the conflict check is an integral part
 727    of insertion of the new index entry.
 728   </para>
 729
 730   <para>
 731    The main limitation of this scheme is that it has no convenient way
 732    to support deferred uniqueness checks.
 733   </para>
 734
 735  </sect1>
 736
 737  <sect1 id="index-cost-estimation">
 738   <title>Index Cost Estimation Functions</title>
 739
 740   <para>
 741    The amcostestimate function is given a list of WHERE clauses that have
 742    been determined to be usable with the index.  It must return estimates
 743    of the cost of accessing the index and the selectivity of the WHERE
 744    clauses (that is, the fraction of parent-table rows that will be
 745    retrieved during the index scan).  For simple cases, nearly all the
 746    work of the cost estimator can be done by calling standard routines
 747    in the optimizer; the point of having an amcostestimate function is
 748    to allow index access methods to provide index-type-specific knowledge,
 749    in case it is possible to improve on the standard estimates.
 750   </para>
 751
 752   <para>
 753    Each amcostestimate function must have the signature:
 754
 755 <programlisting>
 756 void
 757 amcostestimate (PlannerInfo *root,
 758                 IndexOptInfo *index,
 759                 List *indexQuals,
 760                 RelOptInfo *outer_rel,
 761                 Cost *indexStartupCost,
 762                 Cost *indexTotalCost,
 763                 Selectivity *indexSelectivity,
 764                 double *indexCorrelation);
 765 </programlisting>
 766
 767    The first four parameters are inputs:
 768
 769    <variablelist>
 770     <varlistentry>
 771      <term>root</term>
 772      <listitem>
 773       <para>
 774        The planner's information about the query being processed.
 775       </para>
 776      </listitem>
 777     </varlistentry>
 778
 779     <varlistentry>
 780      <term>index</term>
 781      <listitem>
 782       <para>
 783        The index being considered.
 784       </para>
 785      </listitem>
 786     </varlistentry>
 787
 788     <varlistentry>
 789      <term>indexQuals</term>
 790      <listitem>
 791       <para>
 792        List of index qual clauses (implicitly ANDed);
 793        a NIL list indicates no qualifiers are available.
 794        Note that the list contains expression trees, not ScanKeys.
 795       </para>
 796      </listitem>
 797     </varlistentry>
 798
 799     <varlistentry>
 800      <term>outer_rel</term>
 801      <listitem>
 802       <para>
 803        If the index is being considered for use in a join inner indexscan,
 804        the planner's information about the outer side of the join.  Otherwise
 805        NULL.  When non-NULL, some of the qual clauses will be join clauses
 806        with this rel rather than being simple restriction clauses.  Also,
 807        the cost estimator should expect that the index scan will be repeated
 808        for each row of the outer rel.
 809       </para>
 810      </listitem>
 811     </varlistentry>
 812    </variablelist>
 813   </para>
 814
 815   <para>
 816    The last four parameters are pass-by-reference outputs:
 817
 818    <variablelist>
 819     <varlistentry>
 820      <term>*indexStartupCost</term>
 821      <listitem>
 822       <para>
 823        Set to cost of index start-up processing
 824       </para>
 825      </listitem>
 826     </varlistentry>
 827
 828     <varlistentry>
 829      <term>*indexTotalCost</term>
 830      <listitem>
 831       <para>
 832        Set to total cost of index processing
 833       </para>
 834      </listitem>
 835     </varlistentry>
 836
 837     <varlistentry>
 838      <term>*indexSelectivity</term>
 839      <listitem>
 840       <para>
 841        Set to index selectivity
 842       </para>
 843      </listitem>
 844     </varlistentry>
 845
 846     <varlistentry>
 847      <term>*indexCorrelation</term>
 848      <listitem>
 849       <para>
 850        Set to correlation coefficient between index scan order and
 851        underlying table's order
 852       </para>
 853      </listitem>
 854     </varlistentry>
 855    </variablelist>
 856   </para>
 857
 858   <para>
 859    Note that cost estimate functions must be written in C, not in SQL or
 860    any available procedural language, because they must access internal
 861    data structures of the planner/optimizer.
 862   </para>
 863
 864   <para>
 865    The index access costs should be computed using the parameters used by
 866    <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
 867    disk block fetch has cost <varname>seq_page_cost</>, a nonsequential fetch
 868    has cost <varname>random_page_cost</>, and the cost of processing one index
 869    row should usually be taken as <varname>cpu_index_tuple_cost</>.  In
 870    addition, an appropriate multiple of <varname>cpu_operator_cost</> should
 871    be charged for any comparison operators invoked during index processing
 872    (especially evaluation of the indexQuals themselves).
 873   </para>
 874
 875   <para>
 876    The access costs should include all disk and CPU costs associated with
 877    scanning the index itself, but <emphasis>not</> the costs of retrieving or
 878    processing the parent-table rows that are identified by the index.
 879   </para>
 880
 881   <para>
 882    The <quote>start-up cost</quote> is the part of the total scan cost that
 883    must be expended before we can begin to fetch the first row.  For most
 884    indexes this can be taken as zero, but an index type with a high start-up
 885    cost might want to set it nonzero.
 886   </para>
 887
 888   <para>
 889    The indexSelectivity should be set to the estimated fraction of the parent
 890    table rows that will be retrieved during the index scan.  In the case
 891    of a lossy query, this will typically be higher than the fraction of
 892    rows that actually pass the given qual conditions.
 893   </para>
 894
 895   <para>
 896    The indexCorrelation should be set to the correlation (ranging between
 897    -1.0 and 1.0) between the index order and the table order.  This is used
 898    to adjust the estimate for the cost of fetching rows from the parent
 899    table.
 900   </para>
 901
 902   <para>
 903    In the join case, the returned numbers should be averages expected for
 904    any one scan of the index.
 905   </para>
 906
 907   <procedure>
 908    <title>Cost Estimation</title>
 909    <para>
 910     A typical cost estimator will proceed as follows:
 911    </para>
 912
 913    <step>
 914     <para>
 915      Estimate and return the fraction of parent-table rows that will be visited
 916      based on the given qual conditions.  In the absence of any index-type-specific
 917      knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
 918
 919 <programlisting>
 920 *indexSelectivity = clauselist_selectivity(root, indexQuals,
 921                                            index-&gt;rel-&gt;relid,
 922                                            JOIN_INNER, NULL);
 923 </programlisting>
 924     </para>
 925    </step>
 926
 927    <step>
 928     <para>
 929      Estimate the number of index rows that will be visited during the
 930      scan.  For many index types this is the same as indexSelectivity times
 931      the number of rows in the index, but it might be more.  (Note that the
 932      index's size in pages and rows is available from the IndexOptInfo struct.)
 933     </para>
 934    </step>
 935
 936    <step>
 937     <para>
 938      Estimate the number of index pages that will be retrieved during the scan.
 939      This might be just indexSelectivity times the index's size in pages.
 940     </para>
 941    </step>
 942
 943    <step>
 944     <para>
 945      Compute the index access cost.  A generic estimator might do this:
 946
 947 <programlisting>
 948     /*
 949      * Our generic assumption is that the index pages will be read
 950      * sequentially, so they cost seq_page_cost each, not random_page_cost.
 951      * Also, we charge for evaluation of the indexquals at each index row.
 952      * All the costs are assumed to be paid incrementally during the scan.
 953      */
 954     cost_qual_eval(&amp;index_qual_cost, indexQuals, root);
 955     *indexStartupCost = index_qual_cost.startup;
 956     *indexTotalCost = seq_page_cost * numIndexPages +
 957         (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
 958 </programlisting>
 959
 960      However, the above does not account for amortization of index reads
 961      across repeated index scans in the join case.
 962     </para>
 963    </step>
 964
 965    <step>
 966     <para>
 967      Estimate the index correlation.  For a simple ordered index on a single
 968      field, this can be retrieved from pg_statistic.  If the correlation
 969      is not known, the conservative estimate is zero (no correlation).
 970     </para>
 971    </step>
 972   </procedure>
 973
 974   <para>
 975    Examples of cost estimator functions can be found in
 976    <filename>src/backend/utils/adt/selfuncs.c</filename>.
 977   </para>
 978  </sect1>
 979 </chapter>