doc/src/sgml/indexam.sgml

   1 <!-- doc/src/sgml/indexam.sgml -->
   2
   3 <chapter id="indexam">
   4  <title>Index Access Method Interface Definition</title>
   5
   6  <indexterm>
   7   <primary>Index Access Method</primary>
   8  </indexterm>
   9  <indexterm>
  10   <primary>indexam</primary>
  11   <secondary>Index Access Method</secondary>
  12  </indexterm>
  13
  14   <para>
  15    This chapter defines the interface between the core
  16    <productname>PostgreSQL</productname> system and <firstterm>index access
  17    methods</firstterm>, which manage individual index types.  The core system
  18    knows nothing about indexes beyond what is specified here, so it is
  19    possible to develop entirely new index types by writing add-on code.
  20   </para>
  21
  22   <para>
  23    All indexes in <productname>PostgreSQL</productname> are what are known
  24    technically as <firstterm>secondary indexes</firstterm>; that is, the index is
  25    physically separate from the table file that it describes.  Each index
  26    is stored as its own physical <firstterm>relation</firstterm> and so is described
  27    by an entry in the <structname>pg_class</structname> catalog.  The contents of an
  28    index are entirely under the control of its index access method.  In
  29    practice, all index access methods divide indexes into standard-size
  30    pages so that they can use the regular storage manager and buffer manager
  31    to access the index contents.  (All the existing index access methods
  32    furthermore use the standard page layout described in <xref
  33    linkend="storage-page-layout"/>, and most use the same format for index
  34    tuple headers; but these decisions are not forced on an access method.)
  35   </para>
  36
  37   <para>
  38    An index is effectively a mapping from some data key values to
  39    <firstterm>tuple identifiers</firstterm>, or <acronym>TIDs</acronym>, of row versions
  40    (tuples) in the index's parent table.  A TID consists of a
  41    block number and an item number within that block (see <xref
  42    linkend="storage-page-layout"/>).  This is sufficient
  43    information to fetch a particular row version from the table.
  44    Indexes are not directly aware that under MVCC, there might be multiple
  45    extant versions of the same logical row; to an index, each tuple is
  46    an independent object that needs its own index entry.  Thus, an
  47    update of a row always creates all-new index entries for the row, even if
  48    the key values did not change.  (<link linkend="storage-hot">HOT
  49    tuples</link> are an exception to this
  50    statement; but indexes do not deal with those, either.)  Index entries for
  51    dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
  52    are reclaimed.
  53   </para>
  54
  55  <sect1 id="index-api">
  56   <title>Basic API Structure for Indexes</title>
  57
  58   <para>
  59    Each index access method is described by a row in the
  60    <link linkend="catalog-pg-am"><structname>pg_am</structname></link>
  61    system catalog.  The <structname>pg_am</structname> entry
  62    specifies a name and a <firstterm>handler function</firstterm> for the index
  63    access method.  These entries can be created and deleted using the
  64    <xref linkend="sql-create-access-method"/> and
  65    <xref linkend="sql-drop-access-method"/> SQL commands.
  66   </para>
  67
  68   <para>
  69    An index access method handler function must be declared to accept a
  70    single argument of type <type>internal</type> and to return the
  71    pseudo-type <type>index_am_handler</type>.  The argument is a dummy value that
  72    simply serves to prevent handler functions from being called directly from
  73    SQL commands.  The result of the function must be a palloc'd struct of
  74    type <structname>IndexAmRoutine</structname>, which contains everything
  75    that the core code needs to know to make use of the index access method.
  76    The <structname>IndexAmRoutine</structname> struct, also called the access
  77    method's <firstterm>API struct</firstterm>, includes fields specifying assorted
  78    fixed properties of the access method, such as whether it can support
  79    multicolumn indexes.  More importantly, it contains pointers to support
  80    functions for the access method, which do all of the real work to access
  81    indexes.  These support functions are plain C functions and are not
  82    visible or callable at the SQL level.  The support functions are described
  83    in <xref linkend="index-functions"/>.
  84   </para>
  85
  86   <para>
  87    The structure <structname>IndexAmRoutine</structname> is defined thus:
  88 <programlisting>
  89 typedef struct IndexAmRoutine
  90 {
  91     NodeTag     type;
  92
  93     /*
  94      * Total number of strategies (operators) by which we can traverse/search
  95      * this AM.  Zero if AM does not have a fixed set of strategy assignments.
  96      */
  97     uint16      amstrategies;
  98     /* total number of support functions that this AM uses */
  99     uint16      amsupport;
 100     /* opclass options support function number or 0 */
 101     uint16      amoptsprocnum;
 102     /* does AM support ORDER BY indexed column's value? */
 103     bool        amcanorder;
 104     /* does AM support ORDER BY result of an operator on indexed column? */
 105     bool        amcanorderbyop;
 106     /* does AM support backward scanning? */
 107     bool        amcanbackward;
 108     /* does AM support UNIQUE indexes? */
 109     bool        amcanunique;
 110     /* does AM support multi-column indexes? */
 111     bool        amcanmulticol;
 112     /* does AM require scans to have a constraint on the first index column? */
 113     bool        amoptionalkey;
 114     /* does AM handle ScalarArrayOpExpr quals? */
 115     bool        amsearcharray;
 116     /* does AM handle IS NULL/IS NOT NULL quals? */
 117     bool        amsearchnulls;
 118     /* can index storage data type differ from column data type? */
 119     bool        amstorage;
 120     /* can an index of this type be clustered on? */
 121     bool        amclusterable;
 122     /* does AM handle predicate locks? */
 123     bool        ampredlocks;
 124     /* does AM support parallel scan? */
 125     bool        amcanparallel;
 126     /* does AM support parallel build? */
 127     bool        amcanbuildparallel;
 128     /* does AM support columns included with clause INCLUDE? */
 129     bool        amcaninclude;
 130     /* does AM use maintenance_work_mem? */
 131     bool        amusemaintenanceworkmem;
 132     /* does AM summarize tuples, with at least all tuples in the block
 133      * summarized in one summary */
 134     bool        amsummarizing;
 135     /* OR of parallel vacuum flags */
 136     uint8       amparallelvacuumoptions;
 137     /* type of data stored in index, or InvalidOid if variable */
 138     Oid         amkeytype;
 139
 140     /* interface functions */
 141     ambuild_function ambuild;
 142     ambuildempty_function ambuildempty;
 143     aminsert_function aminsert;
 144     aminsertcleanup_function aminsertcleanup;
 145     ambulkdelete_function ambulkdelete;
 146     amvacuumcleanup_function amvacuumcleanup;
 147     amcanreturn_function amcanreturn;   /* can be NULL */
 148     amcostestimate_function amcostestimate;
 149     amgettreeheight_function amgettreeheight;   /* can be NULL */
 150     amoptions_function amoptions;
 151     amproperty_function amproperty;     /* can be NULL */
 152     ambuildphasename_function ambuildphasename;   /* can be NULL */
 153     amvalidate_function amvalidate;
 154     amadjustmembers_function amadjustmembers; /* can be NULL */
 155     ambeginscan_function ambeginscan;
 156     amrescan_function amrescan;
 157     amgettuple_function amgettuple;     /* can be NULL */
 158     amgetbitmap_function amgetbitmap;   /* can be NULL */
 159     amendscan_function amendscan;
 160     ammarkpos_function ammarkpos;       /* can be NULL */
 161     amrestrpos_function amrestrpos;     /* can be NULL */
 162
 163     /* interface functions to support parallel index scans */
 164     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
 165     aminitparallelscan_function aminitparallelscan;    /* can be NULL */
 166     amparallelrescan_function amparallelrescan;    /* can be NULL */
 167 } IndexAmRoutine;
 168 </programlisting>
 169   </para>
 170
 171   <para>
 172    To be useful, an index access method must also have one or more
 173    <firstterm>operator families</firstterm> and
 174    <firstterm>operator classes</firstterm> defined in
 175    <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
 176    <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
 177    <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
 178    <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
 179    These entries allow the planner
 180    to determine what kinds of query qualifications can be used with
 181    indexes of this access method.  Operator families and classes are described
 182    in <xref linkend="xindex"/>, which is prerequisite material for reading
 183    this chapter.
 184   </para>
 185
 186   <para>
 187    An individual index is defined by a
 188    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
 189    entry that describes it as a physical relation, plus a
 190    <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
 191    entry that shows the logical content of the index &mdash; that is, the set
 192    of index columns it has and the semantics of those columns, as captured by
 193    the associated operator classes.  The index columns (key values) can be
 194    either simple columns of the underlying table or expressions over the table
 195    rows.  The index access method normally has no interest in where the index
 196    key values come from (it is always handed precomputed key values) but it
 197    will be very interested in the operator class information in
 198    <structname>pg_index</structname>.  Both of these catalog entries can be
 199    accessed as part of the <structname>Relation</structname> data structure that is
 200    passed to all operations on the index.
 201   </para>
 202
 203   <para>
 204    Some of the flag fields of <structname>IndexAmRoutine</structname> have nonobvious
 205    implications.  The requirements of <structfield>amcanunique</structfield>
 206    are discussed in <xref linkend="index-unique-checks"/>.
 207    The <structfield>amcanmulticol</structfield> flag asserts that the
 208    access method supports multi-key-column indexes, while
 209    <structfield>amoptionalkey</structfield> asserts that it allows scans
 210    where no indexable restriction clause is given for the first index column.
 211    When <structfield>amcanmulticol</structfield> is false,
 212    <structfield>amoptionalkey</structfield> essentially says whether the
 213    access method supports full-index scans without any restriction clause.
 214    Access methods that support multiple index columns <emphasis>must</emphasis>
 215    support scans that omit restrictions on any or all of the columns after
 216    the first; however they are permitted to require some restriction to
 217    appear for the first index column, and this is signaled by setting
 218    <structfield>amoptionalkey</structfield> false.
 219    One reason that an index <acronym>AM</acronym> might set
 220    <structfield>amoptionalkey</structfield> false is if it doesn't index
 221    null values.  Since most indexable operators are
 222    strict and hence cannot return true for null inputs,
 223    it is at first sight attractive to not store index entries for null values:
 224    they could never be returned by an index scan anyway.  However, this
 225    argument fails when an index scan has no restriction clause for a given
 226    index column.  In practice this means that
 227    indexes that have <structfield>amoptionalkey</structfield> true must
 228    index nulls, since the planner might decide to use such an index
 229    with no scan keys at all.  A related restriction is that an index
 230    access method that supports multiple index columns <emphasis>must</emphasis>
 231    support indexing null values in columns after the first, because the planner
 232    will assume the index can be used for queries that do not restrict
 233    these columns.  For example, consider an index on (a,b) and a query with
 234    <literal>WHERE a = 4</literal>.  The system will assume the index can be
 235    used to scan for rows with <literal>a = 4</literal>, which is wrong if the
 236    index omits rows where <literal>b</literal> is null.
 237    It is, however, OK to omit rows where the first indexed column is null.
 238    An index access method that does index nulls may also set
 239    <structfield>amsearchnulls</structfield>, indicating that it supports
 240    <literal>IS NULL</literal> and <literal>IS NOT NULL</literal> clauses as search
 241    conditions.
 242   </para>
 243
 244   <para>
 245    The <structfield>amcaninclude</structfield> flag indicates whether the
 246    access method supports <quote>included</quote> columns, that is it can
 247    store (without processing) additional columns beyond the key column(s).
 248    The requirements of the preceding paragraph apply only to the key
 249    columns.  In particular, the combination
 250    of <structfield>amcanmulticol</structfield>=<literal>false</literal>
 251    and <structfield>amcaninclude</structfield>=<literal>true</literal> is
 252    sensible: it means that there can only be one key column, but there can
 253    also be included column(s).  Also, included columns must be allowed to be
 254    null, independently of <structfield>amoptionalkey</structfield>.
 255   </para>
 256
 257   <para>
 258    The <structfield>amsummarizing</structfield> flag indicates whether the
 259    access method summarizes the indexed tuples, with summarizing granularity
 260    of at least per block.
 261    Access methods that do not point to individual tuples, but to block ranges
 262    (like <acronym>BRIN</acronym>), may allow the <acronym>HOT</acronym> optimization
 263    to continue. This does not apply to attributes referenced in index
 264    predicates, an update of such an attribute always disables <acronym>HOT</acronym>.
 265   </para>
 266
 267  </sect1>
 268
 269  <sect1 id="index-functions">
 270   <title>Index Access Method Functions</title>
 271
 272   <para>
 273    The index construction and maintenance functions that an index access
 274    method must provide in <structname>IndexAmRoutine</structname> are:
 275   </para>
 276
 277   <para>
 278 <programlisting>
 279 IndexBuildResult *
 280 ambuild (Relation heapRelation,
 281          Relation indexRelation,
 282          IndexInfo *indexInfo);
 283 </programlisting>
 284    Build a new index.  The index relation has been physically created,
 285    but is empty.  It must be filled in with whatever fixed data the
 286    access method requires, plus entries for all tuples already existing
 287    in the table.  Ordinarily the <function>ambuild</function> function will call
 288    <function>table_index_build_scan()</function> to scan the table for existing tuples
 289    and compute the keys that need to be inserted into the index.
 290    The function must return a palloc'd struct containing statistics about
 291    the new index.
 292    The <structfield>amcanbuildparallel</structfield> flag indicates whether
 293    the access method supports parallel index builds. When set to <literal>true</literal>,
 294    the system will attempt to allocate parallel workers for the build.
 295    Access methods supporting only non-parallel index builds should leave
 296    this flag set to <literal>false</literal>.
 297   </para>
 298
 299   <para>
 300 <programlisting>
 301 void
 302 ambuildempty (Relation indexRelation);
 303 </programlisting>
 304    Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
 305    of the given relation.  This method is called only for unlogged indexes; the
 306    empty index written to the initialization fork will be copied over the main
 307    relation fork on each server restart.
 308   </para>
 309
 310   <para>
 311 <programlisting>
 312 bool
 313 aminsert (Relation indexRelation,
 314           Datum *values,
 315           bool *isnull,
 316           ItemPointer heap_tid,
 317           Relation heapRelation,
 318           IndexUniqueCheck checkUnique,
 319           bool indexUnchanged,
 320           IndexInfo *indexInfo);
 321 </programlisting>
 322    Insert a new tuple into an existing index.  The <literal>values</literal> and
 323    <literal>isnull</literal> arrays give the key values to be indexed, and
 324    <literal>heap_tid</literal> is the TID to be indexed.
 325    If the access method supports unique indexes (its
 326    <structfield>amcanunique</structfield> flag is true) then
 327    <literal>checkUnique</literal> indicates the type of uniqueness check to
 328    perform.  This varies depending on whether the unique constraint is
 329    deferrable; see <xref linkend="index-unique-checks"/> for details.
 330    Normally the access method only needs the <literal>heapRelation</literal>
 331    parameter when performing uniqueness checking (since then it will have to
 332    look into the heap to verify tuple liveness).
 333   </para>
 334
 335   <para>
 336    The <literal>indexUnchanged</literal> Boolean value gives a hint
 337    about the nature of the tuple to be indexed.  When it is true,
 338    the tuple is a duplicate of some existing tuple in the index.  The
 339    new tuple is a logically unchanged successor MVCC tuple version.  This
 340    happens when an <command>UPDATE</command> takes place that does not
 341    modify any columns covered by the index, but nevertheless requires a
 342    new version in the index.  The index AM may use this hint to decide
 343    to apply bottom-up index deletion in parts of the index where many
 344    versions of the same logical row accumulate.  Note that updating a non-key
 345    column or a column that only appears in a partial index predicate does not
 346    affect the value of <literal>indexUnchanged</literal>.  The core code
 347    determines each tuple's <literal>indexUnchanged</literal> value using a low
 348    overhead approach that allows both false positives and false negatives.
 349    Index AMs must not treat <literal>indexUnchanged</literal> as an
 350    authoritative source of information about tuple visibility or versioning.
 351   </para>
 352
 353   <para>
 354    The function's Boolean result value is significant only when
 355    <literal>checkUnique</literal> is <literal>UNIQUE_CHECK_PARTIAL</literal>.
 356    In this case a true result means the new entry is known unique, whereas
 357    false means it might be non-unique (and a deferred uniqueness check must
 358    be scheduled).  For other cases a constant false result is recommended.
 359   </para>
 360
 361   <para>
 362    Some indexes might not index all tuples.  If the tuple is not to be
 363    indexed, <function>aminsert</function> should just return without doing anything.
 364   </para>
 365
 366   <para>
 367    If the index AM wishes to cache data across successive index insertions
 368    within an SQL statement, it can allocate space
 369    in <literal>indexInfo-&gt;ii_Context</literal> and store a pointer to the
 370    data in <literal>indexInfo-&gt;ii_AmCache</literal> (which will be NULL
 371    initially).  If resources other than memory have to be released after
 372    index insertions, <function>aminsertcleanup</function> may be provided,
 373    which will be called before the memory is released.
 374   </para>
 375
 376   <para>
 377 <programlisting>
 378 void
 379 aminsertcleanup (Relation indexRelation,
 380                  IndexInfo *indexInfo);
 381 </programlisting>
 382    Clean up state that was maintained across successive inserts in
 383    <literal>indexInfo-&gt;ii_AmCache</literal>. This is useful if the data
 384    requires additional cleanup steps (e.g., releasing pinned buffers), and
 385    simply releasing the memory is not sufficient.
 386   </para>
 387
 388   <para>
 389 <programlisting>
 390 IndexBulkDeleteResult *
 391 ambulkdelete (IndexVacuumInfo *info,
 392               IndexBulkDeleteResult *stats,
 393               IndexBulkDeleteCallback callback,
 394               void *callback_state);
 395 </programlisting>
 396    Delete tuple(s) from the index.  This is a <quote>bulk delete</quote> operation
 397    that is intended to be implemented by scanning the whole index and checking
 398    each entry to see if it should be deleted.
 399    The passed-in <literal>callback</literal> function must be called, in the style
 400    <literal>callback(<replaceable>TID</replaceable>, callback_state) returns bool</literal>,
 401    to determine whether any particular index entry, as identified by its
 402    referenced TID, is to be deleted.  Must return either NULL or a palloc'd
 403    struct containing statistics about the effects of the deletion operation.
 404    It is OK to return NULL if no information needs to be passed on to
 405    <function>amvacuumcleanup</function>.
 406   </para>
 407
 408   <para>
 409    Because of limited <varname>maintenance_work_mem</varname>,
 410    <function>ambulkdelete</function> might need to be called more than once when many
 411    tuples are to be deleted.  The <literal>stats</literal> argument is the result
 412    of the previous call for this index (it is NULL for the first call within a
 413    <command>VACUUM</command> operation).  This allows the AM to accumulate statistics
 414    across the whole operation.  Typically, <function>ambulkdelete</function> will
 415    modify and return the same struct if the passed <literal>stats</literal> is not
 416    null.
 417   </para>
 418
 419   <para>
 420 <programlisting>
 421 IndexBulkDeleteResult *
 422 amvacuumcleanup (IndexVacuumInfo *info,
 423                  IndexBulkDeleteResult *stats);
 424 </programlisting>
 425    Clean up after a <command>VACUUM</command> operation (zero or more
 426    <function>ambulkdelete</function> calls).  This does not have to do anything
 427    beyond returning index statistics, but it might perform bulk cleanup
 428    such as reclaiming empty index pages.  <literal>stats</literal> is whatever the
 429    last <function>ambulkdelete</function> call returned, or NULL if
 430    <function>ambulkdelete</function> was not called because no tuples needed to be
 431    deleted.  If the result is not NULL it must be a palloc'd struct.
 432    The statistics it contains will be used to update <structname>pg_class</structname>,
 433    and will be reported by <command>VACUUM</command> if <literal>VERBOSE</literal> is given.
 434    It is OK to return NULL if the index was not changed at all during the
 435    <command>VACUUM</command> operation, but otherwise correct stats should
 436    be returned.
 437   </para>
 438
 439   <para>
 440    <function>amvacuumcleanup</function> will also be called at completion of an
 441    <command>ANALYZE</command> operation.  In this case <literal>stats</literal> is always
 442    NULL and any return value will be ignored.  This case can be distinguished
 443    by checking <literal>info-&gt;analyze_only</literal>.  It is recommended
 444    that the access method do nothing except post-insert cleanup in such a
 445    call, and that only in an autovacuum worker process.
 446   </para>
 447
 448   <para>
 449 <programlisting>
 450 bool
 451 amcanreturn (Relation indexRelation, int attno);
 452 </programlisting>
 453    Check whether the index can support <link
 454    linkend="indexes-index-only-scans"><firstterm>index-only scans</firstterm></link> on
 455    the given column, by returning the column's original indexed value.
 456    The attribute number is 1-based, i.e., the first column's attno is 1.
 457    Returns true if supported, else false.
 458    This function should always return true for included columns
 459    (if those are supported), since there's little point in an included
 460    column that can't be retrieved.
 461    If the access method does not support index-only scans at all,
 462    the <structfield>amcanreturn</structfield> field in its <structname>IndexAmRoutine</structname>
 463    struct can be set to NULL.
 464   </para>
 465
 466   <para>
 467 <programlisting>
 468 void
 469 amcostestimate (PlannerInfo *root,
 470                 IndexPath *path,
 471                 double loop_count,
 472                 Cost *indexStartupCost,
 473                 Cost *indexTotalCost,
 474                 Selectivity *indexSelectivity,
 475                 double *indexCorrelation,
 476                 double *indexPages);
 477 </programlisting>
 478    Estimate the costs of an index scan.  This function is described fully
 479    in <xref linkend="index-cost-estimation"/>, below.
 480   </para>
 481
 482   <para>
 483 <programlisting>
 484 int
 485 amgettreeheight (Relation rel);
 486 </programlisting>
 487    Compute the height of a tree-shaped index.  This information is supplied to
 488    the <function>amcostestimate</function> function in
 489    <literal>path->indexinfo->tree_height</literal> and can be used to support
 490    the cost estimation.  The result is not used anywhere else, so this
 491    function can actually be used to compute any kind of data (that fits into
 492    an integer) about the index that the cost estimation function might want to
 493    know.  If the computation is expensive, it could be useful to cache the
 494    result as part of <literal>RelationData.rd_amcache</literal>.
 495   </para>
 496
 497   <para>
 498 <programlisting>
 499 bytea *
 500 amoptions (ArrayType *reloptions,
 501            bool validate);
 502 </programlisting>
 503    Parse and validate the reloptions array for an index.  This is called only
 504    when a non-null reloptions array exists for the index.
 505    <parameter>reloptions</parameter> is a <type>text</type> array containing entries of the
 506    form <replaceable>name</replaceable><literal>=</literal><replaceable>value</replaceable>.
 507    The function should construct a <type>bytea</type> value, which will be copied
 508    into the <structfield>rd_options</structfield> field of the index's relcache entry.
 509    The data contents of the <type>bytea</type> value are open for the access
 510    method to define; most of the standard access methods use struct
 511    <structname>StdRdOptions</structname>.
 512    When <parameter>validate</parameter> is true, the function should report a suitable
 513    error message if any of the options are unrecognized or have invalid
 514    values; when <parameter>validate</parameter> is false, invalid entries should be
 515    silently ignored.  (<parameter>validate</parameter> is false when loading options
 516    already stored in <structname>pg_catalog</structname>; an invalid entry could only
 517    be found if the access method has changed its rules for options, and in
 518    that case ignoring obsolete entries is appropriate.)
 519    It is OK to return NULL if default behavior is wanted.
 520   </para>
 521
 522   <para>
 523 <programlisting>
 524 bool
 525 amproperty (Oid index_oid, int attno,
 526             IndexAMProperty prop, const char *propname,
 527             bool *res, bool *isnull);
 528 </programlisting>
 529    The <function>amproperty</function> method allows index access methods to override
 530    the default behavior of <function>pg_index_column_has_property</function>
 531    and related functions.
 532    If the access method does not have any special behavior for index property
 533    inquiries, the <structfield>amproperty</structfield> field in
 534    its <structname>IndexAmRoutine</structname> struct can be set to NULL.
 535    Otherwise, the <function>amproperty</function> method will be called with
 536    <parameter>index_oid</parameter> and <parameter>attno</parameter> both zero for
 537    <function>pg_indexam_has_property</function> calls,
 538    or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> zero for
 539    <function>pg_index_has_property</function> calls,
 540    or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> greater than
 541    zero for <function>pg_index_column_has_property</function> calls.
 542    <parameter>prop</parameter> is an enum value identifying the property being tested,
 543    while <parameter>propname</parameter> is the original property name string.
 544    If the core code does not recognize the property name
 545    then <parameter>prop</parameter> is <literal>AMPROP_UNKNOWN</literal>.
 546    Access methods can define custom property names by
 547    checking <parameter>propname</parameter> for a match (use <function>pg_strcasecmp</function>
 548    to match, for consistency with the core code); for names known to the core
 549    code, it's better to inspect <parameter>prop</parameter>.
 550    If the <structfield>amproperty</structfield> method returns <literal>true</literal> then
 551    it has determined the property test result: it must set <literal>*res</literal>
 552    to the Boolean value to return, or set <literal>*isnull</literal>
 553    to <literal>true</literal> to return a NULL.  (Both of the referenced variables
 554    are initialized to <literal>false</literal> before the call.)
 555    If the <structfield>amproperty</structfield> method returns <literal>false</literal> then
 556    the core code will proceed with its normal logic for determining the
 557    property test result.
 558   </para>
 559
 560   <para>
 561    Access methods that support ordering operators should
 562    implement <literal>AMPROP_DISTANCE_ORDERABLE</literal> property testing, as the
 563    core code does not know how to do that and will return NULL.  It may
 564    also be advantageous to implement <literal>AMPROP_RETURNABLE</literal> testing,
 565    if that can be done more cheaply than by opening the index and calling
 566    <function>amcanreturn</function>, which is the core code's default behavior.
 567    The default behavior should be satisfactory for all other standard
 568    properties.
 569   </para>
 570
 571   <para>
 572 <programlisting>
 573 char *
 574 ambuildphasename (int64 phasenum);
 575 </programlisting>
 576    Return the textual name of the given build phase number.
 577    The phase numbers are those reported during an index build via the
 578    <function>pgstat_progress_update_param</function> interface.
 579    The phase names are then exposed in the
 580    <structname>pg_stat_progress_create_index</structname> view.
 581   </para>
 582
 583   <para>
 584 <programlisting>
 585 bool
 586 amvalidate (Oid opclassoid);
 587 </programlisting>
 588    Validate the catalog entries for the specified operator class, so far as
 589    the access method can reasonably do that.  For example, this might include
 590    testing that all required support functions are provided.
 591    The <function>amvalidate</function> function must return false if the opclass is
 592    invalid.  Problems should be reported with <function>ereport</function>
 593    messages, typically at <literal>INFO</literal> level.
 594   </para>
 595
 596   <para>
 597 <programlisting>
 598 void
 599 amadjustmembers (Oid opfamilyoid,
 600                  Oid opclassoid,
 601                  List *operators,
 602                  List *functions);
 603 </programlisting>
 604    Validate proposed new operator and function members of an operator family,
 605    so far as the access method can reasonably do that, and set their
 606    dependency types if the default is not satisfactory.  This is called
 607    during <command>CREATE OPERATOR CLASS</command> and during
 608    <command>ALTER OPERATOR FAMILY ADD</command>; in the latter
 609    case <parameter>opclassoid</parameter> is <literal>InvalidOid</literal>.
 610    The <type>List</type> arguments are lists
 611    of <structname>OpFamilyMember</structname> structs, as defined
 612    in <filename>amapi.h</filename>.
 613
 614    Tests done by this function will typically be a subset of those
 615    performed by <function>amvalidate</function>,
 616    since <function>amadjustmembers</function> cannot assume that it is
 617    seeing a complete set of members.  For example, it would be reasonable
 618    to check the signature of a support function, but not to check whether
 619    all required support functions are provided.  Any problems can be
 620    reported by throwing an error.
 621
 622    The dependency-related fields of
 623    the <structname>OpFamilyMember</structname> structs are initialized by
 624    the core code to create hard dependencies on the opclass if this
 625    is <command>CREATE OPERATOR CLASS</command>, or soft dependencies on the
 626    opfamily if this is <command>ALTER OPERATOR FAMILY ADD</command>.
 627    <function>amadjustmembers</function> can adjust these fields if some other
 628    behavior is more appropriate.  For example, GIN, GiST, and SP-GiST
 629    always set operator members to have soft dependencies on the opfamily,
 630    since the connection between an operator and an opclass is relatively
 631    weak in these index types; so it is reasonable to allow operator members
 632    to be added and removed freely.  Optional support functions are typically
 633    also given soft dependencies, so that they can be removed if necessary.
 634   </para>
 635
 636
 637   <para>
 638    The purpose of an index, of course, is to support scans for tuples matching
 639    an indexable <literal>WHERE</literal> condition, often called a
 640    <firstterm>qualifier</firstterm> or <firstterm>scan key</firstterm>.  The semantics of
 641    index scanning are described more fully in <xref linkend="index-scanning"/>,
 642    below.  An index access method can support <quote>plain</quote> index scans,
 643    <quote>bitmap</quote> index scans, or both.  The scan-related functions that an
 644    index access method must or may provide are:
 645   </para>
 646
 647   <para>
 648 <programlisting>
 649 IndexScanDesc
 650 ambeginscan (Relation indexRelation,
 651              int nkeys,
 652              int norderbys);
 653 </programlisting>
 654    Prepare for an index scan.  The <literal>nkeys</literal> and <literal>norderbys</literal>
 655    parameters indicate the number of quals and ordering operators that will be
 656    used in the scan; these may be useful for space allocation purposes.
 657    Note that the actual values of the scan keys aren't provided yet.
 658    The result must be a palloc'd struct.
 659    For implementation reasons the index access method
 660    <emphasis>must</emphasis> create this struct by calling
 661    <function>RelationGetIndexScan()</function>.  In most cases
 662    <function>ambeginscan</function> does little beyond making that call and perhaps
 663    acquiring locks;
 664    the interesting parts of index-scan startup are in <function>amrescan</function>.
 665   </para>
 666
 667   <para>
 668 <programlisting>
 669 void
 670 amrescan (IndexScanDesc scan,
 671           ScanKey keys,
 672           int nkeys,
 673           ScanKey orderbys,
 674           int norderbys);
 675 </programlisting>
 676    Start or restart an index scan, possibly with new scan keys.  (To restart
 677    using previously-passed keys, NULL is passed for <literal>keys</literal> and/or
 678    <literal>orderbys</literal>.)  Note that it is not allowed for
 679    the number of keys or order-by operators to be larger than
 680    what was passed to <function>ambeginscan</function>.  In practice the restart
 681    feature is used when a new outer tuple is selected by a nested-loop join
 682    and so a new key comparison value is needed, but the scan key structure
 683    remains the same.
 684   </para>
 685
 686   <para>
 687 <programlisting>
 688 bool
 689 amgettuple (IndexScanDesc scan,
 690             ScanDirection direction);
 691 </programlisting>
 692    Fetch the next tuple in the given scan, moving in the given
 693    direction (forward or backward in the index).  Returns true if a tuple was
 694    obtained, false if no matching tuples remain.  In the true case the tuple
 695    TID is stored into the <literal>scan</literal> structure.  Note that
 696    <quote>success</quote> means only that the index contains an entry that matches
 697    the scan keys, not that the tuple necessarily still exists in the heap or
 698    will pass the caller's snapshot test.  On success, <function>amgettuple</function>
 699    must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
 700    False means it is certain that the index entry matches the scan keys.
 701    True means this is not certain, and the conditions represented by the
 702    scan keys must be rechecked against the heap tuple after fetching it.
 703    This provision supports <quote>lossy</quote> index operators.
 704    Note that rechecking will extend only to the scan conditions; a partial
 705    index predicate (if any) is never rechecked by <function>amgettuple</function>
 706    callers.
 707   </para>
 708
 709   <para>
 710    If the index supports <link linkend="indexes-index-only-scans">index-only
 711    scans</link> (i.e., <function>amcanreturn</function> returns true for any
 712    of its columns),
 713    then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
 714    and if that is true it must return the originally indexed data for the
 715    index entry.  Columns for which <function>amcanreturn</function> returns
 716    false can be returned as nulls.
 717    The data can be returned in the form of an
 718    <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
 719    with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
 720    a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
 721    with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>.  (The latter
 722    format should be used when reconstructing data that might possibly not fit
 723    into an <structname>IndexTuple</structname>.)  In either case,
 724    management of the data referenced by the pointer is the access method's
 725    responsibility.  The data must remain good at least until the next
 726    <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
 727    call for the scan.
 728   </para>
 729
 730   <para>
 731    The <function>amgettuple</function> function need only be provided if the access
 732    method supports <quote>plain</quote> index scans.  If it doesn't, the
 733    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
 734    struct must be set to NULL.
 735   </para>
 736
 737   <para>
 738 <programlisting>
 739 int64
 740 amgetbitmap (IndexScanDesc scan,
 741              TIDBitmap *tbm);
 742 </programlisting>
 743    Fetch all tuples in the given scan and add them to the caller-supplied
 744    <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
 745    in the bitmap).  The number of tuples fetched is returned (this might be
 746    just an approximate count, for instance some AMs do not detect duplicates).
 747    While inserting tuple IDs into the bitmap, <function>amgetbitmap</function> can
 748    indicate that rechecking of the scan conditions is required for specific
 749    tuple IDs.  This is analogous to the <literal>xs_recheck</literal> output parameter
 750    of <function>amgettuple</function>.  Note: in the current implementation, support
 751    for this feature is conflated with support for lossy storage of the bitmap
 752    itself, and therefore callers recheck both the scan conditions and the
 753    partial index predicate (if any) for recheckable tuples.  That might not
 754    always be true, however.
 755    <function>amgetbitmap</function> and
 756    <function>amgettuple</function> cannot be used in the same index scan; there
 757    are other restrictions too when using <function>amgetbitmap</function>, as explained
 758    in <xref linkend="index-scanning"/>.
 759   </para>
 760
 761   <para>
 762    The <function>amgetbitmap</function> function need only be provided if the access
 763    method supports <quote>bitmap</quote> index scans.  If it doesn't, the
 764    <structfield>amgetbitmap</structfield> field in its <structname>IndexAmRoutine</structname>
 765    struct must be set to NULL.
 766   </para>
 767
 768   <para>
 769 <programlisting>
 770 void
 771 amendscan (IndexScanDesc scan);
 772 </programlisting>
 773    End a scan and release resources.  The <literal>scan</literal> struct itself
 774    should not be freed, but any locks or pins taken internally by the
 775    access method must be released, as well as any other memory allocated
 776    by <function>ambeginscan</function> and other scan-related functions.
 777   </para>
 778
 779   <para>
 780 <programlisting>
 781 void
 782 ammarkpos (IndexScanDesc scan);
 783 </programlisting>
 784    Mark current scan position.  The access method need only support one
 785    remembered scan position per scan.
 786   </para>
 787
 788   <para>
 789    The <function>ammarkpos</function> function need only be provided if the access
 790    method supports ordered scans.  If it doesn't,
 791    the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
 792    struct may be set to NULL.
 793   </para>
 794
 795   <para>
 796 <programlisting>
 797 void
 798 amrestrpos (IndexScanDesc scan);
 799 </programlisting>
 800    Restore the scan to the most recently marked position.
 801   </para>
 802
 803   <para>
 804    The <function>amrestrpos</function> function need only be provided if the access
 805    method supports ordered scans.  If it doesn't,
 806    the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
 807    struct may be set to NULL.
 808   </para>
 809
 810   <para>
 811    In addition to supporting ordinary index scans, some types of index
 812    may wish to support <firstterm>parallel index scans</firstterm>, which allow
 813    multiple backends to cooperate in performing an index scan.  The
 814    index access method should arrange things so that each cooperating
 815    process returns a subset of the tuples that would be performed by
 816    an ordinary, non-parallel index scan, but in such a way that the
 817    union of those subsets is equal to the set of tuples that would be
 818    returned by an ordinary, non-parallel index scan.  Furthermore, while
 819    there need not be any global ordering of tuples returned by a parallel
 820    scan, the ordering of that subset of tuples returned within each
 821    cooperating backend must match the requested ordering.  The following
 822    functions may be implemented to support parallel index scans:
 823   </para>
 824
 825   <para>
 826 <programlisting>
 827 Size
 828 amestimateparallelscan (int nkeys,
 829                         int norderbys);
 830 </programlisting>
 831    Estimate and return the number of bytes of dynamic shared memory which
 832    the access method will be needed to perform a parallel scan.  (This number
 833    is in addition to, not in lieu of, the amount of space needed for
 834    AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
 835   </para>
 836
 837   <para>
 838    The <literal>nkeys</literal> and <literal>norderbys</literal>
 839    parameters indicate the number of quals and ordering operators that will be
 840    used in the scan; the same values will be passed to <function>amrescan</function>.
 841    Note that the actual values of the scan keys aren't provided yet.
 842   </para>
 843
 844   <para>
 845    It is not necessary to implement this function for access methods which
 846    do not support parallel scans or for which the number of additional bytes
 847    of storage required is zero.
 848   </para>
 849
 850   <para>
 851 <programlisting>
 852 void
 853 aminitparallelscan (void *target);
 854 </programlisting>
 855    This function will be called to initialize dynamic shared memory at the
 856    beginning of a parallel scan.  <parameter>target</parameter> will point to at least
 857    the number of bytes previously returned by
 858    <function>amestimateparallelscan</function>, and this function may use that
 859    amount of space to store whatever data it wishes.
 860   </para>
 861
 862   <para>
 863    It is not necessary to implement this function for access methods which
 864    do not support parallel scans or in cases where the shared memory space
 865    required needs no initialization.
 866   </para>
 867
 868   <para>
 869 <programlisting>
 870 void
 871 amparallelrescan (IndexScanDesc scan);
 872 </programlisting>
 873    This function, if implemented, will be called when a parallel index scan
 874    must be restarted.  It should reset any shared state set up by
 875    <function>aminitparallelscan</function> such that the scan will be restarted from
 876    the beginning.
 877   </para>
 878
 879  </sect1>
 880
 881  <sect1 id="index-scanning">
 882   <title>Index Scanning</title>
 883
 884   <para>
 885    In an index scan, the index access method is responsible for regurgitating
 886    the TIDs of all the tuples it has been told about that match the
 887    <firstterm>scan keys</firstterm>.  The access method is <emphasis>not</emphasis> involved in
 888    actually fetching those tuples from the index's parent table, nor in
 889    determining whether they pass the scan's visibility test or other
 890    conditions.
 891   </para>
 892
 893   <para>
 894    A scan key is the internal representation of a <literal>WHERE</literal> clause of
 895    the form <replaceable>index_key</replaceable> <replaceable>operator</replaceable>
 896    <replaceable>constant</replaceable>, where the index key is one of the columns of the
 897    index and the operator is one of the members of the operator family
 898    associated with that index column.  An index scan has zero or more scan
 899    keys, which are implicitly ANDed &mdash; the returned tuples are expected
 900    to satisfy all the indicated conditions.
 901   </para>
 902
 903   <para>
 904    The access method can report that the index is <firstterm>lossy</firstterm>, or
 905    requires rechecks, for a particular query.  This implies that the index
 906    scan will return all the entries that pass the scan key, plus possibly
 907    additional entries that do not.  The core system's index-scan machinery
 908    will then apply the index conditions again to the heap tuple to verify
 909    whether or not it really should be selected.  If the recheck option is not
 910    specified, the index scan must return exactly the set of matching entries.
 911   </para>
 912
 913   <para>
 914    Note that it is entirely up to the access method to ensure that it
 915    correctly finds all and only the entries passing all the given scan keys.
 916    Also, the core system will simply hand off all the <literal>WHERE</literal>
 917    clauses that match the index keys and operator families, without any
 918    semantic analysis to determine whether they are redundant or
 919    contradictory.  As an example, given
 920    <literal>WHERE x &gt; 4 AND x &gt; 14</literal> where <literal>x</literal> is a b-tree
 921    indexed column, it is left to the b-tree <function>amrescan</function> function
 922    to realize that the first scan key is redundant and can be discarded.
 923    The extent of preprocessing needed during <function>amrescan</function> will
 924    depend on the extent to which the index access method needs to reduce
 925    the scan keys to a <quote>normalized</quote> form.
 926   </para>
 927
 928   <para>
 929    Some access methods return index entries in a well-defined order, others
 930    do not.  There are actually two different ways that an access method can
 931    support sorted output:
 932
 933     <itemizedlist>
 934      <listitem>
 935       <para>
 936        Access methods that always return entries in the natural ordering
 937        of their data (such as btree) should set
 938        <structfield>amcanorder</structfield> to true.
 939        Currently, such access methods must use btree-compatible strategy
 940        numbers for their equality and ordering operators.
 941       </para>
 942      </listitem>
 943      <listitem>
 944       <para>
 945        Access methods that support ordering operators should set
 946        <structfield>amcanorderbyop</structfield> to true.
 947        This indicates that the index is capable of returning entries in
 948        an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
 949        <replaceable>operator</replaceable> <replaceable>constant</replaceable>.  Scan modifiers
 950        of that form can be passed to <function>amrescan</function> as described
 951        previously.
 952       </para>
 953      </listitem>
 954     </itemizedlist>
 955   </para>
 956
 957   <para>
 958    The <function>amgettuple</function> function has a <literal>direction</literal> argument,
 959    which can be either <literal>ForwardScanDirection</literal> (the normal case)
 960    or  <literal>BackwardScanDirection</literal>.  If the first call after
 961    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
 962    set of matching index entries is to be scanned back-to-front rather than in
 963    the normal front-to-back direction, so <function>amgettuple</function> must return
 964    the last matching tuple in the index, rather than the first one as it
 965    normally would.  (This will only occur for access
 966    methods that set <structfield>amcanorder</structfield> to true.)  After the
 967    first call, <function>amgettuple</function> must be prepared to advance the scan in
 968    either direction from the most recently returned entry.  (But if
 969    <structfield>amcanbackward</structfield> is false, all subsequent
 970    calls will have the same direction as the first one.)
 971   </para>
 972
 973   <para>
 974    Access methods that support ordered scans must support <quote>marking</quote> a
 975    position in a scan and later returning to the marked position.  The same
 976    position might be restored multiple times.  However, only one position need
 977    be remembered per scan; a new <function>ammarkpos</function> call overrides the
 978    previously marked position.  An access method that does not support ordered
 979    scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
 980    functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
 981    instead.
 982   </para>
 983
 984   <para>
 985    Both the scan position and the mark position (if any) must be maintained
 986    consistently in the face of concurrent insertions or deletions in the
 987    index.  It is OK if a freshly-inserted entry is not returned by a scan that
 988    would have found the entry if it had existed when the scan started, or for
 989    the scan to return such an entry upon rescanning or backing
 990    up even though it had not been returned the first time through.  Similarly,
 991    a concurrent delete might or might not be reflected in the results of a scan.
 992    What is important is that insertions or deletions not cause the scan to
 993    miss or multiply return entries that were not themselves being inserted or
 994    deleted.
 995   </para>
 996
 997   <para>
 998    If the index stores the original indexed data values (and not some lossy
 999    representation of them), it is useful to
1000    support <link linkend="indexes-index-only-scans">index-only scans</link>, in
1001    which the index returns the actual data not just the TID of the heap tuple.
1002    This will only avoid I/O if the visibility map shows that the TID is on an
1003    all-visible page; else the heap tuple must be visited anyway to check
1004    MVCC visibility.  But that is no concern of the access method's.
1005   </para>
1006
1007   <para>
1008    Instead of using <function>amgettuple</function>, an index scan can be done with
1009    <function>amgetbitmap</function> to fetch all tuples in one call.  This can be
1010    noticeably more efficient than <function>amgettuple</function> because it allows
1011    avoiding lock/unlock cycles within the access method.  In principle
1012    <function>amgetbitmap</function> should have the same effects as repeated
1013    <function>amgettuple</function> calls, but we impose several restrictions to
1014    simplify matters.  First of all, <function>amgetbitmap</function> returns all
1015    tuples at once and marking or restoring scan positions isn't
1016    supported. Secondly, the tuples are returned in a bitmap which doesn't
1017    have any specific ordering, which is why <function>amgetbitmap</function> doesn't
1018    take a <literal>direction</literal> argument.  (Ordering operators will never be
1019    supplied for such a scan, either.)
1020    Also, there is no provision for index-only scans with
1021    <function>amgetbitmap</function>, since there is no way to return the contents of
1022    index tuples.
1023    Finally, <function>amgetbitmap</function>
1024    does not guarantee any locking of the returned tuples, with implications
1025    spelled out in <xref linkend="index-locking"/>.
1026   </para>
1027
1028   <para>
1029    Note that it is permitted for an access method to implement only
1030    <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
1031    if its internal implementation is unsuited to one API or the other.
1032   </para>
1033
1034  </sect1>
1035
1036  <sect1 id="index-locking">
1037   <title>Index Locking Considerations</title>
1038
1039   <para>
1040    Index access methods must handle concurrent updates
1041    of the index by multiple processes.
1042    The core <productname>PostgreSQL</productname> system obtains
1043    <literal>AccessShareLock</literal> on the index during an index scan, and
1044    <literal>RowExclusiveLock</literal> when updating the index (including plain
1045    <command>VACUUM</command>).  Since these lock types do not conflict, the access
1046    method is responsible for handling any fine-grained locking it might need.
1047    An <literal>ACCESS EXCLUSIVE</literal> lock on the index as a whole will be
1048    taken only during index creation, destruction, or <command>REINDEX</command>
1049    (<literal>SHARE UPDATE EXCLUSIVE</literal> is taken instead with
1050    <literal>CONCURRENTLY</literal>).
1051   </para>
1052
1053   <para>
1054    Building an index type that supports concurrent updates usually requires
1055    extensive and subtle analysis of the required behavior.  For the b-tree
1056    and hash index types, you can read about the design decisions involved in
1057    <filename>src/backend/access/nbtree/README</filename> and
1058    <filename>src/backend/access/hash/README</filename>.
1059   </para>
1060
1061   <para>
1062    Aside from the index's own internal consistency requirements, concurrent
1063    updates create issues about consistency between the parent table (the
1064    <firstterm>heap</firstterm>) and the index.  Because
1065    <productname>PostgreSQL</productname> separates accesses
1066    and updates of the heap from those of the index, there are windows in
1067    which the index might be inconsistent with the heap.  We handle this problem
1068    with the following rules:
1069
1070     <itemizedlist>
1071      <listitem>
1072       <para>
1073        A new heap entry is made before making its index entries.  (Therefore
1074        a concurrent index scan is likely to fail to see the heap entry.
1075        This is okay because the index reader would be uninterested in an
1076        uncommitted row anyway.  But see <xref linkend="index-unique-checks"/>.)
1077       </para>
1078      </listitem>
1079      <listitem>
1080       <para>
1081        When a heap entry is to be deleted (by <command>VACUUM</command>), all its
1082        index entries must be removed first.
1083       </para>
1084      </listitem>
1085      <listitem>
1086       <para>
1087        An index scan must maintain a pin
1088        on the index page holding the item last returned by
1089        <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
1090        entries from pages that are pinned by other backends.  The need
1091        for this rule is explained below.
1092       </para>
1093      </listitem>
1094     </itemizedlist>
1095
1096    Without the third rule, it is possible for an index reader to
1097    see an index entry just before it is removed by <command>VACUUM</command>, and
1098    then to arrive at the corresponding heap entry after that was removed by
1099    <command>VACUUM</command>.
1100    This creates no serious problems if that item
1101    number is still unused when the reader reaches it, since an empty
1102    item slot will be ignored by <function>heap_fetch()</function>.  But what if a
1103    third backend has already re-used the item slot for something else?
1104    When using an MVCC-compliant snapshot, there is no problem because
1105    the new occupant of the slot is certain to be too new to pass the
1106    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
1107    <literal>SnapshotAny</literal>), it would be possible to accept and return
1108    a row that does not in fact match the scan keys.  We could defend
1109    against this scenario by requiring the scan keys to be rechecked
1110    against the heap row in all cases, but that is too expensive.  Instead,
1111    we use a pin on an index page as a proxy to indicate that the reader
1112    might still be <quote>in flight</quote> from the index entry to the matching
1113    heap entry.  Making <function>ambulkdelete</function> block on such a pin ensures
1114    that <command>VACUUM</command> cannot delete the heap entry before the reader
1115    is done with it.  This solution costs little in run time, and adds blocking
1116    overhead only in the rare cases where there actually is a conflict.
1117   </para>
1118
1119   <para>
1120    This solution requires that index scans be <quote>synchronous</quote>: we have
1121    to fetch each heap tuple immediately after scanning the corresponding index
1122    entry.  This is expensive for a number of reasons.  An
1123    <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
1124    and only visit the heap tuples sometime later, requires much less index
1125    locking overhead and can allow a more efficient heap access pattern.
1126    Per the above analysis, we must use the synchronous approach for
1127    non-MVCC-compliant snapshots, but an asynchronous scan is workable
1128    for a query using an MVCC snapshot.
1129   </para>
1130
1131   <para>
1132    In an <function>amgetbitmap</function> index scan, the access method does not
1133    keep an index pin on any of the returned tuples.  Therefore
1134    it is only safe to use such scans with MVCC-compliant snapshots.
1135   </para>
1136
1137   <para>
1138    When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
1139    index access method within a serializable transaction will acquire a
1140    nonblocking predicate lock on the full index.  This will generate a
1141    read-write conflict with the insert of any tuple into that index by a
1142    concurrent serializable transaction.  If certain patterns of read-write
1143    conflicts are detected among a set of concurrent serializable
1144    transactions, one of those transactions may be canceled to protect data
1145    integrity.  When the flag is set, it indicates that the index access
1146    method implements finer-grained predicate locking, which will tend to
1147    reduce the frequency of such transaction cancellations.
1148   </para>
1149
1150  </sect1>
1151
1152  <sect1 id="index-unique-checks">
1153   <title>Index Uniqueness Checks</title>
1154
1155   <para>
1156    <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
1157    using <firstterm>unique indexes</firstterm>, which are indexes that disallow
1158    multiple entries with identical keys.  An access method that supports this
1159    feature sets <structfield>amcanunique</structfield> true.
1160    (At present, only b-tree supports it.)  Columns listed in the
1161    <literal>INCLUDE</literal> clause are not considered when enforcing
1162    uniqueness.
1163   </para>
1164
1165   <para>
1166    Because of MVCC, it is always necessary to allow duplicate entries to
1167    exist physically in an index: the entries might refer to successive
1168    versions of a single logical row.  The behavior we actually want to
1169    enforce is that no MVCC snapshot could include two rows with equal
1170    index keys.  This breaks down into the following cases that must be
1171    checked when inserting a new row into a unique index:
1172
1173     <itemizedlist>
1174      <listitem>
1175       <para>
1176        If a conflicting valid row has been deleted by the current transaction,
1177        it's okay.  (In particular, since an UPDATE always deletes the old row
1178        version before inserting the new version, this will allow an UPDATE on
1179        a row without changing the key.)
1180       </para>
1181      </listitem>
1182      <listitem>
1183       <para>
1184        If a conflicting row has been inserted by an as-yet-uncommitted
1185        transaction, the would-be inserter must wait to see if that transaction
1186        commits.  If it rolls back then there is no conflict.  If it commits
1187        without deleting the conflicting row again, there is a uniqueness
1188        violation.  (In practice we just wait for the other transaction to
1189        end and then redo the visibility check in toto.)
1190       </para>
1191      </listitem>
1192      <listitem>
1193       <para>
1194        Similarly, if a conflicting valid row has been deleted by an
1195        as-yet-uncommitted transaction, the would-be inserter must wait
1196        for that transaction to commit or abort, and then repeat the test.
1197       </para>
1198      </listitem>
1199     </itemizedlist>
1200   </para>
1201
1202   <para>
1203    Furthermore, immediately before reporting a uniqueness violation
1204    according to the above rules, the access method must recheck the
1205    liveness of the row being inserted.  If it is committed dead then
1206    no violation should be reported.  (This case cannot occur during the
1207    ordinary scenario of inserting a row that's just been created by
1208    the current transaction.  It can happen during
1209    <command>CREATE UNIQUE INDEX CONCURRENTLY</command>, however.)
1210   </para>
1211
1212   <para>
1213    We require the index access method to apply these tests itself, which
1214    means that it must reach into the heap to check the commit status of
1215    any row that is shown to have a duplicate key according to the index
1216    contents.  This is without a doubt ugly and non-modular, but it saves
1217    redundant work: if we did a separate probe then the index lookup for
1218    a conflicting row would be essentially repeated while finding the place to
1219    insert the new row's index entry.  What's more, there is no obvious way
1220    to avoid race conditions unless the conflict check is an integral part
1221    of insertion of the new index entry.
1222   </para>
1223
1224   <para>
1225    If the unique constraint is deferrable, there is additional complexity:
1226    we need to be able to insert an index entry for a new row, but defer any
1227    uniqueness-violation error until end of statement or even later.  To
1228    avoid unnecessary repeat searches of the index, the index access method
1229    should do a preliminary uniqueness check during the initial insertion.
1230    If this shows that there is definitely no conflicting live tuple, we
1231    are done.  Otherwise, we schedule a recheck to occur when it is time to
1232    enforce the constraint.  If, at the time of the recheck, both the inserted
1233    tuple and some other tuple with the same key are live, then the error
1234    must be reported.  (Note that for this purpose, <quote>live</quote> actually
1235    means <quote>any tuple in the index entry's HOT chain is live</quote>.)
1236    To implement this, the <function>aminsert</function> function is passed a
1237    <literal>checkUnique</literal> parameter having one of the following values:
1238
1239     <itemizedlist>
1240      <listitem>
1241       <para>
1242        <literal>UNIQUE_CHECK_NO</literal> indicates that no uniqueness checking
1243        should be done (this is not a unique index).
1244       </para>
1245      </listitem>
1246      <listitem>
1247       <para>
1248        <literal>UNIQUE_CHECK_YES</literal> indicates that this is a non-deferrable
1249        unique index, and the uniqueness check must be done immediately, as
1250        described above.
1251       </para>
1252      </listitem>
1253      <listitem>
1254       <para>
1255        <literal>UNIQUE_CHECK_PARTIAL</literal> indicates that the unique
1256        constraint is deferrable. <productname>PostgreSQL</productname>
1257        will use this mode to insert each row's index entry.  The access
1258        method must allow duplicate entries into the index, and report any
1259        potential duplicates by returning false from <function>aminsert</function>.
1260        For each row for which false is returned, a deferred recheck will
1261        be scheduled.
1262       </para>
1263
1264       <para>
1265        The access method must identify any rows which might violate the
1266        unique constraint, but it is not an error for it to report false
1267        positives. This allows the check to be done without waiting for other
1268        transactions to finish; conflicts reported here are not treated as
1269        errors and will be rechecked later, by which time they may no longer
1270        be conflicts.
1271       </para>
1272      </listitem>
1273      <listitem>
1274       <para>
1275        <literal>UNIQUE_CHECK_EXISTING</literal> indicates that this is a deferred
1276        recheck of a row that was reported as a potential uniqueness violation.
1277        Although this is implemented by calling <function>aminsert</function>, the
1278        access method must <emphasis>not</emphasis> insert a new index entry in this
1279        case.  The index entry is already present.  Rather, the access method
1280        must check to see if there is another live index entry.  If so, and
1281        if the target row is also still live, report error.
1282       </para>
1283
1284       <para>
1285        It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</literal> call,
1286        the access method further verify that the target row actually does
1287        have an existing entry in the index, and report error if not.  This
1288        is a good idea because the index tuple values passed to
1289        <function>aminsert</function> will have been recomputed.  If the index
1290        definition involves functions that are not really immutable, we
1291        might be checking the wrong area of the index.  Checking that the
1292        target row is found in the recheck verifies that we are scanning
1293        for the same tuple values as were used in the original insertion.
1294       </para>
1295      </listitem>
1296     </itemizedlist>
1297   </para>
1298
1299  </sect1>
1300
1301  <sect1 id="index-cost-estimation">
1302   <title>Index Cost Estimation Functions</title>
1303
1304   <para>
1305    The <function>amcostestimate</function> function is given information describing
1306    a possible index scan, including lists of WHERE and ORDER BY clauses that
1307    have been determined to be usable with the index.  It must return estimates
1308    of the cost of accessing the index and the selectivity of the WHERE
1309    clauses (that is, the fraction of parent-table rows that will be
1310    retrieved during the index scan).  For simple cases, nearly all the
1311    work of the cost estimator can be done by calling standard routines
1312    in the optimizer; the point of having an <function>amcostestimate</function> function is
1313    to allow index access methods to provide index-type-specific knowledge,
1314    in case it is possible to improve on the standard estimates.
1315   </para>
1316
1317   <para>
1318    Each <function>amcostestimate</function> function must have the signature:
1319
1320 <programlisting>
1321 void
1322 amcostestimate (PlannerInfo *root,
1323                 IndexPath *path,
1324                 double loop_count,
1325                 Cost *indexStartupCost,
1326                 Cost *indexTotalCost,
1327                 Selectivity *indexSelectivity,
1328                 double *indexCorrelation,
1329                 double *indexPages);
1330 </programlisting>
1331
1332    The first three parameters are inputs:
1333
1334    <variablelist>
1335     <varlistentry>
1336      <term><parameter>root</parameter></term>
1337      <listitem>
1338       <para>
1339        The planner's information about the query being processed.
1340       </para>
1341      </listitem>
1342     </varlistentry>
1343
1344     <varlistentry>
1345      <term><parameter>path</parameter></term>
1346      <listitem>
1347       <para>
1348        The index access path being considered.  All fields except cost and
1349        selectivity values are valid.
1350       </para>
1351      </listitem>
1352     </varlistentry>
1353
1354     <varlistentry>
1355      <term><parameter>loop_count</parameter></term>
1356      <listitem>
1357       <para>
1358        The number of repetitions of the index scan that should be factored
1359        into the cost estimates.  This will typically be greater than one when
1360        considering a parameterized scan for use in the inside of a nestloop
1361        join.  Note that the cost estimates should still be for just one scan;
1362        a larger <parameter>loop_count</parameter> means that it may be appropriate
1363        to allow for some caching effects across multiple scans.
1364       </para>
1365      </listitem>
1366     </varlistentry>
1367    </variablelist>
1368   </para>
1369
1370   <para>
1371    The last five parameters are pass-by-reference outputs:
1372
1373    <variablelist>
1374     <varlistentry>
1375      <term><parameter>*indexStartupCost</parameter></term>
1376      <listitem>
1377       <para>
1378        Set to cost of index start-up processing
1379       </para>
1380      </listitem>
1381     </varlistentry>
1382
1383     <varlistentry>
1384      <term><parameter>*indexTotalCost</parameter></term>
1385      <listitem>
1386       <para>
1387        Set to total cost of index processing
1388       </para>
1389      </listitem>
1390     </varlistentry>
1391
1392     <varlistentry>
1393      <term><parameter>*indexSelectivity</parameter></term>
1394      <listitem>
1395       <para>
1396        Set to index selectivity
1397       </para>
1398      </listitem>
1399     </varlistentry>
1400
1401     <varlistentry>
1402      <term><parameter>*indexCorrelation</parameter></term>
1403      <listitem>
1404       <para>
1405        Set to correlation coefficient between index scan order and
1406        underlying table's order
1407       </para>
1408      </listitem>
1409     </varlistentry>
1410
1411     <varlistentry>
1412      <term><parameter>*indexPages</parameter></term>
1413      <listitem>
1414       <para>
1415        Set to number of index leaf pages
1416       </para>
1417      </listitem>
1418     </varlistentry>
1419    </variablelist>
1420   </para>
1421
1422   <para>
1423    Note that cost estimate functions must be written in C, not in SQL or
1424    any available procedural language, because they must access internal
1425    data structures of the planner/optimizer.
1426   </para>
1427
1428   <para>
1429    The index access costs should be computed using the parameters used by
1430    <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
1431    disk block fetch has cost <varname>seq_page_cost</varname>, a nonsequential fetch
1432    has cost <varname>random_page_cost</varname>, and the cost of processing one index
1433    row should usually be taken as <varname>cpu_index_tuple_cost</varname>.  In
1434    addition, an appropriate multiple of <varname>cpu_operator_cost</varname> should
1435    be charged for any comparison operators invoked during index processing
1436    (especially evaluation of the indexquals themselves).
1437   </para>
1438
1439   <para>
1440    The access costs should include all disk and CPU costs associated with
1441    scanning the index itself, but <emphasis>not</emphasis> the costs of retrieving or
1442    processing the parent-table rows that are identified by the index.
1443   </para>
1444
1445   <para>
1446    The <quote>start-up cost</quote> is the part of the total scan cost that
1447    must be expended before we can begin to fetch the first row.  For most
1448    indexes this can be taken as zero, but an index type with a high start-up
1449    cost might want to set it nonzero.
1450   </para>
1451
1452   <para>
1453    The <parameter>indexSelectivity</parameter> should be set to the estimated fraction of the parent
1454    table rows that will be retrieved during the index scan.  In the case
1455    of a lossy query, this will typically be higher than the fraction of
1456    rows that actually pass the given qual conditions.
1457   </para>
1458
1459   <para>
1460    The <parameter>indexCorrelation</parameter> should be set to the correlation (ranging between
1461    -1.0 and 1.0) between the index order and the table order.  This is used
1462    to adjust the estimate for the cost of fetching rows from the parent
1463    table.
1464   </para>
1465
1466   <para>
1467    The <parameter>indexPages</parameter> should be set to the number of leaf pages.
1468    This is used to estimate the number of workers for parallel index scan.
1469   </para>
1470
1471   <para>
1472    When <parameter>loop_count</parameter> is greater than one, the returned numbers
1473    should be averages expected for any one scan of the index.
1474   </para>
1475
1476   <procedure>
1477    <title>Cost Estimation</title>
1478    <para>
1479     A typical cost estimator will proceed as follows:
1480    </para>
1481
1482    <step>
1483     <para>
1484      Estimate and return the fraction of parent-table rows that will be visited
1485      based on the given qual conditions.  In the absence of any index-type-specific
1486      knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
1487
1488 <programlisting>
1489 *indexSelectivity = clauselist_selectivity(root, path-&gt;indexquals,
1490                                            path-&gt;indexinfo-&gt;rel-&gt;relid,
1491                                            JOIN_INNER, NULL);
1492 </programlisting>
1493     </para>
1494    </step>
1495
1496    <step>
1497     <para>
1498      Estimate the number of index rows that will be visited during the
1499      scan.  For many index types this is the same as <parameter>indexSelectivity</parameter> times
1500      the number of rows in the index, but it might be more.  (Note that the
1501      index's size in pages and rows is available from the
1502      <literal>path-&gt;indexinfo</literal> struct.)
1503     </para>
1504    </step>
1505
1506    <step>
1507     <para>
1508      Estimate the number of index pages that will be retrieved during the scan.
1509      This might be just <parameter>indexSelectivity</parameter> times the index's size in pages.
1510     </para>
1511    </step>
1512
1513    <step>
1514     <para>
1515      Compute the index access cost.  A generic estimator might do this:
1516
1517 <programlisting>
1518 /*
1519  * Our generic assumption is that the index pages will be read
1520  * sequentially, so they cost seq_page_cost each, not random_page_cost.
1521  * Also, we charge for evaluation of the indexquals at each index row.
1522  * All the costs are assumed to be paid incrementally during the scan.
1523  */
1524 cost_qual_eval(&amp;index_qual_cost, path-&gt;indexquals, root);
1525 *indexStartupCost = index_qual_cost.startup;
1526 *indexTotalCost = seq_page_cost * numIndexPages +
1527     (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
1528 </programlisting>
1529
1530      However, the above does not account for amortization of index reads
1531      across repeated index scans.
1532     </para>
1533    </step>
1534
1535    <step>
1536     <para>
1537      Estimate the index correlation.  For a simple ordered index on a single
1538      field, this can be retrieved from pg_statistic.  If the correlation
1539      is not known, the conservative estimate is zero (no correlation).
1540     </para>
1541    </step>
1542   </procedure>
1543
1544   <para>
1545    Examples of cost estimator functions can be found in
1546    <filename>src/backend/utils/adt/selfuncs.c</filename>.
1547   </para>
1548  </sect1>
1549 </chapter>