doc/src/sgml/indexam.sgml

   1 <!-- doc/src/sgml/indexam.sgml -->
   2
   3 <chapter id="indexam">
   4  <title>Index Access Method Interface Definition</title>
   5
   6  <indexterm>
   7   <primary>Index Access Method</primary>
   8  </indexterm>
   9  <indexterm>
  10   <primary>indexam</primary>
  11   <secondary>Index Access Method</secondary>
  12  </indexterm>
  13
  14   <para>
  15    This chapter defines the interface between the core
  16    <productname>PostgreSQL</productname> system and <firstterm>index access
  17    methods</firstterm>, which manage individual index types.  The core system
  18    knows nothing about indexes beyond what is specified here, so it is
  19    possible to develop entirely new index types by writing add-on code.
  20   </para>
  21
  22   <para>
  23    All indexes in <productname>PostgreSQL</productname> are what are known
  24    technically as <firstterm>secondary indexes</firstterm>; that is, the index is
  25    physically separate from the table file that it describes.  Each index
  26    is stored as its own physical <firstterm>relation</firstterm> and so is described
  27    by an entry in the <structname>pg_class</structname> catalog.  The contents of an
  28    index are entirely under the control of its index access method.  In
  29    practice, all index access methods divide indexes into standard-size
  30    pages so that they can use the regular storage manager and buffer manager
  31    to access the index contents.  (All the existing index access methods
  32    furthermore use the standard page layout described in <xref
  33    linkend="storage-page-layout"/>, and most use the same format for index
  34    tuple headers; but these decisions are not forced on an access method.)
  35   </para>
  36
  37   <para>
  38    An index is effectively a mapping from some data key values to
  39    <firstterm>tuple identifiers</firstterm>, or <acronym>TIDs</acronym>, of row versions
  40    (tuples) in the index's parent table.  A TID consists of a
  41    block number and an item number within that block (see <xref
  42    linkend="storage-page-layout"/>).  This is sufficient
  43    information to fetch a particular row version from the table.
  44    Indexes are not directly aware that under MVCC, there might be multiple
  45    extant versions of the same logical row; to an index, each tuple is
  46    an independent object that needs its own index entry.  Thus, an
  47    update of a row always creates all-new index entries for the row, even if
  48    the key values did not change.  (HOT tuples are an exception to this
  49    statement; but indexes do not deal with those, either.)  Index entries for
  50    dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
  51    are reclaimed.
  52   </para>
  53
  54  <sect1 id="index-api">
  55   <title>Basic API Structure for Indexes</title>
  56
  57   <para>
  58    Each index access method is described by a row in the
  59    <link linkend="catalog-pg-am"><structname>pg_am</structname></link>
  60    system catalog.  The <structname>pg_am</structname> entry
  61    specifies a name and a <firstterm>handler function</firstterm> for the index
  62    access method.  These entries can be created and deleted using the
  63    <xref linkend="sql-create-access-method"/> and
  64    <xref linkend="sql-drop-access-method"/> SQL commands.
  65   </para>
  66
  67   <para>
  68    An index access method handler function must be declared to accept a
  69    single argument of type <type>internal</type> and to return the
  70    pseudo-type <type>index_am_handler</type>.  The argument is a dummy value that
  71    simply serves to prevent handler functions from being called directly from
  72    SQL commands.  The result of the function must be a palloc'd struct of
  73    type <structname>IndexAmRoutine</structname>, which contains everything
  74    that the core code needs to know to make use of the index access method.
  75    The <structname>IndexAmRoutine</structname> struct, also called the access
  76    method's <firstterm>API struct</firstterm>, includes fields specifying assorted
  77    fixed properties of the access method, such as whether it can support
  78    multicolumn indexes.  More importantly, it contains pointers to support
  79    functions for the access method, which do all of the real work to access
  80    indexes.  These support functions are plain C functions and are not
  81    visible or callable at the SQL level.  The support functions are described
  82    in <xref linkend="index-functions"/>.
  83   </para>
  84
  85   <para>
  86    The structure <structname>IndexAmRoutine</structname> is defined thus:
  87 <programlisting>
  88 typedef struct IndexAmRoutine
  89 {
  90     NodeTag     type;
  91
  92     /*
  93      * Total number of strategies (operators) by which we can traverse/search
  94      * this AM.  Zero if AM does not have a fixed set of strategy assignments.
  95      */
  96     uint16      amstrategies;
  97     /* total number of support functions that this AM uses */
  98     uint16      amsupport;
  99     /* opclass options support function number or 0 */
 100     uint16      amoptsprocnum;
 101     /* does AM support ORDER BY indexed column's value? */
 102     bool        amcanorder;
 103     /* does AM support ORDER BY result of an operator on indexed column? */
 104     bool        amcanorderbyop;
 105     /* does AM support backward scanning? */
 106     bool        amcanbackward;
 107     /* does AM support UNIQUE indexes? */
 108     bool        amcanunique;
 109     /* does AM support multi-column indexes? */
 110     bool        amcanmulticol;
 111     /* does AM require scans to have a constraint on the first index column? */
 112     bool        amoptionalkey;
 113     /* does AM handle ScalarArrayOpExpr quals? */
 114     bool        amsearcharray;
 115     /* does AM handle IS NULL/IS NOT NULL quals? */
 116     bool        amsearchnulls;
 117     /* can index storage data type differ from column data type? */
 118     bool        amstorage;
 119     /* can an index of this type be clustered on? */
 120     bool        amclusterable;
 121     /* does AM handle predicate locks? */
 122     bool        ampredlocks;
 123     /* does AM support parallel scan? */
 124     bool        amcanparallel;
 125     /* does AM support columns included with clause INCLUDE? */
 126     bool        amcaninclude;
 127     /* does AM use maintenance_work_mem? */
 128     bool        amusemaintenanceworkmem;
 129     /* OR of parallel vacuum flags */
 130     uint8       amparallelvacuumoptions;
 131     /* type of data stored in index, or InvalidOid if variable */
 132     Oid         amkeytype;
 133
 134     /* interface functions */
 135     ambuild_function ambuild;
 136     ambuildempty_function ambuildempty;
 137     aminsert_function aminsert;
 138     ambulkdelete_function ambulkdelete;
 139     amvacuumcleanup_function amvacuumcleanup;
 140     amcanreturn_function amcanreturn;   /* can be NULL */
 141     amcostestimate_function amcostestimate;
 142     amoptions_function amoptions;
 143     amproperty_function amproperty;     /* can be NULL */
 144     ambuildphasename_function ambuildphasename;   /* can be NULL */
 145     amvalidate_function amvalidate;
 146     amadjustmembers_function amadjustmembers; /* can be NULL */
 147     ambeginscan_function ambeginscan;
 148     amrescan_function amrescan;
 149     amgettuple_function amgettuple;     /* can be NULL */
 150     amgetbitmap_function amgetbitmap;   /* can be NULL */
 151     amendscan_function amendscan;
 152     ammarkpos_function ammarkpos;       /* can be NULL */
 153     amrestrpos_function amrestrpos;     /* can be NULL */
 154
 155     /* interface functions to support parallel index scans */
 156     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
 157     aminitparallelscan_function aminitparallelscan;    /* can be NULL */
 158     amparallelrescan_function amparallelrescan;    /* can be NULL */
 159 } IndexAmRoutine;
 160 </programlisting>
 161   </para>
 162
 163   <para>
 164    To be useful, an index access method must also have one or more
 165    <firstterm>operator families</firstterm> and
 166    <firstterm>operator classes</firstterm> defined in
 167    <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
 168    <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
 169    <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
 170    <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
 171    These entries allow the planner
 172    to determine what kinds of query qualifications can be used with
 173    indexes of this access method.  Operator families and classes are described
 174    in <xref linkend="xindex"/>, which is prerequisite material for reading
 175    this chapter.
 176   </para>
 177
 178   <para>
 179    An individual index is defined by a
 180    <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
 181    entry that describes it as a physical relation, plus a
 182    <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
 183    entry that shows the logical content of the index &mdash; that is, the set
 184    of index columns it has and the semantics of those columns, as captured by
 185    the associated operator classes.  The index columns (key values) can be
 186    either simple columns of the underlying table or expressions over the table
 187    rows.  The index access method normally has no interest in where the index
 188    key values come from (it is always handed precomputed key values) but it
 189    will be very interested in the operator class information in
 190    <structname>pg_index</structname>.  Both of these catalog entries can be
 191    accessed as part of the <structname>Relation</structname> data structure that is
 192    passed to all operations on the index.
 193   </para>
 194
 195   <para>
 196    Some of the flag fields of <structname>IndexAmRoutine</structname> have nonobvious
 197    implications.  The requirements of <structfield>amcanunique</structfield>
 198    are discussed in <xref linkend="index-unique-checks"/>.
 199    The <structfield>amcanmulticol</structfield> flag asserts that the
 200    access method supports multi-key-column indexes, while
 201    <structfield>amoptionalkey</structfield> asserts that it allows scans
 202    where no indexable restriction clause is given for the first index column.
 203    When <structfield>amcanmulticol</structfield> is false,
 204    <structfield>amoptionalkey</structfield> essentially says whether the
 205    access method supports full-index scans without any restriction clause.
 206    Access methods that support multiple index columns <emphasis>must</emphasis>
 207    support scans that omit restrictions on any or all of the columns after
 208    the first; however they are permitted to require some restriction to
 209    appear for the first index column, and this is signaled by setting
 210    <structfield>amoptionalkey</structfield> false.
 211    One reason that an index AM might set
 212    <structfield>amoptionalkey</structfield> false is if it doesn't index
 213    null values.  Since most indexable operators are
 214    strict and hence cannot return true for null inputs,
 215    it is at first sight attractive to not store index entries for null values:
 216    they could never be returned by an index scan anyway.  However, this
 217    argument fails when an index scan has no restriction clause for a given
 218    index column.  In practice this means that
 219    indexes that have <structfield>amoptionalkey</structfield> true must
 220    index nulls, since the planner might decide to use such an index
 221    with no scan keys at all.  A related restriction is that an index
 222    access method that supports multiple index columns <emphasis>must</emphasis>
 223    support indexing null values in columns after the first, because the planner
 224    will assume the index can be used for queries that do not restrict
 225    these columns.  For example, consider an index on (a,b) and a query with
 226    <literal>WHERE a = 4</literal>.  The system will assume the index can be
 227    used to scan for rows with <literal>a = 4</literal>, which is wrong if the
 228    index omits rows where <literal>b</literal> is null.
 229    It is, however, OK to omit rows where the first indexed column is null.
 230    An index access method that does index nulls may also set
 231    <structfield>amsearchnulls</structfield>, indicating that it supports
 232    <literal>IS NULL</literal> and <literal>IS NOT NULL</literal> clauses as search
 233    conditions.
 234   </para>
 235
 236   <para>
 237    The <structfield>amcaninclude</structfield> flag indicates whether the
 238    access method supports <quote>included</quote> columns, that is it can
 239    store (without processing) additional columns beyond the key column(s).
 240    The requirements of the preceding paragraph apply only to the key
 241    columns.  In particular, the combination
 242    of <structfield>amcanmulticol</structfield>=<literal>false</literal>
 243    and <structfield>amcaninclude</structfield>=<literal>true</literal> is
 244    sensible: it means that there can only be one key column, but there can
 245    also be included column(s).  Also, included columns must be allowed to be
 246    null, independently of <structfield>amoptionalkey</structfield>.
 247   </para>
 248
 249  </sect1>
 250
 251  <sect1 id="index-functions">
 252   <title>Index Access Method Functions</title>
 253
 254   <para>
 255    The index construction and maintenance functions that an index access
 256    method must provide in <structname>IndexAmRoutine</structname> are:
 257   </para>
 258
 259   <para>
 260 <programlisting>
 261 IndexBuildResult *
 262 ambuild (Relation heapRelation,
 263          Relation indexRelation,
 264          IndexInfo *indexInfo);
 265 </programlisting>
 266    Build a new index.  The index relation has been physically created,
 267    but is empty.  It must be filled in with whatever fixed data the
 268    access method requires, plus entries for all tuples already existing
 269    in the table.  Ordinarily the <function>ambuild</function> function will call
 270    <function>table_index_build_scan()</function> to scan the table for existing tuples
 271    and compute the keys that need to be inserted into the index.
 272    The function must return a palloc'd struct containing statistics about
 273    the new index.
 274   </para>
 275
 276   <para>
 277 <programlisting>
 278 void
 279 ambuildempty (Relation indexRelation);
 280 </programlisting>
 281    Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
 282    of the given relation.  This method is called only for unlogged indexes; the
 283    empty index written to the initialization fork will be copied over the main
 284    relation fork on each server restart.
 285   </para>
 286
 287   <para>
 288 <programlisting>
 289 bool
 290 aminsert (Relation indexRelation,
 291           Datum *values,
 292           bool *isnull,
 293           ItemPointer heap_tid,
 294           Relation heapRelation,
 295           IndexUniqueCheck checkUnique,
 296           bool indexUnchanged,
 297           IndexInfo *indexInfo);
 298 </programlisting>
 299    Insert a new tuple into an existing index.  The <literal>values</literal> and
 300    <literal>isnull</literal> arrays give the key values to be indexed, and
 301    <literal>heap_tid</literal> is the TID to be indexed.
 302    If the access method supports unique indexes (its
 303    <structfield>amcanunique</structfield> flag is true) then
 304    <literal>checkUnique</literal> indicates the type of uniqueness check to
 305    perform.  This varies depending on whether the unique constraint is
 306    deferrable; see <xref linkend="index-unique-checks"/> for details.
 307    Normally the access method only needs the <literal>heapRelation</literal>
 308    parameter when performing uniqueness checking (since then it will have to
 309    look into the heap to verify tuple liveness).
 310   </para>
 311
 312   <para>
 313    The <literal>indexUnchanged</literal> boolean value gives a hint
 314    about the nature of the tuple to be indexed.  When it is true,
 315    the tuple is a duplicate of some existing tuple in the index.  The
 316    new tuple is a logically unchanged successor MVCC tuple version.  This
 317    happens when an <command>UPDATE</command> takes place that does not
 318    modify any columns covered by the index, but nevertheless requires a
 319    new version in the index.  The index AM may use this hint to decide
 320    to apply bottom-up index deletion in parts of the index where many
 321    versions of the same logical row accumulate.  Note that updating a
 322    non-key column does not affect the value of
 323    <literal>indexUnchanged</literal>.
 324   </para>
 325
 326   <para>
 327    The function's Boolean result value is significant only when
 328    <literal>checkUnique</literal> is <literal>UNIQUE_CHECK_PARTIAL</literal>.
 329    In this case a true result means the new entry is known unique, whereas
 330    false means it might be non-unique (and a deferred uniqueness check must
 331    be scheduled).  For other cases a constant false result is recommended.
 332   </para>
 333
 334   <para>
 335    Some indexes might not index all tuples.  If the tuple is not to be
 336    indexed, <function>aminsert</function> should just return without doing anything.
 337   </para>
 338
 339   <para>
 340    If the index AM wishes to cache data across successive index insertions
 341    within a SQL statement, it can allocate space
 342    in <literal>indexInfo-&gt;ii_Context</literal> and store a pointer to the
 343    data in <literal>indexInfo-&gt;ii_AmCache</literal> (which will be NULL
 344    initially).
 345   </para>
 346
 347   <para>
 348 <programlisting>
 349 IndexBulkDeleteResult *
 350 ambulkdelete (IndexVacuumInfo *info,
 351               IndexBulkDeleteResult *stats,
 352               IndexBulkDeleteCallback callback,
 353               void *callback_state);
 354 </programlisting>
 355    Delete tuple(s) from the index.  This is a <quote>bulk delete</quote> operation
 356    that is intended to be implemented by scanning the whole index and checking
 357    each entry to see if it should be deleted.
 358    The passed-in <literal>callback</literal> function must be called, in the style
 359    <literal>callback(<replaceable>TID</replaceable>, callback_state) returns bool</literal>,
 360    to determine whether any particular index entry, as identified by its
 361    referenced TID, is to be deleted.  Must return either NULL or a palloc'd
 362    struct containing statistics about the effects of the deletion operation.
 363    It is OK to return NULL if no information needs to be passed on to
 364    <function>amvacuumcleanup</function>.
 365   </para>
 366
 367   <para>
 368    Because of limited <varname>maintenance_work_mem</varname>,
 369    <function>ambulkdelete</function> might need to be called more than once when many
 370    tuples are to be deleted.  The <literal>stats</literal> argument is the result
 371    of the previous call for this index (it is NULL for the first call within a
 372    <command>VACUUM</command> operation).  This allows the AM to accumulate statistics
 373    across the whole operation.  Typically, <function>ambulkdelete</function> will
 374    modify and return the same struct if the passed <literal>stats</literal> is not
 375    null.
 376   </para>
 377
 378   <para>
 379 <programlisting>
 380 IndexBulkDeleteResult *
 381 amvacuumcleanup (IndexVacuumInfo *info,
 382                  IndexBulkDeleteResult *stats);
 383 </programlisting>
 384    Clean up after a <command>VACUUM</command> operation (zero or more
 385    <function>ambulkdelete</function> calls).  This does not have to do anything
 386    beyond returning index statistics, but it might perform bulk cleanup
 387    such as reclaiming empty index pages.  <literal>stats</literal> is whatever the
 388    last <function>ambulkdelete</function> call returned, or NULL if
 389    <function>ambulkdelete</function> was not called because no tuples needed to be
 390    deleted.  If the result is not NULL it must be a palloc'd struct.
 391    The statistics it contains will be used to update <structname>pg_class</structname>,
 392    and will be reported by <command>VACUUM</command> if <literal>VERBOSE</literal> is given.
 393    It is OK to return NULL if the index was not changed at all during the
 394    <command>VACUUM</command> operation, but otherwise correct stats should
 395    be returned.
 396   </para>
 397
 398   <para>
 399    <function>amvacuumcleanup</function> will also be called at completion of an
 400    <command>ANALYZE</command> operation.  In this case <literal>stats</literal> is always
 401    NULL and any return value will be ignored.  This case can be distinguished
 402    by checking <literal>info-&gt;analyze_only</literal>.  It is recommended
 403    that the access method do nothing except post-insert cleanup in such a
 404    call, and that only in an autovacuum worker process.
 405   </para>
 406
 407   <para>
 408 <programlisting>
 409 bool
 410 amcanreturn (Relation indexRelation, int attno);
 411 </programlisting>
 412    Check whether the index can support <link
 413    linkend="indexes-index-only-scans"><firstterm>index-only scans</firstterm></link> on
 414    the given column, by returning the column's original indexed value.
 415    The attribute number is 1-based, i.e., the first column's attno is 1.
 416    Returns true if supported, else false.
 417    This function should always return true for included columns
 418    (if those are supported), since there's little point in an included
 419    column that can't be retrieved.
 420    If the access method does not support index-only scans at all,
 421    the <structfield>amcanreturn</structfield> field in its <structname>IndexAmRoutine</structname>
 422    struct can be set to NULL.
 423   </para>
 424
 425   <para>
 426 <programlisting>
 427 void
 428 amcostestimate (PlannerInfo *root,
 429                 IndexPath *path,
 430                 double loop_count,
 431                 Cost *indexStartupCost,
 432                 Cost *indexTotalCost,
 433                 Selectivity *indexSelectivity,
 434                 double *indexCorrelation,
 435                 double *indexPages);
 436 </programlisting>
 437    Estimate the costs of an index scan.  This function is described fully
 438    in <xref linkend="index-cost-estimation"/>, below.
 439   </para>
 440
 441   <para>
 442 <programlisting>
 443 bytea *
 444 amoptions (ArrayType *reloptions,
 445            bool validate);
 446 </programlisting>
 447    Parse and validate the reloptions array for an index.  This is called only
 448    when a non-null reloptions array exists for the index.
 449    <parameter>reloptions</parameter> is a <type>text</type> array containing entries of the
 450    form <replaceable>name</replaceable><literal>=</literal><replaceable>value</replaceable>.
 451    The function should construct a <type>bytea</type> value, which will be copied
 452    into the <structfield>rd_options</structfield> field of the index's relcache entry.
 453    The data contents of the <type>bytea</type> value are open for the access
 454    method to define; most of the standard access methods use struct
 455    <structname>StdRdOptions</structname>.
 456    When <parameter>validate</parameter> is true, the function should report a suitable
 457    error message if any of the options are unrecognized or have invalid
 458    values; when <parameter>validate</parameter> is false, invalid entries should be
 459    silently ignored.  (<parameter>validate</parameter> is false when loading options
 460    already stored in <structname>pg_catalog</structname>; an invalid entry could only
 461    be found if the access method has changed its rules for options, and in
 462    that case ignoring obsolete entries is appropriate.)
 463    It is OK to return NULL if default behavior is wanted.
 464   </para>
 465
 466   <para>
 467 <programlisting>
 468 bool
 469 amproperty (Oid index_oid, int attno,
 470             IndexAMProperty prop, const char *propname,
 471             bool *res, bool *isnull);
 472 </programlisting>
 473    The <function>amproperty</function> method allows index access methods to override
 474    the default behavior of <function>pg_index_column_has_property</function>
 475    and related functions.
 476    If the access method does not have any special behavior for index property
 477    inquiries, the <structfield>amproperty</structfield> field in
 478    its <structname>IndexAmRoutine</structname> struct can be set to NULL.
 479    Otherwise, the <function>amproperty</function> method will be called with
 480    <parameter>index_oid</parameter> and <parameter>attno</parameter> both zero for
 481    <function>pg_indexam_has_property</function> calls,
 482    or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> zero for
 483    <function>pg_index_has_property</function> calls,
 484    or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> greater than
 485    zero for <function>pg_index_column_has_property</function> calls.
 486    <parameter>prop</parameter> is an enum value identifying the property being tested,
 487    while <parameter>propname</parameter> is the original property name string.
 488    If the core code does not recognize the property name
 489    then <parameter>prop</parameter> is <literal>AMPROP_UNKNOWN</literal>.
 490    Access methods can define custom property names by
 491    checking <parameter>propname</parameter> for a match (use <function>pg_strcasecmp</function>
 492    to match, for consistency with the core code); for names known to the core
 493    code, it's better to inspect <parameter>prop</parameter>.
 494    If the <structfield>amproperty</structfield> method returns <literal>true</literal> then
 495    it has determined the property test result: it must set <literal>*res</literal>
 496    to the boolean value to return, or set <literal>*isnull</literal>
 497    to <literal>true</literal> to return a NULL.  (Both of the referenced variables
 498    are initialized to <literal>false</literal> before the call.)
 499    If the <structfield>amproperty</structfield> method returns <literal>false</literal> then
 500    the core code will proceed with its normal logic for determining the
 501    property test result.
 502   </para>
 503
 504   <para>
 505    Access methods that support ordering operators should
 506    implement <literal>AMPROP_DISTANCE_ORDERABLE</literal> property testing, as the
 507    core code does not know how to do that and will return NULL.  It may
 508    also be advantageous to implement <literal>AMPROP_RETURNABLE</literal> testing,
 509    if that can be done more cheaply than by opening the index and calling
 510    <function>amcanreturn</function>, which is the core code's default behavior.
 511    The default behavior should be satisfactory for all other standard
 512    properties.
 513   </para>
 514
 515   <para>
 516 <programlisting>
 517 char *
 518 ambuildphasename (int64 phasenum);
 519 </programlisting>
 520    Return the textual name of the given build phase number.
 521    The phase numbers are those reported during an index build via the
 522    <function>pgstat_progress_update_param</function> interface.
 523    The phase names are then exposed in the
 524    <structname>pg_stat_progress_create_index</structname> view.
 525   </para>
 526
 527   <para>
 528 <programlisting>
 529 bool
 530 amvalidate (Oid opclassoid);
 531 </programlisting>
 532    Validate the catalog entries for the specified operator class, so far as
 533    the access method can reasonably do that.  For example, this might include
 534    testing that all required support functions are provided.
 535    The <function>amvalidate</function> function must return false if the opclass is
 536    invalid.  Problems should be reported with <function>ereport</function>
 537    messages, typically at <literal>INFO</literal> level.
 538   </para>
 539
 540   <para>
 541 <programlisting>
 542 void
 543 amadjustmembers (Oid opfamilyoid,
 544                  Oid opclassoid,
 545                  List *operators,
 546                  List *functions);
 547 </programlisting>
 548    Validate proposed new operator and function members of an operator family,
 549    so far as the access method can reasonably do that, and set their
 550    dependency types if the default is not satisfactory.  This is called
 551    during <command>CREATE OPERATOR CLASS</command> and during
 552    <command>ALTER OPERATOR FAMILY ADD</command>; in the latter
 553    case <parameter>opclassoid</parameter> is <literal>InvalidOid</literal>.
 554    The <type>List</type> arguments are lists
 555    of <structname>OpFamilyMember</structname> structs, as defined
 556    in <filename>amapi.h</filename>.
 557
 558    Tests done by this function will typically be a subset of those
 559    performed by <function>amvalidate</function>,
 560    since <function>amadjustmembers</function> cannot assume that it is
 561    seeing a complete set of members.  For example, it would be reasonable
 562    to check the signature of a support function, but not to check whether
 563    all required support functions are provided.  Any problems can be
 564    reported by throwing an error.
 565
 566    The dependency-related fields of
 567    the <structname>OpFamilyMember</structname> structs are initialized by
 568    the core code to create hard dependencies on the opclass if this
 569    is <command>CREATE OPERATOR CLASS</command>, or soft dependencies on the
 570    opfamily if this is <command>ALTER OPERATOR FAMILY ADD</command>.
 571    <function>amadjustmembers</function> can adjust these fields if some other
 572    behavior is more appropriate.  For example, GIN, GiST, and SP-GiST
 573    always set operator members to have soft dependencies on the opfamily,
 574    since the connection between an operator and an opclass is relatively
 575    weak in these index types; so it is reasonable to allow operator members
 576    to be added and removed freely.  Optional support functions are typically
 577    also given soft dependencies, so that they can be removed if necessary.
 578   </para>
 579
 580
 581   <para>
 582    The purpose of an index, of course, is to support scans for tuples matching
 583    an indexable <literal>WHERE</literal> condition, often called a
 584    <firstterm>qualifier</firstterm> or <firstterm>scan key</firstterm>.  The semantics of
 585    index scanning are described more fully in <xref linkend="index-scanning"/>,
 586    below.  An index access method can support <quote>plain</quote> index scans,
 587    <quote>bitmap</quote> index scans, or both.  The scan-related functions that an
 588    index access method must or may provide are:
 589   </para>
 590
 591   <para>
 592 <programlisting>
 593 IndexScanDesc
 594 ambeginscan (Relation indexRelation,
 595              int nkeys,
 596              int norderbys);
 597 </programlisting>
 598    Prepare for an index scan.  The <literal>nkeys</literal> and <literal>norderbys</literal>
 599    parameters indicate the number of quals and ordering operators that will be
 600    used in the scan; these may be useful for space allocation purposes.
 601    Note that the actual values of the scan keys aren't provided yet.
 602    The result must be a palloc'd struct.
 603    For implementation reasons the index access method
 604    <emphasis>must</emphasis> create this struct by calling
 605    <function>RelationGetIndexScan()</function>.  In most cases
 606    <function>ambeginscan</function> does little beyond making that call and perhaps
 607    acquiring locks;
 608    the interesting parts of index-scan startup are in <function>amrescan</function>.
 609   </para>
 610
 611   <para>
 612 <programlisting>
 613 void
 614 amrescan (IndexScanDesc scan,
 615           ScanKey keys,
 616           int nkeys,
 617           ScanKey orderbys,
 618           int norderbys);
 619 </programlisting>
 620    Start or restart an index scan, possibly with new scan keys.  (To restart
 621    using previously-passed keys, NULL is passed for <literal>keys</literal> and/or
 622    <literal>orderbys</literal>.)  Note that it is not allowed for
 623    the number of keys or order-by operators to be larger than
 624    what was passed to <function>ambeginscan</function>.  In practice the restart
 625    feature is used when a new outer tuple is selected by a nested-loop join
 626    and so a new key comparison value is needed, but the scan key structure
 627    remains the same.
 628   </para>
 629
 630   <para>
 631 <programlisting>
 632 boolean
 633 amgettuple (IndexScanDesc scan,
 634             ScanDirection direction);
 635 </programlisting>
 636    Fetch the next tuple in the given scan, moving in the given
 637    direction (forward or backward in the index).  Returns true if a tuple was
 638    obtained, false if no matching tuples remain.  In the true case the tuple
 639    TID is stored into the <literal>scan</literal> structure.  Note that
 640    <quote>success</quote> means only that the index contains an entry that matches
 641    the scan keys, not that the tuple necessarily still exists in the heap or
 642    will pass the caller's snapshot test.  On success, <function>amgettuple</function>
 643    must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
 644    False means it is certain that the index entry matches the scan keys.
 645    True means this is not certain, and the conditions represented by the
 646    scan keys must be rechecked against the heap tuple after fetching it.
 647    This provision supports <quote>lossy</quote> index operators.
 648    Note that rechecking will extend only to the scan conditions; a partial
 649    index predicate (if any) is never rechecked by <function>amgettuple</function>
 650    callers.
 651   </para>
 652
 653   <para>
 654    If the index supports <link linkend="indexes-index-only-scans">index-only
 655    scans</link> (i.e., <function>amcanreturn</function> returns true for any
 656    of its columns),
 657    then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
 658    and if that is true it must return the originally indexed data for the
 659    index entry.  Columns for which <function>amcanreturn</function> returns
 660    false can be returned as nulls.
 661    The data can be returned in the form of an
 662    <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
 663    with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
 664    a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
 665    with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>.  (The latter
 666    format should be used when reconstructing data that might possibly not fit
 667    into an <structname>IndexTuple</structname>.)  In either case,
 668    management of the data referenced by the pointer is the access method's
 669    responsibility.  The data must remain good at least until the next
 670    <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
 671    call for the scan.
 672   </para>
 673
 674   <para>
 675    The <function>amgettuple</function> function need only be provided if the access
 676    method supports <quote>plain</quote> index scans.  If it doesn't, the
 677    <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
 678    struct must be set to NULL.
 679   </para>
 680
 681   <para>
 682 <programlisting>
 683 int64
 684 amgetbitmap (IndexScanDesc scan,
 685              TIDBitmap *tbm);
 686 </programlisting>
 687    Fetch all tuples in the given scan and add them to the caller-supplied
 688    <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
 689    in the bitmap).  The number of tuples fetched is returned (this might be
 690    just an approximate count, for instance some AMs do not detect duplicates).
 691    While inserting tuple IDs into the bitmap, <function>amgetbitmap</function> can
 692    indicate that rechecking of the scan conditions is required for specific
 693    tuple IDs.  This is analogous to the <literal>xs_recheck</literal> output parameter
 694    of <function>amgettuple</function>.  Note: in the current implementation, support
 695    for this feature is conflated with support for lossy storage of the bitmap
 696    itself, and therefore callers recheck both the scan conditions and the
 697    partial index predicate (if any) for recheckable tuples.  That might not
 698    always be true, however.
 699    <function>amgetbitmap</function> and
 700    <function>amgettuple</function> cannot be used in the same index scan; there
 701    are other restrictions too when using <function>amgetbitmap</function>, as explained
 702    in <xref linkend="index-scanning"/>.
 703   </para>
 704
 705   <para>
 706    The <function>amgetbitmap</function> function need only be provided if the access
 707    method supports <quote>bitmap</quote> index scans.  If it doesn't, the
 708    <structfield>amgetbitmap</structfield> field in its <structname>IndexAmRoutine</structname>
 709    struct must be set to NULL.
 710   </para>
 711
 712   <para>
 713 <programlisting>
 714 void
 715 amendscan (IndexScanDesc scan);
 716 </programlisting>
 717    End a scan and release resources.  The <literal>scan</literal> struct itself
 718    should not be freed, but any locks or pins taken internally by the
 719    access method must be released, as well as any other memory allocated
 720    by <function>ambeginscan</function> and other scan-related functions.
 721   </para>
 722
 723   <para>
 724 <programlisting>
 725 void
 726 ammarkpos (IndexScanDesc scan);
 727 </programlisting>
 728    Mark current scan position.  The access method need only support one
 729    remembered scan position per scan.
 730   </para>
 731
 732   <para>
 733    The <function>ammarkpos</function> function need only be provided if the access
 734    method supports ordered scans.  If it doesn't,
 735    the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
 736    struct may be set to NULL.
 737   </para>
 738
 739   <para>
 740 <programlisting>
 741 void
 742 amrestrpos (IndexScanDesc scan);
 743 </programlisting>
 744    Restore the scan to the most recently marked position.
 745   </para>
 746
 747   <para>
 748    The <function>amrestrpos</function> function need only be provided if the access
 749    method supports ordered scans.  If it doesn't,
 750    the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
 751    struct may be set to NULL.
 752   </para>
 753
 754   <para>
 755    In addition to supporting ordinary index scans, some types of index
 756    may wish to support <firstterm>parallel index scans</firstterm>, which allow
 757    multiple backends to cooperate in performing an index scan.  The
 758    index access method should arrange things so that each cooperating
 759    process returns a subset of the tuples that would be performed by
 760    an ordinary, non-parallel index scan, but in such a way that the
 761    union of those subsets is equal to the set of tuples that would be
 762    returned by an ordinary, non-parallel index scan.  Furthermore, while
 763    there need not be any global ordering of tuples returned by a parallel
 764    scan, the ordering of that subset of tuples returned within each
 765    cooperating backend must match the requested ordering.  The following
 766    functions may be implemented to support parallel index scans:
 767   </para>
 768
 769   <para>
 770 <programlisting>
 771 Size
 772 amestimateparallelscan (void);
 773 </programlisting>
 774    Estimate and return the number of bytes of dynamic shared memory which
 775    the access method will be needed to perform a parallel scan.  (This number
 776    is in addition to, not in lieu of, the amount of space needed for
 777    AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
 778   </para>
 779
 780   <para>
 781    It is not necessary to implement this function for access methods which
 782    do not support parallel scans or for which the number of additional bytes
 783    of storage required is zero.
 784   </para>
 785
 786   <para>
 787 <programlisting>
 788 void
 789 aminitparallelscan (void *target);
 790 </programlisting>
 791    This function will be called to initialize dynamic shared memory at the
 792    beginning of a parallel scan.  <parameter>target</parameter> will point to at least
 793    the number of bytes previously returned by
 794    <function>amestimateparallelscan</function>, and this function may use that
 795    amount of space to store whatever data it wishes.
 796   </para>
 797
 798   <para>
 799    It is not necessary to implement this function for access methods which
 800    do not support parallel scans or in cases where the shared memory space
 801    required needs no initialization.
 802   </para>
 803
 804   <para>
 805 <programlisting>
 806 void
 807 amparallelrescan (IndexScanDesc scan);
 808 </programlisting>
 809    This function, if implemented, will be called when a parallel index scan
 810    must be restarted.  It should reset any shared state set up by
 811    <function>aminitparallelscan</function> such that the scan will be restarted from
 812    the beginning.
 813   </para>
 814
 815  </sect1>
 816
 817  <sect1 id="index-scanning">
 818   <title>Index Scanning</title>
 819
 820   <para>
 821    In an index scan, the index access method is responsible for regurgitating
 822    the TIDs of all the tuples it has been told about that match the
 823    <firstterm>scan keys</firstterm>.  The access method is <emphasis>not</emphasis> involved in
 824    actually fetching those tuples from the index's parent table, nor in
 825    determining whether they pass the scan's visibility test or other
 826    conditions.
 827   </para>
 828
 829   <para>
 830    A scan key is the internal representation of a <literal>WHERE</literal> clause of
 831    the form <replaceable>index_key</replaceable> <replaceable>operator</replaceable>
 832    <replaceable>constant</replaceable>, where the index key is one of the columns of the
 833    index and the operator is one of the members of the operator family
 834    associated with that index column.  An index scan has zero or more scan
 835    keys, which are implicitly ANDed &mdash; the returned tuples are expected
 836    to satisfy all the indicated conditions.
 837   </para>
 838
 839   <para>
 840    The access method can report that the index is <firstterm>lossy</firstterm>, or
 841    requires rechecks, for a particular query.  This implies that the index
 842    scan will return all the entries that pass the scan key, plus possibly
 843    additional entries that do not.  The core system's index-scan machinery
 844    will then apply the index conditions again to the heap tuple to verify
 845    whether or not it really should be selected.  If the recheck option is not
 846    specified, the index scan must return exactly the set of matching entries.
 847   </para>
 848
 849   <para>
 850    Note that it is entirely up to the access method to ensure that it
 851    correctly finds all and only the entries passing all the given scan keys.
 852    Also, the core system will simply hand off all the <literal>WHERE</literal>
 853    clauses that match the index keys and operator families, without any
 854    semantic analysis to determine whether they are redundant or
 855    contradictory.  As an example, given
 856    <literal>WHERE x &gt; 4 AND x &gt; 14</literal> where <literal>x</literal> is a b-tree
 857    indexed column, it is left to the b-tree <function>amrescan</function> function
 858    to realize that the first scan key is redundant and can be discarded.
 859    The extent of preprocessing needed during <function>amrescan</function> will
 860    depend on the extent to which the index access method needs to reduce
 861    the scan keys to a <quote>normalized</quote> form.
 862   </para>
 863
 864   <para>
 865    Some access methods return index entries in a well-defined order, others
 866    do not.  There are actually two different ways that an access method can
 867    support sorted output:
 868
 869     <itemizedlist>
 870      <listitem>
 871       <para>
 872        Access methods that always return entries in the natural ordering
 873        of their data (such as btree) should set
 874        <structfield>amcanorder</structfield> to true.
 875        Currently, such access methods must use btree-compatible strategy
 876        numbers for their equality and ordering operators.
 877       </para>
 878      </listitem>
 879      <listitem>
 880       <para>
 881        Access methods that support ordering operators should set
 882        <structfield>amcanorderbyop</structfield> to true.
 883        This indicates that the index is capable of returning entries in
 884        an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
 885        <replaceable>operator</replaceable> <replaceable>constant</replaceable>.  Scan modifiers
 886        of that form can be passed to <function>amrescan</function> as described
 887        previously.
 888       </para>
 889      </listitem>
 890     </itemizedlist>
 891   </para>
 892
 893   <para>
 894    The <function>amgettuple</function> function has a <literal>direction</literal> argument,
 895    which can be either <literal>ForwardScanDirection</literal> (the normal case)
 896    or  <literal>BackwardScanDirection</literal>.  If the first call after
 897    <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
 898    set of matching index entries is to be scanned back-to-front rather than in
 899    the normal front-to-back direction, so <function>amgettuple</function> must return
 900    the last matching tuple in the index, rather than the first one as it
 901    normally would.  (This will only occur for access
 902    methods that set <structfield>amcanorder</structfield> to true.)  After the
 903    first call, <function>amgettuple</function> must be prepared to advance the scan in
 904    either direction from the most recently returned entry.  (But if
 905    <structfield>amcanbackward</structfield> is false, all subsequent
 906    calls will have the same direction as the first one.)
 907   </para>
 908
 909   <para>
 910    Access methods that support ordered scans must support <quote>marking</quote> a
 911    position in a scan and later returning to the marked position.  The same
 912    position might be restored multiple times.  However, only one position need
 913    be remembered per scan; a new <function>ammarkpos</function> call overrides the
 914    previously marked position.  An access method that does not support ordered
 915    scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
 916    functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
 917    instead.
 918   </para>
 919
 920   <para>
 921    Both the scan position and the mark position (if any) must be maintained
 922    consistently in the face of concurrent insertions or deletions in the
 923    index.  It is OK if a freshly-inserted entry is not returned by a scan that
 924    would have found the entry if it had existed when the scan started, or for
 925    the scan to return such an entry upon rescanning or backing
 926    up even though it had not been returned the first time through.  Similarly,
 927    a concurrent delete might or might not be reflected in the results of a scan.
 928    What is important is that insertions or deletions not cause the scan to
 929    miss or multiply return entries that were not themselves being inserted or
 930    deleted.
 931   </para>
 932
 933   <para>
 934    If the index stores the original indexed data values (and not some lossy
 935    representation of them), it is useful to
 936    support <link linkend="indexes-index-only-scans">index-only scans</link>, in
 937    which the index returns the actual data not just the TID of the heap tuple.
 938    This will only avoid I/O if the visibility map shows that the TID is on an
 939    all-visible page; else the heap tuple must be visited anyway to check
 940    MVCC visibility.  But that is no concern of the access method's.
 941   </para>
 942
 943   <para>
 944    Instead of using <function>amgettuple</function>, an index scan can be done with
 945    <function>amgetbitmap</function> to fetch all tuples in one call.  This can be
 946    noticeably more efficient than <function>amgettuple</function> because it allows
 947    avoiding lock/unlock cycles within the access method.  In principle
 948    <function>amgetbitmap</function> should have the same effects as repeated
 949    <function>amgettuple</function> calls, but we impose several restrictions to
 950    simplify matters.  First of all, <function>amgetbitmap</function> returns all
 951    tuples at once and marking or restoring scan positions isn't
 952    supported. Secondly, the tuples are returned in a bitmap which doesn't
 953    have any specific ordering, which is why <function>amgetbitmap</function> doesn't
 954    take a <literal>direction</literal> argument.  (Ordering operators will never be
 955    supplied for such a scan, either.)
 956    Also, there is no provision for index-only scans with
 957    <function>amgetbitmap</function>, since there is no way to return the contents of
 958    index tuples.
 959    Finally, <function>amgetbitmap</function>
 960    does not guarantee any locking of the returned tuples, with implications
 961    spelled out in <xref linkend="index-locking"/>.
 962   </para>
 963
 964   <para>
 965    Note that it is permitted for an access method to implement only
 966    <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
 967    if its internal implementation is unsuited to one API or the other.
 968   </para>
 969
 970  </sect1>
 971
 972  <sect1 id="index-locking">
 973   <title>Index Locking Considerations</title>
 974
 975   <para>
 976    Index access methods must handle concurrent updates
 977    of the index by multiple processes.
 978    The core <productname>PostgreSQL</productname> system obtains
 979    <literal>AccessShareLock</literal> on the index during an index scan, and
 980    <literal>RowExclusiveLock</literal> when updating the index (including plain
 981    <command>VACUUM</command>).  Since these lock types do not conflict, the access
 982    method is responsible for handling any fine-grained locking it might need.
 983    An exclusive lock on the index as a whole will be taken only during index
 984    creation, destruction, or <command>REINDEX</command>.
 985   </para>
 986
 987   <para>
 988    Building an index type that supports concurrent updates usually requires
 989    extensive and subtle analysis of the required behavior.  For the b-tree
 990    and hash index types, you can read about the design decisions involved in
 991    <filename>src/backend/access/nbtree/README</filename> and
 992    <filename>src/backend/access/hash/README</filename>.
 993   </para>
 994
 995   <para>
 996    Aside from the index's own internal consistency requirements, concurrent
 997    updates create issues about consistency between the parent table (the
 998    <firstterm>heap</firstterm>) and the index.  Because
 999    <productname>PostgreSQL</productname> separates accesses
1000    and updates of the heap from those of the index, there are windows in
1001    which the index might be inconsistent with the heap.  We handle this problem
1002    with the following rules:
1003
1004     <itemizedlist>
1005      <listitem>
1006       <para>
1007        A new heap entry is made before making its index entries.  (Therefore
1008        a concurrent index scan is likely to fail to see the heap entry.
1009        This is okay because the index reader would be uninterested in an
1010        uncommitted row anyway.  But see <xref linkend="index-unique-checks"/>.)
1011       </para>
1012      </listitem>
1013      <listitem>
1014       <para>
1015        When a heap entry is to be deleted (by <command>VACUUM</command>), all its
1016        index entries must be removed first.
1017       </para>
1018      </listitem>
1019      <listitem>
1020       <para>
1021        An index scan must maintain a pin
1022        on the index page holding the item last returned by
1023        <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
1024        entries from pages that are pinned by other backends.  The need
1025        for this rule is explained below.
1026       </para>
1027      </listitem>
1028     </itemizedlist>
1029
1030    Without the third rule, it is possible for an index reader to
1031    see an index entry just before it is removed by <command>VACUUM</command>, and
1032    then to arrive at the corresponding heap entry after that was removed by
1033    <command>VACUUM</command>.
1034    This creates no serious problems if that item
1035    number is still unused when the reader reaches it, since an empty
1036    item slot will be ignored by <function>heap_fetch()</function>.  But what if a
1037    third backend has already re-used the item slot for something else?
1038    When using an MVCC-compliant snapshot, there is no problem because
1039    the new occupant of the slot is certain to be too new to pass the
1040    snapshot test.  However, with a non-MVCC-compliant snapshot (such as
1041    <literal>SnapshotAny</literal>), it would be possible to accept and return
1042    a row that does not in fact match the scan keys.  We could defend
1043    against this scenario by requiring the scan keys to be rechecked
1044    against the heap row in all cases, but that is too expensive.  Instead,
1045    we use a pin on an index page as a proxy to indicate that the reader
1046    might still be <quote>in flight</quote> from the index entry to the matching
1047    heap entry.  Making <function>ambulkdelete</function> block on such a pin ensures
1048    that <command>VACUUM</command> cannot delete the heap entry before the reader
1049    is done with it.  This solution costs little in run time, and adds blocking
1050    overhead only in the rare cases where there actually is a conflict.
1051   </para>
1052
1053   <para>
1054    This solution requires that index scans be <quote>synchronous</quote>: we have
1055    to fetch each heap tuple immediately after scanning the corresponding index
1056    entry.  This is expensive for a number of reasons.  An
1057    <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
1058    and only visit the heap tuples sometime later, requires much less index
1059    locking overhead and can allow a more efficient heap access pattern.
1060    Per the above analysis, we must use the synchronous approach for
1061    non-MVCC-compliant snapshots, but an asynchronous scan is workable
1062    for a query using an MVCC snapshot.
1063   </para>
1064
1065   <para>
1066    In an <function>amgetbitmap</function> index scan, the access method does not
1067    keep an index pin on any of the returned tuples.  Therefore
1068    it is only safe to use such scans with MVCC-compliant snapshots.
1069   </para>
1070
1071   <para>
1072    When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
1073    index access method within a serializable transaction will acquire a
1074    nonblocking predicate lock on the full index.  This will generate a
1075    read-write conflict with the insert of any tuple into that index by a
1076    concurrent serializable transaction.  If certain patterns of read-write
1077    conflicts are detected among a set of concurrent serializable
1078    transactions, one of those transactions may be canceled to protect data
1079    integrity.  When the flag is set, it indicates that the index access
1080    method implements finer-grained predicate locking, which will tend to
1081    reduce the frequency of such transaction cancellations.
1082   </para>
1083
1084  </sect1>
1085
1086  <sect1 id="index-unique-checks">
1087   <title>Index Uniqueness Checks</title>
1088
1089   <para>
1090    <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
1091    using <firstterm>unique indexes</firstterm>, which are indexes that disallow
1092    multiple entries with identical keys.  An access method that supports this
1093    feature sets <structfield>amcanunique</structfield> true.
1094    (At present, only b-tree supports it.)  Columns listed in the
1095    <literal>INCLUDE</literal> clause are not considered when enforcing
1096    uniqueness.
1097   </para>
1098
1099   <para>
1100    Because of MVCC, it is always necessary to allow duplicate entries to
1101    exist physically in an index: the entries might refer to successive
1102    versions of a single logical row.  The behavior we actually want to
1103    enforce is that no MVCC snapshot could include two rows with equal
1104    index keys.  This breaks down into the following cases that must be
1105    checked when inserting a new row into a unique index:
1106
1107     <itemizedlist>
1108      <listitem>
1109       <para>
1110        If a conflicting valid row has been deleted by the current transaction,
1111        it's okay.  (In particular, since an UPDATE always deletes the old row
1112        version before inserting the new version, this will allow an UPDATE on
1113        a row without changing the key.)
1114       </para>
1115      </listitem>
1116      <listitem>
1117       <para>
1118        If a conflicting row has been inserted by an as-yet-uncommitted
1119        transaction, the would-be inserter must wait to see if that transaction
1120        commits.  If it rolls back then there is no conflict.  If it commits
1121        without deleting the conflicting row again, there is a uniqueness
1122        violation.  (In practice we just wait for the other transaction to
1123        end and then redo the visibility check in toto.)
1124       </para>
1125      </listitem>
1126      <listitem>
1127       <para>
1128        Similarly, if a conflicting valid row has been deleted by an
1129        as-yet-uncommitted transaction, the would-be inserter must wait
1130        for that transaction to commit or abort, and then repeat the test.
1131       </para>
1132      </listitem>
1133     </itemizedlist>
1134   </para>
1135
1136   <para>
1137    Furthermore, immediately before reporting a uniqueness violation
1138    according to the above rules, the access method must recheck the
1139    liveness of the row being inserted.  If it is committed dead then
1140    no violation should be reported.  (This case cannot occur during the
1141    ordinary scenario of inserting a row that's just been created by
1142    the current transaction.  It can happen during
1143    <command>CREATE UNIQUE INDEX CONCURRENTLY</command>, however.)
1144   </para>
1145
1146   <para>
1147    We require the index access method to apply these tests itself, which
1148    means that it must reach into the heap to check the commit status of
1149    any row that is shown to have a duplicate key according to the index
1150    contents.  This is without a doubt ugly and non-modular, but it saves
1151    redundant work: if we did a separate probe then the index lookup for
1152    a conflicting row would be essentially repeated while finding the place to
1153    insert the new row's index entry.  What's more, there is no obvious way
1154    to avoid race conditions unless the conflict check is an integral part
1155    of insertion of the new index entry.
1156   </para>
1157
1158   <para>
1159    If the unique constraint is deferrable, there is additional complexity:
1160    we need to be able to insert an index entry for a new row, but defer any
1161    uniqueness-violation error until end of statement or even later.  To
1162    avoid unnecessary repeat searches of the index, the index access method
1163    should do a preliminary uniqueness check during the initial insertion.
1164    If this shows that there is definitely no conflicting live tuple, we
1165    are done.  Otherwise, we schedule a recheck to occur when it is time to
1166    enforce the constraint.  If, at the time of the recheck, both the inserted
1167    tuple and some other tuple with the same key are live, then the error
1168    must be reported.  (Note that for this purpose, <quote>live</quote> actually
1169    means <quote>any tuple in the index entry's HOT chain is live</quote>.)
1170    To implement this, the <function>aminsert</function> function is passed a
1171    <literal>checkUnique</literal> parameter having one of the following values:
1172
1173     <itemizedlist>
1174      <listitem>
1175       <para>
1176        <literal>UNIQUE_CHECK_NO</literal> indicates that no uniqueness checking
1177        should be done (this is not a unique index).
1178       </para>
1179      </listitem>
1180      <listitem>
1181       <para>
1182        <literal>UNIQUE_CHECK_YES</literal> indicates that this is a non-deferrable
1183        unique index, and the uniqueness check must be done immediately, as
1184        described above.
1185       </para>
1186      </listitem>
1187      <listitem>
1188       <para>
1189        <literal>UNIQUE_CHECK_PARTIAL</literal> indicates that the unique
1190        constraint is deferrable. <productname>PostgreSQL</productname>
1191        will use this mode to insert each row's index entry.  The access
1192        method must allow duplicate entries into the index, and report any
1193        potential duplicates by returning false from <function>aminsert</function>.
1194        For each row for which false is returned, a deferred recheck will
1195        be scheduled.
1196       </para>
1197
1198       <para>
1199        The access method must identify any rows which might violate the
1200        unique constraint, but it is not an error for it to report false
1201        positives. This allows the check to be done without waiting for other
1202        transactions to finish; conflicts reported here are not treated as
1203        errors and will be rechecked later, by which time they may no longer
1204        be conflicts.
1205       </para>
1206      </listitem>
1207      <listitem>
1208       <para>
1209        <literal>UNIQUE_CHECK_EXISTING</literal> indicates that this is a deferred
1210        recheck of a row that was reported as a potential uniqueness violation.
1211        Although this is implemented by calling <function>aminsert</function>, the
1212        access method must <emphasis>not</emphasis> insert a new index entry in this
1213        case.  The index entry is already present.  Rather, the access method
1214        must check to see if there is another live index entry.  If so, and
1215        if the target row is also still live, report error.
1216       </para>
1217
1218       <para>
1219        It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</literal> call,
1220        the access method further verify that the target row actually does
1221        have an existing entry in the index, and report error if not.  This
1222        is a good idea because the index tuple values passed to
1223        <function>aminsert</function> will have been recomputed.  If the index
1224        definition involves functions that are not really immutable, we
1225        might be checking the wrong area of the index.  Checking that the
1226        target row is found in the recheck verifies that we are scanning
1227        for the same tuple values as were used in the original insertion.
1228       </para>
1229      </listitem>
1230     </itemizedlist>
1231   </para>
1232
1233  </sect1>
1234
1235  <sect1 id="index-cost-estimation">
1236   <title>Index Cost Estimation Functions</title>
1237
1238   <para>
1239    The <function>amcostestimate</function> function is given information describing
1240    a possible index scan, including lists of WHERE and ORDER BY clauses that
1241    have been determined to be usable with the index.  It must return estimates
1242    of the cost of accessing the index and the selectivity of the WHERE
1243    clauses (that is, the fraction of parent-table rows that will be
1244    retrieved during the index scan).  For simple cases, nearly all the
1245    work of the cost estimator can be done by calling standard routines
1246    in the optimizer; the point of having an <function>amcostestimate</function> function is
1247    to allow index access methods to provide index-type-specific knowledge,
1248    in case it is possible to improve on the standard estimates.
1249   </para>
1250
1251   <para>
1252    Each <function>amcostestimate</function> function must have the signature:
1253
1254 <programlisting>
1255 void
1256 amcostestimate (PlannerInfo *root,
1257                 IndexPath *path,
1258                 double loop_count,
1259                 Cost *indexStartupCost,
1260                 Cost *indexTotalCost,
1261                 Selectivity *indexSelectivity,
1262                 double *indexCorrelation,
1263                 double *indexPages);
1264 </programlisting>
1265
1266    The first three parameters are inputs:
1267
1268    <variablelist>
1269     <varlistentry>
1270      <term><parameter>root</parameter></term>
1271      <listitem>
1272       <para>
1273        The planner's information about the query being processed.
1274       </para>
1275      </listitem>
1276     </varlistentry>
1277
1278     <varlistentry>
1279      <term><parameter>path</parameter></term>
1280      <listitem>
1281       <para>
1282        The index access path being considered.  All fields except cost and
1283        selectivity values are valid.
1284       </para>
1285      </listitem>
1286     </varlistentry>
1287
1288     <varlistentry>
1289      <term><parameter>loop_count</parameter></term>
1290      <listitem>
1291       <para>
1292        The number of repetitions of the index scan that should be factored
1293        into the cost estimates.  This will typically be greater than one when
1294        considering a parameterized scan for use in the inside of a nestloop
1295        join.  Note that the cost estimates should still be for just one scan;
1296        a larger <parameter>loop_count</parameter> means that it may be appropriate
1297        to allow for some caching effects across multiple scans.
1298       </para>
1299      </listitem>
1300     </varlistentry>
1301    </variablelist>
1302   </para>
1303
1304   <para>
1305    The last five parameters are pass-by-reference outputs:
1306
1307    <variablelist>
1308     <varlistentry>
1309      <term><parameter>*indexStartupCost</parameter></term>
1310      <listitem>
1311       <para>
1312        Set to cost of index start-up processing
1313       </para>
1314      </listitem>
1315     </varlistentry>
1316
1317     <varlistentry>
1318      <term><parameter>*indexTotalCost</parameter></term>
1319      <listitem>
1320       <para>
1321        Set to total cost of index processing
1322       </para>
1323      </listitem>
1324     </varlistentry>
1325
1326     <varlistentry>
1327      <term><parameter>*indexSelectivity</parameter></term>
1328      <listitem>
1329       <para>
1330        Set to index selectivity
1331       </para>
1332      </listitem>
1333     </varlistentry>
1334
1335     <varlistentry>
1336      <term><parameter>*indexCorrelation</parameter></term>
1337      <listitem>
1338       <para>
1339        Set to correlation coefficient between index scan order and
1340        underlying table's order
1341       </para>
1342      </listitem>
1343     </varlistentry>
1344
1345     <varlistentry>
1346      <term><parameter>*indexPages</parameter></term>
1347      <listitem>
1348       <para>
1349        Set to number of index leaf pages
1350       </para>
1351      </listitem>
1352     </varlistentry>
1353    </variablelist>
1354   </para>
1355
1356   <para>
1357    Note that cost estimate functions must be written in C, not in SQL or
1358    any available procedural language, because they must access internal
1359    data structures of the planner/optimizer.
1360   </para>
1361
1362   <para>
1363    The index access costs should be computed using the parameters used by
1364    <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
1365    disk block fetch has cost <varname>seq_page_cost</varname>, a nonsequential fetch
1366    has cost <varname>random_page_cost</varname>, and the cost of processing one index
1367    row should usually be taken as <varname>cpu_index_tuple_cost</varname>.  In
1368    addition, an appropriate multiple of <varname>cpu_operator_cost</varname> should
1369    be charged for any comparison operators invoked during index processing
1370    (especially evaluation of the indexquals themselves).
1371   </para>
1372
1373   <para>
1374    The access costs should include all disk and CPU costs associated with
1375    scanning the index itself, but <emphasis>not</emphasis> the costs of retrieving or
1376    processing the parent-table rows that are identified by the index.
1377   </para>
1378
1379   <para>
1380    The <quote>start-up cost</quote> is the part of the total scan cost that
1381    must be expended before we can begin to fetch the first row.  For most
1382    indexes this can be taken as zero, but an index type with a high start-up
1383    cost might want to set it nonzero.
1384   </para>
1385
1386   <para>
1387    The <parameter>indexSelectivity</parameter> should be set to the estimated fraction of the parent
1388    table rows that will be retrieved during the index scan.  In the case
1389    of a lossy query, this will typically be higher than the fraction of
1390    rows that actually pass the given qual conditions.
1391   </para>
1392
1393   <para>
1394    The <parameter>indexCorrelation</parameter> should be set to the correlation (ranging between
1395    -1.0 and 1.0) between the index order and the table order.  This is used
1396    to adjust the estimate for the cost of fetching rows from the parent
1397    table.
1398   </para>
1399
1400   <para>
1401    The <parameter>indexPages</parameter> should be set to the number of leaf pages.
1402    This is used to estimate the number of workers for parallel index scan.
1403   </para>
1404
1405   <para>
1406    When <parameter>loop_count</parameter> is greater than one, the returned numbers
1407    should be averages expected for any one scan of the index.
1408   </para>
1409
1410   <procedure>
1411    <title>Cost Estimation</title>
1412    <para>
1413     A typical cost estimator will proceed as follows:
1414    </para>
1415
1416    <step>
1417     <para>
1418      Estimate and return the fraction of parent-table rows that will be visited
1419      based on the given qual conditions.  In the absence of any index-type-specific
1420      knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
1421
1422 <programlisting>
1423 *indexSelectivity = clauselist_selectivity(root, path-&gt;indexquals,
1424                                            path-&gt;indexinfo-&gt;rel-&gt;relid,
1425                                            JOIN_INNER, NULL);
1426 </programlisting>
1427     </para>
1428    </step>
1429
1430    <step>
1431     <para>
1432      Estimate the number of index rows that will be visited during the
1433      scan.  For many index types this is the same as <parameter>indexSelectivity</parameter> times
1434      the number of rows in the index, but it might be more.  (Note that the
1435      index's size in pages and rows is available from the
1436      <literal>path-&gt;indexinfo</literal> struct.)
1437     </para>
1438    </step>
1439
1440    <step>
1441     <para>
1442      Estimate the number of index pages that will be retrieved during the scan.
1443      This might be just <parameter>indexSelectivity</parameter> times the index's size in pages.
1444     </para>
1445    </step>
1446
1447    <step>
1448     <para>
1449      Compute the index access cost.  A generic estimator might do this:
1450
1451 <programlisting>
1452 /*
1453  * Our generic assumption is that the index pages will be read
1454  * sequentially, so they cost seq_page_cost each, not random_page_cost.
1455  * Also, we charge for evaluation of the indexquals at each index row.
1456  * All the costs are assumed to be paid incrementally during the scan.
1457  */
1458 cost_qual_eval(&amp;index_qual_cost, path-&gt;indexquals, root);
1459 *indexStartupCost = index_qual_cost.startup;
1460 *indexTotalCost = seq_page_cost * numIndexPages +
1461     (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
1462 </programlisting>
1463
1464      However, the above does not account for amortization of index reads
1465      across repeated index scans.
1466     </para>
1467    </step>
1468
1469    <step>
1470     <para>
1471      Estimate the index correlation.  For a simple ordered index on a single
1472      field, this can be retrieved from pg_statistic.  If the correlation
1473      is not known, the conservative estimate is zero (no correlation).
1474     </para>
1475    </step>
1476   </procedure>
1477
1478   <para>
1479    Examples of cost estimator functions can be found in
1480    <filename>src/backend/utils/adt/selfuncs.c</filename>.
1481   </para>
1482  </sect1>
1483 </chapter>