Remove old RULE privilege completely.
[pgsql.git] / doc / src / sgml / indexam.sgml
blobdc7d14b60dd24ddcd95c0b6c93d3a9464aba9394
1 <!-- doc/src/sgml/indexam.sgml -->
3 <chapter id="indexam">
4 <title>Index Access Method Interface Definition</title>
6 <indexterm>
7 <primary>Index Access Method</primary>
8 </indexterm>
9 <indexterm>
10 <primary>indexam</primary>
11 <secondary>Index Access Method</secondary>
12 </indexterm>
14 <para>
15 This chapter defines the interface between the core
16 <productname>PostgreSQL</productname> system and <firstterm>index access
17 methods</firstterm>, which manage individual index types. The core system
18 knows nothing about indexes beyond what is specified here, so it is
19 possible to develop entirely new index types by writing add-on code.
20 </para>
22 <para>
23 All indexes in <productname>PostgreSQL</productname> are what are known
24 technically as <firstterm>secondary indexes</firstterm>; that is, the index is
25 physically separate from the table file that it describes. Each index
26 is stored as its own physical <firstterm>relation</firstterm> and so is described
27 by an entry in the <structname>pg_class</structname> catalog. The contents of an
28 index are entirely under the control of its index access method. In
29 practice, all index access methods divide indexes into standard-size
30 pages so that they can use the regular storage manager and buffer manager
31 to access the index contents. (All the existing index access methods
32 furthermore use the standard page layout described in <xref
33 linkend="storage-page-layout"/>, and most use the same format for index
34 tuple headers; but these decisions are not forced on an access method.)
35 </para>
37 <para>
38 An index is effectively a mapping from some data key values to
39 <firstterm>tuple identifiers</firstterm>, or <acronym>TIDs</acronym>, of row versions
40 (tuples) in the index's parent table. A TID consists of a
41 block number and an item number within that block (see <xref
42 linkend="storage-page-layout"/>). This is sufficient
43 information to fetch a particular row version from the table.
44 Indexes are not directly aware that under MVCC, there might be multiple
45 extant versions of the same logical row; to an index, each tuple is
46 an independent object that needs its own index entry. Thus, an
47 update of a row always creates all-new index entries for the row, even if
48 the key values did not change. (<link linkend="storage-hot">HOT
49 tuples</link> are an exception to this
50 statement; but indexes do not deal with those, either.) Index entries for
51 dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
52 are reclaimed.
53 </para>
55 <sect1 id="index-api">
56 <title>Basic API Structure for Indexes</title>
58 <para>
59 Each index access method is described by a row in the
60 <link linkend="catalog-pg-am"><structname>pg_am</structname></link>
61 system catalog. The <structname>pg_am</structname> entry
62 specifies a name and a <firstterm>handler function</firstterm> for the index
63 access method. These entries can be created and deleted using the
64 <xref linkend="sql-create-access-method"/> and
65 <xref linkend="sql-drop-access-method"/> SQL commands.
66 </para>
68 <para>
69 An index access method handler function must be declared to accept a
70 single argument of type <type>internal</type> and to return the
71 pseudo-type <type>index_am_handler</type>. The argument is a dummy value that
72 simply serves to prevent handler functions from being called directly from
73 SQL commands. The result of the function must be a palloc'd struct of
74 type <structname>IndexAmRoutine</structname>, which contains everything
75 that the core code needs to know to make use of the index access method.
76 The <structname>IndexAmRoutine</structname> struct, also called the access
77 method's <firstterm>API struct</firstterm>, includes fields specifying assorted
78 fixed properties of the access method, such as whether it can support
79 multicolumn indexes. More importantly, it contains pointers to support
80 functions for the access method, which do all of the real work to access
81 indexes. These support functions are plain C functions and are not
82 visible or callable at the SQL level. The support functions are described
83 in <xref linkend="index-functions"/>.
84 </para>
86 <para>
87 The structure <structname>IndexAmRoutine</structname> is defined thus:
88 <programlisting>
89 typedef struct IndexAmRoutine
91 NodeTag type;
94 * Total number of strategies (operators) by which we can traverse/search
95 * this AM. Zero if AM does not have a fixed set of strategy assignments.
97 uint16 amstrategies;
98 /* total number of support functions that this AM uses */
99 uint16 amsupport;
100 /* opclass options support function number or 0 */
101 uint16 amoptsprocnum;
102 /* does AM support ORDER BY indexed column's value? */
103 bool amcanorder;
104 /* does AM support ORDER BY result of an operator on indexed column? */
105 bool amcanorderbyop;
106 /* does AM support backward scanning? */
107 bool amcanbackward;
108 /* does AM support UNIQUE indexes? */
109 bool amcanunique;
110 /* does AM support multi-column indexes? */
111 bool amcanmulticol;
112 /* does AM require scans to have a constraint on the first index column? */
113 bool amoptionalkey;
114 /* does AM handle ScalarArrayOpExpr quals? */
115 bool amsearcharray;
116 /* does AM handle IS NULL/IS NOT NULL quals? */
117 bool amsearchnulls;
118 /* can index storage data type differ from column data type? */
119 bool amstorage;
120 /* can an index of this type be clustered on? */
121 bool amclusterable;
122 /* does AM handle predicate locks? */
123 bool ampredlocks;
124 /* does AM support parallel scan? */
125 bool amcanparallel;
126 /* does AM support parallel build? */
127 bool amcanbuildparallel;
128 /* does AM support columns included with clause INCLUDE? */
129 bool amcaninclude;
130 /* does AM use maintenance_work_mem? */
131 bool amusemaintenanceworkmem;
132 /* does AM summarize tuples, with at least all tuples in the block
133 * summarized in one summary */
134 bool amsummarizing;
135 /* OR of parallel vacuum flags */
136 uint8 amparallelvacuumoptions;
137 /* type of data stored in index, or InvalidOid if variable */
138 Oid amkeytype;
140 /* interface functions */
141 ambuild_function ambuild;
142 ambuildempty_function ambuildempty;
143 aminsert_function aminsert;
144 aminsertcleanup_function aminsertcleanup;
145 ambulkdelete_function ambulkdelete;
146 amvacuumcleanup_function amvacuumcleanup;
147 amcanreturn_function amcanreturn; /* can be NULL */
148 amcostestimate_function amcostestimate;
149 amgettreeheight_function amgettreeheight; /* can be NULL */
150 amoptions_function amoptions;
151 amproperty_function amproperty; /* can be NULL */
152 ambuildphasename_function ambuildphasename; /* can be NULL */
153 amvalidate_function amvalidate;
154 amadjustmembers_function amadjustmembers; /* can be NULL */
155 ambeginscan_function ambeginscan;
156 amrescan_function amrescan;
157 amgettuple_function amgettuple; /* can be NULL */
158 amgetbitmap_function amgetbitmap; /* can be NULL */
159 amendscan_function amendscan;
160 ammarkpos_function ammarkpos; /* can be NULL */
161 amrestrpos_function amrestrpos; /* can be NULL */
163 /* interface functions to support parallel index scans */
164 amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
165 aminitparallelscan_function aminitparallelscan; /* can be NULL */
166 amparallelrescan_function amparallelrescan; /* can be NULL */
167 } IndexAmRoutine;
168 </programlisting>
169 </para>
171 <para>
172 To be useful, an index access method must also have one or more
173 <firstterm>operator families</firstterm> and
174 <firstterm>operator classes</firstterm> defined in
175 <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
176 <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
177 <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
178 <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
179 These entries allow the planner
180 to determine what kinds of query qualifications can be used with
181 indexes of this access method. Operator families and classes are described
182 in <xref linkend="xindex"/>, which is prerequisite material for reading
183 this chapter.
184 </para>
186 <para>
187 An individual index is defined by a
188 <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
189 entry that describes it as a physical relation, plus a
190 <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
191 entry that shows the logical content of the index &mdash; that is, the set
192 of index columns it has and the semantics of those columns, as captured by
193 the associated operator classes. The index columns (key values) can be
194 either simple columns of the underlying table or expressions over the table
195 rows. The index access method normally has no interest in where the index
196 key values come from (it is always handed precomputed key values) but it
197 will be very interested in the operator class information in
198 <structname>pg_index</structname>. Both of these catalog entries can be
199 accessed as part of the <structname>Relation</structname> data structure that is
200 passed to all operations on the index.
201 </para>
203 <para>
204 Some of the flag fields of <structname>IndexAmRoutine</structname> have nonobvious
205 implications. The requirements of <structfield>amcanunique</structfield>
206 are discussed in <xref linkend="index-unique-checks"/>.
207 The <structfield>amcanmulticol</structfield> flag asserts that the
208 access method supports multi-key-column indexes, while
209 <structfield>amoptionalkey</structfield> asserts that it allows scans
210 where no indexable restriction clause is given for the first index column.
211 When <structfield>amcanmulticol</structfield> is false,
212 <structfield>amoptionalkey</structfield> essentially says whether the
213 access method supports full-index scans without any restriction clause.
214 Access methods that support multiple index columns <emphasis>must</emphasis>
215 support scans that omit restrictions on any or all of the columns after
216 the first; however they are permitted to require some restriction to
217 appear for the first index column, and this is signaled by setting
218 <structfield>amoptionalkey</structfield> false.
219 One reason that an index <acronym>AM</acronym> might set
220 <structfield>amoptionalkey</structfield> false is if it doesn't index
221 null values. Since most indexable operators are
222 strict and hence cannot return true for null inputs,
223 it is at first sight attractive to not store index entries for null values:
224 they could never be returned by an index scan anyway. However, this
225 argument fails when an index scan has no restriction clause for a given
226 index column. In practice this means that
227 indexes that have <structfield>amoptionalkey</structfield> true must
228 index nulls, since the planner might decide to use such an index
229 with no scan keys at all. A related restriction is that an index
230 access method that supports multiple index columns <emphasis>must</emphasis>
231 support indexing null values in columns after the first, because the planner
232 will assume the index can be used for queries that do not restrict
233 these columns. For example, consider an index on (a,b) and a query with
234 <literal>WHERE a = 4</literal>. The system will assume the index can be
235 used to scan for rows with <literal>a = 4</literal>, which is wrong if the
236 index omits rows where <literal>b</literal> is null.
237 It is, however, OK to omit rows where the first indexed column is null.
238 An index access method that does index nulls may also set
239 <structfield>amsearchnulls</structfield>, indicating that it supports
240 <literal>IS NULL</literal> and <literal>IS NOT NULL</literal> clauses as search
241 conditions.
242 </para>
244 <para>
245 The <structfield>amcaninclude</structfield> flag indicates whether the
246 access method supports <quote>included</quote> columns, that is it can
247 store (without processing) additional columns beyond the key column(s).
248 The requirements of the preceding paragraph apply only to the key
249 columns. In particular, the combination
250 of <structfield>amcanmulticol</structfield>=<literal>false</literal>
251 and <structfield>amcaninclude</structfield>=<literal>true</literal> is
252 sensible: it means that there can only be one key column, but there can
253 also be included column(s). Also, included columns must be allowed to be
254 null, independently of <structfield>amoptionalkey</structfield>.
255 </para>
257 <para>
258 The <structfield>amsummarizing</structfield> flag indicates whether the
259 access method summarizes the indexed tuples, with summarizing granularity
260 of at least per block.
261 Access methods that do not point to individual tuples, but to block ranges
262 (like <acronym>BRIN</acronym>), may allow the <acronym>HOT</acronym> optimization
263 to continue. This does not apply to attributes referenced in index
264 predicates, an update of such an attribute always disables <acronym>HOT</acronym>.
265 </para>
267 </sect1>
269 <sect1 id="index-functions">
270 <title>Index Access Method Functions</title>
272 <para>
273 The index construction and maintenance functions that an index access
274 method must provide in <structname>IndexAmRoutine</structname> are:
275 </para>
277 <para>
278 <programlisting>
279 IndexBuildResult *
280 ambuild (Relation heapRelation,
281 Relation indexRelation,
282 IndexInfo *indexInfo);
283 </programlisting>
284 Build a new index. The index relation has been physically created,
285 but is empty. It must be filled in with whatever fixed data the
286 access method requires, plus entries for all tuples already existing
287 in the table. Ordinarily the <function>ambuild</function> function will call
288 <function>table_index_build_scan()</function> to scan the table for existing tuples
289 and compute the keys that need to be inserted into the index.
290 The function must return a palloc'd struct containing statistics about
291 the new index.
292 The <structfield>amcanbuildparallel</structfield> flag indicates whether
293 the access method supports parallel index builds. When set to <literal>true</literal>,
294 the system will attempt to allocate parallel workers for the build.
295 Access methods supporting only non-parallel index builds should leave
296 this flag set to <literal>false</literal>.
297 </para>
299 <para>
300 <programlisting>
301 void
302 ambuildempty (Relation indexRelation);
303 </programlisting>
304 Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
305 of the given relation. This method is called only for unlogged indexes; the
306 empty index written to the initialization fork will be copied over the main
307 relation fork on each server restart.
308 </para>
310 <para>
311 <programlisting>
312 bool
313 aminsert (Relation indexRelation,
314 Datum *values,
315 bool *isnull,
316 ItemPointer heap_tid,
317 Relation heapRelation,
318 IndexUniqueCheck checkUnique,
319 bool indexUnchanged,
320 IndexInfo *indexInfo);
321 </programlisting>
322 Insert a new tuple into an existing index. The <literal>values</literal> and
323 <literal>isnull</literal> arrays give the key values to be indexed, and
324 <literal>heap_tid</literal> is the TID to be indexed.
325 If the access method supports unique indexes (its
326 <structfield>amcanunique</structfield> flag is true) then
327 <literal>checkUnique</literal> indicates the type of uniqueness check to
328 perform. This varies depending on whether the unique constraint is
329 deferrable; see <xref linkend="index-unique-checks"/> for details.
330 Normally the access method only needs the <literal>heapRelation</literal>
331 parameter when performing uniqueness checking (since then it will have to
332 look into the heap to verify tuple liveness).
333 </para>
335 <para>
336 The <literal>indexUnchanged</literal> Boolean value gives a hint
337 about the nature of the tuple to be indexed. When it is true,
338 the tuple is a duplicate of some existing tuple in the index. The
339 new tuple is a logically unchanged successor MVCC tuple version. This
340 happens when an <command>UPDATE</command> takes place that does not
341 modify any columns covered by the index, but nevertheless requires a
342 new version in the index. The index AM may use this hint to decide
343 to apply bottom-up index deletion in parts of the index where many
344 versions of the same logical row accumulate. Note that updating a non-key
345 column or a column that only appears in a partial index predicate does not
346 affect the value of <literal>indexUnchanged</literal>. The core code
347 determines each tuple's <literal>indexUnchanged</literal> value using a low
348 overhead approach that allows both false positives and false negatives.
349 Index AMs must not treat <literal>indexUnchanged</literal> as an
350 authoritative source of information about tuple visibility or versioning.
351 </para>
353 <para>
354 The function's Boolean result value is significant only when
355 <literal>checkUnique</literal> is <literal>UNIQUE_CHECK_PARTIAL</literal>.
356 In this case a true result means the new entry is known unique, whereas
357 false means it might be non-unique (and a deferred uniqueness check must
358 be scheduled). For other cases a constant false result is recommended.
359 </para>
361 <para>
362 Some indexes might not index all tuples. If the tuple is not to be
363 indexed, <function>aminsert</function> should just return without doing anything.
364 </para>
366 <para>
367 If the index AM wishes to cache data across successive index insertions
368 within an SQL statement, it can allocate space
369 in <literal>indexInfo-&gt;ii_Context</literal> and store a pointer to the
370 data in <literal>indexInfo-&gt;ii_AmCache</literal> (which will be NULL
371 initially). If resources other than memory have to be released after
372 index insertions, <function>aminsertcleanup</function> may be provided,
373 which will be called before the memory is released.
374 </para>
376 <para>
377 <programlisting>
378 void
379 aminsertcleanup (Relation indexRelation,
380 IndexInfo *indexInfo);
381 </programlisting>
382 Clean up state that was maintained across successive inserts in
383 <literal>indexInfo-&gt;ii_AmCache</literal>. This is useful if the data
384 requires additional cleanup steps (e.g., releasing pinned buffers), and
385 simply releasing the memory is not sufficient.
386 </para>
388 <para>
389 <programlisting>
390 IndexBulkDeleteResult *
391 ambulkdelete (IndexVacuumInfo *info,
392 IndexBulkDeleteResult *stats,
393 IndexBulkDeleteCallback callback,
394 void *callback_state);
395 </programlisting>
396 Delete tuple(s) from the index. This is a <quote>bulk delete</quote> operation
397 that is intended to be implemented by scanning the whole index and checking
398 each entry to see if it should be deleted.
399 The passed-in <literal>callback</literal> function must be called, in the style
400 <literal>callback(<replaceable>TID</replaceable>, callback_state) returns bool</literal>,
401 to determine whether any particular index entry, as identified by its
402 referenced TID, is to be deleted. Must return either NULL or a palloc'd
403 struct containing statistics about the effects of the deletion operation.
404 It is OK to return NULL if no information needs to be passed on to
405 <function>amvacuumcleanup</function>.
406 </para>
408 <para>
409 Because of limited <varname>maintenance_work_mem</varname>,
410 <function>ambulkdelete</function> might need to be called more than once when many
411 tuples are to be deleted. The <literal>stats</literal> argument is the result
412 of the previous call for this index (it is NULL for the first call within a
413 <command>VACUUM</command> operation). This allows the AM to accumulate statistics
414 across the whole operation. Typically, <function>ambulkdelete</function> will
415 modify and return the same struct if the passed <literal>stats</literal> is not
416 null.
417 </para>
419 <para>
420 <programlisting>
421 IndexBulkDeleteResult *
422 amvacuumcleanup (IndexVacuumInfo *info,
423 IndexBulkDeleteResult *stats);
424 </programlisting>
425 Clean up after a <command>VACUUM</command> operation (zero or more
426 <function>ambulkdelete</function> calls). This does not have to do anything
427 beyond returning index statistics, but it might perform bulk cleanup
428 such as reclaiming empty index pages. <literal>stats</literal> is whatever the
429 last <function>ambulkdelete</function> call returned, or NULL if
430 <function>ambulkdelete</function> was not called because no tuples needed to be
431 deleted. If the result is not NULL it must be a palloc'd struct.
432 The statistics it contains will be used to update <structname>pg_class</structname>,
433 and will be reported by <command>VACUUM</command> if <literal>VERBOSE</literal> is given.
434 It is OK to return NULL if the index was not changed at all during the
435 <command>VACUUM</command> operation, but otherwise correct stats should
436 be returned.
437 </para>
439 <para>
440 <function>amvacuumcleanup</function> will also be called at completion of an
441 <command>ANALYZE</command> operation. In this case <literal>stats</literal> is always
442 NULL and any return value will be ignored. This case can be distinguished
443 by checking <literal>info-&gt;analyze_only</literal>. It is recommended
444 that the access method do nothing except post-insert cleanup in such a
445 call, and that only in an autovacuum worker process.
446 </para>
448 <para>
449 <programlisting>
450 bool
451 amcanreturn (Relation indexRelation, int attno);
452 </programlisting>
453 Check whether the index can support <link
454 linkend="indexes-index-only-scans"><firstterm>index-only scans</firstterm></link> on
455 the given column, by returning the column's original indexed value.
456 The attribute number is 1-based, i.e., the first column's attno is 1.
457 Returns true if supported, else false.
458 This function should always return true for included columns
459 (if those are supported), since there's little point in an included
460 column that can't be retrieved.
461 If the access method does not support index-only scans at all,
462 the <structfield>amcanreturn</structfield> field in its <structname>IndexAmRoutine</structname>
463 struct can be set to NULL.
464 </para>
466 <para>
467 <programlisting>
468 void
469 amcostestimate (PlannerInfo *root,
470 IndexPath *path,
471 double loop_count,
472 Cost *indexStartupCost,
473 Cost *indexTotalCost,
474 Selectivity *indexSelectivity,
475 double *indexCorrelation,
476 double *indexPages);
477 </programlisting>
478 Estimate the costs of an index scan. This function is described fully
479 in <xref linkend="index-cost-estimation"/>, below.
480 </para>
482 <para>
483 <programlisting>
485 amgettreeheight (Relation rel);
486 </programlisting>
487 Compute the height of a tree-shaped index. This information is supplied to
488 the <function>amcostestimate</function> function in
489 <literal>path->indexinfo->tree_height</literal> and can be used to support
490 the cost estimation. The result is not used anywhere else, so this
491 function can actually be used to compute any kind of data (that fits into
492 an integer) about the index that the cost estimation function might want to
493 know. If the computation is expensive, it could be useful to cache the
494 result as part of <literal>RelationData.rd_amcache</literal>.
495 </para>
497 <para>
498 <programlisting>
499 bytea *
500 amoptions (ArrayType *reloptions,
501 bool validate);
502 </programlisting>
503 Parse and validate the reloptions array for an index. This is called only
504 when a non-null reloptions array exists for the index.
505 <parameter>reloptions</parameter> is a <type>text</type> array containing entries of the
506 form <replaceable>name</replaceable><literal>=</literal><replaceable>value</replaceable>.
507 The function should construct a <type>bytea</type> value, which will be copied
508 into the <structfield>rd_options</structfield> field of the index's relcache entry.
509 The data contents of the <type>bytea</type> value are open for the access
510 method to define; most of the standard access methods use struct
511 <structname>StdRdOptions</structname>.
512 When <parameter>validate</parameter> is true, the function should report a suitable
513 error message if any of the options are unrecognized or have invalid
514 values; when <parameter>validate</parameter> is false, invalid entries should be
515 silently ignored. (<parameter>validate</parameter> is false when loading options
516 already stored in <structname>pg_catalog</structname>; an invalid entry could only
517 be found if the access method has changed its rules for options, and in
518 that case ignoring obsolete entries is appropriate.)
519 It is OK to return NULL if default behavior is wanted.
520 </para>
522 <para>
523 <programlisting>
524 bool
525 amproperty (Oid index_oid, int attno,
526 IndexAMProperty prop, const char *propname,
527 bool *res, bool *isnull);
528 </programlisting>
529 The <function>amproperty</function> method allows index access methods to override
530 the default behavior of <function>pg_index_column_has_property</function>
531 and related functions.
532 If the access method does not have any special behavior for index property
533 inquiries, the <structfield>amproperty</structfield> field in
534 its <structname>IndexAmRoutine</structname> struct can be set to NULL.
535 Otherwise, the <function>amproperty</function> method will be called with
536 <parameter>index_oid</parameter> and <parameter>attno</parameter> both zero for
537 <function>pg_indexam_has_property</function> calls,
538 or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> zero for
539 <function>pg_index_has_property</function> calls,
540 or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> greater than
541 zero for <function>pg_index_column_has_property</function> calls.
542 <parameter>prop</parameter> is an enum value identifying the property being tested,
543 while <parameter>propname</parameter> is the original property name string.
544 If the core code does not recognize the property name
545 then <parameter>prop</parameter> is <literal>AMPROP_UNKNOWN</literal>.
546 Access methods can define custom property names by
547 checking <parameter>propname</parameter> for a match (use <function>pg_strcasecmp</function>
548 to match, for consistency with the core code); for names known to the core
549 code, it's better to inspect <parameter>prop</parameter>.
550 If the <structfield>amproperty</structfield> method returns <literal>true</literal> then
551 it has determined the property test result: it must set <literal>*res</literal>
552 to the Boolean value to return, or set <literal>*isnull</literal>
553 to <literal>true</literal> to return a NULL. (Both of the referenced variables
554 are initialized to <literal>false</literal> before the call.)
555 If the <structfield>amproperty</structfield> method returns <literal>false</literal> then
556 the core code will proceed with its normal logic for determining the
557 property test result.
558 </para>
560 <para>
561 Access methods that support ordering operators should
562 implement <literal>AMPROP_DISTANCE_ORDERABLE</literal> property testing, as the
563 core code does not know how to do that and will return NULL. It may
564 also be advantageous to implement <literal>AMPROP_RETURNABLE</literal> testing,
565 if that can be done more cheaply than by opening the index and calling
566 <function>amcanreturn</function>, which is the core code's default behavior.
567 The default behavior should be satisfactory for all other standard
568 properties.
569 </para>
571 <para>
572 <programlisting>
573 char *
574 ambuildphasename (int64 phasenum);
575 </programlisting>
576 Return the textual name of the given build phase number.
577 The phase numbers are those reported during an index build via the
578 <function>pgstat_progress_update_param</function> interface.
579 The phase names are then exposed in the
580 <structname>pg_stat_progress_create_index</structname> view.
581 </para>
583 <para>
584 <programlisting>
585 bool
586 amvalidate (Oid opclassoid);
587 </programlisting>
588 Validate the catalog entries for the specified operator class, so far as
589 the access method can reasonably do that. For example, this might include
590 testing that all required support functions are provided.
591 The <function>amvalidate</function> function must return false if the opclass is
592 invalid. Problems should be reported with <function>ereport</function>
593 messages, typically at <literal>INFO</literal> level.
594 </para>
596 <para>
597 <programlisting>
598 void
599 amadjustmembers (Oid opfamilyoid,
600 Oid opclassoid,
601 List *operators,
602 List *functions);
603 </programlisting>
604 Validate proposed new operator and function members of an operator family,
605 so far as the access method can reasonably do that, and set their
606 dependency types if the default is not satisfactory. This is called
607 during <command>CREATE OPERATOR CLASS</command> and during
608 <command>ALTER OPERATOR FAMILY ADD</command>; in the latter
609 case <parameter>opclassoid</parameter> is <literal>InvalidOid</literal>.
610 The <type>List</type> arguments are lists
611 of <structname>OpFamilyMember</structname> structs, as defined
612 in <filename>amapi.h</filename>.
614 Tests done by this function will typically be a subset of those
615 performed by <function>amvalidate</function>,
616 since <function>amadjustmembers</function> cannot assume that it is
617 seeing a complete set of members. For example, it would be reasonable
618 to check the signature of a support function, but not to check whether
619 all required support functions are provided. Any problems can be
620 reported by throwing an error.
622 The dependency-related fields of
623 the <structname>OpFamilyMember</structname> structs are initialized by
624 the core code to create hard dependencies on the opclass if this
625 is <command>CREATE OPERATOR CLASS</command>, or soft dependencies on the
626 opfamily if this is <command>ALTER OPERATOR FAMILY ADD</command>.
627 <function>amadjustmembers</function> can adjust these fields if some other
628 behavior is more appropriate. For example, GIN, GiST, and SP-GiST
629 always set operator members to have soft dependencies on the opfamily,
630 since the connection between an operator and an opclass is relatively
631 weak in these index types; so it is reasonable to allow operator members
632 to be added and removed freely. Optional support functions are typically
633 also given soft dependencies, so that they can be removed if necessary.
634 </para>
637 <para>
638 The purpose of an index, of course, is to support scans for tuples matching
639 an indexable <literal>WHERE</literal> condition, often called a
640 <firstterm>qualifier</firstterm> or <firstterm>scan key</firstterm>. The semantics of
641 index scanning are described more fully in <xref linkend="index-scanning"/>,
642 below. An index access method can support <quote>plain</quote> index scans,
643 <quote>bitmap</quote> index scans, or both. The scan-related functions that an
644 index access method must or may provide are:
645 </para>
647 <para>
648 <programlisting>
649 IndexScanDesc
650 ambeginscan (Relation indexRelation,
651 int nkeys,
652 int norderbys);
653 </programlisting>
654 Prepare for an index scan. The <literal>nkeys</literal> and <literal>norderbys</literal>
655 parameters indicate the number of quals and ordering operators that will be
656 used in the scan; these may be useful for space allocation purposes.
657 Note that the actual values of the scan keys aren't provided yet.
658 The result must be a palloc'd struct.
659 For implementation reasons the index access method
660 <emphasis>must</emphasis> create this struct by calling
661 <function>RelationGetIndexScan()</function>. In most cases
662 <function>ambeginscan</function> does little beyond making that call and perhaps
663 acquiring locks;
664 the interesting parts of index-scan startup are in <function>amrescan</function>.
665 </para>
667 <para>
668 <programlisting>
669 void
670 amrescan (IndexScanDesc scan,
671 ScanKey keys,
672 int nkeys,
673 ScanKey orderbys,
674 int norderbys);
675 </programlisting>
676 Start or restart an index scan, possibly with new scan keys. (To restart
677 using previously-passed keys, NULL is passed for <literal>keys</literal> and/or
678 <literal>orderbys</literal>.) Note that it is not allowed for
679 the number of keys or order-by operators to be larger than
680 what was passed to <function>ambeginscan</function>. In practice the restart
681 feature is used when a new outer tuple is selected by a nested-loop join
682 and so a new key comparison value is needed, but the scan key structure
683 remains the same.
684 </para>
686 <para>
687 <programlisting>
688 bool
689 amgettuple (IndexScanDesc scan,
690 ScanDirection direction);
691 </programlisting>
692 Fetch the next tuple in the given scan, moving in the given
693 direction (forward or backward in the index). Returns true if a tuple was
694 obtained, false if no matching tuples remain. In the true case the tuple
695 TID is stored into the <literal>scan</literal> structure. Note that
696 <quote>success</quote> means only that the index contains an entry that matches
697 the scan keys, not that the tuple necessarily still exists in the heap or
698 will pass the caller's snapshot test. On success, <function>amgettuple</function>
699 must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
700 False means it is certain that the index entry matches the scan keys.
701 True means this is not certain, and the conditions represented by the
702 scan keys must be rechecked against the heap tuple after fetching it.
703 This provision supports <quote>lossy</quote> index operators.
704 Note that rechecking will extend only to the scan conditions; a partial
705 index predicate (if any) is never rechecked by <function>amgettuple</function>
706 callers.
707 </para>
709 <para>
710 If the index supports <link linkend="indexes-index-only-scans">index-only
711 scans</link> (i.e., <function>amcanreturn</function> returns true for any
712 of its columns),
713 then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
714 and if that is true it must return the originally indexed data for the
715 index entry. Columns for which <function>amcanreturn</function> returns
716 false can be returned as nulls.
717 The data can be returned in the form of an
718 <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
719 with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
720 a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
721 with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>. (The latter
722 format should be used when reconstructing data that might possibly not fit
723 into an <structname>IndexTuple</structname>.) In either case,
724 management of the data referenced by the pointer is the access method's
725 responsibility. The data must remain good at least until the next
726 <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
727 call for the scan.
728 </para>
730 <para>
731 The <function>amgettuple</function> function need only be provided if the access
732 method supports <quote>plain</quote> index scans. If it doesn't, the
733 <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
734 struct must be set to NULL.
735 </para>
737 <para>
738 <programlisting>
739 int64
740 amgetbitmap (IndexScanDesc scan,
741 TIDBitmap *tbm);
742 </programlisting>
743 Fetch all tuples in the given scan and add them to the caller-supplied
744 <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
745 in the bitmap). The number of tuples fetched is returned (this might be
746 just an approximate count, for instance some AMs do not detect duplicates).
747 While inserting tuple IDs into the bitmap, <function>amgetbitmap</function> can
748 indicate that rechecking of the scan conditions is required for specific
749 tuple IDs. This is analogous to the <literal>xs_recheck</literal> output parameter
750 of <function>amgettuple</function>. Note: in the current implementation, support
751 for this feature is conflated with support for lossy storage of the bitmap
752 itself, and therefore callers recheck both the scan conditions and the
753 partial index predicate (if any) for recheckable tuples. That might not
754 always be true, however.
755 <function>amgetbitmap</function> and
756 <function>amgettuple</function> cannot be used in the same index scan; there
757 are other restrictions too when using <function>amgetbitmap</function>, as explained
758 in <xref linkend="index-scanning"/>.
759 </para>
761 <para>
762 The <function>amgetbitmap</function> function need only be provided if the access
763 method supports <quote>bitmap</quote> index scans. If it doesn't, the
764 <structfield>amgetbitmap</structfield> field in its <structname>IndexAmRoutine</structname>
765 struct must be set to NULL.
766 </para>
768 <para>
769 <programlisting>
770 void
771 amendscan (IndexScanDesc scan);
772 </programlisting>
773 End a scan and release resources. The <literal>scan</literal> struct itself
774 should not be freed, but any locks or pins taken internally by the
775 access method must be released, as well as any other memory allocated
776 by <function>ambeginscan</function> and other scan-related functions.
777 </para>
779 <para>
780 <programlisting>
781 void
782 ammarkpos (IndexScanDesc scan);
783 </programlisting>
784 Mark current scan position. The access method need only support one
785 remembered scan position per scan.
786 </para>
788 <para>
789 The <function>ammarkpos</function> function need only be provided if the access
790 method supports ordered scans. If it doesn't,
791 the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
792 struct may be set to NULL.
793 </para>
795 <para>
796 <programlisting>
797 void
798 amrestrpos (IndexScanDesc scan);
799 </programlisting>
800 Restore the scan to the most recently marked position.
801 </para>
803 <para>
804 The <function>amrestrpos</function> function need only be provided if the access
805 method supports ordered scans. If it doesn't,
806 the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
807 struct may be set to NULL.
808 </para>
810 <para>
811 In addition to supporting ordinary index scans, some types of index
812 may wish to support <firstterm>parallel index scans</firstterm>, which allow
813 multiple backends to cooperate in performing an index scan. The
814 index access method should arrange things so that each cooperating
815 process returns a subset of the tuples that would be performed by
816 an ordinary, non-parallel index scan, but in such a way that the
817 union of those subsets is equal to the set of tuples that would be
818 returned by an ordinary, non-parallel index scan. Furthermore, while
819 there need not be any global ordering of tuples returned by a parallel
820 scan, the ordering of that subset of tuples returned within each
821 cooperating backend must match the requested ordering. The following
822 functions may be implemented to support parallel index scans:
823 </para>
825 <para>
826 <programlisting>
827 Size
828 amestimateparallelscan (int nkeys,
829 int norderbys);
830 </programlisting>
831 Estimate and return the number of bytes of dynamic shared memory which
832 the access method will be needed to perform a parallel scan. (This number
833 is in addition to, not in lieu of, the amount of space needed for
834 AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
835 </para>
837 <para>
838 The <literal>nkeys</literal> and <literal>norderbys</literal>
839 parameters indicate the number of quals and ordering operators that will be
840 used in the scan; the same values will be passed to <function>amrescan</function>.
841 Note that the actual values of the scan keys aren't provided yet.
842 </para>
844 <para>
845 It is not necessary to implement this function for access methods which
846 do not support parallel scans or for which the number of additional bytes
847 of storage required is zero.
848 </para>
850 <para>
851 <programlisting>
852 void
853 aminitparallelscan (void *target);
854 </programlisting>
855 This function will be called to initialize dynamic shared memory at the
856 beginning of a parallel scan. <parameter>target</parameter> will point to at least
857 the number of bytes previously returned by
858 <function>amestimateparallelscan</function>, and this function may use that
859 amount of space to store whatever data it wishes.
860 </para>
862 <para>
863 It is not necessary to implement this function for access methods which
864 do not support parallel scans or in cases where the shared memory space
865 required needs no initialization.
866 </para>
868 <para>
869 <programlisting>
870 void
871 amparallelrescan (IndexScanDesc scan);
872 </programlisting>
873 This function, if implemented, will be called when a parallel index scan
874 must be restarted. It should reset any shared state set up by
875 <function>aminitparallelscan</function> such that the scan will be restarted from
876 the beginning.
877 </para>
879 </sect1>
881 <sect1 id="index-scanning">
882 <title>Index Scanning</title>
884 <para>
885 In an index scan, the index access method is responsible for regurgitating
886 the TIDs of all the tuples it has been told about that match the
887 <firstterm>scan keys</firstterm>. The access method is <emphasis>not</emphasis> involved in
888 actually fetching those tuples from the index's parent table, nor in
889 determining whether they pass the scan's visibility test or other
890 conditions.
891 </para>
893 <para>
894 A scan key is the internal representation of a <literal>WHERE</literal> clause of
895 the form <replaceable>index_key</replaceable> <replaceable>operator</replaceable>
896 <replaceable>constant</replaceable>, where the index key is one of the columns of the
897 index and the operator is one of the members of the operator family
898 associated with that index column. An index scan has zero or more scan
899 keys, which are implicitly ANDed &mdash; the returned tuples are expected
900 to satisfy all the indicated conditions.
901 </para>
903 <para>
904 The access method can report that the index is <firstterm>lossy</firstterm>, or
905 requires rechecks, for a particular query. This implies that the index
906 scan will return all the entries that pass the scan key, plus possibly
907 additional entries that do not. The core system's index-scan machinery
908 will then apply the index conditions again to the heap tuple to verify
909 whether or not it really should be selected. If the recheck option is not
910 specified, the index scan must return exactly the set of matching entries.
911 </para>
913 <para>
914 Note that it is entirely up to the access method to ensure that it
915 correctly finds all and only the entries passing all the given scan keys.
916 Also, the core system will simply hand off all the <literal>WHERE</literal>
917 clauses that match the index keys and operator families, without any
918 semantic analysis to determine whether they are redundant or
919 contradictory. As an example, given
920 <literal>WHERE x &gt; 4 AND x &gt; 14</literal> where <literal>x</literal> is a b-tree
921 indexed column, it is left to the b-tree <function>amrescan</function> function
922 to realize that the first scan key is redundant and can be discarded.
923 The extent of preprocessing needed during <function>amrescan</function> will
924 depend on the extent to which the index access method needs to reduce
925 the scan keys to a <quote>normalized</quote> form.
926 </para>
928 <para>
929 Some access methods return index entries in a well-defined order, others
930 do not. There are actually two different ways that an access method can
931 support sorted output:
933 <itemizedlist>
934 <listitem>
935 <para>
936 Access methods that always return entries in the natural ordering
937 of their data (such as btree) should set
938 <structfield>amcanorder</structfield> to true.
939 Currently, such access methods must use btree-compatible strategy
940 numbers for their equality and ordering operators.
941 </para>
942 </listitem>
943 <listitem>
944 <para>
945 Access methods that support ordering operators should set
946 <structfield>amcanorderbyop</structfield> to true.
947 This indicates that the index is capable of returning entries in
948 an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
949 <replaceable>operator</replaceable> <replaceable>constant</replaceable>. Scan modifiers
950 of that form can be passed to <function>amrescan</function> as described
951 previously.
952 </para>
953 </listitem>
954 </itemizedlist>
955 </para>
957 <para>
958 The <function>amgettuple</function> function has a <literal>direction</literal> argument,
959 which can be either <literal>ForwardScanDirection</literal> (the normal case)
960 or <literal>BackwardScanDirection</literal>. If the first call after
961 <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
962 set of matching index entries is to be scanned back-to-front rather than in
963 the normal front-to-back direction, so <function>amgettuple</function> must return
964 the last matching tuple in the index, rather than the first one as it
965 normally would. (This will only occur for access
966 methods that set <structfield>amcanorder</structfield> to true.) After the
967 first call, <function>amgettuple</function> must be prepared to advance the scan in
968 either direction from the most recently returned entry. (But if
969 <structfield>amcanbackward</structfield> is false, all subsequent
970 calls will have the same direction as the first one.)
971 </para>
973 <para>
974 Access methods that support ordered scans must support <quote>marking</quote> a
975 position in a scan and later returning to the marked position. The same
976 position might be restored multiple times. However, only one position need
977 be remembered per scan; a new <function>ammarkpos</function> call overrides the
978 previously marked position. An access method that does not support ordered
979 scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
980 functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
981 instead.
982 </para>
984 <para>
985 Both the scan position and the mark position (if any) must be maintained
986 consistently in the face of concurrent insertions or deletions in the
987 index. It is OK if a freshly-inserted entry is not returned by a scan that
988 would have found the entry if it had existed when the scan started, or for
989 the scan to return such an entry upon rescanning or backing
990 up even though it had not been returned the first time through. Similarly,
991 a concurrent delete might or might not be reflected in the results of a scan.
992 What is important is that insertions or deletions not cause the scan to
993 miss or multiply return entries that were not themselves being inserted or
994 deleted.
995 </para>
997 <para>
998 If the index stores the original indexed data values (and not some lossy
999 representation of them), it is useful to
1000 support <link linkend="indexes-index-only-scans">index-only scans</link>, in
1001 which the index returns the actual data not just the TID of the heap tuple.
1002 This will only avoid I/O if the visibility map shows that the TID is on an
1003 all-visible page; else the heap tuple must be visited anyway to check
1004 MVCC visibility. But that is no concern of the access method's.
1005 </para>
1007 <para>
1008 Instead of using <function>amgettuple</function>, an index scan can be done with
1009 <function>amgetbitmap</function> to fetch all tuples in one call. This can be
1010 noticeably more efficient than <function>amgettuple</function> because it allows
1011 avoiding lock/unlock cycles within the access method. In principle
1012 <function>amgetbitmap</function> should have the same effects as repeated
1013 <function>amgettuple</function> calls, but we impose several restrictions to
1014 simplify matters. First of all, <function>amgetbitmap</function> returns all
1015 tuples at once and marking or restoring scan positions isn't
1016 supported. Secondly, the tuples are returned in a bitmap which doesn't
1017 have any specific ordering, which is why <function>amgetbitmap</function> doesn't
1018 take a <literal>direction</literal> argument. (Ordering operators will never be
1019 supplied for such a scan, either.)
1020 Also, there is no provision for index-only scans with
1021 <function>amgetbitmap</function>, since there is no way to return the contents of
1022 index tuples.
1023 Finally, <function>amgetbitmap</function>
1024 does not guarantee any locking of the returned tuples, with implications
1025 spelled out in <xref linkend="index-locking"/>.
1026 </para>
1028 <para>
1029 Note that it is permitted for an access method to implement only
1030 <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
1031 if its internal implementation is unsuited to one API or the other.
1032 </para>
1034 </sect1>
1036 <sect1 id="index-locking">
1037 <title>Index Locking Considerations</title>
1039 <para>
1040 Index access methods must handle concurrent updates
1041 of the index by multiple processes.
1042 The core <productname>PostgreSQL</productname> system obtains
1043 <literal>AccessShareLock</literal> on the index during an index scan, and
1044 <literal>RowExclusiveLock</literal> when updating the index (including plain
1045 <command>VACUUM</command>). Since these lock types do not conflict, the access
1046 method is responsible for handling any fine-grained locking it might need.
1047 An <literal>ACCESS EXCLUSIVE</literal> lock on the index as a whole will be
1048 taken only during index creation, destruction, or <command>REINDEX</command>
1049 (<literal>SHARE UPDATE EXCLUSIVE</literal> is taken instead with
1050 <literal>CONCURRENTLY</literal>).
1051 </para>
1053 <para>
1054 Building an index type that supports concurrent updates usually requires
1055 extensive and subtle analysis of the required behavior. For the b-tree
1056 and hash index types, you can read about the design decisions involved in
1057 <filename>src/backend/access/nbtree/README</filename> and
1058 <filename>src/backend/access/hash/README</filename>.
1059 </para>
1061 <para>
1062 Aside from the index's own internal consistency requirements, concurrent
1063 updates create issues about consistency between the parent table (the
1064 <firstterm>heap</firstterm>) and the index. Because
1065 <productname>PostgreSQL</productname> separates accesses
1066 and updates of the heap from those of the index, there are windows in
1067 which the index might be inconsistent with the heap. We handle this problem
1068 with the following rules:
1070 <itemizedlist>
1071 <listitem>
1072 <para>
1073 A new heap entry is made before making its index entries. (Therefore
1074 a concurrent index scan is likely to fail to see the heap entry.
1075 This is okay because the index reader would be uninterested in an
1076 uncommitted row anyway. But see <xref linkend="index-unique-checks"/>.)
1077 </para>
1078 </listitem>
1079 <listitem>
1080 <para>
1081 When a heap entry is to be deleted (by <command>VACUUM</command>), all its
1082 index entries must be removed first.
1083 </para>
1084 </listitem>
1085 <listitem>
1086 <para>
1087 An index scan must maintain a pin
1088 on the index page holding the item last returned by
1089 <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
1090 entries from pages that are pinned by other backends. The need
1091 for this rule is explained below.
1092 </para>
1093 </listitem>
1094 </itemizedlist>
1096 Without the third rule, it is possible for an index reader to
1097 see an index entry just before it is removed by <command>VACUUM</command>, and
1098 then to arrive at the corresponding heap entry after that was removed by
1099 <command>VACUUM</command>.
1100 This creates no serious problems if that item
1101 number is still unused when the reader reaches it, since an empty
1102 item slot will be ignored by <function>heap_fetch()</function>. But what if a
1103 third backend has already re-used the item slot for something else?
1104 When using an MVCC-compliant snapshot, there is no problem because
1105 the new occupant of the slot is certain to be too new to pass the
1106 snapshot test. However, with a non-MVCC-compliant snapshot (such as
1107 <literal>SnapshotAny</literal>), it would be possible to accept and return
1108 a row that does not in fact match the scan keys. We could defend
1109 against this scenario by requiring the scan keys to be rechecked
1110 against the heap row in all cases, but that is too expensive. Instead,
1111 we use a pin on an index page as a proxy to indicate that the reader
1112 might still be <quote>in flight</quote> from the index entry to the matching
1113 heap entry. Making <function>ambulkdelete</function> block on such a pin ensures
1114 that <command>VACUUM</command> cannot delete the heap entry before the reader
1115 is done with it. This solution costs little in run time, and adds blocking
1116 overhead only in the rare cases where there actually is a conflict.
1117 </para>
1119 <para>
1120 This solution requires that index scans be <quote>synchronous</quote>: we have
1121 to fetch each heap tuple immediately after scanning the corresponding index
1122 entry. This is expensive for a number of reasons. An
1123 <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
1124 and only visit the heap tuples sometime later, requires much less index
1125 locking overhead and can allow a more efficient heap access pattern.
1126 Per the above analysis, we must use the synchronous approach for
1127 non-MVCC-compliant snapshots, but an asynchronous scan is workable
1128 for a query using an MVCC snapshot.
1129 </para>
1131 <para>
1132 In an <function>amgetbitmap</function> index scan, the access method does not
1133 keep an index pin on any of the returned tuples. Therefore
1134 it is only safe to use such scans with MVCC-compliant snapshots.
1135 </para>
1137 <para>
1138 When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
1139 index access method within a serializable transaction will acquire a
1140 nonblocking predicate lock on the full index. This will generate a
1141 read-write conflict with the insert of any tuple into that index by a
1142 concurrent serializable transaction. If certain patterns of read-write
1143 conflicts are detected among a set of concurrent serializable
1144 transactions, one of those transactions may be canceled to protect data
1145 integrity. When the flag is set, it indicates that the index access
1146 method implements finer-grained predicate locking, which will tend to
1147 reduce the frequency of such transaction cancellations.
1148 </para>
1150 </sect1>
1152 <sect1 id="index-unique-checks">
1153 <title>Index Uniqueness Checks</title>
1155 <para>
1156 <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
1157 using <firstterm>unique indexes</firstterm>, which are indexes that disallow
1158 multiple entries with identical keys. An access method that supports this
1159 feature sets <structfield>amcanunique</structfield> true.
1160 (At present, only b-tree supports it.) Columns listed in the
1161 <literal>INCLUDE</literal> clause are not considered when enforcing
1162 uniqueness.
1163 </para>
1165 <para>
1166 Because of MVCC, it is always necessary to allow duplicate entries to
1167 exist physically in an index: the entries might refer to successive
1168 versions of a single logical row. The behavior we actually want to
1169 enforce is that no MVCC snapshot could include two rows with equal
1170 index keys. This breaks down into the following cases that must be
1171 checked when inserting a new row into a unique index:
1173 <itemizedlist>
1174 <listitem>
1175 <para>
1176 If a conflicting valid row has been deleted by the current transaction,
1177 it's okay. (In particular, since an UPDATE always deletes the old row
1178 version before inserting the new version, this will allow an UPDATE on
1179 a row without changing the key.)
1180 </para>
1181 </listitem>
1182 <listitem>
1183 <para>
1184 If a conflicting row has been inserted by an as-yet-uncommitted
1185 transaction, the would-be inserter must wait to see if that transaction
1186 commits. If it rolls back then there is no conflict. If it commits
1187 without deleting the conflicting row again, there is a uniqueness
1188 violation. (In practice we just wait for the other transaction to
1189 end and then redo the visibility check in toto.)
1190 </para>
1191 </listitem>
1192 <listitem>
1193 <para>
1194 Similarly, if a conflicting valid row has been deleted by an
1195 as-yet-uncommitted transaction, the would-be inserter must wait
1196 for that transaction to commit or abort, and then repeat the test.
1197 </para>
1198 </listitem>
1199 </itemizedlist>
1200 </para>
1202 <para>
1203 Furthermore, immediately before reporting a uniqueness violation
1204 according to the above rules, the access method must recheck the
1205 liveness of the row being inserted. If it is committed dead then
1206 no violation should be reported. (This case cannot occur during the
1207 ordinary scenario of inserting a row that's just been created by
1208 the current transaction. It can happen during
1209 <command>CREATE UNIQUE INDEX CONCURRENTLY</command>, however.)
1210 </para>
1212 <para>
1213 We require the index access method to apply these tests itself, which
1214 means that it must reach into the heap to check the commit status of
1215 any row that is shown to have a duplicate key according to the index
1216 contents. This is without a doubt ugly and non-modular, but it saves
1217 redundant work: if we did a separate probe then the index lookup for
1218 a conflicting row would be essentially repeated while finding the place to
1219 insert the new row's index entry. What's more, there is no obvious way
1220 to avoid race conditions unless the conflict check is an integral part
1221 of insertion of the new index entry.
1222 </para>
1224 <para>
1225 If the unique constraint is deferrable, there is additional complexity:
1226 we need to be able to insert an index entry for a new row, but defer any
1227 uniqueness-violation error until end of statement or even later. To
1228 avoid unnecessary repeat searches of the index, the index access method
1229 should do a preliminary uniqueness check during the initial insertion.
1230 If this shows that there is definitely no conflicting live tuple, we
1231 are done. Otherwise, we schedule a recheck to occur when it is time to
1232 enforce the constraint. If, at the time of the recheck, both the inserted
1233 tuple and some other tuple with the same key are live, then the error
1234 must be reported. (Note that for this purpose, <quote>live</quote> actually
1235 means <quote>any tuple in the index entry's HOT chain is live</quote>.)
1236 To implement this, the <function>aminsert</function> function is passed a
1237 <literal>checkUnique</literal> parameter having one of the following values:
1239 <itemizedlist>
1240 <listitem>
1241 <para>
1242 <literal>UNIQUE_CHECK_NO</literal> indicates that no uniqueness checking
1243 should be done (this is not a unique index).
1244 </para>
1245 </listitem>
1246 <listitem>
1247 <para>
1248 <literal>UNIQUE_CHECK_YES</literal> indicates that this is a non-deferrable
1249 unique index, and the uniqueness check must be done immediately, as
1250 described above.
1251 </para>
1252 </listitem>
1253 <listitem>
1254 <para>
1255 <literal>UNIQUE_CHECK_PARTIAL</literal> indicates that the unique
1256 constraint is deferrable. <productname>PostgreSQL</productname>
1257 will use this mode to insert each row's index entry. The access
1258 method must allow duplicate entries into the index, and report any
1259 potential duplicates by returning false from <function>aminsert</function>.
1260 For each row for which false is returned, a deferred recheck will
1261 be scheduled.
1262 </para>
1264 <para>
1265 The access method must identify any rows which might violate the
1266 unique constraint, but it is not an error for it to report false
1267 positives. This allows the check to be done without waiting for other
1268 transactions to finish; conflicts reported here are not treated as
1269 errors and will be rechecked later, by which time they may no longer
1270 be conflicts.
1271 </para>
1272 </listitem>
1273 <listitem>
1274 <para>
1275 <literal>UNIQUE_CHECK_EXISTING</literal> indicates that this is a deferred
1276 recheck of a row that was reported as a potential uniqueness violation.
1277 Although this is implemented by calling <function>aminsert</function>, the
1278 access method must <emphasis>not</emphasis> insert a new index entry in this
1279 case. The index entry is already present. Rather, the access method
1280 must check to see if there is another live index entry. If so, and
1281 if the target row is also still live, report error.
1282 </para>
1284 <para>
1285 It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</literal> call,
1286 the access method further verify that the target row actually does
1287 have an existing entry in the index, and report error if not. This
1288 is a good idea because the index tuple values passed to
1289 <function>aminsert</function> will have been recomputed. If the index
1290 definition involves functions that are not really immutable, we
1291 might be checking the wrong area of the index. Checking that the
1292 target row is found in the recheck verifies that we are scanning
1293 for the same tuple values as were used in the original insertion.
1294 </para>
1295 </listitem>
1296 </itemizedlist>
1297 </para>
1299 </sect1>
1301 <sect1 id="index-cost-estimation">
1302 <title>Index Cost Estimation Functions</title>
1304 <para>
1305 The <function>amcostestimate</function> function is given information describing
1306 a possible index scan, including lists of WHERE and ORDER BY clauses that
1307 have been determined to be usable with the index. It must return estimates
1308 of the cost of accessing the index and the selectivity of the WHERE
1309 clauses (that is, the fraction of parent-table rows that will be
1310 retrieved during the index scan). For simple cases, nearly all the
1311 work of the cost estimator can be done by calling standard routines
1312 in the optimizer; the point of having an <function>amcostestimate</function> function is
1313 to allow index access methods to provide index-type-specific knowledge,
1314 in case it is possible to improve on the standard estimates.
1315 </para>
1317 <para>
1318 Each <function>amcostestimate</function> function must have the signature:
1320 <programlisting>
1321 void
1322 amcostestimate (PlannerInfo *root,
1323 IndexPath *path,
1324 double loop_count,
1325 Cost *indexStartupCost,
1326 Cost *indexTotalCost,
1327 Selectivity *indexSelectivity,
1328 double *indexCorrelation,
1329 double *indexPages);
1330 </programlisting>
1332 The first three parameters are inputs:
1334 <variablelist>
1335 <varlistentry>
1336 <term><parameter>root</parameter></term>
1337 <listitem>
1338 <para>
1339 The planner's information about the query being processed.
1340 </para>
1341 </listitem>
1342 </varlistentry>
1344 <varlistentry>
1345 <term><parameter>path</parameter></term>
1346 <listitem>
1347 <para>
1348 The index access path being considered. All fields except cost and
1349 selectivity values are valid.
1350 </para>
1351 </listitem>
1352 </varlistentry>
1354 <varlistentry>
1355 <term><parameter>loop_count</parameter></term>
1356 <listitem>
1357 <para>
1358 The number of repetitions of the index scan that should be factored
1359 into the cost estimates. This will typically be greater than one when
1360 considering a parameterized scan for use in the inside of a nestloop
1361 join. Note that the cost estimates should still be for just one scan;
1362 a larger <parameter>loop_count</parameter> means that it may be appropriate
1363 to allow for some caching effects across multiple scans.
1364 </para>
1365 </listitem>
1366 </varlistentry>
1367 </variablelist>
1368 </para>
1370 <para>
1371 The last five parameters are pass-by-reference outputs:
1373 <variablelist>
1374 <varlistentry>
1375 <term><parameter>*indexStartupCost</parameter></term>
1376 <listitem>
1377 <para>
1378 Set to cost of index start-up processing
1379 </para>
1380 </listitem>
1381 </varlistentry>
1383 <varlistentry>
1384 <term><parameter>*indexTotalCost</parameter></term>
1385 <listitem>
1386 <para>
1387 Set to total cost of index processing
1388 </para>
1389 </listitem>
1390 </varlistentry>
1392 <varlistentry>
1393 <term><parameter>*indexSelectivity</parameter></term>
1394 <listitem>
1395 <para>
1396 Set to index selectivity
1397 </para>
1398 </listitem>
1399 </varlistentry>
1401 <varlistentry>
1402 <term><parameter>*indexCorrelation</parameter></term>
1403 <listitem>
1404 <para>
1405 Set to correlation coefficient between index scan order and
1406 underlying table's order
1407 </para>
1408 </listitem>
1409 </varlistentry>
1411 <varlistentry>
1412 <term><parameter>*indexPages</parameter></term>
1413 <listitem>
1414 <para>
1415 Set to number of index leaf pages
1416 </para>
1417 </listitem>
1418 </varlistentry>
1419 </variablelist>
1420 </para>
1422 <para>
1423 Note that cost estimate functions must be written in C, not in SQL or
1424 any available procedural language, because they must access internal
1425 data structures of the planner/optimizer.
1426 </para>
1428 <para>
1429 The index access costs should be computed using the parameters used by
1430 <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
1431 disk block fetch has cost <varname>seq_page_cost</varname>, a nonsequential fetch
1432 has cost <varname>random_page_cost</varname>, and the cost of processing one index
1433 row should usually be taken as <varname>cpu_index_tuple_cost</varname>. In
1434 addition, an appropriate multiple of <varname>cpu_operator_cost</varname> should
1435 be charged for any comparison operators invoked during index processing
1436 (especially evaluation of the indexquals themselves).
1437 </para>
1439 <para>
1440 The access costs should include all disk and CPU costs associated with
1441 scanning the index itself, but <emphasis>not</emphasis> the costs of retrieving or
1442 processing the parent-table rows that are identified by the index.
1443 </para>
1445 <para>
1446 The <quote>start-up cost</quote> is the part of the total scan cost that
1447 must be expended before we can begin to fetch the first row. For most
1448 indexes this can be taken as zero, but an index type with a high start-up
1449 cost might want to set it nonzero.
1450 </para>
1452 <para>
1453 The <parameter>indexSelectivity</parameter> should be set to the estimated fraction of the parent
1454 table rows that will be retrieved during the index scan. In the case
1455 of a lossy query, this will typically be higher than the fraction of
1456 rows that actually pass the given qual conditions.
1457 </para>
1459 <para>
1460 The <parameter>indexCorrelation</parameter> should be set to the correlation (ranging between
1461 -1.0 and 1.0) between the index order and the table order. This is used
1462 to adjust the estimate for the cost of fetching rows from the parent
1463 table.
1464 </para>
1466 <para>
1467 The <parameter>indexPages</parameter> should be set to the number of leaf pages.
1468 This is used to estimate the number of workers for parallel index scan.
1469 </para>
1471 <para>
1472 When <parameter>loop_count</parameter> is greater than one, the returned numbers
1473 should be averages expected for any one scan of the index.
1474 </para>
1476 <procedure>
1477 <title>Cost Estimation</title>
1478 <para>
1479 A typical cost estimator will proceed as follows:
1480 </para>
1482 <step>
1483 <para>
1484 Estimate and return the fraction of parent-table rows that will be visited
1485 based on the given qual conditions. In the absence of any index-type-specific
1486 knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
1488 <programlisting>
1489 *indexSelectivity = clauselist_selectivity(root, path-&gt;indexquals,
1490 path-&gt;indexinfo-&gt;rel-&gt;relid,
1491 JOIN_INNER, NULL);
1492 </programlisting>
1493 </para>
1494 </step>
1496 <step>
1497 <para>
1498 Estimate the number of index rows that will be visited during the
1499 scan. For many index types this is the same as <parameter>indexSelectivity</parameter> times
1500 the number of rows in the index, but it might be more. (Note that the
1501 index's size in pages and rows is available from the
1502 <literal>path-&gt;indexinfo</literal> struct.)
1503 </para>
1504 </step>
1506 <step>
1507 <para>
1508 Estimate the number of index pages that will be retrieved during the scan.
1509 This might be just <parameter>indexSelectivity</parameter> times the index's size in pages.
1510 </para>
1511 </step>
1513 <step>
1514 <para>
1515 Compute the index access cost. A generic estimator might do this:
1517 <programlisting>
1519 * Our generic assumption is that the index pages will be read
1520 * sequentially, so they cost seq_page_cost each, not random_page_cost.
1521 * Also, we charge for evaluation of the indexquals at each index row.
1522 * All the costs are assumed to be paid incrementally during the scan.
1524 cost_qual_eval(&amp;index_qual_cost, path-&gt;indexquals, root);
1525 *indexStartupCost = index_qual_cost.startup;
1526 *indexTotalCost = seq_page_cost * numIndexPages +
1527 (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
1528 </programlisting>
1530 However, the above does not account for amortization of index reads
1531 across repeated index scans.
1532 </para>
1533 </step>
1535 <step>
1536 <para>
1537 Estimate the index correlation. For a simple ordered index on a single
1538 field, this can be retrieved from pg_statistic. If the correlation
1539 is not known, the conservative estimate is zero (no correlation).
1540 </para>
1541 </step>
1542 </procedure>
1544 <para>
1545 Examples of cost estimator functions can be found in
1546 <filename>src/backend/utils/adt/selfuncs.c</filename>.
1547 </para>
1548 </sect1>
1549 </chapter>