Pass down "logically unchanged index" hint.
[pgsql.git] / doc / src / sgml / indexam.sgml
blobec5741df6d16414691565a93a959978e4b14b04b
1 <!-- doc/src/sgml/indexam.sgml -->
3 <chapter id="indexam">
4 <title>Index Access Method Interface Definition</title>
6 <indexterm>
7 <primary>Index Access Method</primary>
8 </indexterm>
9 <indexterm>
10 <primary>indexam</primary>
11 <secondary>Index Access Method</secondary>
12 </indexterm>
14 <para>
15 This chapter defines the interface between the core
16 <productname>PostgreSQL</productname> system and <firstterm>index access
17 methods</firstterm>, which manage individual index types. The core system
18 knows nothing about indexes beyond what is specified here, so it is
19 possible to develop entirely new index types by writing add-on code.
20 </para>
22 <para>
23 All indexes in <productname>PostgreSQL</productname> are what are known
24 technically as <firstterm>secondary indexes</firstterm>; that is, the index is
25 physically separate from the table file that it describes. Each index
26 is stored as its own physical <firstterm>relation</firstterm> and so is described
27 by an entry in the <structname>pg_class</structname> catalog. The contents of an
28 index are entirely under the control of its index access method. In
29 practice, all index access methods divide indexes into standard-size
30 pages so that they can use the regular storage manager and buffer manager
31 to access the index contents. (All the existing index access methods
32 furthermore use the standard page layout described in <xref
33 linkend="storage-page-layout"/>, and most use the same format for index
34 tuple headers; but these decisions are not forced on an access method.)
35 </para>
37 <para>
38 An index is effectively a mapping from some data key values to
39 <firstterm>tuple identifiers</firstterm>, or <acronym>TIDs</acronym>, of row versions
40 (tuples) in the index's parent table. A TID consists of a
41 block number and an item number within that block (see <xref
42 linkend="storage-page-layout"/>). This is sufficient
43 information to fetch a particular row version from the table.
44 Indexes are not directly aware that under MVCC, there might be multiple
45 extant versions of the same logical row; to an index, each tuple is
46 an independent object that needs its own index entry. Thus, an
47 update of a row always creates all-new index entries for the row, even if
48 the key values did not change. (HOT tuples are an exception to this
49 statement; but indexes do not deal with those, either.) Index entries for
50 dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
51 are reclaimed.
52 </para>
54 <sect1 id="index-api">
55 <title>Basic API Structure for Indexes</title>
57 <para>
58 Each index access method is described by a row in the
59 <link linkend="catalog-pg-am"><structname>pg_am</structname></link>
60 system catalog. The <structname>pg_am</structname> entry
61 specifies a name and a <firstterm>handler function</firstterm> for the index
62 access method. These entries can be created and deleted using the
63 <xref linkend="sql-create-access-method"/> and
64 <xref linkend="sql-drop-access-method"/> SQL commands.
65 </para>
67 <para>
68 An index access method handler function must be declared to accept a
69 single argument of type <type>internal</type> and to return the
70 pseudo-type <type>index_am_handler</type>. The argument is a dummy value that
71 simply serves to prevent handler functions from being called directly from
72 SQL commands. The result of the function must be a palloc'd struct of
73 type <structname>IndexAmRoutine</structname>, which contains everything
74 that the core code needs to know to make use of the index access method.
75 The <structname>IndexAmRoutine</structname> struct, also called the access
76 method's <firstterm>API struct</firstterm>, includes fields specifying assorted
77 fixed properties of the access method, such as whether it can support
78 multicolumn indexes. More importantly, it contains pointers to support
79 functions for the access method, which do all of the real work to access
80 indexes. These support functions are plain C functions and are not
81 visible or callable at the SQL level. The support functions are described
82 in <xref linkend="index-functions"/>.
83 </para>
85 <para>
86 The structure <structname>IndexAmRoutine</structname> is defined thus:
87 <programlisting>
88 typedef struct IndexAmRoutine
90 NodeTag type;
93 * Total number of strategies (operators) by which we can traverse/search
94 * this AM. Zero if AM does not have a fixed set of strategy assignments.
96 uint16 amstrategies;
97 /* total number of support functions that this AM uses */
98 uint16 amsupport;
99 /* opclass options support function number or 0 */
100 uint16 amoptsprocnum;
101 /* does AM support ORDER BY indexed column's value? */
102 bool amcanorder;
103 /* does AM support ORDER BY result of an operator on indexed column? */
104 bool amcanorderbyop;
105 /* does AM support backward scanning? */
106 bool amcanbackward;
107 /* does AM support UNIQUE indexes? */
108 bool amcanunique;
109 /* does AM support multi-column indexes? */
110 bool amcanmulticol;
111 /* does AM require scans to have a constraint on the first index column? */
112 bool amoptionalkey;
113 /* does AM handle ScalarArrayOpExpr quals? */
114 bool amsearcharray;
115 /* does AM handle IS NULL/IS NOT NULL quals? */
116 bool amsearchnulls;
117 /* can index storage data type differ from column data type? */
118 bool amstorage;
119 /* can an index of this type be clustered on? */
120 bool amclusterable;
121 /* does AM handle predicate locks? */
122 bool ampredlocks;
123 /* does AM support parallel scan? */
124 bool amcanparallel;
125 /* does AM support columns included with clause INCLUDE? */
126 bool amcaninclude;
127 /* does AM use maintenance_work_mem? */
128 bool amusemaintenanceworkmem;
129 /* OR of parallel vacuum flags */
130 uint8 amparallelvacuumoptions;
131 /* type of data stored in index, or InvalidOid if variable */
132 Oid amkeytype;
134 /* interface functions */
135 ambuild_function ambuild;
136 ambuildempty_function ambuildempty;
137 aminsert_function aminsert;
138 ambulkdelete_function ambulkdelete;
139 amvacuumcleanup_function amvacuumcleanup;
140 amcanreturn_function amcanreturn; /* can be NULL */
141 amcostestimate_function amcostestimate;
142 amoptions_function amoptions;
143 amproperty_function amproperty; /* can be NULL */
144 ambuildphasename_function ambuildphasename; /* can be NULL */
145 amvalidate_function amvalidate;
146 amadjustmembers_function amadjustmembers; /* can be NULL */
147 ambeginscan_function ambeginscan;
148 amrescan_function amrescan;
149 amgettuple_function amgettuple; /* can be NULL */
150 amgetbitmap_function amgetbitmap; /* can be NULL */
151 amendscan_function amendscan;
152 ammarkpos_function ammarkpos; /* can be NULL */
153 amrestrpos_function amrestrpos; /* can be NULL */
155 /* interface functions to support parallel index scans */
156 amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
157 aminitparallelscan_function aminitparallelscan; /* can be NULL */
158 amparallelrescan_function amparallelrescan; /* can be NULL */
159 } IndexAmRoutine;
160 </programlisting>
161 </para>
163 <para>
164 To be useful, an index access method must also have one or more
165 <firstterm>operator families</firstterm> and
166 <firstterm>operator classes</firstterm> defined in
167 <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
168 <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
169 <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
170 <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
171 These entries allow the planner
172 to determine what kinds of query qualifications can be used with
173 indexes of this access method. Operator families and classes are described
174 in <xref linkend="xindex"/>, which is prerequisite material for reading
175 this chapter.
176 </para>
178 <para>
179 An individual index is defined by a
180 <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
181 entry that describes it as a physical relation, plus a
182 <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
183 entry that shows the logical content of the index &mdash; that is, the set
184 of index columns it has and the semantics of those columns, as captured by
185 the associated operator classes. The index columns (key values) can be
186 either simple columns of the underlying table or expressions over the table
187 rows. The index access method normally has no interest in where the index
188 key values come from (it is always handed precomputed key values) but it
189 will be very interested in the operator class information in
190 <structname>pg_index</structname>. Both of these catalog entries can be
191 accessed as part of the <structname>Relation</structname> data structure that is
192 passed to all operations on the index.
193 </para>
195 <para>
196 Some of the flag fields of <structname>IndexAmRoutine</structname> have nonobvious
197 implications. The requirements of <structfield>amcanunique</structfield>
198 are discussed in <xref linkend="index-unique-checks"/>.
199 The <structfield>amcanmulticol</structfield> flag asserts that the
200 access method supports multi-key-column indexes, while
201 <structfield>amoptionalkey</structfield> asserts that it allows scans
202 where no indexable restriction clause is given for the first index column.
203 When <structfield>amcanmulticol</structfield> is false,
204 <structfield>amoptionalkey</structfield> essentially says whether the
205 access method supports full-index scans without any restriction clause.
206 Access methods that support multiple index columns <emphasis>must</emphasis>
207 support scans that omit restrictions on any or all of the columns after
208 the first; however they are permitted to require some restriction to
209 appear for the first index column, and this is signaled by setting
210 <structfield>amoptionalkey</structfield> false.
211 One reason that an index AM might set
212 <structfield>amoptionalkey</structfield> false is if it doesn't index
213 null values. Since most indexable operators are
214 strict and hence cannot return true for null inputs,
215 it is at first sight attractive to not store index entries for null values:
216 they could never be returned by an index scan anyway. However, this
217 argument fails when an index scan has no restriction clause for a given
218 index column. In practice this means that
219 indexes that have <structfield>amoptionalkey</structfield> true must
220 index nulls, since the planner might decide to use such an index
221 with no scan keys at all. A related restriction is that an index
222 access method that supports multiple index columns <emphasis>must</emphasis>
223 support indexing null values in columns after the first, because the planner
224 will assume the index can be used for queries that do not restrict
225 these columns. For example, consider an index on (a,b) and a query with
226 <literal>WHERE a = 4</literal>. The system will assume the index can be
227 used to scan for rows with <literal>a = 4</literal>, which is wrong if the
228 index omits rows where <literal>b</literal> is null.
229 It is, however, OK to omit rows where the first indexed column is null.
230 An index access method that does index nulls may also set
231 <structfield>amsearchnulls</structfield>, indicating that it supports
232 <literal>IS NULL</literal> and <literal>IS NOT NULL</literal> clauses as search
233 conditions.
234 </para>
236 <para>
237 The <structfield>amcaninclude</structfield> flag indicates whether the
238 access method supports <quote>included</quote> columns, that is it can
239 store (without processing) additional columns beyond the key column(s).
240 The requirements of the preceding paragraph apply only to the key
241 columns. In particular, the combination
242 of <structfield>amcanmulticol</structfield>=<literal>false</literal>
243 and <structfield>amcaninclude</structfield>=<literal>true</literal> is
244 sensible: it means that there can only be one key column, but there can
245 also be included column(s). Also, included columns must be allowed to be
246 null, independently of <structfield>amoptionalkey</structfield>.
247 </para>
249 </sect1>
251 <sect1 id="index-functions">
252 <title>Index Access Method Functions</title>
254 <para>
255 The index construction and maintenance functions that an index access
256 method must provide in <structname>IndexAmRoutine</structname> are:
257 </para>
259 <para>
260 <programlisting>
261 IndexBuildResult *
262 ambuild (Relation heapRelation,
263 Relation indexRelation,
264 IndexInfo *indexInfo);
265 </programlisting>
266 Build a new index. The index relation has been physically created,
267 but is empty. It must be filled in with whatever fixed data the
268 access method requires, plus entries for all tuples already existing
269 in the table. Ordinarily the <function>ambuild</function> function will call
270 <function>table_index_build_scan()</function> to scan the table for existing tuples
271 and compute the keys that need to be inserted into the index.
272 The function must return a palloc'd struct containing statistics about
273 the new index.
274 </para>
276 <para>
277 <programlisting>
278 void
279 ambuildempty (Relation indexRelation);
280 </programlisting>
281 Build an empty index, and write it to the initialization fork (<symbol>INIT_FORKNUM</symbol>)
282 of the given relation. This method is called only for unlogged indexes; the
283 empty index written to the initialization fork will be copied over the main
284 relation fork on each server restart.
285 </para>
287 <para>
288 <programlisting>
289 bool
290 aminsert (Relation indexRelation,
291 Datum *values,
292 bool *isnull,
293 ItemPointer heap_tid,
294 Relation heapRelation,
295 IndexUniqueCheck checkUnique,
296 bool indexUnchanged,
297 IndexInfo *indexInfo);
298 </programlisting>
299 Insert a new tuple into an existing index. The <literal>values</literal> and
300 <literal>isnull</literal> arrays give the key values to be indexed, and
301 <literal>heap_tid</literal> is the TID to be indexed.
302 If the access method supports unique indexes (its
303 <structfield>amcanunique</structfield> flag is true) then
304 <literal>checkUnique</literal> indicates the type of uniqueness check to
305 perform. This varies depending on whether the unique constraint is
306 deferrable; see <xref linkend="index-unique-checks"/> for details.
307 Normally the access method only needs the <literal>heapRelation</literal>
308 parameter when performing uniqueness checking (since then it will have to
309 look into the heap to verify tuple liveness).
310 </para>
312 <para>
313 The <literal>indexUnchanged</literal> boolean value gives a hint
314 about the nature of the tuple to be indexed. When it is true,
315 the tuple is a duplicate of some existing tuple in the index. The
316 new tuple is a logically unchanged successor MVCC tuple version. This
317 happens when an <command>UPDATE</command> takes place that does not
318 modify any columns covered by the index, but nevertheless requires a
319 new version in the index. The index AM may use this hint to decide
320 to apply bottom-up index deletion in parts of the index where many
321 versions of the same logical row accumulate. Note that updating a
322 non-key column does not affect the value of
323 <literal>indexUnchanged</literal>.
324 </para>
326 <para>
327 The function's Boolean result value is significant only when
328 <literal>checkUnique</literal> is <literal>UNIQUE_CHECK_PARTIAL</literal>.
329 In this case a true result means the new entry is known unique, whereas
330 false means it might be non-unique (and a deferred uniqueness check must
331 be scheduled). For other cases a constant false result is recommended.
332 </para>
334 <para>
335 Some indexes might not index all tuples. If the tuple is not to be
336 indexed, <function>aminsert</function> should just return without doing anything.
337 </para>
339 <para>
340 If the index AM wishes to cache data across successive index insertions
341 within a SQL statement, it can allocate space
342 in <literal>indexInfo-&gt;ii_Context</literal> and store a pointer to the
343 data in <literal>indexInfo-&gt;ii_AmCache</literal> (which will be NULL
344 initially).
345 </para>
347 <para>
348 <programlisting>
349 IndexBulkDeleteResult *
350 ambulkdelete (IndexVacuumInfo *info,
351 IndexBulkDeleteResult *stats,
352 IndexBulkDeleteCallback callback,
353 void *callback_state);
354 </programlisting>
355 Delete tuple(s) from the index. This is a <quote>bulk delete</quote> operation
356 that is intended to be implemented by scanning the whole index and checking
357 each entry to see if it should be deleted.
358 The passed-in <literal>callback</literal> function must be called, in the style
359 <literal>callback(<replaceable>TID</replaceable>, callback_state) returns bool</literal>,
360 to determine whether any particular index entry, as identified by its
361 referenced TID, is to be deleted. Must return either NULL or a palloc'd
362 struct containing statistics about the effects of the deletion operation.
363 It is OK to return NULL if no information needs to be passed on to
364 <function>amvacuumcleanup</function>.
365 </para>
367 <para>
368 Because of limited <varname>maintenance_work_mem</varname>,
369 <function>ambulkdelete</function> might need to be called more than once when many
370 tuples are to be deleted. The <literal>stats</literal> argument is the result
371 of the previous call for this index (it is NULL for the first call within a
372 <command>VACUUM</command> operation). This allows the AM to accumulate statistics
373 across the whole operation. Typically, <function>ambulkdelete</function> will
374 modify and return the same struct if the passed <literal>stats</literal> is not
375 null.
376 </para>
378 <para>
379 <programlisting>
380 IndexBulkDeleteResult *
381 amvacuumcleanup (IndexVacuumInfo *info,
382 IndexBulkDeleteResult *stats);
383 </programlisting>
384 Clean up after a <command>VACUUM</command> operation (zero or more
385 <function>ambulkdelete</function> calls). This does not have to do anything
386 beyond returning index statistics, but it might perform bulk cleanup
387 such as reclaiming empty index pages. <literal>stats</literal> is whatever the
388 last <function>ambulkdelete</function> call returned, or NULL if
389 <function>ambulkdelete</function> was not called because no tuples needed to be
390 deleted. If the result is not NULL it must be a palloc'd struct.
391 The statistics it contains will be used to update <structname>pg_class</structname>,
392 and will be reported by <command>VACUUM</command> if <literal>VERBOSE</literal> is given.
393 It is OK to return NULL if the index was not changed at all during the
394 <command>VACUUM</command> operation, but otherwise correct stats should
395 be returned.
396 </para>
398 <para>
399 <function>amvacuumcleanup</function> will also be called at completion of an
400 <command>ANALYZE</command> operation. In this case <literal>stats</literal> is always
401 NULL and any return value will be ignored. This case can be distinguished
402 by checking <literal>info-&gt;analyze_only</literal>. It is recommended
403 that the access method do nothing except post-insert cleanup in such a
404 call, and that only in an autovacuum worker process.
405 </para>
407 <para>
408 <programlisting>
409 bool
410 amcanreturn (Relation indexRelation, int attno);
411 </programlisting>
412 Check whether the index can support <link
413 linkend="indexes-index-only-scans"><firstterm>index-only scans</firstterm></link> on
414 the given column, by returning the column's original indexed value.
415 The attribute number is 1-based, i.e., the first column's attno is 1.
416 Returns true if supported, else false.
417 This function should always return true for included columns
418 (if those are supported), since there's little point in an included
419 column that can't be retrieved.
420 If the access method does not support index-only scans at all,
421 the <structfield>amcanreturn</structfield> field in its <structname>IndexAmRoutine</structname>
422 struct can be set to NULL.
423 </para>
425 <para>
426 <programlisting>
427 void
428 amcostestimate (PlannerInfo *root,
429 IndexPath *path,
430 double loop_count,
431 Cost *indexStartupCost,
432 Cost *indexTotalCost,
433 Selectivity *indexSelectivity,
434 double *indexCorrelation,
435 double *indexPages);
436 </programlisting>
437 Estimate the costs of an index scan. This function is described fully
438 in <xref linkend="index-cost-estimation"/>, below.
439 </para>
441 <para>
442 <programlisting>
443 bytea *
444 amoptions (ArrayType *reloptions,
445 bool validate);
446 </programlisting>
447 Parse and validate the reloptions array for an index. This is called only
448 when a non-null reloptions array exists for the index.
449 <parameter>reloptions</parameter> is a <type>text</type> array containing entries of the
450 form <replaceable>name</replaceable><literal>=</literal><replaceable>value</replaceable>.
451 The function should construct a <type>bytea</type> value, which will be copied
452 into the <structfield>rd_options</structfield> field of the index's relcache entry.
453 The data contents of the <type>bytea</type> value are open for the access
454 method to define; most of the standard access methods use struct
455 <structname>StdRdOptions</structname>.
456 When <parameter>validate</parameter> is true, the function should report a suitable
457 error message if any of the options are unrecognized or have invalid
458 values; when <parameter>validate</parameter> is false, invalid entries should be
459 silently ignored. (<parameter>validate</parameter> is false when loading options
460 already stored in <structname>pg_catalog</structname>; an invalid entry could only
461 be found if the access method has changed its rules for options, and in
462 that case ignoring obsolete entries is appropriate.)
463 It is OK to return NULL if default behavior is wanted.
464 </para>
466 <para>
467 <programlisting>
468 bool
469 amproperty (Oid index_oid, int attno,
470 IndexAMProperty prop, const char *propname,
471 bool *res, bool *isnull);
472 </programlisting>
473 The <function>amproperty</function> method allows index access methods to override
474 the default behavior of <function>pg_index_column_has_property</function>
475 and related functions.
476 If the access method does not have any special behavior for index property
477 inquiries, the <structfield>amproperty</structfield> field in
478 its <structname>IndexAmRoutine</structname> struct can be set to NULL.
479 Otherwise, the <function>amproperty</function> method will be called with
480 <parameter>index_oid</parameter> and <parameter>attno</parameter> both zero for
481 <function>pg_indexam_has_property</function> calls,
482 or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> zero for
483 <function>pg_index_has_property</function> calls,
484 or with <parameter>index_oid</parameter> valid and <parameter>attno</parameter> greater than
485 zero for <function>pg_index_column_has_property</function> calls.
486 <parameter>prop</parameter> is an enum value identifying the property being tested,
487 while <parameter>propname</parameter> is the original property name string.
488 If the core code does not recognize the property name
489 then <parameter>prop</parameter> is <literal>AMPROP_UNKNOWN</literal>.
490 Access methods can define custom property names by
491 checking <parameter>propname</parameter> for a match (use <function>pg_strcasecmp</function>
492 to match, for consistency with the core code); for names known to the core
493 code, it's better to inspect <parameter>prop</parameter>.
494 If the <structfield>amproperty</structfield> method returns <literal>true</literal> then
495 it has determined the property test result: it must set <literal>*res</literal>
496 to the boolean value to return, or set <literal>*isnull</literal>
497 to <literal>true</literal> to return a NULL. (Both of the referenced variables
498 are initialized to <literal>false</literal> before the call.)
499 If the <structfield>amproperty</structfield> method returns <literal>false</literal> then
500 the core code will proceed with its normal logic for determining the
501 property test result.
502 </para>
504 <para>
505 Access methods that support ordering operators should
506 implement <literal>AMPROP_DISTANCE_ORDERABLE</literal> property testing, as the
507 core code does not know how to do that and will return NULL. It may
508 also be advantageous to implement <literal>AMPROP_RETURNABLE</literal> testing,
509 if that can be done more cheaply than by opening the index and calling
510 <function>amcanreturn</function>, which is the core code's default behavior.
511 The default behavior should be satisfactory for all other standard
512 properties.
513 </para>
515 <para>
516 <programlisting>
517 char *
518 ambuildphasename (int64 phasenum);
519 </programlisting>
520 Return the textual name of the given build phase number.
521 The phase numbers are those reported during an index build via the
522 <function>pgstat_progress_update_param</function> interface.
523 The phase names are then exposed in the
524 <structname>pg_stat_progress_create_index</structname> view.
525 </para>
527 <para>
528 <programlisting>
529 bool
530 amvalidate (Oid opclassoid);
531 </programlisting>
532 Validate the catalog entries for the specified operator class, so far as
533 the access method can reasonably do that. For example, this might include
534 testing that all required support functions are provided.
535 The <function>amvalidate</function> function must return false if the opclass is
536 invalid. Problems should be reported with <function>ereport</function>
537 messages, typically at <literal>INFO</literal> level.
538 </para>
540 <para>
541 <programlisting>
542 void
543 amadjustmembers (Oid opfamilyoid,
544 Oid opclassoid,
545 List *operators,
546 List *functions);
547 </programlisting>
548 Validate proposed new operator and function members of an operator family,
549 so far as the access method can reasonably do that, and set their
550 dependency types if the default is not satisfactory. This is called
551 during <command>CREATE OPERATOR CLASS</command> and during
552 <command>ALTER OPERATOR FAMILY ADD</command>; in the latter
553 case <parameter>opclassoid</parameter> is <literal>InvalidOid</literal>.
554 The <type>List</type> arguments are lists
555 of <structname>OpFamilyMember</structname> structs, as defined
556 in <filename>amapi.h</filename>.
558 Tests done by this function will typically be a subset of those
559 performed by <function>amvalidate</function>,
560 since <function>amadjustmembers</function> cannot assume that it is
561 seeing a complete set of members. For example, it would be reasonable
562 to check the signature of a support function, but not to check whether
563 all required support functions are provided. Any problems can be
564 reported by throwing an error.
566 The dependency-related fields of
567 the <structname>OpFamilyMember</structname> structs are initialized by
568 the core code to create hard dependencies on the opclass if this
569 is <command>CREATE OPERATOR CLASS</command>, or soft dependencies on the
570 opfamily if this is <command>ALTER OPERATOR FAMILY ADD</command>.
571 <function>amadjustmembers</function> can adjust these fields if some other
572 behavior is more appropriate. For example, GIN, GiST, and SP-GiST
573 always set operator members to have soft dependencies on the opfamily,
574 since the connection between an operator and an opclass is relatively
575 weak in these index types; so it is reasonable to allow operator members
576 to be added and removed freely. Optional support functions are typically
577 also given soft dependencies, so that they can be removed if necessary.
578 </para>
581 <para>
582 The purpose of an index, of course, is to support scans for tuples matching
583 an indexable <literal>WHERE</literal> condition, often called a
584 <firstterm>qualifier</firstterm> or <firstterm>scan key</firstterm>. The semantics of
585 index scanning are described more fully in <xref linkend="index-scanning"/>,
586 below. An index access method can support <quote>plain</quote> index scans,
587 <quote>bitmap</quote> index scans, or both. The scan-related functions that an
588 index access method must or may provide are:
589 </para>
591 <para>
592 <programlisting>
593 IndexScanDesc
594 ambeginscan (Relation indexRelation,
595 int nkeys,
596 int norderbys);
597 </programlisting>
598 Prepare for an index scan. The <literal>nkeys</literal> and <literal>norderbys</literal>
599 parameters indicate the number of quals and ordering operators that will be
600 used in the scan; these may be useful for space allocation purposes.
601 Note that the actual values of the scan keys aren't provided yet.
602 The result must be a palloc'd struct.
603 For implementation reasons the index access method
604 <emphasis>must</emphasis> create this struct by calling
605 <function>RelationGetIndexScan()</function>. In most cases
606 <function>ambeginscan</function> does little beyond making that call and perhaps
607 acquiring locks;
608 the interesting parts of index-scan startup are in <function>amrescan</function>.
609 </para>
611 <para>
612 <programlisting>
613 void
614 amrescan (IndexScanDesc scan,
615 ScanKey keys,
616 int nkeys,
617 ScanKey orderbys,
618 int norderbys);
619 </programlisting>
620 Start or restart an index scan, possibly with new scan keys. (To restart
621 using previously-passed keys, NULL is passed for <literal>keys</literal> and/or
622 <literal>orderbys</literal>.) Note that it is not allowed for
623 the number of keys or order-by operators to be larger than
624 what was passed to <function>ambeginscan</function>. In practice the restart
625 feature is used when a new outer tuple is selected by a nested-loop join
626 and so a new key comparison value is needed, but the scan key structure
627 remains the same.
628 </para>
630 <para>
631 <programlisting>
632 boolean
633 amgettuple (IndexScanDesc scan,
634 ScanDirection direction);
635 </programlisting>
636 Fetch the next tuple in the given scan, moving in the given
637 direction (forward or backward in the index). Returns true if a tuple was
638 obtained, false if no matching tuples remain. In the true case the tuple
639 TID is stored into the <literal>scan</literal> structure. Note that
640 <quote>success</quote> means only that the index contains an entry that matches
641 the scan keys, not that the tuple necessarily still exists in the heap or
642 will pass the caller's snapshot test. On success, <function>amgettuple</function>
643 must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
644 False means it is certain that the index entry matches the scan keys.
645 True means this is not certain, and the conditions represented by the
646 scan keys must be rechecked against the heap tuple after fetching it.
647 This provision supports <quote>lossy</quote> index operators.
648 Note that rechecking will extend only to the scan conditions; a partial
649 index predicate (if any) is never rechecked by <function>amgettuple</function>
650 callers.
651 </para>
653 <para>
654 If the index supports <link linkend="indexes-index-only-scans">index-only
655 scans</link> (i.e., <function>amcanreturn</function> returns true for any
656 of its columns),
657 then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
658 and if that is true it must return the originally indexed data for the
659 index entry. Columns for which <function>amcanreturn</function> returns
660 false can be returned as nulls.
661 The data can be returned in the form of an
662 <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
663 with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
664 a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
665 with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>. (The latter
666 format should be used when reconstructing data that might possibly not fit
667 into an <structname>IndexTuple</structname>.) In either case,
668 management of the data referenced by the pointer is the access method's
669 responsibility. The data must remain good at least until the next
670 <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
671 call for the scan.
672 </para>
674 <para>
675 The <function>amgettuple</function> function need only be provided if the access
676 method supports <quote>plain</quote> index scans. If it doesn't, the
677 <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
678 struct must be set to NULL.
679 </para>
681 <para>
682 <programlisting>
683 int64
684 amgetbitmap (IndexScanDesc scan,
685 TIDBitmap *tbm);
686 </programlisting>
687 Fetch all tuples in the given scan and add them to the caller-supplied
688 <type>TIDBitmap</type> (that is, OR the set of tuple IDs into whatever set is already
689 in the bitmap). The number of tuples fetched is returned (this might be
690 just an approximate count, for instance some AMs do not detect duplicates).
691 While inserting tuple IDs into the bitmap, <function>amgetbitmap</function> can
692 indicate that rechecking of the scan conditions is required for specific
693 tuple IDs. This is analogous to the <literal>xs_recheck</literal> output parameter
694 of <function>amgettuple</function>. Note: in the current implementation, support
695 for this feature is conflated with support for lossy storage of the bitmap
696 itself, and therefore callers recheck both the scan conditions and the
697 partial index predicate (if any) for recheckable tuples. That might not
698 always be true, however.
699 <function>amgetbitmap</function> and
700 <function>amgettuple</function> cannot be used in the same index scan; there
701 are other restrictions too when using <function>amgetbitmap</function>, as explained
702 in <xref linkend="index-scanning"/>.
703 </para>
705 <para>
706 The <function>amgetbitmap</function> function need only be provided if the access
707 method supports <quote>bitmap</quote> index scans. If it doesn't, the
708 <structfield>amgetbitmap</structfield> field in its <structname>IndexAmRoutine</structname>
709 struct must be set to NULL.
710 </para>
712 <para>
713 <programlisting>
714 void
715 amendscan (IndexScanDesc scan);
716 </programlisting>
717 End a scan and release resources. The <literal>scan</literal> struct itself
718 should not be freed, but any locks or pins taken internally by the
719 access method must be released, as well as any other memory allocated
720 by <function>ambeginscan</function> and other scan-related functions.
721 </para>
723 <para>
724 <programlisting>
725 void
726 ammarkpos (IndexScanDesc scan);
727 </programlisting>
728 Mark current scan position. The access method need only support one
729 remembered scan position per scan.
730 </para>
732 <para>
733 The <function>ammarkpos</function> function need only be provided if the access
734 method supports ordered scans. If it doesn't,
735 the <structfield>ammarkpos</structfield> field in its <structname>IndexAmRoutine</structname>
736 struct may be set to NULL.
737 </para>
739 <para>
740 <programlisting>
741 void
742 amrestrpos (IndexScanDesc scan);
743 </programlisting>
744 Restore the scan to the most recently marked position.
745 </para>
747 <para>
748 The <function>amrestrpos</function> function need only be provided if the access
749 method supports ordered scans. If it doesn't,
750 the <structfield>amrestrpos</structfield> field in its <structname>IndexAmRoutine</structname>
751 struct may be set to NULL.
752 </para>
754 <para>
755 In addition to supporting ordinary index scans, some types of index
756 may wish to support <firstterm>parallel index scans</firstterm>, which allow
757 multiple backends to cooperate in performing an index scan. The
758 index access method should arrange things so that each cooperating
759 process returns a subset of the tuples that would be performed by
760 an ordinary, non-parallel index scan, but in such a way that the
761 union of those subsets is equal to the set of tuples that would be
762 returned by an ordinary, non-parallel index scan. Furthermore, while
763 there need not be any global ordering of tuples returned by a parallel
764 scan, the ordering of that subset of tuples returned within each
765 cooperating backend must match the requested ordering. The following
766 functions may be implemented to support parallel index scans:
767 </para>
769 <para>
770 <programlisting>
771 Size
772 amestimateparallelscan (void);
773 </programlisting>
774 Estimate and return the number of bytes of dynamic shared memory which
775 the access method will be needed to perform a parallel scan. (This number
776 is in addition to, not in lieu of, the amount of space needed for
777 AM-independent data in <structname>ParallelIndexScanDescData</structname>.)
778 </para>
780 <para>
781 It is not necessary to implement this function for access methods which
782 do not support parallel scans or for which the number of additional bytes
783 of storage required is zero.
784 </para>
786 <para>
787 <programlisting>
788 void
789 aminitparallelscan (void *target);
790 </programlisting>
791 This function will be called to initialize dynamic shared memory at the
792 beginning of a parallel scan. <parameter>target</parameter> will point to at least
793 the number of bytes previously returned by
794 <function>amestimateparallelscan</function>, and this function may use that
795 amount of space to store whatever data it wishes.
796 </para>
798 <para>
799 It is not necessary to implement this function for access methods which
800 do not support parallel scans or in cases where the shared memory space
801 required needs no initialization.
802 </para>
804 <para>
805 <programlisting>
806 void
807 amparallelrescan (IndexScanDesc scan);
808 </programlisting>
809 This function, if implemented, will be called when a parallel index scan
810 must be restarted. It should reset any shared state set up by
811 <function>aminitparallelscan</function> such that the scan will be restarted from
812 the beginning.
813 </para>
815 </sect1>
817 <sect1 id="index-scanning">
818 <title>Index Scanning</title>
820 <para>
821 In an index scan, the index access method is responsible for regurgitating
822 the TIDs of all the tuples it has been told about that match the
823 <firstterm>scan keys</firstterm>. The access method is <emphasis>not</emphasis> involved in
824 actually fetching those tuples from the index's parent table, nor in
825 determining whether they pass the scan's visibility test or other
826 conditions.
827 </para>
829 <para>
830 A scan key is the internal representation of a <literal>WHERE</literal> clause of
831 the form <replaceable>index_key</replaceable> <replaceable>operator</replaceable>
832 <replaceable>constant</replaceable>, where the index key is one of the columns of the
833 index and the operator is one of the members of the operator family
834 associated with that index column. An index scan has zero or more scan
835 keys, which are implicitly ANDed &mdash; the returned tuples are expected
836 to satisfy all the indicated conditions.
837 </para>
839 <para>
840 The access method can report that the index is <firstterm>lossy</firstterm>, or
841 requires rechecks, for a particular query. This implies that the index
842 scan will return all the entries that pass the scan key, plus possibly
843 additional entries that do not. The core system's index-scan machinery
844 will then apply the index conditions again to the heap tuple to verify
845 whether or not it really should be selected. If the recheck option is not
846 specified, the index scan must return exactly the set of matching entries.
847 </para>
849 <para>
850 Note that it is entirely up to the access method to ensure that it
851 correctly finds all and only the entries passing all the given scan keys.
852 Also, the core system will simply hand off all the <literal>WHERE</literal>
853 clauses that match the index keys and operator families, without any
854 semantic analysis to determine whether they are redundant or
855 contradictory. As an example, given
856 <literal>WHERE x &gt; 4 AND x &gt; 14</literal> where <literal>x</literal> is a b-tree
857 indexed column, it is left to the b-tree <function>amrescan</function> function
858 to realize that the first scan key is redundant and can be discarded.
859 The extent of preprocessing needed during <function>amrescan</function> will
860 depend on the extent to which the index access method needs to reduce
861 the scan keys to a <quote>normalized</quote> form.
862 </para>
864 <para>
865 Some access methods return index entries in a well-defined order, others
866 do not. There are actually two different ways that an access method can
867 support sorted output:
869 <itemizedlist>
870 <listitem>
871 <para>
872 Access methods that always return entries in the natural ordering
873 of their data (such as btree) should set
874 <structfield>amcanorder</structfield> to true.
875 Currently, such access methods must use btree-compatible strategy
876 numbers for their equality and ordering operators.
877 </para>
878 </listitem>
879 <listitem>
880 <para>
881 Access methods that support ordering operators should set
882 <structfield>amcanorderbyop</structfield> to true.
883 This indicates that the index is capable of returning entries in
884 an order satisfying <literal>ORDER BY</literal> <replaceable>index_key</replaceable>
885 <replaceable>operator</replaceable> <replaceable>constant</replaceable>. Scan modifiers
886 of that form can be passed to <function>amrescan</function> as described
887 previously.
888 </para>
889 </listitem>
890 </itemizedlist>
891 </para>
893 <para>
894 The <function>amgettuple</function> function has a <literal>direction</literal> argument,
895 which can be either <literal>ForwardScanDirection</literal> (the normal case)
896 or <literal>BackwardScanDirection</literal>. If the first call after
897 <function>amrescan</function> specifies <literal>BackwardScanDirection</literal>, then the
898 set of matching index entries is to be scanned back-to-front rather than in
899 the normal front-to-back direction, so <function>amgettuple</function> must return
900 the last matching tuple in the index, rather than the first one as it
901 normally would. (This will only occur for access
902 methods that set <structfield>amcanorder</structfield> to true.) After the
903 first call, <function>amgettuple</function> must be prepared to advance the scan in
904 either direction from the most recently returned entry. (But if
905 <structfield>amcanbackward</structfield> is false, all subsequent
906 calls will have the same direction as the first one.)
907 </para>
909 <para>
910 Access methods that support ordered scans must support <quote>marking</quote> a
911 position in a scan and later returning to the marked position. The same
912 position might be restored multiple times. However, only one position need
913 be remembered per scan; a new <function>ammarkpos</function> call overrides the
914 previously marked position. An access method that does not support ordered
915 scans need not provide <function>ammarkpos</function> and <function>amrestrpos</function>
916 functions in <structname>IndexAmRoutine</structname>; set those pointers to NULL
917 instead.
918 </para>
920 <para>
921 Both the scan position and the mark position (if any) must be maintained
922 consistently in the face of concurrent insertions or deletions in the
923 index. It is OK if a freshly-inserted entry is not returned by a scan that
924 would have found the entry if it had existed when the scan started, or for
925 the scan to return such an entry upon rescanning or backing
926 up even though it had not been returned the first time through. Similarly,
927 a concurrent delete might or might not be reflected in the results of a scan.
928 What is important is that insertions or deletions not cause the scan to
929 miss or multiply return entries that were not themselves being inserted or
930 deleted.
931 </para>
933 <para>
934 If the index stores the original indexed data values (and not some lossy
935 representation of them), it is useful to
936 support <link linkend="indexes-index-only-scans">index-only scans</link>, in
937 which the index returns the actual data not just the TID of the heap tuple.
938 This will only avoid I/O if the visibility map shows that the TID is on an
939 all-visible page; else the heap tuple must be visited anyway to check
940 MVCC visibility. But that is no concern of the access method's.
941 </para>
943 <para>
944 Instead of using <function>amgettuple</function>, an index scan can be done with
945 <function>amgetbitmap</function> to fetch all tuples in one call. This can be
946 noticeably more efficient than <function>amgettuple</function> because it allows
947 avoiding lock/unlock cycles within the access method. In principle
948 <function>amgetbitmap</function> should have the same effects as repeated
949 <function>amgettuple</function> calls, but we impose several restrictions to
950 simplify matters. First of all, <function>amgetbitmap</function> returns all
951 tuples at once and marking or restoring scan positions isn't
952 supported. Secondly, the tuples are returned in a bitmap which doesn't
953 have any specific ordering, which is why <function>amgetbitmap</function> doesn't
954 take a <literal>direction</literal> argument. (Ordering operators will never be
955 supplied for such a scan, either.)
956 Also, there is no provision for index-only scans with
957 <function>amgetbitmap</function>, since there is no way to return the contents of
958 index tuples.
959 Finally, <function>amgetbitmap</function>
960 does not guarantee any locking of the returned tuples, with implications
961 spelled out in <xref linkend="index-locking"/>.
962 </para>
964 <para>
965 Note that it is permitted for an access method to implement only
966 <function>amgetbitmap</function> and not <function>amgettuple</function>, or vice versa,
967 if its internal implementation is unsuited to one API or the other.
968 </para>
970 </sect1>
972 <sect1 id="index-locking">
973 <title>Index Locking Considerations</title>
975 <para>
976 Index access methods must handle concurrent updates
977 of the index by multiple processes.
978 The core <productname>PostgreSQL</productname> system obtains
979 <literal>AccessShareLock</literal> on the index during an index scan, and
980 <literal>RowExclusiveLock</literal> when updating the index (including plain
981 <command>VACUUM</command>). Since these lock types do not conflict, the access
982 method is responsible for handling any fine-grained locking it might need.
983 An exclusive lock on the index as a whole will be taken only during index
984 creation, destruction, or <command>REINDEX</command>.
985 </para>
987 <para>
988 Building an index type that supports concurrent updates usually requires
989 extensive and subtle analysis of the required behavior. For the b-tree
990 and hash index types, you can read about the design decisions involved in
991 <filename>src/backend/access/nbtree/README</filename> and
992 <filename>src/backend/access/hash/README</filename>.
993 </para>
995 <para>
996 Aside from the index's own internal consistency requirements, concurrent
997 updates create issues about consistency between the parent table (the
998 <firstterm>heap</firstterm>) and the index. Because
999 <productname>PostgreSQL</productname> separates accesses
1000 and updates of the heap from those of the index, there are windows in
1001 which the index might be inconsistent with the heap. We handle this problem
1002 with the following rules:
1004 <itemizedlist>
1005 <listitem>
1006 <para>
1007 A new heap entry is made before making its index entries. (Therefore
1008 a concurrent index scan is likely to fail to see the heap entry.
1009 This is okay because the index reader would be uninterested in an
1010 uncommitted row anyway. But see <xref linkend="index-unique-checks"/>.)
1011 </para>
1012 </listitem>
1013 <listitem>
1014 <para>
1015 When a heap entry is to be deleted (by <command>VACUUM</command>), all its
1016 index entries must be removed first.
1017 </para>
1018 </listitem>
1019 <listitem>
1020 <para>
1021 An index scan must maintain a pin
1022 on the index page holding the item last returned by
1023 <function>amgettuple</function>, and <function>ambulkdelete</function> cannot delete
1024 entries from pages that are pinned by other backends. The need
1025 for this rule is explained below.
1026 </para>
1027 </listitem>
1028 </itemizedlist>
1030 Without the third rule, it is possible for an index reader to
1031 see an index entry just before it is removed by <command>VACUUM</command>, and
1032 then to arrive at the corresponding heap entry after that was removed by
1033 <command>VACUUM</command>.
1034 This creates no serious problems if that item
1035 number is still unused when the reader reaches it, since an empty
1036 item slot will be ignored by <function>heap_fetch()</function>. But what if a
1037 third backend has already re-used the item slot for something else?
1038 When using an MVCC-compliant snapshot, there is no problem because
1039 the new occupant of the slot is certain to be too new to pass the
1040 snapshot test. However, with a non-MVCC-compliant snapshot (such as
1041 <literal>SnapshotAny</literal>), it would be possible to accept and return
1042 a row that does not in fact match the scan keys. We could defend
1043 against this scenario by requiring the scan keys to be rechecked
1044 against the heap row in all cases, but that is too expensive. Instead,
1045 we use a pin on an index page as a proxy to indicate that the reader
1046 might still be <quote>in flight</quote> from the index entry to the matching
1047 heap entry. Making <function>ambulkdelete</function> block on such a pin ensures
1048 that <command>VACUUM</command> cannot delete the heap entry before the reader
1049 is done with it. This solution costs little in run time, and adds blocking
1050 overhead only in the rare cases where there actually is a conflict.
1051 </para>
1053 <para>
1054 This solution requires that index scans be <quote>synchronous</quote>: we have
1055 to fetch each heap tuple immediately after scanning the corresponding index
1056 entry. This is expensive for a number of reasons. An
1057 <quote>asynchronous</quote> scan in which we collect many TIDs from the index,
1058 and only visit the heap tuples sometime later, requires much less index
1059 locking overhead and can allow a more efficient heap access pattern.
1060 Per the above analysis, we must use the synchronous approach for
1061 non-MVCC-compliant snapshots, but an asynchronous scan is workable
1062 for a query using an MVCC snapshot.
1063 </para>
1065 <para>
1066 In an <function>amgetbitmap</function> index scan, the access method does not
1067 keep an index pin on any of the returned tuples. Therefore
1068 it is only safe to use such scans with MVCC-compliant snapshots.
1069 </para>
1071 <para>
1072 When the <structfield>ampredlocks</structfield> flag is not set, any scan using that
1073 index access method within a serializable transaction will acquire a
1074 nonblocking predicate lock on the full index. This will generate a
1075 read-write conflict with the insert of any tuple into that index by a
1076 concurrent serializable transaction. If certain patterns of read-write
1077 conflicts are detected among a set of concurrent serializable
1078 transactions, one of those transactions may be canceled to protect data
1079 integrity. When the flag is set, it indicates that the index access
1080 method implements finer-grained predicate locking, which will tend to
1081 reduce the frequency of such transaction cancellations.
1082 </para>
1084 </sect1>
1086 <sect1 id="index-unique-checks">
1087 <title>Index Uniqueness Checks</title>
1089 <para>
1090 <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
1091 using <firstterm>unique indexes</firstterm>, which are indexes that disallow
1092 multiple entries with identical keys. An access method that supports this
1093 feature sets <structfield>amcanunique</structfield> true.
1094 (At present, only b-tree supports it.) Columns listed in the
1095 <literal>INCLUDE</literal> clause are not considered when enforcing
1096 uniqueness.
1097 </para>
1099 <para>
1100 Because of MVCC, it is always necessary to allow duplicate entries to
1101 exist physically in an index: the entries might refer to successive
1102 versions of a single logical row. The behavior we actually want to
1103 enforce is that no MVCC snapshot could include two rows with equal
1104 index keys. This breaks down into the following cases that must be
1105 checked when inserting a new row into a unique index:
1107 <itemizedlist>
1108 <listitem>
1109 <para>
1110 If a conflicting valid row has been deleted by the current transaction,
1111 it's okay. (In particular, since an UPDATE always deletes the old row
1112 version before inserting the new version, this will allow an UPDATE on
1113 a row without changing the key.)
1114 </para>
1115 </listitem>
1116 <listitem>
1117 <para>
1118 If a conflicting row has been inserted by an as-yet-uncommitted
1119 transaction, the would-be inserter must wait to see if that transaction
1120 commits. If it rolls back then there is no conflict. If it commits
1121 without deleting the conflicting row again, there is a uniqueness
1122 violation. (In practice we just wait for the other transaction to
1123 end and then redo the visibility check in toto.)
1124 </para>
1125 </listitem>
1126 <listitem>
1127 <para>
1128 Similarly, if a conflicting valid row has been deleted by an
1129 as-yet-uncommitted transaction, the would-be inserter must wait
1130 for that transaction to commit or abort, and then repeat the test.
1131 </para>
1132 </listitem>
1133 </itemizedlist>
1134 </para>
1136 <para>
1137 Furthermore, immediately before reporting a uniqueness violation
1138 according to the above rules, the access method must recheck the
1139 liveness of the row being inserted. If it is committed dead then
1140 no violation should be reported. (This case cannot occur during the
1141 ordinary scenario of inserting a row that's just been created by
1142 the current transaction. It can happen during
1143 <command>CREATE UNIQUE INDEX CONCURRENTLY</command>, however.)
1144 </para>
1146 <para>
1147 We require the index access method to apply these tests itself, which
1148 means that it must reach into the heap to check the commit status of
1149 any row that is shown to have a duplicate key according to the index
1150 contents. This is without a doubt ugly and non-modular, but it saves
1151 redundant work: if we did a separate probe then the index lookup for
1152 a conflicting row would be essentially repeated while finding the place to
1153 insert the new row's index entry. What's more, there is no obvious way
1154 to avoid race conditions unless the conflict check is an integral part
1155 of insertion of the new index entry.
1156 </para>
1158 <para>
1159 If the unique constraint is deferrable, there is additional complexity:
1160 we need to be able to insert an index entry for a new row, but defer any
1161 uniqueness-violation error until end of statement or even later. To
1162 avoid unnecessary repeat searches of the index, the index access method
1163 should do a preliminary uniqueness check during the initial insertion.
1164 If this shows that there is definitely no conflicting live tuple, we
1165 are done. Otherwise, we schedule a recheck to occur when it is time to
1166 enforce the constraint. If, at the time of the recheck, both the inserted
1167 tuple and some other tuple with the same key are live, then the error
1168 must be reported. (Note that for this purpose, <quote>live</quote> actually
1169 means <quote>any tuple in the index entry's HOT chain is live</quote>.)
1170 To implement this, the <function>aminsert</function> function is passed a
1171 <literal>checkUnique</literal> parameter having one of the following values:
1173 <itemizedlist>
1174 <listitem>
1175 <para>
1176 <literal>UNIQUE_CHECK_NO</literal> indicates that no uniqueness checking
1177 should be done (this is not a unique index).
1178 </para>
1179 </listitem>
1180 <listitem>
1181 <para>
1182 <literal>UNIQUE_CHECK_YES</literal> indicates that this is a non-deferrable
1183 unique index, and the uniqueness check must be done immediately, as
1184 described above.
1185 </para>
1186 </listitem>
1187 <listitem>
1188 <para>
1189 <literal>UNIQUE_CHECK_PARTIAL</literal> indicates that the unique
1190 constraint is deferrable. <productname>PostgreSQL</productname>
1191 will use this mode to insert each row's index entry. The access
1192 method must allow duplicate entries into the index, and report any
1193 potential duplicates by returning false from <function>aminsert</function>.
1194 For each row for which false is returned, a deferred recheck will
1195 be scheduled.
1196 </para>
1198 <para>
1199 The access method must identify any rows which might violate the
1200 unique constraint, but it is not an error for it to report false
1201 positives. This allows the check to be done without waiting for other
1202 transactions to finish; conflicts reported here are not treated as
1203 errors and will be rechecked later, by which time they may no longer
1204 be conflicts.
1205 </para>
1206 </listitem>
1207 <listitem>
1208 <para>
1209 <literal>UNIQUE_CHECK_EXISTING</literal> indicates that this is a deferred
1210 recheck of a row that was reported as a potential uniqueness violation.
1211 Although this is implemented by calling <function>aminsert</function>, the
1212 access method must <emphasis>not</emphasis> insert a new index entry in this
1213 case. The index entry is already present. Rather, the access method
1214 must check to see if there is another live index entry. If so, and
1215 if the target row is also still live, report error.
1216 </para>
1218 <para>
1219 It is recommended that in a <literal>UNIQUE_CHECK_EXISTING</literal> call,
1220 the access method further verify that the target row actually does
1221 have an existing entry in the index, and report error if not. This
1222 is a good idea because the index tuple values passed to
1223 <function>aminsert</function> will have been recomputed. If the index
1224 definition involves functions that are not really immutable, we
1225 might be checking the wrong area of the index. Checking that the
1226 target row is found in the recheck verifies that we are scanning
1227 for the same tuple values as were used in the original insertion.
1228 </para>
1229 </listitem>
1230 </itemizedlist>
1231 </para>
1233 </sect1>
1235 <sect1 id="index-cost-estimation">
1236 <title>Index Cost Estimation Functions</title>
1238 <para>
1239 The <function>amcostestimate</function> function is given information describing
1240 a possible index scan, including lists of WHERE and ORDER BY clauses that
1241 have been determined to be usable with the index. It must return estimates
1242 of the cost of accessing the index and the selectivity of the WHERE
1243 clauses (that is, the fraction of parent-table rows that will be
1244 retrieved during the index scan). For simple cases, nearly all the
1245 work of the cost estimator can be done by calling standard routines
1246 in the optimizer; the point of having an <function>amcostestimate</function> function is
1247 to allow index access methods to provide index-type-specific knowledge,
1248 in case it is possible to improve on the standard estimates.
1249 </para>
1251 <para>
1252 Each <function>amcostestimate</function> function must have the signature:
1254 <programlisting>
1255 void
1256 amcostestimate (PlannerInfo *root,
1257 IndexPath *path,
1258 double loop_count,
1259 Cost *indexStartupCost,
1260 Cost *indexTotalCost,
1261 Selectivity *indexSelectivity,
1262 double *indexCorrelation,
1263 double *indexPages);
1264 </programlisting>
1266 The first three parameters are inputs:
1268 <variablelist>
1269 <varlistentry>
1270 <term><parameter>root</parameter></term>
1271 <listitem>
1272 <para>
1273 The planner's information about the query being processed.
1274 </para>
1275 </listitem>
1276 </varlistentry>
1278 <varlistentry>
1279 <term><parameter>path</parameter></term>
1280 <listitem>
1281 <para>
1282 The index access path being considered. All fields except cost and
1283 selectivity values are valid.
1284 </para>
1285 </listitem>
1286 </varlistentry>
1288 <varlistentry>
1289 <term><parameter>loop_count</parameter></term>
1290 <listitem>
1291 <para>
1292 The number of repetitions of the index scan that should be factored
1293 into the cost estimates. This will typically be greater than one when
1294 considering a parameterized scan for use in the inside of a nestloop
1295 join. Note that the cost estimates should still be for just one scan;
1296 a larger <parameter>loop_count</parameter> means that it may be appropriate
1297 to allow for some caching effects across multiple scans.
1298 </para>
1299 </listitem>
1300 </varlistentry>
1301 </variablelist>
1302 </para>
1304 <para>
1305 The last five parameters are pass-by-reference outputs:
1307 <variablelist>
1308 <varlistentry>
1309 <term><parameter>*indexStartupCost</parameter></term>
1310 <listitem>
1311 <para>
1312 Set to cost of index start-up processing
1313 </para>
1314 </listitem>
1315 </varlistentry>
1317 <varlistentry>
1318 <term><parameter>*indexTotalCost</parameter></term>
1319 <listitem>
1320 <para>
1321 Set to total cost of index processing
1322 </para>
1323 </listitem>
1324 </varlistentry>
1326 <varlistentry>
1327 <term><parameter>*indexSelectivity</parameter></term>
1328 <listitem>
1329 <para>
1330 Set to index selectivity
1331 </para>
1332 </listitem>
1333 </varlistentry>
1335 <varlistentry>
1336 <term><parameter>*indexCorrelation</parameter></term>
1337 <listitem>
1338 <para>
1339 Set to correlation coefficient between index scan order and
1340 underlying table's order
1341 </para>
1342 </listitem>
1343 </varlistentry>
1345 <varlistentry>
1346 <term><parameter>*indexPages</parameter></term>
1347 <listitem>
1348 <para>
1349 Set to number of index leaf pages
1350 </para>
1351 </listitem>
1352 </varlistentry>
1353 </variablelist>
1354 </para>
1356 <para>
1357 Note that cost estimate functions must be written in C, not in SQL or
1358 any available procedural language, because they must access internal
1359 data structures of the planner/optimizer.
1360 </para>
1362 <para>
1363 The index access costs should be computed using the parameters used by
1364 <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
1365 disk block fetch has cost <varname>seq_page_cost</varname>, a nonsequential fetch
1366 has cost <varname>random_page_cost</varname>, and the cost of processing one index
1367 row should usually be taken as <varname>cpu_index_tuple_cost</varname>. In
1368 addition, an appropriate multiple of <varname>cpu_operator_cost</varname> should
1369 be charged for any comparison operators invoked during index processing
1370 (especially evaluation of the indexquals themselves).
1371 </para>
1373 <para>
1374 The access costs should include all disk and CPU costs associated with
1375 scanning the index itself, but <emphasis>not</emphasis> the costs of retrieving or
1376 processing the parent-table rows that are identified by the index.
1377 </para>
1379 <para>
1380 The <quote>start-up cost</quote> is the part of the total scan cost that
1381 must be expended before we can begin to fetch the first row. For most
1382 indexes this can be taken as zero, but an index type with a high start-up
1383 cost might want to set it nonzero.
1384 </para>
1386 <para>
1387 The <parameter>indexSelectivity</parameter> should be set to the estimated fraction of the parent
1388 table rows that will be retrieved during the index scan. In the case
1389 of a lossy query, this will typically be higher than the fraction of
1390 rows that actually pass the given qual conditions.
1391 </para>
1393 <para>
1394 The <parameter>indexCorrelation</parameter> should be set to the correlation (ranging between
1395 -1.0 and 1.0) between the index order and the table order. This is used
1396 to adjust the estimate for the cost of fetching rows from the parent
1397 table.
1398 </para>
1400 <para>
1401 The <parameter>indexPages</parameter> should be set to the number of leaf pages.
1402 This is used to estimate the number of workers for parallel index scan.
1403 </para>
1405 <para>
1406 When <parameter>loop_count</parameter> is greater than one, the returned numbers
1407 should be averages expected for any one scan of the index.
1408 </para>
1410 <procedure>
1411 <title>Cost Estimation</title>
1412 <para>
1413 A typical cost estimator will proceed as follows:
1414 </para>
1416 <step>
1417 <para>
1418 Estimate and return the fraction of parent-table rows that will be visited
1419 based on the given qual conditions. In the absence of any index-type-specific
1420 knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
1422 <programlisting>
1423 *indexSelectivity = clauselist_selectivity(root, path-&gt;indexquals,
1424 path-&gt;indexinfo-&gt;rel-&gt;relid,
1425 JOIN_INNER, NULL);
1426 </programlisting>
1427 </para>
1428 </step>
1430 <step>
1431 <para>
1432 Estimate the number of index rows that will be visited during the
1433 scan. For many index types this is the same as <parameter>indexSelectivity</parameter> times
1434 the number of rows in the index, but it might be more. (Note that the
1435 index's size in pages and rows is available from the
1436 <literal>path-&gt;indexinfo</literal> struct.)
1437 </para>
1438 </step>
1440 <step>
1441 <para>
1442 Estimate the number of index pages that will be retrieved during the scan.
1443 This might be just <parameter>indexSelectivity</parameter> times the index's size in pages.
1444 </para>
1445 </step>
1447 <step>
1448 <para>
1449 Compute the index access cost. A generic estimator might do this:
1451 <programlisting>
1453 * Our generic assumption is that the index pages will be read
1454 * sequentially, so they cost seq_page_cost each, not random_page_cost.
1455 * Also, we charge for evaluation of the indexquals at each index row.
1456 * All the costs are assumed to be paid incrementally during the scan.
1458 cost_qual_eval(&amp;index_qual_cost, path-&gt;indexquals, root);
1459 *indexStartupCost = index_qual_cost.startup;
1460 *indexTotalCost = seq_page_cost * numIndexPages +
1461 (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
1462 </programlisting>
1464 However, the above does not account for amortization of index reads
1465 across repeated index scans.
1466 </para>
1467 </step>
1469 <step>
1470 <para>
1471 Estimate the index correlation. For a simple ordered index on a single
1472 field, this can be retrieved from pg_statistic. If the correlation
1473 is not known, the conservative estimate is zero (no correlation).
1474 </para>
1475 </step>
1476 </procedure>
1478 <para>
1479 Examples of cost estimator functions can be found in
1480 <filename>src/backend/utils/adt/selfuncs.c</filename>.
1481 </para>
1482 </sect1>
1483 </chapter>