The code to unlink dropped relations in FinishPreparedTransaction() was
[PostgreSQL.git] / doc / src / sgml / indexam.sgml
blob46f7c1520ca1db2dd2e243f1ec492c7a082f0626
1 <!-- $PostgreSQL$ -->
3 <chapter id="indexam">
4 <title>Index Access Method Interface Definition</title>
6 <para>
7 This chapter defines the interface between the core
8 <productname>PostgreSQL</productname> system and <firstterm>index access
9 methods</>, which manage individual index types. The core system
10 knows nothing about indexes beyond what is specified here, so it is
11 possible to develop entirely new index types by writing add-on code.
12 </para>
14 <para>
15 All indexes in <productname>PostgreSQL</productname> are what are known
16 technically as <firstterm>secondary indexes</>; that is, the index is
17 physically separate from the table file that it describes. Each index
18 is stored as its own physical <firstterm>relation</> and so is described
19 by an entry in the <structname>pg_class</> catalog. The contents of an
20 index are entirely under the control of its index access method. In
21 practice, all index access methods divide indexes into standard-size
22 pages so that they can use the regular storage manager and buffer manager
23 to access the index contents. (All the existing index access methods
24 furthermore use the standard page layout described in <xref
25 linkend="storage-page-layout">, and they all use the same format for index
26 tuple headers; but these decisions are not forced on an access method.)
27 </para>
29 <para>
30 An index is effectively a mapping from some data key values to
31 <firstterm>tuple identifiers</>, or <acronym>TIDs</>, of row versions
32 (tuples) in the index's parent table. A TID consists of a
33 block number and an item number within that block (see <xref
34 linkend="storage-page-layout">). This is sufficient
35 information to fetch a particular row version from the table.
36 Indexes are not directly aware that under MVCC, there might be multiple
37 extant versions of the same logical row; to an index, each tuple is
38 an independent object that needs its own index entry. Thus, an
39 update of a row always creates all-new index entries for the row, even if
40 the key values did not change. (HOT tuples are an exception to this
41 statement; but indexes do not deal with those, either.) Index entries for
42 dead tuples are reclaimed (by vacuuming) when the dead tuples themselves
43 are reclaimed.
44 </para>
46 <sect1 id="index-catalog">
47 <title>Catalog Entries for Indexes</title>
49 <para>
50 Each index access method is described by a row in the
51 <structname>pg_am</structname> system catalog (see
52 <xref linkend="catalog-pg-am">). The principal contents of a
53 <structname>pg_am</structname> row are references to
54 <link linkend="catalog-pg-proc"><structname>pg_proc</structname></link>
55 entries that identify the index access
56 functions supplied by the access method. The APIs for these functions
57 are defined later in this chapter. In addition, the
58 <structname>pg_am</structname> row specifies a few fixed properties of
59 the access method, such as whether it can support multicolumn indexes.
60 There is not currently any special support
61 for creating or deleting <structname>pg_am</structname> entries;
62 anyone able to write a new access method is expected to be competent
63 to insert an appropriate row for themselves.
64 </para>
66 <para>
67 To be useful, an index access method must also have one or more
68 <firstterm>operator families</> and
69 <firstterm>operator classes</> defined in
70 <link linkend="catalog-pg-opfamily"><structname>pg_opfamily</structname></link>,
71 <link linkend="catalog-pg-opclass"><structname>pg_opclass</structname></link>,
72 <link linkend="catalog-pg-amop"><structname>pg_amop</structname></link>, and
73 <link linkend="catalog-pg-amproc"><structname>pg_amproc</structname></link>.
74 These entries allow the planner
75 to determine what kinds of query qualifications can be used with
76 indexes of this access method. Operator families and classes are described
77 in <xref linkend="xindex">, which is prerequisite material for reading
78 this chapter.
79 </para>
81 <para>
82 An individual index is defined by a
83 <link linkend="catalog-pg-class"><structname>pg_class</structname></link>
84 entry that describes it as a physical relation, plus a
85 <link linkend="catalog-pg-index"><structname>pg_index</structname></link>
86 entry that shows the logical content of the index &mdash; that is, the set
87 of index columns it has and the semantics of those columns, as captured by
88 the associated operator classes. The index columns (key values) can be
89 either simple columns of the underlying table or expressions over the table
90 rows. The index access method normally has no interest in where the index
91 key values come from (it is always handed precomputed key values) but it
92 will be very interested in the operator class information in
93 <structname>pg_index</structname>. Both of these catalog entries can be
94 accessed as part of the <structname>Relation</> data structure that is
95 passed to all operations on the index.
96 </para>
98 <para>
99 Some of the flag columns of <structname>pg_am</structname> have nonobvious
100 implications. The requirements of <structfield>amcanunique</structfield>
101 are discussed in <xref linkend="index-unique-checks">.
102 The <structfield>amcanmulticol</structfield> flag asserts that the
103 access method supports multicolumn indexes, while
104 <structfield>amoptionalkey</structfield> asserts that it allows scans
105 where no indexable restriction clause is given for the first index column.
106 When <structfield>amcanmulticol</structfield> is false,
107 <structfield>amoptionalkey</structfield> essentially says whether the
108 access method allows full-index scans without any restriction clause.
109 Access methods that support multiple index columns <emphasis>must</>
110 support scans that omit restrictions on any or all of the columns after
111 the first; however they are permitted to require some restriction to
112 appear for the first index column, and this is signaled by setting
113 <structfield>amoptionalkey</structfield> false.
114 <structfield>amindexnulls</structfield> asserts that index entries are
115 created for NULL key values. Since most indexable operators are
116 strict and hence cannot return TRUE for NULL inputs,
117 it is at first sight attractive to not store index entries for null values:
118 they could never be returned by an index scan anyway. However, this
119 argument fails when an index scan has no restriction clause for a given
120 index column. In practice this means that
121 indexes that have <structfield>amoptionalkey</structfield> true must
122 index nulls, since the planner might decide to use such an index
123 with no scan keys at all. A related restriction is that an index
124 access method that supports multiple index columns <emphasis>must</>
125 support indexing null values in columns after the first, because the planner
126 will assume the index can be used for queries that do not restrict
127 these columns. For example, consider an index on (a,b) and a query with
128 <literal>WHERE a = 4</literal>. The system will assume the index can be
129 used to scan for rows with <literal>a = 4</literal>, which is wrong if the
130 index omits rows where <literal>b</> is null.
131 It is, however, OK to omit rows where the first indexed column is null.
132 Thus, <structfield>amindexnulls</structfield> should be set true only if the
133 index access method indexes all rows, including arbitrary combinations of
134 null values. An index access method that sets
135 <structfield>amindexnulls</structfield> may also set
136 <structfield>amsearchnulls</structfield>, indicating that it supports
137 <literal>IS NULL</> clauses as search conditions.
138 </para>
140 </sect1>
142 <sect1 id="index-functions">
143 <title>Index Access Method Functions</title>
145 <para>
146 The index construction and maintenance functions that an index access
147 method must provide are:
148 </para>
150 <para>
151 <programlisting>
152 IndexBuildResult *
153 ambuild (Relation heapRelation,
154 Relation indexRelation,
155 IndexInfo *indexInfo);
156 </programlisting>
157 Build a new index. The index relation has been physically created,
158 but is empty. It must be filled in with whatever fixed data the
159 access method requires, plus entries for all tuples already existing
160 in the table. Ordinarily the <function>ambuild</> function will call
161 <function>IndexBuildHeapScan()</> to scan the table for existing tuples
162 and compute the keys that need to be inserted into the index.
163 The function must return a palloc'd struct containing statistics about
164 the new index.
165 </para>
167 <para>
168 <programlisting>
169 bool
170 aminsert (Relation indexRelation,
171 Datum *values,
172 bool *isnull,
173 ItemPointer heap_tid,
174 Relation heapRelation,
175 bool check_uniqueness);
176 </programlisting>
177 Insert a new tuple into an existing index. The <literal>values</> and
178 <literal>isnull</> arrays give the key values to be indexed, and
179 <literal>heap_tid</> is the TID to be indexed.
180 If the access method supports unique indexes (its
181 <structname>pg_am</>.<structfield>amcanunique</> flag is true) then
182 <literal>check_uniqueness</> might be true, in which case the access method
183 must verify that there is no conflicting row; this is the only situation in
184 which the access method normally needs the <literal>heapRelation</>
185 parameter. See <xref linkend="index-unique-checks"> for details.
186 The result is TRUE if an index entry was inserted, FALSE if not. (A FALSE
187 result does not denote an error condition, but is used for cases such
188 as an index method refusing to index a NULL.)
189 </para>
191 <para>
192 <programlisting>
193 IndexBulkDeleteResult *
194 ambulkdelete (IndexVacuumInfo *info,
195 IndexBulkDeleteResult *stats,
196 IndexBulkDeleteCallback callback,
197 void *callback_state);
198 </programlisting>
199 Delete tuple(s) from the index. This is a <quote>bulk delete</> operation
200 that is intended to be implemented by scanning the whole index and checking
201 each entry to see if it should be deleted.
202 The passed-in <literal>callback</> function must be called, in the style
203 <literal>callback(<replaceable>TID</>, callback_state) returns bool</literal>,
204 to determine whether any particular index entry, as identified by its
205 referenced TID, is to be deleted. Must return either NULL or a palloc'd
206 struct containing statistics about the effects of the deletion operation.
207 It is OK to return NULL if no information needs to be passed on to
208 <function>amvacuumcleanup</>.
209 </para>
211 <para>
212 Because of limited <varname>maintenance_work_mem</>,
213 <function>ambulkdelete</> might need to be called more than once when many
214 tuples are to be deleted. The <literal>stats</> argument is the result
215 of the previous call for this index (it is NULL for the first call within a
216 <command>VACUUM</> operation). This allows the AM to accumulate statistics
217 across the whole operation. Typically, <function>ambulkdelete</> will
218 modify and return the same struct if the passed <literal>stats</> is not
219 null.
220 </para>
222 <para>
223 <programlisting>
224 IndexBulkDeleteResult *
225 amvacuumcleanup (IndexVacuumInfo *info,
226 IndexBulkDeleteResult *stats);
227 </programlisting>
228 Clean up after a <command>VACUUM</command> operation (zero or more
229 <function>ambulkdelete</> calls). This does not have to do anything
230 beyond returning index statistics, but it might perform bulk cleanup
231 such as reclaiming empty index pages. <literal>stats</> is whatever the
232 last <function>ambulkdelete</> call returned, or NULL if
233 <function>ambulkdelete</> was not called because no tuples needed to be
234 deleted. If the result is not NULL it must be a palloc'd struct.
235 The statistics it contains will be used to update <structname>pg_class</>,
236 and will be reported by <command>VACUUM</> if <literal>VERBOSE</> is given.
237 It is OK to return NULL if the index was not changed at all during the
238 <command>VACUUM</command> operation, but otherwise correct stats should
239 be returned.
240 </para>
242 <para>
243 As of <productname>PostgreSQL</productname> 8.4,
244 <function>amvacuumcleanup</> will also be called at completion of an
245 <command>ANALYZE</> operation. In this case <literal>stats</> is always
246 NULL and any return value will be ignored. This case can be distinguished
247 by checking <literal>info-&gt;analyze_only</literal>. It is recommended
248 that the access method do nothing except post-insert cleanup in such a
249 call, and that only in an autovacuum worker process.
250 </para>
252 <para>
253 <programlisting>
254 void
255 amcostestimate (PlannerInfo *root,
256 IndexOptInfo *index,
257 List *indexQuals,
258 RelOptInfo *outer_rel,
259 Cost *indexStartupCost,
260 Cost *indexTotalCost,
261 Selectivity *indexSelectivity,
262 double *indexCorrelation);
263 </programlisting>
264 Estimate the costs of an index scan. This function is described fully
265 in <xref linkend="index-cost-estimation">, below.
266 </para>
268 <para>
269 <programlisting>
270 bytea *
271 amoptions (ArrayType *reloptions,
272 bool validate);
273 </programlisting>
274 Parse and validate the reloptions array for an index. This is called only
275 when a non-null reloptions array exists for the index.
276 <parameter>reloptions</> is a <type>text</> array containing entries of the
277 form <replaceable>name</><literal>=</><replaceable>value</>.
278 The function should construct a <type>bytea</> value, which will be copied
279 into the <structfield>rd_options</> field of the index's relcache entry.
280 The data contents of the <type>bytea</> value are open for the access
281 method to define; most of the standard access methods use struct
282 <structname>StdRdOptions</>.
283 When <parameter>validate</> is true, the function should report a suitable
284 error message if any of the options are unrecognized or have invalid
285 values; when <parameter>validate</> is false, invalid entries should be
286 silently ignored. (<parameter>validate</> is false when loading options
287 already stored in <structname>pg_catalog</>; an invalid entry could only
288 be found if the access method has changed its rules for options, and in
289 that case ignoring obsolete entries is appropriate.)
290 It is OK to return NULL if default behavior is wanted.
291 </para>
293 <para>
294 The purpose of an index, of course, is to support scans for tuples matching
295 an indexable <literal>WHERE</> condition, often called a
296 <firstterm>qualifier</> or <firstterm>scan key</>. The semantics of
297 index scanning are described more fully in <xref linkend="index-scanning">,
298 below. An index access method can support <quote>plain</> index scans,
299 <quote>bitmap</> index scans, or both. The scan-related functions that an
300 index access method must or may provide are:
301 </para>
303 <para>
304 <programlisting>
305 IndexScanDesc
306 ambeginscan (Relation indexRelation,
307 int nkeys,
308 ScanKey key);
309 </programlisting>
310 Begin a new scan. The <literal>key</> array (of length <literal>nkeys</>)
311 describes the scan key(s) for the index scan. The result must be a
312 palloc'd struct. For implementation reasons the index access method
313 <emphasis>must</> create this struct by calling
314 <function>RelationGetIndexScan()</>. In most cases
315 <function>ambeginscan</> itself does little beyond making that call;
316 the interesting parts of index-scan startup are in <function>amrescan</>.
317 </para>
319 <para>
320 <programlisting>
321 boolean
322 amgettuple (IndexScanDesc scan,
323 ScanDirection direction);
324 </programlisting>
325 Fetch the next tuple in the given scan, moving in the given
326 direction (forward or backward in the index). Returns TRUE if a tuple was
327 obtained, FALSE if no matching tuples remain. In the TRUE case the tuple
328 TID is stored into the <literal>scan</> structure. Note that
329 <quote>success</> means only that the index contains an entry that matches
330 the scan keys, not that the tuple necessarily still exists in the heap or
331 will pass the caller's snapshot test. On success, <function>amgettuple</>
332 must also set <literal>scan-&gt;xs_recheck</> to TRUE or FALSE.
333 FALSE means it is certain that the index entry matches the scan keys.
334 TRUE means this is not certain, and the conditions represented by the
335 scan keys must be rechecked against the heap tuple after fetching it.
336 This provision supports <quote>lossy</> index operators.
337 Note that rechecking will extend only to the scan conditions; a partial
338 index predicate (if any) is never rechecked by <function>amgettuple</>
339 callers.
340 </para>
342 <para>
343 The <function>amgettuple</> function need only be provided if the access
344 method supports <quote>plain</> index scans. If it doesn't, the
345 <structfield>amgettuple</> field in its <structname>pg_am</> row must
346 be set to zero.
347 </para>
349 <para>
350 <programlisting>
351 int64
352 amgetbitmap (IndexScanDesc scan,
353 TIDBitmap *tbm);
354 </programlisting>
355 Fetch all tuples in the given scan and add them to the caller-supplied
356 TIDBitmap (that is, OR the set of tuple IDs into whatever set is already
357 in the bitmap). The number of tuples fetched is returned (this might be
358 just an approximate count, for instance some AMs do not detect duplicates).
359 While inserting tuple IDs into the bitmap, <function>amgetbitmap</> can
360 indicate that rechecking of the scan conditions is required for specific
361 tuple IDs. This is analogous to the <literal>xs_recheck</> output parameter
362 of <function>amgettuple</>. Note: in the current implementation, support
363 for this feature is conflated with support for lossy storage of the bitmap
364 itself, and therefore callers recheck both the scan conditions and the
365 partial index predicate (if any) for recheckable tuples. That might not
366 always be true, however.
367 <function>amgetbitmap</> and
368 <function>amgettuple</> cannot be used in the same index scan; there
369 are other restrictions too when using <function>amgetbitmap</>, as explained
370 in <xref linkend="index-scanning">.
371 </para>
373 <para>
374 The <function>amgetbitmap</> function need only be provided if the access
375 method supports <quote>bitmap</> index scans. If it doesn't, the
376 <structfield>amgetbitmap</> field in its <structname>pg_am</> row must
377 be set to zero.
378 </para>
380 <para>
381 <programlisting>
382 void
383 amrescan (IndexScanDesc scan,
384 ScanKey key);
385 </programlisting>
386 Restart the given scan, possibly with new scan keys (to continue using
387 the old keys, NULL is passed for <literal>key</>). Note that it is not
388 possible for the number of keys to be changed. In practice the restart
389 feature is used when a new outer tuple is selected by a nested-loop join
390 and so a new key comparison value is needed, but the scan key structure
391 remains the same. This function is also called by
392 <function>RelationGetIndexScan()</>, so it is used for initial setup
393 of an index scan as well as rescanning.
394 </para>
396 <para>
397 <programlisting>
398 void
399 amendscan (IndexScanDesc scan);
400 </programlisting>
401 End a scan and release resources. The <literal>scan</> struct itself
402 should not be freed, but any locks or pins taken internally by the
403 access method must be released.
404 </para>
406 <para>
407 <programlisting>
408 void
409 ammarkpos (IndexScanDesc scan);
410 </programlisting>
411 Mark current scan position. The access method need only support one
412 remembered scan position per scan.
413 </para>
415 <para>
416 <programlisting>
417 void
418 amrestrpos (IndexScanDesc scan);
419 </programlisting>
420 Restore the scan to the most recently marked position.
421 </para>
423 <para>
424 By convention, the <literal>pg_proc</literal> entry for an index
425 access method function should show the correct number of arguments,
426 but declare them all as type <type>internal</> (since most of the arguments
427 have types that are not known to SQL, and we don't want users calling
428 the functions directly anyway). The return type is declared as
429 <type>void</>, <type>internal</>, or <type>boolean</> as appropriate.
430 The only exception is <function>amoptions</>, which should be correctly
431 declared as taking <type>text[]</> and <type>bool</> and returning
432 <type>bytea</>. This provision allows client code to execute
433 <function>amoptions</> to test validity of options settings.
434 </para>
436 </sect1>
438 <sect1 id="index-scanning">
439 <title>Index Scanning</title>
441 <para>
442 In an index scan, the index access method is responsible for regurgitating
443 the TIDs of all the tuples it has been told about that match the
444 <firstterm>scan keys</>. The access method is <emphasis>not</> involved in
445 actually fetching those tuples from the index's parent table, nor in
446 determining whether they pass the scan's time qualification test or other
447 conditions.
448 </para>
450 <para>
451 A scan key is the internal representation of a <literal>WHERE</> clause of
452 the form <replaceable>index_key</> <replaceable>operator</>
453 <replaceable>constant</>, where the index key is one of the columns of the
454 index and the operator is one of the members of the operator family
455 associated with that index column. An index scan has zero or more scan
456 keys, which are implicitly ANDed &mdash; the returned tuples are expected
457 to satisfy all the indicated conditions.
458 </para>
460 <para>
461 The access method can report that the index is <firstterm>lossy</>, or
462 requires rechecks, for a particular query. This implies that the index
463 scan will return all the entries that pass the scan key, plus possibly
464 additional entries that do not. The core system's index-scan machinery
465 will then apply the index conditions again to the heap tuple to verify
466 whether or not it really should be selected. If the recheck option is not
467 specified, the index scan must return exactly the set of matching entries.
468 </para>
470 <para>
471 Note that it is entirely up to the access method to ensure that it
472 correctly finds all and only the entries passing all the given scan keys.
473 Also, the core system will simply hand off all the <literal>WHERE</>
474 clauses that match the index keys and operator families, without any
475 semantic analysis to determine whether they are redundant or
476 contradictory. As an example, given
477 <literal>WHERE x &gt; 4 AND x &gt; 14</> where <literal>x</> is a b-tree
478 indexed column, it is left to the b-tree <function>amrescan</> function
479 to realize that the first scan key is redundant and can be discarded.
480 The extent of preprocessing needed during <function>amrescan</> will
481 depend on the extent to which the index access method needs to reduce
482 the scan keys to a <quote>normalized</> form.
483 </para>
485 <para>
486 Some access methods return index entries in a well-defined order, others
487 do not. If entries are returned in sorted order, the access method should
488 set <structname>pg_am</>.<structfield>amcanorder</> true to indicate that
489 it supports ordered scans.
490 All such access methods must use btree-compatible strategy numbers for
491 their equality and ordering operators.
492 </para>
494 <para>
495 The <function>amgettuple</> function has a <literal>direction</> argument,
496 which can be either <literal>ForwardScanDirection</> (the normal case)
497 or <literal>BackwardScanDirection</>. If the first call after
498 <function>amrescan</> specifies <literal>BackwardScanDirection</>, then the
499 set of matching index entries is to be scanned back-to-front rather than in
500 the normal front-to-back direction, so <function>amgettuple</> must return
501 the last matching tuple in the index, rather than the first one as it
502 normally would. (This will only occur for access
503 methods that advertise they support ordered scans.) After the
504 first call, <function>amgettuple</> must be prepared to advance the scan in
505 either direction from the most recently returned entry. (But if
506 <structname>pg_am</>.<structfield>amcanbackward</> is false, all subsequent
507 calls will have the same direction as the first one.)
508 </para>
510 <para>
511 Access methods that support ordered scans must support <quote>marking</> a
512 position in a scan and later returning to the marked position. The same
513 position might be restored multiple times. However, only one position need
514 be remembered per scan; a new <function>ammarkpos</> call overrides the
515 previously marked position. An access method that does not support
516 ordered scans should still provide mark and restore functions in
517 <structname>pg_am</>, but it is sufficient to have them throw errors if
518 called.
519 </para>
521 <para>
522 Both the scan position and the mark position (if any) must be maintained
523 consistently in the face of concurrent insertions or deletions in the
524 index. It is OK if a freshly-inserted entry is not returned by a scan that
525 would have found the entry if it had existed when the scan started, or for
526 the scan to return such an entry upon rescanning or backing
527 up even though it had not been returned the first time through. Similarly,
528 a concurrent delete might or might not be reflected in the results of a scan.
529 What is important is that insertions or deletions not cause the scan to
530 miss or multiply return entries that were not themselves being inserted or
531 deleted.
532 </para>
534 <para>
535 Instead of using <function>amgettuple</>, an index scan can be done with
536 <function>amgetbitmap</> to fetch all tuples in one call. This can be
537 noticeably more efficient than <function>amgettuple</> because it allows
538 avoiding lock/unlock cycles within the access method. In principle
539 <function>amgetbitmap</> should have the same effects as repeated
540 <function>amgettuple</> calls, but we impose several restrictions to
541 simplify matters. First of all, <function>amgetbitmap</> returns all
542 tuples at once and marking or restoring scan positions isn't
543 supported. Secondly, the tuples are returned in a bitmap which doesn't
544 have any specific ordering, which is why <function>amgetbitmap</> doesn't
545 take a <literal>direction</> argument. Finally, <function>amgetbitmap</>
546 does not guarantee any locking of the returned tuples, with implications
547 spelled out in <xref linkend="index-locking">.
548 </para>
550 <para>
551 Note that it is permitted for an access method to implement only
552 <function>amgetbitmap</> and not <function>amgettuple</>, or vice versa,
553 if its internal implementation is unsuited to one API or the other.
554 </para>
556 </sect1>
558 <sect1 id="index-locking">
559 <title>Index Locking Considerations</title>
561 <para>
562 Index access methods must handle concurrent updates
563 of the index by multiple processes.
564 The core <productname>PostgreSQL</productname> system obtains
565 <literal>AccessShareLock</> on the index during an index scan, and
566 <literal>RowExclusiveLock</> when updating the index (including plain
567 <command>VACUUM</>). Since these lock
568 types do not conflict, the access method is responsible for handling any
569 fine-grained locking it might need. An exclusive lock on the index as a whole
570 will be taken only during index creation, destruction,
571 <command>REINDEX</>, or <command>VACUUM FULL</>.
572 </para>
574 <para>
575 Building an index type that supports concurrent updates usually requires
576 extensive and subtle analysis of the required behavior. For the b-tree
577 and hash index types, you can read about the design decisions involved in
578 <filename>src/backend/access/nbtree/README</> and
579 <filename>src/backend/access/hash/README</>.
580 </para>
582 <para>
583 Aside from the index's own internal consistency requirements, concurrent
584 updates create issues about consistency between the parent table (the
585 <firstterm>heap</>) and the index. Because
586 <productname>PostgreSQL</productname> separates accesses
587 and updates of the heap from those of the index, there are windows in
588 which the index might be inconsistent with the heap. We handle this problem
589 with the following rules:
591 <itemizedlist>
592 <listitem>
593 <para>
594 A new heap entry is made before making its index entries. (Therefore
595 a concurrent index scan is likely to fail to see the heap entry.
596 This is okay because the index reader would be uninterested in an
597 uncommitted row anyway. But see <xref linkend="index-unique-checks">.)
598 </para>
599 </listitem>
600 <listitem>
601 <para>
602 When a heap entry is to be deleted (by <command>VACUUM</>), all its
603 index entries must be removed first.
604 </para>
605 </listitem>
606 <listitem>
607 <para>
608 An index scan must maintain a pin
609 on the index page holding the item last returned by
610 <function>amgettuple</>, and <function>ambulkdelete</> cannot delete
611 entries from pages that are pinned by other backends. The need
612 for this rule is explained below.
613 </para>
614 </listitem>
615 </itemizedlist>
617 Without the third rule, it is possible for an index reader to
618 see an index entry just before it is removed by <command>VACUUM</>, and
619 then to arrive at the corresponding heap entry after that was removed by
620 <command>VACUUM</>.
621 This creates no serious problems if that item
622 number is still unused when the reader reaches it, since an empty
623 item slot will be ignored by <function>heap_fetch()</>. But what if a
624 third backend has already re-used the item slot for something else?
625 When using an MVCC-compliant snapshot, there is no problem because
626 the new occupant of the slot is certain to be too new to pass the
627 snapshot test. However, with a non-MVCC-compliant snapshot (such as
628 <literal>SnapshotNow</>), it would be possible to accept and return
629 a row that does not in fact match the scan keys. We could defend
630 against this scenario by requiring the scan keys to be rechecked
631 against the heap row in all cases, but that is too expensive. Instead,
632 we use a pin on an index page as a proxy to indicate that the reader
633 might still be <quote>in flight</> from the index entry to the matching
634 heap entry. Making <function>ambulkdelete</> block on such a pin ensures
635 that <command>VACUUM</> cannot delete the heap entry before the reader
636 is done with it. This solution costs little in run time, and adds blocking
637 overhead only in the rare cases where there actually is a conflict.
638 </para>
640 <para>
641 This solution requires that index scans be <quote>synchronous</>: we have
642 to fetch each heap tuple immediately after scanning the corresponding index
643 entry. This is expensive for a number of reasons. An
644 <quote>asynchronous</> scan in which we collect many TIDs from the index,
645 and only visit the heap tuples sometime later, requires much less index
646 locking overhead and can allow a more efficient heap access pattern.
647 Per the above analysis, we must use the synchronous approach for
648 non-MVCC-compliant snapshots, but an asynchronous scan is workable
649 for a query using an MVCC snapshot.
650 </para>
652 <para>
653 In an <function>amgetbitmap</> index scan, the access method does not
654 keep an index pin on any of the returned tuples. Therefore
655 it is only safe to use such scans with MVCC-compliant snapshots.
656 </para>
658 </sect1>
660 <sect1 id="index-unique-checks">
661 <title>Index Uniqueness Checks</title>
663 <para>
664 <productname>PostgreSQL</productname> enforces SQL uniqueness constraints
665 using <firstterm>unique indexes</>, which are indexes that disallow
666 multiple entries with identical keys. An access method that supports this
667 feature sets <structname>pg_am</>.<structfield>amcanunique</> true.
668 (At present, only b-tree supports it.)
669 </para>
671 <para>
672 Because of MVCC, it is always necessary to allow duplicate entries to
673 exist physically in an index: the entries might refer to successive
674 versions of a single logical row. The behavior we actually want to
675 enforce is that no MVCC snapshot could include two rows with equal
676 index keys. This breaks down into the following cases that must be
677 checked when inserting a new row into a unique index:
679 <itemizedlist>
680 <listitem>
681 <para>
682 If a conflicting valid row has been deleted by the current transaction,
683 it's okay. (In particular, since an UPDATE always deletes the old row
684 version before inserting the new version, this will allow an UPDATE on
685 a row without changing the key.)
686 </para>
687 </listitem>
688 <listitem>
689 <para>
690 If a conflicting row has been inserted by an as-yet-uncommitted
691 transaction, the would-be inserter must wait to see if that transaction
692 commits. If it rolls back then there is no conflict. If it commits
693 without deleting the conflicting row again, there is a uniqueness
694 violation. (In practice we just wait for the other transaction to
695 end and then redo the visibility check in toto.)
696 </para>
697 </listitem>
698 <listitem>
699 <para>
700 Similarly, if a conflicting valid row has been deleted by an
701 as-yet-uncommitted transaction, the would-be inserter must wait
702 for that transaction to commit or abort, and then repeat the test.
703 </para>
704 </listitem>
705 </itemizedlist>
706 </para>
708 <para>
709 Furthermore, immediately before raising a uniqueness violation
710 according to the above rules, the access method must recheck the
711 liveness of the row being inserted. If it is committed dead then
712 no error should be raised. (This case cannot occur during the
713 ordinary scenario of inserting a row that's just been created by
714 the current transaction. It can happen during
715 <command>CREATE UNIQUE INDEX CONCURRENTLY</>, however.)
716 </para>
718 <para>
719 We require the index access method to apply these tests itself, which
720 means that it must reach into the heap to check the commit status of
721 any row that is shown to have a duplicate key according to the index
722 contents. This is without a doubt ugly and non-modular, but it saves
723 redundant work: if we did a separate probe then the index lookup for
724 a conflicting row would be essentially repeated while finding the place to
725 insert the new row's index entry. What's more, there is no obvious way
726 to avoid race conditions unless the conflict check is an integral part
727 of insertion of the new index entry.
728 </para>
730 <para>
731 The main limitation of this scheme is that it has no convenient way
732 to support deferred uniqueness checks.
733 </para>
735 </sect1>
737 <sect1 id="index-cost-estimation">
738 <title>Index Cost Estimation Functions</title>
740 <para>
741 The amcostestimate function is given a list of WHERE clauses that have
742 been determined to be usable with the index. It must return estimates
743 of the cost of accessing the index and the selectivity of the WHERE
744 clauses (that is, the fraction of parent-table rows that will be
745 retrieved during the index scan). For simple cases, nearly all the
746 work of the cost estimator can be done by calling standard routines
747 in the optimizer; the point of having an amcostestimate function is
748 to allow index access methods to provide index-type-specific knowledge,
749 in case it is possible to improve on the standard estimates.
750 </para>
752 <para>
753 Each amcostestimate function must have the signature:
755 <programlisting>
756 void
757 amcostestimate (PlannerInfo *root,
758 IndexOptInfo *index,
759 List *indexQuals,
760 RelOptInfo *outer_rel,
761 Cost *indexStartupCost,
762 Cost *indexTotalCost,
763 Selectivity *indexSelectivity,
764 double *indexCorrelation);
765 </programlisting>
767 The first four parameters are inputs:
769 <variablelist>
770 <varlistentry>
771 <term>root</term>
772 <listitem>
773 <para>
774 The planner's information about the query being processed.
775 </para>
776 </listitem>
777 </varlistentry>
779 <varlistentry>
780 <term>index</term>
781 <listitem>
782 <para>
783 The index being considered.
784 </para>
785 </listitem>
786 </varlistentry>
788 <varlistentry>
789 <term>indexQuals</term>
790 <listitem>
791 <para>
792 List of index qual clauses (implicitly ANDed);
793 a NIL list indicates no qualifiers are available.
794 Note that the list contains expression trees, not ScanKeys.
795 </para>
796 </listitem>
797 </varlistentry>
799 <varlistentry>
800 <term>outer_rel</term>
801 <listitem>
802 <para>
803 If the index is being considered for use in a join inner indexscan,
804 the planner's information about the outer side of the join. Otherwise
805 NULL. When non-NULL, some of the qual clauses will be join clauses
806 with this rel rather than being simple restriction clauses. Also,
807 the cost estimator should expect that the index scan will be repeated
808 for each row of the outer rel.
809 </para>
810 </listitem>
811 </varlistentry>
812 </variablelist>
813 </para>
815 <para>
816 The last four parameters are pass-by-reference outputs:
818 <variablelist>
819 <varlistentry>
820 <term>*indexStartupCost</term>
821 <listitem>
822 <para>
823 Set to cost of index start-up processing
824 </para>
825 </listitem>
826 </varlistentry>
828 <varlistentry>
829 <term>*indexTotalCost</term>
830 <listitem>
831 <para>
832 Set to total cost of index processing
833 </para>
834 </listitem>
835 </varlistentry>
837 <varlistentry>
838 <term>*indexSelectivity</term>
839 <listitem>
840 <para>
841 Set to index selectivity
842 </para>
843 </listitem>
844 </varlistentry>
846 <varlistentry>
847 <term>*indexCorrelation</term>
848 <listitem>
849 <para>
850 Set to correlation coefficient between index scan order and
851 underlying table's order
852 </para>
853 </listitem>
854 </varlistentry>
855 </variablelist>
856 </para>
858 <para>
859 Note that cost estimate functions must be written in C, not in SQL or
860 any available procedural language, because they must access internal
861 data structures of the planner/optimizer.
862 </para>
864 <para>
865 The index access costs should be computed using the parameters used by
866 <filename>src/backend/optimizer/path/costsize.c</filename>: a sequential
867 disk block fetch has cost <varname>seq_page_cost</>, a nonsequential fetch
868 has cost <varname>random_page_cost</>, and the cost of processing one index
869 row should usually be taken as <varname>cpu_index_tuple_cost</>. In
870 addition, an appropriate multiple of <varname>cpu_operator_cost</> should
871 be charged for any comparison operators invoked during index processing
872 (especially evaluation of the indexQuals themselves).
873 </para>
875 <para>
876 The access costs should include all disk and CPU costs associated with
877 scanning the index itself, but <emphasis>not</> the costs of retrieving or
878 processing the parent-table rows that are identified by the index.
879 </para>
881 <para>
882 The <quote>start-up cost</quote> is the part of the total scan cost that
883 must be expended before we can begin to fetch the first row. For most
884 indexes this can be taken as zero, but an index type with a high start-up
885 cost might want to set it nonzero.
886 </para>
888 <para>
889 The indexSelectivity should be set to the estimated fraction of the parent
890 table rows that will be retrieved during the index scan. In the case
891 of a lossy query, this will typically be higher than the fraction of
892 rows that actually pass the given qual conditions.
893 </para>
895 <para>
896 The indexCorrelation should be set to the correlation (ranging between
897 -1.0 and 1.0) between the index order and the table order. This is used
898 to adjust the estimate for the cost of fetching rows from the parent
899 table.
900 </para>
902 <para>
903 In the join case, the returned numbers should be averages expected for
904 any one scan of the index.
905 </para>
907 <procedure>
908 <title>Cost Estimation</title>
909 <para>
910 A typical cost estimator will proceed as follows:
911 </para>
913 <step>
914 <para>
915 Estimate and return the fraction of parent-table rows that will be visited
916 based on the given qual conditions. In the absence of any index-type-specific
917 knowledge, use the standard optimizer function <function>clauselist_selectivity()</function>:
919 <programlisting>
920 *indexSelectivity = clauselist_selectivity(root, indexQuals,
921 index-&gt;rel-&gt;relid,
922 JOIN_INNER, NULL);
923 </programlisting>
924 </para>
925 </step>
927 <step>
928 <para>
929 Estimate the number of index rows that will be visited during the
930 scan. For many index types this is the same as indexSelectivity times
931 the number of rows in the index, but it might be more. (Note that the
932 index's size in pages and rows is available from the IndexOptInfo struct.)
933 </para>
934 </step>
936 <step>
937 <para>
938 Estimate the number of index pages that will be retrieved during the scan.
939 This might be just indexSelectivity times the index's size in pages.
940 </para>
941 </step>
943 <step>
944 <para>
945 Compute the index access cost. A generic estimator might do this:
947 <programlisting>
949 * Our generic assumption is that the index pages will be read
950 * sequentially, so they cost seq_page_cost each, not random_page_cost.
951 * Also, we charge for evaluation of the indexquals at each index row.
952 * All the costs are assumed to be paid incrementally during the scan.
954 cost_qual_eval(&amp;index_qual_cost, indexQuals, root);
955 *indexStartupCost = index_qual_cost.startup;
956 *indexTotalCost = seq_page_cost * numIndexPages +
957 (cpu_index_tuple_cost + index_qual_cost.per_tuple) * numIndexTuples;
958 </programlisting>
960 However, the above does not account for amortization of index reads
961 across repeated index scans in the join case.
962 </para>
963 </step>
965 <step>
966 <para>
967 Estimate the index correlation. For a simple ordered index on a single
968 field, this can be retrieved from pg_statistic. If the correlation
969 is not known, the conservative estimate is zero (no correlation).
970 </para>
971 </step>
972 </procedure>
974 <para>
975 Examples of cost estimator functions can be found in
976 <filename>src/backend/utils/adt/selfuncs.c</filename>.
977 </para>
978 </sect1>
979 </chapter>