Repair memory leaks in plpython.
[pgsql.git] / doc / src / sgml / btree.sgml
blob2b3997988cff0fd38db32da643266c05fab7ba1a
1 <!-- doc/src/sgml/btree.sgml -->
3 <sect1 id="btree">
4 <title>B-Tree Indexes</title>
6 <indexterm>
7 <primary>index</primary>
8 <secondary>B-Tree</secondary>
9 </indexterm>
11 <sect2 id="btree-intro">
12 <title>Introduction</title>
14 <para>
15 <productname>PostgreSQL</productname> includes an implementation of the
16 standard <acronym>btree</acronym> (multi-way balanced tree) index data
17 structure. Any data type that can be sorted into a well-defined linear
18 order can be indexed by a btree index. The only limitation is that an
19 index entry cannot exceed approximately one-third of a page (after TOAST
20 compression, if applicable).
21 </para>
23 <para>
24 Because each btree operator class imposes a sort order on its data type,
25 btree operator classes (or, really, operator families) have come to be
26 used as <productname>PostgreSQL</productname>'s general representation
27 and understanding of sorting semantics. Therefore, they've acquired
28 some features that go beyond what would be needed just to support btree
29 indexes, and parts of the system that are quite distant from the
30 btree <acronym>AM</acronym> make use of them.
31 </para>
33 </sect2>
35 <sect2 id="btree-behavior">
36 <title>Behavior of B-Tree Operator Classes</title>
38 <para>
39 As shown in <xref linkend="xindex-btree-strat-table"/>, a btree operator
40 class must provide five comparison operators,
41 <literal>&lt;</literal>,
42 <literal>&lt;=</literal>,
43 <literal>=</literal>,
44 <literal>&gt;=</literal> and
45 <literal>&gt;</literal>.
46 One might expect that <literal>&lt;&gt;</literal> should also be part of
47 the operator class, but it is not, because it would almost never be
48 useful to use a <literal>&lt;&gt;</literal> WHERE clause in an index
49 search. (For some purposes, the planner treats <literal>&lt;&gt;</literal>
50 as associated with a btree operator class; but it finds that operator via
51 the <literal>=</literal> operator's negator link, rather than
52 from <structname>pg_amop</structname>.)
53 </para>
55 <para>
56 When several data types share near-identical sorting semantics, their
57 operator classes can be grouped into an operator family. Doing so is
58 advantageous because it allows the planner to make deductions about
59 cross-type comparisons. Each operator class within the family should
60 contain the single-type operators (and associated support functions)
61 for its input data type, while cross-type comparison operators and
62 support functions are <quote>loose</quote> in the family. It is
63 recommendable that a complete set of cross-type operators be included
64 in the family, thus ensuring that the planner can represent any
65 comparison conditions that it deduces from transitivity.
66 </para>
68 <para>
69 There are some basic assumptions that a btree operator family must
70 satisfy:
71 </para>
73 <itemizedlist>
74 <listitem>
75 <para>
76 An <literal>=</literal> operator must be an equivalence relation; that
77 is, for all non-null values <replaceable>A</replaceable>,
78 <replaceable>B</replaceable>, <replaceable>C</replaceable> of the
79 data type:
81 <itemizedlist>
82 <listitem>
83 <para>
84 <replaceable>A</replaceable> <literal>=</literal>
85 <replaceable>A</replaceable> is true
86 (<firstterm>reflexive law</firstterm>)
87 </para>
88 </listitem>
89 <listitem>
90 <para>
91 if <replaceable>A</replaceable> <literal>=</literal>
92 <replaceable>B</replaceable>,
93 then <replaceable>B</replaceable> <literal>=</literal>
94 <replaceable>A</replaceable>
95 (<firstterm>symmetric law</firstterm>)
96 </para>
97 </listitem>
98 <listitem>
99 <para>
100 if <replaceable>A</replaceable> <literal>=</literal>
101 <replaceable>B</replaceable> and <replaceable>B</replaceable>
102 <literal>=</literal> <replaceable>C</replaceable>,
103 then <replaceable>A</replaceable> <literal>=</literal>
104 <replaceable>C</replaceable>
105 (<firstterm>transitive law</firstterm>)
106 </para>
107 </listitem>
108 </itemizedlist>
109 </para>
110 </listitem>
112 <listitem>
113 <para>
114 A <literal>&lt;</literal> operator must be a strong ordering relation;
115 that is, for all non-null values <replaceable>A</replaceable>,
116 <replaceable>B</replaceable>, <replaceable>C</replaceable>:
118 <itemizedlist>
119 <listitem>
120 <para>
121 <replaceable>A</replaceable> <literal>&lt;</literal>
122 <replaceable>A</replaceable> is false
123 (<firstterm>irreflexive law</firstterm>)
124 </para>
125 </listitem>
126 <listitem>
127 <para>
128 if <replaceable>A</replaceable> <literal>&lt;</literal>
129 <replaceable>B</replaceable>
130 and <replaceable>B</replaceable> <literal>&lt;</literal>
131 <replaceable>C</replaceable>,
132 then <replaceable>A</replaceable> <literal>&lt;</literal>
133 <replaceable>C</replaceable>
134 (<firstterm>transitive law</firstterm>)
135 </para>
136 </listitem>
137 </itemizedlist>
138 </para>
139 </listitem>
141 <listitem>
142 <para>
143 Furthermore, the ordering is total; that is, for all non-null
144 values <replaceable>A</replaceable>, <replaceable>B</replaceable>:
146 <itemizedlist>
147 <listitem>
148 <para>
149 exactly one of <replaceable>A</replaceable> <literal>&lt;</literal>
150 <replaceable>B</replaceable>, <replaceable>A</replaceable>
151 <literal>=</literal> <replaceable>B</replaceable>, and
152 <replaceable>B</replaceable> <literal>&lt;</literal>
153 <replaceable>A</replaceable> is true
154 (<firstterm>trichotomy law</firstterm>)
155 </para>
156 </listitem>
157 </itemizedlist>
159 (The trichotomy law justifies the definition of the comparison support
160 function, of course.)
161 </para>
162 </listitem>
163 </itemizedlist>
165 <para>
166 The other three operators are defined in terms of <literal>=</literal>
167 and <literal>&lt;</literal> in the obvious way, and must act consistently
168 with them.
169 </para>
171 <para>
172 For an operator family supporting multiple data types, the above laws must
173 hold when <replaceable>A</replaceable>, <replaceable>B</replaceable>,
174 <replaceable>C</replaceable> are taken from any data types in the family.
175 The transitive laws are the trickiest to ensure, as in cross-type
176 situations they represent statements that the behaviors of two or three
177 different operators are consistent.
178 As an example, it would not work to put <type>float8</type>
179 and <type>numeric</type> into the same operator family, at least not with
180 the current semantics that <type>numeric</type> values are converted
181 to <type>float8</type> for comparison to a <type>float8</type>. Because
182 of the limited accuracy of <type>float8</type>, this means there are
183 distinct <type>numeric</type> values that will compare equal to the
184 same <type>float8</type> value, and thus the transitive law would fail.
185 </para>
187 <para>
188 Another requirement for a multiple-data-type family is that any implicit
189 or binary-coercion casts that are defined between data types included in
190 the operator family must not change the associated sort ordering.
191 </para>
193 <para>
194 It should be fairly clear why a btree index requires these laws to hold
195 within a single data type: without them there is no ordering to arrange
196 the keys with. Also, index searches using a comparison key of a
197 different data type require comparisons to behave sanely across two
198 data types. The extensions to three or more data types within a family
199 are not strictly required by the btree index mechanism itself, but the
200 planner relies on them for optimization purposes.
201 </para>
203 </sect2>
205 <sect2 id="btree-support-funcs">
206 <title>B-Tree Support Functions</title>
208 <para>
209 As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
210 one required and four optional support functions. The five
211 user-defined methods are:
212 </para>
213 <variablelist>
214 <varlistentry>
215 <term><function>order</function></term>
216 <listitem>
217 <para>
218 For each combination of data types that a btree operator family
219 provides comparison operators for, it must provide a comparison
220 support function, registered in
221 <structname>pg_amproc</structname> with support function number 1
223 <structfield>amproclefttype</structfield>/<structfield>amprocrighttype</structfield>
224 equal to the left and right data types for the comparison (i.e.,
225 the same data types that the matching operators are registered
226 with in <structname>pg_amop</structname>). The comparison
227 function must take two non-null values
228 <replaceable>A</replaceable> and <replaceable>B</replaceable> and
229 return an <type>int32</type> value that is
230 <literal>&lt;</literal> <literal>0</literal>,
231 <literal>0</literal>, or <literal>&gt;</literal>
232 <literal>0</literal> when <replaceable>A</replaceable>
233 <literal>&lt;</literal> <replaceable>B</replaceable>,
234 <replaceable>A</replaceable> <literal>=</literal>
235 <replaceable>B</replaceable>, or <replaceable>A</replaceable>
236 <literal>&gt;</literal> <replaceable>B</replaceable>,
237 respectively. A null result is disallowed: all values of the
238 data type must be comparable. See
239 <filename>src/backend/access/nbtree/nbtcompare.c</filename> for
240 examples.
241 </para>
243 <para>
244 If the compared values are of a collatable data type, the
245 appropriate collation OID will be passed to the comparison
246 support function, using the standard
247 <function>PG_GET_COLLATION()</function> mechanism.
248 </para>
249 </listitem>
250 </varlistentry>
251 <varlistentry>
252 <term><function>sortsupport</function></term>
253 <listitem>
254 <para>
255 Optionally, a btree operator family may provide <firstterm>sort
256 support</firstterm> function(s), registered under support
257 function number 2. These functions allow implementing
258 comparisons for sorting purposes in a more efficient way than
259 naively calling the comparison support function. The APIs
260 involved in this are defined in
261 <filename>src/include/utils/sortsupport.h</filename>.
262 </para>
263 </listitem>
264 </varlistentry>
265 <varlistentry>
266 <term><function>in_range</function></term>
267 <listitem>
268 <indexterm>
269 <primary>in_range support functions</primary>
270 </indexterm>
272 <indexterm>
273 <primary>support functions</primary>
274 <secondary>in_range</secondary>
275 </indexterm>
276 <para>
277 Optionally, a btree operator family may provide
278 <firstterm>in_range</firstterm> support function(s), registered
279 under support function number 3. These are not used during btree
280 index operations; rather, they extend the semantics of the
281 operator family so that it can support window clauses containing
282 the <literal>RANGE</literal> <replaceable>offset</replaceable>
283 <literal>PRECEDING</literal> and <literal>RANGE</literal>
284 <replaceable>offset</replaceable> <literal>FOLLOWING</literal>
285 frame bound types (see <xref
286 linkend="syntax-window-functions"/>). Fundamentally, the extra
287 information provided is how to add or subtract an
288 <replaceable>offset</replaceable> value in a way that is
289 compatible with the family's data ordering.
290 </para>
292 <para>
293 An <function>in_range</function> function must have the signature
294 <synopsis>
295 in_range(<replaceable>val</replaceable> type1, <replaceable>base</replaceable> type1, <replaceable>offset</replaceable> type2, <replaceable>sub</replaceable> bool, <replaceable>less</replaceable> bool)
296 returns bool
297 </synopsis>
298 <replaceable>val</replaceable> and
299 <replaceable>base</replaceable> must be of the same type, which
300 is one of the types supported by the operator family (i.e., a
301 type for which it provides an ordering). However,
302 <replaceable>offset</replaceable> could be of a different type,
303 which might be one otherwise unsupported by the family. An
304 example is that the built-in <literal>time_ops</literal> family
305 provides an <function>in_range</function> function that has
306 <replaceable>offset</replaceable> of type <type>interval</type>.
307 A family can provide <function>in_range</function> functions for
308 any of its supported types and one or more
309 <replaceable>offset</replaceable> types. Each
310 <function>in_range</function> function should be entered in
311 <structname>pg_amproc</structname> with
312 <structfield>amproclefttype</structfield> equal to
313 <type>type1</type> and <structfield>amprocrighttype</structfield>
314 equal to <type>type2</type>.
315 </para>
317 <para>
318 The essential semantics of an <function>in_range</function>
319 function depend on the two Boolean flag parameters. It should
320 add or subtract <replaceable>base</replaceable> and
321 <replaceable>offset</replaceable>, then compare
322 <replaceable>val</replaceable> to the result, as follows:
323 <itemizedlist>
324 <listitem>
325 <para>
326 if <literal>!</literal><replaceable>sub</replaceable> and
327 <literal>!</literal><replaceable>less</replaceable>, return
328 <replaceable>val</replaceable> <literal>&gt;=</literal>
329 (<replaceable>base</replaceable> <literal>+</literal>
330 <replaceable>offset</replaceable>)
331 </para>
332 </listitem>
333 <listitem>
334 <para>
335 if <literal>!</literal><replaceable>sub</replaceable> and
336 <replaceable>less</replaceable>, return
337 <replaceable>val</replaceable> <literal>&lt;=</literal>
338 (<replaceable>base</replaceable> <literal>+</literal>
339 <replaceable>offset</replaceable>)
340 </para>
341 </listitem>
342 <listitem>
343 <para>
344 if <replaceable>sub</replaceable> and
345 <literal>!</literal><replaceable>less</replaceable>, return
346 <replaceable>val</replaceable> <literal>&gt;=</literal>
347 (<replaceable>base</replaceable> <literal>-</literal>
348 <replaceable>offset</replaceable>)
349 </para>
350 </listitem>
351 <listitem>
352 <para>
353 if <replaceable>sub</replaceable> and
354 <replaceable>less</replaceable>, return
355 <replaceable>val</replaceable> <literal>&lt;=</literal>
356 (<replaceable>base</replaceable> <literal>-</literal>
357 <replaceable>offset</replaceable>)
358 </para>
359 </listitem>
360 </itemizedlist>
361 Before doing so, the function should check the sign of
362 <replaceable>offset</replaceable>: if it is less than zero, raise
363 error
364 <literal>ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</literal>
365 (22013) with error text like <quote>invalid preceding or
366 following size in window function</quote>. (This is required by
367 the SQL standard, although nonstandard operator families might
368 perhaps choose to ignore this restriction, since there seems to
369 be little semantic necessity for it.) This requirement is
370 delegated to the <function>in_range</function> function so that
371 the core code needn't understand what <quote>less than
372 zero</quote> means for a particular data type.
373 </para>
375 <para>
376 An additional expectation is that <function>in_range</function>
377 functions should, if practical, avoid throwing an error if
378 <replaceable>base</replaceable> <literal>+</literal>
379 <replaceable>offset</replaceable> or
380 <replaceable>base</replaceable> <literal>-</literal>
381 <replaceable>offset</replaceable> would overflow. The correct
382 comparison result can be determined even if that value would be
383 out of the data type's range. Note that if the data type
384 includes concepts such as <quote>infinity</quote> or
385 <quote>NaN</quote>, extra care may be needed to ensure that
386 <function>in_range</function>'s results agree with the normal
387 sort order of the operator family.
388 </para>
390 <para>
391 The results of the <function>in_range</function> function must be
392 consistent with the sort ordering imposed by the operator family.
393 To be precise, given any fixed values of
394 <replaceable>offset</replaceable> and
395 <replaceable>sub</replaceable>, then:
396 <itemizedlist>
397 <listitem>
398 <para>
399 If <function>in_range</function> with
400 <replaceable>less</replaceable> = true is true for some
401 <replaceable>val1</replaceable> and
402 <replaceable>base</replaceable>, it must be true for every
403 <replaceable>val2</replaceable> <literal>&lt;=</literal>
404 <replaceable>val1</replaceable> with the same
405 <replaceable>base</replaceable>.
406 </para>
407 </listitem>
408 <listitem>
409 <para>
410 If <function>in_range</function> with
411 <replaceable>less</replaceable> = true is false for some
412 <replaceable>val1</replaceable> and
413 <replaceable>base</replaceable>, it must be false for every
414 <replaceable>val2</replaceable> <literal>&gt;=</literal>
415 <replaceable>val1</replaceable> with the same
416 <replaceable>base</replaceable>.
417 </para>
418 </listitem>
419 <listitem>
420 <para>
421 If <function>in_range</function> with
422 <replaceable>less</replaceable> = true is true for some
423 <replaceable>val</replaceable> and
424 <replaceable>base1</replaceable>, it must be true for every
425 <replaceable>base2</replaceable> <literal>&gt;=</literal>
426 <replaceable>base1</replaceable> with the same
427 <replaceable>val</replaceable>.
428 </para>
429 </listitem>
430 <listitem>
431 <para>
432 If <function>in_range</function> with
433 <replaceable>less</replaceable> = true is false for some
434 <replaceable>val</replaceable> and
435 <replaceable>base1</replaceable>, it must be false for every
436 <replaceable>base2</replaceable> <literal>&lt;=</literal>
437 <replaceable>base1</replaceable> with the same
438 <replaceable>val</replaceable>.
439 </para>
440 </listitem>
441 </itemizedlist>
442 Analogous statements with inverted conditions hold when
443 <replaceable>less</replaceable> = false.
444 </para>
446 <para>
447 If the type being ordered (<type>type1</type>) is collatable, the
448 appropriate collation OID will be passed to the
449 <function>in_range</function> function, using the standard
450 PG_GET_COLLATION() mechanism.
451 </para>
453 <para>
454 <function>in_range</function> functions need not handle NULL
455 inputs, and typically will be marked strict.
456 </para>
457 </listitem>
458 </varlistentry>
459 <varlistentry>
460 <term><function>equalimage</function></term>
461 <listitem>
462 <para>
463 Optionally, a btree operator family may provide
464 <function>equalimage</function> (<quote>equality implies image
465 equality</quote>) support functions, registered under support
466 function number 4. These functions allow the core code to
467 determine when it is safe to apply the btree deduplication
468 optimization. Currently, <function>equalimage</function>
469 functions are only called when building or rebuilding an index.
470 </para>
471 <para>
472 An <function>equalimage</function> function must have the
473 signature
474 <synopsis>
475 equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
476 </synopsis>
477 The return value is static information about an operator class
478 and collation. Returning <literal>true</literal> indicates that
479 the <function>order</function> function for the operator class is
480 guaranteed to only return <literal>0</literal> (<quote>arguments
481 are equal</quote>) when its <replaceable>A</replaceable> and
482 <replaceable>B</replaceable> arguments are also interchangeable
483 without any loss of semantic information. Not registering an
484 <function>equalimage</function> function or returning
485 <literal>false</literal> indicates that this condition cannot be
486 assumed to hold.
487 </para>
488 <para>
489 The <replaceable>opcintype</replaceable> argument is the
490 <literal><structname>pg_type</structname>.oid</literal> of the
491 data type that the operator class indexes. This is a convenience
492 that allows reuse of the same underlying
493 <function>equalimage</function> function across operator classes.
494 If <replaceable>opcintype</replaceable> is a collatable data
495 type, the appropriate collation OID will be passed to the
496 <function>equalimage</function> function, using the standard
497 <function>PG_GET_COLLATION()</function> mechanism.
498 </para>
499 <para>
500 As far as the operator class is concerned, returning
501 <literal>true</literal> indicates that deduplication is safe (or
502 safe for the collation whose OID was passed to its
503 <function>equalimage</function> function). However, the core
504 code will only deem deduplication safe for an index when
505 <emphasis>every</emphasis> indexed column uses an operator class
506 that registers an <function>equalimage</function> function, and
507 each function actually returns <literal>true</literal> when
508 called.
509 </para>
510 <para>
511 Image equality is <emphasis>almost</emphasis> the same condition
512 as simple bitwise equality. There is one subtle difference: When
513 indexing a varlena data type, the on-disk representation of two
514 image equal datums may not be bitwise equal due to inconsistent
515 application of <acronym>TOAST</acronym> compression on input.
516 Formally, when an operator class's
517 <function>equalimage</function> function returns
518 <literal>true</literal>, it is safe to assume that the
519 <literal>datum_image_eq()</literal> C function will always agree
520 with the operator class's <function>order</function> function
521 (provided that the same collation OID is passed to both the
522 <function>equalimage</function> and <function>order</function>
523 functions).
524 </para>
525 <para>
526 The core code is fundamentally unable to deduce anything about
527 the <quote>equality implies image equality</quote> status of an
528 operator class within a multiple-data-type family based on
529 details from other operator classes in the same family. Also, it
530 is not sensible for an operator family to register a cross-type
531 <function>equalimage</function> function, and attempting to do so
532 will result in an error. This is because <quote>equality implies
533 image equality</quote> status does not just depend on
534 sorting/equality semantics, which are more or less defined at the
535 operator family level. In general, the semantics that one
536 particular data type implements must be considered separately.
537 </para>
538 <para>
539 The convention followed by the operator classes included with the
540 core <productname>PostgreSQL</productname> distribution is to
541 register a stock, generic <function>equalimage</function>
542 function. Most operator classes register
543 <function>btequalimage()</function>, which indicates that
544 deduplication is safe unconditionally. Operator classes for
545 collatable data types such as <type>text</type> register
546 <function>btvarstrequalimage()</function>, which indicates that
547 deduplication is safe with deterministic collations. Best
548 practice for third-party extensions is to register their own
549 custom function to retain control.
550 </para>
551 </listitem>
552 </varlistentry>
553 <varlistentry>
554 <term><function>options</function></term>
555 <listitem>
556 <para>
557 Optionally, a B-tree operator family may provide
558 <function>options</function> (<quote>operator class specific
559 options</quote>) support functions, registered under support
560 function number 5. These functions define a set of user-visible
561 parameters that control operator class behavior.
562 </para>
563 <para>
564 An <function>options</function> support function must have the
565 signature
566 <synopsis>
567 options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns void
568 </synopsis>
569 The function is passed a pointer to a <structname>local_relopts</structname>
570 struct, which needs to be filled with a set of operator class
571 specific options. The options can be accessed from other support
572 functions using the <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and
573 <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros.
574 </para>
575 <para>
576 Currently, no B-Tree operator class has an <function>options</function>
577 support function. B-tree doesn't allow flexible representation of keys
578 like GiST, SP-GiST, GIN and BRIN do. So, <function>options</function>
579 probably doesn't have much application in the current B-tree index
580 access method. Nevertheless, this support function was added to B-tree
581 for uniformity, and will probably find uses during further
582 evolution of B-tree in <productname>PostgreSQL</productname>.
583 </para>
584 </listitem>
585 </varlistentry>
586 </variablelist>
588 </sect2>
590 <sect2 id="btree-implementation">
591 <title>Implementation</title>
593 <para>
594 This section covers B-Tree index implementation details that may be
595 of use to advanced users. See
596 <filename>src/backend/access/nbtree/README</filename> in the source
597 distribution for a much more detailed, internals-focused description
598 of the B-Tree implementation.
599 </para>
600 <sect3 id="btree-structure">
601 <title>B-Tree Structure</title>
602 <para>
603 <productname>PostgreSQL</productname> B-Tree indexes are
604 multi-level tree structures, where each level of the tree can be
605 used as a doubly-linked list of pages. A single metapage is stored
606 in a fixed position at the start of the first segment file of the
607 index. All other pages are either leaf pages or internal pages.
608 Leaf pages are the pages on the lowest level of the tree. All
609 other levels consist of internal pages. Each leaf page contains
610 tuples that point to table rows. Each internal page contains
611 tuples that point to the next level down in the tree. Typically,
612 over 99% of all pages are leaf pages. Both internal pages and leaf
613 pages use the standard page format described in <xref
614 linkend="storage-page-layout"/>.
615 </para>
616 <para>
617 New leaf pages are added to a B-Tree index when an existing leaf
618 page cannot fit an incoming tuple. A <firstterm>page
619 split</firstterm> operation makes room for items that originally
620 belonged on the overflowing page by moving a portion of the items
621 to a new page. Page splits must also insert a new
622 <firstterm>downlink</firstterm> to the new page in the parent page,
623 which may cause the parent to split in turn. Page splits
624 <quote>cascade upwards</quote> in a recursive fashion. When the
625 root page finally cannot fit a new downlink, a <firstterm>root page
626 split</firstterm> operation takes place. This adds a new level to
627 the tree structure by creating a new root page that is one level
628 above the original root page.
629 </para>
630 </sect3>
632 <sect3 id="btree-deletion">
633 <title>Bottom-up Index Deletion</title>
634 <para>
635 B-Tree indexes are not directly aware that under MVCC, there might
636 be multiple extant versions of the same logical table row; to an
637 index, each tuple is an independent object that needs its own index
638 entry. <quote>Version churn</quote> tuples may sometimes
639 accumulate and adversely affect query latency and throughput. This
640 typically occurs with <command>UPDATE</command>-heavy workloads
641 where most individual updates cannot apply the
642 <link linkend="storage-hot"><acronym>HOT</acronym> optimization.</link>
643 Changing the value of only
644 one column covered by one index during an <command>UPDATE</command>
645 <emphasis>always</emphasis> necessitates a new set of index tuples
646 &mdash; one for <emphasis>each and every</emphasis> index on the
647 table. Note in particular that this includes indexes that were not
648 <quote>logically modified</quote> by the <command>UPDATE</command>.
649 All indexes will need a successor physical index tuple that points
650 to the latest version in the table. Each new tuple within each
651 index will generally need to coexist with the original
652 <quote>updated</quote> tuple for a short period of time (typically
653 until shortly after the <command>UPDATE</command> transaction
654 commits).
655 </para>
656 <para>
657 B-Tree indexes incrementally delete version churn index tuples by
658 performing <firstterm>bottom-up index deletion</firstterm> passes.
659 Each deletion pass is triggered in reaction to an anticipated
660 <quote>version churn page split</quote>. This only happens with
661 indexes that are not logically modified by
662 <command>UPDATE</command> statements, where concentrated build up
663 of obsolete versions in particular pages would occur otherwise. A
664 page split will usually be avoided, though it's possible that
665 certain implementation-level heuristics will fail to identify and
666 delete even one garbage index tuple (in which case a page split or
667 deduplication pass resolves the issue of an incoming new tuple not
668 fitting on a leaf page). The worst-case number of versions that
669 any index scan must traverse (for any single logical row) is an
670 important contributor to overall system responsiveness and
671 throughput. A bottom-up index deletion pass targets suspected
672 garbage tuples in a single leaf page based on
673 <emphasis>qualitative</emphasis> distinctions involving logical
674 rows and versions. This contrasts with the <quote>top-down</quote>
675 index cleanup performed by autovacuum workers, which is triggered
676 when certain <emphasis>quantitative</emphasis> table-level
677 thresholds are exceeded (see <xref linkend="autovacuum"/>).
678 </para>
679 <note>
680 <para>
681 Not all deletion operations that are performed within B-Tree
682 indexes are bottom-up deletion operations. There is a distinct
683 category of index tuple deletion: <firstterm>simple index tuple
684 deletion</firstterm>. This is a deferred maintenance operation
685 that deletes index tuples that are known to be safe to delete
686 (those whose item identifier's <literal>LP_DEAD</literal> bit is
687 already set). Like bottom-up index deletion, simple index
688 deletion takes place at the point that a page split is anticipated
689 as a way of avoiding the split.
690 </para>
691 <para>
692 Simple deletion is opportunistic in the sense that it can only
693 take place when recent index scans set the
694 <literal>LP_DEAD</literal> bits of affected items in passing.
695 Prior to <productname>PostgreSQL</productname> 14, the only
696 category of B-Tree deletion was simple deletion. The main
697 differences between it and bottom-up deletion are that only the
698 former is opportunistically driven by the activity of passing
699 index scans, while only the latter specifically targets version
700 churn from <command>UPDATE</command>s that do not logically modify
701 indexed columns.
702 </para>
703 </note>
704 <para>
705 Bottom-up index deletion performs the vast majority of all garbage
706 index tuple cleanup for particular indexes with certain workloads.
707 This is expected with any B-Tree index that is subject to
708 significant version churn from <command>UPDATE</command>s that
709 rarely or never logically modify the columns that the index covers.
710 The average and worst-case number of versions per logical row can
711 be kept low purely through targeted incremental deletion passes.
712 It's quite possible that the on-disk size of certain indexes will
713 never increase by even one single page/block despite
714 <emphasis>constant</emphasis> version churn from
715 <command>UPDATE</command>s. Even then, an exhaustive <quote>clean
716 sweep</quote> by a <command>VACUUM</command> operation (typically
717 run in an autovacuum worker process) will eventually be required as
718 a part of <emphasis>collective</emphasis> cleanup of the table and
719 each of its indexes.
720 </para>
721 <para>
722 Unlike <command>VACUUM</command>, bottom-up index deletion does not
723 provide any strong guarantees about how old the oldest garbage
724 index tuple may be. No index can be permitted to retain
725 <quote>floating garbage</quote> index tuples that became dead prior
726 to a conservative cutoff point shared by the table and all of its
727 indexes collectively. This fundamental table-level invariant makes
728 it safe to recycle table <acronym>TID</acronym>s. This is how it
729 is possible for distinct logical rows to reuse the same table
730 <acronym>TID</acronym> over time (though this can never happen with
731 two logical rows whose lifetimes span the same
732 <command>VACUUM</command> cycle).
733 </para>
734 </sect3>
736 <sect3 id="btree-deduplication">
737 <title>Deduplication</title>
738 <para>
739 A duplicate is a leaf page tuple (a tuple that points to a table
740 row) where <emphasis>all</emphasis> indexed key columns have values
741 that match corresponding column values from at least one other leaf
742 page tuple in the same index. Duplicate tuples are quite common in
743 practice. B-Tree indexes can use a special, space-efficient
744 representation for duplicates when an optional technique is
745 enabled: <firstterm>deduplication</firstterm>.
746 </para>
747 <para>
748 Deduplication works by periodically merging groups of duplicate
749 tuples together, forming a single <firstterm>posting list</firstterm> tuple for each
750 group. The column key value(s) only appear once in this
751 representation. This is followed by a sorted array of
752 <acronym>TID</acronym>s that point to rows in the table. This
753 significantly reduces the storage size of indexes where each value
754 (or each distinct combination of column values) appears several
755 times on average. The latency of queries can be reduced
756 significantly. Overall query throughput may increase
757 significantly. The overhead of routine index vacuuming may also be
758 reduced significantly.
759 </para>
760 <note>
761 <para>
762 B-Tree deduplication is just as effective with
763 <quote>duplicates</quote> that contain a NULL value, even though
764 NULL values are never equal to each other according to the
765 <literal>=</literal> member of any B-Tree operator class. As far
766 as any part of the implementation that understands the on-disk
767 B-Tree structure is concerned, NULL is just another value from the
768 domain of indexed values.
769 </para>
770 </note>
771 <para>
772 The deduplication process occurs lazily, when a new item is
773 inserted that cannot fit on an existing leaf page, though only when
774 index tuple deletion could not free sufficient space for the new
775 item (typically deletion is briefly considered and then skipped
776 over). Unlike GIN posting list tuples, B-Tree posting list tuples
777 do not need to expand every time a new duplicate is inserted; they
778 are merely an alternative physical representation of the original
779 logical contents of the leaf page. This design prioritizes
780 consistent performance with mixed read-write workloads. Most
781 client applications will at least see a moderate performance
782 benefit from using deduplication. Deduplication is enabled by
783 default.
784 </para>
785 <para>
786 <command>CREATE INDEX</command> and <command>REINDEX</command>
787 apply deduplication to create posting list tuples, though the
788 strategy they use is slightly different. Each group of duplicate
789 ordinary tuples encountered in the sorted input taken from the
790 table is merged into a posting list tuple
791 <emphasis>before</emphasis> being added to the current pending leaf
792 page. Individual posting list tuples are packed with as many
793 <acronym>TID</acronym>s as possible. Leaf pages are written out in
794 the usual way, without any separate deduplication pass. This
795 strategy is well-suited to <command>CREATE INDEX</command> and
796 <command>REINDEX</command> because they are once-off batch
797 operations.
798 </para>
799 <para>
800 Write-heavy workloads that don't benefit from deduplication due to
801 having few or no duplicate values in indexes will incur a small,
802 fixed performance penalty (unless deduplication is explicitly
803 disabled). The <literal>deduplicate_items</literal> storage
804 parameter can be used to disable deduplication within individual
805 indexes. There is never any performance penalty with read-only
806 workloads, since reading posting list tuples is at least as
807 efficient as reading the standard tuple representation. Disabling
808 deduplication isn't usually helpful.
809 </para>
810 <para>
811 It is sometimes possible for unique indexes (as well as unique
812 constraints) to use deduplication. This allows leaf pages to
813 temporarily <quote>absorb</quote> extra version churn duplicates.
814 Deduplication in unique indexes augments bottom-up index deletion,
815 especially in cases where a long-running transaction holds a
816 snapshot that blocks garbage collection. The goal is to buy time
817 for the bottom-up index deletion strategy to become effective
818 again. Delaying page splits until a single long-running
819 transaction naturally goes away can allow a bottom-up deletion pass
820 to succeed where an earlier deletion pass failed.
821 </para>
822 <tip>
823 <para>
824 A special heuristic is applied to determine whether a
825 deduplication pass in a unique index should take place. It can
826 often skip straight to splitting a leaf page, avoiding a
827 performance penalty from wasting cycles on unhelpful deduplication
828 passes. If you're concerned about the overhead of deduplication,
829 consider setting <literal>deduplicate_items = off</literal>
830 selectively. Leaving deduplication enabled in unique indexes has
831 little downside.
832 </para>
833 </tip>
834 <para>
835 Deduplication cannot be used in all cases due to
836 implementation-level restrictions. Deduplication safety is
837 determined when <command>CREATE INDEX</command> or
838 <command>REINDEX</command> is run.
839 </para>
840 <para>
841 Note that deduplication is deemed unsafe and cannot be used in the
842 following cases involving semantically significant differences
843 among equal datums:
844 </para>
845 <para>
846 <itemizedlist>
847 <listitem>
848 <para>
849 <type>text</type>, <type>varchar</type>, and <type>char</type>
850 cannot use deduplication when a
851 <emphasis>nondeterministic</emphasis> collation is used. Case
852 and accent differences must be preserved among equal datums.
853 </para>
854 </listitem>
856 <listitem>
857 <para>
858 <type>numeric</type> cannot use deduplication. Numeric display
859 scale must be preserved among equal datums.
860 </para>
861 </listitem>
863 <listitem>
864 <para>
865 <type>jsonb</type> cannot use deduplication, since the
866 <type>jsonb</type> B-Tree operator class uses
867 <type>numeric</type> internally.
868 </para>
869 </listitem>
871 <listitem>
872 <para>
873 <type>float4</type> and <type>float8</type> cannot use
874 deduplication. These types have distinct representations for
875 <literal>-0</literal> and <literal>0</literal>, which are
876 nevertheless considered equal. This difference must be
877 preserved.
878 </para>
879 </listitem>
880 </itemizedlist>
881 </para>
882 <para>
883 There is one further implementation-level restriction that may be
884 lifted in a future version of
885 <productname>PostgreSQL</productname>:
886 </para>
887 <para>
888 <itemizedlist>
889 <listitem>
890 <para>
891 Container types (such as composite types, arrays, or range
892 types) cannot use deduplication.
893 </para>
894 </listitem>
895 </itemizedlist>
896 </para>
897 <para>
898 There is one further implementation-level restriction that applies
899 regardless of the operator class or collation used:
900 </para>
901 <para>
902 <itemizedlist>
903 <listitem>
904 <para>
905 <literal>INCLUDE</literal> indexes can never use deduplication.
906 </para>
907 </listitem>
908 </itemizedlist>
909 </para>
911 </sect3>
912 </sect2>
914 </sect1>