At update of non-LP_NORMAL TID, fail instead of corrupting page header.
[pgsql.git] / doc / src / sgml / gin.sgml
blob46e87e01324dd00c460dd98746788bd95d5ccfa3
1 <!-- doc/src/sgml/gin.sgml -->
3 <sect1 id="gin">
4 <title>GIN Indexes</title>
6 <indexterm>
7 <primary>index</primary>
8 <secondary>GIN</secondary>
9 </indexterm>
11 <sect2 id="gin-intro">
12 <title>Introduction</title>
14 <para>
15 <acronym>GIN</acronym> stands for Generalized Inverted Index.
16 <acronym>GIN</acronym> is designed for handling cases where the items
17 to be indexed are composite values, and the queries to be handled by
18 the index need to search for element values that appear within
19 the composite items. For example, the items could be documents,
20 and the queries could be searches for documents containing specific words.
21 </para>
23 <para>
24 We use the word <firstterm>item</firstterm> to refer to a composite value that
25 is to be indexed, and the word <firstterm>key</firstterm> to refer to an element
26 value. <acronym>GIN</acronym> always stores and searches for keys,
27 not item values per se.
28 </para>
30 <para>
31 A <acronym>GIN</acronym> index stores a set of (key, posting list) pairs,
32 where a <firstterm>posting list</firstterm> is a set of row IDs in which the key
33 occurs. The same row ID can appear in multiple posting lists, since
34 an item can contain more than one key. Each key value is stored only
35 once, so a <acronym>GIN</acronym> index is very compact for cases
36 where the same key appears many times.
37 </para>
39 <para>
40 <acronym>GIN</acronym> is generalized in the sense that the
41 <acronym>GIN</acronym> access method code does not need to know the
42 specific operations that it accelerates.
43 Instead, it uses custom strategies defined for particular data types.
44 The strategy defines how keys are extracted from indexed items and
45 query conditions, and how to determine whether a row that contains
46 some of the key values in a query actually satisfies the query.
47 </para>
49 <para>
50 One advantage of <acronym>GIN</acronym> is that it allows the development
51 of custom data types with the appropriate access methods, by
52 an expert in the domain of the data type, rather than a database expert.
53 This is much the same advantage as using <acronym>GiST</acronym>.
54 </para>
56 <para>
57 The <acronym>GIN</acronym>
58 implementation in <productname>PostgreSQL</productname> is primarily
59 maintained by Teodor Sigaev and Oleg Bartunov. There is more
60 information about <acronym>GIN</acronym> on their
61 <ulink url="http://www.sai.msu.su/~megera/wiki/Gin">website</ulink>.
62 </para>
63 </sect2>
65 <sect2 id="gin-builtin-opclasses">
66 <title>Built-in Operator Classes</title>
68 <para>
69 The core <productname>PostgreSQL</productname> distribution
70 includes the <acronym>GIN</acronym> operator classes shown in
71 <xref linkend="gin-builtin-opclasses-table"/>.
72 (Some of the optional modules described in <xref linkend="contrib"/>
73 provide additional <acronym>GIN</acronym> operator classes.)
74 </para>
76 <table id="gin-builtin-opclasses-table">
77 <title>Built-in <acronym>GIN</acronym> Operator Classes</title>
78 <tgroup cols="2">
79 <thead>
80 <row>
81 <entry>Name</entry>
82 <entry>Indexable Operators</entry>
83 </row>
84 </thead>
85 <tbody>
86 <row>
87 <entry morerows="3" valign="middle"><literal>array_ops</literal></entry>
88 <entry><literal>&amp;&amp; (anyarray,anyarray)</literal></entry>
89 </row>
90 <row>
91 <entry><literal>@&gt; (anyarray,anyarray)</literal></entry>
92 </row>
93 <row>
94 <entry><literal>&lt;@ (anyarray,anyarray)</literal></entry>
95 </row>
96 <row>
97 <entry><literal>= (anyarray,anyarray)</literal></entry>
98 </row>
99 <row>
100 <entry morerows="5" valign="middle"><literal>jsonb_ops</literal></entry>
101 <entry><literal>@&gt; (jsonb,jsonb)</literal></entry>
102 </row>
103 <row>
104 <entry><literal>@? (jsonb,jsonpath)</literal></entry>
105 </row>
106 <row>
107 <entry><literal>@@ (jsonb,jsonpath)</literal></entry>
108 </row>
109 <row>
110 <entry><literal>? (jsonb,text)</literal></entry>
111 </row>
112 <row>
113 <entry><literal>?| (jsonb,text[])</literal></entry>
114 </row>
115 <row>
116 <entry><literal>?&amp; (jsonb,text[])</literal></entry>
117 </row>
118 <row>
119 <entry morerows="2" valign="middle"><literal>jsonb_path_ops</literal></entry>
120 <entry><literal>@&gt; (jsonb,jsonb)</literal></entry>
121 </row>
122 <row>
123 <entry><literal>@? (jsonb,jsonpath)</literal></entry>
124 </row>
125 <row>
126 <entry><literal>@@ (jsonb,jsonpath)</literal></entry>
127 </row>
128 <row>
129 <entry valign="middle"><literal>tsvector_ops</literal></entry>
130 <entry><literal>@@ (tsvector,tsquery)</literal></entry>
131 </row>
132 </tbody>
133 </tgroup>
134 </table>
136 <para>
137 Of the two operator classes for type <type>jsonb</type>, <literal>jsonb_ops</literal>
138 is the default. <literal>jsonb_path_ops</literal> supports fewer operators but
139 offers better performance for those operators.
140 See <xref linkend="json-indexing"/> for details.
141 </para>
143 </sect2>
145 <sect2 id="gin-extensibility">
146 <title>Extensibility</title>
148 <para>
149 The <acronym>GIN</acronym> interface has a high level of abstraction,
150 requiring the access method implementer only to implement the semantics of
151 the data type being accessed. The <acronym>GIN</acronym> layer itself
152 takes care of concurrency, logging and searching the tree structure.
153 </para>
155 <para>
156 All it takes to get a <acronym>GIN</acronym> access method working is to
157 implement a few user-defined methods, which define the behavior of
158 keys in the tree and the relationships between keys, indexed items,
159 and indexable queries. In short, <acronym>GIN</acronym> combines
160 extensibility with generality, code reuse, and a clean interface.
161 </para>
163 <para>
164 There are two methods that an operator class for
165 <acronym>GIN</acronym> must provide:
167 <variablelist>
168 <varlistentry>
169 <term><function>Datum *extractValue(Datum itemValue, int32 *nkeys,
170 bool **nullFlags)</function></term>
171 <listitem>
172 <para>
173 Returns a palloc'd array of keys given an item to be indexed. The
174 number of returned keys must be stored into <literal>*nkeys</literal>.
175 If any of the keys can be null, also palloc an array of
176 <literal>*nkeys</literal> <type>bool</type> fields, store its address at
177 <literal>*nullFlags</literal>, and set these null flags as needed.
178 <literal>*nullFlags</literal> can be left <symbol>NULL</symbol> (its initial value)
179 if all keys are non-null.
180 The return value can be <symbol>NULL</symbol> if the item contains no keys.
181 </para>
182 </listitem>
183 </varlistentry>
185 <varlistentry>
186 <term><function>Datum *extractQuery(Datum query, int32 *nkeys,
187 StrategyNumber n, bool **pmatch, Pointer **extra_data,
188 bool **nullFlags, int32 *searchMode)</function></term>
189 <listitem>
190 <para>
191 Returns a palloc'd array of keys given a value to be queried; that is,
192 <literal>query</literal> is the value on the right-hand side of an
193 indexable operator whose left-hand side is the indexed column.
194 <literal>n</literal> is the strategy number of the operator within the
195 operator class (see <xref linkend="xindex-strategies"/>).
196 Often, <function>extractQuery</function> will need
197 to consult <literal>n</literal> to determine the data type of
198 <literal>query</literal> and the method it should use to extract key values.
199 The number of returned keys must be stored into <literal>*nkeys</literal>.
200 If any of the keys can be null, also palloc an array of
201 <literal>*nkeys</literal> <type>bool</type> fields, store its address at
202 <literal>*nullFlags</literal>, and set these null flags as needed.
203 <literal>*nullFlags</literal> can be left <symbol>NULL</symbol> (its initial value)
204 if all keys are non-null.
205 The return value can be <symbol>NULL</symbol> if the <literal>query</literal> contains no keys.
206 </para>
208 <para>
209 <literal>searchMode</literal> is an output argument that allows
210 <function>extractQuery</function> to specify details about how the search
211 will be done.
212 If <literal>*searchMode</literal> is set to
213 <literal>GIN_SEARCH_MODE_DEFAULT</literal> (which is the value it is
214 initialized to before call), only items that match at least one of
215 the returned keys are considered candidate matches.
216 If <literal>*searchMode</literal> is set to
217 <literal>GIN_SEARCH_MODE_INCLUDE_EMPTY</literal>, then in addition to items
218 containing at least one matching key, items that contain no keys at
219 all are considered candidate matches. (This mode is useful for
220 implementing is-subset-of operators, for example.)
221 If <literal>*searchMode</literal> is set to <literal>GIN_SEARCH_MODE_ALL</literal>,
222 then all non-null items in the index are considered candidate
223 matches, whether they match any of the returned keys or not. (This
224 mode is much slower than the other two choices, since it requires
225 scanning essentially the entire index, but it may be necessary to
226 implement corner cases correctly. An operator that needs this mode
227 in most cases is probably not a good candidate for a GIN operator
228 class.)
229 The symbols to use for setting this mode are defined in
230 <filename>access/gin.h</filename>.
231 </para>
233 <para>
234 <literal>pmatch</literal> is an output argument for use when partial match
235 is supported. To use it, <function>extractQuery</function> must allocate
236 an array of <literal>*nkeys</literal> <type>bool</type>s and store its address at
237 <literal>*pmatch</literal>. Each element of the array should be set to true
238 if the corresponding key requires partial match, false if not.
239 If <literal>*pmatch</literal> is set to <symbol>NULL</symbol> then GIN assumes partial match
240 is not required. The variable is initialized to <symbol>NULL</symbol> before call,
241 so this argument can simply be ignored by operator classes that do
242 not support partial match.
243 </para>
245 <para>
246 <literal>extra_data</literal> is an output argument that allows
247 <function>extractQuery</function> to pass additional data to the
248 <function>consistent</function> and <function>comparePartial</function> methods.
249 To use it, <function>extractQuery</function> must allocate
250 an array of <literal>*nkeys</literal> pointers and store its address at
251 <literal>*extra_data</literal>, then store whatever it wants to into the
252 individual pointers. The variable is initialized to <symbol>NULL</symbol> before
253 call, so this argument can simply be ignored by operator classes that
254 do not require extra data. If <literal>*extra_data</literal> is set, the
255 whole array is passed to the <function>consistent</function> method, and
256 the appropriate element to the <function>comparePartial</function> method.
257 </para>
259 </listitem>
260 </varlistentry>
261 </variablelist>
263 An operator class must also provide a function to check if an indexed item
264 matches the query. It comes in two flavors, a Boolean <function>consistent</function>
265 function, and a ternary <function>triConsistent</function> function.
266 <function>triConsistent</function> covers the functionality of both, so providing
267 <function>triConsistent</function> alone is sufficient. However, if the Boolean
268 variant is significantly cheaper to calculate, it can be advantageous to
269 provide both. If only the Boolean variant is provided, some optimizations
270 that depend on refuting index items before fetching all the keys are
271 disabled.
273 <variablelist>
274 <varlistentry>
275 <term><function>bool consistent(bool check[], StrategyNumber n, Datum query,
276 int32 nkeys, Pointer extra_data[], bool *recheck,
277 Datum queryKeys[], bool nullFlags[])</function></term>
278 <listitem>
279 <para>
280 Returns true if an indexed item satisfies the query operator with
281 strategy number <literal>n</literal> (or might satisfy it, if the recheck
282 indication is returned). This function does not have direct access
283 to the indexed item's value, since <acronym>GIN</acronym> does not
284 store items explicitly. Rather, what is available is knowledge
285 about which key values extracted from the query appear in a given
286 indexed item. The <literal>check</literal> array has length
287 <literal>nkeys</literal>, which is the same as the number of keys previously
288 returned by <function>extractQuery</function> for this <literal>query</literal> datum.
289 Each element of the
290 <literal>check</literal> array is true if the indexed item contains the
291 corresponding query key, i.e., if (check[i] == true) the i-th key of the
292 <function>extractQuery</function> result array is present in the indexed item.
293 The original <literal>query</literal> datum is
294 passed in case the <function>consistent</function> method needs to consult it,
295 and so are the <literal>queryKeys[]</literal> and <literal>nullFlags[]</literal>
296 arrays previously returned by <function>extractQuery</function>.
297 <literal>extra_data</literal> is the extra-data array returned by
298 <function>extractQuery</function>, or <symbol>NULL</symbol> if none.
299 </para>
301 <para>
302 When <function>extractQuery</function> returns a null key in
303 <literal>queryKeys[]</literal>, the corresponding <literal>check[]</literal> element
304 is true if the indexed item contains a null key; that is, the
305 semantics of <literal>check[]</literal> are like <literal>IS NOT DISTINCT
306 FROM</literal>. The <function>consistent</function> function can examine the
307 corresponding <literal>nullFlags[]</literal> element if it needs to tell
308 the difference between a regular value match and a null match.
309 </para>
311 <para>
312 On success, <literal>*recheck</literal> should be set to true if the heap
313 tuple needs to be rechecked against the query operator, or false if
314 the index test is exact. That is, a false return value guarantees
315 that the heap tuple does not match the query; a true return value with
316 <literal>*recheck</literal> set to false guarantees that the heap tuple does
317 match the query; and a true return value with
318 <literal>*recheck</literal> set to true means that the heap tuple might match
319 the query, so it needs to be fetched and rechecked by evaluating the
320 query operator directly against the originally indexed item.
321 </para>
322 </listitem>
323 </varlistentry>
325 <varlistentry>
326 <term><function>GinTernaryValue triConsistent(GinTernaryValue check[], StrategyNumber n, Datum query,
327 int32 nkeys, Pointer extra_data[],
328 Datum queryKeys[], bool nullFlags[])</function></term>
329 <listitem>
330 <para>
331 <function>triConsistent</function> is similar to <function>consistent</function>,
332 but instead of Booleans in the <literal>check</literal> vector, there are
333 three possible values for each
334 key: <literal>GIN_TRUE</literal>, <literal>GIN_FALSE</literal> and
335 <literal>GIN_MAYBE</literal>. <literal>GIN_FALSE</literal> and <literal>GIN_TRUE</literal>
336 have the same meaning as regular Boolean values, while
337 <literal>GIN_MAYBE</literal> means that the presence of that key is not known.
338 When <literal>GIN_MAYBE</literal> values are present, the function should only
339 return <literal>GIN_TRUE</literal> if the item certainly matches whether or
340 not the index item contains the corresponding query keys. Likewise, the
341 function must return <literal>GIN_FALSE</literal> only if the item certainly
342 does not match, whether or not it contains the <literal>GIN_MAYBE</literal>
343 keys. If the result depends on the <literal>GIN_MAYBE</literal> entries, i.e.,
344 the match cannot be confirmed or refuted based on the known query keys,
345 the function must return <literal>GIN_MAYBE</literal>.
346 </para>
347 <para>
348 When there are no <literal>GIN_MAYBE</literal> values in the <literal>check</literal>
349 vector, a <literal>GIN_MAYBE</literal> return value is the equivalent of
350 setting the <literal>recheck</literal> flag in the
351 Boolean <function>consistent</function> function.
352 </para>
353 </listitem>
354 </varlistentry>
355 </variablelist>
356 </para>
358 <para>
359 In addition, GIN must have a way to sort the key values stored in the index.
360 The operator class can define the sort ordering by specifying a comparison
361 method:
363 <variablelist>
364 <varlistentry>
365 <term><function>int compare(Datum a, Datum b)</function></term>
366 <listitem>
367 <para>
368 Compares two keys (not indexed items!) and returns an integer less than
369 zero, zero, or greater than zero, indicating whether the first key is
370 less than, equal to, or greater than the second. Null keys are never
371 passed to this function.
372 </para>
373 </listitem>
374 </varlistentry>
375 </variablelist>
377 Alternatively, if the operator class does not provide a <function>compare</function>
378 method, GIN will look up the default btree operator class for the index
379 key data type, and use its comparison function. It is recommended to
380 specify the comparison function in a GIN operator class that is meant for
381 just one data type, as looking up the btree operator class costs a few
382 cycles. However, polymorphic GIN operator classes (such
383 as <literal>array_ops</literal>) typically cannot specify a single comparison
384 function.
385 </para>
387 <para>
388 An operator class for <acronym>GIN</acronym> can optionally supply the
389 following methods:
391 <variablelist>
392 <varlistentry>
393 <term><function>int comparePartial(Datum partial_key, Datum key, StrategyNumber n,
394 Pointer extra_data)</function></term>
395 <listitem>
396 <para>
397 Compare a partial-match query key to an index key. Returns an integer
398 whose sign indicates the result: less than zero means the index key
399 does not match the query, but the index scan should continue; zero
400 means that the index key does match the query; greater than zero
401 indicates that the index scan should stop because no more matches
402 are possible. The strategy number <literal>n</literal> of the operator
403 that generated the partial match query is provided, in case its
404 semantics are needed to determine when to end the scan. Also,
405 <literal>extra_data</literal> is the corresponding element of the extra-data
406 array made by <function>extractQuery</function>, or <symbol>NULL</symbol> if none.
407 Null keys are never passed to this function.
408 </para>
409 </listitem>
410 </varlistentry>
411 <varlistentry>
412 <term><function>void options(local_relopts *relopts)</function></term>
413 <listitem>
414 <para>
415 Defines a set of user-visible parameters that control operator class
416 behavior.
417 </para>
419 <para>
420 The <function>options</function> function is passed a pointer to a
421 <structname>local_relopts</structname> struct, which needs to be
422 filled with a set of operator class specific options. The options
423 can be accessed from other support functions using the
424 <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and
425 <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros.
426 </para>
428 <para>
429 Since both key extraction of indexed values and representation of the
430 key in <acronym>GIN</acronym> are flexible, they may depend on
431 user-specified parameters.
432 </para>
433 </listitem>
434 </varlistentry>
435 </variablelist>
436 </para>
438 <para>
439 To support <quote>partial match</quote> queries, an operator class must
440 provide the <function>comparePartial</function> method, and its
441 <function>extractQuery</function> method must set the <literal>pmatch</literal>
442 parameter when a partial-match query is encountered. See
443 <xref linkend="gin-partial-match"/> for details.
444 </para>
446 <para>
447 The actual data types of the various <literal>Datum</literal> values mentioned
448 above vary depending on the operator class. The item values passed to
449 <function>extractValue</function> are always of the operator class's input type, and
450 all key values must be of the class's <literal>STORAGE</literal> type. The type of
451 the <literal>query</literal> argument passed to <function>extractQuery</function>,
452 <function>consistent</function> and <function>triConsistent</function> is whatever is the
453 right-hand input type of the class member operator identified by the
454 strategy number. This need not be the same as the indexed type, so long as
455 key values of the correct type can be extracted from it. However, it is
456 recommended that the SQL declarations of these three support functions use
457 the opclass's indexed data type for the <literal>query</literal> argument, even
458 though the actual type might be something else depending on the operator.
459 </para>
461 </sect2>
463 <sect2 id="gin-implementation">
464 <title>Implementation</title>
466 <para>
467 Internally, a <acronym>GIN</acronym> index contains a B-tree index
468 constructed over keys, where each key is an element of one or more indexed
469 items (a member of an array, for example) and where each tuple in a leaf
470 page contains either a pointer to a B-tree of heap pointers (a
471 <quote>posting tree</quote>), or a simple list of heap pointers (a <quote>posting
472 list</quote>) when the list is small enough to fit into a single index tuple along
473 with the key value. <xref linkend="gin-internals-figure"/> illustrates
474 these components of a GIN index.
475 </para>
477 <para>
478 As of <productname>PostgreSQL</productname> 9.1, null key values can be
479 included in the index. Also, placeholder nulls are included in the index
480 for indexed items that are null or contain no keys according to
481 <function>extractValue</function>. This allows searches that should find empty
482 items to do so.
483 </para>
485 <para>
486 Multicolumn <acronym>GIN</acronym> indexes are implemented by building
487 a single B-tree over composite values (column number, key value). The
488 key values for different columns can be of different types.
489 </para>
491 <figure id="gin-internals-figure">
492 <title>GIN Internals</title>
493 <mediaobject>
494 <imageobject>
495 <imagedata fileref="images/gin.svg" format="SVG" width="100%"/>
496 </imageobject>
497 </mediaobject>
498 </figure>
500 <sect3 id="gin-fast-update">
501 <title>GIN Fast Update Technique</title>
503 <para>
504 Updating a <acronym>GIN</acronym> index tends to be slow because of the
505 intrinsic nature of inverted indexes: inserting or updating one heap row
506 can cause many inserts into the index (one for each key extracted
507 from the indexed item).
508 <acronym>GIN</acronym> is capable of postponing much of this work by inserting
509 new tuples into a temporary, unsorted list of pending entries.
510 When the table is vacuumed or autoanalyzed, or when
511 <function>gin_clean_pending_list</function> function is called, or if the
512 pending list becomes larger than
513 <xref linkend="guc-gin-pending-list-limit"/>, the entries are moved to the
514 main <acronym>GIN</acronym> data structure using the same bulk insert
515 techniques used during initial index creation. This greatly improves
516 <acronym>GIN</acronym> index update speed, even counting the additional
517 vacuum overhead. Moreover the overhead work can be done by a background
518 process instead of in foreground query processing.
519 </para>
521 <para>
522 The main disadvantage of this approach is that searches must scan the list
523 of pending entries in addition to searching the regular index, and so
524 a large list of pending entries will slow searches significantly.
525 Another disadvantage is that, while most updates are fast, an update
526 that causes the pending list to become <quote>too large</quote> will incur an
527 immediate cleanup cycle and thus be much slower than other updates.
528 Proper use of autovacuum can minimize both of these problems.
529 </para>
531 <para>
532 If consistent response time is more important than update speed,
533 use of pending entries can be disabled by turning off the
534 <literal>fastupdate</literal> storage parameter for a
535 <acronym>GIN</acronym> index. See <xref linkend="sql-createindex"/>
536 for details.
537 </para>
538 </sect3>
540 <sect3 id="gin-partial-match">
541 <title>Partial Match Algorithm</title>
543 <para>
544 GIN can support <quote>partial match</quote> queries, in which the query
545 does not determine an exact match for one or more keys, but the possible
546 matches fall within a reasonably narrow range of key values (within the
547 key sorting order determined by the <function>compare</function> support method).
548 The <function>extractQuery</function> method, instead of returning a key value
549 to be matched exactly, returns a key value that is the lower bound of
550 the range to be searched, and sets the <literal>pmatch</literal> flag true.
551 The key range is then scanned using the <function>comparePartial</function>
552 method. <function>comparePartial</function> must return zero for a matching
553 index key, less than zero for a non-match that is still within the range
554 to be searched, or greater than zero if the index key is past the range
555 that could match.
556 </para>
557 </sect3>
559 </sect2>
561 <sect2 id="gin-tips">
562 <title>GIN Tips and Tricks</title>
564 <variablelist>
565 <varlistentry>
566 <term>Create vs. insert</term>
567 <listitem>
568 <para>
569 Insertion into a <acronym>GIN</acronym> index can be slow
570 due to the likelihood of many keys being inserted for each item.
571 So, for bulk insertions into a table it is advisable to drop the GIN
572 index and recreate it after finishing bulk insertion.
573 </para>
575 <para>
576 When <literal>fastupdate</literal> is enabled for <acronym>GIN</acronym>
577 (see <xref linkend="gin-fast-update"/> for details), the penalty is
578 less than when it is not. But for very large updates it may still be
579 best to drop and recreate the index.
580 </para>
581 </listitem>
582 </varlistentry>
584 <varlistentry>
585 <term><xref linkend="guc-maintenance-work-mem"/></term>
586 <listitem>
587 <para>
588 Build time for a <acronym>GIN</acronym> index is very sensitive to
589 the <varname>maintenance_work_mem</varname> setting; it doesn't pay to
590 skimp on work memory during index creation.
591 </para>
592 </listitem>
593 </varlistentry>
595 <varlistentry>
596 <term><xref linkend="guc-gin-pending-list-limit"/></term>
597 <listitem>
598 <para>
599 During a series of insertions into an existing <acronym>GIN</acronym>
600 index that has <literal>fastupdate</literal> enabled, the system will clean up
601 the pending-entry list whenever the list grows larger than
602 <varname>gin_pending_list_limit</varname>. To avoid fluctuations in observed
603 response time, it's desirable to have pending-list cleanup occur in the
604 background (i.e., via autovacuum). Foreground cleanup operations
605 can be avoided by increasing <varname>gin_pending_list_limit</varname>
606 or making autovacuum more aggressive.
607 However, enlarging the threshold of the cleanup operation means that
608 if a foreground cleanup does occur, it will take even longer.
609 </para>
610 <para>
611 <varname>gin_pending_list_limit</varname> can be overridden for individual
612 GIN indexes by changing storage parameters, which allows each
613 GIN index to have its own cleanup threshold.
614 For example, it's possible to increase the threshold only for the GIN
615 index which can be updated heavily, and decrease it otherwise.
616 </para>
617 </listitem>
618 </varlistentry>
620 <varlistentry>
621 <term><xref linkend="guc-gin-fuzzy-search-limit"/></term>
622 <listitem>
623 <para>
624 The primary goal of developing <acronym>GIN</acronym> indexes was
625 to create support for highly scalable full-text search in
626 <productname>PostgreSQL</productname>, and there are often situations when
627 a full-text search returns a very large set of results. Moreover, this
628 often happens when the query contains very frequent words, so that the
629 large result set is not even useful. Since reading many
630 tuples from the disk and sorting them could take a lot of time, this is
631 unacceptable for production. (Note that the index search itself is very
632 fast.)
633 </para>
634 <para>
635 To facilitate controlled execution of such queries,
636 <acronym>GIN</acronym> has a configurable soft upper limit on the
637 number of rows returned: the
638 <varname>gin_fuzzy_search_limit</varname> configuration parameter.
639 It is set to 0 (meaning no limit) by default.
640 If a non-zero limit is set, then the returned set is a subset of
641 the whole result set, chosen at random.
642 </para>
643 <para>
644 <quote>Soft</quote> means that the actual number of returned results
645 could differ somewhat from the specified limit, depending on the query
646 and the quality of the system's random number generator.
647 </para>
648 <para>
649 From experience, values in the thousands (e.g., 5000 &mdash; 20000)
650 work well.
651 </para>
652 </listitem>
653 </varlistentry>
654 </variablelist>
656 </sect2>
658 <sect2 id="gin-limit">
659 <title>Limitations</title>
661 <para>
662 <acronym>GIN</acronym> assumes that indexable operators are strict. This
663 means that <function>extractValue</function> will not be called at all on a null
664 item value (instead, a placeholder index entry is created automatically),
665 and <function>extractQuery</function> will not be called on a null query
666 value either (instead, the query is presumed to be unsatisfiable). Note
667 however that null key values contained within a non-null composite item
668 or query value are supported.
669 </para>
670 </sect2>
672 <sect2 id="gin-examples">
673 <title>Examples</title>
675 <para>
676 The core <productname>PostgreSQL</productname> distribution
677 includes the <acronym>GIN</acronym> operator classes previously shown in
678 <xref linkend="gin-builtin-opclasses-table"/>.
679 The following <filename>contrib</filename> modules also contain
680 <acronym>GIN</acronym> operator classes:
682 <variablelist>
683 <varlistentry>
684 <term><filename>btree_gin</filename></term>
685 <listitem>
686 <para>B-tree equivalent functionality for several data types</para>
687 </listitem>
688 </varlistentry>
690 <varlistentry>
691 <term><filename>hstore</filename></term>
692 <listitem>
693 <para>Module for storing (key, value) pairs</para>
694 </listitem>
695 </varlistentry>
697 <varlistentry>
698 <term><filename>intarray</filename></term>
699 <listitem>
700 <para>Enhanced support for <type>int[]</type></para>
701 </listitem>
702 </varlistentry>
704 <varlistentry>
705 <term><filename>pg_trgm</filename></term>
706 <listitem>
707 <para>Text similarity using trigram matching</para>
708 </listitem>
709 </varlistentry>
710 </variablelist>
711 </para>
712 </sect2>
714 </sect1>