doc/src/sgml/gin.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="GIN">
   4 <title>GIN Indexes</title>
   5
   6    <indexterm>
   7     <primary>index</primary>
   8     <secondary>GIN</secondary>
   9    </indexterm>
  10
  11 <sect1 id="gin-intro">
  12  <title>Introduction</title>
  13
  14  <para>
  15    <acronym>GIN</acronym> stands for Generalized Inverted Index.  It is
  16    an index structure storing a set of (key, posting list) pairs, where
  17    a <quote>posting list</> is a set of rows in which the key occurs. Each
  18    indexed value can contain many keys, so the same row ID can appear in
  19    multiple posting lists.
  20  </para>
  21
  22  <para>
  23    It is generalized in the sense that a <acronym>GIN</acronym> index
  24    does not need to be aware of the operation that it accelerates.
  25    Instead, it uses custom strategies defined for particular data types.
  26  </para>
  27
  28  <para>
  29   One advantage of <acronym>GIN</acronym> is that it allows the development
  30   of custom data types with the appropriate access methods, by
  31   an expert in the domain of the data type, rather than a database expert.
  32   This is much the same advantage as using <acronym>GiST</acronym>.
  33  </para>
  34
  35  <para>
  36   The <acronym>GIN</acronym>
  37   implementation in <productname>PostgreSQL</productname> is primarily
  38   maintained by Teodor Sigaev and Oleg Bartunov. There is more
  39   information about <acronym>GIN</acronym> on their
  40   <ulink url="http://www.sai.msu.su/~megera/wiki/Gin">website</ulink>.
  41  </para>
  42 </sect1>
  43
  44 <sect1 id="gin-extensibility">
  45  <title>Extensibility</title>
  46
  47  <para>
  48    The <acronym>GIN</acronym> interface has a high level of abstraction,
  49    requiring the access method implementer only to implement the semantics of
  50    the data type being accessed.  The <acronym>GIN</acronym> layer itself
  51    takes care of concurrency, logging and searching the tree structure.
  52  </para>
  53
  54  <para>
  55    All it takes to get a <acronym>GIN</acronym> access method working is to
  56    implement four (or five) user-defined methods, which define the behavior of
  57    keys in the tree and the relationships between keys, indexed values,
  58    and indexable queries. In short, <acronym>GIN</acronym> combines
  59    extensibility with generality, code reuse, and a clean interface.
  60  </para>
  61
  62  <para>
  63    The four methods that an operator class for
  64    <acronym>GIN</acronym> must provide are:
  65  </para>
  66
  67  <variablelist>
  68     <varlistentry>
  69      <term>int compare(Datum a, Datum b)</term>
  70      <listitem>
  71       <para>
  72        Compares keys (not indexed values!) and returns an integer less than
  73        zero, zero, or greater than zero, indicating whether the first key is
  74        less than, equal to, or greater than the second.
  75       </para>
  76      </listitem>
  77     </varlistentry>
  78
  79     <varlistentry>
  80      <term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
  81      <listitem>
  82       <para>
  83        Returns an array of keys given a value to be indexed.  The
  84        number of returned keys must be stored into <literal>*nkeys</>.
  85       </para>
  86      </listitem>
  87     </varlistentry>
  88
  89     <varlistentry>
  90      <term>Datum *extractQuery(Datum query, int32 *nkeys,
  91         StrategyNumber n, bool **pmatch, Pointer **extra_data)</term>
  92      <listitem>
  93       <para>
  94        Returns an array of keys given a value to be queried; that is,
  95        <literal>query</> is the value on the right-hand side of an
  96        indexable operator whose left-hand side is the indexed column.
  97        <literal>n</> is the strategy number of the operator within the
  98        operator class (see <xref linkend="xindex-strategies">).
  99        Often, <function>extractQuery</> will need
 100        to consult <literal>n</> to determine the data type of
 101        <literal>query</> and the key values that need to be extracted.
 102        The number of returned keys must be stored into <literal>*nkeys</>.
 103        If the query contains no keys then <function>extractQuery</>
 104        should store 0 or -1 into <literal>*nkeys</>, depending on the
 105        semantics of the operator.  0 means that every
 106        value matches the <literal>query</> and a full-index scan should be
 107        performed (but see <xref linkend="gin-limit">).
 108        -1 means that nothing can match the <literal>query</>, and
 109        so the index scan can be skipped entirely.
 110        <literal>pmatch</> is an output argument for use when partial match
 111        is supported.  To use it, <function>extractQuery</> must allocate
 112        an array of <literal>*nkeys</> booleans and store its address at
 113        <literal>*pmatch</>.  Each element of the array should be set to TRUE
 114        if the corresponding key requires partial match, FALSE if not.
 115        If <literal>*pmatch</> is set to NULL then GIN assumes partial match
 116        is not required.  The variable is initialized to NULL before call,
 117        so this argument can simply be ignored by operator classes that do
 118        not support partial match.
 119        <literal>extra_data</> is an output argument that allows
 120        <function>extractQuery</> to pass additional data to the
 121        <function>consistent</> and <function>comparePartial</> methods.
 122        To use it, <function>extractQuery</> must allocate
 123        an array of <literal>*nkeys</> Pointers and store its address at
 124        <literal>*extra_data</>, then store whatever it wants to into the
 125        individual pointers.  The variable is initialized to NULL before
 126        call, so this argument can simply be ignored by operator classes that
 127        do not require extra data.  If <literal>*extra_data</> is set, the
 128        whole array is passed to the <function>consistent</> method, and
 129        the appropriate element to the <function>comparePartial</> method.
 130       </para>
 131
 132      </listitem>
 133     </varlistentry>
 134
 135     <varlistentry>
 136      <term>bool consistent(bool check[], StrategyNumber n, Datum query,
 137                            int32 nkeys, Pointer extra_data[], bool *recheck)</term>
 138      <listitem>
 139       <para>
 140        Returns TRUE if the indexed value satisfies the query operator with
 141        strategy number <literal>n</> (or might satisfy, if the recheck
 142        indication is returned).  The <literal>check</> array has length
 143        <literal>nkeys</>, which is the same as the number of keys previously
 144        returned by <function>extractQuery</> for this <literal>query</> datum.
 145        Each element of the
 146        <literal>check</> array is TRUE if the indexed value contains the
 147        corresponding query key, ie, if (check[i] == TRUE) the i-th key of the
 148        <function>extractQuery</> result array is present in the indexed value.
 149        The original <literal>query</> datum (not the extracted key array!) is
 150        passed in case the <function>consistent</> method needs to consult it.
 151        <literal>extra_data</> is the extra-data array returned by
 152        <function>extractQuery</>, or NULL if none.
 153        On success, <literal>*recheck</> should be set to TRUE if the heap
 154        tuple needs to be rechecked against the query operator, or FALSE if
 155        the index test is exact.
 156       </para>
 157      </listitem>
 158     </varlistentry>
 159
 160   </variablelist>
 161
 162  <para>
 163   Optionally, an operator class for
 164   <acronym>GIN</acronym> can supply a fifth method:
 165  </para>
 166
 167   <variablelist>
 168
 169     <varlistentry>
 170      <term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n,
 171                               Pointer extra_data)</term>
 172      <listitem>
 173       <para>
 174        Compare a partial-match query to an index key.  Returns an integer
 175        whose sign indicates the result: less than zero means the index key
 176        does not match the query, but the index scan should continue; zero
 177        means that the index key does match the query; greater than zero
 178        indicates that the index scan should stop because no more matches
 179        are possible.  The strategy number <literal>n</> of the operator
 180        that generated the partial match query is provided, in case its
 181        semantics are needed to determine when to end the scan.  Also,
 182        <literal>extra_data</> is the corresponding element of the extra-data
 183        array made by <function>extractQuery</>, or NULL if none.
 184       </para>
 185      </listitem>
 186     </varlistentry>
 187
 188   </variablelist>
 189
 190  <para>
 191   To support <quote>partial match</> queries, an operator class must
 192   provide the <function>comparePartial</> method, and its
 193   <function>extractQuery</> method must set the <literal>pmatch</>
 194   parameter when a partial-match query is encountered.  See
 195   <xref linkend="gin-partial-match"> for details.
 196  </para>
 197
 198 </sect1>
 199
 200 <sect1 id="gin-implementation">
 201  <title>Implementation</title>
 202
 203  <para>
 204   Internally, a <acronym>GIN</acronym> index contains a B-tree index
 205   constructed over keys, where each key is an element of the indexed value
 206   (a member of an array, for example) and where each tuple in a leaf page is
 207   either a pointer to a B-tree over heap pointers (PT, posting tree), or a
 208   list of heap pointers (PL, posting list) if the list is small enough.
 209  </para>
 210
 211  <sect2 id="gin-fast-update">
 212   <title>GIN fast update technique</title>
 213
 214   <para>
 215    Updating a <acronym>GIN</acronym> index tends to be slow because of the
 216    intrinsic nature of inverted indexes: inserting or updating one heap row
 217    can cause many inserts into the index (one for each key extracted
 218    from the indexed value). As of <productname>PostgreSQL</productname> 8.4,
 219    <acronym>GIN</> is capable of postponing much of this work by inserting
 220    new tuples into a temporary, unsorted list of pending entries.
 221    When the table is vacuumed, or if the pending list becomes too large
 222    (larger than <xref linkend="guc-work-mem">), the entries are moved to the
 223    main <acronym>GIN</acronym> data structure using the same bulk insert
 224    techniques used during initial index creation.  This greatly improves
 225    <acronym>GIN</acronym> index update speed, even counting the additional
 226    vacuum overhead.  Moreover the overhead can be done by a background
 227    process instead of in foreground query processing.
 228   </para>
 229
 230   <para>
 231    The main disadvantage of this approach is that searches must scan the list
 232    of pending entries in addition to searching the regular index, and so
 233    a large list of pending entries will slow searches significantly.
 234    Another disadvantage is that, while most updates are fast, an update
 235    that causes the pending list to become <quote>too large</> will incur an
 236    immediate cleanup cycle and thus be much slower than other updates.
 237    Proper use of autovacuum can minimize both of these problems.
 238   </para>
 239
 240   <para>
 241    If consistent response time is more important than update speed,
 242    use of pending entries can be disabled by turning off the
 243    <literal>FASTUPDATE</literal> storage parameter for a
 244    <acronym>GIN</acronym> index.  See <xref linkend="sql-createindex"
 245    endterm="sql-createindex-title"> for details.
 246   </para>
 247  </sect2>
 248
 249  <sect2 id="gin-partial-match">
 250   <title>Partial match algorithm</title>
 251
 252   <para>
 253    GIN can support <quote>partial match</> queries, in which the query
 254    does not determine an exact match for one or more keys, but the possible
 255    matches fall within a reasonably narrow range of key values (within the
 256    key sorting order determined by the <function>compare</> support method).
 257    The <function>extractQuery</> method, instead of returning a key value
 258    to be matched exactly, returns a key value that is the lower bound of
 259    the range to be searched, and sets the <literal>pmatch</> flag true.
 260    The key range is then searched using the <function>comparePartial</>
 261    method.  <function>comparePartial</> must return zero for an actual
 262    match, less than zero for a non-match that is still within the range
 263    to be searched, or greater than zero if the index key is past the range
 264    that could match.
 265   </para>
 266  </sect2>
 267
 268 </sect1>
 269
 270 <sect1 id="gin-tips">
 271 <title>GIN tips and tricks</title>
 272
 273  <variablelist>
 274   <varlistentry>
 275    <term>Create vs insert</term>
 276    <listitem>
 277     <para>
 278      Insertion into a <acronym>GIN</acronym> index can be slow
 279      due to the likelihood of many keys being inserted for each value.
 280      So, for bulk insertions into a table it is advisable to drop the GIN
 281      index and recreate it after finishing bulk insertion.
 282     </para>
 283
 284     <para>
 285      As of <productname>PostgreSQL</productname> 8.4, this advice is less
 286      necessary since delayed indexing is used (see <xref
 287      linkend="gin-fast-update"> for details).  But for very large updates
 288      it may still be best to drop and recreate the index.
 289     </para>
 290    </listitem>
 291   </varlistentry>
 292
 293   <varlistentry>
 294    <term><xref linkend="guc-maintenance-work-mem"></term>
 295    <listitem>
 296     <para>
 297      Build time for a <acronym>GIN</acronym> index is very sensitive to
 298      the <varname>maintenance_work_mem</> setting; it doesn't pay to
 299      skimp on work memory during index creation.
 300     </para>
 301    </listitem>
 302   </varlistentry>
 303
 304   <varlistentry>
 305    <term><xref linkend="guc-work-mem"></term>
 306    <listitem>
 307     <para>
 308      During a series of insertions into an existing <acronym>GIN</acronym>
 309      index that has <literal>FASTUPDATE</> enabled, the system will clean up
 310      the pending-entry list whenever it grows larger than
 311      <varname>work_mem</>.  To avoid fluctuations in observed response time,
 312      it's desirable to have pending-list cleanup occur in the background
 313      (i.e., via autovacuum).  Foreground cleanup operations can be avoided by
 314      increasing <varname>work_mem</> or making autovacuum more aggressive.
 315      However, enlarging <varname>work_mem</> means that if a foreground
 316      cleanup does occur, it will take even longer.
 317     </para>
 318    </listitem>
 319   </varlistentry>
 320
 321   <varlistentry>
 322    <term><xref linkend="guc-gin-fuzzy-search-limit"></term>
 323    <listitem>
 324     <para>
 325      The primary goal of developing <acronym>GIN</acronym> indexes was
 326      to create support for highly scalable, full-text search in
 327      <productname>PostgreSQL</productname>, and there are often situations when
 328      a full-text search returns a very large set of results.  Moreover, this
 329      often happens when the query contains very frequent words, so that the
 330      large result set is not even useful.  Since reading many
 331      tuples from the disk and sorting them could take a lot of time, this is
 332      unacceptable for production.  (Note that the index search itself is very
 333      fast.)
 334     </para>
 335     <para>
 336      To facilitate controlled execution of such queries
 337      <acronym>GIN</acronym> has a configurable soft upper limit on the
 338      number of rows returned, the
 339      <varname>gin_fuzzy_search_limit</varname> configuration parameter.
 340      It is set to 0 (meaning no limit) by default.
 341      If a non-zero limit is set, then the returned set is a subset of
 342      the whole result set, chosen at random.
 343     </para>
 344     <para>
 345      <quote>Soft</quote> means that the actual number of returned results
 346      could differ slightly from the specified limit, depending on the query
 347      and the quality of the system's random number generator.
 348     </para>
 349    </listitem>
 350   </varlistentry>
 351  </variablelist>
 352
 353 </sect1>
 354
 355 <sect1 id="gin-limit">
 356  <title>Limitations</title>
 357
 358  <para>
 359   <acronym>GIN</acronym> doesn't support full index scans.  The reason for
 360   this is that <function>extractValue</> is allowed to return zero keys,
 361   as for example might happen with an empty string or empty array.  In such
 362   a case the indexed value will be unrepresented in the index.  It is
 363   therefore impossible for <acronym>GIN</acronym> to guarantee that a
 364   scan of the index can find every row in the table.
 365  </para>
 366
 367  <para>
 368   Because of this limitation, when <function>extractQuery</function> returns
 369   <literal>nkeys = 0</> to indicate that all values match the query,
 370   <acronym>GIN</acronym> will emit an error.  (If there are multiple ANDed
 371   indexable operators in the query, this happens only if they all return zero
 372   for <literal>nkeys</>.)
 373  </para>
 374
 375  <para>
 376   It is possible for an operator class to circumvent the restriction against
 377   full index scan.  To do that, <function>extractValue</> must return at least
 378   one (possibly dummy) key for every indexed value, and
 379   <function>extractQuery</function> must convert an unrestricted search into
 380   a partial-match query that will scan the whole index.  This is inefficient
 381   but might be necessary to avoid corner-case failures with operators such
 382   as <literal>LIKE</> or subset inclusion.
 383  </para>
 384
 385  <para>
 386   <acronym>GIN</acronym> assumes that indexable operators are strict.
 387   This means that <function>extractValue</> will not be called at all on
 388   a NULL value (so the value will go unindexed), and
 389   <function>extractQuery</function> will not be called on a NULL comparison
 390   value either (instead, the query is presumed to be unmatchable).
 391  </para>
 392
 393  <para>
 394   A possibly more serious limitation is that <acronym>GIN</acronym> cannot
 395   handle NULL keys &mdash; for example, an array containing a NULL cannot
 396   be handled except by ignoring the NULL.
 397  </para>
 398 </sect1>
 399
 400 <sect1 id="gin-examples">
 401  <title>Examples</title>
 402
 403  <para>
 404   The <productname>PostgreSQL</productname> source distribution includes
 405   <acronym>GIN</acronym> operator classes for <type>tsvector</> and
 406   for one-dimensional arrays of all internal types.  Prefix searching in
 407   <type>tsvector</> is implemented using the <acronym>GIN</> partial match
 408   feature.
 409   The following <filename>contrib</> modules also contain
 410   <acronym>GIN</acronym> operator classes:
 411  </para>
 412
 413  <variablelist>
 414   <varlistentry>
 415    <term>btree-gin</term>
 416    <listitem>
 417     <para>B-Tree equivalent functionality for several data types</para>
 418    </listitem>
 419   </varlistentry>
 420
 421   <varlistentry>
 422    <term>hstore</term>
 423    <listitem>
 424     <para>Module for storing (key, value) pairs</para>
 425    </listitem>
 426   </varlistentry>
 427
 428   <varlistentry>
 429    <term>intarray</term>
 430    <listitem>
 431     <para>Enhanced support for int4[]</para>
 432    </listitem>
 433   </varlistentry>
 434
 435   <varlistentry>
 436    <term>pg_trgm</term>
 437    <listitem>
 438     <para>Text similarity using trigram matching</para>
 439    </listitem>
 440   </varlistentry>
 441  </variablelist>
 442 </sect1>
 443
 444 </chapter>