Remove old RULE privilege completely.
[pgsql.git] / doc / src / sgml / pgtrgm.sgml
blob07bfcac93191ed91822e24e7a746e87ac3e9d0df
1 <!-- doc/src/sgml/pgtrgm.sgml -->
3 <sect1 id="pgtrgm" xreflabel="pg_trgm">
4 <title>pg_trgm &mdash;
5 support for similarity of text using trigram matching</title>
7 <indexterm zone="pgtrgm">
8 <primary>pg_trgm</primary>
9 </indexterm>
11 <para>
12 The <filename>pg_trgm</filename> module provides functions and operators
13 for determining the similarity of
14 alphanumeric text based on trigram matching, as
15 well as index operator classes that support fast searching for similar
16 strings.
17 </para>
19 <para>
20 This module is considered <quote>trusted</quote>, that is, it can be
21 installed by non-superusers who have <literal>CREATE</literal> privilege
22 on the current database.
23 </para>
25 <sect2 id="pgtrgm-concepts">
26 <title>Trigram (or Trigraph) Concepts</title>
28 <para>
29 A trigram is a group of three consecutive characters taken
30 from a string. We can measure the similarity of two strings by
31 counting the number of trigrams they share. This simple idea
32 turns out to be very effective for measuring the similarity of
33 words in many natural languages.
34 </para>
36 <note>
37 <para>
38 <filename>pg_trgm</filename> ignores non-word characters
39 (non-alphanumerics) when extracting trigrams from a string.
40 Each word is considered to have two spaces
41 prefixed and one space suffixed when determining the set
42 of trigrams contained in the string.
43 For example, the set of trigrams in the string
44 <quote><literal>cat</literal></quote> is
45 <quote><literal> c</literal></quote>,
46 <quote><literal> ca</literal></quote>,
47 <quote><literal>cat</literal></quote>, and
48 <quote><literal>at </literal></quote>.
49 The set of trigrams in the string
50 <quote><literal>foo|bar</literal></quote> is
51 <quote><literal> f</literal></quote>,
52 <quote><literal> fo</literal></quote>,
53 <quote><literal>foo</literal></quote>,
54 <quote><literal>oo </literal></quote>,
55 <quote><literal> b</literal></quote>,
56 <quote><literal> ba</literal></quote>,
57 <quote><literal>bar</literal></quote>, and
58 <quote><literal>ar </literal></quote>.
59 </para>
60 </note>
61 </sect2>
63 <sect2 id="pgtrgm-funcs-ops">
64 <title>Functions and Operators</title>
66 <para>
67 The functions provided by the <filename>pg_trgm</filename> module
68 are shown in <xref linkend="pgtrgm-func-table"/>, the operators
69 in <xref linkend="pgtrgm-op-table"/>.
70 </para>
72 <table id="pgtrgm-func-table">
73 <title><filename>pg_trgm</filename> Functions</title>
74 <tgroup cols="1">
75 <thead>
76 <row>
77 <entry role="func_table_entry"><para role="func_signature">
78 Function
79 </para>
80 <para>
81 Description
82 </para></entry>
83 </row>
84 </thead>
86 <tbody>
87 <row>
88 <entry role="func_table_entry"><para role="func_signature">
89 <indexterm><primary>similarity</primary></indexterm>
90 <function>similarity</function> ( <type>text</type>, <type>text</type> )
91 <returnvalue>real</returnvalue>
92 </para>
93 <para>
94 Returns a number that indicates how similar the two arguments are.
95 The range of the result is zero (indicating that the two strings are
96 completely dissimilar) to one (indicating that the two strings are
97 identical).
98 </para></entry>
99 </row>
101 <row>
102 <entry role="func_table_entry"><para role="func_signature">
103 <indexterm><primary>show_trgm</primary></indexterm>
104 <function>show_trgm</function> ( <type>text</type> )
105 <returnvalue>text[]</returnvalue>
106 </para>
107 <para>
108 Returns an array of all the trigrams in the given string.
109 (In practice this is seldom useful except for debugging.)
110 </para></entry>
111 </row>
113 <row>
114 <entry role="func_table_entry"><para role="func_signature">
115 <indexterm><primary>word_similarity</primary></indexterm>
116 <function>word_similarity</function> ( <type>text</type>, <type>text</type> )
117 <returnvalue>real</returnvalue>
118 </para>
119 <para>
120 Returns a number that indicates the greatest similarity between
121 the set of trigrams in the first string and any continuous extent
122 of an ordered set of trigrams in the second string. For details, see
123 the explanation below.
124 </para></entry>
125 </row>
127 <row>
128 <entry role="func_table_entry"><para role="func_signature">
129 <indexterm><primary>strict_word_similarity</primary></indexterm>
130 <function>strict_word_similarity</function> ( <type>text</type>, <type>text</type> )
131 <returnvalue>real</returnvalue>
132 </para>
133 <para>
134 Same as <function>word_similarity</function>, but forces
135 extent boundaries to match word boundaries. Since we don't have
136 cross-word trigrams, this function actually returns greatest similarity
137 between first string and any continuous extent of words of the second
138 string.
139 </para></entry>
140 </row>
142 <row>
143 <entry role="func_table_entry"><para role="func_signature">
144 <indexterm><primary>show_limit</primary></indexterm>
145 <function>show_limit</function> ()
146 <returnvalue>real</returnvalue>
147 </para>
148 <para>
149 Returns the current similarity threshold used by the <literal>%</literal>
150 operator. This sets the minimum similarity between
151 two words for them to be considered similar enough to
152 be misspellings of each other, for example.
153 (<emphasis>Deprecated</emphasis>; instead use <command>SHOW</command>
154 <varname>pg_trgm.similarity_threshold</varname>.)
155 </para></entry>
156 </row>
158 <row>
159 <entry role="func_table_entry"><para role="func_signature">
160 <indexterm><primary>set_limit</primary></indexterm>
161 <function>set_limit</function> ( <type>real</type> )
162 <returnvalue>real</returnvalue>
163 </para>
164 <para>
165 Sets the current similarity threshold that is used by the <literal>%</literal>
166 operator. The threshold must be between 0 and 1 (default is 0.3).
167 Returns the same value passed in.
168 (<emphasis>Deprecated</emphasis>; instead use <command>SET</command>
169 <varname>pg_trgm.similarity_threshold</varname>.)
170 </para></entry>
171 </row>
172 </tbody>
173 </tgroup>
174 </table>
176 <para>
177 Consider the following example:
179 <programlisting>
180 # SELECT word_similarity('word', 'two words');
181 word_similarity
182 -----------------
184 (1 row)
185 </programlisting>
187 In the first string, the set of trigrams is
188 <literal>{" w"," wo","wor","ord","rd "}</literal>.
189 In the second string, the ordered set of trigrams is
190 <literal>{" t"," tw","two","wo "," w"," wo","wor","ord","rds","ds "}</literal>.
191 The most similar extent of an ordered set of trigrams in the second string
192 is <literal>{" w"," wo","wor","ord"}</literal>, and the similarity is
193 <literal>0.8</literal>.
194 </para>
196 <para>
197 This function returns a value that can be approximately understood as the
198 greatest similarity between the first string and any substring of the second
199 string. However, this function does not add padding to the boundaries of
200 the extent. Thus, the number of additional characters present in the
201 second string is not considered, except for the mismatched word boundaries.
202 </para>
204 <para>
205 At the same time, <function>strict_word_similarity</function>
206 selects an extent of words in the second string. In the example above,
207 <function>strict_word_similarity</function> would select the
208 extent of a single word <literal>'words'</literal>, whose set of trigrams is
209 <literal>{" w"," wo","wor","ord","rds","ds "}</literal>.
211 <programlisting>
212 # SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words');
213 strict_word_similarity | similarity
214 ------------------------+------------
215 0.571429 | 0.571429
216 (1 row)
217 </programlisting>
218 </para>
220 <para>
221 Thus, the <function>strict_word_similarity</function> function
222 is useful for finding the similarity to whole words, while
223 <function>word_similarity</function> is more suitable for
224 finding the similarity for parts of words.
225 </para>
227 <table id="pgtrgm-op-table">
228 <title><filename>pg_trgm</filename> Operators</title>
229 <tgroup cols="1">
230 <thead>
231 <row>
232 <entry role="func_table_entry"><para role="func_signature">
233 Operator
234 </para>
235 <para>
236 Description
237 </para></entry>
238 </row>
239 </thead>
241 <tbody>
242 <row>
243 <entry role="func_table_entry"><para role="func_signature">
244 <type>text</type> <literal>%</literal> <type>text</type>
245 <returnvalue>boolean</returnvalue>
246 </para>
247 <para>
248 Returns <literal>true</literal> if its arguments have a similarity
249 that is greater than the current similarity threshold set by
250 <varname>pg_trgm.similarity_threshold</varname>.
251 </para></entry>
252 </row>
254 <row>
255 <entry role="func_table_entry"><para role="func_signature">
256 <type>text</type> <literal>&lt;%</literal> <type>text</type>
257 <returnvalue>boolean</returnvalue>
258 </para>
259 <para>
260 Returns <literal>true</literal> if the similarity between the trigram
261 set in the first argument and a continuous extent of an ordered trigram
262 set in the second argument is greater than the current word similarity
263 threshold set by <varname>pg_trgm.word_similarity_threshold</varname>
264 parameter.
265 </para></entry>
266 </row>
268 <row>
269 <entry role="func_table_entry"><para role="func_signature">
270 <type>text</type> <literal>%&gt;</literal> <type>text</type>
271 <returnvalue>boolean</returnvalue>
272 </para>
273 <para>
274 Commutator of the <literal>&lt;%</literal> operator.
275 </para></entry>
276 </row>
278 <row>
279 <entry role="func_table_entry"><para role="func_signature">
280 <type>text</type> <literal>&lt;&lt;%</literal> <type>text</type>
281 <returnvalue>boolean</returnvalue>
282 </para>
283 <para>
284 Returns <literal>true</literal> if its second argument has a continuous
285 extent of an ordered trigram set that matches word boundaries,
286 and its similarity to the trigram set of the first argument is greater
287 than the current strict word similarity threshold set by the
288 <varname>pg_trgm.strict_word_similarity_threshold</varname> parameter.
289 </para></entry>
290 </row>
292 <row>
293 <entry role="func_table_entry"><para role="func_signature">
294 <type>text</type> <literal>%&gt;&gt;</literal> <type>text</type>
295 <returnvalue>boolean</returnvalue>
296 </para>
297 <para>
298 Commutator of the <literal>&lt;&lt;%</literal> operator.
299 </para></entry>
300 </row>
302 <row>
303 <entry role="func_table_entry"><para role="func_signature">
304 <type>text</type> <literal>&lt;-&gt;</literal> <type>text</type>
305 <returnvalue>real</returnvalue>
306 </para>
307 <para>
308 Returns the <quote>distance</quote> between the arguments, that is
309 one minus the <function>similarity()</function> value.
310 </para></entry>
311 </row>
313 <row>
314 <entry role="func_table_entry"><para role="func_signature">
315 <type>text</type> <literal>&lt;&lt;-&gt;</literal> <type>text</type>
316 <returnvalue>real</returnvalue>
317 </para>
318 <para>
319 Returns the <quote>distance</quote> between the arguments, that is
320 one minus the <function>word_similarity()</function> value.
321 </para></entry>
322 </row>
324 <row>
325 <entry role="func_table_entry"><para role="func_signature">
326 <type>text</type> <literal>&lt;-&gt;&gt;</literal> <type>text</type>
327 <returnvalue>real</returnvalue>
328 </para>
329 <para>
330 Commutator of the <literal>&lt;&lt;-&gt;</literal> operator.
331 </para></entry>
332 </row>
334 <row>
335 <entry role="func_table_entry"><para role="func_signature">
336 <type>text</type> <literal>&lt;&lt;&lt;-&gt;</literal> <type>text</type>
337 <returnvalue>real</returnvalue>
338 </para>
339 <para>
340 Returns the <quote>distance</quote> between the arguments, that is
341 one minus the <function>strict_word_similarity()</function> value.
342 </para></entry>
343 </row>
345 <row>
346 <entry role="func_table_entry"><para role="func_signature">
347 <type>text</type> <literal>&lt;-&gt;&gt;&gt;</literal> <type>text</type>
348 <returnvalue>real</returnvalue>
349 </para>
350 <para>
351 Commutator of the <literal>&lt;&lt;&lt;-&gt;</literal> operator.
352 </para></entry>
353 </row>
354 </tbody>
355 </tgroup>
356 </table>
357 </sect2>
359 <sect2 id="pgtrgm-guc">
360 <title>GUC Parameters</title>
362 <variablelist>
363 <varlistentry id="guc-pgtrgm-similarity-threshold" xreflabel="pg_trgm.similarity_threshold">
364 <term>
365 <varname>pg_trgm.similarity_threshold</varname> (<type>real</type>)
366 <indexterm>
367 <primary><varname>pg_trgm.similarity_threshold</varname> configuration parameter</primary>
368 </indexterm>
369 </term>
370 <listitem>
371 <para>
372 Sets the current similarity threshold that is used by the <literal>%</literal>
373 operator. The threshold must be between 0 and 1 (default is 0.3).
374 </para>
375 </listitem>
376 </varlistentry>
377 <varlistentry id="guc-pgtrgm-word-similarity-threshold" xreflabel="pg_trgm.word_similarity_threshold">
378 <term>
379 <varname>pg_trgm.word_similarity_threshold</varname> (<type>real</type>)
380 <indexterm>
381 <primary><varname>pg_trgm.word_similarity_threshold</varname> configuration parameter</primary>
382 </indexterm>
383 </term>
384 <listitem>
385 <para>
386 Sets the current word similarity threshold that is used by the
387 <literal>&lt;%</literal> and <literal>%&gt;</literal> operators. The threshold
388 must be between 0 and 1 (default is 0.6).
389 </para>
390 </listitem>
391 </varlistentry>
392 <varlistentry id="guc-pgtrgm-strict-word-similarity-threshold" xreflabel="pg_trgm.strict_word_similarity_threshold">
393 <term>
394 <varname>pg_trgm.strict_word_similarity_threshold</varname> (<type>real</type>)
395 <indexterm>
396 <primary><varname>pg_trgm.strict_word_similarity_threshold</varname> configuration parameter</primary>
397 </indexterm>
398 </term>
399 <listitem>
400 <para>
401 Sets the current strict word similarity threshold that is used by the
402 <literal>&lt;&lt;%</literal> and <literal>%&gt;&gt;</literal> operators. The threshold
403 must be between 0 and 1 (default is 0.5).
404 </para>
405 </listitem>
406 </varlistentry>
407 </variablelist>
408 </sect2>
410 <sect2 id="pgtrgm-index">
411 <title>Index Support</title>
413 <para>
414 The <filename>pg_trgm</filename> module provides GiST and GIN index
415 operator classes that allow you to create an index over a text column for
416 the purpose of very fast similarity searches. These index types support
417 the above-described similarity operators, and additionally support
418 trigram-based index searches for <literal>LIKE</literal>, <literal>ILIKE</literal>,
419 <literal>~</literal>, <literal>~*</literal> and <literal>=</literal> queries.
420 The similarity comparisons are case-insensitive in a default build of
421 <filename>pg_trgm</filename>.
422 Inequality operators are not supported.
423 Note that those indexes may not be as efficient as regular B-tree indexes
424 for equality operator.
425 </para>
427 <para>
428 Example:
430 <programlisting>
431 CREATE TABLE test_trgm (t text);
432 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
433 </programlisting>
435 <programlisting>
436 CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
437 </programlisting>
438 </para>
440 <para>
441 <literal>gist_trgm_ops</literal> GiST opclass approximates a set of
442 trigrams as a bitmap signature. Its optional integer parameter
443 <literal>siglen</literal> determines the
444 signature length in bytes. The default length is 12 bytes.
445 Valid values of signature length are between 1 and 2024 bytes. Longer
446 signatures lead to a more precise search (scanning a smaller fraction of the index and
447 fewer heap pages), at the cost of a larger index.
448 </para>
450 <para>
451 Example of creating such an index with a signature length of 32 bytes:
452 </para>
453 <programlisting>
454 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops(siglen=32));
455 </programlisting>
457 <para>
458 At this point, you will have an index on the <structfield>t</structfield> column that
459 you can use for similarity searching. A typical query is
460 <programlisting>
461 SELECT t, similarity(t, '<replaceable>word</replaceable>') AS sml
462 FROM test_trgm
463 WHERE t % '<replaceable>word</replaceable>'
464 ORDER BY sml DESC, t;
465 </programlisting>
466 This will return all values in the text column that are sufficiently
467 similar to <replaceable>word</replaceable>, sorted from best match to worst. The
468 index will be used to make this a fast operation even over very large data
469 sets.
470 </para>
472 <para>
473 A variant of the above query is
474 <programlisting>
475 SELECT t, t &lt;-&gt; '<replaceable>word</replaceable>' AS dist
476 FROM test_trgm
477 ORDER BY dist LIMIT 10;
478 </programlisting>
479 This can be implemented quite efficiently by GiST indexes, but not
480 by GIN indexes. It will usually beat the first formulation when only
481 a small number of the closest matches is wanted.
482 </para>
484 <para>
485 Also you can use an index on the <structfield>t</structfield> column for word
486 similarity or strict word similarity. Typical queries are:
487 <programlisting>
488 SELECT t, word_similarity('<replaceable>word</replaceable>', t) AS sml
489 FROM test_trgm
490 WHERE '<replaceable>word</replaceable>' &lt;% t
491 ORDER BY sml DESC, t;
492 </programlisting>
494 <programlisting>
495 SELECT t, strict_word_similarity('<replaceable>word</replaceable>', t) AS sml
496 FROM test_trgm
497 WHERE '<replaceable>word</replaceable>' &lt;&lt;% t
498 ORDER BY sml DESC, t;
499 </programlisting>
500 This will return all values in the text column for which there is a
501 continuous extent in the corresponding ordered trigram set that is
502 sufficiently similar to the trigram set of <replaceable>word</replaceable>,
503 sorted from best match to worst. The index will be used to make this
504 a fast operation even over very large data sets.
505 </para>
507 <para>
508 Possible variants of the above queries are:
509 <programlisting>
510 SELECT t, '<replaceable>word</replaceable>' &lt;&lt;-&gt; t AS dist
511 FROM test_trgm
512 ORDER BY dist LIMIT 10;
513 </programlisting>
515 <programlisting>
516 SELECT t, '<replaceable>word</replaceable>' &lt;&lt;&lt;-&gt; t AS dist
517 FROM test_trgm
518 ORDER BY dist LIMIT 10;
519 </programlisting>
520 This can be implemented quite efficiently by GiST indexes, but not
521 by GIN indexes.
522 </para>
525 <para>
526 Beginning in <productname>PostgreSQL</productname> 9.1, these index types also support
527 index searches for <literal>LIKE</literal> and <literal>ILIKE</literal>, for example
528 <programlisting>
529 SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
530 </programlisting>
531 The index search works by extracting trigrams from the search string
532 and then looking these up in the index. The more trigrams in the search
533 string, the more effective the index search is. Unlike B-tree based
534 searches, the search string need not be left-anchored.
535 </para>
537 <para>
538 Beginning in <productname>PostgreSQL</productname> 9.3, these index types also support
539 index searches for regular-expression matches
540 (<literal>~</literal> and <literal>~*</literal> operators), for example
541 <programlisting>
542 SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
543 </programlisting>
544 The index search works by extracting trigrams from the regular expression
545 and then looking these up in the index. The more trigrams that can be
546 extracted from the regular expression, the more effective the index search
547 is. Unlike B-tree based searches, the search string need not be
548 left-anchored.
549 </para>
551 <para>
552 For both <literal>LIKE</literal> and regular-expression searches, keep in mind
553 that a pattern with no extractable trigrams will degenerate to a full-index
554 scan.
555 </para>
557 <para>
558 The choice between GiST and GIN indexing depends on the relative
559 performance characteristics of GiST and GIN, which are discussed elsewhere.
560 </para>
561 </sect2>
563 <sect2 id="pgtrgm-text-search">
564 <title>Text Search Integration</title>
566 <para>
567 Trigram matching is a very useful tool when used in conjunction
568 with a full text index. In particular it can help to recognize
569 misspelled input words that will not be matched directly by the
570 full text search mechanism.
571 </para>
573 <para>
574 The first step is to generate an auxiliary table containing all
575 the unique words in the documents:
577 <programlisting>
578 CREATE TABLE words AS SELECT word FROM
579 ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
580 </programlisting>
582 where <structname>documents</structname> is a table that has a text field
583 <structfield>bodytext</structfield> that we wish to search. The reason for using
584 the <literal>simple</literal> configuration with the <function>to_tsvector</function>
585 function, instead of using a language-specific configuration,
586 is that we want a list of the original (unstemmed) words.
587 </para>
589 <para>
590 Next, create a trigram index on the word column:
592 <programlisting>
593 CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
594 </programlisting>
596 Now, a <command>SELECT</command> query similar to the previous example can
597 be used to suggest spellings for misspelled words in user search terms.
598 A useful extra test is to require that the selected words are also of
599 similar length to the misspelled word.
600 </para>
602 <note>
603 <para>
604 Since the <structname>words</structname> table has been generated as a separate,
605 static table, it will need to be periodically regenerated so that
606 it remains reasonably up-to-date with the document collection.
607 Keeping it exactly current is usually unnecessary.
608 </para>
609 </note>
610 </sect2>
612 <sect2 id="pgtrgm-references">
613 <title>References</title>
615 <para>
616 GiST Development Site
617 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
618 </para>
619 <para>
620 Tsearch2 Development Site
621 <ulink url="http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
622 </para>
623 </sect2>
625 <sect2 id="pgtrgm-authors">
626 <title>Authors</title>
628 <para>
629 Oleg Bartunov <email>oleg@sai.msu.su</email>, Moscow, Moscow University, Russia
630 </para>
631 <para>
632 Teodor Sigaev <email>teodor@sigaev.ru</email>, Moscow, Delta-Soft Ltd.,Russia
633 </para>
634 <para>
635 Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>, Moscow, Postgres Professional, Russia
636 </para>
637 <para>
638 Documentation: Christopher Kings-Lynne
639 </para>
640 <para>
641 This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
642 </para>
643 </sect2>
645 </sect1>