1 <!-- doc/src/sgml/pgtrgm.sgml -->
3 <sect1 id=
"pgtrgm" xreflabel=
"pg_trgm">
5 support for similarity of text using trigram matching
</title>
7 <indexterm zone=
"pgtrgm">
8 <primary>pg_trgm
</primary>
12 The
<filename>pg_trgm
</filename> module provides functions and operators
13 for determining the similarity of
14 alphanumeric text based on trigram matching, as
15 well as index operator classes that support fast searching for similar
20 This module is considered
<quote>trusted
</quote>, that is, it can be
21 installed by non-superusers who have
<literal>CREATE
</literal> privilege
22 on the current database.
25 <sect2 id=
"pgtrgm-concepts">
26 <title>Trigram (or Trigraph) Concepts
</title>
29 A trigram is a group of three consecutive characters taken
30 from a string. We can measure the similarity of two strings by
31 counting the number of trigrams they share. This simple idea
32 turns out to be very effective for measuring the similarity of
33 words in many natural languages.
38 <filename>pg_trgm
</filename> ignores non-word characters
39 (non-alphanumerics) when extracting trigrams from a string.
40 Each word is considered to have two spaces
41 prefixed and one space suffixed when determining the set
42 of trigrams contained in the string.
43 For example, the set of trigrams in the string
44 <quote><literal>cat
</literal></quote> is
45 <quote><literal> c
</literal></quote>,
46 <quote><literal> ca
</literal></quote>,
47 <quote><literal>cat
</literal></quote>, and
48 <quote><literal>at
</literal></quote>.
49 The set of trigrams in the string
50 <quote><literal>foo|bar
</literal></quote> is
51 <quote><literal> f
</literal></quote>,
52 <quote><literal> fo
</literal></quote>,
53 <quote><literal>foo
</literal></quote>,
54 <quote><literal>oo
</literal></quote>,
55 <quote><literal> b
</literal></quote>,
56 <quote><literal> ba
</literal></quote>,
57 <quote><literal>bar
</literal></quote>, and
58 <quote><literal>ar
</literal></quote>.
63 <sect2 id=
"pgtrgm-funcs-ops">
64 <title>Functions and Operators
</title>
67 The functions provided by the
<filename>pg_trgm
</filename> module
68 are shown in
<xref linkend=
"pgtrgm-func-table"/>, the operators
69 in
<xref linkend=
"pgtrgm-op-table"/>.
72 <table id=
"pgtrgm-func-table">
73 <title><filename>pg_trgm
</filename> Functions
</title>
77 <entry role=
"func_table_entry"><para role=
"func_signature">
88 <entry role=
"func_table_entry"><para role=
"func_signature">
89 <indexterm><primary>similarity
</primary></indexterm>
90 <function>similarity
</function> (
<type>text
</type>,
<type>text
</type> )
91 <returnvalue>real
</returnvalue>
94 Returns a number that indicates how similar the two arguments are.
95 The range of the result is zero (indicating that the two strings are
96 completely dissimilar) to one (indicating that the two strings are
102 <entry role=
"func_table_entry"><para role=
"func_signature">
103 <indexterm><primary>show_trgm
</primary></indexterm>
104 <function>show_trgm
</function> (
<type>text
</type> )
105 <returnvalue>text[]
</returnvalue>
108 Returns an array of all the trigrams in the given string.
109 (In practice this is seldom useful except for debugging.)
114 <entry role=
"func_table_entry"><para role=
"func_signature">
115 <indexterm><primary>word_similarity
</primary></indexterm>
116 <function>word_similarity
</function> (
<type>text
</type>,
<type>text
</type> )
117 <returnvalue>real
</returnvalue>
120 Returns a number that indicates the greatest similarity between
121 the set of trigrams in the first string and any continuous extent
122 of an ordered set of trigrams in the second string. For details, see
123 the explanation below.
128 <entry role=
"func_table_entry"><para role=
"func_signature">
129 <indexterm><primary>strict_word_similarity
</primary></indexterm>
130 <function>strict_word_similarity
</function> (
<type>text
</type>,
<type>text
</type> )
131 <returnvalue>real
</returnvalue>
134 Same as
<function>word_similarity
</function>, but forces
135 extent boundaries to match word boundaries. Since we don't have
136 cross-word trigrams, this function actually returns greatest similarity
137 between first string and any continuous extent of words of the second
143 <entry role=
"func_table_entry"><para role=
"func_signature">
144 <indexterm><primary>show_limit
</primary></indexterm>
145 <function>show_limit
</function> ()
146 <returnvalue>real
</returnvalue>
149 Returns the current similarity threshold used by the
<literal>%
</literal>
150 operator. This sets the minimum similarity between
151 two words for them to be considered similar enough to
152 be misspellings of each other, for example.
153 (
<emphasis>Deprecated
</emphasis>; instead use
<command>SHOW
</command>
154 <varname>pg_trgm.similarity_threshold
</varname>.)
159 <entry role=
"func_table_entry"><para role=
"func_signature">
160 <indexterm><primary>set_limit
</primary></indexterm>
161 <function>set_limit
</function> (
<type>real
</type> )
162 <returnvalue>real
</returnvalue>
165 Sets the current similarity threshold that is used by the
<literal>%
</literal>
166 operator. The threshold must be between
0 and
1 (default is
0.3).
167 Returns the same value passed in.
168 (
<emphasis>Deprecated
</emphasis>; instead use
<command>SET
</command>
169 <varname>pg_trgm.similarity_threshold
</varname>.)
177 Consider the following example:
180 # SELECT word_similarity('word', 'two words');
187 In the first string, the set of trigrams is
188 <literal>{
" w",
" wo",
"wor",
"ord",
"rd "}
</literal>.
189 In the second string, the ordered set of trigrams is
190 <literal>{
" t",
" tw",
"two",
"wo ",
" w",
" wo",
"wor",
"ord",
"rds",
"ds "}
</literal>.
191 The most similar extent of an ordered set of trigrams in the second string
192 is
<literal>{
" w",
" wo",
"wor",
"ord"}
</literal>, and the similarity is
193 <literal>0.8</literal>.
197 This function returns a value that can be approximately understood as the
198 greatest similarity between the first string and any substring of the second
199 string. However, this function does not add padding to the boundaries of
200 the extent. Thus, the number of additional characters present in the
201 second string is not considered, except for the mismatched word boundaries.
205 At the same time,
<function>strict_word_similarity
</function>
206 selects an extent of words in the second string. In the example above,
207 <function>strict_word_similarity
</function> would select the
208 extent of a single word
<literal>'words'
</literal>, whose set of trigrams is
209 <literal>{
" w",
" wo",
"wor",
"ord",
"rds",
"ds "}
</literal>.
212 # SELECT strict_word_similarity('word', 'two words'), similarity('word', 'words');
213 strict_word_similarity | similarity
214 ------------------------+------------
221 Thus, the
<function>strict_word_similarity
</function> function
222 is useful for finding the similarity to whole words, while
223 <function>word_similarity
</function> is more suitable for
224 finding the similarity for parts of words.
227 <table id=
"pgtrgm-op-table">
228 <title><filename>pg_trgm
</filename> Operators
</title>
232 <entry role=
"func_table_entry"><para role=
"func_signature">
243 <entry role=
"func_table_entry"><para role=
"func_signature">
244 <type>text
</type> <literal>%
</literal> <type>text
</type>
245 <returnvalue>boolean
</returnvalue>
248 Returns
<literal>true
</literal> if its arguments have a similarity
249 that is greater than the current similarity threshold set by
250 <varname>pg_trgm.similarity_threshold
</varname>.
255 <entry role=
"func_table_entry"><para role=
"func_signature">
256 <type>text
</type> <literal><%
</literal> <type>text
</type>
257 <returnvalue>boolean
</returnvalue>
260 Returns
<literal>true
</literal> if the similarity between the trigram
261 set in the first argument and a continuous extent of an ordered trigram
262 set in the second argument is greater than the current word similarity
263 threshold set by
<varname>pg_trgm.word_similarity_threshold
</varname>
269 <entry role=
"func_table_entry"><para role=
"func_signature">
270 <type>text
</type> <literal>%
></literal> <type>text
</type>
271 <returnvalue>boolean
</returnvalue>
274 Commutator of the
<literal><%
</literal> operator.
279 <entry role=
"func_table_entry"><para role=
"func_signature">
280 <type>text
</type> <literal><<%
</literal> <type>text
</type>
281 <returnvalue>boolean
</returnvalue>
284 Returns
<literal>true
</literal> if its second argument has a continuous
285 extent of an ordered trigram set that matches word boundaries,
286 and its similarity to the trigram set of the first argument is greater
287 than the current strict word similarity threshold set by the
288 <varname>pg_trgm.strict_word_similarity_threshold
</varname> parameter.
293 <entry role=
"func_table_entry"><para role=
"func_signature">
294 <type>text
</type> <literal>%
>></literal> <type>text
</type>
295 <returnvalue>boolean
</returnvalue>
298 Commutator of the
<literal><<%
</literal> operator.
303 <entry role=
"func_table_entry"><para role=
"func_signature">
304 <type>text
</type> <literal><-
></literal> <type>text
</type>
305 <returnvalue>real
</returnvalue>
308 Returns the
<quote>distance
</quote> between the arguments, that is
309 one minus the
<function>similarity()
</function> value.
314 <entry role=
"func_table_entry"><para role=
"func_signature">
315 <type>text
</type> <literal><<-
></literal> <type>text
</type>
316 <returnvalue>real
</returnvalue>
319 Returns the
<quote>distance
</quote> between the arguments, that is
320 one minus the
<function>word_similarity()
</function> value.
325 <entry role=
"func_table_entry"><para role=
"func_signature">
326 <type>text
</type> <literal><-
>></literal> <type>text
</type>
327 <returnvalue>real
</returnvalue>
330 Commutator of the
<literal><<-
></literal> operator.
335 <entry role=
"func_table_entry"><para role=
"func_signature">
336 <type>text
</type> <literal><<<-
></literal> <type>text
</type>
337 <returnvalue>real
</returnvalue>
340 Returns the
<quote>distance
</quote> between the arguments, that is
341 one minus the
<function>strict_word_similarity()
</function> value.
346 <entry role=
"func_table_entry"><para role=
"func_signature">
347 <type>text
</type> <literal><-
>>></literal> <type>text
</type>
348 <returnvalue>real
</returnvalue>
351 Commutator of the
<literal><<<-
></literal> operator.
359 <sect2 id=
"pgtrgm-guc">
360 <title>GUC Parameters
</title>
363 <varlistentry id=
"guc-pgtrgm-similarity-threshold" xreflabel=
"pg_trgm.similarity_threshold">
365 <varname>pg_trgm.similarity_threshold
</varname> (
<type>real
</type>)
367 <primary><varname>pg_trgm.similarity_threshold
</varname> configuration parameter
</primary>
372 Sets the current similarity threshold that is used by the
<literal>%
</literal>
373 operator. The threshold must be between
0 and
1 (default is
0.3).
377 <varlistentry id=
"guc-pgtrgm-word-similarity-threshold" xreflabel=
"pg_trgm.word_similarity_threshold">
379 <varname>pg_trgm.word_similarity_threshold
</varname> (
<type>real
</type>)
381 <primary><varname>pg_trgm.word_similarity_threshold
</varname> configuration parameter
</primary>
386 Sets the current word similarity threshold that is used by the
387 <literal><%
</literal> and
<literal>%
></literal> operators. The threshold
388 must be between
0 and
1 (default is
0.6).
392 <varlistentry id=
"guc-pgtrgm-strict-word-similarity-threshold" xreflabel=
"pg_trgm.strict_word_similarity_threshold">
394 <varname>pg_trgm.strict_word_similarity_threshold
</varname> (
<type>real
</type>)
396 <primary><varname>pg_trgm.strict_word_similarity_threshold
</varname> configuration parameter
</primary>
401 Sets the current strict word similarity threshold that is used by the
402 <literal><<%
</literal> and
<literal>%
>></literal> operators. The threshold
403 must be between
0 and
1 (default is
0.5).
410 <sect2 id=
"pgtrgm-index">
411 <title>Index Support
</title>
414 The
<filename>pg_trgm
</filename> module provides GiST and GIN index
415 operator classes that allow you to create an index over a text column for
416 the purpose of very fast similarity searches. These index types support
417 the above-described similarity operators, and additionally support
418 trigram-based index searches for
<literal>LIKE
</literal>,
<literal>ILIKE
</literal>,
419 <literal>~
</literal>,
<literal>~*
</literal> and
<literal>=
</literal> queries.
420 The similarity comparisons are case-insensitive in a default build of
421 <filename>pg_trgm
</filename>.
422 Inequality operators are not supported.
423 Note that those indexes may not be as efficient as regular B-tree indexes
424 for equality operator.
431 CREATE TABLE test_trgm (t text);
432 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops);
436 CREATE INDEX trgm_idx ON test_trgm USING GIN (t gin_trgm_ops);
441 <literal>gist_trgm_ops
</literal> GiST opclass approximates a set of
442 trigrams as a bitmap signature. Its optional integer parameter
443 <literal>siglen
</literal> determines the
444 signature length in bytes. The default length is
12 bytes.
445 Valid values of signature length are between
1 and
2024 bytes. Longer
446 signatures lead to a more precise search (scanning a smaller fraction of the index and
447 fewer heap pages), at the cost of a larger index.
451 Example of creating such an index with a signature length of
32 bytes:
454 CREATE INDEX trgm_idx ON test_trgm USING GIST (t gist_trgm_ops(siglen=
32));
458 At this point, you will have an index on the
<structfield>t
</structfield> column that
459 you can use for similarity searching. A typical query is
461 SELECT t, similarity(t, '
<replaceable>word
</replaceable>') AS sml
463 WHERE t % '
<replaceable>word
</replaceable>'
464 ORDER BY sml DESC, t;
466 This will return all values in the text column that are sufficiently
467 similar to
<replaceable>word
</replaceable>, sorted from best match to worst. The
468 index will be used to make this a fast operation even over very large data
473 A variant of the above query is
475 SELECT t, t
<-
> '
<replaceable>word
</replaceable>' AS dist
477 ORDER BY dist LIMIT
10;
479 This can be implemented quite efficiently by GiST indexes, but not
480 by GIN indexes. It will usually beat the first formulation when only
481 a small number of the closest matches is wanted.
485 Also you can use an index on the
<structfield>t
</structfield> column for word
486 similarity or strict word similarity. Typical queries are:
488 SELECT t, word_similarity('
<replaceable>word
</replaceable>', t) AS sml
490 WHERE '
<replaceable>word
</replaceable>'
<% t
491 ORDER BY sml DESC, t;
495 SELECT t, strict_word_similarity('
<replaceable>word
</replaceable>', t) AS sml
497 WHERE '
<replaceable>word
</replaceable>'
<<% t
498 ORDER BY sml DESC, t;
500 This will return all values in the text column for which there is a
501 continuous extent in the corresponding ordered trigram set that is
502 sufficiently similar to the trigram set of
<replaceable>word
</replaceable>,
503 sorted from best match to worst. The index will be used to make this
504 a fast operation even over very large data sets.
508 Possible variants of the above queries are:
510 SELECT t, '
<replaceable>word
</replaceable>'
<<-
> t AS dist
512 ORDER BY dist LIMIT
10;
516 SELECT t, '
<replaceable>word
</replaceable>'
<<<-
> t AS dist
518 ORDER BY dist LIMIT
10;
520 This can be implemented quite efficiently by GiST indexes, but not
526 Beginning in
<productname>PostgreSQL
</productname> 9.1, these index types also support
527 index searches for
<literal>LIKE
</literal> and
<literal>ILIKE
</literal>, for example
529 SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
531 The index search works by extracting trigrams from the search string
532 and then looking these up in the index. The more trigrams in the search
533 string, the more effective the index search is. Unlike B-tree based
534 searches, the search string need not be left-anchored.
538 Beginning in
<productname>PostgreSQL
</productname> 9.3, these index types also support
539 index searches for regular-expression matches
540 (
<literal>~
</literal> and
<literal>~*
</literal> operators), for example
542 SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
544 The index search works by extracting trigrams from the regular expression
545 and then looking these up in the index. The more trigrams that can be
546 extracted from the regular expression, the more effective the index search
547 is. Unlike B-tree based searches, the search string need not be
552 For both
<literal>LIKE
</literal> and regular-expression searches, keep in mind
553 that a pattern with no extractable trigrams will degenerate to a full-index
558 The choice between GiST and GIN indexing depends on the relative
559 performance characteristics of GiST and GIN, which are discussed elsewhere.
563 <sect2 id=
"pgtrgm-text-search">
564 <title>Text Search Integration
</title>
567 Trigram matching is a very useful tool when used in conjunction
568 with a full text index. In particular it can help to recognize
569 misspelled input words that will not be matched directly by the
570 full text search mechanism.
574 The first step is to generate an auxiliary table containing all
575 the unique words in the documents:
578 CREATE TABLE words AS SELECT word FROM
579 ts_stat('SELECT to_tsvector(''simple'', bodytext) FROM documents');
582 where
<structname>documents
</structname> is a table that has a text field
583 <structfield>bodytext
</structfield> that we wish to search. The reason for using
584 the
<literal>simple
</literal> configuration with the
<function>to_tsvector
</function>
585 function, instead of using a language-specific configuration,
586 is that we want a list of the original (unstemmed) words.
590 Next, create a trigram index on the word column:
593 CREATE INDEX words_idx ON words USING GIN (word gin_trgm_ops);
596 Now, a
<command>SELECT
</command> query similar to the previous example can
597 be used to suggest spellings for misspelled words in user search terms.
598 A useful extra test is to require that the selected words are also of
599 similar length to the misspelled word.
604 Since the
<structname>words
</structname> table has been generated as a separate,
605 static table, it will need to be periodically regenerated so that
606 it remains reasonably up-to-date with the document collection.
607 Keeping it exactly current is usually unnecessary.
612 <sect2 id=
"pgtrgm-references">
613 <title>References
</title>
616 GiST Development Site
617 <ulink url=
"http://www.sai.msu.su/~megera/postgres/gist/"></ulink>
620 Tsearch2 Development Site
621 <ulink url=
"http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/"></ulink>
625 <sect2 id=
"pgtrgm-authors">
626 <title>Authors
</title>
629 Oleg Bartunov
<email>oleg@sai.msu.su
</email>, Moscow, Moscow University, Russia
632 Teodor Sigaev
<email>teodor@sigaev.ru
</email>, Moscow, Delta-Soft Ltd.,Russia
635 Alexander Korotkov
<email>a.korotkov@postgrespro.ru
</email>, Moscow, Postgres Professional, Russia
638 Documentation: Christopher Kings-Lynne
641 This module is sponsored by Delta-Soft Ltd., Moscow, Russia.