Repair memory leaks in plpython.
[pgsql.git] / doc / src / sgml / textsearch.sgml
blobbde5f391e5c06c1f5e19b648a1954008f1170a1f
1 <!-- doc/src/sgml/textsearch.sgml -->
3 <chapter id="textsearch">
4 <title>Full Text Search</title>
6 <indexterm zone="textsearch">
7 <primary>full text search</primary>
8 </indexterm>
10 <indexterm zone="textsearch">
11 <primary>text search</primary>
12 </indexterm>
14 <sect1 id="textsearch-intro">
15 <title>Introduction</title>
17 <para>
18 Full Text Searching (or just <firstterm>text search</firstterm>) provides
19 the capability to identify natural-language <firstterm>documents</firstterm> that
20 satisfy a <firstterm>query</firstterm>, and optionally to sort them by
21 relevance to the query. The most common type of search
22 is to find all documents containing given <firstterm>query terms</firstterm>
23 and return them in order of their <firstterm>similarity</firstterm> to the
24 query. Notions of <varname>query</varname> and
25 <varname>similarity</varname> are very flexible and depend on the specific
26 application. The simplest search considers <varname>query</varname> as a
27 set of words and <varname>similarity</varname> as the frequency of query
28 words in the document.
29 </para>
31 <para>
32 Textual search operators have existed in databases for years.
33 <productname>PostgreSQL</productname> has
34 <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
35 <literal>ILIKE</literal> operators for textual data types, but they lack
36 many essential properties required by modern information systems:
37 </para>
39 <itemizedlist spacing="compact" mark="bullet">
40 <listitem>
41 <para>
42 There is no linguistic support, even for English. Regular expressions
43 are not sufficient because they cannot easily handle derived words, e.g.,
44 <literal>satisfies</literal> and <literal>satisfy</literal>. You might
45 miss documents that contain <literal>satisfies</literal>, although you
46 probably would like to find them when searching for
47 <literal>satisfy</literal>. It is possible to use <literal>OR</literal>
48 to search for multiple derived forms, but this is tedious and error-prone
49 (some words can have several thousand derivatives).
50 </para>
51 </listitem>
53 <listitem>
54 <para>
55 They provide no ordering (ranking) of search results, which makes them
56 ineffective when thousands of matching documents are found.
57 </para>
58 </listitem>
60 <listitem>
61 <para>
62 They tend to be slow because there is no index support, so they must
63 process all documents for every search.
64 </para>
65 </listitem>
66 </itemizedlist>
68 <para>
69 Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
70 and an index saved for later rapid searching. Preprocessing includes:
71 </para>
73 <itemizedlist mark="none">
74 <listitem>
75 <para>
76 <emphasis>Parsing documents into <firstterm>tokens</firstterm></emphasis>. It is
77 useful to identify various classes of tokens, e.g., numbers, words,
78 complex words, email addresses, so that they can be processed
79 differently. In principle token classes depend on the specific
80 application, but for most purposes it is adequate to use a predefined
81 set of classes.
82 <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to
83 perform this step. A standard parser is provided, and custom parsers
84 can be created for specific needs.
85 </para>
86 </listitem>
88 <listitem>
89 <para>
90 <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>.
91 A lexeme is a string, just like a token, but it has been
92 <firstterm>normalized</firstterm> so that different forms of the same word
93 are made alike. For example, normalization almost always includes
94 folding upper-case letters to lower-case, and often involves removal
95 of suffixes (such as <literal>s</literal> or <literal>es</literal> in English).
96 This allows searches to find variant forms of the
97 same word, without tediously entering all the possible variants.
98 Also, this step typically eliminates <firstterm>stop words</firstterm>, which
99 are words that are so common that they are useless for searching.
100 (In short, then, tokens are raw fragments of the document text, while
101 lexemes are words that are believed useful for indexing and searching.)
102 <productname>PostgreSQL</productname> uses <firstterm>dictionaries</firstterm> to
103 perform this step. Various standard dictionaries are provided, and
104 custom ones can be created for specific needs.
105 </para>
106 </listitem>
108 <listitem>
109 <para>
110 <emphasis>Storing preprocessed documents optimized for
111 searching</emphasis>. For example, each document can be represented
112 as a sorted array of normalized lexemes. Along with the lexemes it is
113 often desirable to store positional information to use for
114 <firstterm>proximity ranking</firstterm>, so that a document that
115 contains a more <quote>dense</quote> region of query words is
116 assigned a higher rank than one with scattered query words.
117 </para>
118 </listitem>
119 </itemizedlist>
121 <para>
122 Dictionaries allow fine-grained control over how tokens are normalized.
123 With appropriate dictionaries, you can:
124 </para>
126 <itemizedlist spacing="compact" mark="bullet">
127 <listitem>
128 <para>
129 Define stop words that should not be indexed.
130 </para>
131 </listitem>
133 <listitem>
134 <para>
135 Map synonyms to a single word using <application>Ispell</application>.
136 </para>
137 </listitem>
139 <listitem>
140 <para>
141 Map phrases to a single word using a thesaurus.
142 </para>
143 </listitem>
145 <listitem>
146 <para>
147 Map different variations of a word to a canonical form using
148 an <application>Ispell</application> dictionary.
149 </para>
150 </listitem>
152 <listitem>
153 <para>
154 Map different variations of a word to a canonical form using
155 <application>Snowball</application> stemmer rules.
156 </para>
157 </listitem>
158 </itemizedlist>
160 <para>
161 A data type <type>tsvector</type> is provided for storing preprocessed
162 documents, along with a type <type>tsquery</type> for representing processed
163 queries (<xref linkend="datatype-textsearch"/>). There are many
164 functions and operators available for these data types
165 (<xref linkend="functions-textsearch"/>), the most important of which is
166 the match operator <literal>@@</literal>, which we introduce in
167 <xref linkend="textsearch-matching"/>. Full text searches can be accelerated
168 using indexes (<xref linkend="textsearch-indexes"/>).
169 </para>
172 <sect2 id="textsearch-document">
173 <title>What Is a Document?</title>
175 <indexterm zone="textsearch-document">
176 <primary>document</primary>
177 <secondary>text search</secondary>
178 </indexterm>
180 <para>
181 A <firstterm>document</firstterm> is the unit of searching in a full text search
182 system; for example, a magazine article or email message. The text search
183 engine must be able to parse documents and store associations of lexemes
184 (key words) with their parent document. Later, these associations are
185 used to search for documents that contain query words.
186 </para>
188 <para>
189 For searches within <productname>PostgreSQL</productname>,
190 a document is normally a textual field within a row of a database table,
191 or possibly a combination (concatenation) of such fields, perhaps stored
192 in several tables or obtained dynamically. In other words, a document can
193 be constructed from different parts for indexing and it might not be
194 stored anywhere as a whole. For example:
196 <programlisting>
197 SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
198 FROM messages
199 WHERE mid = 12;
201 SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
202 FROM messages m, docs d
203 WHERE m.mid = d.did AND m.mid = 12;
204 </programlisting>
205 </para>
207 <note>
208 <para>
209 Actually, in these example queries, <function>coalesce</function>
210 should be used to prevent a single <literal>NULL</literal> attribute from
211 causing a <literal>NULL</literal> result for the whole document.
212 </para>
213 </note>
215 <para>
216 Another possibility is to store the documents as simple text files in the
217 file system. In this case, the database can be used to store the full text
218 index and to execute searches, and some unique identifier can be used to
219 retrieve the document from the file system. However, retrieving files
220 from outside the database requires superuser permissions or special
221 function support, so this is usually less convenient than keeping all
222 the data inside <productname>PostgreSQL</productname>. Also, keeping
223 everything inside the database allows easy access
224 to document metadata to assist in indexing and display.
225 </para>
227 <para>
228 For text search purposes, each document must be reduced to the
229 preprocessed <type>tsvector</type> format. Searching and ranking
230 are performed entirely on the <type>tsvector</type> representation
231 of a document &mdash; the original text need only be retrieved
232 when the document has been selected for display to a user.
233 We therefore often speak of the <type>tsvector</type> as being the
234 document, but of course it is only a compact representation of
235 the full document.
236 </para>
237 </sect2>
239 <sect2 id="textsearch-matching">
240 <title>Basic Text Matching</title>
242 <para>
243 Full text searching in <productname>PostgreSQL</productname> is based on
244 the match operator <literal>@@</literal>, which returns
245 <literal>true</literal> if a <type>tsvector</type>
246 (document) matches a <type>tsquery</type> (query).
247 It doesn't matter which data type is written first:
249 <programlisting>
250 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
251 ?column?
252 ----------
255 SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
256 ?column?
257 ----------
259 </programlisting>
260 </para>
262 <para>
263 As the above example suggests, a <type>tsquery</type> is not just raw
264 text, any more than a <type>tsvector</type> is. A <type>tsquery</type>
265 contains search terms, which must be already-normalized lexemes, and
266 may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
267 (For syntax details see <xref linkend="datatype-tsquery"/>.) There are
268 functions <function>to_tsquery</function>, <function>plainto_tsquery</function>,
269 and <function>phraseto_tsquery</function>
270 that are helpful in converting user-written text into a proper
271 <type>tsquery</type>, primarily by normalizing words appearing in
272 the text. Similarly, <function>to_tsvector</function> is used to parse and
273 normalize a document string. So in practice a text search match would
274 look more like this:
276 <programlisting>
277 SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
278 ?column?
279 ----------
281 </programlisting>
283 Observe that this match would not succeed if written as
285 <programlisting>
286 SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
287 ?column?
288 ----------
290 </programlisting>
292 since here no normalization of the word <literal>rats</literal> will occur.
293 The elements of a <type>tsvector</type> are lexemes, which are assumed
294 already normalized, so <literal>rats</literal> does not match <literal>rat</literal>.
295 </para>
297 <para>
298 The <literal>@@</literal> operator also
299 supports <type>text</type> input, allowing explicit conversion of a text
300 string to <type>tsvector</type> or <type>tsquery</type> to be skipped
301 in simple cases. The variants available are:
303 <programlisting>
304 tsvector @@ tsquery
305 tsquery @@ tsvector
306 text @@ tsquery
307 text @@ text
308 </programlisting>
309 </para>
311 <para>
312 The first two of these we saw already.
313 The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
314 is equivalent to <literal>to_tsvector(x) @@ y</literal>.
315 The form <type>text</type> <literal>@@</literal> <type>text</type>
316 is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
317 </para>
319 <para>
320 Within a <type>tsquery</type>, the <literal>&amp;</literal> (AND) operator
321 specifies that both its arguments must appear in the document to have a
322 match. Similarly, the <literal>|</literal> (OR) operator specifies that
323 at least one of its arguments must appear, while the <literal>!</literal> (NOT)
324 operator specifies that its argument must <emphasis>not</emphasis> appear in
325 order to have a match.
326 For example, the query <literal>fat &amp; ! rat</literal> matches documents that
327 contain <literal>fat</literal> but not <literal>rat</literal>.
328 </para>
330 <para>
331 Searching for phrases is possible with the help of
332 the <literal>&lt;-&gt;</literal> (FOLLOWED BY) <type>tsquery</type> operator, which
333 matches only if its arguments have matches that are adjacent and in the
334 given order. For example:
336 <programlisting>
337 SELECT to_tsvector('fatal error') @@ to_tsquery('fatal &lt;-&gt; error');
338 ?column?
339 ----------
342 SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal &lt;-&gt; error');
343 ?column?
344 ----------
346 </programlisting>
348 There is a more general version of the FOLLOWED BY operator having the
349 form <literal>&lt;<replaceable>N</replaceable>&gt;</literal>,
350 where <replaceable>N</replaceable> is an integer standing for the difference between
351 the positions of the matching lexemes. <literal>&lt;1&gt;</literal> is
352 the same as <literal>&lt;-&gt;</literal>, while <literal>&lt;2&gt;</literal>
353 allows exactly one other lexeme to appear between the matches, and so
354 on. The <literal>phraseto_tsquery</literal> function makes use of this
355 operator to construct a <literal>tsquery</literal> that can match a multi-word
356 phrase when some of the words are stop words. For example:
358 <programlisting>
359 SELECT phraseto_tsquery('cats ate rats');
360 phraseto_tsquery
361 -------------------------------
362 'cat' &lt;-&gt; 'ate' &lt;-&gt; 'rat'
364 SELECT phraseto_tsquery('the cats ate the rats');
365 phraseto_tsquery
366 -------------------------------
367 'cat' &lt;-&gt; 'ate' &lt;2&gt; 'rat'
368 </programlisting>
369 </para>
371 <para>
372 A special case that's sometimes useful is that <literal>&lt;0&gt;</literal>
373 can be used to require that two patterns match the same word.
374 </para>
376 <para>
377 Parentheses can be used to control nesting of the <type>tsquery</type>
378 operators. Without parentheses, <literal>|</literal> binds least tightly,
379 then <literal>&amp;</literal>, then <literal>&lt;-&gt;</literal>,
380 and <literal>!</literal> most tightly.
381 </para>
383 <para>
384 It's worth noticing that the AND/OR/NOT operators mean something subtly
385 different when they are within the arguments of a FOLLOWED BY operator
386 than when they are not, because within FOLLOWED BY the exact position of
387 the match is significant. For example, normally <literal>!x</literal> matches
388 only documents that do not contain <literal>x</literal> anywhere.
389 But <literal>!x &lt;-&gt; y</literal> matches <literal>y</literal> if it is not
390 immediately after an <literal>x</literal>; an occurrence of <literal>x</literal>
391 elsewhere in the document does not prevent a match. Another example is
392 that <literal>x &amp; y</literal> normally only requires that <literal>x</literal>
393 and <literal>y</literal> both appear somewhere in the document, but
394 <literal>(x &amp; y) &lt;-&gt; z</literal> requires <literal>x</literal>
395 and <literal>y</literal> to match at the same place, immediately before
396 a <literal>z</literal>. Thus this query behaves differently from
397 <literal>x &lt;-&gt; z &amp; y &lt;-&gt; z</literal>, which will match a
398 document containing two separate sequences <literal>x z</literal> and
399 <literal>y z</literal>. (This specific query is useless as written,
400 since <literal>x</literal> and <literal>y</literal> could not match at the same place;
401 but with more complex situations such as prefix-match patterns, a query
402 of this form could be useful.)
403 </para>
404 </sect2>
406 <sect2 id="textsearch-intro-configurations">
407 <title>Configurations</title>
409 <para>
410 The above are all simple text search examples. As mentioned before, full
411 text search functionality includes the ability to do many more things:
412 skip indexing certain words (stop words), process synonyms, and use
413 sophisticated parsing, e.g., parse based on more than just white space.
414 This functionality is controlled by <firstterm>text search
415 configurations</firstterm>. <productname>PostgreSQL</productname> comes with predefined
416 configurations for many languages, and you can easily create your own
417 configurations. (<application>psql</application>'s <command>\dF</command> command
418 shows all available configurations.)
419 </para>
421 <para>
422 During installation an appropriate configuration is selected and
423 <xref linkend="guc-default-text-search-config"/> is set accordingly
424 in <filename>postgresql.conf</filename>. If you are using the same text search
425 configuration for the entire cluster you can use the value in
426 <filename>postgresql.conf</filename>. To use different configurations
427 throughout the cluster but the same configuration within any one database,
428 use <command>ALTER DATABASE ... SET</command>. Otherwise, you can set
429 <varname>default_text_search_config</varname> in each session.
430 </para>
432 <para>
433 Each text search function that depends on a configuration has an optional
434 <type>regconfig</type> argument, so that the configuration to use can be
435 specified explicitly. <varname>default_text_search_config</varname>
436 is used only when this argument is omitted.
437 </para>
439 <para>
440 To make it easier to build custom text search configurations, a
441 configuration is built up from simpler database objects.
442 <productname>PostgreSQL</productname>'s text search facility provides
443 four types of configuration-related database objects:
444 </para>
446 <itemizedlist spacing="compact" mark="bullet">
447 <listitem>
448 <para>
449 <firstterm>Text search parsers</firstterm> break documents into tokens
450 and classify each token (for example, as words or numbers).
451 </para>
452 </listitem>
454 <listitem>
455 <para>
456 <firstterm>Text search dictionaries</firstterm> convert tokens to normalized
457 form and reject stop words.
458 </para>
459 </listitem>
461 <listitem>
462 <para>
463 <firstterm>Text search templates</firstterm> provide the functions underlying
464 dictionaries. (A dictionary simply specifies a template and a set
465 of parameters for the template.)
466 </para>
467 </listitem>
469 <listitem>
470 <para>
471 <firstterm>Text search configurations</firstterm> select a parser and a set
472 of dictionaries to use to normalize the tokens produced by the parser.
473 </para>
474 </listitem>
475 </itemizedlist>
477 <para>
478 Text search parsers and templates are built from low-level C functions;
479 therefore it requires C programming ability to develop new ones, and
480 superuser privileges to install one into a database. (There are examples
481 of add-on parsers and templates in the <filename>contrib/</filename> area of the
482 <productname>PostgreSQL</productname> distribution.) Since dictionaries and
483 configurations just parameterize and connect together some underlying
484 parsers and templates, no special privilege is needed to create a new
485 dictionary or configuration. Examples of creating custom dictionaries and
486 configurations appear later in this chapter.
487 </para>
489 </sect2>
491 </sect1>
493 <sect1 id="textsearch-tables">
494 <title>Tables and Indexes</title>
496 <para>
497 The examples in the previous section illustrated full text matching using
498 simple constant strings. This section shows how to search table data,
499 optionally using indexes.
500 </para>
502 <sect2 id="textsearch-tables-search">
503 <title>Searching a Table</title>
505 <para>
506 It is possible to do a full text search without an index. A simple query
507 to print the <structname>title</structname> of each row that contains the word
508 <literal>friend</literal> in its <structfield>body</structfield> field is:
510 <programlisting>
511 SELECT title
512 FROM pgweb
513 WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
514 </programlisting>
516 This will also find related words such as <literal>friends</literal>
517 and <literal>friendly</literal>, since all these are reduced to the same
518 normalized lexeme.
519 </para>
521 <para>
522 The query above specifies that the <literal>english</literal> configuration
523 is to be used to parse and normalize the strings. Alternatively we
524 could omit the configuration parameters:
526 <programlisting>
527 SELECT title
528 FROM pgweb
529 WHERE to_tsvector(body) @@ to_tsquery('friend');
530 </programlisting>
532 This query will use the configuration set by <xref
533 linkend="guc-default-text-search-config"/>.
534 </para>
536 <para>
537 A more complex example is to
538 select the ten most recent documents that contain <literal>create</literal> and
539 <literal>table</literal> in the <structname>title</structname> or <structname>body</structname>:
541 <programlisting>
542 SELECT title
543 FROM pgweb
544 WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create &amp; table')
545 ORDER BY last_mod_date DESC
546 LIMIT 10;
547 </programlisting>
549 For clarity we omitted the <function>coalesce</function> function calls
550 which would be needed to find rows that contain <literal>NULL</literal>
551 in one of the two fields.
552 </para>
554 <para>
555 Although these queries will work without an index, most applications
556 will find this approach too slow, except perhaps for occasional ad-hoc
557 searches. Practical use of text searching usually requires creating
558 an index.
559 </para>
561 </sect2>
563 <sect2 id="textsearch-tables-index">
564 <title>Creating Indexes</title>
566 <para>
567 We can create a <acronym>GIN</acronym> index (<xref
568 linkend="textsearch-indexes"/>) to speed up text searches:
570 <programlisting>
571 CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));
572 </programlisting>
574 Notice that the 2-argument version of <function>to_tsvector</function> is
575 used. Only text search functions that specify a configuration name can
576 be used in expression indexes (<xref linkend="indexes-expressional"/>).
577 This is because the index contents must be unaffected by <xref
578 linkend="guc-default-text-search-config"/>. If they were affected, the
579 index contents might be inconsistent because different entries could
580 contain <type>tsvector</type>s that were created with different text search
581 configurations, and there would be no way to guess which was which. It
582 would be impossible to dump and restore such an index correctly.
583 </para>
585 <para>
586 Because the two-argument version of <function>to_tsvector</function> was
587 used in the index above, only a query reference that uses the 2-argument
588 version of <function>to_tsvector</function> with the same configuration
589 name will use that index. That is, <literal>WHERE
590 to_tsvector('english', body) @@ 'a &amp; b'</literal> can use the index,
591 but <literal>WHERE to_tsvector(body) @@ 'a &amp; b'</literal> cannot.
592 This ensures that an index will be used only with the same configuration
593 used to create the index entries.
594 </para>
596 <para>
597 It is possible to set up more complex expression indexes wherein the
598 configuration name is specified by another column, e.g.:
600 <programlisting>
601 CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body));
602 </programlisting>
604 where <literal>config_name</literal> is a column in the <literal>pgweb</literal>
605 table. This allows mixed configurations in the same index while
606 recording which configuration was used for each index entry. This
607 would be useful, for example, if the document collection contained
608 documents in different languages. Again,
609 queries that are meant to use the index must be phrased to match, e.g.,
610 <literal>WHERE to_tsvector(config_name, body) @@ 'a &amp; b'</literal>.
611 </para>
613 <para>
614 Indexes can even concatenate columns:
616 <programlisting>
617 CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body));
618 </programlisting>
619 </para>
621 <para>
622 Another approach is to create a separate <type>tsvector</type> column
623 to hold the output of <function>to_tsvector</function>. To keep this
624 column automatically up to date with its source data, use a stored
625 generated column. This example is a
626 concatenation of <literal>title</literal> and <literal>body</literal>,
627 using <function>coalesce</function> to ensure that one field will still be
628 indexed when the other is <literal>NULL</literal>:
630 <programlisting>
631 ALTER TABLE pgweb
632 ADD COLUMN textsearchable_index_col tsvector
633 GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
634 </programlisting>
636 Then we create a <acronym>GIN</acronym> index to speed up the search:
638 <programlisting>
639 CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);
640 </programlisting>
642 Now we are ready to perform a fast full text search:
644 <programlisting>
645 SELECT title
646 FROM pgweb
647 WHERE textsearchable_index_col @@ to_tsquery('create &amp; table')
648 ORDER BY last_mod_date DESC
649 LIMIT 10;
650 </programlisting>
651 </para>
653 <para>
654 One advantage of the separate-column approach over an expression index
655 is that it is not necessary to explicitly specify the text search
656 configuration in queries in order to make use of the index. As shown
657 in the example above, the query can depend on
658 <varname>default_text_search_config</varname>. Another advantage is that
659 searches will be faster, since it will not be necessary to redo the
660 <function>to_tsvector</function> calls to verify index matches. (This is more
661 important when using a GiST index than a GIN index; see <xref
662 linkend="textsearch-indexes"/>.) The expression-index approach is
663 simpler to set up, however, and it requires less disk space since the
664 <type>tsvector</type> representation is not stored explicitly.
665 </para>
667 </sect2>
669 </sect1>
671 <sect1 id="textsearch-controls">
672 <title>Controlling Text Search</title>
674 <para>
675 To implement full text searching there must be a function to create a
676 <type>tsvector</type> from a document and a <type>tsquery</type> from a
677 user query. Also, we need to return results in a useful order, so we need
678 a function that compares documents with respect to their relevance to
679 the query. It's also important to be able to display the results nicely.
680 <productname>PostgreSQL</productname> provides support for all of these
681 functions.
682 </para>
684 <sect2 id="textsearch-parsing-documents">
685 <title>Parsing Documents</title>
687 <para>
688 <productname>PostgreSQL</productname> provides the
689 function <function>to_tsvector</function> for converting a document to
690 the <type>tsvector</type> data type.
691 </para>
693 <indexterm>
694 <primary>to_tsvector</primary>
695 </indexterm>
697 <synopsis>
698 to_tsvector(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>) returns <type>tsvector</type>
699 </synopsis>
701 <para>
702 <function>to_tsvector</function> parses a textual document into tokens,
703 reduces the tokens to lexemes, and returns a <type>tsvector</type> which
704 lists the lexemes together with their positions in the document.
705 The document is processed according to the specified or default
706 text search configuration.
707 Here is a simple example:
709 <screen>
710 SELECT to_tsvector('english', 'a fat cat sat on a mat - it ate a fat rats');
711 to_tsvector
712 -----------------------------------------------------
713 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
714 </screen>
715 </para>
717 <para>
718 In the example above we see that the resulting <type>tsvector</type> does not
719 contain the words <literal>a</literal>, <literal>on</literal>, or
720 <literal>it</literal>, the word <literal>rats</literal> became
721 <literal>rat</literal>, and the punctuation sign <literal>-</literal> was
722 ignored.
723 </para>
725 <para>
726 The <function>to_tsvector</function> function internally calls a parser
727 which breaks the document text into tokens and assigns a type to
728 each token. For each token, a list of
729 dictionaries (<xref linkend="textsearch-dictionaries"/>) is consulted,
730 where the list can vary depending on the token type. The first dictionary
731 that <firstterm>recognizes</firstterm> the token emits one or more normalized
732 <firstterm>lexemes</firstterm> to represent the token. For example,
733 <literal>rats</literal> became <literal>rat</literal> because one of the
734 dictionaries recognized that the word <literal>rats</literal> is a plural
735 form of <literal>rat</literal>. Some words are recognized as
736 <firstterm>stop words</firstterm> (<xref linkend="textsearch-stopwords"/>), which
737 causes them to be ignored since they occur too frequently to be useful in
738 searching. In our example these are
739 <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
740 If no dictionary in the list recognizes the token then it is also ignored.
741 In this example that happened to the punctuation sign <literal>-</literal>
742 because there are in fact no dictionaries assigned for its token type
743 (<literal>Space symbols</literal>), meaning space tokens will never be
744 indexed. The choices of parser, dictionaries and which types of tokens to
745 index are determined by the selected text search configuration (<xref
746 linkend="textsearch-configuration"/>). It is possible to have
747 many different configurations in the same database, and predefined
748 configurations are available for various languages. In our example
749 we used the default configuration <literal>english</literal> for the
750 English language.
751 </para>
753 <para>
754 The function <function>setweight</function> can be used to label the
755 entries of a <type>tsvector</type> with a given <firstterm>weight</firstterm>,
756 where a weight is one of the letters <literal>A</literal>, <literal>B</literal>,
757 <literal>C</literal>, or <literal>D</literal>.
758 This is typically used to mark entries coming from
759 different parts of a document, such as title versus body. Later, this
760 information can be used for ranking of search results.
761 </para>
763 <para>
764 Because <function>to_tsvector</function>(<literal>NULL</literal>) will
765 return <literal>NULL</literal>, it is recommended to use
766 <function>coalesce</function> whenever a field might be null.
767 Here is the recommended method for creating
768 a <type>tsvector</type> from a structured document:
770 <programlisting>
771 UPDATE tt SET ti =
772 setweight(to_tsvector(coalesce(title,'')), 'A') ||
773 setweight(to_tsvector(coalesce(keyword,'')), 'B') ||
774 setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
775 setweight(to_tsvector(coalesce(body,'')), 'D');
776 </programlisting>
778 Here we have used <function>setweight</function> to label the source
779 of each lexeme in the finished <type>tsvector</type>, and then merged
780 the labeled <type>tsvector</type> values using the <type>tsvector</type>
781 concatenation operator <literal>||</literal>. (<xref
782 linkend="textsearch-manipulate-tsvector"/> gives details about these
783 operations.)
784 </para>
786 </sect2>
788 <sect2 id="textsearch-parsing-queries">
789 <title>Parsing Queries</title>
791 <para>
792 <productname>PostgreSQL</productname> provides the
793 functions <function>to_tsquery</function>,
794 <function>plainto_tsquery</function>,
795 <function>phraseto_tsquery</function> and
796 <function>websearch_to_tsquery</function>
797 for converting a query to the <type>tsquery</type> data type.
798 <function>to_tsquery</function> offers access to more features
799 than either <function>plainto_tsquery</function> or
800 <function>phraseto_tsquery</function>, but it is less forgiving about its
801 input. <function>websearch_to_tsquery</function> is a simplified version
802 of <function>to_tsquery</function> with an alternative syntax, similar
803 to the one used by web search engines.
804 </para>
806 <indexterm>
807 <primary>to_tsquery</primary>
808 </indexterm>
810 <synopsis>
811 to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
812 </synopsis>
814 <para>
815 <function>to_tsquery</function> creates a <type>tsquery</type> value from
816 <replaceable>querytext</replaceable>, which must consist of single tokens
817 separated by the <type>tsquery</type> operators <literal>&amp;</literal> (AND),
818 <literal>|</literal> (OR), <literal>!</literal> (NOT), and
819 <literal>&lt;-&gt;</literal> (FOLLOWED BY), possibly grouped
820 using parentheses. In other words, the input to
821 <function>to_tsquery</function> must already follow the general rules for
822 <type>tsquery</type> input, as described in <xref
823 linkend="datatype-tsquery"/>. The difference is that while basic
824 <type>tsquery</type> input takes the tokens at face value,
825 <function>to_tsquery</function> normalizes each token into a lexeme using
826 the specified or default configuration, and discards any tokens that are
827 stop words according to the configuration. For example:
829 <screen>
830 SELECT to_tsquery('english', 'The &amp; Fat &amp; Rats');
831 to_tsquery
832 ---------------
833 'fat' &amp; 'rat'
834 </screen>
836 As in basic <type>tsquery</type> input, weight(s) can be attached to each
837 lexeme to restrict it to match only <type>tsvector</type> lexemes of those
838 weight(s). For example:
840 <screen>
841 SELECT to_tsquery('english', 'Fat | Rats:AB');
842 to_tsquery
843 ------------------
844 'fat' | 'rat':AB
845 </screen>
847 Also, <literal>*</literal> can be attached to a lexeme to specify prefix matching:
849 <screen>
850 SELECT to_tsquery('supern:*A &amp; star:A*B');
851 to_tsquery
852 --------------------------
853 'supern':*A &amp; 'star':*AB
854 </screen>
856 Such a lexeme will match any word in a <type>tsvector</type> that begins
857 with the given string.
858 </para>
860 <para>
861 <function>to_tsquery</function> can also accept single-quoted
862 phrases. This is primarily useful when the configuration includes a
863 thesaurus dictionary that may trigger on such phrases.
864 In the example below, a thesaurus contains the rule <literal>supernovae
865 stars : sn</literal>:
867 <screen>
868 SELECT to_tsquery('''supernovae stars'' &amp; !crab');
869 to_tsquery
870 ---------------
871 'sn' &amp; !'crab'
872 </screen>
874 Without quotes, <function>to_tsquery</function> will generate a syntax
875 error for tokens that are not separated by an AND, OR, or FOLLOWED BY
876 operator.
877 </para>
879 <indexterm>
880 <primary>plainto_tsquery</primary>
881 </indexterm>
883 <synopsis>
884 plainto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
885 </synopsis>
887 <para>
888 <function>plainto_tsquery</function> transforms the unformatted text
889 <replaceable>querytext</replaceable> to a <type>tsquery</type> value.
890 The text is parsed and normalized much as for <function>to_tsvector</function>,
891 then the <literal>&amp;</literal> (AND) <type>tsquery</type> operator is
892 inserted between surviving words.
893 </para>
895 <para>
896 Example:
898 <screen>
899 SELECT plainto_tsquery('english', 'The Fat Rats');
900 plainto_tsquery
901 -----------------
902 'fat' &amp; 'rat'
903 </screen>
905 Note that <function>plainto_tsquery</function> will not
906 recognize <type>tsquery</type> operators, weight labels,
907 or prefix-match labels in its input:
909 <screen>
910 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
911 plainto_tsquery
912 ---------------------
913 'fat' &amp; 'rat' &amp; 'c'
914 </screen>
916 Here, all the input punctuation was discarded.
917 </para>
919 <indexterm>
920 <primary>phraseto_tsquery</primary>
921 </indexterm>
923 <synopsis>
924 phraseto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
925 </synopsis>
927 <para>
928 <function>phraseto_tsquery</function> behaves much like
929 <function>plainto_tsquery</function>, except that it inserts
930 the <literal>&lt;-&gt;</literal> (FOLLOWED BY) operator between
931 surviving words instead of the <literal>&amp;</literal> (AND) operator.
932 Also, stop words are not simply discarded, but are accounted for by
933 inserting <literal>&lt;<replaceable>N</replaceable>&gt;</literal> operators rather
934 than <literal>&lt;-&gt;</literal> operators. This function is useful
935 when searching for exact lexeme sequences, since the FOLLOWED BY
936 operators check lexeme order not just the presence of all the lexemes.
937 </para>
939 <para>
940 Example:
942 <screen>
943 SELECT phraseto_tsquery('english', 'The Fat Rats');
944 phraseto_tsquery
945 ------------------
946 'fat' &lt;-&gt; 'rat'
947 </screen>
949 Like <function>plainto_tsquery</function>, the
950 <function>phraseto_tsquery</function> function will not
951 recognize <type>tsquery</type> operators, weight labels,
952 or prefix-match labels in its input:
954 <screen>
955 SELECT phraseto_tsquery('english', 'The Fat &amp; Rats:C');
956 phraseto_tsquery
957 -----------------------------
958 'fat' &lt;-&gt; 'rat' &lt;-&gt; 'c'
959 </screen>
960 </para>
962 <synopsis>
963 websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
964 </synopsis>
966 <para>
967 <function>websearch_to_tsquery</function> creates a <type>tsquery</type>
968 value from <replaceable>querytext</replaceable> using an alternative
969 syntax in which simple unformatted text is a valid query.
970 Unlike <function>plainto_tsquery</function>
971 and <function>phraseto_tsquery</function>, it also recognizes certain
972 operators. Moreover, this function will never raise syntax errors,
973 which makes it possible to use raw user-supplied input for search.
974 The following syntax is supported:
976 <itemizedlist spacing="compact" mark="bullet">
977 <listitem>
978 <para>
979 <literal>unquoted text</literal>: text not inside quote marks will be
980 converted to terms separated by <literal>&amp;</literal> operators, as
981 if processed by <function>plainto_tsquery</function>.
982 </para>
983 </listitem>
984 <listitem>
985 <para>
986 <literal>"quoted text"</literal>: text inside quote marks will be
987 converted to terms separated by <literal>&lt;-&gt;</literal>
988 operators, as if processed by <function>phraseto_tsquery</function>.
989 </para>
990 </listitem>
991 <listitem>
992 <para>
993 <literal>OR</literal>: the word <quote>or</quote> will be converted to
994 the <literal>|</literal> operator.
995 </para>
996 </listitem>
997 <listitem>
998 <para>
999 <literal>-</literal>: a dash will be converted to
1000 the <literal>!</literal> operator.
1001 </para>
1002 </listitem>
1003 </itemizedlist>
1005 Other punctuation is ignored. So
1006 like <function>plainto_tsquery</function>
1007 and <function>phraseto_tsquery</function>,
1008 the <function>websearch_to_tsquery</function> function will not
1009 recognize <type>tsquery</type> operators, weight labels, or prefix-match
1010 labels in its input.
1011 </para>
1013 <para>
1014 Examples:
1015 <screen>
1016 SELECT websearch_to_tsquery('english', 'The fat rats');
1017 websearch_to_tsquery
1018 ----------------------
1019 'fat' &amp; 'rat'
1020 (1 row)
1022 SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
1023 websearch_to_tsquery
1024 ----------------------------------
1025 'supernova' &lt;-&gt; 'star' &amp; !'crab'
1026 (1 row)
1028 SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
1029 websearch_to_tsquery
1030 -----------------------------------
1031 'sad' &lt;-&gt; 'cat' | 'fat' &lt;-&gt; 'rat'
1032 (1 row)
1034 SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
1035 websearch_to_tsquery
1036 ---------------------------------------
1037 'signal' &amp; !( 'segment' &lt;-&gt; 'fault' )
1038 (1 row)
1040 SELECT websearch_to_tsquery('english', '""" )( dummy \\ query &lt;-&gt;');
1041 websearch_to_tsquery
1042 ----------------------
1043 'dummi' &amp; 'queri'
1044 (1 row)
1045 </screen>
1046 </para>
1047 </sect2>
1049 <sect2 id="textsearch-ranking">
1050 <title>Ranking Search Results</title>
1052 <para>
1053 Ranking attempts to measure how relevant documents are to a particular
1054 query, so that when there are many matches the most relevant ones can be
1055 shown first. <productname>PostgreSQL</productname> provides two
1056 predefined ranking functions, which take into account lexical, proximity,
1057 and structural information; that is, they consider how often the query
1058 terms appear in the document, how close together the terms are in the
1059 document, and how important is the part of the document where they occur.
1060 However, the concept of relevancy is vague and very application-specific.
1061 Different applications might require additional information for ranking,
1062 e.g., document modification time. The built-in ranking functions are only
1063 examples. You can write your own ranking functions and/or combine their
1064 results with additional factors to fit your specific needs.
1065 </para>
1067 <para>
1068 The two ranking functions currently available are:
1070 <variablelist>
1072 <varlistentry>
1074 <term>
1075 <indexterm>
1076 <primary>ts_rank</primary>
1077 </indexterm>
1079 <literal>ts_rank(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal>
1080 </term>
1082 <listitem>
1083 <para>
1084 Ranks vectors based on the frequency of their matching lexemes.
1085 </para>
1086 </listitem>
1087 </varlistentry>
1089 <varlistentry>
1091 <term>
1092 <indexterm>
1093 <primary>ts_rank_cd</primary>
1094 </indexterm>
1096 <literal>ts_rank_cd(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal>
1097 </term>
1099 <listitem>
1100 <para>
1101 This function computes the <firstterm>cover density</firstterm>
1102 ranking for the given document vector and query, as described in
1103 Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
1104 Term Queries" in the journal "Information Processing and Management",
1105 1999. Cover density is similar to <function>ts_rank</function> ranking
1106 except that the proximity of matching lexemes to each other is
1107 taken into consideration.
1108 </para>
1110 <para>
1111 This function requires lexeme positional information to perform
1112 its calculation. Therefore, it ignores any <quote>stripped</quote>
1113 lexemes in the <type>tsvector</type>. If there are no unstripped
1114 lexemes in the input, the result will be zero. (See <xref
1115 linkend="textsearch-manipulate-tsvector"/> for more information
1116 about the <function>strip</function> function and positional information
1117 in <type>tsvector</type>s.)
1118 </para>
1119 </listitem>
1120 </varlistentry>
1122 </variablelist>
1124 </para>
1126 <para>
1127 For both these functions,
1128 the optional <replaceable class="parameter">weights</replaceable>
1129 argument offers the ability to weigh word instances more or less
1130 heavily depending on how they are labeled. The weight arrays specify
1131 how heavily to weigh each category of word, in the order:
1133 <synopsis>
1134 {D-weight, C-weight, B-weight, A-weight}
1135 </synopsis>
1137 If no <replaceable class="parameter">weights</replaceable> are provided,
1138 then these defaults are used:
1140 <programlisting>
1141 {0.1, 0.2, 0.4, 1.0}
1142 </programlisting>
1144 Typically weights are used to mark words from special areas of the
1145 document, like the title or an initial abstract, so they can be
1146 treated with more or less importance than words in the document body.
1147 </para>
1149 <para>
1150 Since a longer document has a greater chance of containing a query term
1151 it is reasonable to take into account document size, e.g., a hundred-word
1152 document with five instances of a search word is probably more relevant
1153 than a thousand-word document with five instances. Both ranking functions
1154 take an integer <replaceable>normalization</replaceable> option that
1155 specifies whether and how a document's length should impact its rank.
1156 The integer option controls several behaviors, so it is a bit mask:
1157 you can specify one or more behaviors using
1158 <literal>|</literal> (for example, <literal>2|4</literal>).
1160 <itemizedlist spacing="compact" mark="bullet">
1161 <listitem>
1162 <para>
1163 0 (the default) ignores the document length
1164 </para>
1165 </listitem>
1166 <listitem>
1167 <para>
1168 1 divides the rank by 1 + the logarithm of the document length
1169 </para>
1170 </listitem>
1171 <listitem>
1172 <para>
1173 2 divides the rank by the document length
1174 </para>
1175 </listitem>
1176 <listitem>
1177 <para>
1178 4 divides the rank by the mean harmonic distance between extents
1179 (this is implemented only by <function>ts_rank_cd</function>)
1180 </para>
1181 </listitem>
1182 <listitem>
1183 <para>
1184 8 divides the rank by the number of unique words in document
1185 </para>
1186 </listitem>
1187 <listitem>
1188 <para>
1189 16 divides the rank by 1 + the logarithm of the number
1190 of unique words in document
1191 </para>
1192 </listitem>
1193 <listitem>
1194 <para>
1195 32 divides the rank by itself + 1
1196 </para>
1197 </listitem>
1198 </itemizedlist>
1200 If more than one flag bit is specified, the transformations are
1201 applied in the order listed.
1202 </para>
1204 <para>
1205 It is important to note that the ranking functions do not use any global
1206 information, so it is impossible to produce a fair normalization to 1% or
1207 100% as sometimes desired. Normalization option 32
1208 (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks
1209 into the range zero to one, but of course this is just a cosmetic change;
1210 it will not affect the ordering of the search results.
1211 </para>
1213 <para>
1214 Here is an example that selects only the ten highest-ranked matches:
1216 <screen>
1217 SELECT title, ts_rank_cd(textsearch, query) AS rank
1218 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1219 WHERE query @@ textsearch
1220 ORDER BY rank DESC
1221 LIMIT 10;
1222 title | rank
1223 -----------------------------------------------+----------
1224 Neutrinos in the Sun | 3.1
1225 The Sudbury Neutrino Detector | 2.4
1226 A MACHO View of Galactic Dark Matter | 2.01317
1227 Hot Gas and Dark Matter | 1.91171
1228 The Virgo Cluster: Hot Plasma and Dark Matter | 1.90953
1229 Rafting for Solar Neutrinos | 1.9
1230 NGC 4650A: Strange Galaxy and Dark Matter | 1.85774
1231 Hot Gas and Dark Matter | 1.6123
1232 Ice Fishing for Cosmic Neutrinos | 1.6
1233 Weak Lensing Distorts the Universe | 0.818218
1234 </screen>
1236 This is the same example using normalized ranking:
1238 <screen>
1239 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
1240 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1241 WHERE query @@ textsearch
1242 ORDER BY rank DESC
1243 LIMIT 10;
1244 title | rank
1245 -----------------------------------------------+-------------------
1246 Neutrinos in the Sun | 0.756097569485493
1247 The Sudbury Neutrino Detector | 0.705882361190954
1248 A MACHO View of Galactic Dark Matter | 0.668123210574724
1249 Hot Gas and Dark Matter | 0.65655958650282
1250 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
1251 Rafting for Solar Neutrinos | 0.655172410958162
1252 NGC 4650A: Strange Galaxy and Dark Matter | 0.650072921219637
1253 Hot Gas and Dark Matter | 0.617195790024749
1254 Ice Fishing for Cosmic Neutrinos | 0.615384618911517
1255 Weak Lensing Distorts the Universe | 0.450010798361481
1256 </screen>
1257 </para>
1259 <para>
1260 Ranking can be expensive since it requires consulting the
1261 <type>tsvector</type> of each matching document, which can be I/O bound and
1262 therefore slow. Unfortunately, it is almost impossible to avoid since
1263 practical queries often result in large numbers of matches.
1264 </para>
1266 </sect2>
1268 <sect2 id="textsearch-headline">
1269 <title>Highlighting Results</title>
1271 <para>
1272 To present search results it is ideal to show a part of each document and
1273 how it is related to the query. Usually, search engines show fragments of
1274 the document with marked search terms. <productname>PostgreSQL</productname>
1275 provides a function <function>ts_headline</function> that
1276 implements this functionality.
1277 </para>
1279 <indexterm>
1280 <primary>ts_headline</primary>
1281 </indexterm>
1283 <synopsis>
1284 ts_headline(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">options</replaceable> <type>text</type> </optional>) returns <type>text</type>
1285 </synopsis>
1287 <para>
1288 <function>ts_headline</function> accepts a document along
1289 with a query, and returns an excerpt from
1290 the document in which terms from the query are highlighted.
1291 Specifically, the function will use the query to select relevant
1292 text fragments, and then highlight all words that appear in the query,
1293 even if those word positions do not match the query's restrictions. The
1294 configuration to be used to parse the document can be specified by
1295 <replaceable>config</replaceable>; if <replaceable>config</replaceable>
1296 is omitted, the
1297 <varname>default_text_search_config</varname> configuration is used.
1298 </para>
1300 <para>
1301 If an <replaceable>options</replaceable> string is specified it must
1302 consist of a comma-separated list of one or more
1303 <replaceable>option</replaceable><literal>=</literal><replaceable>value</replaceable> pairs.
1304 The available options are:
1306 <itemizedlist spacing="compact" mark="bullet">
1307 <listitem>
1308 <para>
1309 <literal>MaxWords</literal>, <literal>MinWords</literal> (integers):
1310 these numbers determine the longest and shortest headlines to output.
1311 The default values are 35 and 15.
1312 </para>
1313 </listitem>
1314 <listitem>
1315 <para>
1316 <literal>ShortWord</literal> (integer): words of this length or less
1317 will be dropped at the start and end of a headline, unless they are
1318 query terms. The default value of three eliminates common English
1319 articles.
1320 </para>
1321 </listitem>
1322 <listitem>
1323 <para>
1324 <literal>HighlightAll</literal> (boolean): if
1325 <literal>true</literal> the whole document will be used as the
1326 headline, ignoring the preceding three parameters. The default
1327 is <literal>false</literal>.
1328 </para>
1329 </listitem>
1330 <listitem>
1331 <para>
1332 <literal>MaxFragments</literal> (integer): maximum number of text
1333 fragments to display. The default value of zero selects a
1334 non-fragment-based headline generation method. A value greater
1335 than zero selects fragment-based headline generation (see below).
1336 </para>
1337 </listitem>
1338 <listitem>
1339 <para>
1340 <literal>StartSel</literal>, <literal>StopSel</literal> (strings):
1341 the strings with which to delimit query words appearing in the
1342 document, to distinguish them from other excerpted words. The
1343 default values are <quote><literal>&lt;b&gt;</literal></quote> and
1344 <quote><literal>&lt;/b&gt;</literal></quote>, which can be suitable
1345 for HTML output.
1346 </para>
1347 </listitem>
1348 <listitem>
1349 <para>
1350 <literal>FragmentDelimiter</literal> (string): When more than one
1351 fragment is displayed, the fragments will be separated by this string.
1352 The default is <quote><literal> ... </literal></quote>.
1353 </para>
1354 </listitem>
1355 </itemizedlist>
1357 These option names are recognized case-insensitively.
1358 You must double-quote string values if they contain spaces or commas.
1359 </para>
1361 <para>
1362 In non-fragment-based headline
1363 generation, <function>ts_headline</function> locates matches for the
1364 given <replaceable class="parameter">query</replaceable> and chooses a
1365 single one to display, preferring matches that have more query words
1366 within the allowed headline length.
1367 In fragment-based headline generation, <function>ts_headline</function>
1368 locates the query matches and splits each match
1369 into <quote>fragments</quote> of no more than <literal>MaxWords</literal>
1370 words each, preferring fragments with more query words, and when
1371 possible <quote>stretching</quote> fragments to include surrounding
1372 words. The fragment-based mode is thus more useful when the query
1373 matches span large sections of the document, or when it's desirable to
1374 display multiple matches.
1375 In either mode, if no query matches can be identified, then a single
1376 fragment of the first <literal>MinWords</literal> words in the document
1377 will be displayed.
1378 </para>
1380 <para>
1381 For example:
1383 <screen>
1384 SELECT ts_headline('english',
1385 'The most common type of search
1386 is to find all documents containing given query terms
1387 and return them in order of their similarity to the
1388 query.',
1389 to_tsquery('english', 'query &amp; similarity'));
1390 ts_headline
1391 ------------------------------------------------------------
1392 containing given &lt;b&gt;query&lt;/b&gt; terms +
1393 and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the+
1394 &lt;b&gt;query&lt;/b&gt;.
1396 SELECT ts_headline('english',
1397 'Search terms may occur
1398 many times in a document,
1399 requiring ranking of the search matches to decide which
1400 occurrences to display in the result.',
1401 to_tsquery('english', 'search &amp; term'),
1402 'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=&lt;&lt;, StopSel=&gt;&gt;');
1403 ts_headline
1404 ------------------------------------------------------------
1405 &lt;&lt;Search&gt;&gt; &lt;&lt;terms&gt;&gt; may occur +
1406 many times ... ranking of the &lt;&lt;search&gt;&gt; matches to decide
1407 </screen>
1408 </para>
1410 <para>
1411 <function>ts_headline</function> uses the original document, not a
1412 <type>tsvector</type> summary, so it can be slow and should be used with
1413 care.
1414 </para>
1416 </sect2>
1418 </sect1>
1420 <sect1 id="textsearch-features">
1421 <title>Additional Features</title>
1423 <para>
1424 This section describes additional functions and operators that are
1425 useful in connection with text search.
1426 </para>
1428 <sect2 id="textsearch-manipulate-tsvector">
1429 <title>Manipulating Documents</title>
1431 <para>
1432 <xref linkend="textsearch-parsing-documents"/> showed how raw textual
1433 documents can be converted into <type>tsvector</type> values.
1434 <productname>PostgreSQL</productname> also provides functions and
1435 operators that can be used to manipulate documents that are already
1436 in <type>tsvector</type> form.
1437 </para>
1439 <variablelist>
1441 <varlistentry>
1443 <term>
1444 <indexterm>
1445 <primary>tsvector concatenation</primary>
1446 </indexterm>
1448 <literal><type>tsvector</type> || <type>tsvector</type></literal>
1449 </term>
1451 <listitem>
1452 <para>
1453 The <type>tsvector</type> concatenation operator
1454 returns a vector which combines the lexemes and positional information
1455 of the two vectors given as arguments. Positions and weight labels
1456 are retained during the concatenation.
1457 Positions appearing in the right-hand vector are offset by the largest
1458 position mentioned in the left-hand vector, so that the result is
1459 nearly equivalent to the result of performing <function>to_tsvector</function>
1460 on the concatenation of the two original document strings. (The
1461 equivalence is not exact, because any stop-words removed from the
1462 end of the left-hand argument will not affect the result, whereas
1463 they would have affected the positions of the lexemes in the
1464 right-hand argument if textual concatenation were used.)
1465 </para>
1467 <para>
1468 One advantage of using concatenation in the vector form, rather than
1469 concatenating text before applying <function>to_tsvector</function>, is that
1470 you can use different configurations to parse different sections
1471 of the document. Also, because the <function>setweight</function> function
1472 marks all lexemes of the given vector the same way, it is necessary
1473 to parse the text and do <function>setweight</function> before concatenating
1474 if you want to label different parts of the document with different
1475 weights.
1476 </para>
1477 </listitem>
1478 </varlistentry>
1480 <varlistentry>
1482 <term>
1483 <indexterm>
1484 <primary>setweight</primary>
1485 </indexterm>
1487 <literal>setweight(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">weight</replaceable> <type>"char"</type>) returns <type>tsvector</type></literal>
1488 </term>
1490 <listitem>
1491 <para>
1492 <function>setweight</function> returns a copy of the input vector in which every
1493 position has been labeled with the given <replaceable>weight</replaceable>, either
1494 <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
1495 <literal>D</literal>. (<literal>D</literal> is the default for new
1496 vectors and as such is not displayed on output.) These labels are
1497 retained when vectors are concatenated, allowing words from different
1498 parts of a document to be weighted differently by ranking functions.
1499 </para>
1501 <para>
1502 Note that weight labels apply to <emphasis>positions</emphasis>, not
1503 <emphasis>lexemes</emphasis>. If the input vector has been stripped of
1504 positions then <function>setweight</function> does nothing.
1505 </para>
1506 </listitem>
1507 </varlistentry>
1509 <varlistentry>
1510 <term>
1511 <indexterm>
1512 <primary>length(tsvector)</primary>
1513 </indexterm>
1515 <literal>length(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>integer</type></literal>
1516 </term>
1518 <listitem>
1519 <para>
1520 Returns the number of lexemes stored in the vector.
1521 </para>
1522 </listitem>
1523 </varlistentry>
1525 <varlistentry>
1527 <term>
1528 <indexterm>
1529 <primary>strip</primary>
1530 </indexterm>
1532 <literal>strip(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>tsvector</type></literal>
1533 </term>
1535 <listitem>
1536 <para>
1537 Returns a vector that lists the same lexemes as the given vector, but
1538 lacks any position or weight information. The result is usually much
1539 smaller than an unstripped vector, but it is also less useful.
1540 Relevance ranking does not work as well on stripped vectors as
1541 unstripped ones. Also,
1542 the <literal>&lt;-&gt;</literal> (FOLLOWED BY) <type>tsquery</type> operator
1543 will never match stripped input, since it cannot determine the
1544 distance between lexeme occurrences.
1545 </para>
1546 </listitem>
1548 </varlistentry>
1550 </variablelist>
1552 <para>
1553 A full list of <type>tsvector</type>-related functions is available
1554 in <xref linkend="textsearch-functions-table"/>.
1555 </para>
1557 </sect2>
1559 <sect2 id="textsearch-manipulate-tsquery">
1560 <title>Manipulating Queries</title>
1562 <para>
1563 <xref linkend="textsearch-parsing-queries"/> showed how raw textual
1564 queries can be converted into <type>tsquery</type> values.
1565 <productname>PostgreSQL</productname> also provides functions and
1566 operators that can be used to manipulate queries that are already
1567 in <type>tsquery</type> form.
1568 </para>
1570 <variablelist>
1572 <varlistentry>
1574 <term>
1575 <literal><type>tsquery</type> &amp;&amp; <type>tsquery</type></literal>
1576 </term>
1578 <listitem>
1579 <para>
1580 Returns the AND-combination of the two given queries.
1581 </para>
1582 </listitem>
1584 </varlistentry>
1586 <varlistentry>
1588 <term>
1589 <literal><type>tsquery</type> || <type>tsquery</type></literal>
1590 </term>
1592 <listitem>
1593 <para>
1594 Returns the OR-combination of the two given queries.
1595 </para>
1596 </listitem>
1598 </varlistentry>
1600 <varlistentry>
1602 <term>
1603 <literal>!! <type>tsquery</type></literal>
1604 </term>
1606 <listitem>
1607 <para>
1608 Returns the negation (NOT) of the given query.
1609 </para>
1610 </listitem>
1612 </varlistentry>
1614 <varlistentry>
1616 <term>
1617 <literal><type>tsquery</type> &lt;-&gt; <type>tsquery</type></literal>
1618 </term>
1620 <listitem>
1621 <para>
1622 Returns a query that searches for a match to the first given query
1623 immediately followed by a match to the second given query, using
1624 the <literal>&lt;-&gt;</literal> (FOLLOWED BY)
1625 <type>tsquery</type> operator. For example:
1627 <screen>
1628 SELECT to_tsquery('fat') &lt;-&gt; to_tsquery('cat | rat');
1629 ?column?
1630 ----------------------------
1631 'fat' &lt;-&gt; ( 'cat' | 'rat' )
1632 </screen>
1633 </para>
1634 </listitem>
1636 </varlistentry>
1638 <varlistentry>
1640 <term>
1641 <indexterm>
1642 <primary>tsquery_phrase</primary>
1643 </indexterm>
1645 <literal>tsquery_phrase(<replaceable class="parameter">query1</replaceable> <type>tsquery</type>, <replaceable class="parameter">query2</replaceable> <type>tsquery</type> [, <replaceable class="parameter">distance</replaceable> <type>integer</type> ]) returns <type>tsquery</type></literal>
1646 </term>
1648 <listitem>
1649 <para>
1650 Returns a query that searches for a match to the first given query
1651 followed by a match to the second given query at a distance of exactly
1652 <replaceable>distance</replaceable> lexemes, using
1653 the <literal>&lt;<replaceable>N</replaceable>&gt;</literal>
1654 <type>tsquery</type> operator. For example:
1656 <screen>
1657 SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
1658 tsquery_phrase
1659 ------------------
1660 'fat' &lt;10&gt; 'cat'
1661 </screen>
1662 </para>
1663 </listitem>
1665 </varlistentry>
1667 <varlistentry>
1669 <term>
1670 <indexterm>
1671 <primary>numnode</primary>
1672 </indexterm>
1674 <literal>numnode(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>integer</type></literal>
1675 </term>
1677 <listitem>
1678 <para>
1679 Returns the number of nodes (lexemes plus operators) in a
1680 <type>tsquery</type>. This function is useful
1681 to determine if the <replaceable>query</replaceable> is meaningful
1682 (returns &gt; 0), or contains only stop words (returns 0).
1683 Examples:
1685 <screen>
1686 SELECT numnode(plainto_tsquery('the any'));
1687 NOTICE: query contains only stopword(s) or doesn't contain lexeme(s), ignored
1688 numnode
1689 ---------
1692 SELECT numnode('foo &amp; bar'::tsquery);
1693 numnode
1694 ---------
1696 </screen>
1697 </para>
1698 </listitem>
1699 </varlistentry>
1701 <varlistentry>
1703 <term>
1704 <indexterm>
1705 <primary>querytree</primary>
1706 </indexterm>
1708 <literal>querytree(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>text</type></literal>
1709 </term>
1711 <listitem>
1712 <para>
1713 Returns the portion of a <type>tsquery</type> that can be used for
1714 searching an index. This function is useful for detecting
1715 unindexable queries, for example those containing only stop words
1716 or only negated terms. For example:
1718 <screen>
1719 SELECT querytree(to_tsquery('defined'));
1720 querytree
1721 -----------
1722 'defin'
1724 SELECT querytree(to_tsquery('!defined'));
1725 querytree
1726 -----------
1728 </screen>
1729 </para>
1730 </listitem>
1731 </varlistentry>
1733 </variablelist>
1735 <sect3 id="textsearch-query-rewriting">
1736 <title>Query Rewriting</title>
1738 <indexterm zone="textsearch-query-rewriting">
1739 <primary>ts_rewrite</primary>
1740 </indexterm>
1742 <para>
1743 The <function>ts_rewrite</function> family of functions search a
1744 given <type>tsquery</type> for occurrences of a target
1745 subquery, and replace each occurrence with a
1746 substitute subquery. In essence this operation is a
1747 <type>tsquery</type>-specific version of substring replacement.
1748 A target and substitute combination can be
1749 thought of as a <firstterm>query rewrite rule</firstterm>. A collection
1750 of such rewrite rules can be a powerful search aid.
1751 For example, you can expand the search using synonyms
1752 (e.g., <literal>new york</literal>, <literal>big apple</literal>, <literal>nyc</literal>,
1753 <literal>gotham</literal>) or narrow the search to direct the user to some hot
1754 topic. There is some overlap in functionality between this feature
1755 and thesaurus dictionaries (<xref linkend="textsearch-thesaurus"/>).
1756 However, you can modify a set of rewrite rules on-the-fly without
1757 reindexing, whereas updating a thesaurus requires reindexing to be
1758 effective.
1759 </para>
1761 <variablelist>
1763 <varlistentry>
1765 <term>
1766 <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">target</replaceable> <type>tsquery</type>, <replaceable class="parameter">substitute</replaceable> <type>tsquery</type>) returns <type>tsquery</type></literal>
1767 </term>
1769 <listitem>
1770 <para>
1771 This form of <function>ts_rewrite</function> simply applies a single
1772 rewrite rule: <replaceable class="parameter">target</replaceable>
1773 is replaced by <replaceable class="parameter">substitute</replaceable>
1774 wherever it appears in <replaceable
1775 class="parameter">query</replaceable>. For example:
1777 <screen>
1778 SELECT ts_rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
1779 ts_rewrite
1780 ------------
1781 'b' &amp; 'c'
1782 </screen>
1783 </para>
1784 </listitem>
1785 </varlistentry>
1787 <varlistentry>
1789 <term>
1790 <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">select</replaceable> <type>text</type>) returns <type>tsquery</type></literal>
1791 </term>
1793 <listitem>
1794 <para>
1795 This form of <function>ts_rewrite</function> accepts a starting
1796 <replaceable>query</replaceable> and an SQL <replaceable>select</replaceable> command, which
1797 is given as a text string. The <replaceable>select</replaceable> must yield two
1798 columns of <type>tsquery</type> type. For each row of the
1799 <replaceable>select</replaceable> result, occurrences of the first column value
1800 (the target) are replaced by the second column value (the substitute)
1801 within the current <replaceable>query</replaceable> value. For example:
1803 <screen>
1804 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
1805 INSERT INTO aliases VALUES('a', 'c');
1807 SELECT ts_rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
1808 ts_rewrite
1809 ------------
1810 'b' &amp; 'c'
1811 </screen>
1812 </para>
1814 <para>
1815 Note that when multiple rewrite rules are applied in this way,
1816 the order of application can be important; so in practice you will
1817 want the source query to <literal>ORDER BY</literal> some ordering key.
1818 </para>
1819 </listitem>
1820 </varlistentry>
1822 </variablelist>
1824 <para>
1825 Let's consider a real-life astronomical example. We'll expand query
1826 <literal>supernovae</literal> using table-driven rewriting rules:
1828 <screen>
1829 CREATE TABLE aliases (t tsquery primary key, s tsquery);
1830 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
1832 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1833 ts_rewrite
1834 ---------------------------------
1835 'crab' &amp; ( 'supernova' | 'sn' )
1836 </screen>
1838 We can change the rewriting rules just by updating the table:
1840 <screen>
1841 UPDATE aliases
1842 SET s = to_tsquery('supernovae|sn &amp; !nebulae')
1843 WHERE t = to_tsquery('supernovae');
1845 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1846 ts_rewrite
1847 ---------------------------------------------
1848 'crab' &amp; ( 'supernova' | 'sn' &amp; !'nebula' )
1849 </screen>
1850 </para>
1852 <para>
1853 Rewriting can be slow when there are many rewriting rules, since it
1854 checks every rule for a possible match. To filter out obvious non-candidate
1855 rules we can use the containment operators for the <type>tsquery</type>
1856 type. In the example below, we select only those rules which might match
1857 the original query:
1859 <screen>
1860 SELECT ts_rewrite('a &amp; b'::tsquery,
1861 'SELECT t,s FROM aliases WHERE ''a &amp; b''::tsquery @&gt; t');
1862 ts_rewrite
1863 ------------
1864 'b' &amp; 'c'
1865 </screen>
1866 </para>
1868 </sect3>
1870 </sect2>
1872 <sect2 id="textsearch-update-triggers">
1873 <title>Triggers for Automatic Updates</title>
1875 <indexterm>
1876 <primary>trigger</primary>
1877 <secondary>for updating a derived tsvector column</secondary>
1878 </indexterm>
1880 <note>
1881 <para>
1882 The method described in this section has been obsoleted by the use of
1883 stored generated columns, as described in <xref
1884 linkend="textsearch-tables-index"/>.
1885 </para>
1886 </note>
1888 <para>
1889 When using a separate column to store the <type>tsvector</type> representation
1890 of your documents, it is necessary to create a trigger to update the
1891 <type>tsvector</type> column when the document content columns change.
1892 Two built-in trigger functions are available for this, or you can write
1893 your own.
1894 </para>
1896 <synopsis>
1897 tsvector_update_trigger(<replaceable class="parameter">tsvector_column_name</replaceable>,&zwsp; <replaceable class="parameter">config_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>)
1898 tsvector_update_trigger_column(<replaceable class="parameter">tsvector_column_name</replaceable>,&zwsp; <replaceable class="parameter">config_column_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>)
1899 </synopsis>
1901 <para>
1902 These trigger functions automatically compute a <type>tsvector</type>
1903 column from one or more textual columns, under the control of
1904 parameters specified in the <command>CREATE TRIGGER</command> command.
1905 An example of their use is:
1907 <screen>
1908 CREATE TABLE messages (
1909 title text,
1910 body text,
1911 tsv tsvector
1914 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1915 ON messages FOR EACH ROW EXECUTE FUNCTION
1916 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
1918 INSERT INTO messages VALUES('title here', 'the body text is here');
1920 SELECT * FROM messages;
1921 title | body | tsv
1922 ------------+-----------------------+----------------------------
1923 title here | the body text is here | 'bodi':4 'text':5 'titl':1
1925 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title &amp; body');
1926 title | body
1927 ------------+-----------------------
1928 title here | the body text is here
1929 </screen>
1931 Having created this trigger, any change in <structfield>title</structfield> or
1932 <structfield>body</structfield> will automatically be reflected into
1933 <structfield>tsv</structfield>, without the application having to worry about it.
1934 </para>
1936 <para>
1937 The first trigger argument must be the name of the <type>tsvector</type>
1938 column to be updated. The second argument specifies the text search
1939 configuration to be used to perform the conversion. For
1940 <function>tsvector_update_trigger</function>, the configuration name is simply
1941 given as the second trigger argument. It must be schema-qualified as
1942 shown above, so that the trigger behavior will not change with changes
1943 in <varname>search_path</varname>. For
1944 <function>tsvector_update_trigger_column</function>, the second trigger argument
1945 is the name of another table column, which must be of type
1946 <type>regconfig</type>. This allows a per-row selection of configuration
1947 to be made. The remaining argument(s) are the names of textual columns
1948 (of type <type>text</type>, <type>varchar</type>, or <type>char</type>). These
1949 will be included in the document in the order given. NULL values will
1950 be skipped (but the other columns will still be indexed).
1951 </para>
1953 <para>
1954 A limitation of these built-in triggers is that they treat all the
1955 input columns alike. To process columns differently &mdash; for
1956 example, to weight title differently from body &mdash; it is necessary
1957 to write a custom trigger. Here is an example using
1958 <application>PL/pgSQL</application> as the trigger language:
1960 <programlisting>
1961 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
1962 begin
1963 new.tsv :=
1964 setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
1965 setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
1966 return new;
1968 $$ LANGUAGE plpgsql;
1970 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1971 ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger();
1972 </programlisting>
1973 </para>
1975 <para>
1976 Keep in mind that it is important to specify the configuration name
1977 explicitly when creating <type>tsvector</type> values inside triggers,
1978 so that the column's contents will not be affected by changes to
1979 <varname>default_text_search_config</varname>. Failure to do this is likely to
1980 lead to problems such as search results changing after a dump and restore.
1981 </para>
1983 </sect2>
1985 <sect2 id="textsearch-statistics">
1986 <title>Gathering Document Statistics</title>
1988 <indexterm>
1989 <primary>ts_stat</primary>
1990 </indexterm>
1992 <para>
1993 The function <function>ts_stat</function> is useful for checking your
1994 configuration and for finding stop-word candidates.
1995 </para>
1997 <synopsis>
1998 ts_stat(<replaceable class="parameter">sqlquery</replaceable> <type>text</type>, <optional> <replaceable class="parameter">weights</replaceable> <type>text</type>, </optional>
1999 OUT <replaceable class="parameter">word</replaceable> <type>text</type>, OUT <replaceable class="parameter">ndoc</replaceable> <type>integer</type>,
2000 OUT <replaceable class="parameter">nentry</replaceable> <type>integer</type>) returns <type>setof record</type>
2001 </synopsis>
2003 <para>
2004 <replaceable>sqlquery</replaceable> is a text value containing an SQL
2005 query which must return a single <type>tsvector</type> column.
2006 <function>ts_stat</function> executes the query and returns statistics about
2007 each distinct lexeme (word) contained in the <type>tsvector</type>
2008 data. The columns returned are
2010 <itemizedlist spacing="compact" mark="bullet">
2011 <listitem>
2012 <para>
2013 <replaceable>word</replaceable> <type>text</type> &mdash; the value of a lexeme
2014 </para>
2015 </listitem>
2016 <listitem>
2017 <para>
2018 <replaceable>ndoc</replaceable> <type>integer</type> &mdash; number of documents
2019 (<type>tsvector</type>s) the word occurred in
2020 </para>
2021 </listitem>
2022 <listitem>
2023 <para>
2024 <replaceable>nentry</replaceable> <type>integer</type> &mdash; total number of
2025 occurrences of the word
2026 </para>
2027 </listitem>
2028 </itemizedlist>
2030 If <replaceable>weights</replaceable> is supplied, only occurrences
2031 having one of those weights are counted.
2032 </para>
2034 <para>
2035 For example, to find the ten most frequent words in a document collection:
2037 <programlisting>
2038 SELECT * FROM ts_stat('SELECT vector FROM apod')
2039 ORDER BY nentry DESC, ndoc DESC, word
2040 LIMIT 10;
2041 </programlisting>
2043 The same, but counting only word occurrences with weight <literal>A</literal>
2044 or <literal>B</literal>:
2046 <programlisting>
2047 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
2048 ORDER BY nentry DESC, ndoc DESC, word
2049 LIMIT 10;
2050 </programlisting>
2051 </para>
2053 </sect2>
2055 </sect1>
2057 <sect1 id="textsearch-parsers">
2058 <title>Parsers</title>
2060 <para>
2061 Text search parsers are responsible for splitting raw document text
2062 into <firstterm>tokens</firstterm> and identifying each token's type, where
2063 the set of possible types is defined by the parser itself.
2064 Note that a parser does not modify the text at all &mdash; it simply
2065 identifies plausible word boundaries. Because of this limited scope,
2066 there is less need for application-specific custom parsers than there is
2067 for custom dictionaries. At present <productname>PostgreSQL</productname>
2068 provides just one built-in parser, which has been found to be useful for a
2069 wide range of applications.
2070 </para>
2072 <para>
2073 The built-in parser is named <literal>pg_catalog.default</literal>.
2074 It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser"/>.
2075 </para>
2077 <table id="textsearch-default-parser">
2078 <title>Default Parser's Token Types</title>
2079 <tgroup cols="3">
2080 <colspec colname="col1" colwidth="2*"/>
2081 <colspec colname="col2" colwidth="2*"/>
2082 <colspec colname="col3" colwidth="3*"/>
2083 <thead>
2084 <row>
2085 <entry>Alias</entry>
2086 <entry>Description</entry>
2087 <entry>Example</entry>
2088 </row>
2089 </thead>
2090 <tbody>
2091 <row>
2092 <entry><literal>asciiword</literal></entry>
2093 <entry>Word, all ASCII letters</entry>
2094 <entry><literal>elephant</literal></entry>
2095 </row>
2096 <row>
2097 <entry><literal>word</literal></entry>
2098 <entry>Word, all letters</entry>
2099 <entry><literal>ma&ntilde;ana</literal></entry>
2100 </row>
2101 <row>
2102 <entry><literal>numword</literal></entry>
2103 <entry>Word, letters and digits</entry>
2104 <entry><literal>beta1</literal></entry>
2105 </row>
2106 <row>
2107 <entry><literal>asciihword</literal></entry>
2108 <entry>Hyphenated word, all ASCII</entry>
2109 <entry><literal>up-to-date</literal></entry>
2110 </row>
2111 <row>
2112 <entry><literal>hword</literal></entry>
2113 <entry>Hyphenated word, all letters</entry>
2114 <entry><literal>l&oacute;gico-matem&aacute;tica</literal></entry>
2115 </row>
2116 <row>
2117 <entry><literal>numhword</literal></entry>
2118 <entry>Hyphenated word, letters and digits</entry>
2119 <entry><literal>postgresql-beta1</literal></entry>
2120 </row>
2121 <row>
2122 <entry><literal>hword_asciipart</literal></entry>
2123 <entry>Hyphenated word part, all ASCII</entry>
2124 <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
2125 </row>
2126 <row>
2127 <entry><literal>hword_part</literal></entry>
2128 <entry>Hyphenated word part, all letters</entry>
2129 <entry><literal>l&oacute;gico</literal> or <literal>matem&aacute;tica</literal>
2130 in the context <literal>l&oacute;gico-matem&aacute;tica</literal></entry>
2131 </row>
2132 <row>
2133 <entry><literal>hword_numpart</literal></entry>
2134 <entry>Hyphenated word part, letters and digits</entry>
2135 <entry><literal>beta1</literal> in the context
2136 <literal>postgresql-beta1</literal></entry>
2137 </row>
2138 <row>
2139 <entry><literal>email</literal></entry>
2140 <entry>Email address</entry>
2141 <entry><literal>foo@example.com</literal></entry>
2142 </row>
2143 <row>
2144 <entry><literal>protocol</literal></entry>
2145 <entry>Protocol head</entry>
2146 <entry><literal>http://</literal></entry>
2147 </row>
2148 <row>
2149 <entry><literal>url</literal></entry>
2150 <entry>URL</entry>
2151 <entry><literal>example.com/stuff/index.html</literal></entry>
2152 </row>
2153 <row>
2154 <entry><literal>host</literal></entry>
2155 <entry>Host</entry>
2156 <entry><literal>example.com</literal></entry>
2157 </row>
2158 <row>
2159 <entry><literal>url_path</literal></entry>
2160 <entry>URL path</entry>
2161 <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry>
2162 </row>
2163 <row>
2164 <entry><literal>file</literal></entry>
2165 <entry>File or path name</entry>
2166 <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry>
2167 </row>
2168 <row>
2169 <entry><literal>sfloat</literal></entry>
2170 <entry>Scientific notation</entry>
2171 <entry><literal>-1.234e56</literal></entry>
2172 </row>
2173 <row>
2174 <entry><literal>float</literal></entry>
2175 <entry>Decimal notation</entry>
2176 <entry><literal>-1.234</literal></entry>
2177 </row>
2178 <row>
2179 <entry><literal>int</literal></entry>
2180 <entry>Signed integer</entry>
2181 <entry><literal>-1234</literal></entry>
2182 </row>
2183 <row>
2184 <entry><literal>uint</literal></entry>
2185 <entry>Unsigned integer</entry>
2186 <entry><literal>1234</literal></entry>
2187 </row>
2188 <row>
2189 <entry><literal>version</literal></entry>
2190 <entry>Version number</entry>
2191 <entry><literal>8.3.0</literal></entry>
2192 </row>
2193 <row>
2194 <entry><literal>tag</literal></entry>
2195 <entry>XML tag</entry>
2196 <entry><literal>&lt;a href="dictionaries.html"&gt;</literal></entry>
2197 </row>
2198 <row>
2199 <entry><literal>entity</literal></entry>
2200 <entry>XML entity</entry>
2201 <entry><literal>&amp;amp;</literal></entry>
2202 </row>
2203 <row>
2204 <entry><literal>blank</literal></entry>
2205 <entry>Space symbols</entry>
2206 <entry>(any whitespace or punctuation not otherwise recognized)</entry>
2207 </row>
2208 </tbody>
2209 </tgroup>
2210 </table>
2212 <note>
2213 <para>
2214 The parser's notion of a <quote>letter</quote> is determined by the database's
2215 locale setting, specifically <varname>lc_ctype</varname>. Words containing
2216 only the basic ASCII letters are reported as a separate token type,
2217 since it is sometimes useful to distinguish them. In most European
2218 languages, token types <literal>word</literal> and <literal>asciiword</literal>
2219 should be treated alike.
2220 </para>
2222 <para>
2223 <literal>email</literal> does not support all valid email characters as
2224 defined by <ulink url="https://datatracker.ietf.org/doc/html/rfc5322">RFC 5322</ulink>.
2225 Specifically, the only non-alphanumeric characters supported for
2226 email user names are period, dash, and underscore.
2227 </para>
2228 </note>
2230 <para>
2231 It is possible for the parser to produce overlapping tokens from the same
2232 piece of text. As an example, a hyphenated word will be reported both
2233 as the entire word and as each component:
2235 <screen>
2236 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
2237 alias | description | token
2238 -----------------+------------------------------------------+---------------
2239 numhword | Hyphenated word, letters and digits | foo-bar-beta1
2240 hword_asciipart | Hyphenated word part, all ASCII | foo
2241 blank | Space symbols | -
2242 hword_asciipart | Hyphenated word part, all ASCII | bar
2243 blank | Space symbols | -
2244 hword_numpart | Hyphenated word part, letters and digits | beta1
2245 </screen>
2247 This behavior is desirable since it allows searches to work for both
2248 the whole compound word and for components. Here is another
2249 instructive example:
2251 <screen>
2252 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
2253 alias | description | token
2254 ----------+---------------+------------------------------
2255 protocol | Protocol head | http://
2256 url | URL | example.com/stuff/index.html
2257 host | Host | example.com
2258 url_path | URL path | /stuff/index.html
2259 </screen>
2260 </para>
2262 </sect1>
2264 <sect1 id="textsearch-dictionaries">
2265 <title>Dictionaries</title>
2267 <para>
2268 Dictionaries are used to eliminate words that should not be considered in a
2269 search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so
2270 that different derived forms of the same word will match. A successfully
2271 normalized word is called a <firstterm>lexeme</firstterm>. Aside from
2272 improving search quality, normalization and removal of stop words reduce the
2273 size of the <type>tsvector</type> representation of a document, thereby
2274 improving performance. Normalization does not always have linguistic meaning
2275 and usually depends on application semantics.
2276 </para>
2278 <para>
2279 Some examples of normalization:
2281 <itemizedlist spacing="compact" mark="bullet">
2283 <listitem>
2284 <para>
2285 Linguistic &mdash; Ispell dictionaries try to reduce input words to a
2286 normalized form; stemmer dictionaries remove word endings
2287 </para>
2288 </listitem>
2289 <listitem>
2290 <para>
2291 <acronym>URL</acronym> locations can be canonicalized to make
2292 equivalent URLs match:
2294 <itemizedlist spacing="compact" mark="bullet">
2295 <listitem>
2296 <para>
2297 http://www.pgsql.ru/db/mw/index.html
2298 </para>
2299 </listitem>
2300 <listitem>
2301 <para>
2302 http://www.pgsql.ru/db/mw/
2303 </para>
2304 </listitem>
2305 <listitem>
2306 <para>
2307 http://www.pgsql.ru/db/../db/mw/index.html
2308 </para>
2309 </listitem>
2310 </itemizedlist>
2311 </para>
2312 </listitem>
2313 <listitem>
2314 <para>
2315 Color names can be replaced by their hexadecimal values, e.g.,
2316 <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
2317 </para>
2318 </listitem>
2319 <listitem>
2320 <para>
2321 If indexing numbers, we can
2322 remove some fractional digits to reduce the range of possible
2323 numbers, so for example <emphasis>3.14</emphasis>159265359,
2324 <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
2325 after normalization if only two digits are kept after the decimal point.
2326 </para>
2327 </listitem>
2328 </itemizedlist>
2330 </para>
2332 <para>
2333 A dictionary is a program that accepts a token as
2334 input and returns:
2335 <itemizedlist spacing="compact" mark="bullet">
2336 <listitem>
2337 <para>
2338 an array of lexemes if the input token is known to the dictionary
2339 (notice that one token can produce more than one lexeme)
2340 </para>
2341 </listitem>
2342 <listitem>
2343 <para>
2344 a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace
2345 the original token with a new token to be passed to subsequent
2346 dictionaries (a dictionary that does this is called a
2347 <firstterm>filtering dictionary</firstterm>)
2348 </para>
2349 </listitem>
2350 <listitem>
2351 <para>
2352 an empty array if the dictionary knows the token, but it is a stop word
2353 </para>
2354 </listitem>
2355 <listitem>
2356 <para>
2357 <literal>NULL</literal> if the dictionary does not recognize the input token
2358 </para>
2359 </listitem>
2360 </itemizedlist>
2361 </para>
2363 <para>
2364 <productname>PostgreSQL</productname> provides predefined dictionaries for
2365 many languages. There are also several predefined templates that can be
2366 used to create new dictionaries with custom parameters. Each predefined
2367 dictionary template is described below. If no existing
2368 template is suitable, it is possible to create new ones; see the
2369 <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution
2370 for examples.
2371 </para>
2373 <para>
2374 A text search configuration binds a parser together with a set of
2375 dictionaries to process the parser's output tokens. For each token
2376 type that the parser can return, a separate list of dictionaries is
2377 specified by the configuration. When a token of that type is found
2378 by the parser, each dictionary in the list is consulted in turn,
2379 until some dictionary recognizes it as a known word. If it is identified
2380 as a stop word, or if no dictionary recognizes the token, it will be
2381 discarded and not indexed or searched for.
2382 Normally, the first dictionary that returns a non-<literal>NULL</literal>
2383 output determines the result, and any remaining dictionaries are not
2384 consulted; but a filtering dictionary can replace the given word
2385 with a modified word, which is then passed to subsequent dictionaries.
2386 </para>
2388 <para>
2389 The general rule for configuring a list of dictionaries
2390 is to place first the most narrow, most specific dictionary, then the more
2391 general dictionaries, finishing with a very general dictionary, like
2392 a <application>Snowball</application> stemmer or <literal>simple</literal>, which
2393 recognizes everything. For example, for an astronomy-specific search
2394 (<literal>astro_en</literal> configuration) one could bind token type
2395 <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
2396 terms, a general English dictionary and a <application>Snowball</application> English
2397 stemmer:
2399 <programlisting>
2400 ALTER TEXT SEARCH CONFIGURATION astro_en
2401 ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
2402 </programlisting>
2403 </para>
2405 <para>
2406 A filtering dictionary can be placed anywhere in the list, except at the
2407 end where it'd be useless. Filtering dictionaries are useful to partially
2408 normalize words to simplify the task of later dictionaries. For example,
2409 a filtering dictionary could be used to remove accents from accented
2410 letters, as is done by the <xref linkend="unaccent"/> module.
2411 </para>
2413 <sect2 id="textsearch-stopwords">
2414 <title>Stop Words</title>
2416 <para>
2417 Stop words are words that are very common, appear in almost every
2418 document, and have no discrimination value. Therefore, they can be ignored
2419 in the context of full text searching. For example, every English text
2420 contains words like <literal>a</literal> and <literal>the</literal>, so it is
2421 useless to store them in an index. However, stop words do affect the
2422 positions in <type>tsvector</type>, which in turn affect ranking:
2424 <screen>
2425 SELECT to_tsvector('english', 'in the list of stop words');
2426 to_tsvector
2427 ----------------------------
2428 'list':3 'stop':5 'word':6
2429 </screen>
2431 The missing positions 1,2,4 are because of stop words. Ranks
2432 calculated for documents with and without stop words are quite different:
2434 <screen>
2435 SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list &amp; stop'));
2436 ts_rank_cd
2437 ------------
2438 0.05
2440 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &amp; stop'));
2441 ts_rank_cd
2442 ------------
2444 </screen>
2446 </para>
2448 <para>
2449 It is up to the specific dictionary how it treats stop words. For example,
2450 <literal>ispell</literal> dictionaries first normalize words and then
2451 look at the list of stop words, while <literal>Snowball</literal> stemmers
2452 first check the list of stop words. The reason for the different
2453 behavior is an attempt to decrease noise.
2454 </para>
2456 </sect2>
2458 <sect2 id="textsearch-simple-dictionary">
2459 <title>Simple Dictionary</title>
2461 <para>
2462 The <literal>simple</literal> dictionary template operates by converting the
2463 input token to lower case and checking it against a file of stop words.
2464 If it is found in the file then an empty array is returned, causing
2465 the token to be discarded. If not, the lower-cased form of the word
2466 is returned as the normalized lexeme. Alternatively, the dictionary
2467 can be configured to report non-stop-words as unrecognized, allowing
2468 them to be passed on to the next dictionary in the list.
2469 </para>
2471 <para>
2472 Here is an example of a dictionary definition using the <literal>simple</literal>
2473 template:
2475 <programlisting>
2476 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
2477 TEMPLATE = pg_catalog.simple,
2478 STOPWORDS = english
2480 </programlisting>
2482 Here, <literal>english</literal> is the base name of a file of stop words.
2483 The file's full name will be
2484 <filename>$SHAREDIR/tsearch_data/english.stop</filename>,
2485 where <literal>$SHAREDIR</literal> means the
2486 <productname>PostgreSQL</productname> installation's shared-data directory,
2487 often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config
2488 --sharedir</command> to determine it if you're not sure).
2489 The file format is simply a list
2490 of words, one per line. Blank lines and trailing spaces are ignored,
2491 and upper case is folded to lower case, but no other processing is done
2492 on the file contents.
2493 </para>
2495 <para>
2496 Now we can test our dictionary:
2498 <screen>
2499 SELECT ts_lexize('public.simple_dict', 'YeS');
2500 ts_lexize
2501 -----------
2502 {yes}
2504 SELECT ts_lexize('public.simple_dict', 'The');
2505 ts_lexize
2506 -----------
2508 </screen>
2509 </para>
2511 <para>
2512 We can also choose to return <literal>NULL</literal>, instead of the lower-cased
2513 word, if it is not found in the stop words file. This behavior is
2514 selected by setting the dictionary's <literal>Accept</literal> parameter to
2515 <literal>false</literal>. Continuing the example:
2517 <screen>
2518 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
2520 SELECT ts_lexize('public.simple_dict', 'YeS');
2521 ts_lexize
2522 -----------
2525 SELECT ts_lexize('public.simple_dict', 'The');
2526 ts_lexize
2527 -----------
2529 </screen>
2530 </para>
2532 <para>
2533 With the default setting of <literal>Accept</literal> = <literal>true</literal>,
2534 it is only useful to place a <literal>simple</literal> dictionary at the end
2535 of a list of dictionaries, since it will never pass on any token to
2536 a following dictionary. Conversely, <literal>Accept</literal> = <literal>false</literal>
2537 is only useful when there is at least one following dictionary.
2538 </para>
2540 <caution>
2541 <para>
2542 Most types of dictionaries rely on configuration files, such as files of
2543 stop words. These files <emphasis>must</emphasis> be stored in UTF-8 encoding.
2544 They will be translated to the actual database encoding, if that is
2545 different, when they are read into the server.
2546 </para>
2547 </caution>
2549 <caution>
2550 <para>
2551 Normally, a database session will read a dictionary configuration file
2552 only once, when it is first used within the session. If you modify a
2553 configuration file and want to force existing sessions to pick up the
2554 new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</command> command
2555 on the dictionary. This can be a <quote>dummy</quote> update that doesn't
2556 actually change any parameter values.
2557 </para>
2558 </caution>
2560 </sect2>
2562 <sect2 id="textsearch-synonym-dictionary">
2563 <title>Synonym Dictionary</title>
2565 <para>
2566 This dictionary template is used to create dictionaries that replace a
2567 word with a synonym. Phrases are not supported (use the thesaurus
2568 template (<xref linkend="textsearch-thesaurus"/>) for that). A synonym
2569 dictionary can be used to overcome linguistic problems, for example, to
2570 prevent an English stemmer dictionary from reducing the word <quote>Paris</quote> to
2571 <quote>pari</quote>. It is enough to have a <literal>Paris paris</literal> line in the
2572 synonym dictionary and put it before the <literal>english_stem</literal>
2573 dictionary. For example:
2575 <screen>
2576 SELECT * FROM ts_debug('english', 'Paris');
2577 alias | description | token | dictionaries | dictionary | lexemes
2578 -----------+-----------------+-------+----------------+--------------+---------
2579 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
2581 CREATE TEXT SEARCH DICTIONARY my_synonym (
2582 TEMPLATE = synonym,
2583 SYNONYMS = my_synonyms
2586 ALTER TEXT SEARCH CONFIGURATION english
2587 ALTER MAPPING FOR asciiword
2588 WITH my_synonym, english_stem;
2590 SELECT * FROM ts_debug('english', 'Paris');
2591 alias | description | token | dictionaries | dictionary | lexemes
2592 -----------+-----------------+-------+---------------------------+------------+---------
2593 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
2594 </screen>
2595 </para>
2597 <para>
2598 The only parameter required by the <literal>synonym</literal> template is
2599 <literal>SYNONYMS</literal>, which is the base name of its configuration file
2600 &mdash; <literal>my_synonyms</literal> in the above example.
2601 The file's full name will be
2602 <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</filename>
2603 (where <literal>$SHAREDIR</literal> means the
2604 <productname>PostgreSQL</productname> installation's shared-data directory).
2605 The file format is just one line
2606 per word to be substituted, with the word followed by its synonym,
2607 separated by white space. Blank lines and trailing spaces are ignored.
2608 </para>
2610 <para>
2611 The <literal>synonym</literal> template also has an optional parameter
2612 <literal>CaseSensitive</literal>, which defaults to <literal>false</literal>. When
2613 <literal>CaseSensitive</literal> is <literal>false</literal>, words in the synonym file
2614 are folded to lower case, as are input tokens. When it is
2615 <literal>true</literal>, words and tokens are not folded to lower case,
2616 but are compared as-is.
2617 </para>
2619 <para>
2620 An asterisk (<literal>*</literal>) can be placed at the end of a synonym
2621 in the configuration file. This indicates that the synonym is a prefix.
2622 The asterisk is ignored when the entry is used in
2623 <function>to_tsvector()</function>, but when it is used in
2624 <function>to_tsquery()</function>, the result will be a query item with
2625 the prefix match marker (see
2626 <xref linkend="textsearch-parsing-queries"/>).
2627 For example, suppose we have these entries in
2628 <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</filename>:
2629 <programlisting>
2630 postgres pgsql
2631 postgresql pgsql
2632 postgre pgsql
2633 gogle googl
2634 indices index*
2635 </programlisting>
2636 Then we will get these results:
2637 <screen>
2638 mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
2639 mydb=# SELECT ts_lexize('syn', 'indices');
2640 ts_lexize
2641 -----------
2642 {index}
2643 (1 row)
2645 mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
2646 mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
2647 mydb=# SELECT to_tsvector('tst', 'indices');
2648 to_tsvector
2649 -------------
2650 'index':1
2651 (1 row)
2653 mydb=# SELECT to_tsquery('tst', 'indices');
2654 to_tsquery
2655 ------------
2656 'index':*
2657 (1 row)
2659 mydb=# SELECT 'indexes are very useful'::tsvector;
2660 tsvector
2661 ---------------------------------
2662 'are' 'indexes' 'useful' 'very'
2663 (1 row)
2665 mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
2666 ?column?
2667 ----------
2669 (1 row)
2670 </screen>
2671 </para>
2672 </sect2>
2674 <sect2 id="textsearch-thesaurus">
2675 <title>Thesaurus Dictionary</title>
2677 <para>
2678 A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
2679 a collection of words that includes information about the relationships
2680 of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
2681 terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
2682 terms, etc.
2683 </para>
2685 <para>
2686 Basically a thesaurus dictionary replaces all non-preferred terms by one
2687 preferred term and, optionally, preserves the original terms for indexing
2688 as well. <productname>PostgreSQL</productname>'s current implementation of the
2689 thesaurus dictionary is an extension of the synonym dictionary with added
2690 <firstterm>phrase</firstterm> support. A thesaurus dictionary requires
2691 a configuration file of the following format:
2693 <programlisting>
2694 # this is a comment
2695 sample word(s) : indexed word(s)
2696 more sample word(s) : more indexed word(s)
2698 </programlisting>
2700 where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
2701 phrase and its replacement.
2702 </para>
2704 <para>
2705 A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
2706 is specified in the dictionary's configuration) to normalize the input
2707 text before checking for phrase matches. It is only possible to select one
2708 subdictionary. An error is reported if the subdictionary fails to
2709 recognize a word. In that case, you should remove the use of the word or
2710 teach the subdictionary about it. You can place an asterisk
2711 (<symbol>*</symbol>) at the beginning of an indexed word to skip applying
2712 the subdictionary to it, but all sample words <emphasis>must</emphasis> be known
2713 to the subdictionary.
2714 </para>
2716 <para>
2717 The thesaurus dictionary chooses the longest match if there are multiple
2718 phrases matching the input, and ties are broken by using the last
2719 definition.
2720 </para>
2722 <para>
2723 Specific stop words recognized by the subdictionary cannot be
2724 specified; instead use <literal>?</literal> to mark the location where any
2725 stop word can appear. For example, assuming that <literal>a</literal> and
2726 <literal>the</literal> are stop words according to the subdictionary:
2728 <programlisting>
2729 ? one ? two : swsw
2730 </programlisting>
2732 matches <literal>a one the two</literal> and <literal>the one a two</literal>;
2733 both would be replaced by <literal>swsw</literal>.
2734 </para>
2736 <para>
2737 Since a thesaurus dictionary has the capability to recognize phrases it
2738 must remember its state and interact with the parser. A thesaurus dictionary
2739 uses these assignments to check if it should handle the next word or stop
2740 accumulation. The thesaurus dictionary must be configured
2741 carefully. For example, if the thesaurus dictionary is assigned to handle
2742 only the <literal>asciiword</literal> token, then a thesaurus dictionary
2743 definition like <literal>one 7</literal> will not work since token type
2744 <literal>uint</literal> is not assigned to the thesaurus dictionary.
2745 </para>
2747 <caution>
2748 <para>
2749 Thesauruses are used during indexing so any change in the thesaurus
2750 dictionary's parameters <emphasis>requires</emphasis> reindexing.
2751 For most other dictionary types, small changes such as adding or
2752 removing stopwords does not force reindexing.
2753 </para>
2754 </caution>
2756 <sect3 id="textsearch-thesaurus-config">
2757 <title>Thesaurus Configuration</title>
2759 <para>
2760 To define a new thesaurus dictionary, use the <literal>thesaurus</literal>
2761 template. For example:
2763 <programlisting>
2764 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
2765 TEMPLATE = thesaurus,
2766 DictFile = mythesaurus,
2767 Dictionary = pg_catalog.english_stem
2769 </programlisting>
2771 Here:
2772 <itemizedlist spacing="compact" mark="bullet">
2773 <listitem>
2774 <para>
2775 <literal>thesaurus_simple</literal> is the new dictionary's name
2776 </para>
2777 </listitem>
2778 <listitem>
2779 <para>
2780 <literal>mythesaurus</literal> is the base name of the thesaurus
2781 configuration file.
2782 (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>,
2783 where <literal>$SHAREDIR</literal> means the installation shared-data
2784 directory.)
2785 </para>
2786 </listitem>
2787 <listitem>
2788 <para>
2789 <literal>pg_catalog.english_stem</literal> is the subdictionary (here,
2790 a Snowball English stemmer) to use for thesaurus normalization.
2791 Notice that the subdictionary will have its own
2792 configuration (for example, stop words), which is not shown here.
2793 </para>
2794 </listitem>
2795 </itemizedlist>
2797 Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
2798 to the desired token types in a configuration, for example:
2800 <programlisting>
2801 ALTER TEXT SEARCH CONFIGURATION russian
2802 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2803 WITH thesaurus_simple;
2804 </programlisting>
2805 </para>
2807 </sect3>
2809 <sect3 id="textsearch-thesaurus-examples">
2810 <title>Thesaurus Example</title>
2812 <para>
2813 Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
2814 which contains some astronomical word combinations:
2816 <programlisting>
2817 supernovae stars : sn
2818 crab nebulae : crab
2819 </programlisting>
2821 Below we create a dictionary and bind some token types to
2822 an astronomical thesaurus and English stemmer:
2824 <programlisting>
2825 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
2826 TEMPLATE = thesaurus,
2827 DictFile = thesaurus_astro,
2828 Dictionary = english_stem
2831 ALTER TEXT SEARCH CONFIGURATION russian
2832 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2833 WITH thesaurus_astro, english_stem;
2834 </programlisting>
2836 Now we can see how it works.
2837 <function>ts_lexize</function> is not very useful for testing a thesaurus,
2838 because it treats its input as a single token. Instead we can use
2839 <function>plainto_tsquery</function> and <function>to_tsvector</function>
2840 which will break their input strings into multiple tokens:
2842 <screen>
2843 SELECT plainto_tsquery('supernova star');
2844 plainto_tsquery
2845 -----------------
2846 'sn'
2848 SELECT to_tsvector('supernova star');
2849 to_tsvector
2850 -------------
2851 'sn':1
2852 </screen>
2854 In principle, one can use <function>to_tsquery</function> if you quote
2855 the argument:
2857 <screen>
2858 SELECT to_tsquery('''supernova star''');
2859 to_tsquery
2860 ------------
2861 'sn'
2862 </screen>
2864 Notice that <literal>supernova star</literal> matches <literal>supernovae
2865 stars</literal> in <literal>thesaurus_astro</literal> because we specified
2866 the <literal>english_stem</literal> stemmer in the thesaurus definition.
2867 The stemmer removed the <literal>e</literal> and <literal>s</literal>.
2868 </para>
2870 <para>
2871 To index the original phrase as well as the substitute, just include it
2872 in the right-hand part of the definition:
2874 <screen>
2875 supernovae stars : sn supernovae stars
2877 SELECT plainto_tsquery('supernova star');
2878 plainto_tsquery
2879 -----------------------------
2880 'sn' &amp; 'supernova' &amp; 'star'
2881 </screen>
2882 </para>
2884 </sect3>
2886 </sect2>
2888 <sect2 id="textsearch-ispell-dictionary">
2889 <title><application>Ispell</application> Dictionary</title>
2891 <para>
2892 The <application>Ispell</application> dictionary template supports
2893 <firstterm>morphological dictionaries</firstterm>, which can normalize many
2894 different linguistic forms of a word into the same lexeme. For example,
2895 an English <application>Ispell</application> dictionary can match all declensions and
2896 conjugations of the search term <literal>bank</literal>, e.g.,
2897 <literal>banking</literal>, <literal>banked</literal>, <literal>banks</literal>,
2898 <literal>banks'</literal>, and <literal>bank's</literal>.
2899 </para>
2901 <para>
2902 The standard <productname>PostgreSQL</productname> distribution does
2903 not include any <application>Ispell</application> configuration files.
2904 Dictionaries for a large number of languages are available from <ulink
2905 url="https://www.cs.hmc.edu/~geoff/ispell.html">Ispell</ulink>.
2906 Also, some more modern dictionary file formats are supported &mdash; <ulink
2907 url="https://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO &lt; 2.0.1)
2908 and <ulink url="https://hunspell.github.io/">Hunspell</ulink>
2909 (OO &gt;= 2.0.2). A large list of dictionaries is available on the <ulink
2910 url="https://wiki.openoffice.org/wiki/Dictionaries">OpenOffice
2911 Wiki</ulink>.
2912 </para>
2914 <para>
2915 To create an <application>Ispell</application> dictionary perform these steps:
2916 </para>
2917 <itemizedlist spacing="compact" mark="bullet">
2918 <listitem>
2919 <para>
2920 download dictionary configuration files. <productname>OpenOffice</productname>
2921 extension files have the <filename>.oxt</filename> extension. It is necessary
2922 to extract <filename>.aff</filename> and <filename>.dic</filename> files, change
2923 extensions to <filename>.affix</filename> and <filename>.dict</filename>. For some
2924 dictionary files it is also needed to convert characters to the UTF-8
2925 encoding with commands (for example, for a Norwegian language dictionary):
2926 <programlisting>
2927 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
2928 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
2929 </programlisting>
2930 </para>
2931 </listitem>
2932 <listitem>
2933 <para>
2934 copy files to the <filename>$SHAREDIR/tsearch_data</filename> directory
2935 </para>
2936 </listitem>
2937 <listitem>
2938 <para>
2939 load files into PostgreSQL with the following command:
2940 <programlisting>
2941 CREATE TEXT SEARCH DICTIONARY english_hunspell (
2942 TEMPLATE = ispell,
2943 DictFile = en_us,
2944 AffFile = en_us,
2945 Stopwords = english);
2946 </programlisting>
2947 </para>
2948 </listitem>
2949 </itemizedlist>
2951 <para>
2952 Here, <literal>DictFile</literal>, <literal>AffFile</literal>, and <literal>StopWords</literal>
2953 specify the base names of the dictionary, affixes, and stop-words files.
2954 The stop-words file has the same format explained above for the
2955 <literal>simple</literal> dictionary type. The format of the other files is
2956 not specified here but is available from the above-mentioned web sites.
2957 </para>
2959 <para>
2960 Ispell dictionaries usually recognize a limited set of words, so they
2961 should be followed by another broader dictionary; for
2962 example, a Snowball dictionary, which recognizes everything.
2963 </para>
2965 <para>
2966 The <filename>.affix</filename> file of <application>Ispell</application> has the following
2967 structure:
2968 <programlisting>
2969 prefixes
2970 flag *A:
2971 . > RE # As in enter > reenter
2972 suffixes
2973 flag T:
2974 E > ST # As in late > latest
2975 [^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
2976 [AEIOU]Y > EST # As in gray > grayest
2977 [^EY] > EST # As in small > smallest
2978 </programlisting>
2979 </para>
2980 <para>
2981 And the <filename>.dict</filename> file has the following structure:
2982 <programlisting>
2983 lapse/ADGRS
2984 lard/DGRS
2985 large/PRTY
2986 lark/MRS
2987 </programlisting>
2988 </para>
2990 <para>
2991 Format of the <filename>.dict</filename> file is:
2992 <programlisting>
2993 basic_form/affix_class_name
2994 </programlisting>
2995 </para>
2997 <para>
2998 In the <filename>.affix</filename> file every affix flag is described in the
2999 following format:
3000 <programlisting>
3001 condition > [-stripping_letters,] adding_affix
3002 </programlisting>
3003 </para>
3005 <para>
3006 Here, condition has a format similar to the format of regular expressions.
3007 It can use groupings <literal>[...]</literal> and <literal>[^...]</literal>.
3008 For example, <literal>[AEIOU]Y</literal> means that the last letter of the word
3009 is <literal>"y"</literal> and the penultimate letter is <literal>"a"</literal>,
3010 <literal>"e"</literal>, <literal>"i"</literal>, <literal>"o"</literal> or <literal>"u"</literal>.
3011 <literal>[^EY]</literal> means that the last letter is neither <literal>"e"</literal>
3012 nor <literal>"y"</literal>.
3013 </para>
3015 <para>
3016 Ispell dictionaries support splitting compound words;
3017 a useful feature.
3018 Notice that the affix file should specify a special flag using the
3019 <literal>compoundwords controlled</literal> statement that marks dictionary
3020 words that can participate in compound formation:
3022 <programlisting>
3023 compoundwords controlled z
3024 </programlisting>
3026 Here are some examples for the Norwegian language:
3028 <programlisting>
3029 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
3030 {over,buljong,terning,pakk,mester,assistent}
3031 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
3032 {sjokoladefabrikk,sjokolade,fabrikk}
3033 </programlisting>
3034 </para>
3036 <para>
3037 <application>MySpell</application> format is a subset of <application>Hunspell</application>.
3038 The <filename>.affix</filename> file of <application>Hunspell</application> has the following
3039 structure:
3040 <programlisting>
3041 PFX A Y 1
3042 PFX A 0 re .
3043 SFX T N 4
3044 SFX T 0 st e
3045 SFX T y iest [^aeiou]y
3046 SFX T 0 est [aeiou]y
3047 SFX T 0 est [^ey]
3048 </programlisting>
3049 </para>
3051 <para>
3052 The first line of an affix class is the header. Fields of an affix rules are
3053 listed after the header:
3054 </para>
3055 <itemizedlist spacing="compact" mark="bullet">
3056 <listitem>
3057 <para>
3058 parameter name (PFX or SFX)
3059 </para>
3060 </listitem>
3061 <listitem>
3062 <para>
3063 flag (name of the affix class)
3064 </para>
3065 </listitem>
3066 <listitem>
3067 <para>
3068 stripping characters from beginning (at prefix) or end (at suffix) of the
3069 word
3070 </para>
3071 </listitem>
3072 <listitem>
3073 <para>
3074 adding affix
3075 </para>
3076 </listitem>
3077 <listitem>
3078 <para>
3079 condition that has a format similar to the format of regular expressions.
3080 </para>
3081 </listitem>
3082 </itemizedlist>
3084 <para>
3085 The <filename>.dict</filename> file looks like the <filename>.dict</filename> file of
3086 <application>Ispell</application>:
3087 <programlisting>
3088 larder/M
3089 lardy/RT
3090 large/RSPMYT
3091 largehearted
3092 </programlisting>
3093 </para>
3095 <note>
3096 <para>
3097 <application>MySpell</application> does not support compound words.
3098 <application>Hunspell</application> has sophisticated support for compound words. At
3099 present, <productname>PostgreSQL</productname> implements only the basic
3100 compound word operations of Hunspell.
3101 </para>
3102 </note>
3104 </sect2>
3106 <sect2 id="textsearch-snowball-dictionary">
3107 <title><application>Snowball</application> Dictionary</title>
3109 <para>
3110 The <application>Snowball</application> dictionary template is based on a project
3111 by Martin Porter, inventor of the popular Porter's stemming algorithm
3112 for the English language. Snowball now provides stemming algorithms for
3113 many languages (see the <ulink url="https://snowballstem.org/">Snowball
3114 site</ulink> for more information). Each algorithm understands how to
3115 reduce common variant forms of words to a base, or stem, spelling within
3116 its language. A Snowball dictionary requires a <literal>language</literal>
3117 parameter to identify which stemmer to use, and optionally can specify a
3118 <literal>stopword</literal> file name that gives a list of words to eliminate.
3119 (<productname>PostgreSQL</productname>'s standard stopword lists are also
3120 provided by the Snowball project.)
3121 For example, there is a built-in definition equivalent to
3123 <programlisting>
3124 CREATE TEXT SEARCH DICTIONARY english_stem (
3125 TEMPLATE = snowball,
3126 Language = english,
3127 StopWords = english
3129 </programlisting>
3131 The stopword file format is the same as already explained.
3132 </para>
3134 <para>
3135 A <application>Snowball</application> dictionary recognizes everything, whether
3136 or not it is able to simplify the word, so it should be placed
3137 at the end of the dictionary list. It is useless to have it
3138 before any other dictionary because a token will never pass through it to
3139 the next dictionary.
3140 </para>
3142 </sect2>
3144 </sect1>
3146 <sect1 id="textsearch-configuration">
3147 <title>Configuration Example</title>
3149 <para>
3150 A text search configuration specifies all options necessary to transform a
3151 document into a <type>tsvector</type>: the parser to use to break text
3152 into tokens, and the dictionaries to use to transform each token into a
3153 lexeme. Every call of
3154 <function>to_tsvector</function> or <function>to_tsquery</function>
3155 needs a text search configuration to perform its processing.
3156 The configuration parameter
3157 <xref linkend="guc-default-text-search-config"/>
3158 specifies the name of the default configuration, which is the
3159 one used by text search functions if an explicit configuration
3160 parameter is omitted.
3161 It can be set in <filename>postgresql.conf</filename>, or set for an
3162 individual session using the <command>SET</command> command.
3163 </para>
3165 <para>
3166 Several predefined text search configurations are available, and
3167 you can create custom configurations easily. To facilitate management
3168 of text search objects, a set of <acronym>SQL</acronym> commands
3169 is available, and there are several <application>psql</application> commands that display information
3170 about text search objects (<xref linkend="textsearch-psql"/>).
3171 </para>
3173 <para>
3174 As an example we will create a configuration
3175 <literal>pg</literal>, starting by duplicating the built-in
3176 <literal>english</literal> configuration:
3178 <programlisting>
3179 CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
3180 </programlisting>
3181 </para>
3183 <para>
3184 We will use a PostgreSQL-specific synonym list
3185 and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>.
3186 The file contents look like:
3188 <programlisting>
3189 postgres pg
3190 pgsql pg
3191 postgresql pg
3192 </programlisting>
3194 We define the synonym dictionary like this:
3196 <programlisting>
3197 CREATE TEXT SEARCH DICTIONARY pg_dict (
3198 TEMPLATE = synonym,
3199 SYNONYMS = pg_dict
3201 </programlisting>
3203 Next we register the <productname>Ispell</productname> dictionary
3204 <literal>english_ispell</literal>, which has its own configuration files:
3206 <programlisting>
3207 CREATE TEXT SEARCH DICTIONARY english_ispell (
3208 TEMPLATE = ispell,
3209 DictFile = english,
3210 AffFile = english,
3211 StopWords = english
3213 </programlisting>
3215 Now we can set up the mappings for words in configuration
3216 <literal>pg</literal>:
3218 <programlisting>
3219 ALTER TEXT SEARCH CONFIGURATION pg
3220 ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
3221 word, hword, hword_part
3222 WITH pg_dict, english_ispell, english_stem;
3223 </programlisting>
3225 We choose not to index or search some token types that the built-in
3226 configuration does handle:
3228 <programlisting>
3229 ALTER TEXT SEARCH CONFIGURATION pg
3230 DROP MAPPING FOR email, url, url_path, sfloat, float;
3231 </programlisting>
3232 </para>
3234 <para>
3235 Now we can test our configuration:
3237 <programlisting>
3238 SELECT * FROM ts_debug('public.pg', '
3239 PostgreSQL, the highly scalable, SQL compliant, open source object-relational
3240 database management system, is now undergoing beta testing of the next
3241 version of our software.
3243 </programlisting>
3244 </para>
3246 <para>
3247 The next step is to set the session to use the new configuration, which was
3248 created in the <literal>public</literal> schema:
3250 <screen>
3251 =&gt; \dF
3252 List of text search configurations
3253 Schema | Name | Description
3254 ---------+------+-------------
3255 public | pg |
3257 SET default_text_search_config = 'public.pg';
3260 SHOW default_text_search_config;
3261 default_text_search_config
3262 ----------------------------
3263 public.pg
3264 </screen>
3265 </para>
3267 </sect1>
3269 <sect1 id="textsearch-debugging">
3270 <title>Testing and Debugging Text Search</title>
3272 <para>
3273 The behavior of a custom text search configuration can easily become
3274 confusing. The functions described
3275 in this section are useful for testing text search objects. You can
3276 test a complete configuration, or test parsers and dictionaries separately.
3277 </para>
3279 <sect2 id="textsearch-configuration-testing">
3280 <title>Configuration Testing</title>
3282 <para>
3283 The function <function>ts_debug</function> allows easy testing of a
3284 text search configuration.
3285 </para>
3287 <indexterm>
3288 <primary>ts_debug</primary>
3289 </indexterm>
3291 <synopsis>
3292 ts_debug(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>,
3293 OUT <replaceable class="parameter">alias</replaceable> <type>text</type>,
3294 OUT <replaceable class="parameter">description</replaceable> <type>text</type>,
3295 OUT <replaceable class="parameter">token</replaceable> <type>text</type>,
3296 OUT <replaceable class="parameter">dictionaries</replaceable> <type>regdictionary[]</type>,
3297 OUT <replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>,
3298 OUT <replaceable class="parameter">lexemes</replaceable> <type>text[]</type>)
3299 returns setof record
3300 </synopsis>
3302 <para>
3303 <function>ts_debug</function> displays information about every token of
3304 <replaceable class="parameter">document</replaceable> as produced by the
3305 parser and processed by the configured dictionaries. It uses the
3306 configuration specified by <replaceable
3307 class="parameter">config</replaceable>,
3308 or <varname>default_text_search_config</varname> if that argument is
3309 omitted.
3310 </para>
3312 <para>
3313 <function>ts_debug</function> returns one row for each token identified in the text
3314 by the parser. The columns returned are
3316 <itemizedlist spacing="compact" mark="bullet">
3317 <listitem>
3318 <para>
3319 <replaceable>alias</replaceable> <type>text</type> &mdash; short name of the token type
3320 </para>
3321 </listitem>
3322 <listitem>
3323 <para>
3324 <replaceable>description</replaceable> <type>text</type> &mdash; description of the
3325 token type
3326 </para>
3327 </listitem>
3328 <listitem>
3329 <para>
3330 <replaceable>token</replaceable> <type>text</type> &mdash; text of the token
3331 </para>
3332 </listitem>
3333 <listitem>
3334 <para>
3335 <replaceable>dictionaries</replaceable> <type>regdictionary[]</type> &mdash; the
3336 dictionaries selected by the configuration for this token type
3337 </para>
3338 </listitem>
3339 <listitem>
3340 <para>
3341 <replaceable>dictionary</replaceable> <type>regdictionary</type> &mdash; the dictionary
3342 that recognized the token, or <literal>NULL</literal> if none did
3343 </para>
3344 </listitem>
3345 <listitem>
3346 <para>
3347 <replaceable>lexemes</replaceable> <type>text[]</type> &mdash; the lexeme(s) produced
3348 by the dictionary that recognized the token, or <literal>NULL</literal> if
3349 none did; an empty array (<literal>{}</literal>) means it was recognized as a
3350 stop word
3351 </para>
3352 </listitem>
3353 </itemizedlist>
3354 </para>
3356 <para>
3357 Here is a simple example:
3359 <screen>
3360 SELECT * FROM ts_debug('english', 'a fat cat sat on a mat - it ate a fat rats');
3361 alias | description | token | dictionaries | dictionary | lexemes
3362 -----------+-----------------+-------+----------------+--------------+---------
3363 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
3364 blank | Space symbols | | {} | |
3365 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
3366 blank | Space symbols | | {} | |
3367 asciiword | Word, all ASCII | cat | {english_stem} | english_stem | {cat}
3368 blank | Space symbols | | {} | |
3369 asciiword | Word, all ASCII | sat | {english_stem} | english_stem | {sat}
3370 blank | Space symbols | | {} | |
3371 asciiword | Word, all ASCII | on | {english_stem} | english_stem | {}
3372 blank | Space symbols | | {} | |
3373 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
3374 blank | Space symbols | | {} | |
3375 asciiword | Word, all ASCII | mat | {english_stem} | english_stem | {mat}
3376 blank | Space symbols | | {} | |
3377 blank | Space symbols | - | {} | |
3378 asciiword | Word, all ASCII | it | {english_stem} | english_stem | {}
3379 blank | Space symbols | | {} | |
3380 asciiword | Word, all ASCII | ate | {english_stem} | english_stem | {ate}
3381 blank | Space symbols | | {} | |
3382 asciiword | Word, all ASCII | a | {english_stem} | english_stem | {}
3383 blank | Space symbols | | {} | |
3384 asciiword | Word, all ASCII | fat | {english_stem} | english_stem | {fat}
3385 blank | Space symbols | | {} | |
3386 asciiword | Word, all ASCII | rats | {english_stem} | english_stem | {rat}
3387 </screen>
3388 </para>
3390 <para>
3391 For a more extensive demonstration, we
3392 first create a <literal>public.english</literal> configuration and
3393 Ispell dictionary for the English language:
3394 </para>
3396 <programlisting>
3397 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
3399 CREATE TEXT SEARCH DICTIONARY english_ispell (
3400 TEMPLATE = ispell,
3401 DictFile = english,
3402 AffFile = english,
3403 StopWords = english
3406 ALTER TEXT SEARCH CONFIGURATION public.english
3407 ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
3408 </programlisting>
3410 <screen>
3411 SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes');
3412 alias | description | token | dictionaries | dictionary | lexemes
3413 -----------+-----------------+-------------+-------------------------------+----------------+-------------
3414 asciiword | Word, all ASCII | The | {english_ispell,english_stem} | english_ispell | {}
3415 blank | Space symbols | | {} | |
3416 asciiword | Word, all ASCII | Brightest | {english_ispell,english_stem} | english_ispell | {bright}
3417 blank | Space symbols | | {} | |
3418 asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem | {supernova}
3419 </screen>
3421 <para>
3422 In this example, the word <literal>Brightest</literal> was recognized by the
3423 parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>).
3424 For this token type the dictionary list is
3425 <literal>english_ispell</literal> and
3426 <literal>english_stem</literal>. The word was recognized by
3427 <literal>english_ispell</literal>, which reduced it to the noun
3428 <literal>bright</literal>. The word <literal>supernovaes</literal> is
3429 unknown to the <literal>english_ispell</literal> dictionary so it
3430 was passed to the next dictionary, and, fortunately, was recognized (in
3431 fact, <literal>english_stem</literal> is a Snowball dictionary which
3432 recognizes everything; that is why it was placed at the end of the
3433 dictionary list).
3434 </para>
3436 <para>
3437 The word <literal>The</literal> was recognized by the
3438 <literal>english_ispell</literal> dictionary as a stop word (<xref
3439 linkend="textsearch-stopwords"/>) and will not be indexed.
3440 The spaces are discarded too, since the configuration provides no
3441 dictionaries at all for them.
3442 </para>
3444 <para>
3445 You can reduce the width of the output by explicitly specifying which columns
3446 you want to see:
3448 <screen>
3449 SELECT alias, token, dictionary, lexemes
3450 FROM ts_debug('public.english', 'The Brightest supernovaes');
3451 alias | token | dictionary | lexemes
3452 -----------+-------------+----------------+-------------
3453 asciiword | The | english_ispell | {}
3454 blank | | |
3455 asciiword | Brightest | english_ispell | {bright}
3456 blank | | |
3457 asciiword | supernovaes | english_stem | {supernova}
3458 </screen>
3459 </para>
3461 </sect2>
3463 <sect2 id="textsearch-parser-testing">
3464 <title>Parser Testing</title>
3466 <para>
3467 The following functions allow direct testing of a text search parser.
3468 </para>
3470 <indexterm>
3471 <primary>ts_parse</primary>
3472 </indexterm>
3474 <synopsis>
3475 ts_parse(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, <replaceable class="parameter">document</replaceable> <type>text</type>,
3476 OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type>
3477 ts_parse(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, <replaceable class="parameter">document</replaceable> <type>text</type>,
3478 OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type>
3479 </synopsis>
3481 <para>
3482 <function>ts_parse</function> parses the given <replaceable>document</replaceable>
3483 and returns a series of records, one for each token produced by
3484 parsing. Each record includes a <varname>tokid</varname> showing the
3485 assigned token type and a <varname>token</varname> which is the text of the
3486 token. For example:
3488 <screen>
3489 SELECT * FROM ts_parse('default', '123 - a number');
3490 tokid | token
3491 -------+--------
3492 22 | 123
3493 12 |
3494 12 | -
3495 1 | a
3496 12 |
3497 1 | number
3498 </screen>
3499 </para>
3501 <indexterm>
3502 <primary>ts_token_type</primary>
3503 </indexterm>
3505 <synopsis>
3506 ts_token_type(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>,
3507 OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type>
3508 ts_token_type(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>,
3509 OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type>
3510 </synopsis>
3512 <para>
3513 <function>ts_token_type</function> returns a table which describes each type of
3514 token the specified parser can recognize. For each token type, the table
3515 gives the integer <varname>tokid</varname> that the parser uses to label a
3516 token of that type, the <varname>alias</varname> that names the token type
3517 in configuration commands, and a short <varname>description</varname>. For
3518 example:
3520 <screen>
3521 SELECT * FROM ts_token_type('default');
3522 tokid | alias | description
3523 -------+-----------------+------------------------------------------
3524 1 | asciiword | Word, all ASCII
3525 2 | word | Word, all letters
3526 3 | numword | Word, letters and digits
3527 4 | email | Email address
3528 5 | url | URL
3529 6 | host | Host
3530 7 | sfloat | Scientific notation
3531 8 | version | Version number
3532 9 | hword_numpart | Hyphenated word part, letters and digits
3533 10 | hword_part | Hyphenated word part, all letters
3534 11 | hword_asciipart | Hyphenated word part, all ASCII
3535 12 | blank | Space symbols
3536 13 | tag | XML tag
3537 14 | protocol | Protocol head
3538 15 | numhword | Hyphenated word, letters and digits
3539 16 | asciihword | Hyphenated word, all ASCII
3540 17 | hword | Hyphenated word, all letters
3541 18 | url_path | URL path
3542 19 | file | File or path name
3543 20 | float | Decimal notation
3544 21 | int | Signed integer
3545 22 | uint | Unsigned integer
3546 23 | entity | XML entity
3547 </screen>
3548 </para>
3550 </sect2>
3552 <sect2 id="textsearch-dictionary-testing">
3553 <title>Dictionary Testing</title>
3555 <para>
3556 The <function>ts_lexize</function> function facilitates dictionary testing.
3557 </para>
3559 <indexterm>
3560 <primary>ts_lexize</primary>
3561 </indexterm>
3563 <synopsis>
3564 ts_lexize(<replaceable class="parameter">dict</replaceable> <type>regdictionary</type>, <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>text[]</type>
3565 </synopsis>
3567 <para>
3568 <function>ts_lexize</function> returns an array of lexemes if the input
3569 <replaceable>token</replaceable> is known to the dictionary,
3570 or an empty array if the token
3571 is known to the dictionary but it is a stop word, or
3572 <literal>NULL</literal> if it is an unknown word.
3573 </para>
3575 <para>
3576 Examples:
3578 <screen>
3579 SELECT ts_lexize('english_stem', 'stars');
3580 ts_lexize
3581 -----------
3582 {star}
3584 SELECT ts_lexize('english_stem', 'a');
3585 ts_lexize
3586 -----------
3588 </screen>
3589 </para>
3591 <note>
3592 <para>
3593 The <function>ts_lexize</function> function expects a single
3594 <emphasis>token</emphasis>, not text. Here is a case
3595 where this can be confusing:
3597 <screen>
3598 SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null;
3599 ?column?
3600 ----------
3602 </screen>
3604 The thesaurus dictionary <literal>thesaurus_astro</literal> does know the
3605 phrase <literal>supernovae stars</literal>, but <function>ts_lexize</function>
3606 fails since it does not parse the input text but treats it as a single
3607 token. Use <function>plainto_tsquery</function> or <function>to_tsvector</function> to
3608 test thesaurus dictionaries, for example:
3610 <screen>
3611 SELECT plainto_tsquery('supernovae stars');
3612 plainto_tsquery
3613 -----------------
3614 'sn'
3615 </screen>
3616 </para>
3617 </note>
3619 </sect2>
3621 </sect1>
3623 <sect1 id="textsearch-indexes">
3624 <title>Preferred Index Types for Text Search</title>
3626 <indexterm zone="textsearch-indexes">
3627 <primary>text search</primary>
3628 <secondary>indexes</secondary>
3629 </indexterm>
3631 <para>
3632 There are two kinds of indexes that can be used to speed up full text
3633 searches:
3634 <link linkend="gin"><acronym>GIN</acronym></link> and
3635 <link linkend="gist"><acronym>GiST</acronym></link>.
3636 Note that indexes are not mandatory for full text searching, but in
3637 cases where a column is searched on a regular basis, an index is
3638 usually desirable.
3639 </para>
3641 <para>
3642 To create such an index, do one of:
3644 <variablelist>
3646 <varlistentry>
3648 <term>
3649 <indexterm zone="textsearch-indexes">
3650 <primary>index</primary>
3651 <secondary>GIN</secondary>
3652 <tertiary>text search</tertiary>
3653 </indexterm>
3655 <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIN (<replaceable>column</replaceable>);</literal>
3656 </term>
3658 <listitem>
3659 <para>
3660 Creates a GIN (Generalized Inverted Index)-based index.
3661 The <replaceable>column</replaceable> must be of <type>tsvector</type> type.
3662 </para>
3663 </listitem>
3664 </varlistentry>
3666 <varlistentry>
3668 <term>
3669 <indexterm zone="textsearch-indexes">
3670 <primary>index</primary>
3671 <secondary>GiST</secondary>
3672 <tertiary>text search</tertiary>
3673 </indexterm>
3675 <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIST (<replaceable>column</replaceable> [ { DEFAULT | tsvector_ops } (siglen = <replaceable>number</replaceable>) ] );</literal>
3676 </term>
3678 <listitem>
3679 <para>
3680 Creates a GiST (Generalized Search Tree)-based index.
3681 The <replaceable>column</replaceable> can be of <type>tsvector</type> or
3682 <type>tsquery</type> type.
3683 Optional integer parameter <literal>siglen</literal> determines
3684 signature length in bytes (see below for details).
3685 </para>
3686 </listitem>
3687 </varlistentry>
3689 </variablelist>
3690 </para>
3692 <para>
3693 GIN indexes are the preferred text search index type. As inverted
3694 indexes, they contain an index entry for each word (lexeme), with a
3695 compressed list of matching locations. Multi-word searches can find
3696 the first match, then use the index to remove rows that are lacking
3697 additional words. GIN indexes store only the words (lexemes) of
3698 <type>tsvector</type> values, and not their weight labels. Thus a table
3699 row recheck is needed when using a query that involves weights.
3700 </para>
3702 <para>
3703 A GiST index is <firstterm>lossy</firstterm>, meaning that the index
3704 might produce false matches, and it is necessary
3705 to check the actual table row to eliminate such false matches.
3706 (<productname>PostgreSQL</productname> does this automatically when needed.)
3707 GiST indexes are lossy because each document is represented in the
3708 index by a fixed-length signature. The signature length in bytes is determined
3709 by the value of the optional integer parameter <literal>siglen</literal>.
3710 The default signature length (when <literal>siglen</literal> is not specified) is
3711 124 bytes, the maximum signature length is 2024 bytes. The signature is generated by hashing
3712 each word into a single bit in an n-bit string, with all these bits OR-ed
3713 together to produce an n-bit document signature. When two words hash to
3714 the same bit position there will be a false match. If all words in
3715 the query have matches (real or false) then the table row must be
3716 retrieved to see if the match is correct. Longer signatures lead to a more
3717 precise search (scanning a smaller fraction of the index and fewer heap
3718 pages), at the cost of a larger index.
3719 </para>
3721 <para>
3722 A GiST index can be covering, i.e., use the <literal>INCLUDE</literal>
3723 clause. Included columns can have data types without any GiST operator
3724 class. Included attributes will be stored uncompressed.
3725 </para>
3727 <para>
3728 Lossiness causes performance degradation due to unnecessary fetches of table
3729 records that turn out to be false matches. Since random access to table
3730 records is slow, this limits the usefulness of GiST indexes. The
3731 likelihood of false matches depends on several factors, in particular the
3732 number of unique words, so using dictionaries to reduce this number is
3733 recommended.
3734 </para>
3736 <para>
3737 Note that <acronym>GIN</acronym> index build time can often be improved
3738 by increasing <xref linkend="guc-maintenance-work-mem"/>, while
3739 <acronym>GiST</acronym> index build time is not sensitive to that
3740 parameter.
3741 </para>
3743 <para>
3744 Partitioning of big collections and the proper use of GIN and GiST indexes
3745 allows the implementation of very fast searches with online update.
3746 Partitioning can be done at the database level using table inheritance,
3747 or by distributing documents over
3748 servers and collecting external search results, e.g., via <link
3749 linkend="ddl-foreign-data">Foreign Data</link> access.
3750 The latter is possible because ranking functions use
3751 only local information.
3752 </para>
3754 </sect1>
3756 <sect1 id="textsearch-psql">
3757 <title><application>psql</application> Support</title>
3759 <para>
3760 Information about text search configuration objects can be obtained
3761 in <application>psql</application> using a set of commands:
3762 <synopsis>
3763 \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
3764 </synopsis>
3765 An optional <literal>+</literal> produces more details.
3766 </para>
3768 <para>
3769 The optional parameter <replaceable>PATTERN</replaceable> can be the name of
3770 a text search object, optionally schema-qualified. If
3771 <replaceable>PATTERN</replaceable> is omitted then information about all
3772 visible objects will be displayed. <replaceable>PATTERN</replaceable> can be a
3773 regular expression and can provide <emphasis>separate</emphasis> patterns
3774 for the schema and object names. The following examples illustrate this:
3776 <screen>
3777 =&gt; \dF *fulltext*
3778 List of text search configurations
3779 Schema | Name | Description
3780 --------+--------------+-------------
3781 public | fulltext_cfg |
3782 </screen>
3784 <screen>
3785 =&gt; \dF *.fulltext*
3786 List of text search configurations
3787 Schema | Name | Description
3788 ----------+----------------------------
3789 fulltext | fulltext_cfg |
3790 public | fulltext_cfg |
3791 </screen>
3793 The available commands are:
3794 </para>
3796 <variablelist>
3797 <varlistentry>
3798 <term><literal>\dF<optional>+</optional> <optional>PATTERN</optional></literal></term>
3799 <listitem>
3800 <para>
3801 List text search configurations (add <literal>+</literal> for more detail).
3802 <screen>
3803 =&gt; \dF russian
3804 List of text search configurations
3805 Schema | Name | Description
3806 ------------+---------+------------------------------------
3807 pg_catalog | russian | configuration for russian language
3809 =&gt; \dF+ russian
3810 Text search configuration "pg_catalog.russian"
3811 Parser: "pg_catalog.default"
3812 Token | Dictionaries
3813 -----------------+--------------
3814 asciihword | english_stem
3815 asciiword | english_stem
3816 email | simple
3817 file | simple
3818 float | simple
3819 host | simple
3820 hword | russian_stem
3821 hword_asciipart | english_stem
3822 hword_numpart | simple
3823 hword_part | russian_stem
3824 int | simple
3825 numhword | simple
3826 numword | simple
3827 sfloat | simple
3828 uint | simple
3829 url | simple
3830 url_path | simple
3831 version | simple
3832 word | russian_stem
3833 </screen>
3834 </para>
3835 </listitem>
3836 </varlistentry>
3838 <varlistentry>
3839 <term><literal>\dFd<optional>+</optional> <optional>PATTERN</optional></literal></term>
3840 <listitem>
3841 <para>
3842 List text search dictionaries (add <literal>+</literal> for more detail).
3843 <screen>
3844 =&gt; \dFd
3845 List of text search dictionaries
3846 Schema | Name | Description
3847 ------------+-----------------+-----------------------------------------------------------
3848 pg_catalog | arabic_stem | snowball stemmer for arabic language
3849 pg_catalog | armenian_stem | snowball stemmer for armenian language
3850 pg_catalog | basque_stem | snowball stemmer for basque language
3851 pg_catalog | catalan_stem | snowball stemmer for catalan language
3852 pg_catalog | danish_stem | snowball stemmer for danish language
3853 pg_catalog | dutch_stem | snowball stemmer for dutch language
3854 pg_catalog | english_stem | snowball stemmer for english language
3855 pg_catalog | finnish_stem | snowball stemmer for finnish language
3856 pg_catalog | french_stem | snowball stemmer for french language
3857 pg_catalog | german_stem | snowball stemmer for german language
3858 pg_catalog | greek_stem | snowball stemmer for greek language
3859 pg_catalog | hindi_stem | snowball stemmer for hindi language
3860 pg_catalog | hungarian_stem | snowball stemmer for hungarian language
3861 pg_catalog | indonesian_stem | snowball stemmer for indonesian language
3862 pg_catalog | irish_stem | snowball stemmer for irish language
3863 pg_catalog | italian_stem | snowball stemmer for italian language
3864 pg_catalog | lithuanian_stem | snowball stemmer for lithuanian language
3865 pg_catalog | nepali_stem | snowball stemmer for nepali language
3866 pg_catalog | norwegian_stem | snowball stemmer for norwegian language
3867 pg_catalog | portuguese_stem | snowball stemmer for portuguese language
3868 pg_catalog | romanian_stem | snowball stemmer for romanian language
3869 pg_catalog | russian_stem | snowball stemmer for russian language
3870 pg_catalog | serbian_stem | snowball stemmer for serbian language
3871 pg_catalog | simple | simple dictionary: just lower case and check for stopword
3872 pg_catalog | spanish_stem | snowball stemmer for spanish language
3873 pg_catalog | swedish_stem | snowball stemmer for swedish language
3874 pg_catalog | tamil_stem | snowball stemmer for tamil language
3875 pg_catalog | turkish_stem | snowball stemmer for turkish language
3876 pg_catalog | yiddish_stem | snowball stemmer for yiddish language
3877 </screen>
3878 </para>
3879 </listitem>
3880 </varlistentry>
3882 <varlistentry>
3883 <term><literal>\dFp<optional>+</optional> <optional>PATTERN</optional></literal></term>
3884 <listitem>
3885 <para>
3886 List text search parsers (add <literal>+</literal> for more detail).
3887 <screen>
3888 =&gt; \dFp
3889 List of text search parsers
3890 Schema | Name | Description
3891 ------------+---------+---------------------
3892 pg_catalog | default | default word parser
3893 =&gt; \dFp+
3894 Text search parser "pg_catalog.default"
3895 Method | Function | Description
3896 -----------------+----------------+-------------
3897 Start parse | prsd_start |
3898 Get next token | prsd_nexttoken |
3899 End parse | prsd_end |
3900 Get headline | prsd_headline |
3901 Get token types | prsd_lextype |
3903 Token types for parser "pg_catalog.default"
3904 Token name | Description
3905 -----------------+------------------------------------------
3906 asciihword | Hyphenated word, all ASCII
3907 asciiword | Word, all ASCII
3908 blank | Space symbols
3909 email | Email address
3910 entity | XML entity
3911 file | File or path name
3912 float | Decimal notation
3913 host | Host
3914 hword | Hyphenated word, all letters
3915 hword_asciipart | Hyphenated word part, all ASCII
3916 hword_numpart | Hyphenated word part, letters and digits
3917 hword_part | Hyphenated word part, all letters
3918 int | Signed integer
3919 numhword | Hyphenated word, letters and digits
3920 numword | Word, letters and digits
3921 protocol | Protocol head
3922 sfloat | Scientific notation
3923 tag | XML tag
3924 uint | Unsigned integer
3925 url | URL
3926 url_path | URL path
3927 version | Version number
3928 word | Word, all letters
3929 (23 rows)
3930 </screen>
3931 </para>
3932 </listitem>
3933 </varlistentry>
3935 <varlistentry>
3936 <term><literal>\dFt<optional>+</optional> <optional>PATTERN</optional></literal></term>
3937 <listitem>
3938 <para>
3939 List text search templates (add <literal>+</literal> for more detail).
3940 <screen>
3941 =&gt; \dFt
3942 List of text search templates
3943 Schema | Name | Description
3944 ------------+-----------+-----------------------------------------------------------
3945 pg_catalog | ispell | ispell dictionary
3946 pg_catalog | simple | simple dictionary: just lower case and check for stopword
3947 pg_catalog | snowball | snowball stemmer
3948 pg_catalog | synonym | synonym dictionary: replace word by its synonym
3949 pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
3950 </screen>
3951 </para>
3952 </listitem>
3953 </varlistentry>
3954 </variablelist>
3956 </sect1>
3958 <sect1 id="textsearch-limitations">
3959 <title>Limitations</title>
3961 <para>
3962 The current limitations of <productname>PostgreSQL</productname>'s
3963 text search features are:
3964 <itemizedlist spacing="compact" mark="bullet">
3965 <listitem>
3966 <para>The length of each lexeme must be less than 2 kilobytes</para>
3967 </listitem>
3968 <listitem>
3969 <para>The length of a <type>tsvector</type> (lexemes + positions) must be
3970 less than 1 megabyte</para>
3971 </listitem>
3972 <listitem>
3973 <!-- TODO: number of lexemes in what? This is unclear -->
3974 <para>The number of lexemes must be less than
3975 2<superscript>64</superscript></para>
3976 </listitem>
3977 <listitem>
3978 <para>Position values in <type>tsvector</type> must be greater than 0 and
3979 no more than 16,383</para>
3980 </listitem>
3981 <listitem>
3982 <para>The match distance in a <literal>&lt;<replaceable>N</replaceable>&gt;</literal>
3983 (FOLLOWED BY) <type>tsquery</type> operator cannot be more than
3984 16,384</para>
3985 </listitem>
3986 <listitem>
3987 <para>No more than 256 positions per lexeme</para>
3988 </listitem>
3989 <listitem>
3990 <para>The number of nodes (lexemes + operators) in a <type>tsquery</type>
3991 must be less than 32,768</para>
3992 </listitem>
3993 </itemizedlist>
3994 </para>
3996 <para>
3997 For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
3998 contained 10,441 unique words, a total of 335,420 words, and the most
3999 frequent word <quote>postgresql</quote> was mentioned 6,127 times in 655
4000 documents.
4001 </para>
4003 <!-- TODO we need to put a date on these numbers? -->
4004 <para>
4005 Another example &mdash; the <productname>PostgreSQL</productname> mailing
4006 list archives contained 910,989 unique words with 57,491,343 lexemes in
4007 461,020 messages.
4008 </para>
4010 </sect1>
4012 </chapter>