doc/src/sgml/textsearch.sgml

   1 <!-- doc/src/sgml/textsearch.sgml -->
   2
   3 <chapter id="textsearch">
   4  <title>Full Text Search</title>
   5
   6   <indexterm zone="textsearch">
   7    <primary>full text search</primary>
   8   </indexterm>
   9
  10   <indexterm zone="textsearch">
  11    <primary>text search</primary>
  12   </indexterm>
  13
  14  <sect1 id="textsearch-intro">
  15   <title>Introduction</title>
  16
  17   <para>
  18    Full Text Searching (or just <firstterm>text search</firstterm>) provides
  19    the capability to identify natural-language <firstterm>documents</firstterm> that
  20    satisfy a <firstterm>query</firstterm>, and optionally to sort them by
  21    relevance to the query.  The most common type of search
  22    is to find all documents containing given <firstterm>query terms</firstterm>
  23    and return them in order of their <firstterm>similarity</firstterm> to the
  24    query.  Notions of <varname>query</varname> and
  25    <varname>similarity</varname> are very flexible and depend on the specific
  26    application. The simplest search considers <varname>query</varname> as a
  27    set of words and <varname>similarity</varname> as the frequency of query
  28    words in the document.
  29   </para>
  30
  31   <para>
  32    Textual search operators have existed in databases for years.
  33    <productname>PostgreSQL</productname> has
  34    <literal>~</literal>, <literal>~*</literal>, <literal>LIKE</literal>, and
  35    <literal>ILIKE</literal> operators for textual data types, but they lack
  36    many essential properties required by modern information systems:
  37   </para>
  38
  39   <itemizedlist  spacing="compact" mark="bullet">
  40    <listitem>
  41     <para>
  42      There is no linguistic support, even for English.  Regular expressions
  43      are not sufficient because they cannot easily handle derived words, e.g.,
  44      <literal>satisfies</literal> and <literal>satisfy</literal>. You might
  45      miss documents that contain <literal>satisfies</literal>, although you
  46      probably would like to find them when searching for
  47      <literal>satisfy</literal>. It is possible to use <literal>OR</literal>
  48      to search for multiple derived forms, but this is tedious and error-prone
  49      (some words can have several thousand derivatives).
  50     </para>
  51    </listitem>
  52
  53    <listitem>
  54     <para>
  55      They provide no ordering (ranking) of search results, which makes them
  56      ineffective when thousands of matching documents are found.
  57     </para>
  58    </listitem>
  59
  60    <listitem>
  61     <para>
  62      They tend to be slow because there is no index support, so they must
  63      process all documents for every search.
  64     </para>
  65    </listitem>
  66   </itemizedlist>
  67
  68   <para>
  69    Full text indexing allows documents to be <emphasis>preprocessed</emphasis>
  70    and an index saved for later rapid searching. Preprocessing includes:
  71   </para>
  72
  73   <itemizedlist  mark="none">
  74    <listitem>
  75     <para>
  76      <emphasis>Parsing documents into <firstterm>tokens</firstterm></emphasis>. It is
  77      useful to identify various classes of tokens, e.g., numbers, words,
  78      complex words, email addresses, so that they can be processed
  79      differently.  In principle token classes depend on the specific
  80      application, but for most purposes it is adequate to use a predefined
  81      set of classes.
  82      <productname>PostgreSQL</productname> uses a <firstterm>parser</firstterm> to
  83      perform this step.  A standard parser is provided, and custom parsers
  84      can be created for specific needs.
  85     </para>
  86    </listitem>
  87
  88    <listitem>
  89     <para>
  90      <emphasis>Converting tokens into <firstterm>lexemes</firstterm></emphasis>.
  91      A lexeme is a string, just like a token, but it has been
  92      <firstterm>normalized</firstterm> so that different forms of the same word
  93      are made alike.  For example, normalization almost always includes
  94      folding upper-case letters to lower-case, and often involves removal
  95      of suffixes (such as <literal>s</literal> or <literal>es</literal> in English).
  96      This allows searches to find variant forms of the
  97      same word, without tediously entering all the possible variants.
  98      Also, this step typically eliminates <firstterm>stop words</firstterm>, which
  99      are words that are so common that they are useless for searching.
 100      (In short, then, tokens are raw fragments of the document text, while
 101      lexemes are words that are believed useful for indexing and searching.)
 102      <productname>PostgreSQL</productname> uses <firstterm>dictionaries</firstterm> to
 103      perform this step.  Various standard dictionaries are provided, and
 104      custom ones can be created for specific needs.
 105     </para>
 106    </listitem>
 107
 108    <listitem>
 109     <para>
 110      <emphasis>Storing preprocessed documents optimized for
 111      searching</emphasis>.  For example, each document can be represented
 112      as a sorted array of normalized lexemes. Along with the lexemes it is
 113      often desirable to store positional information to use for
 114      <firstterm>proximity ranking</firstterm>, so that a document that
 115      contains a more <quote>dense</quote> region of query words is
 116      assigned a higher rank than one with scattered query words.
 117     </para>
 118    </listitem>
 119   </itemizedlist>
 120
 121   <para>
 122    Dictionaries allow fine-grained control over how tokens are normalized.
 123    With appropriate dictionaries, you can:
 124   </para>
 125
 126   <itemizedlist  spacing="compact" mark="bullet">
 127    <listitem>
 128     <para>
 129      Define stop words that should not be indexed.
 130     </para>
 131    </listitem>
 132
 133    <listitem>
 134     <para>
 135      Map synonyms to a single word using <application>Ispell</application>.
 136     </para>
 137    </listitem>
 138
 139    <listitem>
 140     <para>
 141      Map phrases to a single word using a thesaurus.
 142     </para>
 143    </listitem>
 144
 145    <listitem>
 146     <para>
 147      Map different variations of a word to a canonical form using
 148      an <application>Ispell</application> dictionary.
 149     </para>
 150    </listitem>
 151
 152    <listitem>
 153     <para>
 154      Map different variations of a word to a canonical form using
 155      <application>Snowball</application> stemmer rules.
 156     </para>
 157    </listitem>
 158   </itemizedlist>
 159
 160   <para>
 161    A data type <type>tsvector</type> is provided for storing preprocessed
 162    documents, along with a type <type>tsquery</type> for representing processed
 163    queries (<xref linkend="datatype-textsearch"/>).  There are many
 164    functions and operators available for these data types
 165    (<xref linkend="functions-textsearch"/>), the most important of which is
 166    the match operator <literal>@@</literal>, which we introduce in
 167    <xref linkend="textsearch-matching"/>.  Full text searches can be accelerated
 168    using indexes (<xref linkend="textsearch-indexes"/>).
 169   </para>
 170
 171
 172   <sect2 id="textsearch-document">
 173    <title>What Is a Document?</title>
 174
 175    <indexterm zone="textsearch-document">
 176     <primary>document</primary>
 177     <secondary>text search</secondary>
 178    </indexterm>
 179
 180    <para>
 181     A <firstterm>document</firstterm> is the unit of searching in a full text search
 182     system; for example, a magazine article or email message.  The text search
 183     engine must be able to parse documents and store associations of lexemes
 184     (key words) with their parent document. Later, these associations are
 185     used to search for documents that contain query words.
 186    </para>
 187
 188    <para>
 189     For searches within <productname>PostgreSQL</productname>,
 190     a document is normally a textual field within a row of a database table,
 191     or possibly a combination (concatenation) of such fields, perhaps stored
 192     in several tables or obtained dynamically. In other words, a document can
 193     be constructed from different parts for indexing and it might not be
 194     stored anywhere as a whole. For example:
 195
 196 <programlisting>
 197 SELECT title || ' ' ||  author || ' ' ||  abstract || ' ' || body AS document
 198 FROM messages
 199 WHERE mid = 12;
 200
 201 SELECT m.title || ' ' || m.author || ' ' || m.abstract || ' ' || d.body AS document
 202 FROM messages m, docs d
 203 WHERE m.mid = d.did AND m.mid = 12;
 204 </programlisting>
 205    </para>
 206
 207    <note>
 208     <para>
 209      Actually, in these example queries, <function>coalesce</function>
 210      should be used to prevent a single <literal>NULL</literal> attribute from
 211      causing a <literal>NULL</literal> result for the whole document.
 212     </para>
 213    </note>
 214
 215    <para>
 216     Another possibility is to store the documents as simple text files in the
 217     file system. In this case, the database can be used to store the full text
 218     index and to execute searches, and some unique identifier can be used to
 219     retrieve the document from the file system.  However, retrieving files
 220     from outside the database requires superuser permissions or special
 221     function support, so this is usually less convenient than keeping all
 222     the data inside <productname>PostgreSQL</productname>.  Also, keeping
 223     everything inside the database allows easy access
 224     to document metadata to assist in indexing and display.
 225    </para>
 226
 227    <para>
 228     For text search purposes, each document must be reduced to the
 229     preprocessed <type>tsvector</type> format.  Searching and ranking
 230     are performed entirely on the <type>tsvector</type> representation
 231     of a document &mdash; the original text need only be retrieved
 232     when the document has been selected for display to a user.
 233     We therefore often speak of the <type>tsvector</type> as being the
 234     document, but of course it is only a compact representation of
 235     the full document.
 236    </para>
 237   </sect2>
 238
 239   <sect2 id="textsearch-matching">
 240    <title>Basic Text Matching</title>
 241
 242    <para>
 243     Full text searching in <productname>PostgreSQL</productname> is based on
 244     the match operator <literal>@@</literal>, which returns
 245     <literal>true</literal> if a <type>tsvector</type>
 246     (document) matches a <type>tsquery</type> (query).
 247     It doesn't matter which data type is written first:
 248
 249 <programlisting>
 250 SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector @@ 'cat &amp; rat'::tsquery;
 251  ?column?
 252 ----------
 253  t
 254
 255 SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::tsvector;
 256  ?column?
 257 ----------
 258  f
 259 </programlisting>
 260    </para>
 261
 262    <para>
 263     As the above example suggests, a <type>tsquery</type> is not just raw
 264     text, any more than a <type>tsvector</type> is.  A <type>tsquery</type>
 265     contains search terms, which must be already-normalized lexemes, and
 266     may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators.
 267     (For syntax details see <xref linkend="datatype-tsquery"/>.)  There are
 268     functions <function>to_tsquery</function>, <function>plainto_tsquery</function>,
 269     and <function>phraseto_tsquery</function>
 270     that are helpful in converting user-written text into a proper
 271     <type>tsquery</type>, primarily by normalizing words appearing in
 272     the text.  Similarly, <function>to_tsvector</function> is used to parse and
 273     normalize a document string.  So in practice a text search match would
 274     look more like this:
 275
 276 <programlisting>
 277 SELECT to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat &amp; rat');
 278  ?column?
 279 ----------
 280  t
 281 </programlisting>
 282
 283     Observe that this match would not succeed if written as
 284
 285 <programlisting>
 286 SELECT 'fat cats ate fat rats'::tsvector @@ to_tsquery('fat &amp; rat');
 287  ?column?
 288 ----------
 289  f
 290 </programlisting>
 291
 292     since here no normalization of the word <literal>rats</literal> will occur.
 293     The elements of a <type>tsvector</type> are lexemes, which are assumed
 294     already normalized, so <literal>rats</literal> does not match <literal>rat</literal>.
 295    </para>
 296
 297    <para>
 298     The <literal>@@</literal> operator also
 299     supports <type>text</type> input, allowing explicit conversion of a text
 300     string to <type>tsvector</type> or <type>tsquery</type> to be skipped
 301     in simple cases.  The variants available are:
 302
 303 <programlisting>
 304 tsvector @@ tsquery
 305 tsquery  @@ tsvector
 306 text @@ tsquery
 307 text @@ text
 308 </programlisting>
 309    </para>
 310
 311    <para>
 312     The first two of these we saw already.
 313     The form <type>text</type> <literal>@@</literal> <type>tsquery</type>
 314     is equivalent to <literal>to_tsvector(x) @@ y</literal>.
 315     The form <type>text</type> <literal>@@</literal> <type>text</type>
 316     is equivalent to <literal>to_tsvector(x) @@ plainto_tsquery(y)</literal>.
 317    </para>
 318
 319    <para>
 320     Within a <type>tsquery</type>, the <literal>&amp;</literal> (AND) operator
 321     specifies that both its arguments must appear in the document to have a
 322     match.  Similarly, the <literal>|</literal> (OR) operator specifies that
 323     at least one of its arguments must appear, while the <literal>!</literal> (NOT)
 324     operator specifies that its argument must <emphasis>not</emphasis> appear in
 325     order to have a match.
 326     For example, the query <literal>fat &amp; ! rat</literal> matches documents that
 327     contain <literal>fat</literal> but not <literal>rat</literal>.
 328    </para>
 329
 330    <para>
 331     Searching for phrases is possible with the help of
 332     the <literal>&lt;-&gt;</literal> (FOLLOWED BY) <type>tsquery</type> operator, which
 333     matches only if its arguments have matches that are adjacent and in the
 334     given order.  For example:
 335
 336 <programlisting>
 337 SELECT to_tsvector('fatal error') @@ to_tsquery('fatal &lt;-&gt; error');
 338  ?column?
 339 ----------
 340  t
 341
 342 SELECT to_tsvector('error is not fatal') @@ to_tsquery('fatal &lt;-&gt; error');
 343  ?column?
 344 ----------
 345  f
 346 </programlisting>
 347
 348     There is a more general version of the FOLLOWED BY operator having the
 349     form <literal>&lt;<replaceable>N</replaceable>&gt;</literal>,
 350     where <replaceable>N</replaceable> is an integer standing for the difference between
 351     the positions of the matching lexemes.  <literal>&lt;1&gt;</literal> is
 352     the same as <literal>&lt;-&gt;</literal>, while <literal>&lt;2&gt;</literal>
 353     allows exactly one other lexeme to appear between the matches, and so
 354     on.  The <literal>phraseto_tsquery</literal> function makes use of this
 355     operator to construct a <literal>tsquery</literal> that can match a multi-word
 356     phrase when some of the words are stop words.  For example:
 357
 358 <programlisting>
 359 SELECT phraseto_tsquery('cats ate rats');
 360        phraseto_tsquery
 361 -------------------------------
 362  'cat' &lt;-&gt; 'ate' &lt;-&gt; 'rat'
 363
 364 SELECT phraseto_tsquery('the cats ate the rats');
 365        phraseto_tsquery
 366 -------------------------------
 367  'cat' &lt;-&gt; 'ate' &lt;2&gt; 'rat'
 368 </programlisting>
 369    </para>
 370
 371    <para>
 372     A special case that's sometimes useful is that <literal>&lt;0&gt;</literal>
 373     can be used to require that two patterns match the same word.
 374    </para>
 375
 376    <para>
 377     Parentheses can be used to control nesting of the <type>tsquery</type>
 378     operators.  Without parentheses, <literal>|</literal> binds least tightly,
 379     then <literal>&amp;</literal>, then <literal>&lt;-&gt;</literal>,
 380     and <literal>!</literal> most tightly.
 381    </para>
 382
 383    <para>
 384     It's worth noticing that the AND/OR/NOT operators mean something subtly
 385     different when they are within the arguments of a FOLLOWED BY operator
 386     than when they are not, because within FOLLOWED BY the exact position of
 387     the match is significant.  For example, normally <literal>!x</literal> matches
 388     only documents that do not contain <literal>x</literal> anywhere.
 389     But <literal>!x &lt;-&gt; y</literal> matches <literal>y</literal> if it is not
 390     immediately after an <literal>x</literal>; an occurrence of <literal>x</literal>
 391     elsewhere in the document does not prevent a match.  Another example is
 392     that <literal>x &amp; y</literal> normally only requires that <literal>x</literal>
 393     and <literal>y</literal> both appear somewhere in the document, but
 394     <literal>(x &amp; y) &lt;-&gt; z</literal> requires <literal>x</literal>
 395     and <literal>y</literal> to match at the same place, immediately before
 396     a <literal>z</literal>.  Thus this query behaves differently from
 397     <literal>x &lt;-&gt; z &amp; y &lt;-&gt; z</literal>, which will match a
 398     document containing two separate sequences <literal>x z</literal> and
 399     <literal>y z</literal>.  (This specific query is useless as written,
 400     since <literal>x</literal> and <literal>y</literal> could not match at the same place;
 401     but with more complex situations such as prefix-match patterns, a query
 402     of this form could be useful.)
 403    </para>
 404   </sect2>
 405
 406   <sect2 id="textsearch-intro-configurations">
 407    <title>Configurations</title>
 408
 409    <para>
 410     The above are all simple text search examples.  As mentioned before, full
 411     text search functionality includes the ability to do many more things:
 412     skip indexing certain words (stop words), process synonyms, and use
 413     sophisticated parsing, e.g., parse based on more than just white space.
 414     This functionality is controlled by <firstterm>text search
 415     configurations</firstterm>.  <productname>PostgreSQL</productname> comes with predefined
 416     configurations for many languages, and you can easily create your own
 417     configurations.  (<application>psql</application>'s <command>\dF</command> command
 418     shows all available configurations.)
 419    </para>
 420
 421    <para>
 422     During installation an appropriate configuration is selected and
 423     <xref linkend="guc-default-text-search-config"/> is set accordingly
 424     in <filename>postgresql.conf</filename>.  If you are using the same text search
 425     configuration for the entire cluster you can use the value in
 426     <filename>postgresql.conf</filename>.  To use different configurations
 427     throughout the cluster but the same configuration within any one database,
 428     use <command>ALTER DATABASE ... SET</command>.  Otherwise, you can set
 429     <varname>default_text_search_config</varname> in each session.
 430    </para>
 431
 432    <para>
 433     Each text search function that depends on a configuration has an optional
 434     <type>regconfig</type> argument, so that the configuration to use can be
 435     specified explicitly.  <varname>default_text_search_config</varname>
 436     is used only when this argument is omitted.
 437    </para>
 438
 439    <para>
 440     To make it easier to build custom text search configurations, a
 441     configuration is built up from simpler database objects.
 442     <productname>PostgreSQL</productname>'s text search facility provides
 443     four types of configuration-related database objects:
 444    </para>
 445
 446   <itemizedlist  spacing="compact" mark="bullet">
 447    <listitem>
 448     <para>
 449      <firstterm>Text search parsers</firstterm> break documents into tokens
 450      and classify each token (for example, as words or numbers).
 451     </para>
 452    </listitem>
 453
 454    <listitem>
 455     <para>
 456      <firstterm>Text search dictionaries</firstterm> convert tokens to normalized
 457      form and reject stop words.
 458     </para>
 459    </listitem>
 460
 461    <listitem>
 462     <para>
 463      <firstterm>Text search templates</firstterm> provide the functions underlying
 464      dictionaries.  (A dictionary simply specifies a template and a set
 465      of parameters for the template.)
 466     </para>
 467    </listitem>
 468
 469    <listitem>
 470     <para>
 471      <firstterm>Text search configurations</firstterm> select a parser and a set
 472      of dictionaries to use to normalize the tokens produced by the parser.
 473     </para>
 474    </listitem>
 475   </itemizedlist>
 476
 477    <para>
 478     Text search parsers and templates are built from low-level C functions;
 479     therefore it requires C programming ability to develop new ones, and
 480     superuser privileges to install one into a database.  (There are examples
 481     of add-on parsers and templates in the <filename>contrib/</filename> area of the
 482     <productname>PostgreSQL</productname> distribution.)  Since dictionaries and
 483     configurations just parameterize and connect together some underlying
 484     parsers and templates, no special privilege is needed to create a new
 485     dictionary or configuration.  Examples of creating custom dictionaries and
 486     configurations appear later in this chapter.
 487    </para>
 488
 489   </sect2>
 490
 491  </sect1>
 492
 493  <sect1 id="textsearch-tables">
 494   <title>Tables and Indexes</title>
 495
 496   <para>
 497    The examples in the previous section illustrated full text matching using
 498    simple constant strings.  This section shows how to search table data,
 499    optionally using indexes.
 500   </para>
 501
 502   <sect2 id="textsearch-tables-search">
 503    <title>Searching a Table</title>
 504
 505    <para>
 506     It is possible to do a full text search without an index.  A simple query
 507     to print the <structname>title</structname> of each row that contains the word
 508     <literal>friend</literal> in its <structfield>body</structfield> field is:
 509
 510 <programlisting>
 511 SELECT title
 512 FROM pgweb
 513 WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');
 514 </programlisting>
 515
 516     This will also find related words such as <literal>friends</literal>
 517     and <literal>friendly</literal>, since all these are reduced to the same
 518     normalized lexeme.
 519    </para>
 520
 521    <para>
 522     The query above specifies that the <literal>english</literal> configuration
 523     is to be used to parse and normalize the strings.  Alternatively we
 524     could omit the configuration parameters:
 525
 526 <programlisting>
 527 SELECT title
 528 FROM pgweb
 529 WHERE to_tsvector(body) @@ to_tsquery('friend');
 530 </programlisting>
 531
 532     This query will use the configuration set by <xref
 533     linkend="guc-default-text-search-config"/>.
 534    </para>
 535
 536    <para>
 537     A more complex example is to
 538     select the ten most recent documents that contain <literal>create</literal> and
 539     <literal>table</literal> in the <structname>title</structname> or <structname>body</structname>:
 540
 541 <programlisting>
 542 SELECT title
 543 FROM pgweb
 544 WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create &amp; table')
 545 ORDER BY last_mod_date DESC
 546 LIMIT 10;
 547 </programlisting>
 548
 549     For clarity we omitted the <function>coalesce</function> function calls
 550     which would be needed to find rows that contain <literal>NULL</literal>
 551     in one of the two fields.
 552    </para>
 553
 554    <para>
 555     Although these queries will work without an index, most applications
 556     will find this approach too slow, except perhaps for occasional ad-hoc
 557     searches.  Practical use of text searching usually requires creating
 558     an index.
 559    </para>
 560
 561   </sect2>
 562
 563   <sect2 id="textsearch-tables-index">
 564    <title>Creating Indexes</title>
 565
 566    <para>
 567     We can create a <acronym>GIN</acronym> index (<xref
 568     linkend="textsearch-indexes"/>) to speed up text searches:
 569
 570 <programlisting>
 571 CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));
 572 </programlisting>
 573
 574     Notice that the 2-argument version of <function>to_tsvector</function> is
 575     used.  Only text search functions that specify a configuration name can
 576     be used in expression indexes (<xref linkend="indexes-expressional"/>).
 577     This is because the index contents must be unaffected by <xref
 578     linkend="guc-default-text-search-config"/>.  If they were affected, the
 579     index contents might be inconsistent because different entries could
 580     contain <type>tsvector</type>s that were created with different text search
 581     configurations, and there would be no way to guess which was which.  It
 582     would be impossible to dump and restore such an index correctly.
 583    </para>
 584
 585    <para>
 586     Because the two-argument version of <function>to_tsvector</function> was
 587     used in the index above, only a query reference that uses the 2-argument
 588     version of <function>to_tsvector</function> with the same configuration
 589     name will use that index.  That is, <literal>WHERE
 590     to_tsvector('english', body) @@ 'a &amp; b'</literal> can use the index,
 591     but <literal>WHERE to_tsvector(body) @@ 'a &amp; b'</literal> cannot.
 592     This ensures that an index will be used only with the same configuration
 593     used to create the index entries.
 594    </para>
 595
 596   <para>
 597     It is possible to set up more complex expression indexes wherein the
 598     configuration name is specified by another column, e.g.:
 599
 600 <programlisting>
 601 CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector(config_name, body));
 602 </programlisting>
 603
 604     where <literal>config_name</literal> is a column in the <literal>pgweb</literal>
 605     table.  This allows mixed configurations in the same index while
 606     recording which configuration was used for each index entry.  This
 607     would be useful, for example, if the document collection contained
 608     documents in different languages.  Again,
 609     queries that are meant to use the index must be phrased to match, e.g.,
 610     <literal>WHERE to_tsvector(config_name, body) @@ 'a &amp; b'</literal>.
 611    </para>
 612
 613    <para>
 614     Indexes can even concatenate columns:
 615
 616 <programlisting>
 617 CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', title || ' ' || body));
 618 </programlisting>
 619    </para>
 620
 621    <para>
 622     Another approach is to create a separate <type>tsvector</type> column
 623     to hold the output of <function>to_tsvector</function>.  To keep this
 624     column automatically up to date with its source data, use a stored
 625     generated column.  This example is a
 626     concatenation of <literal>title</literal> and <literal>body</literal>,
 627     using <function>coalesce</function> to ensure that one field will still be
 628     indexed when the other is <literal>NULL</literal>:
 629
 630 <programlisting>
 631 ALTER TABLE pgweb
 632     ADD COLUMN textsearchable_index_col tsvector
 633                GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
 634 </programlisting>
 635
 636     Then we create a <acronym>GIN</acronym> index to speed up the search:
 637
 638 <programlisting>
 639 CREATE INDEX textsearch_idx ON pgweb USING GIN (textsearchable_index_col);
 640 </programlisting>
 641
 642     Now we are ready to perform a fast full text search:
 643
 644 <programlisting>
 645 SELECT title
 646 FROM pgweb
 647 WHERE textsearchable_index_col @@ to_tsquery('create &amp; table')
 648 ORDER BY last_mod_date DESC
 649 LIMIT 10;
 650 </programlisting>
 651    </para>
 652
 653    <para>
 654     One advantage of the separate-column approach over an expression index
 655     is that it is not necessary to explicitly specify the text search
 656     configuration in queries in order to make use of the index.  As shown
 657     in the example above, the query can depend on
 658     <varname>default_text_search_config</varname>.  Another advantage is that
 659     searches will be faster, since it will not be necessary to redo the
 660     <function>to_tsvector</function> calls to verify index matches.  (This is more
 661     important when using a GiST index than a GIN index; see <xref
 662     linkend="textsearch-indexes"/>.)  The expression-index approach is
 663     simpler to set up, however, and it requires less disk space since the
 664     <type>tsvector</type> representation is not stored explicitly.
 665    </para>
 666
 667   </sect2>
 668
 669  </sect1>
 670
 671  <sect1 id="textsearch-controls">
 672   <title>Controlling Text Search</title>
 673
 674   <para>
 675    To implement full text searching there must be a function to create a
 676    <type>tsvector</type> from a document and a <type>tsquery</type> from a
 677    user query. Also, we need to return results in a useful order, so we need
 678    a function that compares documents with respect to their relevance to
 679    the query. It's also important to be able to display the results nicely.
 680    <productname>PostgreSQL</productname> provides support for all of these
 681    functions.
 682   </para>
 683
 684   <sect2 id="textsearch-parsing-documents">
 685    <title>Parsing Documents</title>
 686
 687    <para>
 688     <productname>PostgreSQL</productname> provides the
 689     function <function>to_tsvector</function> for converting a document to
 690     the <type>tsvector</type> data type.
 691    </para>
 692
 693    <indexterm>
 694     <primary>to_tsvector</primary>
 695    </indexterm>
 696
 697 <synopsis>
 698 to_tsvector(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>) returns <type>tsvector</type>
 699 </synopsis>
 700
 701    <para>
 702     <function>to_tsvector</function> parses a textual document into tokens,
 703     reduces the tokens to lexemes, and returns a <type>tsvector</type> which
 704     lists the lexemes together with their positions in the document.
 705     The document is processed according to the specified or default
 706     text search configuration.
 707     Here is a simple example:
 708
 709 <screen>
 710 SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
 711                   to_tsvector
 712 -----------------------------------------------------
 713  'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
 714 </screen>
 715    </para>
 716
 717    <para>
 718     In the example above we see that the resulting <type>tsvector</type> does not
 719     contain the words <literal>a</literal>, <literal>on</literal>, or
 720     <literal>it</literal>, the word <literal>rats</literal> became
 721     <literal>rat</literal>, and the punctuation sign <literal>-</literal> was
 722     ignored.
 723    </para>
 724
 725    <para>
 726     The <function>to_tsvector</function> function internally calls a parser
 727     which breaks the document text into tokens and assigns a type to
 728     each token.  For each token, a list of
 729     dictionaries (<xref linkend="textsearch-dictionaries"/>) is consulted,
 730     where the list can vary depending on the token type.  The first dictionary
 731     that <firstterm>recognizes</firstterm> the token emits one or more normalized
 732     <firstterm>lexemes</firstterm> to represent the token.  For example,
 733     <literal>rats</literal> became <literal>rat</literal> because one of the
 734     dictionaries recognized that the word <literal>rats</literal> is a plural
 735     form of <literal>rat</literal>.  Some words are recognized as
 736     <firstterm>stop words</firstterm> (<xref linkend="textsearch-stopwords"/>), which
 737     causes them to be ignored since they occur too frequently to be useful in
 738     searching.  In our example these are
 739     <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>.
 740     If no dictionary in the list recognizes the token then it is also ignored.
 741     In this example that happened to the punctuation sign <literal>-</literal>
 742     because there are in fact no dictionaries assigned for its token type
 743     (<literal>Space symbols</literal>), meaning space tokens will never be
 744     indexed. The choices of parser, dictionaries and which types of tokens to
 745     index are determined by the selected text search configuration (<xref
 746     linkend="textsearch-configuration"/>).  It is possible to have
 747     many different configurations in the same database, and predefined
 748     configurations are available for various languages. In our example
 749     we used the default configuration <literal>english</literal> for the
 750     English language.
 751    </para>
 752
 753    <para>
 754     The function <function>setweight</function> can be used to label the
 755     entries of a <type>tsvector</type> with a given <firstterm>weight</firstterm>,
 756     where a weight is one of the letters <literal>A</literal>, <literal>B</literal>,
 757     <literal>C</literal>, or <literal>D</literal>.
 758     This is typically used to mark entries coming from
 759     different parts of a document, such as title versus body.  Later, this
 760     information can be used for ranking of search results.
 761    </para>
 762
 763    <para>
 764     Because <function>to_tsvector</function>(<literal>NULL</literal>) will
 765     return <literal>NULL</literal>, it is recommended to use
 766     <function>coalesce</function> whenever a field might be null.
 767     Here is the recommended method for creating
 768     a <type>tsvector</type> from a structured document:
 769
 770 <programlisting>
 771 UPDATE tt SET ti =
 772     setweight(to_tsvector(coalesce(title,'')), 'A')    ||
 773     setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
 774     setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
 775     setweight(to_tsvector(coalesce(body,'')), 'D');
 776 </programlisting>
 777
 778     Here we have used <function>setweight</function> to label the source
 779     of each lexeme in the finished <type>tsvector</type>, and then merged
 780     the labeled <type>tsvector</type> values using the <type>tsvector</type>
 781     concatenation operator <literal>||</literal>.  (<xref
 782     linkend="textsearch-manipulate-tsvector"/> gives details about these
 783     operations.)
 784    </para>
 785
 786   </sect2>
 787
 788   <sect2 id="textsearch-parsing-queries">
 789    <title>Parsing Queries</title>
 790
 791    <para>
 792     <productname>PostgreSQL</productname> provides the
 793     functions <function>to_tsquery</function>,
 794     <function>plainto_tsquery</function>,
 795     <function>phraseto_tsquery</function> and
 796     <function>websearch_to_tsquery</function>
 797     for converting a query to the <type>tsquery</type> data type.
 798     <function>to_tsquery</function> offers access to more features
 799     than either <function>plainto_tsquery</function> or
 800     <function>phraseto_tsquery</function>, but it is less forgiving about its
 801     input. <function>websearch_to_tsquery</function> is a simplified version
 802     of <function>to_tsquery</function> with an alternative syntax, similar
 803     to the one used by web search engines.
 804    </para>
 805
 806    <indexterm>
 807     <primary>to_tsquery</primary>
 808    </indexterm>
 809
 810 <synopsis>
 811 to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
 812 </synopsis>
 813
 814    <para>
 815     <function>to_tsquery</function> creates a <type>tsquery</type> value from
 816     <replaceable>querytext</replaceable>, which must consist of single tokens
 817     separated by the <type>tsquery</type> operators <literal>&amp;</literal> (AND),
 818     <literal>|</literal> (OR), <literal>!</literal> (NOT), and
 819     <literal>&lt;-&gt;</literal> (FOLLOWED BY), possibly grouped
 820     using parentheses.  In other words, the input to
 821     <function>to_tsquery</function> must already follow the general rules for
 822     <type>tsquery</type> input, as described in <xref
 823     linkend="datatype-tsquery"/>.  The difference is that while basic
 824     <type>tsquery</type> input takes the tokens at face value,
 825     <function>to_tsquery</function> normalizes each token into a lexeme using
 826     the specified or default configuration, and discards any tokens that are
 827     stop words according to the configuration.  For example:
 828
 829 <screen>
 830 SELECT to_tsquery('english', 'The &amp; Fat &amp; Rats');
 831   to_tsquery
 832 ---------------
 833  'fat' &amp; 'rat'
 834 </screen>
 835
 836     As in basic <type>tsquery</type> input, weight(s) can be attached to each
 837     lexeme to restrict it to match only <type>tsvector</type> lexemes of those
 838     weight(s).  For example:
 839
 840 <screen>
 841 SELECT to_tsquery('english', 'Fat | Rats:AB');
 842     to_tsquery
 843 ------------------
 844  'fat' | 'rat':AB
 845 </screen>
 846
 847     Also, <literal>*</literal> can be attached to a lexeme to specify prefix matching:
 848
 849 <screen>
 850 SELECT to_tsquery('supern:*A &amp; star:A*B');
 851         to_tsquery
 852 --------------------------
 853  'supern':*A &amp; 'star':*AB
 854 </screen>
 855
 856     Such a lexeme will match any word in a <type>tsvector</type> that begins
 857     with the given string.
 858    </para>
 859
 860    <para>
 861     <function>to_tsquery</function> can also accept single-quoted
 862     phrases.  This is primarily useful when the configuration includes a
 863     thesaurus dictionary that may trigger on such phrases.
 864     In the example below, a thesaurus contains the rule <literal>supernovae
 865     stars : sn</literal>:
 866
 867 <screen>
 868 SELECT to_tsquery('''supernovae stars'' &amp; !crab');
 869   to_tsquery
 870 ---------------
 871  'sn' &amp; !'crab'
 872 </screen>
 873
 874     Without quotes, <function>to_tsquery</function> will generate a syntax
 875     error for tokens that are not separated by an AND, OR, or FOLLOWED BY
 876     operator.
 877    </para>
 878
 879    <indexterm>
 880     <primary>plainto_tsquery</primary>
 881    </indexterm>
 882
 883 <synopsis>
 884 plainto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
 885 </synopsis>
 886
 887    <para>
 888     <function>plainto_tsquery</function> transforms the unformatted text
 889     <replaceable>querytext</replaceable> to a <type>tsquery</type> value.
 890     The text is parsed and normalized much as for <function>to_tsvector</function>,
 891     then the <literal>&amp;</literal> (AND) <type>tsquery</type> operator is
 892     inserted between surviving words.
 893    </para>
 894
 895    <para>
 896     Example:
 897
 898 <screen>
 899 SELECT plainto_tsquery('english', 'The Fat Rats');
 900  plainto_tsquery
 901 -----------------
 902  'fat' &amp; 'rat'
 903 </screen>
 904
 905     Note that <function>plainto_tsquery</function> will not
 906     recognize <type>tsquery</type> operators, weight labels,
 907     or prefix-match labels in its input:
 908
 909 <screen>
 910 SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
 911    plainto_tsquery
 912 ---------------------
 913  'fat' &amp; 'rat' &amp; 'c'
 914 </screen>
 915
 916     Here, all the input punctuation was discarded.
 917    </para>
 918
 919    <indexterm>
 920     <primary>phraseto_tsquery</primary>
 921    </indexterm>
 922
 923 <synopsis>
 924 phraseto_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
 925 </synopsis>
 926
 927    <para>
 928     <function>phraseto_tsquery</function> behaves much like
 929     <function>plainto_tsquery</function>, except that it inserts
 930     the <literal>&lt;-&gt;</literal> (FOLLOWED BY) operator between
 931     surviving words instead of the <literal>&amp;</literal> (AND) operator.
 932     Also, stop words are not simply discarded, but are accounted for by
 933     inserting <literal>&lt;<replaceable>N</replaceable>&gt;</literal> operators rather
 934     than <literal>&lt;-&gt;</literal> operators.  This function is useful
 935     when searching for exact lexeme sequences, since the FOLLOWED BY
 936     operators check lexeme order not just the presence of all the lexemes.
 937    </para>
 938
 939    <para>
 940     Example:
 941
 942 <screen>
 943 SELECT phraseto_tsquery('english', 'The Fat Rats');
 944  phraseto_tsquery
 945 ------------------
 946  'fat' &lt;-&gt; 'rat'
 947 </screen>
 948
 949     Like <function>plainto_tsquery</function>, the
 950     <function>phraseto_tsquery</function> function will not
 951     recognize <type>tsquery</type> operators, weight labels,
 952     or prefix-match labels in its input:
 953
 954 <screen>
 955 SELECT phraseto_tsquery('english', 'The Fat &amp; Rats:C');
 956       phraseto_tsquery
 957 -----------------------------
 958  'fat' &lt;-&gt; 'rat' &lt;-&gt; 'c'
 959 </screen>
 960    </para>
 961
 962 <synopsis>
 963 websearch_to_tsquery(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">querytext</replaceable> <type>text</type>) returns <type>tsquery</type>
 964 </synopsis>
 965
 966    <para>
 967     <function>websearch_to_tsquery</function> creates a <type>tsquery</type>
 968     value from <replaceable>querytext</replaceable> using an alternative
 969     syntax in which simple unformatted text is a valid query.
 970     Unlike <function>plainto_tsquery</function>
 971     and <function>phraseto_tsquery</function>, it also recognizes certain
 972     operators. Moreover, this function will never raise syntax errors,
 973     which makes it possible to use raw user-supplied input for search.
 974     The following syntax is supported:
 975
 976     <itemizedlist  spacing="compact" mark="bullet">
 977      <listitem>
 978        <para>
 979         <literal>unquoted text</literal>: text not inside quote marks will be
 980         converted to terms separated by <literal>&amp;</literal> operators, as
 981         if processed by <function>plainto_tsquery</function>.
 982       </para>
 983      </listitem>
 984      <listitem>
 985        <para>
 986         <literal>"quoted text"</literal>: text inside quote marks will be
 987         converted to terms separated by <literal>&lt;-&gt;</literal>
 988         operators, as if processed by <function>phraseto_tsquery</function>.
 989       </para>
 990      </listitem>
 991      <listitem>
 992       <para>
 993        <literal>OR</literal>: the word <quote>or</quote> will be converted to
 994        the <literal>|</literal> operator.
 995       </para>
 996      </listitem>
 997      <listitem>
 998       <para>
 999        <literal>-</literal>: a dash will be converted to
1000        the <literal>!</literal> operator.
1001       </para>
1002      </listitem>
1003     </itemizedlist>
1004
1005     Other punctuation is ignored.  So
1006     like <function>plainto_tsquery</function>
1007     and <function>phraseto_tsquery</function>,
1008     the <function>websearch_to_tsquery</function> function will not
1009     recognize <type>tsquery</type> operators, weight labels, or prefix-match
1010     labels in its input.
1011    </para>
1012
1013    <para>
1014     Examples:
1015 <screen>
1016 SELECT websearch_to_tsquery('english', 'The fat rats');
1017  websearch_to_tsquery
1018 ----------------------
1019  'fat' &amp; 'rat'
1020 (1 row)
1021
1022 SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
1023        websearch_to_tsquery
1024 ----------------------------------
1025  'supernova' &lt;-&gt; 'star' &amp; !'crab'
1026 (1 row)
1027
1028 SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
1029        websearch_to_tsquery
1030 -----------------------------------
1031  'sad' &lt;-&gt; 'cat' | 'fat' &lt;-&gt; 'rat'
1032 (1 row)
1033
1034 SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
1035          websearch_to_tsquery
1036 ---------------------------------------
1037  'signal' &amp; !( 'segment' &lt;-&gt; 'fault' )
1038 (1 row)
1039
1040 SELECT websearch_to_tsquery('english', '""" )( dummy \\ query &lt;-&gt;');
1041  websearch_to_tsquery
1042 ----------------------
1043  'dummi' &amp; 'queri'
1044 (1 row)
1045 </screen>
1046     </para>
1047   </sect2>
1048
1049   <sect2 id="textsearch-ranking">
1050    <title>Ranking Search Results</title>
1051
1052    <para>
1053     Ranking attempts to measure how relevant documents are to a particular
1054     query, so that when there are many matches the most relevant ones can be
1055     shown first.  <productname>PostgreSQL</productname> provides two
1056     predefined ranking functions, which take into account lexical, proximity,
1057     and structural information; that is, they consider how often the query
1058     terms appear in the document, how close together the terms are in the
1059     document, and how important is the part of the document where they occur.
1060     However, the concept of relevancy is vague and very application-specific.
1061     Different applications might require additional information for ranking,
1062     e.g., document modification time.  The built-in ranking functions are only
1063     examples.  You can write your own ranking functions and/or combine their
1064     results with additional factors to fit your specific needs.
1065    </para>
1066
1067    <para>
1068     The two ranking functions currently available are:
1069
1070     <variablelist>
1071
1072      <varlistentry>
1073
1074       <term>
1075        <indexterm>
1076         <primary>ts_rank</primary>
1077        </indexterm>
1078
1079        <literal>ts_rank(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal>
1080       </term>
1081
1082       <listitem>
1083        <para>
1084         Ranks vectors based on the frequency of their matching lexemes.
1085        </para>
1086       </listitem>
1087      </varlistentry>
1088
1089      <varlistentry>
1090
1091       <term>
1092       <indexterm>
1093        <primary>ts_rank_cd</primary>
1094       </indexterm>
1095
1096        <literal>ts_rank_cd(<optional> <replaceable class="parameter">weights</replaceable> <type>float4[]</type>, </optional> <replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">normalization</replaceable> <type>integer</type> </optional>) returns <type>float4</type></literal>
1097       </term>
1098
1099       <listitem>
1100        <para>
1101         This function computes the <firstterm>cover density</firstterm>
1102         ranking for the given document vector and query, as described in
1103         Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three
1104         Term Queries" in the journal "Information Processing and Management",
1105         1999.  Cover density is similar to <function>ts_rank</function> ranking
1106         except that the proximity of matching lexemes to each other is
1107         taken into consideration.
1108        </para>
1109
1110        <para>
1111         This function requires lexeme positional information to perform
1112         its calculation.  Therefore, it ignores any <quote>stripped</quote>
1113         lexemes in the <type>tsvector</type>.  If there are no unstripped
1114         lexemes in the input, the result will be zero.  (See <xref
1115         linkend="textsearch-manipulate-tsvector"/> for more information
1116         about the <function>strip</function> function and positional information
1117         in <type>tsvector</type>s.)
1118        </para>
1119       </listitem>
1120      </varlistentry>
1121
1122     </variablelist>
1123
1124    </para>
1125
1126    <para>
1127     For both these functions,
1128     the optional <replaceable class="parameter">weights</replaceable>
1129     argument offers the ability to weigh word instances more or less
1130     heavily depending on how they are labeled.  The weight arrays specify
1131     how heavily to weigh each category of word, in the order:
1132
1133 <synopsis>
1134 {D-weight, C-weight, B-weight, A-weight}
1135 </synopsis>
1136
1137     If no <replaceable class="parameter">weights</replaceable> are provided,
1138     then these defaults are used:
1139
1140 <programlisting>
1141 {0.1, 0.2, 0.4, 1.0}
1142 </programlisting>
1143
1144     Typically weights are used to mark words from special areas of the
1145     document, like the title or an initial abstract, so they can be
1146     treated with more or less importance than words in the document body.
1147    </para>
1148
1149    <para>
1150     Since a longer document has a greater chance of containing a query term
1151     it is reasonable to take into account document size, e.g., a hundred-word
1152     document with five instances of a search word is probably more relevant
1153     than a thousand-word document with five instances.  Both ranking functions
1154     take an integer <replaceable>normalization</replaceable> option that
1155     specifies whether and how a document's length should impact its rank.
1156     The integer option controls several behaviors, so it is a bit mask:
1157     you can specify one or more behaviors using
1158     <literal>|</literal> (for example, <literal>2|4</literal>).
1159
1160     <itemizedlist  spacing="compact" mark="bullet">
1161      <listitem>
1162       <para>
1163        0 (the default) ignores the document length
1164       </para>
1165      </listitem>
1166      <listitem>
1167       <para>
1168        1 divides the rank by 1 + the logarithm of the document length
1169       </para>
1170      </listitem>
1171      <listitem>
1172       <para>
1173        2 divides the rank by the document length
1174       </para>
1175      </listitem>
1176      <listitem>
1177       <para>
1178        4 divides the rank by the mean harmonic distance between extents
1179        (this is implemented only by <function>ts_rank_cd</function>)
1180       </para>
1181      </listitem>
1182      <listitem>
1183       <para>
1184        8 divides the rank by the number of unique words in document
1185       </para>
1186      </listitem>
1187      <listitem>
1188       <para>
1189        16 divides the rank by 1 + the logarithm of the number
1190        of unique words in document
1191       </para>
1192      </listitem>
1193      <listitem>
1194       <para>
1195        32 divides the rank by itself + 1
1196       </para>
1197      </listitem>
1198     </itemizedlist>
1199
1200     If more than one flag bit is specified, the transformations are
1201     applied in the order listed.
1202    </para>
1203
1204    <para>
1205     It is important to note that the ranking functions do not use any global
1206     information, so it is impossible to produce a fair normalization to 1% or
1207     100% as sometimes desired.  Normalization option 32
1208     (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks
1209     into the range zero to one, but of course this is just a cosmetic change;
1210     it will not affect the ordering of the search results.
1211    </para>
1212
1213    <para>
1214     Here is an example that selects only the ten highest-ranked matches:
1215
1216 <screen>
1217 SELECT title, ts_rank_cd(textsearch, query) AS rank
1218 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1219 WHERE query @@ textsearch
1220 ORDER BY rank DESC
1221 LIMIT 10;
1222                      title                     |   rank
1223 -----------------------------------------------+----------
1224  Neutrinos in the Sun                          |      3.1
1225  The Sudbury Neutrino Detector                 |      2.4
1226  A MACHO View of Galactic Dark Matter          |  2.01317
1227  Hot Gas and Dark Matter                       |  1.91171
1228  The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
1229  Rafting for Solar Neutrinos                   |      1.9
1230  NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
1231  Hot Gas and Dark Matter                       |   1.6123
1232  Ice Fishing for Cosmic Neutrinos              |      1.6
1233  Weak Lensing Distorts the Universe            | 0.818218
1234 </screen>
1235
1236     This is the same example using normalized ranking:
1237
1238 <screen>
1239 SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
1240 FROM apod, to_tsquery('neutrino|(dark &amp; matter)') query
1241 WHERE  query @@ textsearch
1242 ORDER BY rank DESC
1243 LIMIT 10;
1244                      title                     |        rank
1245 -----------------------------------------------+-------------------
1246  Neutrinos in the Sun                          | 0.756097569485493
1247  The Sudbury Neutrino Detector                 | 0.705882361190954
1248  A MACHO View of Galactic Dark Matter          | 0.668123210574724
1249  Hot Gas and Dark Matter                       |  0.65655958650282
1250  The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
1251  Rafting for Solar Neutrinos                   | 0.655172410958162
1252  NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
1253  Hot Gas and Dark Matter                       | 0.617195790024749
1254  Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
1255  Weak Lensing Distorts the Universe            | 0.450010798361481
1256 </screen>
1257    </para>
1258
1259    <para>
1260     Ranking can be expensive since it requires consulting the
1261     <type>tsvector</type> of each matching document, which can be I/O bound and
1262     therefore slow. Unfortunately, it is almost impossible to avoid since
1263     practical queries often result in large numbers of matches.
1264    </para>
1265
1266   </sect2>
1267
1268   <sect2 id="textsearch-headline">
1269    <title>Highlighting Results</title>
1270
1271    <para>
1272     To present search results it is ideal to show a part of each document and
1273     how it is related to the query. Usually, search engines show fragments of
1274     the document with marked search terms.  <productname>PostgreSQL</productname>
1275     provides a function <function>ts_headline</function> that
1276     implements this functionality.
1277    </para>
1278
1279    <indexterm>
1280     <primary>ts_headline</primary>
1281    </indexterm>
1282
1283 <synopsis>
1284 ts_headline(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>, <replaceable class="parameter">query</replaceable> <type>tsquery</type> <optional>, <replaceable class="parameter">options</replaceable> <type>text</type> </optional>) returns <type>text</type>
1285 </synopsis>
1286
1287    <para>
1288     <function>ts_headline</function> accepts a document along
1289     with a query, and returns an excerpt from
1290     the document in which terms from the query are highlighted.
1291     Specifically, the function will use the query to select relevant
1292     text fragments, and then highlight all words that appear in the query,
1293     even if those word positions do not match the query's restrictions.  The
1294     configuration to be used to parse the document can be specified by
1295     <replaceable>config</replaceable>; if <replaceable>config</replaceable>
1296     is omitted, the
1297     <varname>default_text_search_config</varname> configuration is used.
1298    </para>
1299
1300    <para>
1301     If an <replaceable>options</replaceable> string is specified it must
1302     consist of a comma-separated list of one or more
1303     <replaceable>option</replaceable><literal>=</literal><replaceable>value</replaceable> pairs.
1304     The available options are:
1305
1306     <itemizedlist  spacing="compact" mark="bullet">
1307      <listitem>
1308       <para>
1309        <literal>MaxWords</literal>, <literal>MinWords</literal> (integers):
1310        these numbers determine the longest and shortest headlines to output.
1311        The default values are 35 and 15.
1312       </para>
1313      </listitem>
1314      <listitem>
1315       <para>
1316        <literal>ShortWord</literal> (integer): words of this length or less
1317        will be dropped at the start and end of a headline, unless they are
1318        query terms.  The default value of three eliminates common English
1319        articles.
1320       </para>
1321      </listitem>
1322      <listitem>
1323       <para>
1324        <literal>HighlightAll</literal> (boolean): if
1325        <literal>true</literal> the whole document will be used as the
1326        headline, ignoring the preceding three parameters.  The default
1327        is <literal>false</literal>.
1328       </para>
1329      </listitem>
1330      <listitem>
1331       <para>
1332        <literal>MaxFragments</literal> (integer): maximum number of text
1333        fragments to display.  The default value of zero selects a
1334        non-fragment-based headline generation method.  A value greater
1335        than zero selects fragment-based headline generation (see below).
1336       </para>
1337      </listitem>
1338      <listitem>
1339       <para>
1340        <literal>StartSel</literal>, <literal>StopSel</literal> (strings):
1341        the strings with which to delimit query words appearing in the
1342        document, to distinguish them from other excerpted words.  The
1343        default values are <quote><literal>&lt;b&gt;</literal></quote> and
1344        <quote><literal>&lt;/b&gt;</literal></quote>, which can be suitable
1345        for HTML output.
1346       </para>
1347      </listitem>
1348      <listitem>
1349       <para>
1350        <literal>FragmentDelimiter</literal> (string): When more than one
1351        fragment is displayed, the fragments will be separated by this string.
1352        The default is <quote><literal> ... </literal></quote>.
1353       </para>
1354      </listitem>
1355     </itemizedlist>
1356
1357     These option names are recognized case-insensitively.
1358     You must double-quote string values if they contain spaces or commas.
1359    </para>
1360
1361    <para>
1362     In non-fragment-based headline
1363     generation, <function>ts_headline</function> locates matches for the
1364     given <replaceable class="parameter">query</replaceable> and chooses a
1365     single one to display, preferring matches that have more query words
1366     within the allowed headline length.
1367     In fragment-based headline generation, <function>ts_headline</function>
1368     locates the query matches and splits each match
1369     into <quote>fragments</quote> of no more than <literal>MaxWords</literal>
1370     words each, preferring fragments with more query words, and when
1371     possible <quote>stretching</quote> fragments to include surrounding
1372     words.  The fragment-based mode is thus more useful when the query
1373     matches span large sections of the document, or when it's desirable to
1374     display multiple matches.
1375     In either mode, if no query matches can be identified, then a single
1376     fragment of the first <literal>MinWords</literal> words in the document
1377     will be displayed.
1378    </para>
1379
1380    <para>
1381     For example:
1382
1383 <screen>
1384 SELECT ts_headline('english',
1385   'The most common type of search
1386 is to find all documents containing given query terms
1387 and return them in order of their similarity to the
1388 query.',
1389   to_tsquery('english', 'query &amp; similarity'));
1390                         ts_headline
1391 ------------------------------------------------------------
1392  containing given &lt;b&gt;query&lt;/b&gt; terms                       +
1393  and return them in order of their &lt;b&gt;similarity&lt;/b&gt; to the+
1394  &lt;b&gt;query&lt;/b&gt;.
1395
1396 SELECT ts_headline('english',
1397   'Search terms may occur
1398 many times in a document,
1399 requiring ranking of the search matches to decide which
1400 occurrences to display in the result.',
1401   to_tsquery('english', 'search &amp; term'),
1402   'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=&lt;&lt;, StopSel=&gt;&gt;');
1403                         ts_headline
1404 ------------------------------------------------------------
1405  &lt;&lt;Search&gt;&gt; &lt;&lt;terms&gt;&gt; may occur                            +
1406  many times ... ranking of the &lt;&lt;search&gt;&gt; matches to decide
1407 </screen>
1408    </para>
1409
1410    <para>
1411     <function>ts_headline</function> uses the original document, not a
1412     <type>tsvector</type> summary, so it can be slow and should be used with
1413     care.
1414    </para>
1415
1416   </sect2>
1417
1418  </sect1>
1419
1420  <sect1 id="textsearch-features">
1421   <title>Additional Features</title>
1422
1423   <para>
1424    This section describes additional functions and operators that are
1425    useful in connection with text search.
1426   </para>
1427
1428   <sect2 id="textsearch-manipulate-tsvector">
1429    <title>Manipulating Documents</title>
1430
1431    <para>
1432     <xref linkend="textsearch-parsing-documents"/> showed how raw textual
1433     documents can be converted into <type>tsvector</type> values.
1434     <productname>PostgreSQL</productname> also provides functions and
1435     operators that can be used to manipulate documents that are already
1436     in <type>tsvector</type> form.
1437    </para>
1438
1439    <variablelist>
1440
1441     <varlistentry>
1442
1443      <term>
1444      <indexterm>
1445       <primary>tsvector concatenation</primary>
1446      </indexterm>
1447
1448       <literal><type>tsvector</type> || <type>tsvector</type></literal>
1449      </term>
1450
1451      <listitem>
1452       <para>
1453        The <type>tsvector</type> concatenation operator
1454        returns a vector which combines the lexemes and positional information
1455        of the two vectors given as arguments.  Positions and weight labels
1456        are retained during the concatenation.
1457        Positions appearing in the right-hand vector are offset by the largest
1458        position mentioned in the left-hand vector, so that the result is
1459        nearly equivalent to the result of performing <function>to_tsvector</function>
1460        on the concatenation of the two original document strings.  (The
1461        equivalence is not exact, because any stop-words removed from the
1462        end of the left-hand argument will not affect the result, whereas
1463        they would have affected the positions of the lexemes in the
1464        right-hand argument if textual concatenation were used.)
1465       </para>
1466
1467       <para>
1468        One advantage of using concatenation in the vector form, rather than
1469        concatenating text before applying <function>to_tsvector</function>, is that
1470        you can use different configurations to parse different sections
1471        of the document.  Also, because the <function>setweight</function> function
1472        marks all lexemes of the given vector the same way, it is necessary
1473        to parse the text and do <function>setweight</function> before concatenating
1474        if you want to label different parts of the document with different
1475        weights.
1476       </para>
1477      </listitem>
1478     </varlistentry>
1479
1480     <varlistentry>
1481
1482      <term>
1483      <indexterm>
1484       <primary>setweight</primary>
1485      </indexterm>
1486
1487       <literal>setweight(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>, <replaceable class="parameter">weight</replaceable> <type>"char"</type>) returns <type>tsvector</type></literal>
1488      </term>
1489
1490      <listitem>
1491       <para>
1492        <function>setweight</function> returns a copy of the input vector in which every
1493        position has been labeled with the given <replaceable>weight</replaceable>, either
1494        <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or
1495        <literal>D</literal>.  (<literal>D</literal> is the default for new
1496        vectors and as such is not displayed on output.)  These labels are
1497        retained when vectors are concatenated, allowing words from different
1498        parts of a document to be weighted differently by ranking functions.
1499       </para>
1500
1501       <para>
1502        Note that weight labels apply to <emphasis>positions</emphasis>, not
1503        <emphasis>lexemes</emphasis>.  If the input vector has been stripped of
1504        positions then <function>setweight</function> does nothing.
1505       </para>
1506      </listitem>
1507     </varlistentry>
1508
1509     <varlistentry>
1510      <term>
1511      <indexterm>
1512       <primary>length(tsvector)</primary>
1513      </indexterm>
1514
1515       <literal>length(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>integer</type></literal>
1516      </term>
1517
1518      <listitem>
1519       <para>
1520        Returns the number of lexemes stored in the vector.
1521       </para>
1522      </listitem>
1523     </varlistentry>
1524
1525     <varlistentry>
1526
1527      <term>
1528      <indexterm>
1529       <primary>strip</primary>
1530      </indexterm>
1531
1532       <literal>strip(<replaceable class="parameter">vector</replaceable> <type>tsvector</type>) returns <type>tsvector</type></literal>
1533      </term>
1534
1535      <listitem>
1536       <para>
1537        Returns a vector that lists the same lexemes as the given vector, but
1538        lacks any position or weight information.  The result is usually much
1539        smaller than an unstripped vector, but it is also less useful.
1540        Relevance ranking does not work as well on stripped vectors as
1541        unstripped ones.  Also,
1542        the <literal>&lt;-&gt;</literal> (FOLLOWED BY) <type>tsquery</type> operator
1543        will never match stripped input, since it cannot determine the
1544        distance between lexeme occurrences.
1545       </para>
1546      </listitem>
1547
1548     </varlistentry>
1549
1550    </variablelist>
1551
1552    <para>
1553     A full list of <type>tsvector</type>-related functions is available
1554     in <xref linkend="textsearch-functions-table"/>.
1555    </para>
1556
1557   </sect2>
1558
1559   <sect2 id="textsearch-manipulate-tsquery">
1560    <title>Manipulating Queries</title>
1561
1562    <para>
1563     <xref linkend="textsearch-parsing-queries"/> showed how raw textual
1564     queries can be converted into <type>tsquery</type> values.
1565     <productname>PostgreSQL</productname> also provides functions and
1566     operators that can be used to manipulate queries that are already
1567     in <type>tsquery</type> form.
1568    </para>
1569
1570    <variablelist>
1571
1572     <varlistentry>
1573
1574      <term>
1575       <literal><type>tsquery</type> &amp;&amp; <type>tsquery</type></literal>
1576      </term>
1577
1578      <listitem>
1579       <para>
1580        Returns the AND-combination of the two given queries.
1581       </para>
1582      </listitem>
1583
1584     </varlistentry>
1585
1586     <varlistentry>
1587
1588      <term>
1589       <literal><type>tsquery</type> || <type>tsquery</type></literal>
1590      </term>
1591
1592      <listitem>
1593       <para>
1594        Returns the OR-combination of the two given queries.
1595       </para>
1596      </listitem>
1597
1598     </varlistentry>
1599
1600     <varlistentry>
1601
1602      <term>
1603       <literal>!! <type>tsquery</type></literal>
1604      </term>
1605
1606      <listitem>
1607       <para>
1608        Returns the negation (NOT) of the given query.
1609       </para>
1610      </listitem>
1611
1612     </varlistentry>
1613
1614     <varlistentry>
1615
1616      <term>
1617       <literal><type>tsquery</type> &lt;-&gt; <type>tsquery</type></literal>
1618      </term>
1619
1620      <listitem>
1621       <para>
1622        Returns a query that searches for a match to the first given query
1623        immediately followed by a match to the second given query, using
1624        the <literal>&lt;-&gt;</literal> (FOLLOWED BY)
1625        <type>tsquery</type> operator.  For example:
1626
1627 <screen>
1628 SELECT to_tsquery('fat') &lt;-&gt; to_tsquery('cat | rat');
1629           ?column?
1630 ----------------------------
1631  'fat' &lt;-&gt; ( 'cat' | 'rat' )
1632 </screen>
1633       </para>
1634      </listitem>
1635
1636     </varlistentry>
1637
1638     <varlistentry>
1639
1640      <term>
1641      <indexterm>
1642       <primary>tsquery_phrase</primary>
1643      </indexterm>
1644
1645       <literal>tsquery_phrase(<replaceable class="parameter">query1</replaceable> <type>tsquery</type>, <replaceable class="parameter">query2</replaceable> <type>tsquery</type> [, <replaceable class="parameter">distance</replaceable> <type>integer</type> ]) returns <type>tsquery</type></literal>
1646      </term>
1647
1648      <listitem>
1649       <para>
1650        Returns a query that searches for a match to the first given query
1651        followed by a match to the second given query at a distance of exactly
1652        <replaceable>distance</replaceable> lexemes, using
1653        the <literal>&lt;<replaceable>N</replaceable>&gt;</literal>
1654        <type>tsquery</type> operator.  For example:
1655
1656 <screen>
1657 SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
1658   tsquery_phrase
1659 ------------------
1660  'fat' &lt;10&gt; 'cat'
1661 </screen>
1662       </para>
1663      </listitem>
1664
1665     </varlistentry>
1666
1667     <varlistentry>
1668
1669      <term>
1670      <indexterm>
1671       <primary>numnode</primary>
1672      </indexterm>
1673
1674       <literal>numnode(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>integer</type></literal>
1675      </term>
1676
1677      <listitem>
1678       <para>
1679        Returns the number of nodes (lexemes plus operators) in a
1680        <type>tsquery</type>. This function is useful
1681        to determine if the <replaceable>query</replaceable> is meaningful
1682        (returns &gt; 0), or contains only stop words (returns 0).
1683        Examples:
1684
1685 <screen>
1686 SELECT numnode(plainto_tsquery('the any'));
1687 NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
1688  numnode
1689 ---------
1690        0
1691
1692 SELECT numnode('foo &amp; bar'::tsquery);
1693  numnode
1694 ---------
1695        3
1696 </screen>
1697       </para>
1698      </listitem>
1699     </varlistentry>
1700
1701     <varlistentry>
1702
1703      <term>
1704      <indexterm>
1705       <primary>querytree</primary>
1706      </indexterm>
1707
1708       <literal>querytree(<replaceable class="parameter">query</replaceable> <type>tsquery</type>) returns <type>text</type></literal>
1709      </term>
1710
1711      <listitem>
1712       <para>
1713        Returns the portion of a <type>tsquery</type> that can be used for
1714        searching an index.  This function is useful for detecting
1715        unindexable queries, for example those containing only stop words
1716        or only negated terms.  For example:
1717
1718 <screen>
1719 SELECT querytree(to_tsquery('defined'));
1720  querytree
1721 -----------
1722  'defin'
1723
1724 SELECT querytree(to_tsquery('!defined'));
1725  querytree
1726 -----------
1727  T
1728 </screen>
1729       </para>
1730      </listitem>
1731     </varlistentry>
1732
1733    </variablelist>
1734
1735    <sect3 id="textsearch-query-rewriting">
1736     <title>Query Rewriting</title>
1737
1738     <indexterm zone="textsearch-query-rewriting">
1739      <primary>ts_rewrite</primary>
1740     </indexterm>
1741
1742     <para>
1743      The <function>ts_rewrite</function> family of functions search a
1744      given <type>tsquery</type> for occurrences of a target
1745      subquery, and replace each occurrence with a
1746      substitute subquery.  In essence this operation is a
1747      <type>tsquery</type>-specific version of substring replacement.
1748      A target and substitute combination can be
1749      thought of as a <firstterm>query rewrite rule</firstterm>.  A collection
1750      of such rewrite rules can be a powerful search aid.
1751      For example, you can expand the search using synonyms
1752      (e.g., <literal>new york</literal>, <literal>big apple</literal>, <literal>nyc</literal>,
1753      <literal>gotham</literal>) or narrow the search to direct the user to some hot
1754      topic.  There is some overlap in functionality between this feature
1755      and thesaurus dictionaries (<xref linkend="textsearch-thesaurus"/>).
1756      However, you can modify a set of rewrite rules on-the-fly without
1757      reindexing, whereas updating a thesaurus requires reindexing to be
1758      effective.
1759     </para>
1760
1761     <variablelist>
1762
1763      <varlistentry>
1764
1765       <term>
1766        <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">target</replaceable> <type>tsquery</type>, <replaceable class="parameter">substitute</replaceable> <type>tsquery</type>) returns <type>tsquery</type></literal>
1767       </term>
1768
1769       <listitem>
1770        <para>
1771         This form of <function>ts_rewrite</function> simply applies a single
1772         rewrite rule: <replaceable class="parameter">target</replaceable>
1773         is replaced by <replaceable class="parameter">substitute</replaceable>
1774         wherever it appears in <replaceable
1775         class="parameter">query</replaceable>.  For example:
1776
1777 <screen>
1778 SELECT ts_rewrite('a &amp; b'::tsquery, 'a'::tsquery, 'c'::tsquery);
1779  ts_rewrite
1780 ------------
1781  'b' &amp; 'c'
1782 </screen>
1783        </para>
1784       </listitem>
1785      </varlistentry>
1786
1787      <varlistentry>
1788
1789       <term>
1790        <literal>ts_rewrite (<replaceable class="parameter">query</replaceable> <type>tsquery</type>, <replaceable class="parameter">select</replaceable> <type>text</type>) returns <type>tsquery</type></literal>
1791       </term>
1792
1793       <listitem>
1794        <para>
1795         This form of <function>ts_rewrite</function> accepts a starting
1796         <replaceable>query</replaceable> and an SQL <replaceable>select</replaceable> command, which
1797         is given as a text string.  The <replaceable>select</replaceable> must yield two
1798         columns of <type>tsquery</type> type.  For each row of the
1799         <replaceable>select</replaceable> result, occurrences of the first column value
1800         (the target) are replaced by the second column value (the substitute)
1801         within the current <replaceable>query</replaceable> value.  For example:
1802
1803 <screen>
1804 CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
1805 INSERT INTO aliases VALUES('a', 'c');
1806
1807 SELECT ts_rewrite('a &amp; b'::tsquery, 'SELECT t,s FROM aliases');
1808  ts_rewrite
1809 ------------
1810  'b' &amp; 'c'
1811 </screen>
1812        </para>
1813
1814        <para>
1815         Note that when multiple rewrite rules are applied in this way,
1816         the order of application can be important; so in practice you will
1817         want the source query to <literal>ORDER BY</literal> some ordering key.
1818        </para>
1819       </listitem>
1820      </varlistentry>
1821
1822     </variablelist>
1823
1824     <para>
1825      Let's consider a real-life astronomical example. We'll expand query
1826      <literal>supernovae</literal> using table-driven rewriting rules:
1827
1828 <screen>
1829 CREATE TABLE aliases (t tsquery primary key, s tsquery);
1830 INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));
1831
1832 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1833            ts_rewrite
1834 ---------------------------------
1835  'crab' &amp; ( 'supernova' | 'sn' )
1836 </screen>
1837
1838      We can change the rewriting rules just by updating the table:
1839
1840 <screen>
1841 UPDATE aliases
1842 SET s = to_tsquery('supernovae|sn &amp; !nebulae')
1843 WHERE t = to_tsquery('supernovae');
1844
1845 SELECT ts_rewrite(to_tsquery('supernovae &amp; crab'), 'SELECT * FROM aliases');
1846                  ts_rewrite
1847 ---------------------------------------------
1848  'crab' &amp; ( 'supernova' | 'sn' &amp; !'nebula' )
1849 </screen>
1850     </para>
1851
1852     <para>
1853      Rewriting can be slow when there are many rewriting rules, since it
1854      checks every rule for a possible match. To filter out obvious non-candidate
1855      rules we can use the containment operators for the <type>tsquery</type>
1856      type. In the example below, we select only those rules which might match
1857      the original query:
1858
1859 <screen>
1860 SELECT ts_rewrite('a &amp; b'::tsquery,
1861                   'SELECT t,s FROM aliases WHERE ''a &amp; b''::tsquery @&gt; t');
1862  ts_rewrite
1863 ------------
1864  'b' &amp; 'c'
1865 </screen>
1866     </para>
1867
1868    </sect3>
1869
1870   </sect2>
1871
1872   <sect2 id="textsearch-update-triggers">
1873    <title>Triggers for Automatic Updates</title>
1874
1875    <indexterm>
1876     <primary>trigger</primary>
1877     <secondary>for updating a derived tsvector column</secondary>
1878    </indexterm>
1879
1880    <note>
1881     <para>
1882      The method described in this section has been obsoleted by the use of
1883      stored generated columns, as described in <xref
1884      linkend="textsearch-tables-index"/>.
1885     </para>
1886    </note>
1887
1888    <para>
1889     When using a separate column to store the <type>tsvector</type> representation
1890     of your documents, it is necessary to create a trigger to update the
1891     <type>tsvector</type> column when the document content columns change.
1892     Two built-in trigger functions are available for this, or you can write
1893     your own.
1894    </para>
1895
1896 <synopsis>
1897 tsvector_update_trigger(<replaceable class="parameter">tsvector_column_name</replaceable>,&zwsp; <replaceable class="parameter">config_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>)
1898 tsvector_update_trigger_column(<replaceable class="parameter">tsvector_column_name</replaceable>,&zwsp; <replaceable class="parameter">config_column_name</replaceable>, <replaceable class="parameter">text_column_name</replaceable> <optional>, ... </optional>)
1899 </synopsis>
1900
1901    <para>
1902     These trigger functions automatically compute a <type>tsvector</type>
1903     column from one or more textual columns, under the control of
1904     parameters specified in the <command>CREATE TRIGGER</command> command.
1905     An example of their use is:
1906
1907 <screen>
1908 CREATE TABLE messages (
1909     title       text,
1910     body        text,
1911     tsv         tsvector
1912 );
1913
1914 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1915 ON messages FOR EACH ROW EXECUTE FUNCTION
1916 tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);
1917
1918 INSERT INTO messages VALUES('title here', 'the body text is here');
1919
1920 SELECT * FROM messages;
1921    title    |         body          |            tsv
1922 ------------+-----------------------+----------------------------
1923  title here | the body text is here | 'bodi':4 'text':5 'titl':1
1924
1925 SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title &amp; body');
1926    title    |         body
1927 ------------+-----------------------
1928  title here | the body text is here
1929 </screen>
1930
1931     Having created this trigger, any change in <structfield>title</structfield> or
1932     <structfield>body</structfield> will automatically be reflected into
1933     <structfield>tsv</structfield>, without the application having to worry about it.
1934    </para>
1935
1936    <para>
1937     The first trigger argument must be the name of the <type>tsvector</type>
1938     column to be updated.  The second argument specifies the text search
1939     configuration to be used to perform the conversion.  For
1940     <function>tsvector_update_trigger</function>, the configuration name is simply
1941     given as the second trigger argument.  It must be schema-qualified as
1942     shown above, so that the trigger behavior will not change with changes
1943     in <varname>search_path</varname>.  For
1944     <function>tsvector_update_trigger_column</function>, the second trigger argument
1945     is the name of another table column, which must be of type
1946     <type>regconfig</type>.  This allows a per-row selection of configuration
1947     to be made.  The remaining argument(s) are the names of textual columns
1948     (of type <type>text</type>, <type>varchar</type>, or <type>char</type>).  These
1949     will be included in the document in the order given.  NULL values will
1950     be skipped (but the other columns will still be indexed).
1951    </para>
1952
1953    <para>
1954     A limitation of these built-in triggers is that they treat all the
1955     input columns alike.  To process columns differently &mdash; for
1956     example, to weight title differently from body &mdash; it is necessary
1957     to write a custom trigger.  Here is an example using
1958     <application>PL/pgSQL</application> as the trigger language:
1959
1960 <programlisting>
1961 CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
1962 begin
1963   new.tsv :=
1964      setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
1965      setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
1966   return new;
1967 end
1968 $$ LANGUAGE plpgsql;
1969
1970 CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
1971     ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger();
1972 </programlisting>
1973    </para>
1974
1975    <para>
1976     Keep in mind that it is important to specify the configuration name
1977     explicitly when creating <type>tsvector</type> values inside triggers,
1978     so that the column's contents will not be affected by changes to
1979     <varname>default_text_search_config</varname>.  Failure to do this is likely to
1980     lead to problems such as search results changing after a dump and restore.
1981    </para>
1982
1983   </sect2>
1984
1985   <sect2 id="textsearch-statistics">
1986    <title>Gathering Document Statistics</title>
1987
1988    <indexterm>
1989     <primary>ts_stat</primary>
1990    </indexterm>
1991
1992    <para>
1993     The function <function>ts_stat</function> is useful for checking your
1994     configuration and for finding stop-word candidates.
1995    </para>
1996
1997 <synopsis>
1998 ts_stat(<replaceable class="parameter">sqlquery</replaceable> <type>text</type>, <optional> <replaceable class="parameter">weights</replaceable> <type>text</type>, </optional>
1999         OUT <replaceable class="parameter">word</replaceable> <type>text</type>, OUT <replaceable class="parameter">ndoc</replaceable> <type>integer</type>,
2000         OUT <replaceable class="parameter">nentry</replaceable> <type>integer</type>) returns <type>setof record</type>
2001 </synopsis>
2002
2003    <para>
2004     <replaceable>sqlquery</replaceable> is a text value containing an SQL
2005     query which must return a single <type>tsvector</type> column.
2006     <function>ts_stat</function> executes the query and returns statistics about
2007     each distinct lexeme (word) contained in the <type>tsvector</type>
2008     data.  The columns returned are
2009
2010     <itemizedlist  spacing="compact" mark="bullet">
2011      <listitem>
2012       <para>
2013        <replaceable>word</replaceable> <type>text</type> &mdash; the value of a lexeme
2014       </para>
2015      </listitem>
2016      <listitem>
2017       <para>
2018        <replaceable>ndoc</replaceable> <type>integer</type> &mdash; number of documents
2019        (<type>tsvector</type>s) the word occurred in
2020       </para>
2021      </listitem>
2022      <listitem>
2023       <para>
2024        <replaceable>nentry</replaceable> <type>integer</type> &mdash; total number of
2025        occurrences of the word
2026       </para>
2027      </listitem>
2028     </itemizedlist>
2029
2030     If <replaceable>weights</replaceable> is supplied, only occurrences
2031     having one of those weights are counted.
2032    </para>
2033
2034    <para>
2035     For example, to find the ten most frequent words in a document collection:
2036
2037 <programlisting>
2038 SELECT * FROM ts_stat('SELECT vector FROM apod')
2039 ORDER BY nentry DESC, ndoc DESC, word
2040 LIMIT 10;
2041 </programlisting>
2042
2043     The same, but counting only word occurrences with weight <literal>A</literal>
2044     or <literal>B</literal>:
2045
2046 <programlisting>
2047 SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
2048 ORDER BY nentry DESC, ndoc DESC, word
2049 LIMIT 10;
2050 </programlisting>
2051    </para>
2052
2053   </sect2>
2054
2055  </sect1>
2056
2057  <sect1 id="textsearch-parsers">
2058   <title>Parsers</title>
2059
2060   <para>
2061    Text search parsers are responsible for splitting raw document text
2062    into <firstterm>tokens</firstterm> and identifying each token's type, where
2063    the set of possible types is defined by the parser itself.
2064    Note that a parser does not modify the text at all &mdash; it simply
2065    identifies plausible word boundaries.  Because of this limited scope,
2066    there is less need for application-specific custom parsers than there is
2067    for custom dictionaries.  At present <productname>PostgreSQL</productname>
2068    provides just one built-in parser, which has been found to be useful for a
2069    wide range of applications.
2070   </para>
2071
2072   <para>
2073    The built-in parser is named <literal>pg_catalog.default</literal>.
2074    It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser"/>.
2075   </para>
2076
2077   <table id="textsearch-default-parser">
2078    <title>Default Parser's Token Types</title>
2079    <tgroup cols="3">
2080     <colspec colname="col1" colwidth="2*"/>
2081     <colspec colname="col2" colwidth="2*"/>
2082     <colspec colname="col3" colwidth="3*"/>
2083     <thead>
2084      <row>
2085       <entry>Alias</entry>
2086       <entry>Description</entry>
2087       <entry>Example</entry>
2088      </row>
2089     </thead>
2090     <tbody>
2091      <row>
2092       <entry><literal>asciiword</literal></entry>
2093       <entry>Word, all ASCII letters</entry>
2094       <entry><literal>elephant</literal></entry>
2095      </row>
2096      <row>
2097       <entry><literal>word</literal></entry>
2098       <entry>Word, all letters</entry>
2099       <entry><literal>ma&ntilde;ana</literal></entry>
2100      </row>
2101      <row>
2102       <entry><literal>numword</literal></entry>
2103       <entry>Word, letters and digits</entry>
2104       <entry><literal>beta1</literal></entry>
2105      </row>
2106      <row>
2107       <entry><literal>asciihword</literal></entry>
2108       <entry>Hyphenated word, all ASCII</entry>
2109       <entry><literal>up-to-date</literal></entry>
2110      </row>
2111      <row>
2112       <entry><literal>hword</literal></entry>
2113       <entry>Hyphenated word, all letters</entry>
2114       <entry><literal>l&oacute;gico-matem&aacute;tica</literal></entry>
2115      </row>
2116      <row>
2117       <entry><literal>numhword</literal></entry>
2118       <entry>Hyphenated word, letters and digits</entry>
2119       <entry><literal>postgresql-beta1</literal></entry>
2120      </row>
2121      <row>
2122       <entry><literal>hword_asciipart</literal></entry>
2123       <entry>Hyphenated word part, all ASCII</entry>
2124       <entry><literal>postgresql</literal> in the context <literal>postgresql-beta1</literal></entry>
2125      </row>
2126      <row>
2127       <entry><literal>hword_part</literal></entry>
2128       <entry>Hyphenated word part, all letters</entry>
2129       <entry><literal>l&oacute;gico</literal> or <literal>matem&aacute;tica</literal>
2130        in the context <literal>l&oacute;gico-matem&aacute;tica</literal></entry>
2131      </row>
2132      <row>
2133       <entry><literal>hword_numpart</literal></entry>
2134       <entry>Hyphenated word part, letters and digits</entry>
2135       <entry><literal>beta1</literal> in the context
2136        <literal>postgresql-beta1</literal></entry>
2137      </row>
2138      <row>
2139       <entry><literal>email</literal></entry>
2140       <entry>Email address</entry>
2141       <entry><literal>foo@example.com</literal></entry>
2142      </row>
2143      <row>
2144       <entry><literal>protocol</literal></entry>
2145       <entry>Protocol head</entry>
2146       <entry><literal>http://</literal></entry>
2147      </row>
2148      <row>
2149       <entry><literal>url</literal></entry>
2150       <entry>URL</entry>
2151       <entry><literal>example.com/stuff/index.html</literal></entry>
2152      </row>
2153      <row>
2154       <entry><literal>host</literal></entry>
2155       <entry>Host</entry>
2156       <entry><literal>example.com</literal></entry>
2157      </row>
2158      <row>
2159       <entry><literal>url_path</literal></entry>
2160       <entry>URL path</entry>
2161       <entry><literal>/stuff/index.html</literal>, in the context of a URL</entry>
2162      </row>
2163      <row>
2164       <entry><literal>file</literal></entry>
2165       <entry>File or path name</entry>
2166       <entry><literal>/usr/local/foo.txt</literal>, if not within a URL</entry>
2167      </row>
2168      <row>
2169       <entry><literal>sfloat</literal></entry>
2170       <entry>Scientific notation</entry>
2171       <entry><literal>-1.234e56</literal></entry>
2172      </row>
2173      <row>
2174       <entry><literal>float</literal></entry>
2175       <entry>Decimal notation</entry>
2176       <entry><literal>-1.234</literal></entry>
2177      </row>
2178      <row>
2179       <entry><literal>int</literal></entry>
2180       <entry>Signed integer</entry>
2181       <entry><literal>-1234</literal></entry>
2182      </row>
2183      <row>
2184       <entry><literal>uint</literal></entry>
2185       <entry>Unsigned integer</entry>
2186       <entry><literal>1234</literal></entry>
2187      </row>
2188      <row>
2189       <entry><literal>version</literal></entry>
2190       <entry>Version number</entry>
2191       <entry><literal>8.3.0</literal></entry>
2192      </row>
2193      <row>
2194       <entry><literal>tag</literal></entry>
2195       <entry>XML tag</entry>
2196       <entry><literal>&lt;a href="dictionaries.html"&gt;</literal></entry>
2197      </row>
2198      <row>
2199       <entry><literal>entity</literal></entry>
2200       <entry>XML entity</entry>
2201       <entry><literal>&amp;amp;</literal></entry>
2202      </row>
2203      <row>
2204       <entry><literal>blank</literal></entry>
2205       <entry>Space symbols</entry>
2206       <entry>(any whitespace or punctuation not otherwise recognized)</entry>
2207      </row>
2208     </tbody>
2209    </tgroup>
2210   </table>
2211
2212   <note>
2213    <para>
2214     The parser's notion of a <quote>letter</quote> is determined by the database's
2215     locale setting, specifically <varname>lc_ctype</varname>.  Words containing
2216     only the basic ASCII letters are reported as a separate token type,
2217     since it is sometimes useful to distinguish them.  In most European
2218     languages, token types <literal>word</literal> and <literal>asciiword</literal>
2219     should be treated alike.
2220    </para>
2221
2222    <para>
2223     <literal>email</literal> does not support all valid email characters as
2224     defined by <ulink url="https://datatracker.ietf.org/doc/html/rfc5322">RFC 5322</ulink>.
2225     Specifically, the only non-alphanumeric characters supported for
2226     email user names are period, dash, and underscore.
2227    </para>
2228   </note>
2229
2230   <para>
2231    It is possible for the parser to produce overlapping tokens from the same
2232    piece of text.  As an example, a hyphenated word will be reported both
2233    as the entire word and as each component:
2234
2235 <screen>
2236 SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
2237       alias      |               description                |     token
2238 -----------------+------------------------------------------+---------------
2239  numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
2240  hword_asciipart | Hyphenated word part, all ASCII          | foo
2241  blank           | Space symbols                            | -
2242  hword_asciipart | Hyphenated word part, all ASCII          | bar
2243  blank           | Space symbols                            | -
2244  hword_numpart   | Hyphenated word part, letters and digits | beta1
2245 </screen>
2246
2247    This behavior is desirable since it allows searches to work for both
2248    the whole compound word and for components.  Here is another
2249    instructive example:
2250
2251 <screen>
2252 SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
2253   alias   |  description  |            token
2254 ----------+---------------+------------------------------
2255  protocol | Protocol head | http://
2256  url      | URL           | example.com/stuff/index.html
2257  host     | Host          | example.com
2258  url_path | URL path      | /stuff/index.html
2259 </screen>
2260   </para>
2261
2262  </sect1>
2263
2264  <sect1 id="textsearch-dictionaries">
2265   <title>Dictionaries</title>
2266
2267   <para>
2268    Dictionaries are used to eliminate words that should not be considered in a
2269    search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so
2270    that different derived forms of the same word will match.  A successfully
2271    normalized word is called a <firstterm>lexeme</firstterm>.  Aside from
2272    improving search quality, normalization and removal of stop words reduce the
2273    size of the <type>tsvector</type> representation of a document, thereby
2274    improving performance.  Normalization does not always have linguistic meaning
2275    and usually depends on application semantics.
2276   </para>
2277
2278   <para>
2279    Some examples of normalization:
2280
2281    <itemizedlist  spacing="compact" mark="bullet">
2282
2283     <listitem>
2284      <para>
2285       Linguistic &mdash; Ispell dictionaries try to reduce input words to a
2286       normalized form; stemmer dictionaries remove word endings
2287      </para>
2288     </listitem>
2289     <listitem>
2290      <para>
2291       <acronym>URL</acronym> locations can be canonicalized to make
2292       equivalent URLs match:
2293
2294       <itemizedlist  spacing="compact" mark="bullet">
2295        <listitem>
2296         <para>
2297          http://www.pgsql.ru/db/mw/index.html
2298         </para>
2299        </listitem>
2300        <listitem>
2301         <para>
2302          http://www.pgsql.ru/db/mw/
2303         </para>
2304        </listitem>
2305        <listitem>
2306         <para>
2307          http://www.pgsql.ru/db/../db/mw/index.html
2308         </para>
2309        </listitem>
2310       </itemizedlist>
2311      </para>
2312     </listitem>
2313     <listitem>
2314      <para>
2315       Color names can be replaced by their hexadecimal values, e.g.,
2316       <literal>red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF</literal>
2317      </para>
2318     </listitem>
2319     <listitem>
2320      <para>
2321       If indexing numbers, we can
2322       remove some fractional digits to reduce the range of possible
2323       numbers, so for example <emphasis>3.14</emphasis>159265359,
2324       <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same
2325       after normalization if only two digits are kept after the decimal point.
2326      </para>
2327     </listitem>
2328    </itemizedlist>
2329
2330   </para>
2331
2332   <para>
2333    A dictionary is a program that accepts a token as
2334    input and returns:
2335    <itemizedlist  spacing="compact" mark="bullet">
2336     <listitem>
2337      <para>
2338       an array of lexemes if the input token is known to the dictionary
2339       (notice that one token can produce more than one lexeme)
2340      </para>
2341     </listitem>
2342     <listitem>
2343      <para>
2344       a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace
2345       the original token with a new token to be passed to subsequent
2346       dictionaries (a dictionary that does this is called a
2347       <firstterm>filtering dictionary</firstterm>)
2348      </para>
2349     </listitem>
2350     <listitem>
2351      <para>
2352       an empty array if the dictionary knows the token, but it is a stop word
2353      </para>
2354     </listitem>
2355     <listitem>
2356      <para>
2357       <literal>NULL</literal> if the dictionary does not recognize the input token
2358      </para>
2359     </listitem>
2360    </itemizedlist>
2361   </para>
2362
2363   <para>
2364    <productname>PostgreSQL</productname> provides predefined dictionaries for
2365    many languages.  There are also several predefined templates that can be
2366    used to create new dictionaries with custom parameters.  Each predefined
2367    dictionary template is described below.  If no existing
2368    template is suitable, it is possible to create new ones; see the
2369    <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution
2370    for examples.
2371   </para>
2372
2373   <para>
2374    A text search configuration binds a parser together with a set of
2375    dictionaries to process the parser's output tokens.  For each token
2376    type that the parser can return, a separate list of dictionaries is
2377    specified by the configuration.  When a token of that type is found
2378    by the parser, each dictionary in the list is consulted in turn,
2379    until some dictionary recognizes it as a known word.  If it is identified
2380    as a stop word, or if no dictionary recognizes the token, it will be
2381    discarded and not indexed or searched for.
2382    Normally, the first dictionary that returns a non-<literal>NULL</literal>
2383    output determines the result, and any remaining dictionaries are not
2384    consulted; but a filtering dictionary can replace the given word
2385    with a modified word, which is then passed to subsequent dictionaries.
2386   </para>
2387
2388   <para>
2389    The general rule for configuring a list of dictionaries
2390    is to place first the most narrow, most specific dictionary, then the more
2391    general dictionaries, finishing with a very general dictionary, like
2392    a <application>Snowball</application> stemmer or <literal>simple</literal>, which
2393    recognizes everything.  For example, for an astronomy-specific search
2394    (<literal>astro_en</literal> configuration) one could bind token type
2395    <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical
2396    terms, a general English dictionary and a <application>Snowball</application> English
2397    stemmer:
2398
2399 <programlisting>
2400 ALTER TEXT SEARCH CONFIGURATION astro_en
2401     ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;
2402 </programlisting>
2403   </para>
2404
2405   <para>
2406    A filtering dictionary can be placed anywhere in the list, except at the
2407    end where it'd be useless.  Filtering dictionaries are useful to partially
2408    normalize words to simplify the task of later dictionaries.  For example,
2409    a filtering dictionary could be used to remove accents from accented
2410    letters, as is done by the <xref linkend="unaccent"/> module.
2411   </para>
2412
2413   <sect2 id="textsearch-stopwords">
2414    <title>Stop Words</title>
2415
2416    <para>
2417     Stop words are words that are very common, appear in almost every
2418     document, and have no discrimination value. Therefore, they can be ignored
2419     in the context of full text searching. For example, every English text
2420     contains words like <literal>a</literal> and <literal>the</literal>, so it is
2421     useless to store them in an index.  However, stop words do affect the
2422     positions in <type>tsvector</type>, which in turn affect ranking:
2423
2424 <screen>
2425 SELECT to_tsvector('english', 'in the list of stop words');
2426         to_tsvector
2427 ----------------------------
2428  'list':3 'stop':5 'word':6
2429 </screen>
2430
2431     The missing positions 1,2,4 are because of stop words.  Ranks
2432     calculated for documents with and without stop words are quite different:
2433
2434 <screen>
2435 SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list &amp; stop'));
2436  ts_rank_cd
2437 ------------
2438        0.05
2439
2440 SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list &amp; stop'));
2441  ts_rank_cd
2442 ------------
2443         0.1
2444 </screen>
2445
2446    </para>
2447
2448    <para>
2449     It is up to the specific dictionary how it treats stop words. For example,
2450     <literal>ispell</literal> dictionaries first normalize words and then
2451     look at the list of stop words, while <literal>Snowball</literal> stemmers
2452     first check the list of stop words. The reason for the different
2453     behavior is an attempt to decrease noise.
2454    </para>
2455
2456   </sect2>
2457
2458   <sect2 id="textsearch-simple-dictionary">
2459    <title>Simple Dictionary</title>
2460
2461    <para>
2462     The <literal>simple</literal> dictionary template operates by converting the
2463     input token to lower case and checking it against a file of stop words.
2464     If it is found in the file then an empty array is returned, causing
2465     the token to be discarded.  If not, the lower-cased form of the word
2466     is returned as the normalized lexeme.  Alternatively, the dictionary
2467     can be configured to report non-stop-words as unrecognized, allowing
2468     them to be passed on to the next dictionary in the list.
2469    </para>
2470
2471    <para>
2472     Here is an example of a dictionary definition using the <literal>simple</literal>
2473     template:
2474
2475 <programlisting>
2476 CREATE TEXT SEARCH DICTIONARY public.simple_dict (
2477     TEMPLATE = pg_catalog.simple,
2478     STOPWORDS = english
2479 );
2480 </programlisting>
2481
2482     Here, <literal>english</literal> is the base name of a file of stop words.
2483     The file's full name will be
2484     <filename>$SHAREDIR/tsearch_data/english.stop</filename>,
2485     where <literal>$SHAREDIR</literal> means the
2486     <productname>PostgreSQL</productname> installation's shared-data directory,
2487     often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config
2488     --sharedir</command> to determine it if you're not sure).
2489     The file format is simply a list
2490     of words, one per line.  Blank lines and trailing spaces are ignored,
2491     and upper case is folded to lower case, but no other processing is done
2492     on the file contents.
2493    </para>
2494
2495    <para>
2496     Now we can test our dictionary:
2497
2498 <screen>
2499 SELECT ts_lexize('public.simple_dict', 'YeS');
2500  ts_lexize
2501 -----------
2502  {yes}
2503
2504 SELECT ts_lexize('public.simple_dict', 'The');
2505  ts_lexize
2506 -----------
2507  {}
2508 </screen>
2509    </para>
2510
2511    <para>
2512     We can also choose to return <literal>NULL</literal>, instead of the lower-cased
2513     word, if it is not found in the stop words file.  This behavior is
2514     selected by setting the dictionary's <literal>Accept</literal> parameter to
2515     <literal>false</literal>.  Continuing the example:
2516
2517 <screen>
2518 ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );
2519
2520 SELECT ts_lexize('public.simple_dict', 'YeS');
2521  ts_lexize
2522 -----------
2523
2524
2525 SELECT ts_lexize('public.simple_dict', 'The');
2526  ts_lexize
2527 -----------
2528  {}
2529 </screen>
2530    </para>
2531
2532    <para>
2533     With the default setting of <literal>Accept</literal> = <literal>true</literal>,
2534     it is only useful to place a <literal>simple</literal> dictionary at the end
2535     of a list of dictionaries, since it will never pass on any token to
2536     a following dictionary.  Conversely, <literal>Accept</literal> = <literal>false</literal>
2537     is only useful when there is at least one following dictionary.
2538    </para>
2539
2540    <caution>
2541     <para>
2542      Most types of dictionaries rely on configuration files, such as files of
2543      stop words.  These files <emphasis>must</emphasis> be stored in UTF-8 encoding.
2544      They will be translated to the actual database encoding, if that is
2545      different, when they are read into the server.
2546     </para>
2547    </caution>
2548
2549    <caution>
2550     <para>
2551      Normally, a database session will read a dictionary configuration file
2552      only once, when it is first used within the session.  If you modify a
2553      configuration file and want to force existing sessions to pick up the
2554      new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</command> command
2555      on the dictionary.  This can be a <quote>dummy</quote> update that doesn't
2556      actually change any parameter values.
2557     </para>
2558    </caution>
2559
2560   </sect2>
2561
2562   <sect2 id="textsearch-synonym-dictionary">
2563    <title>Synonym Dictionary</title>
2564
2565    <para>
2566     This dictionary template is used to create dictionaries that replace a
2567     word with a synonym. Phrases are not supported (use the thesaurus
2568     template (<xref linkend="textsearch-thesaurus"/>) for that).  A synonym
2569     dictionary can be used to overcome linguistic problems, for example, to
2570     prevent an English stemmer dictionary from reducing the word <quote>Paris</quote> to
2571     <quote>pari</quote>.  It is enough to have a <literal>Paris paris</literal> line in the
2572     synonym dictionary and put it before the <literal>english_stem</literal>
2573     dictionary.  For example:
2574
2575 <screen>
2576 SELECT * FROM ts_debug('english', 'Paris');
2577    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
2578 -----------+-----------------+-------+----------------+--------------+---------
2579  asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}
2580
2581 CREATE TEXT SEARCH DICTIONARY my_synonym (
2582     TEMPLATE = synonym,
2583     SYNONYMS = my_synonyms
2584 );
2585
2586 ALTER TEXT SEARCH CONFIGURATION english
2587     ALTER MAPPING FOR asciiword
2588     WITH my_synonym, english_stem;
2589
2590 SELECT * FROM ts_debug('english', 'Paris');
2591    alias   |   description   | token |       dictionaries        | dictionary | lexemes
2592 -----------+-----------------+-------+---------------------------+------------+---------
2593  asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}
2594 </screen>
2595    </para>
2596
2597    <para>
2598     The only parameter required by the <literal>synonym</literal> template is
2599     <literal>SYNONYMS</literal>, which is the base name of its configuration file
2600     &mdash; <literal>my_synonyms</literal> in the above example.
2601     The file's full name will be
2602     <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</filename>
2603     (where <literal>$SHAREDIR</literal> means the
2604     <productname>PostgreSQL</productname> installation's shared-data directory).
2605     The file format is just one line
2606     per word to be substituted, with the word followed by its synonym,
2607     separated by white space.  Blank lines and trailing spaces are ignored.
2608    </para>
2609
2610    <para>
2611     The <literal>synonym</literal> template also has an optional parameter
2612     <literal>CaseSensitive</literal>, which defaults to <literal>false</literal>.  When
2613     <literal>CaseSensitive</literal> is <literal>false</literal>, words in the synonym file
2614     are folded to lower case, as are input tokens.  When it is
2615     <literal>true</literal>, words and tokens are not folded to lower case,
2616     but are compared as-is.
2617    </para>
2618
2619    <para>
2620     An asterisk (<literal>*</literal>) can be placed at the end of a synonym
2621     in the configuration file.  This indicates that the synonym is a prefix.
2622     The asterisk is ignored when the entry is used in
2623     <function>to_tsvector()</function>, but when it is used in
2624     <function>to_tsquery()</function>, the result will be a query item with
2625     the prefix match marker (see
2626     <xref linkend="textsearch-parsing-queries"/>).
2627     For example, suppose we have these entries in
2628     <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</filename>:
2629 <programlisting>
2630 postgres        pgsql
2631 postgresql      pgsql
2632 postgre pgsql
2633 gogle   googl
2634 indices index*
2635 </programlisting>
2636     Then we will get these results:
2637 <screen>
2638 mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
2639 mydb=# SELECT ts_lexize('syn', 'indices');
2640  ts_lexize
2641 -----------
2642  {index}
2643 (1 row)
2644
2645 mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
2646 mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
2647 mydb=# SELECT to_tsvector('tst', 'indices');
2648  to_tsvector
2649 -------------
2650  'index':1
2651 (1 row)
2652
2653 mydb=# SELECT to_tsquery('tst', 'indices');
2654  to_tsquery
2655 ------------
2656  'index':*
2657 (1 row)
2658
2659 mydb=# SELECT 'indexes are very useful'::tsvector;
2660             tsvector
2661 ---------------------------------
2662  'are' 'indexes' 'useful' 'very'
2663 (1 row)
2664
2665 mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
2666  ?column?
2667 ----------
2668  t
2669 (1 row)
2670 </screen>
2671    </para>
2672   </sect2>
2673
2674   <sect2 id="textsearch-thesaurus">
2675    <title>Thesaurus Dictionary</title>
2676
2677    <para>
2678     A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is
2679     a collection of words that includes information about the relationships
2680     of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower
2681     terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related
2682     terms, etc.
2683    </para>
2684
2685    <para>
2686     Basically a thesaurus dictionary replaces all non-preferred terms by one
2687     preferred term and, optionally, preserves the original terms for indexing
2688     as well.  <productname>PostgreSQL</productname>'s current implementation of the
2689     thesaurus dictionary is an extension of the synonym dictionary with added
2690     <firstterm>phrase</firstterm> support.  A thesaurus dictionary requires
2691     a configuration file of the following format:
2692
2693 <programlisting>
2694 # this is a comment
2695 sample word(s) : indexed word(s)
2696 more sample word(s) : more indexed word(s)
2697 ...
2698 </programlisting>
2699
2700     where  the colon (<symbol>:</symbol>) symbol acts as a delimiter between a
2701     phrase and its replacement.
2702    </para>
2703
2704    <para>
2705     A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which
2706     is specified in the dictionary's configuration) to normalize the input
2707     text before checking for phrase matches. It is only possible to select one
2708     subdictionary.  An error is reported if the subdictionary fails to
2709     recognize a word. In that case, you should remove the use of the word or
2710     teach the subdictionary about it.  You can place an asterisk
2711     (<symbol>*</symbol>) at the beginning of an indexed word to skip applying
2712     the subdictionary to it, but all sample words <emphasis>must</emphasis> be known
2713     to the subdictionary.
2714    </para>
2715
2716    <para>
2717     The thesaurus dictionary chooses the longest match if there are multiple
2718     phrases matching the input, and ties are broken by using the last
2719     definition.
2720    </para>
2721
2722    <para>
2723     Specific stop words recognized by the subdictionary cannot be
2724     specified;  instead use <literal>?</literal> to mark the location where any
2725     stop word can appear.  For example, assuming that <literal>a</literal> and
2726     <literal>the</literal> are stop words according to the subdictionary:
2727
2728 <programlisting>
2729 ? one ? two : swsw
2730 </programlisting>
2731
2732     matches <literal>a one the two</literal> and <literal>the one a two</literal>;
2733     both would be replaced by <literal>swsw</literal>.
2734    </para>
2735
2736    <para>
2737     Since a thesaurus dictionary has the capability to recognize phrases it
2738     must remember its state and interact with the parser. A thesaurus dictionary
2739     uses these assignments to check if it should handle the next word or stop
2740     accumulation.  The thesaurus dictionary must be configured
2741     carefully. For example, if the thesaurus dictionary is assigned to handle
2742     only the <literal>asciiword</literal> token, then a thesaurus dictionary
2743     definition like <literal>one 7</literal> will not work since token type
2744     <literal>uint</literal> is not assigned to the thesaurus dictionary.
2745    </para>
2746
2747    <caution>
2748     <para>
2749      Thesauruses are used during indexing so any change in the thesaurus
2750      dictionary's parameters <emphasis>requires</emphasis> reindexing.
2751      For most other dictionary types, small changes such as adding or
2752      removing stopwords does not force reindexing.
2753     </para>
2754    </caution>
2755
2756   <sect3 id="textsearch-thesaurus-config">
2757    <title>Thesaurus Configuration</title>
2758
2759    <para>
2760     To define a new thesaurus dictionary, use the <literal>thesaurus</literal>
2761     template.  For example:
2762
2763 <programlisting>
2764 CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
2765     TEMPLATE = thesaurus,
2766     DictFile = mythesaurus,
2767     Dictionary = pg_catalog.english_stem
2768 );
2769 </programlisting>
2770
2771     Here:
2772     <itemizedlist  spacing="compact" mark="bullet">
2773      <listitem>
2774       <para>
2775        <literal>thesaurus_simple</literal> is the new dictionary's name
2776       </para>
2777      </listitem>
2778      <listitem>
2779       <para>
2780        <literal>mythesaurus</literal> is the base name of the thesaurus
2781        configuration file.
2782        (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>,
2783        where <literal>$SHAREDIR</literal> means the installation shared-data
2784        directory.)
2785       </para>
2786      </listitem>
2787      <listitem>
2788       <para>
2789        <literal>pg_catalog.english_stem</literal> is the subdictionary (here,
2790        a Snowball English stemmer) to use for thesaurus normalization.
2791        Notice that the subdictionary will have its own
2792        configuration (for example, stop words), which is not shown here.
2793       </para>
2794      </listitem>
2795     </itemizedlist>
2796
2797     Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal>
2798     to the desired token types in a configuration, for example:
2799
2800 <programlisting>
2801 ALTER TEXT SEARCH CONFIGURATION russian
2802     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2803     WITH thesaurus_simple;
2804 </programlisting>
2805    </para>
2806
2807   </sect3>
2808
2809   <sect3 id="textsearch-thesaurus-examples">
2810    <title>Thesaurus Example</title>
2811
2812    <para>
2813     Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>,
2814     which contains some astronomical word combinations:
2815
2816 <programlisting>
2817 supernovae stars : sn
2818 crab nebulae : crab
2819 </programlisting>
2820
2821     Below we create a dictionary and bind some token types to
2822     an astronomical thesaurus and English stemmer:
2823
2824 <programlisting>
2825 CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
2826     TEMPLATE = thesaurus,
2827     DictFile = thesaurus_astro,
2828     Dictionary = english_stem
2829 );
2830
2831 ALTER TEXT SEARCH CONFIGURATION russian
2832     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
2833     WITH thesaurus_astro, english_stem;
2834 </programlisting>
2835
2836     Now we can see how it works.
2837     <function>ts_lexize</function> is not very useful for testing a thesaurus,
2838     because it treats its input as a single token.  Instead we can use
2839     <function>plainto_tsquery</function> and <function>to_tsvector</function>
2840     which will break their input strings into multiple tokens:
2841
2842 <screen>
2843 SELECT plainto_tsquery('supernova star');
2844  plainto_tsquery
2845 -----------------
2846  'sn'
2847
2848 SELECT to_tsvector('supernova star');
2849  to_tsvector
2850 -------------
2851  'sn':1
2852 </screen>
2853
2854     In principle, one can use <function>to_tsquery</function> if you quote
2855     the argument:
2856
2857 <screen>
2858 SELECT to_tsquery('''supernova star''');
2859  to_tsquery
2860 ------------
2861  'sn'
2862 </screen>
2863
2864     Notice that <literal>supernova star</literal> matches <literal>supernovae
2865     stars</literal> in <literal>thesaurus_astro</literal> because we specified
2866     the <literal>english_stem</literal> stemmer in the thesaurus definition.
2867     The stemmer removed the <literal>e</literal> and <literal>s</literal>.
2868    </para>
2869
2870    <para>
2871     To index the original phrase as well as the substitute, just include it
2872     in the right-hand part of the definition:
2873
2874 <screen>
2875 supernovae stars : sn supernovae stars
2876
2877 SELECT plainto_tsquery('supernova star');
2878        plainto_tsquery
2879 -----------------------------
2880  'sn' &amp; 'supernova' &amp; 'star'
2881 </screen>
2882    </para>
2883
2884   </sect3>
2885
2886   </sect2>
2887
2888   <sect2 id="textsearch-ispell-dictionary">
2889    <title><application>Ispell</application> Dictionary</title>
2890
2891    <para>
2892     The <application>Ispell</application> dictionary template supports
2893     <firstterm>morphological dictionaries</firstterm>, which can normalize many
2894     different linguistic forms of a word into the same lexeme.  For example,
2895     an English <application>Ispell</application> dictionary can match all declensions and
2896     conjugations of the search term <literal>bank</literal>, e.g.,
2897     <literal>banking</literal>, <literal>banked</literal>, <literal>banks</literal>,
2898     <literal>banks'</literal>, and <literal>bank's</literal>.
2899    </para>
2900
2901    <para>
2902     The standard <productname>PostgreSQL</productname> distribution does
2903     not include any <application>Ispell</application> configuration files.
2904     Dictionaries for a large number of languages are available from <ulink
2905     url="https://www.cs.hmc.edu/~geoff/ispell.html">Ispell</ulink>.
2906     Also, some more modern dictionary file formats are supported &mdash; <ulink
2907     url="https://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO &lt; 2.0.1)
2908     and <ulink url="https://hunspell.github.io/">Hunspell</ulink>
2909     (OO &gt;= 2.0.2).  A large list of dictionaries is available on the <ulink
2910     url="https://wiki.openoffice.org/wiki/Dictionaries">OpenOffice
2911     Wiki</ulink>.
2912    </para>
2913
2914    <para>
2915     To create an <application>Ispell</application> dictionary perform these steps:
2916    </para>
2917    <itemizedlist spacing="compact" mark="bullet">
2918     <listitem>
2919      <para>
2920       download dictionary configuration files. <productname>OpenOffice</productname>
2921       extension files have the <filename>.oxt</filename> extension. It is necessary
2922       to extract <filename>.aff</filename> and <filename>.dic</filename> files, change
2923       extensions to <filename>.affix</filename> and <filename>.dict</filename>. For some
2924       dictionary files it is also needed to convert characters to the UTF-8
2925       encoding with commands (for example, for a Norwegian language dictionary):
2926 <programlisting>
2927 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
2928 iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
2929 </programlisting>
2930      </para>
2931     </listitem>
2932     <listitem>
2933      <para>
2934       copy files to the <filename>$SHAREDIR/tsearch_data</filename> directory
2935      </para>
2936     </listitem>
2937     <listitem>
2938      <para>
2939       load files into PostgreSQL with the following command:
2940 <programlisting>
2941 CREATE TEXT SEARCH DICTIONARY english_hunspell (
2942     TEMPLATE = ispell,
2943     DictFile = en_us,
2944     AffFile = en_us,
2945     Stopwords = english);
2946 </programlisting>
2947      </para>
2948     </listitem>
2949    </itemizedlist>
2950
2951    <para>
2952     Here, <literal>DictFile</literal>, <literal>AffFile</literal>, and <literal>StopWords</literal>
2953     specify the base names of the dictionary, affixes, and stop-words files.
2954     The stop-words file has the same format explained above for the
2955     <literal>simple</literal> dictionary type.  The format of the other files is
2956     not specified here but is available from the above-mentioned web sites.
2957    </para>
2958
2959    <para>
2960     Ispell dictionaries usually recognize a limited set of words, so they
2961     should be followed by another broader dictionary; for
2962     example, a Snowball dictionary, which recognizes everything.
2963    </para>
2964
2965    <para>
2966     The <filename>.affix</filename> file of <application>Ispell</application> has the following
2967     structure:
2968 <programlisting>
2969 prefixes
2970 flag *A:
2971     .           >   RE      # As in enter > reenter
2972 suffixes
2973 flag T:
2974     E           >   ST      # As in late > latest
2975     [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
2976     [AEIOU]Y    >   EST     # As in gray > grayest
2977     [^EY]       >   EST     # As in small > smallest
2978 </programlisting>
2979    </para>
2980    <para>
2981     And the <filename>.dict</filename> file has the following structure:
2982 <programlisting>
2983 lapse/ADGRS
2984 lard/DGRS
2985 large/PRTY
2986 lark/MRS
2987 </programlisting>
2988    </para>
2989
2990    <para>
2991     Format of the <filename>.dict</filename> file is:
2992 <programlisting>
2993 basic_form/affix_class_name
2994 </programlisting>
2995    </para>
2996
2997    <para>
2998     In the <filename>.affix</filename> file every affix flag is described in the
2999     following format:
3000 <programlisting>
3001 condition > [-stripping_letters,] adding_affix
3002 </programlisting>
3003    </para>
3004
3005    <para>
3006     Here, condition has a format similar to the format of regular expressions.
3007     It can use groupings <literal>[...]</literal> and <literal>[^...]</literal>.
3008     For example, <literal>[AEIOU]Y</literal> means that the last letter of the word
3009     is <literal>"y"</literal> and the penultimate letter is <literal>"a"</literal>,
3010     <literal>"e"</literal>, <literal>"i"</literal>, <literal>"o"</literal> or <literal>"u"</literal>.
3011     <literal>[^EY]</literal> means that the last letter is neither <literal>"e"</literal>
3012     nor <literal>"y"</literal>.
3013    </para>
3014
3015    <para>
3016     Ispell dictionaries support splitting compound words;
3017     a useful feature.
3018     Notice that the affix file should specify a special flag using the
3019     <literal>compoundwords controlled</literal> statement that marks dictionary
3020     words that can participate in compound formation:
3021
3022 <programlisting>
3023 compoundwords  controlled z
3024 </programlisting>
3025
3026     Here are some examples for the Norwegian language:
3027
3028 <programlisting>
3029 SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
3030    {over,buljong,terning,pakk,mester,assistent}
3031 SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
3032    {sjokoladefabrikk,sjokolade,fabrikk}
3033 </programlisting>
3034    </para>
3035
3036    <para>
3037     <application>MySpell</application> format is a subset of <application>Hunspell</application>.
3038     The <filename>.affix</filename> file of <application>Hunspell</application> has the following
3039     structure:
3040 <programlisting>
3041 PFX A Y 1
3042 PFX A   0     re         .
3043 SFX T N 4
3044 SFX T   0     st         e
3045 SFX T   y     iest       [^aeiou]y
3046 SFX T   0     est        [aeiou]y
3047 SFX T   0     est        [^ey]
3048 </programlisting>
3049    </para>
3050
3051    <para>
3052     The first line of an affix class is the header. Fields of an affix rules are
3053     listed after the header:
3054    </para>
3055    <itemizedlist spacing="compact" mark="bullet">
3056     <listitem>
3057      <para>
3058       parameter name (PFX or SFX)
3059      </para>
3060     </listitem>
3061     <listitem>
3062      <para>
3063       flag (name of the affix class)
3064      </para>
3065     </listitem>
3066     <listitem>
3067      <para>
3068       stripping characters from beginning (at prefix) or end (at suffix) of the
3069       word
3070      </para>
3071     </listitem>
3072     <listitem>
3073      <para>
3074       adding affix
3075      </para>
3076     </listitem>
3077     <listitem>
3078      <para>
3079       condition that has a format similar to the format of regular expressions.
3080      </para>
3081     </listitem>
3082    </itemizedlist>
3083
3084    <para>
3085     The <filename>.dict</filename> file looks like the <filename>.dict</filename> file of
3086     <application>Ispell</application>:
3087 <programlisting>
3088 larder/M
3089 lardy/RT
3090 large/RSPMYT
3091 largehearted
3092 </programlisting>
3093    </para>
3094
3095    <note>
3096     <para>
3097      <application>MySpell</application> does not support compound words.
3098      <application>Hunspell</application> has sophisticated support for compound words. At
3099      present, <productname>PostgreSQL</productname> implements only the basic
3100      compound word operations of Hunspell.
3101     </para>
3102    </note>
3103
3104   </sect2>
3105
3106   <sect2 id="textsearch-snowball-dictionary">
3107    <title><application>Snowball</application> Dictionary</title>
3108
3109    <para>
3110     The <application>Snowball</application> dictionary template is based on a project
3111     by Martin Porter, inventor of the popular Porter's stemming algorithm
3112     for the English language.  Snowball now provides stemming algorithms for
3113     many languages (see the <ulink url="https://snowballstem.org/">Snowball
3114     site</ulink> for more information).  Each algorithm understands how to
3115     reduce common variant forms of words to a base, or stem, spelling within
3116     its language.  A Snowball dictionary requires a <literal>language</literal>
3117     parameter to identify which stemmer to use, and optionally can specify a
3118     <literal>stopword</literal> file name that gives a list of words to eliminate.
3119     (<productname>PostgreSQL</productname>'s standard stopword lists are also
3120     provided by the Snowball project.)
3121     For example, there is a built-in definition equivalent to
3122
3123 <programlisting>
3124 CREATE TEXT SEARCH DICTIONARY english_stem (
3125     TEMPLATE = snowball,
3126     Language = english,
3127     StopWords = english
3128 );
3129 </programlisting>
3130
3131     The stopword file format is the same as already explained.
3132    </para>
3133
3134    <para>
3135     A <application>Snowball</application> dictionary recognizes everything, whether
3136     or not it is able to simplify the word, so it should be placed
3137     at the end of the dictionary list. It is useless to have it
3138     before any other dictionary because a token will never pass through it to
3139     the next dictionary.
3140    </para>
3141
3142   </sect2>
3143
3144  </sect1>
3145
3146  <sect1 id="textsearch-configuration">
3147   <title>Configuration Example</title>
3148
3149    <para>
3150     A text search configuration specifies all options necessary to transform a
3151     document into a <type>tsvector</type>: the parser to use to break text
3152     into tokens, and the dictionaries to use to transform each token into a
3153     lexeme.  Every call of
3154     <function>to_tsvector</function> or <function>to_tsquery</function>
3155     needs a text search configuration to perform its processing.
3156     The configuration parameter
3157     <xref linkend="guc-default-text-search-config"/>
3158     specifies the name of the default configuration, which is the
3159     one used by text search functions if an explicit configuration
3160     parameter is omitted.
3161     It can be set in <filename>postgresql.conf</filename>, or set for an
3162     individual session using the <command>SET</command> command.
3163    </para>
3164
3165    <para>
3166     Several predefined text search configurations are available, and
3167     you can create custom configurations easily.  To facilitate management
3168     of text search objects, a set of <acronym>SQL</acronym> commands
3169     is available, and there are several <application>psql</application> commands that display information
3170     about text search objects (<xref linkend="textsearch-psql"/>).
3171    </para>
3172
3173    <para>
3174     As an example we will create a configuration
3175     <literal>pg</literal>, starting by duplicating the built-in
3176     <literal>english</literal> configuration:
3177
3178 <programlisting>
3179 CREATE TEXT SEARCH CONFIGURATION public.pg ( COPY = pg_catalog.english );
3180 </programlisting>
3181    </para>
3182
3183    <para>
3184     We will use a PostgreSQL-specific synonym list
3185     and store it in <filename>$SHAREDIR/tsearch_data/pg_dict.syn</filename>.
3186     The file contents look like:
3187
3188 <programlisting>
3189 postgres    pg
3190 pgsql       pg
3191 postgresql  pg
3192 </programlisting>
3193
3194     We define the synonym dictionary like this:
3195
3196 <programlisting>
3197 CREATE TEXT SEARCH DICTIONARY pg_dict (
3198     TEMPLATE = synonym,
3199     SYNONYMS = pg_dict
3200 );
3201 </programlisting>
3202
3203     Next we register the <productname>Ispell</productname> dictionary
3204     <literal>english_ispell</literal>, which has its own configuration files:
3205
3206 <programlisting>
3207 CREATE TEXT SEARCH DICTIONARY english_ispell (
3208     TEMPLATE = ispell,
3209     DictFile = english,
3210     AffFile = english,
3211     StopWords = english
3212 );
3213 </programlisting>
3214
3215     Now we can set up the mappings for words in configuration
3216     <literal>pg</literal>:
3217
3218 <programlisting>
3219 ALTER TEXT SEARCH CONFIGURATION pg
3220     ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
3221                       word, hword, hword_part
3222     WITH pg_dict, english_ispell, english_stem;
3223 </programlisting>
3224
3225     We choose not to index or search some token types that the built-in
3226     configuration does handle:
3227
3228 <programlisting>
3229 ALTER TEXT SEARCH CONFIGURATION pg
3230     DROP MAPPING FOR email, url, url_path, sfloat, float;
3231 </programlisting>
3232    </para>
3233
3234    <para>
3235     Now we can test our configuration:
3236
3237 <programlisting>
3238 SELECT * FROM ts_debug('public.pg', '
3239 PostgreSQL, the highly scalable, SQL compliant, open source object-relational
3240 database management system, is now undergoing beta testing of the next
3241 version of our software.
3242 ');
3243 </programlisting>
3244    </para>
3245
3246    <para>
3247     The next step is to set the session to use the new configuration, which was
3248     created in the <literal>public</literal> schema:
3249
3250 <screen>
3251 =&gt; \dF
3252    List of text search configurations
3253  Schema  | Name | Description
3254 ---------+------+-------------
3255  public  | pg   |
3256
3257 SET default_text_search_config = 'public.pg';
3258 SET
3259
3260 SHOW default_text_search_config;
3261  default_text_search_config
3262 ----------------------------
3263  public.pg
3264 </screen>
3265   </para>
3266
3267  </sect1>
3268
3269  <sect1 id="textsearch-debugging">
3270   <title>Testing and Debugging Text Search</title>
3271
3272   <para>
3273    The behavior of a custom text search configuration can easily become
3274    confusing.  The functions described
3275    in this section are useful for testing text search objects.  You can
3276    test a complete configuration, or test parsers and dictionaries separately.
3277   </para>
3278
3279   <sect2 id="textsearch-configuration-testing">
3280    <title>Configuration Testing</title>
3281
3282   <para>
3283    The function <function>ts_debug</function> allows easy testing of a
3284    text search configuration.
3285   </para>
3286
3287   <indexterm>
3288    <primary>ts_debug</primary>
3289   </indexterm>
3290
3291 <synopsis>
3292 ts_debug(<optional> <replaceable class="parameter">config</replaceable> <type>regconfig</type>, </optional> <replaceable class="parameter">document</replaceable> <type>text</type>,
3293          OUT <replaceable class="parameter">alias</replaceable> <type>text</type>,
3294          OUT <replaceable class="parameter">description</replaceable> <type>text</type>,
3295          OUT <replaceable class="parameter">token</replaceable> <type>text</type>,
3296          OUT <replaceable class="parameter">dictionaries</replaceable> <type>regdictionary[]</type>,
3297          OUT <replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>,
3298          OUT <replaceable class="parameter">lexemes</replaceable> <type>text[]</type>)
3299          returns setof record
3300 </synopsis>
3301
3302   <para>
3303    <function>ts_debug</function> displays information about every token of
3304    <replaceable class="parameter">document</replaceable> as produced by the
3305    parser and processed by the configured dictionaries.  It uses the
3306    configuration specified by <replaceable
3307    class="parameter">config</replaceable>,
3308    or <varname>default_text_search_config</varname> if that argument is
3309    omitted.
3310   </para>
3311
3312   <para>
3313    <function>ts_debug</function> returns one row for each token identified in the text
3314    by the parser.  The columns returned are
3315
3316     <itemizedlist  spacing="compact" mark="bullet">
3317      <listitem>
3318       <para>
3319        <replaceable>alias</replaceable> <type>text</type> &mdash; short name of the token type
3320       </para>
3321      </listitem>
3322      <listitem>
3323       <para>
3324        <replaceable>description</replaceable> <type>text</type> &mdash; description of the
3325        token type
3326       </para>
3327      </listitem>
3328      <listitem>
3329       <para>
3330        <replaceable>token</replaceable> <type>text</type> &mdash; text of the token
3331       </para>
3332      </listitem>
3333      <listitem>
3334       <para>
3335        <replaceable>dictionaries</replaceable> <type>regdictionary[]</type> &mdash; the
3336        dictionaries selected by the configuration for this token type
3337       </para>
3338      </listitem>
3339      <listitem>
3340       <para>
3341        <replaceable>dictionary</replaceable> <type>regdictionary</type> &mdash; the dictionary
3342        that recognized the token, or <literal>NULL</literal> if none did
3343       </para>
3344      </listitem>
3345      <listitem>
3346       <para>
3347        <replaceable>lexemes</replaceable> <type>text[]</type> &mdash; the lexeme(s) produced
3348        by the dictionary that recognized the token, or <literal>NULL</literal> if
3349        none did; an empty array (<literal>{}</literal>) means it was recognized as a
3350        stop word
3351       </para>
3352      </listitem>
3353     </itemizedlist>
3354   </para>
3355
3356   <para>
3357    Here is a simple example:
3358
3359 <screen>
3360 SELECT * FROM ts_debug('english', 'a fat  cat sat on a mat - it ate a fat rats');
3361    alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
3362 -----------+-----------------+-------+----------------+--------------+---------
3363  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
3364  blank     | Space symbols   |       | {}             |              |
3365  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
3366  blank     | Space symbols   |       | {}             |              |
3367  asciiword | Word, all ASCII | cat   | {english_stem} | english_stem | {cat}
3368  blank     | Space symbols   |       | {}             |              |
3369  asciiword | Word, all ASCII | sat   | {english_stem} | english_stem | {sat}
3370  blank     | Space symbols   |       | {}             |              |
3371  asciiword | Word, all ASCII | on    | {english_stem} | english_stem | {}
3372  blank     | Space symbols   |       | {}             |              |
3373  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
3374  blank     | Space symbols   |       | {}             |              |
3375  asciiword | Word, all ASCII | mat   | {english_stem} | english_stem | {mat}
3376  blank     | Space symbols   |       | {}             |              |
3377  blank     | Space symbols   | -     | {}             |              |
3378  asciiword | Word, all ASCII | it    | {english_stem} | english_stem | {}
3379  blank     | Space symbols   |       | {}             |              |
3380  asciiword | Word, all ASCII | ate   | {english_stem} | english_stem | {ate}
3381  blank     | Space symbols   |       | {}             |              |
3382  asciiword | Word, all ASCII | a     | {english_stem} | english_stem | {}
3383  blank     | Space symbols   |       | {}             |              |
3384  asciiword | Word, all ASCII | fat   | {english_stem} | english_stem | {fat}
3385  blank     | Space symbols   |       | {}             |              |
3386  asciiword | Word, all ASCII | rats  | {english_stem} | english_stem | {rat}
3387 </screen>
3388   </para>
3389
3390   <para>
3391    For a more extensive demonstration, we
3392    first create a <literal>public.english</literal> configuration and
3393    Ispell dictionary for the English language:
3394   </para>
3395
3396 <programlisting>
3397 CREATE TEXT SEARCH CONFIGURATION public.english ( COPY = pg_catalog.english );
3398
3399 CREATE TEXT SEARCH DICTIONARY english_ispell (
3400     TEMPLATE = ispell,
3401     DictFile = english,
3402     AffFile = english,
3403     StopWords = english
3404 );
3405
3406 ALTER TEXT SEARCH CONFIGURATION public.english
3407    ALTER MAPPING FOR asciiword WITH english_ispell, english_stem;
3408 </programlisting>
3409
3410 <screen>
3411 SELECT * FROM ts_debug('public.english', 'The Brightest supernovaes');
3412    alias   |   description   |    token    |         dictionaries          |   dictionary   |   lexemes
3413 -----------+-----------------+-------------+-------------------------------+----------------+-------------
3414  asciiword | Word, all ASCII | The         | {english_ispell,english_stem} | english_ispell | {}
3415  blank     | Space symbols   |             | {}                            |                |
3416  asciiword | Word, all ASCII | Brightest   | {english_ispell,english_stem} | english_ispell | {bright}
3417  blank     | Space symbols   |             | {}                            |                |
3418  asciiword | Word, all ASCII | supernovaes | {english_ispell,english_stem} | english_stem   | {supernova}
3419 </screen>
3420
3421   <para>
3422    In this example, the word <literal>Brightest</literal> was recognized by the
3423    parser as an <literal>ASCII word</literal> (alias <literal>asciiword</literal>).
3424    For this token type the dictionary list is
3425    <literal>english_ispell</literal> and
3426    <literal>english_stem</literal>. The word was recognized by
3427    <literal>english_ispell</literal>, which reduced it to the noun
3428    <literal>bright</literal>. The word <literal>supernovaes</literal> is
3429    unknown to the <literal>english_ispell</literal> dictionary so it
3430    was passed to the next dictionary, and, fortunately, was recognized (in
3431    fact, <literal>english_stem</literal> is a Snowball dictionary which
3432    recognizes everything; that is why it was placed at the end of the
3433    dictionary list).
3434   </para>
3435
3436   <para>
3437    The word <literal>The</literal> was recognized by the
3438    <literal>english_ispell</literal> dictionary as a stop word (<xref
3439    linkend="textsearch-stopwords"/>) and will not be indexed.
3440    The spaces are discarded too, since the configuration provides no
3441    dictionaries at all for them.
3442   </para>
3443
3444   <para>
3445    You can reduce the width of the output by explicitly specifying which columns
3446    you want to see:
3447
3448 <screen>
3449 SELECT alias, token, dictionary, lexemes
3450 FROM ts_debug('public.english', 'The Brightest supernovaes');
3451    alias   |    token    |   dictionary   |   lexemes
3452 -----------+-------------+----------------+-------------
3453  asciiword | The         | english_ispell | {}
3454  blank     |             |                |
3455  asciiword | Brightest   | english_ispell | {bright}
3456  blank     |             |                |
3457  asciiword | supernovaes | english_stem   | {supernova}
3458 </screen>
3459   </para>
3460
3461   </sect2>
3462
3463   <sect2 id="textsearch-parser-testing">
3464    <title>Parser Testing</title>
3465
3466   <para>
3467    The following functions allow direct testing of a text search parser.
3468   </para>
3469
3470   <indexterm>
3471    <primary>ts_parse</primary>
3472   </indexterm>
3473
3474 <synopsis>
3475 ts_parse(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, <replaceable class="parameter">document</replaceable> <type>text</type>,
3476          OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type>
3477 ts_parse(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, <replaceable class="parameter">document</replaceable> <type>text</type>,
3478          OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>, OUT <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>setof record</type>
3479 </synopsis>
3480
3481   <para>
3482    <function>ts_parse</function> parses the given <replaceable>document</replaceable>
3483    and returns a series of records, one for each token produced by
3484    parsing. Each record includes a <varname>tokid</varname> showing the
3485    assigned token type and a <varname>token</varname> which is the text of the
3486    token.  For example:
3487
3488 <screen>
3489 SELECT * FROM ts_parse('default', '123 - a number');
3490  tokid | token
3491 -------+--------
3492     22 | 123
3493     12 |
3494     12 | -
3495      1 | a
3496     12 |
3497      1 | number
3498 </screen>
3499   </para>
3500
3501   <indexterm>
3502    <primary>ts_token_type</primary>
3503   </indexterm>
3504
3505 <synopsis>
3506 ts_token_type(<replaceable class="parameter">parser_name</replaceable> <type>text</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>,
3507               OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type>
3508 ts_token_type(<replaceable class="parameter">parser_oid</replaceable> <type>oid</type>, OUT <replaceable class="parameter">tokid</replaceable> <type>integer</type>,
3509               OUT <replaceable class="parameter">alias</replaceable> <type>text</type>, OUT <replaceable class="parameter">description</replaceable> <type>text</type>) returns <type>setof record</type>
3510 </synopsis>
3511
3512   <para>
3513    <function>ts_token_type</function> returns a table which describes each type of
3514    token the specified parser can recognize.  For each token type, the table
3515    gives the integer <varname>tokid</varname> that the parser uses to label a
3516    token of that type, the <varname>alias</varname> that names the token type
3517    in configuration commands, and a short <varname>description</varname>.  For
3518    example:
3519
3520 <screen>
3521 SELECT * FROM ts_token_type('default');
3522  tokid |      alias      |               description
3523 -------+-----------------+------------------------------------------
3524      1 | asciiword       | Word, all ASCII
3525      2 | word            | Word, all letters
3526      3 | numword         | Word, letters and digits
3527      4 | email           | Email address
3528      5 | url             | URL
3529      6 | host            | Host
3530      7 | sfloat          | Scientific notation
3531      8 | version         | Version number
3532      9 | hword_numpart   | Hyphenated word part, letters and digits
3533     10 | hword_part      | Hyphenated word part, all letters
3534     11 | hword_asciipart | Hyphenated word part, all ASCII
3535     12 | blank           | Space symbols
3536     13 | tag             | XML tag
3537     14 | protocol        | Protocol head
3538     15 | numhword        | Hyphenated word, letters and digits
3539     16 | asciihword      | Hyphenated word, all ASCII
3540     17 | hword           | Hyphenated word, all letters
3541     18 | url_path        | URL path
3542     19 | file            | File or path name
3543     20 | float           | Decimal notation
3544     21 | int             | Signed integer
3545     22 | uint            | Unsigned integer
3546     23 | entity          | XML entity
3547 </screen>
3548    </para>
3549
3550   </sect2>
3551
3552   <sect2 id="textsearch-dictionary-testing">
3553    <title>Dictionary Testing</title>
3554
3555    <para>
3556     The <function>ts_lexize</function> function facilitates dictionary testing.
3557    </para>
3558
3559    <indexterm>
3560     <primary>ts_lexize</primary>
3561    </indexterm>
3562
3563 <synopsis>
3564 ts_lexize(<replaceable class="parameter">dict</replaceable> <type>regdictionary</type>, <replaceable class="parameter">token</replaceable> <type>text</type>) returns <type>text[]</type>
3565 </synopsis>
3566
3567    <para>
3568     <function>ts_lexize</function> returns an array of lexemes if the input
3569     <replaceable>token</replaceable> is known to the dictionary,
3570     or an empty array if the token
3571     is known to the dictionary but it is a stop word, or
3572     <literal>NULL</literal> if it is an unknown word.
3573    </para>
3574
3575    <para>
3576     Examples:
3577
3578 <screen>
3579 SELECT ts_lexize('english_stem', 'stars');
3580  ts_lexize
3581 -----------
3582  {star}
3583
3584 SELECT ts_lexize('english_stem', 'a');
3585  ts_lexize
3586 -----------
3587  {}
3588 </screen>
3589    </para>
3590
3591    <note>
3592     <para>
3593      The <function>ts_lexize</function> function expects a single
3594      <emphasis>token</emphasis>, not text. Here is a case
3595      where this can be confusing:
3596
3597 <screen>
3598 SELECT ts_lexize('thesaurus_astro', 'supernovae stars') is null;
3599  ?column?
3600 ----------
3601  t
3602 </screen>
3603
3604      The thesaurus dictionary <literal>thesaurus_astro</literal> does know the
3605      phrase <literal>supernovae stars</literal>, but <function>ts_lexize</function>
3606      fails since it does not parse the input text but treats it as a single
3607      token. Use <function>plainto_tsquery</function> or <function>to_tsvector</function> to
3608      test thesaurus dictionaries, for example:
3609
3610 <screen>
3611 SELECT plainto_tsquery('supernovae stars');
3612  plainto_tsquery
3613 -----------------
3614  'sn'
3615 </screen>
3616     </para>
3617    </note>
3618
3619   </sect2>
3620
3621  </sect1>
3622
3623  <sect1 id="textsearch-indexes">
3624   <title>Preferred Index Types for Text Search</title>
3625
3626   <indexterm zone="textsearch-indexes">
3627    <primary>text search</primary>
3628    <secondary>indexes</secondary>
3629   </indexterm>
3630
3631   <para>
3632    There are two kinds of indexes that can be used to speed up full text
3633    searches:
3634    <link linkend="gin"><acronym>GIN</acronym></link> and
3635    <link linkend="gist"><acronym>GiST</acronym></link>.
3636    Note that indexes are not mandatory for full text searching, but in
3637    cases where a column is searched on a regular basis, an index is
3638    usually desirable.
3639   </para>
3640
3641   <para>
3642    To create such an index, do one of:
3643
3644    <variablelist>
3645
3646     <varlistentry>
3647
3648      <term>
3649      <indexterm zone="textsearch-indexes">
3650       <primary>index</primary>
3651       <secondary>GIN</secondary>
3652       <tertiary>text search</tertiary>
3653      </indexterm>
3654
3655       <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIN (<replaceable>column</replaceable>);</literal>
3656      </term>
3657
3658      <listitem>
3659       <para>
3660        Creates a GIN (Generalized Inverted Index)-based index.
3661        The <replaceable>column</replaceable> must be of <type>tsvector</type> type.
3662       </para>
3663      </listitem>
3664     </varlistentry>
3665
3666     <varlistentry>
3667
3668      <term>
3669      <indexterm zone="textsearch-indexes">
3670       <primary>index</primary>
3671       <secondary>GiST</secondary>
3672       <tertiary>text search</tertiary>
3673      </indexterm>
3674
3675       <literal>CREATE INDEX <replaceable>name</replaceable> ON <replaceable>table</replaceable> USING GIST (<replaceable>column</replaceable> [ { DEFAULT | tsvector_ops } (siglen = <replaceable>number</replaceable>) ] );</literal>
3676      </term>
3677
3678      <listitem>
3679       <para>
3680        Creates a GiST (Generalized Search Tree)-based index.
3681        The <replaceable>column</replaceable> can be of <type>tsvector</type> or
3682        <type>tsquery</type> type.
3683        Optional integer parameter <literal>siglen</literal> determines
3684        signature length in bytes (see below for details).
3685       </para>
3686      </listitem>
3687     </varlistentry>
3688
3689    </variablelist>
3690   </para>
3691
3692   <para>
3693    GIN indexes are the preferred text search index type.  As inverted
3694    indexes, they contain an index entry for each word (lexeme), with a
3695    compressed list of matching locations.  Multi-word searches can find
3696    the first match, then use the index to remove rows that are lacking
3697    additional words.  GIN indexes store only the words (lexemes) of
3698    <type>tsvector</type> values, and not their weight labels.  Thus a table
3699    row recheck is needed when using a query that involves weights.
3700   </para>
3701
3702   <para>
3703    A GiST index is <firstterm>lossy</firstterm>, meaning that the index
3704    might produce false matches, and it is necessary
3705    to check the actual table row to eliminate such false matches.
3706    (<productname>PostgreSQL</productname> does this automatically when needed.)
3707    GiST indexes are lossy because each document is represented in the
3708    index by a fixed-length signature.  The signature length in bytes is determined
3709    by the value of the optional integer parameter <literal>siglen</literal>.
3710    The default signature length (when <literal>siglen</literal> is not specified) is
3711    124 bytes, the maximum signature length is 2024 bytes. The signature is generated by hashing
3712    each word into a single bit in an n-bit string, with all these bits OR-ed
3713    together to produce an n-bit document signature.  When two words hash to
3714    the same bit position there will be a false match.  If all words in
3715    the query have matches (real or false) then the table row must be
3716    retrieved to see if the match is correct.  Longer signatures lead to a more
3717    precise search (scanning a smaller fraction of the index and fewer heap
3718    pages), at the cost of a larger index.
3719   </para>
3720
3721   <para>
3722    A GiST index can be covering, i.e., use the <literal>INCLUDE</literal>
3723    clause.  Included columns can have data types without any GiST operator
3724    class.  Included attributes will be stored uncompressed.
3725   </para>
3726
3727   <para>
3728    Lossiness causes performance degradation due to unnecessary fetches of table
3729    records that turn out to be false matches.  Since random access to table
3730    records is slow, this limits the usefulness of GiST indexes.  The
3731    likelihood of false matches depends on several factors, in particular the
3732    number of unique words, so using dictionaries to reduce this number is
3733    recommended.
3734   </para>
3735
3736   <para>
3737    Note that <acronym>GIN</acronym> index build time can often be improved
3738    by increasing <xref linkend="guc-maintenance-work-mem"/>, while
3739    <acronym>GiST</acronym> index build time is not sensitive to that
3740    parameter.
3741   </para>
3742
3743   <para>
3744    Partitioning of big collections and the proper use of GIN and GiST indexes
3745    allows the implementation of very fast searches with online update.
3746    Partitioning can be done at the database level using table inheritance,
3747    or by distributing documents over
3748    servers and collecting external search results, e.g., via <link
3749    linkend="ddl-foreign-data">Foreign Data</link> access.
3750    The latter is possible because ranking functions use
3751    only local information.
3752   </para>
3753
3754  </sect1>
3755
3756  <sect1 id="textsearch-psql">
3757   <title><application>psql</application> Support</title>
3758
3759   <para>
3760    Information about text search configuration objects can be obtained
3761    in <application>psql</application> using a set of commands:
3762 <synopsis>
3763 \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
3764 </synopsis>
3765    An optional <literal>+</literal> produces more details.
3766   </para>
3767
3768   <para>
3769    The optional parameter <replaceable>PATTERN</replaceable> can be the name of
3770    a text search object, optionally schema-qualified.  If
3771    <replaceable>PATTERN</replaceable> is omitted then information about all
3772    visible objects will be displayed.  <replaceable>PATTERN</replaceable> can be a
3773    regular expression and can provide <emphasis>separate</emphasis> patterns
3774    for the schema and object names.  The following examples illustrate this:
3775
3776 <screen>
3777 =&gt; \dF *fulltext*
3778        List of text search configurations
3779  Schema |  Name        | Description
3780 --------+--------------+-------------
3781  public | fulltext_cfg |
3782 </screen>
3783
3784 <screen>
3785 =&gt; \dF *.fulltext*
3786        List of text search configurations
3787  Schema   |  Name        | Description
3788 ----------+----------------------------
3789  fulltext | fulltext_cfg |
3790  public   | fulltext_cfg |
3791 </screen>
3792
3793    The available commands are:
3794   </para>
3795
3796   <variablelist>
3797    <varlistentry>
3798     <term><literal>\dF<optional>+</optional> <optional>PATTERN</optional></literal></term>
3799     <listitem>
3800      <para>
3801       List text search configurations (add <literal>+</literal> for more detail).
3802 <screen>
3803 =&gt; \dF russian
3804             List of text search configurations
3805    Schema   |  Name   |            Description
3806 ------------+---------+------------------------------------
3807  pg_catalog | russian | configuration for russian language
3808
3809 =&gt; \dF+ russian
3810 Text search configuration "pg_catalog.russian"
3811 Parser: "pg_catalog.default"
3812       Token      | Dictionaries
3813 -----------------+--------------
3814  asciihword      | english_stem
3815  asciiword       | english_stem
3816  email           | simple
3817  file            | simple
3818  float           | simple
3819  host            | simple
3820  hword           | russian_stem
3821  hword_asciipart | english_stem
3822  hword_numpart   | simple
3823  hword_part      | russian_stem
3824  int             | simple
3825  numhword        | simple
3826  numword         | simple
3827  sfloat          | simple
3828  uint            | simple
3829  url             | simple
3830  url_path        | simple
3831  version         | simple
3832  word            | russian_stem
3833 </screen>
3834      </para>
3835     </listitem>
3836    </varlistentry>
3837
3838    <varlistentry>
3839     <term><literal>\dFd<optional>+</optional> <optional>PATTERN</optional></literal></term>
3840     <listitem>
3841      <para>
3842       List text search dictionaries (add <literal>+</literal> for more detail).
3843 <screen>
3844 =&gt; \dFd
3845                              List of text search dictionaries
3846    Schema   |      Name       |                        Description
3847 ------------+-----------------+-----------------------------------------------------------
3848  pg_catalog | arabic_stem     | snowball stemmer for arabic language
3849  pg_catalog | armenian_stem   | snowball stemmer for armenian language
3850  pg_catalog | basque_stem     | snowball stemmer for basque language
3851  pg_catalog | catalan_stem    | snowball stemmer for catalan language
3852  pg_catalog | danish_stem     | snowball stemmer for danish language
3853  pg_catalog | dutch_stem      | snowball stemmer for dutch language
3854  pg_catalog | english_stem    | snowball stemmer for english language
3855  pg_catalog | finnish_stem    | snowball stemmer for finnish language
3856  pg_catalog | french_stem     | snowball stemmer for french language
3857  pg_catalog | german_stem     | snowball stemmer for german language
3858  pg_catalog | greek_stem      | snowball stemmer for greek language
3859  pg_catalog | hindi_stem      | snowball stemmer for hindi language
3860  pg_catalog | hungarian_stem  | snowball stemmer for hungarian language
3861  pg_catalog | indonesian_stem | snowball stemmer for indonesian language
3862  pg_catalog | irish_stem      | snowball stemmer for irish language
3863  pg_catalog | italian_stem    | snowball stemmer for italian language
3864  pg_catalog | lithuanian_stem | snowball stemmer for lithuanian language
3865  pg_catalog | nepali_stem     | snowball stemmer for nepali language
3866  pg_catalog | norwegian_stem  | snowball stemmer for norwegian language
3867  pg_catalog | portuguese_stem | snowball stemmer for portuguese language
3868  pg_catalog | romanian_stem   | snowball stemmer for romanian language
3869  pg_catalog | russian_stem    | snowball stemmer for russian language
3870  pg_catalog | serbian_stem    | snowball stemmer for serbian language
3871  pg_catalog | simple          | simple dictionary: just lower case and check for stopword
3872  pg_catalog | spanish_stem    | snowball stemmer for spanish language
3873  pg_catalog | swedish_stem    | snowball stemmer for swedish language
3874  pg_catalog | tamil_stem      | snowball stemmer for tamil language
3875  pg_catalog | turkish_stem    | snowball stemmer for turkish language
3876  pg_catalog | yiddish_stem    | snowball stemmer for yiddish language
3877 </screen>
3878      </para>
3879     </listitem>
3880    </varlistentry>
3881
3882    <varlistentry>
3883    <term><literal>\dFp<optional>+</optional> <optional>PATTERN</optional></literal></term>
3884     <listitem>
3885      <para>
3886       List text search parsers (add <literal>+</literal> for more detail).
3887 <screen>
3888 =&gt; \dFp
3889         List of text search parsers
3890    Schema   |  Name   |     Description
3891 ------------+---------+---------------------
3892  pg_catalog | default | default word parser
3893 =&gt; \dFp+
3894     Text search parser "pg_catalog.default"
3895      Method      |    Function    | Description
3896 -----------------+----------------+-------------
3897  Start parse     | prsd_start     |
3898  Get next token  | prsd_nexttoken |
3899  End parse       | prsd_end       |
3900  Get headline    | prsd_headline  |
3901  Get token types | prsd_lextype   |
3902
3903         Token types for parser "pg_catalog.default"
3904    Token name    |               Description
3905 -----------------+------------------------------------------
3906  asciihword      | Hyphenated word, all ASCII
3907  asciiword       | Word, all ASCII
3908  blank           | Space symbols
3909  email           | Email address
3910  entity          | XML entity
3911  file            | File or path name
3912  float           | Decimal notation
3913  host            | Host
3914  hword           | Hyphenated word, all letters
3915  hword_asciipart | Hyphenated word part, all ASCII
3916  hword_numpart   | Hyphenated word part, letters and digits
3917  hword_part      | Hyphenated word part, all letters
3918  int             | Signed integer
3919  numhword        | Hyphenated word, letters and digits
3920  numword         | Word, letters and digits
3921  protocol        | Protocol head
3922  sfloat          | Scientific notation
3923  tag             | XML tag
3924  uint            | Unsigned integer
3925  url             | URL
3926  url_path        | URL path
3927  version         | Version number
3928  word            | Word, all letters
3929 (23 rows)
3930 </screen>
3931      </para>
3932     </listitem>
3933    </varlistentry>
3934
3935    <varlistentry>
3936    <term><literal>\dFt<optional>+</optional> <optional>PATTERN</optional></literal></term>
3937     <listitem>
3938      <para>
3939       List text search templates (add <literal>+</literal> for more detail).
3940 <screen>
3941 =&gt; \dFt
3942                            List of text search templates
3943    Schema   |   Name    |                        Description
3944 ------------+-----------+-----------------------------------------------------------
3945  pg_catalog | ispell    | ispell dictionary
3946  pg_catalog | simple    | simple dictionary: just lower case and check for stopword
3947  pg_catalog | snowball  | snowball stemmer
3948  pg_catalog | synonym   | synonym dictionary: replace word by its synonym
3949  pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
3950 </screen>
3951      </para>
3952     </listitem>
3953    </varlistentry>
3954   </variablelist>
3955
3956  </sect1>
3957
3958  <sect1 id="textsearch-limitations">
3959   <title>Limitations</title>
3960
3961   <para>
3962    The current limitations of <productname>PostgreSQL</productname>'s
3963    text search features are:
3964    <itemizedlist  spacing="compact" mark="bullet">
3965     <listitem>
3966      <para>The length of each lexeme must be less than 2 kilobytes</para>
3967     </listitem>
3968     <listitem>
3969      <para>The length of a <type>tsvector</type> (lexemes + positions) must be
3970      less than 1 megabyte</para>
3971     </listitem>
3972     <listitem>
3973      <!-- TODO: number of lexemes in what?  This is unclear -->
3974      <para>The number of lexemes must be less than
3975      2<superscript>64</superscript></para>
3976     </listitem>
3977     <listitem>
3978      <para>Position values in <type>tsvector</type> must be greater than 0 and
3979      no more than 16,383</para>
3980     </listitem>
3981     <listitem>
3982      <para>The match distance in a <literal>&lt;<replaceable>N</replaceable>&gt;</literal>
3983      (FOLLOWED BY) <type>tsquery</type> operator cannot be more than
3984      16,384</para>
3985     </listitem>
3986     <listitem>
3987      <para>No more than 256 positions per lexeme</para>
3988     </listitem>
3989     <listitem>
3990      <para>The number of nodes (lexemes + operators) in a <type>tsquery</type>
3991      must be less than 32,768</para>
3992     </listitem>
3993    </itemizedlist>
3994   </para>
3995
3996   <para>
3997    For comparison, the <productname>PostgreSQL</productname> 8.1 documentation
3998    contained 10,441 unique words, a total of 335,420 words, and the most
3999    frequent word <quote>postgresql</quote> was mentioned 6,127 times in 655
4000    documents.
4001   </para>
4002
4003    <!-- TODO we need to put a date on these numbers? -->
4004   <para>
4005    Another example &mdash; the <productname>PostgreSQL</productname> mailing
4006    list archives contained 910,989 unique words with 57,491,343 lexemes in
4007    461,020 messages.
4008   </para>
4009
4010  </sect1>
4011
4012 </chapter>