doc/src/sgml/unaccent.sgml

   1 <!-- doc/src/sgml/unaccent.sgml -->
   2
   3 <sect1 id="unaccent" xreflabel="unaccent">
   4  <title>unaccent &mdash; a text search dictionary which removes diacritics</title>
   5
   6  <indexterm zone="unaccent">
   7   <primary>unaccent</primary>
   8  </indexterm>
   9
  10  <para>
  11   <filename>unaccent</filename> is a text search dictionary that removes accents
  12   (diacritic signs) from lexemes.
  13   It's a filtering dictionary, which means its output is
  14   always passed to the next dictionary (if any), unlike the normal
  15   behavior of dictionaries.  This allows accent-insensitive processing
  16   for full text search.
  17  </para>
  18
  19  <para>
  20   The current implementation of <filename>unaccent</filename> cannot be used as a
  21   normalizing dictionary for the <filename>thesaurus</filename> dictionary.
  22  </para>
  23
  24  <para>
  25   This module is considered <quote>trusted</quote>, that is, it can be
  26   installed by non-superusers who have <literal>CREATE</literal> privilege
  27   on the current database.
  28  </para>
  29
  30  <sect2 id="unaccent-configuration">
  31   <title>Configuration</title>
  32
  33   <para>
  34    An <literal>unaccent</literal> dictionary accepts the following options:
  35   </para>
  36   <itemizedlist>
  37    <listitem>
  38     <para>
  39      <literal>RULES</literal> is the base name of the file containing the list of
  40      translation rules.  This file must be stored in
  41      <filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means
  42      the <productname>PostgreSQL</productname> installation's shared-data directory).
  43      Its name must end in <literal>.rules</literal> (which is not to be included in
  44      the <literal>RULES</literal> parameter).
  45     </para>
  46    </listitem>
  47   </itemizedlist>
  48   <para>
  49    The rules file has the following format:
  50   </para>
  51   <itemizedlist>
  52    <listitem>
  53     <para>
  54      Each line represents one translation rule, consisting of a character with
  55      accent followed by a character without accent.  The first is translated
  56      into the second.  For example,
  57 <programlisting>
  58 &Agrave;        A
  59 &Aacute;        A
  60 &Acirc;        A
  61 &Atilde;        A
  62 &Auml;        A
  63 &Aring;        A
  64 &AElig;        AE
  65 </programlisting>
  66      The two characters must be separated by whitespace, and any leading or
  67      trailing whitespace on a line is ignored.
  68     </para>
  69    </listitem>
  70
  71    <listitem>
  72     <para>
  73      Alternatively, if only one character is given on a line, instances of
  74      that character are deleted; this is useful in languages where accents
  75      are represented by separate characters.
  76     </para>
  77    </listitem>
  78
  79    <listitem>
  80     <para>
  81      Actually, each <quote>character</quote> can be any string not containing
  82      whitespace, so <filename>unaccent</filename> dictionaries could be used for
  83      other sorts of substring substitutions besides diacritic removal.
  84     </para>
  85    </listitem>
  86
  87    <listitem>
  88     <para>
  89      Some characters, like numeric symbols, may require whitespaces in their
  90      translation rule. It is possible to use double quotes around the translated
  91      characters in this case. A double quote needs to be escaped with a second
  92      double quote when including one in the translated character. For example:
  93 <programlisting>
  94 &frac14;      " 1/4"
  95 &frac12;      " 1/2"
  96 &frac34;      " 3/4"
  97 &ldquo;       """"
  98 &rdquo;       """"
  99 </programlisting>
 100     </para>
 101    </listitem>
 102
 103    <listitem>
 104     <para>
 105      As with other <productname>PostgreSQL</productname> text search configuration files,
 106      the rules file must be stored in UTF-8 encoding.  The data is
 107      automatically translated into the current database's encoding when
 108      loaded.  Any lines containing untranslatable characters are silently
 109      ignored, so that rules files can contain rules that are not applicable in
 110      the current encoding.
 111     </para>
 112    </listitem>
 113   </itemizedlist>
 114
 115   <para>
 116    A more complete example, which is directly useful for most European
 117    languages, can be found in <filename>unaccent.rules</filename>, which is installed
 118    in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename>
 119    module is installed.  This rules file translates characters with accents
 120    to the same characters without accents, and it also expands ligatures
 121    into the equivalent series of simple characters (for example, &AElig; to
 122    AE).
 123   </para>
 124  </sect2>
 125
 126  <sect2 id="unaccent-usage">
 127   <title>Usage</title>
 128
 129   <para>
 130    Installing the <literal>unaccent</literal> extension creates a text
 131    search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal>
 132    based on it.  The <literal>unaccent</literal> dictionary has the default
 133    parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately
 134    usable with the standard <filename>unaccent.rules</filename> file.
 135    If you wish, you can alter the parameter, for example
 136
 137 <programlisting>
 138 mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
 139 </programlisting>
 140
 141    or create new dictionaries based on the template.
 142   </para>
 143
 144   <para>
 145    To test the dictionary, you can try:
 146 <programlisting>
 147 mydb=# select ts_lexize('unaccent','H&ocirc;tel');
 148  ts_lexize
 149 -----------
 150  {Hotel}
 151 (1 row)
 152 </programlisting>
 153   </para>
 154
 155   <para>
 156    Here is an example showing how to insert the
 157    <filename>unaccent</filename> dictionary into a text search configuration:
 158 <programlisting>
 159 mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
 160 mydb=# ALTER TEXT SEARCH CONFIGURATION fr
 161         ALTER MAPPING FOR hword, hword_part, word
 162         WITH unaccent, french_stem;
 163 mydb=# select to_tsvector('fr','H&ocirc;tels de la Mer');
 164     to_tsvector
 165 -------------------
 166  'hotel':1 'mer':4
 167 (1 row)
 168
 169 mydb=# select to_tsvector('fr','H&ocirc;tel de la Mer') @@ to_tsquery('fr','Hotels');
 170  ?column?
 171 ----------
 172  t
 173 (1 row)
 174
 175 mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels'));
 176       ts_headline
 177 ------------------------
 178  &lt;b&gt;H&ocirc;tel&lt;/b&gt; de la Mer
 179 (1 row)
 180 </programlisting>
 181   </para>
 182  </sect2>
 183
 184  <sect2 id="unaccent-functions">
 185  <title>Functions</title>
 186
 187  <para>
 188   The <function>unaccent()</function> function removes accents (diacritic signs) from
 189   a given string.  Basically, it's a wrapper around
 190   <filename>unaccent</filename>-type dictionaries, but it can be used outside normal
 191   text search contexts.
 192  </para>
 193
 194  <indexterm>
 195   <primary>unaccent</primary>
 196  </indexterm>
 197
 198 <synopsis>
 199 unaccent(<optional><replaceable class="parameter">dictionary</replaceable> <type>regdictionary</type>, </optional> <replaceable class="parameter">string</replaceable> <type>text</type>) returns <type>text</type>
 200 </synopsis>
 201
 202  <para>
 203   If the <replaceable class="parameter">dictionary</replaceable> argument is
 204   omitted, the text search dictionary named <literal>unaccent</literal> and
 205   appearing in the same schema as the <function>unaccent()</function>
 206   function itself is used.
 207  </para>
 208
 209  <para>
 210   For example:
 211 <programlisting>
 212 SELECT unaccent('unaccent', 'H&ocirc;tel');
 213 SELECT unaccent('H&ocirc;tel');
 214 </programlisting>
 215  </para>
 216  </sect2>
 217
 218 </sect1>