1 <!-- doc/src/sgml/unaccent.sgml -->
3 <sect1 id=
"unaccent" xreflabel=
"unaccent">
4 <title>unaccent
— a text search dictionary which removes diacritics
</title>
6 <indexterm zone=
"unaccent">
7 <primary>unaccent
</primary>
11 <filename>unaccent
</filename> is a text search dictionary that removes accents
12 (diacritic signs) from lexemes.
13 It's a filtering dictionary, which means its output is
14 always passed to the next dictionary (if any), unlike the normal
15 behavior of dictionaries. This allows accent-insensitive processing
20 The current implementation of
<filename>unaccent
</filename> cannot be used as a
21 normalizing dictionary for the
<filename>thesaurus
</filename> dictionary.
25 This module is considered
<quote>trusted
</quote>, that is, it can be
26 installed by non-superusers who have
<literal>CREATE
</literal> privilege
27 on the current database.
30 <sect2 id=
"unaccent-configuration">
31 <title>Configuration
</title>
34 An
<literal>unaccent
</literal> dictionary accepts the following options:
39 <literal>RULES
</literal> is the base name of the file containing the list of
40 translation rules. This file must be stored in
41 <filename>$SHAREDIR/tsearch_data/
</filename> (where
<literal>$SHAREDIR
</literal> means
42 the
<productname>PostgreSQL
</productname> installation's shared-data directory).
43 Its name must end in
<literal>.rules
</literal> (which is not to be included in
44 the
<literal>RULES
</literal> parameter).
49 The rules file has the following format:
54 Each line represents one translation rule, consisting of a character with
55 accent followed by a character without accent. The first is translated
56 into the second. For example,
66 The two characters must be separated by whitespace, and any leading or
67 trailing whitespace on a line is ignored.
73 Alternatively, if only one character is given on a line, instances of
74 that character are deleted; this is useful in languages where accents
75 are represented by separate characters.
81 Actually, each
<quote>character
</quote> can be any string not containing
82 whitespace, so
<filename>unaccent
</filename> dictionaries could be used for
83 other sorts of substring substitutions besides diacritic removal.
89 Some characters, like numeric symbols, may require whitespaces in their
90 translation rule. It is possible to use double quotes around the translated
91 characters in this case. A double quote needs to be escaped with a second
92 double quote when including one in the translated character. For example:
105 As with other
<productname>PostgreSQL
</productname> text search configuration files,
106 the rules file must be stored in UTF-
8 encoding. The data is
107 automatically translated into the current database's encoding when
108 loaded. Any lines containing untranslatable characters are silently
109 ignored, so that rules files can contain rules that are not applicable in
110 the current encoding.
116 A more complete example, which is directly useful for most European
117 languages, can be found in
<filename>unaccent.rules
</filename>, which is installed
118 in
<filename>$SHAREDIR/tsearch_data/
</filename> when the
<filename>unaccent
</filename>
119 module is installed. This rules file translates characters with accents
120 to the same characters without accents, and it also expands ligatures
121 into the equivalent series of simple characters (for example,
Æ to
126 <sect2 id=
"unaccent-usage">
130 Installing the
<literal>unaccent
</literal> extension creates a text
131 search template
<literal>unaccent
</literal> and a dictionary
<literal>unaccent
</literal>
132 based on it. The
<literal>unaccent
</literal> dictionary has the default
133 parameter setting
<literal>RULES='unaccent'
</literal>, which makes it immediately
134 usable with the standard
<filename>unaccent.rules
</filename> file.
135 If you wish, you can alter the parameter, for example
138 mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');
141 or create new dictionaries based on the template.
145 To test the dictionary, you can try:
147 mydb=# select ts_lexize('unaccent','H
ôtel');
156 Here is an example showing how to insert the
157 <filename>unaccent
</filename> dictionary into a text search configuration:
159 mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
160 mydb=# ALTER TEXT SEARCH CONFIGURATION fr
161 ALTER MAPPING FOR hword, hword_part, word
162 WITH unaccent, french_stem;
163 mydb=# select to_tsvector('fr','H
ôtels de la Mer');
169 mydb=# select to_tsvector('fr','H
ôtel de la Mer') @@ to_tsquery('fr','Hotels');
175 mydb=# select ts_headline('fr','H
ôtel de la Mer',to_tsquery('fr','Hotels'));
177 ------------------------
178 <b
>H
ôtel
</b
> de la Mer
184 <sect2 id=
"unaccent-functions">
185 <title>Functions
</title>
188 The
<function>unaccent()
</function> function removes accents (diacritic signs) from
189 a given string. Basically, it's a wrapper around
190 <filename>unaccent
</filename>-type dictionaries, but it can be used outside normal
191 text search contexts.
195 <primary>unaccent
</primary>
199 unaccent(
<optional><replaceable class=
"parameter">dictionary
</replaceable> <type>regdictionary
</type>,
</optional> <replaceable class=
"parameter">string
</replaceable> <type>text
</type>) returns
<type>text
</type>
203 If the
<replaceable class=
"parameter">dictionary
</replaceable> argument is
204 omitted, the text search dictionary named
<literal>unaccent
</literal> and
205 appearing in the same schema as the
<function>unaccent()
</function>
206 function itself is used.
212 SELECT unaccent('unaccent', 'H
ôtel');
213 SELECT unaccent('H
ôtel');