doc/src/sgml/citext.sgml

   1 <!-- doc/src/sgml/citext.sgml -->
   2
   3 <sect1 id="citext" xreflabel="citext">
   4  <title>citext &mdash; a case-insensitive character string type</title>
   5
   6  <indexterm zone="citext">
   7   <primary>citext</primary>
   8  </indexterm>
   9
  10  <para>
  11   The <filename>citext</filename> module provides a case-insensitive
  12   character string type, <type>citext</type>. Essentially, it internally calls
  13   <function>lower</function> when comparing values. Otherwise, it behaves almost
  14   exactly like <type>text</type>.
  15  </para>
  16
  17  <tip>
  18   <para>
  19    Consider using <firstterm>nondeterministic collations</firstterm> (see
  20    <xref linkend="collation-nondeterministic"/>) instead of this module.  They
  21    can be used for case-insensitive comparisons, accent-insensitive
  22    comparisons, and other combinations, and they handle more Unicode special
  23    cases correctly.
  24   </para>
  25  </tip>
  26
  27  <para>
  28   This module is considered <quote>trusted</quote>, that is, it can be
  29   installed by non-superusers who have <literal>CREATE</literal> privilege
  30   on the current database.
  31  </para>
  32
  33  <sect2 id="citext-rationale">
  34   <title>Rationale</title>
  35
  36   <para>
  37    The standard approach to doing case-insensitive matches
  38    in <productname>PostgreSQL</productname> has been to use the <function>lower</function>
  39    function when comparing values, for example
  40
  41 <programlisting>
  42 SELECT * FROM tab WHERE lower(col) = LOWER(?);
  43 </programlisting>
  44   </para>
  45
  46   <para>
  47    This works reasonably well, but has a number of drawbacks:
  48   </para>
  49
  50    <itemizedlist>
  51     <listitem>
  52      <para>
  53       It makes your SQL statements verbose, and you always have to remember to
  54       use <function>lower</function> on both the column and the query value.
  55      </para>
  56     </listitem>
  57     <listitem>
  58      <para>
  59       It won't use an index, unless you create a functional index using
  60       <function>lower</function>.
  61      </para>
  62     </listitem>
  63     <listitem>
  64      <para>
  65       If you declare a column as <literal>UNIQUE</literal> or <literal>PRIMARY
  66       KEY</literal>, the implicitly generated index is case-sensitive.  So it's
  67       useless for case-insensitive searches, and it won't enforce
  68       uniqueness case-insensitively.
  69      </para>
  70     </listitem>
  71    </itemizedlist>
  72
  73    <para>
  74     The <type>citext</type> data type allows you to eliminate calls
  75     to <function>lower</function> in SQL queries, and allows a primary key to
  76     be case-insensitive. <type>citext</type> is locale-aware, just
  77     like <type>text</type>, which means that the matching of upper case and
  78     lower case characters is dependent on the rules of
  79     the database's <literal>LC_CTYPE</literal> setting. Again, this behavior is
  80     identical to the use of <function>lower</function> in queries. But because it's
  81     done transparently by the data type, you don't have to remember to do
  82     anything special in your queries.
  83    </para>
  84
  85  </sect2>
  86
  87  <sect2 id="citext-how-to-use-it">
  88   <title>How to Use It</title>
  89
  90   <para>
  91    Here's a simple example of usage:
  92
  93 <programlisting>
  94 CREATE TABLE users (
  95     nick CITEXT PRIMARY KEY,
  96     pass TEXT   NOT NULL
  97 );
  98
  99 INSERT INTO users VALUES ( 'larry',  sha256(random()::text::bytea) );
 100 INSERT INTO users VALUES ( 'Tom',    sha256(random()::text::bytea) );
 101 INSERT INTO users VALUES ( 'Damian', sha256(random()::text::bytea) );
 102 INSERT INTO users VALUES ( 'NEAL',   sha256(random()::text::bytea) );
 103 INSERT INTO users VALUES ( 'Bj&oslash;rn',  sha256(random()::text::bytea) );
 104
 105 SELECT * FROM users WHERE nick = 'Larry';
 106 </programlisting>
 107
 108    The <command>SELECT</command> statement will return one tuple, even though
 109    the <structfield>nick</structfield> column was set to <literal>larry</literal> and the query
 110    was for <literal>Larry</literal>.
 111   </para>
 112  </sect2>
 113
 114  <sect2 id="citext-string-comparison-behavior">
 115   <title>String Comparison Behavior</title>
 116
 117   <para>
 118    <type>citext</type> performs comparisons by converting each string to lower
 119    case (as though <function>lower</function> were called) and then comparing the
 120    results normally.  Thus, for example, two strings are considered equal
 121    if <function>lower</function> would produce identical results for them.
 122   </para>
 123
 124   <para>
 125    In order to emulate a case-insensitive collation as closely as possible,
 126    there are <type>citext</type>-specific versions of a number of string-processing
 127    operators and functions.  So, for example, the regular expression
 128    operators <literal>~</literal> and <literal>~*</literal> exhibit the same behavior when
 129    applied to <type>citext</type>: they both match case-insensitively.
 130    The same is true
 131    for <literal>!~</literal> and <literal>!~*</literal>, as well as for the
 132    <literal>LIKE</literal> operators <literal>~~</literal> and <literal>~~*</literal>, and
 133    <literal>!~~</literal> and <literal>!~~*</literal>. If you'd like to match
 134    case-sensitively, you can cast the operator's arguments to <type>text</type>.
 135   </para>
 136
 137   <para>
 138    Similarly, all of the following functions perform matching
 139    case-insensitively if their arguments are <type>citext</type>:
 140   </para>
 141
 142   <itemizedlist>
 143    <listitem>
 144     <para>
 145       <function>regexp_match()</function>
 146     </para>
 147    </listitem>
 148    <listitem>
 149     <para>
 150       <function>regexp_matches()</function>
 151     </para>
 152    </listitem>
 153    <listitem>
 154     <para>
 155       <function>regexp_replace()</function>
 156     </para>
 157    </listitem>
 158    <listitem>
 159     <para>
 160       <function>regexp_split_to_array()</function>
 161     </para>
 162    </listitem>
 163    <listitem>
 164     <para>
 165       <function>regexp_split_to_table()</function>
 166     </para>
 167    </listitem>
 168    <listitem>
 169     <para>
 170       <function>replace()</function>
 171     </para>
 172    </listitem>
 173    <listitem>
 174     <para>
 175       <function>split_part()</function>
 176     </para>
 177    </listitem>
 178    <listitem>
 179     <para>
 180       <function>strpos()</function>
 181     </para>
 182    </listitem>
 183    <listitem>
 184     <para>
 185       <function>translate()</function>
 186     </para>
 187    </listitem>
 188   </itemizedlist>
 189
 190   <para>
 191    For the regexp functions, if you want to match case-sensitively, you can
 192    specify the <quote>c</quote> flag to force a case-sensitive match.  Otherwise,
 193    you must cast to <type>text</type> before using one of these functions if
 194    you want case-sensitive behavior.
 195   </para>
 196
 197  </sect2>
 198
 199  <sect2 id="citext-limitations">
 200   <title>Limitations</title>
 201
 202    <itemizedlist>
 203     <listitem>
 204      <para>
 205       <type>citext</type>'s case-folding behavior depends on
 206       the <literal>LC_CTYPE</literal> setting of your database. How it compares
 207       values is therefore determined when the database is created.
 208       It is not truly
 209       case-insensitive in the terms defined by the Unicode standard.
 210       Effectively, what this means is that, as long as you're happy with your
 211       collation, you should be happy with <type>citext</type>'s comparisons. But
 212       if you have data in different languages stored in your database, users
 213       of one language may find their query results are not as expected if the
 214       collation is for another language.
 215      </para>
 216     </listitem>
 217
 218     <listitem>
 219      <para>
 220       As of <productname>PostgreSQL</productname> 9.1, you can attach a
 221       <literal>COLLATE</literal> specification to <type>citext</type> columns or data
 222       values.  Currently, <type>citext</type> operators will honor a non-default
 223       <literal>COLLATE</literal> specification while comparing case-folded strings,
 224       but the initial folding to lower case is always done according to the
 225       database's <literal>LC_CTYPE</literal> setting (that is, as though
 226       <literal>COLLATE "default"</literal> were given).  This may be changed in a
 227       future release so that both steps follow the input <literal>COLLATE</literal>
 228       specification.
 229      </para>
 230     </listitem>
 231
 232     <listitem>
 233      <para>
 234        <type>citext</type> is not as efficient as <type>text</type> because the
 235        operator functions and the B-tree comparison functions must make copies
 236        of the data and convert it to lower case for comparisons.  Also, only
 237        <type>text</type> can support B-Tree deduplication.  However,
 238        <type>citext</type> is slightly more efficient than using
 239        <function>lower</function> to get case-insensitive matching.
 240      </para>
 241     </listitem>
 242
 243     <listitem>
 244      <para>
 245       <type>citext</type> doesn't help much if you need data to compare
 246       case-sensitively in some contexts and case-insensitively in other
 247       contexts.  The standard answer is to use the <type>text</type> type and
 248       manually use the <function>lower</function> function when you need to compare
 249       case-insensitively; this works all right if case-insensitive comparison
 250       is needed only infrequently.  If you need case-insensitive behavior most
 251       of the time and case-sensitive infrequently, consider storing the data
 252       as <type>citext</type> and explicitly casting the column to <type>text</type>
 253       when you want case-sensitive comparison.  In either situation, you will
 254       need two indexes if you want both types of searches to be fast.
 255     </para>
 256     </listitem>
 257
 258     <listitem>
 259      <para>
 260       The schema containing the <type>citext</type> operators must be
 261       in the current <varname>search_path</varname> (typically <literal>public</literal>);
 262       if it is not, the normal case-sensitive <type>text</type> operators
 263       will be invoked instead.
 264     </para>
 265     </listitem>
 266
 267     <listitem>
 268      <para>
 269       The approach of lower-casing strings for comparison does not handle some
 270       Unicode special cases correctly, for example when one upper-case letter
 271       has two lower-case letter equivalents.  Unicode distinguishes between
 272       <firstterm>case mapping</firstterm> and <firstterm>case
 273       folding</firstterm> for this reason.  Use nondeterministic collations
 274       instead of <type>citext</type> to handle that correctly.
 275      </para>
 276     </listitem>
 277    </itemizedlist>
 278  </sect2>
 279
 280  <sect2 id="citext-author">
 281   <title>Author</title>
 282
 283   <para>
 284    David E. Wheeler <email>david@kineticode.com</email>
 285   </para>
 286
 287   <para>
 288     Inspired by the original <type>citext</type> module by Donald Fraser.
 289   </para>
 290
 291  </sect2>
 292
 293 </sect1>