doc/src/sgml/citext.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <sect1 id="citext">
   4  <title>citext</title>
   5
   6  <indexterm zone="citext">
   7   <primary>citext</primary>
   8  </indexterm>
   9
  10  <para>
  11   The <filename>citext</> module provides a case-insensitive
  12   character string type, <type>citext</>. Essentially, it internally calls
  13   <function>lower</> when comparing values. Otherwise, it behaves almost
  14   exactly like <type>text</>.
  15  </para>
  16
  17  <sect2>
  18   <title>Rationale</title>
  19
  20   <para>
  21    The standard approach to doing case-insensitive matches
  22    in <productname>PostgreSQL</> has been to use the <function>lower</>
  23    function when comparing values, for example
  24
  25   <programlisting>
  26    SELECT * FROM tab WHERE lower(col) = LOWER(?);
  27   </programlisting>
  28   </para>
  29
  30   <para>
  31    This works reasonably well, but has a number of drawbacks:
  32   </para>
  33
  34    <itemizedlist>
  35     <listitem>
  36      <para>
  37       It makes your SQL statements verbose, and you always have to remember to
  38       use <function>lower</> on both the column and the query value.
  39      </para>
  40     </listitem>
  41     <listitem>
  42      <para>
  43       It won't use an index, unless you create a functional index using
  44       <function>lower</>.
  45      </para>
  46     </listitem>
  47     <listitem>
  48      <para>
  49       If you declare a column as <literal>UNIQUE</> or <literal>PRIMARY
  50       KEY</>, the implicitly generated index is case-sensitive.  So it's
  51       useless for case-insensitive searches, and it won't enforce
  52       uniqueness case-insensitively.
  53      </para>
  54     </listitem>
  55    </itemizedlist>
  56
  57    <para>
  58     The <type>citext</> data type allows you to eliminate calls
  59     to <function>lower</> in SQL queries, and allows a primary key to
  60     be case-insensitive. <type>citext</> is locale-aware, just
  61     like <type>text</>, which means that the comparison of uppercase and
  62     lowercase characters is dependent on the rules of
  63     the <literal>LC_CTYPE</> locale setting. Again, this behavior is
  64     identical to the use of <function>lower</> in queries. But because it's
  65     done transparently by the datatype, you don't have to remember to do
  66     anything special in your queries.
  67    </para>
  68
  69  </sect2>
  70
  71  <sect2>
  72   <title>How to Use It</title>
  73
  74   <para>
  75    Here's a simple example of usage:
  76
  77   <programlisting>
  78    CREATE TABLE users (
  79        nick CITEXT PRIMARY KEY,
  80        pass TEXT   NOT NULL
  81    );
  82
  83    INSERT INTO users VALUES ( 'larry',  md5(random()::text) );
  84    INSERT INTO users VALUES ( 'Tom',    md5(random()::text) );
  85    INSERT INTO users VALUES ( 'Damian', md5(random()::text) );
  86    INSERT INTO users VALUES ( 'NEAL',   md5(random()::text) );
  87    INSERT INTO users VALUES ( 'Bj&oslash;rn',  md5(random()::text) );
  88
  89    SELECT * FROM users WHERE nick = 'Larry';
  90   </programlisting>
  91
  92    The <command>SELECT</> statement will return one tuple, even though
  93    the <structfield>nick</> column was set to <quote>larry</> and the query
  94    was for <quote>Larry</>.
  95   </para>
  96  </sect2>
  97
  98  <sect2>
  99   <title>String Comparison Behavior</title>
 100   <para>
 101    In order to emulate a case-insensitive collation as closely as possible,
 102    there are <type>citext</>-specific versions of a number of the comparison
 103    operators and functions.  So, for example, the regular expression
 104    operators <literal>~</> and <literal>~*</> exhibit the same behavior when
 105    applied to <type>citext</>: they both compare case-insensitively.
 106    The same is true
 107    for <literal>!~</> and <literal>!~*</>, as well as for the
 108    <literal>LIKE</> operators <literal>~~</> and <literal>~~*</>, and
 109    <literal>!~~</> and <literal>!~~*</>. If you'd like to match
 110    case-sensitively, you can always cast to <type>text</> before comparing.
 111   </para>
 112
 113   <para>
 114    Similarly, all of the following functions perform matching
 115    case-insensitively if their arguments are <type>citext</>:
 116   </para>
 117
 118   <itemizedlist>
 119    <listitem>
 120     <para>
 121       <function>regexp_replace()</>
 122     </para>
 123    </listitem>
 124    <listitem>
 125     <para>
 126       <function>regexp_split_to_array()</>
 127     </para>
 128    </listitem>
 129    <listitem>
 130     <para>
 131       <function>regexp_split_to_table()</>
 132     </para>
 133    </listitem>
 134    <listitem>
 135     <para>
 136       <function>replace()</>
 137     </para>
 138    </listitem>
 139    <listitem>
 140     <para>
 141       <function>split_part()</>
 142     </para>
 143    </listitem>
 144    <listitem>
 145     <para>
 146       <function>strpos()</>
 147     </para>
 148    </listitem>
 149    <listitem>
 150     <para>
 151       <function>translate()</>
 152     </para>
 153    </listitem>
 154   </itemizedlist>
 155
 156   <para>
 157    For the regexp functions, if you want to match case-sensitively, you can
 158    specify the <quote>c</> flag to force a case-sensitive match.  Otherwise,
 159    you must cast to <type>text</> before using one of these functions if
 160    you want case-sensitive behavior.
 161   </para>
 162
 163  </sect2>
 164
 165  <sect2>
 166   <title>Limitations</title>
 167
 168    <itemizedlist>
 169     <listitem>
 170      <para>
 171       <type>citext</>'s behavior depends on
 172       the <literal>LC_CTYPE</> setting of your database. How it compares
 173       values is therefore determined when
 174       <application>initdb</> is run to create the cluster. It is not truly
 175       case-insensitive in the terms defined by the Unicode standard.
 176       Effectively, what this means is that, as long as you're happy with your
 177       collation, you should be happy with <type>citext</>'s comparisons. But
 178       if you have data in different languages stored in your database, users
 179       of one language may find their query results are not as expected if the
 180       collation is for another language.
 181      </para>
 182     </listitem>
 183
 184     <listitem>
 185      <para>
 186        <type>citext</> is not as efficient as <type>text</> because the
 187        operator functions and the btree comparison functions must make copies
 188        of the data and convert it to lower case for comparisons. It is,
 189        however, slightly more efficient than using <function>lower</> to get
 190        case-insensitive matching.
 191      </para>
 192     </listitem>
 193
 194     <listitem>
 195      <para>
 196       <type>citext</> doesn't help much if you need data to compare
 197       case-sensitively in some contexts and case-insensitively in other
 198       contexts.  The standard answer is to use the <type>text</> type and
 199       manually use the <function>lower</> function when you need to compare
 200       case-insensitively; this works all right if case-insensitive comparison
 201       is needed only infrequently.  If you need case-insensitive most of
 202       the time and case-sensitive infrequently, consider storing the data
 203       as <type>citext</> and explicitly casting the column to <type>text</>
 204       when you want case-sensitive comparison.  In either situation, you
 205       will need two indexes if you want both types of searches to be fast.
 206     </para>
 207     </listitem>
 208    </itemizedlist>
 209  </sect2>
 210
 211  <sect2>
 212   <title>Author</title>
 213
 214   <para>
 215    David E. Wheeler <email>david@kineticode.com</email>
 216   </para>
 217
 218   <para>
 219     Inspired by the original <type>citext</> module by Donald Fraser.
 220   </para>
 221
 222  </sect2>
 223
 224 </sect1>