ext/icu/README.txt

   1
   2 This directory contains source code for the SQLite "ICU" extension, an
   3 integration of the "International Components for Unicode" library with
   4 SQLite. Documentation follows.
   5
   6     1. Features
   7
   8         1.1  SQL Scalars upper() and lower()
   9         1.2  Unicode Aware LIKE Operator
  10         1.3  ICU Collation Sequences
  11         1.4  SQL REGEXP Operator
  12
  13     2. Compilation and Usage
  14
  15     3. Bugs, Problems and Security Issues
  16
  17         3.1  The "case_sensitive_like" Pragma
  18         3.2  The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
  19         3.3  Collation Sequence Security Issue
  20
  21
  22 1. FEATURES
  23
  24   1.1  SQL Scalars upper() and lower()
  25
  26     SQLite's built-in implementations of these two functions only
  27     provide case mapping for the 26 letters used in the English
  28     language. The ICU based functions provided by this extension
  29     provide case mapping, where defined, for the full range of
  30     unicode characters.
  31
  32     ICU provides two types of case mapping, "general" case mapping and
  33     "language specific". Refer to ICU documentation for the differences
  34     between the two. Specifically:
  35
  36        http://www.icu-project.org/userguide/caseMappings.html
  37        http://www.icu-project.org/userguide/posix.html#case_mappings
  38
  39     To utilise "general" case mapping, the upper() or lower() scalar
  40     functions are invoked with one argument:
  41
  42         upper('abc') -> 'ABC'
  43         lower('ABC') -> 'abc'
  44
  45     To access ICU "language specific" case mapping, upper() or lower()
  46     should be invoked with two arguments. The second argument is the name
  47     of the locale to use. Passing an empty string ("") or SQL NULL value
  48     as the second argument is the same as invoking the 1 argument version
  49     of upper() or lower():
  50
  51         lower('I', 'en_us') -> 'i'
  52         lower('I', 'tr_tr') -> 'ı' (small dotless i)
  53
  54   1.2  Unicode Aware LIKE Operator
  55
  56     Similarly to the upper() and lower() functions, the built-in SQLite LIKE
  57     operator understands case equivalence for the 26 letters of the English
  58     language alphabet. The implementation of LIKE included in this
  59     extension uses the ICU function u_foldCase() to provide case
  60     independent comparisons for the full range of unicode characters.
  61
  62     The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the
  63     dotless 'I' character used in the Turkish language is considered
  64     to be in the same equivalence class as the dotted 'I' character
  65     used by many languages (including English).
  66
  67   1.3  ICU Collation Sequences
  68
  69     A special SQL scalar function, icu_load_collation() is provided that
  70     may be used to register ICU collation sequences with SQLite. It
  71     is always called with exactly two arguments, the ICU locale
  72     identifying the collation sequence to ICU, and the name of the
  73     SQLite collation sequence to create. For example, to create an
  74     SQLite collation sequence named "turkish" using Turkish language
  75     sorting rules, the SQL statement:
  76
  77         SELECT icu_load_collation('tr_TR', 'turkish');
  78
  79     Or, for Australian English:
  80
  81         SELECT icu_load_collation('en_AU', 'australian');
  82
  83     The identifiers "turkish" and "australian" may then be used
  84     as collation sequence identifiers in SQL statements:
  85
  86         CREATE TABLE aust_turkish_penpals(
  87           australian_penpal_name TEXT COLLATE australian,
  88           turkish_penpal_name    TEXT COLLATE turkish
  89         );
  90
  91   1.4 SQL REGEXP Operator
  92
  93     This extension provides an implementation of the SQL binary
  94     comparision operator "REGEXP", based on the regular expression functions
  95     provided by the ICU library. The syntax of the operator is as described
  96     in SQLite documentation:
  97
  98         <string> REGEXP <re-pattern>
  99
 100     This extension uses the ICU defaults for regular expression matching
 101     behavior. Specifically, this means that:
 102
 103         * Matching is case-sensitive,
 104         * Regular expression comments are not allowed within patterns, and
 105         * The '^' and '$' characters match the beginning and end of the
 106           <string> argument, not the beginning and end of lines within
 107           the <string> argument.
 108
 109     Even more specifically, the value passed to the "flags" parameter
 110     of ICU C function uregex_open() is 0.
 111
 112
 113 2  COMPILATION AND USAGE
 114
 115   The easiest way to compile and use the ICU extension is to build
 116   and use it as a dynamically loadable SQLite extension. To do this
 117   using gcc on *nix:
 118
 119     gcc -fPIC -shared icu.c `pkg-config --libs --cflags icu-uc icu-io` \
 120         -o libSqliteIcu.so
 121
 122   You may need to add "-I" flags so that gcc can find sqlite3ext.h
 123   and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
 124   loaded into sqlite in the same way as any other dynamically loadable
 125   extension.
 126
 127
 128 3 BUGS, PROBLEMS AND SECURITY ISSUES
 129
 130   3.1 The "case_sensitive_like" Pragma
 131
 132     This extension does not work well with the "case_sensitive_like"
 133     pragma. If this pragma is used before the ICU extension is loaded,
 134     then the pragma has no effect. If the pragma is used after the ICU
 135     extension is loaded, then SQLite ignores the ICU implementation and
 136     always uses the built-in LIKE operator.
 137
 138     The ICU extension LIKE operator is always case insensitive.
 139
 140   3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
 141
 142     Passing very long patterns to the built-in SQLite LIKE operator can
 143     cause excessive CPU usage. To curb this problem, SQLite defines the
 144     SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a
 145     pattern in bytes (irrespective of encoding). The default value is
 146     defined in internal header file "limits.h".
 147
 148     The ICU extension LIKE implementation suffers from the same
 149     problem and uses the same solution. However, since the ICU extension
 150     code does not include the SQLite file "limits.h", modifying
 151     the default value therein does not affect the ICU extension.
 152     The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by
 153     the ICU extension LIKE operator is 50000, defined in source
 154     file "icu.c".
 155
 156   3.3 Collation Sequence Security
 157
 158     Internally, SQLite assumes that indices stored in database files
 159     are sorted according to the collation sequence indicated by the
 160     SQL schema. Changing the definition of a collation sequence after
 161     an index has been built is therefore equivalent to database
 162     corruption. The SQLite library is well tested for robustness in
 163     the fact of database corruption.  Database corruption may well
 164     lead to incorrect answers, but should not cause memory errors.