documentation/manual/en/module_specs/Zend_Search_Lucene-Charset.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!-- Reviewed: no -->
   3 <sect1 id="zend.search.lucene.charset">
   4     <title>Character Set</title>
   5
   6     <sect2 id="zend.search.lucene.charset.description">
   7         <title>UTF-8 and single-byte character set support</title>
   8
   9         <para>
  10             <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index
  11             files store unicode data in Java's "modified UTF-8 encoding".
  12             <classname>Zend_Search_Lucene</classname> core completely supports this encoding with
  13             one exception.
  14
  15             <footnote>
  16                <para>
  17                    <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
  18                    (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
  19                    "supplementary characters" (characters whose code points are
  20                    greater than 0xFFFF)
  21                </para>
  22
  23                <para>
  24                    Java 2 represents these characters as a pair of char (16-bit)
  25                    values, the first from the high-surrogates range (0xD800-0xDBFF),
  26                    the second from the low-surrogates range (0xDC00-0xDFFF). Then
  27                    they are encoded as usual UTF-8 characters in six bytes.
  28                    Standard UTF-8 representation uses four bytes for supplementary
  29                    characters.
  30                </para>
  31             </footnote>
  32         </para>
  33
  34         <para>
  35             Actual input data encoding may be specified through
  36             <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>. Data will be
  37             automatically converted into UTF-8 encoding.
  38         </para>
  39     </sect2>
  40
  41     <sect2 id="zend.search.lucene.charset.default_analyzer">
  42         <title>Default text analyzer</title>
  43
  44         <para>
  45             However, the default text analyzer (which is also used within query parser) uses
  46             ctype_alpha() for tokenizing text and queries.
  47         </para>
  48
  49         <para>
  50             ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
  51             'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
  52             performed during query parsing.
  53
  54             <footnote>
  55                <para>
  56                    Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
  57                </para>
  58             </footnote>
  59         </para>
  60
  61         <note>
  62             <title/>
  63             <para>
  64                 Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
  65                 analyzer if you don't want words to be broken by numbers.
  66             </para>
  67         </note>
  68     </sect2>
  69
  70     <sect2 id="zend.search.lucene.charset.utf_analyzer">
  71         <title>UTF-8 compatible text analyzers</title>
  72
  73         <para>
  74             <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
  75             analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
  76             <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
  77             <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
  78             <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
  79         </para>
  80
  81         <para>
  82             Any of this analyzers can be enabled with the code like this:
  83         </para>
  84
  85         <programlisting language="php"><![CDATA[
  86 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  87     new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
  88 ]]></programlisting>
  89
  90         <warning>
  91             <title/>
  92             <para>
  93                 UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
  94                 analyzers assumed all non-ascii characters are letters. New analyzers implementation
  95                 has more accurate behavior.
  96             </para>
  97
  98             <para>
  99                 This may need you to re-build index to have data and search queries tokenized in the
 100                 same way, otherwise search engine may return wrong result sets.
 101             </para>
 102         </warning>
 103
 104         <para>
 105             All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
 106             compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
 107             library sources bundled with <acronym>PHP</acronym> source code distribution, but if
 108             shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
 109             UTF-8 support state may depend on you operating system.
 110         </para>
 111
 112         <para>
 113             Use the following code to check, if PCRE UTF-8 support is enabled:
 114         </para>
 115
 116         <programlisting language="php"><![CDATA[
 117 if (@preg_match('/\pL/u', 'a') == 1) {
 118     echo "PCRE unicode support is turned on.\n";
 119 } else {
 120     echo "PCRE unicode support is turned off.\n";
 121 }
 122 ]]></programlisting>
 123
 124         <para>
 125             Case insensitive versions of UTF-8 compatible analyzers also need <ulink
 126                 url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
 127             be enabled.
 128         </para>
 129
 130         <para>
 131             If you don't want mbstring extension to be turned on, but need case insensitive search,
 132             you may use the following approach: normalize source data before indexing and query
 133             string before searching by converting them to lowercase:
 134         </para>
 135
 136         <programlisting language="php"><![CDATA[
 137 // Indexing
 138 setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
 139
 140 ...
 141
 142 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
 143     new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
 144
 145 ...
 146
 147 $doc = new Zend_Search_Lucene_Document();
 148
 149 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
 150                                                   strtolower($contents)));
 151
 152 // Title field for search through (indexed, unstored)
 153 $doc->addField(Zend_Search_Lucene_Field::UnStored('title',
 154                                                   strtolower($title)));
 155
 156 // Title field for retrieving (unindexed, stored)
 157 $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
 158 ]]></programlisting>
 159
 160             <programlisting language="php"><![CDATA[
 161 // Searching
 162 setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
 163
 164 ...
 165
 166 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
 167     new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
 168
 169 ...
 170
 171 $hits = $index->find(strtolower($query));
 172 ]]></programlisting>
 173     </sect2>
 174 </sect1>