[GENERIC] Zend_Translate:
[zend.git] / documentation / manual / en / module_specs / Zend_Search_Lucene-Charset.xml
blob8cf89a8419d95f6b9fa98b4018968ae871b0bb73
1 <?xml version="1.0" encoding="UTF-8"?>
2 <!-- Reviewed: no -->
3 <sect1 id="zend.search.lucene.charset">
4     <title>Character Set</title>
6     <sect2 id="zend.search.lucene.charset.description">
7         <title>UTF-8 and single-byte character set support</title>
9         <para>
10             <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index
11             files store unicode data in Java's "modified UTF-8 encoding".
12             <classname>Zend_Search_Lucene</classname> core completely supports this encoding with
13             one exception.
15             <footnote>
16                <para>
17                    <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
18                    (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
19                    "supplementary characters" (characters whose code points are
20                    greater than 0xFFFF)
21                </para>
23                <para>
24                    Java 2 represents these characters as a pair of char (16-bit)
25                    values, the first from the high-surrogates range (0xD800-0xDBFF),
26                    the second from the low-surrogates range (0xDC00-0xDFFF). Then
27                    they are encoded as usual UTF-8 characters in six bytes.
28                    Standard UTF-8 representation uses four bytes for supplementary
29                    characters.
30                </para>
31             </footnote>
32         </para>
34         <para>
35             Actual input data encoding may be specified through
36             <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>. Data will be
37             automatically converted into UTF-8 encoding.
38         </para>
39     </sect2>
41     <sect2 id="zend.search.lucene.charset.default_analyzer">
42         <title>Default text analyzer</title>
44         <para>
45             However, the default text analyzer (which is also used within query parser) uses
46             ctype_alpha() for tokenizing text and queries.
47         </para>
49         <para>
50             ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
51             'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
52             performed during query parsing.
54             <footnote>
55                <para>
56                    Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
57                </para>
58             </footnote>
59         </para>
61         <note>
62             <title/>
63             <para>
64                 Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
65                 analyzer if you don't want words to be broken by numbers.
66             </para>
67         </note>
68     </sect2>
70     <sect2 id="zend.search.lucene.charset.utf_analyzer">
71         <title>UTF-8 compatible text analyzers</title>
73         <para>
74             <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
75             analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
76             <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
77             <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
78             <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
79         </para>
81         <para>
82             Any of this analyzers can be enabled with the code like this:
83         </para>
85         <programlisting language="php"><![CDATA[
86 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
87     new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
88 ]]></programlisting>
90         <warning>
91             <title/>
92             <para>
93                 UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
94                 analyzers assumed all non-ascii characters are letters. New analyzers implementation
95                 has more accurate behavior.
96             </para>
98             <para>
99                 This may need you to re-build index to have data and search queries tokenized in the
100                 same way, otherwise search engine may return wrong result sets.
101             </para>
102         </warning>
104         <para>
105             All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
106             compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
107             library sources bundled with <acronym>PHP</acronym> source code distribution, but if
108             shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
109             UTF-8 support state may depend on you operating system.
110         </para>
112         <para>
113             Use the following code to check, if PCRE UTF-8 support is enabled:
114         </para>
116         <programlisting language="php"><![CDATA[
117 if (@preg_match('/\pL/u', 'a') == 1) {
118     echo "PCRE unicode support is turned on.\n";
119 } else {
120     echo "PCRE unicode support is turned off.\n";
122 ]]></programlisting>
124         <para>
125             Case insensitive versions of UTF-8 compatible analyzers also need <ulink
126                 url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
127             be enabled.
128         </para>
130         <para>
131             If you don't want mbstring extension to be turned on, but need case insensitive search,
132             you may use the following approach: normalize source data before indexing and query
133             string before searching by converting them to lowercase:
134         </para>
136         <programlisting language="php"><![CDATA[
137 // Indexing
138 setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
142 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
143     new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
147 $doc = new Zend_Search_Lucene_Document();
149 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
150                                                   strtolower($contents)));
152 // Title field for search through (indexed, unstored)
153 $doc->addField(Zend_Search_Lucene_Field::UnStored('title',
154                                                   strtolower($title)));
156 // Title field for retrieving (unindexed, stored)
157 $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
158 ]]></programlisting>
160             <programlisting language="php"><![CDATA[
161 // Searching
162 setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
166 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
167     new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
171 $hits = $index->find(strtolower($query));
172 ]]></programlisting>
173     </sect2>
174 </sect1>