1 <?xml version="1.0" encoding="UTF-8"?>
3 <sect1 id="zend.search.lucene.charset">
4 <title>Character Set</title>
6 <sect2 id="zend.search.lucene.charset.description">
7 <title>UTF-8 and single-byte character set support</title>
10 <classname>Zend_Search_Lucene</classname> works with the UTF-8 charset internally. Index
11 files store unicode data in Java's "modified UTF-8 encoding".
12 <classname>Zend_Search_Lucene</classname> core completely supports this encoding with
17 <classname>Zend_Search_Lucene</classname> supports only Basic Multilingual Plane
18 (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support
19 "supplementary characters" (characters whose code points are
24 Java 2 represents these characters as a pair of char (16-bit)
25 values, the first from the high-surrogates range (0xD800-0xDBFF),
26 the second from the low-surrogates range (0xDC00-0xDFFF). Then
27 they are encoded as usual UTF-8 characters in six bytes.
28 Standard UTF-8 representation uses four bytes for supplementary
35 Actual input data encoding may be specified through
36 <classname>Zend_Search_Lucene</classname> <acronym>API</acronym>. Data will be
37 automatically converted into UTF-8 encoding.
41 <sect2 id="zend.search.lucene.charset.default_analyzer">
42 <title>Default text analyzer</title>
45 However, the default text analyzer (which is also used within query parser) uses
46 ctype_alpha() for tokenizing text and queries.
50 ctype_alpha() is not UTF-8 compatible, so the analyzer converts text to
51 'ASCII//TRANSLIT' encoding before indexing. The same processing is transparently
52 performed during query parsing.
56 Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.
64 Default analyzer doesn't treats numbers as parts of terms. Use corresponding 'Num'
65 analyzer if you don't want words to be broken by numbers.
70 <sect2 id="zend.search.lucene.charset.utf_analyzer">
71 <title>UTF-8 compatible text analyzers</title>
74 <classname>Zend_Search_Lucene</classname> also contains a set of UTF-8 compatible
75 analyzers: <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8</classname>,
76 <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num</classname>,
77 <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8_CaseInsensitive</classname>,
78 <classname>Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive</classname>.
82 Any of this analyzers can be enabled with the code like this:
85 <programlisting language="php"><![CDATA[
86 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
87 new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
93 UTF-8 compatible analyzers were improved in Zend Framework 1.5. Early versions of
94 analyzers assumed all non-ascii characters are letters. New analyzers implementation
95 has more accurate behavior.
99 This may need you to re-build index to have data and search queries tokenized in the
100 same way, otherwise search engine may return wrong result sets.
105 All of these analyzers need PCRE (Perl-compatible regular expressions) library to be
106 compiled with UTF-8 support turned on. PCRE UTF-8 support is turned on for the PCRE
107 library sources bundled with <acronym>PHP</acronym> source code distribution, but if
108 shared library is used instead of bundled with <acronym>PHP</acronym> sources, then
109 UTF-8 support state may depend on you operating system.
113 Use the following code to check, if PCRE UTF-8 support is enabled:
116 <programlisting language="php"><![CDATA[
117 if (@preg_match('/\pL/u', 'a') == 1) {
118 echo "PCRE unicode support is turned on.\n";
120 echo "PCRE unicode support is turned off.\n";
125 Case insensitive versions of UTF-8 compatible analyzers also need <ulink
126 url="http://www.php.net/manual/en/ref.mbstring.php">mbstring</ulink> extension to
131 If you don't want mbstring extension to be turned on, but need case insensitive search,
132 you may use the following approach: normalize source data before indexing and query
133 string before searching by converting them to lowercase:
136 <programlisting language="php"><![CDATA[
138 setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
142 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
143 new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
147 $doc = new Zend_Search_Lucene_Document();
149 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
150 strtolower($contents)));
152 // Title field for search through (indexed, unstored)
153 $doc->addField(Zend_Search_Lucene_Field::UnStored('title',
154 strtolower($title)));
156 // Title field for retrieving (unindexed, stored)
157 $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
160 <programlisting language="php"><![CDATA[
162 setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
166 Zend_Search_Lucene_Analysis_Analyzer::setDefault(
167 new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
171 $hits = $index->find(strtolower($query));