1 <?xml version="1.0" encoding="UTF-8"?>
3 <sect1 id="zend.search.lucene.best-practice">
4 <title>Best Practices</title>
6 <sect2 id="zend.search.lucene.best-practice.field-names">
7 <title>Field names</title>
10 There are no limitations for field names in <classname>Zend_Search_Lucene</classname>.
14 Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and
15 '<emphasis>score</emphasis>' names to avoid ambiguity in <code>QueryHit</code>
20 The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <code>id</code> and
21 <code>score</code> properties always refer to internal Lucene document id and hit <link
22 linkend="zend.search.lucene.searching.results-scoring">score</link>. If the indexed
23 document has the same stored fields, you have to use the
24 <methodname>getDocument()</methodname> method to access them:
26 <programlisting language="php"><![CDATA[
27 $hits = $index->find($query);
29 foreach ($hits as $hit) {
30 // Get 'title' document field
33 // Get 'contents' document field
34 $contents = $hit->contents;
36 // Get internal Lucene document id
39 // Get query hit score
42 // Get 'id' document field
43 $docId = $hit->getDocument()->id;
45 // Get 'score' document field
46 $docId = $hit->getDocument()->score;
48 // Another way to get 'title' document field
49 $title = $hit->getDocument()->title;
55 <sect2 id="zend.search.lucene.best-practice.indexing-performance">
56 <title>Indexing performance</title>
59 Indexing performance is a compromise between used resources, indexing time and index
64 Index quality is completely determined by number of index segments.
68 Each index segment is entirely independent portion of data. So indexes containing more
69 segments need more memory and time for searching.
73 Index optimization is a process of merging several segments into a new one. A fully
74 optimized index contains only one segment.
78 Full index optimization may be performed with the <methodname>optimize()</methodname>
81 <programlisting language="php"><![CDATA[
82 $index = Zend_Search_Lucene::open($indexPath);
89 Index optimization works with data streams and doesn't take a lot of memory but does
90 require processor resources and time.
94 Lucene index segments are not updatable by their nature (the update operation requires
95 the segment file to be completely rewritten). So adding new document(s) to an index
96 always generates a new segment. This, in turn, decreases index quality.
100 An index auto-optimization process is performed after each segment generation and
101 consists of merging partial segments.
105 There are three options to control the behavior of auto-optimization (see <link
106 linkend="zend.search.lucene.index-creation.optimization">Index optimization</link>
112 <emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be
113 buffered in memory before a new segment is generated and written to the hard
120 <emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged
121 by auto-optimization process into a new segment.
127 <emphasis>MergeFactor</emphasis> determines how often auto-optimization is
135 All these options are <classname>Zend_Search_Lucene</classname> object
136 properties- not index properties. They affect only current
137 <classname>Zend_Search_Lucene</classname> object behavior and may vary for
144 <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one
145 document per script execution. On the other hand, it's very important for batch
146 indexing. Greater values increase indexing performance, but also require more memory.
150 There is simply no way to calculate the best value for the
151 <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document
152 size, the analyzer in use and allowed memory.
156 A good way to find the right value is to perform several tests with the largest document
157 you expect to be added to the index
161 <methodname>memory_get_usage()</methodname> and
162 <methodname>memory_get_peak_usage()</methodname> may be used to control memory
167 . It's a best practice not to use more than a half of the allowed memory.
171 <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It
172 therefore also limits auto-optimization time by guaranteeing that the
173 <methodname>addDocument()</methodname> method is not executed more than a certain number
174 of times. This is very important for interactive applications.
178 Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing
179 performance. Index auto-optimization is an iterative process and is performed from
180 bottom up. Small segments are merged into larger segment, which are in turn merged into
181 even larger segments and so on. Full index optimization is achieved when only one large
182 segment file remains.
186 Small segments generally decrease index quality. Many small segments may also trigger
187 the "Too many open files" error determined by OS limitations
191 <classname>Zend_Search_Lucene</classname> keeps each segment file opened to
192 improve search performance.
198 in general, background index optimization should be performed for interactive indexing
199 mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be too low for batch indexing.
203 <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values
204 increase the quality of unoptimized indexes. Larger values increase indexing
205 performance, but also increase the number of merged segments. This again may trigger the
206 "Too many open files" error.
210 <emphasis>MergeFactor</emphasis> groups index segments by their size:
214 <para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para>
219 Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
220 <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.
227 <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but
229 <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.
233 <listitem><para>...</para></listitem>
238 <classname>Zend_Search_Lucene</classname> checks during each
239 <methodname>addDocument()</methodname> call to see if merging any segments may move the
240 newly created segment into the next group. If yes, then merging is performed.
244 So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> +
245 (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
246 <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript>
251 This gives good approximation for the number of segments in the index:
254 <emphasis>NumberOfSegments</emphasis> <= <emphasis>MaxBufferedDocs</emphasis> +
255 <emphasis>MergeFactor</emphasis>*log
256 <subscript><emphasis>MergeFactor</emphasis></subscript>
257 (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
261 <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for
262 the appropriate merge factor to get a reasonable number of segments.
266 Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch
267 indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more
268 course-grained. So use the estimation above for tuning <emphasis>MergeFactor</emphasis>,
269 then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
273 <sect2 id="zend.search.lucene.best-practice.shutting-down">
274 <title>Index during Shut Down</title>
277 The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time
278 if any documents were added to the index but not written to a new segment.
282 It also may trigger an auto-optimization process.
286 The index object is automatically closed when it, and all returned QueryHit objects, go
291 If index object is stored in global variable than it's closed only at the end of script
296 This also may occur if the index or QueryHit instances are referred to in some
297 cyclical data structures, because <acronym>PHP</acronym> garbage collects
298 objects with cyclic references only at the end of script execution.
304 <acronym>PHP</acronym> exception processing is also shut down at this moment.
308 It doesn't prevent normal index shutdown process, but may prevent accurate error
309 diagnostic if any error occurs during shutdown.
313 There are two ways with which you may avoid this problem.
317 The first is to force going out of scope:
319 <programlisting language="php"><![CDATA[
320 $index = Zend_Search_Lucene::open($indexPath);
329 And the second is to perform a commit operation before the end of script execution:
331 <programlisting language="php"><![CDATA[
332 $index = Zend_Search_Lucene::open($indexPath);
337 This possibility is also described in the "<link
338 linkend="zend.search.lucene.advanced.static">Advanced. Using index as static
339 property</link>" section.
343 <sect2 id="zend.search.lucene.best-practice.unique-id">
344 <title>Retrieving documents by unique id</title>
347 It's a common practice to store some unique document id in the index. Examples include
348 url, path, or database id.
352 <classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname>
353 method for retrieving documents containing specified terms.
357 This is more efficient than using the <methodname>find()</methodname> method:
359 <programlisting language="php"><![CDATA[
360 // Retrieving documents with find() method using a query string
361 $query = $idFieldName . ':' . $docId;
362 $hits = $index->find($query);
363 foreach ($hits as $hit) {
364 $title = $hit->title;
365 $contents = $hit->contents;
370 // Retrieving documents with find() method using the query API
371 $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
372 $query = new Zend_Search_Lucene_Search_Query_Term($term);
373 $hits = $index->find($query);
374 foreach ($hits as $hit) {
375 $title = $hit->title;
376 $contents = $hit->contents;
382 // Retrieving documents with termDocs() method
383 $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
384 $docIds = $index->termDocs($term);
385 foreach ($docIds as $id) {
386 $doc = $index->getDocument($id);
387 $title = $doc->title;
388 $contents = $doc->contents;
395 <sect2 id="zend.search.lucene.best-practice.memory-usage">
396 <title>Memory Usage</title>
399 <classname>Zend_Search_Lucene</classname> is a relatively memory-intensive module.
403 It uses memory to cache some information and optimize searching and indexing
408 The memory required differs for different modes.
412 The terms dictionary index is loaded during the search. It's actually each
413 128<superscript>th</superscript>
417 The Lucene file format allows you to configure this number, but
418 <classname>Zend_Search_Lucene</classname> doesn't expose this in its
419 <acronym>API</acronym>. Nevertheless you still have the ability to configure
420 this value if the index is prepared with another Lucene implementation.
424 term of the full dictionary.
428 Thus memory usage is increased if you have a high number of unique terms. This may
429 happen if you use untokenized phrases as a field values or index a large volume of
430 non-text information.
434 An unoptimized index consists of several segments. It also increases memory usage.
435 Segments are independent, so each segment contains its own terms dictionary and terms
436 dictionary index. If an index consists of <emphasis>N</emphasis> segments it may
437 increase memory usage by <emphasis>N</emphasis> times in worst case. Perform index
438 optimization to merge all segments into one to avoid such memory consumption.
442 Indexing uses the same memory as searching plus memory for buffering documents. The
443 amount of memory used may be managed with <emphasis>MaxBufferedDocs</emphasis>
448 Index optimization (full or partial) uses stream-style data processing and doesn't
449 require a lot of memory.
453 <sect2 id="zend.search.lucene.best-practice.encoding">
454 <title>Encoding</title>
457 <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all
458 strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
462 You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym>
463 data, but you should be careful if this is not the case.
467 Wrong encoding may cause error notices at the encoding conversion time or loss of data.
471 <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities
472 for indexed documents and parsed queries.
476 Encoding may be explicitly specified as an optional parameter of field creation methods:
478 <programlisting language="php"><![CDATA[
479 $doc = new Zend_Search_Lucene_Document();
480 $doc->addField(Zend_Search_Lucene_Field::Text('title',
483 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
488 This is the best way to avoid ambiguity in the encoding used.
492 If optional encoding parameter is omitted, then the current locale is used. The current
493 locale may contain character encoding data in addition to the language specification:
495 <programlisting language="php"><![CDATA[
496 setlocale(LC_ALL, 'fr_FR');
499 setlocale(LC_ALL, 'de_DE.iso-8859-1');
502 setlocale(LC_ALL, 'ru_RU.UTF-8');
508 The same approach is used to set query string encoding.
512 If encoding is not specified, then the current locale is used to determine the encoding.
516 Encoding may be passed as an optional parameter, if the query is parsed explicitly
519 <programlisting language="php"><![CDATA[
521 Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
522 $hits = $index->find($query);
528 The default encoding may also be specified with
529 <methodname>setDefaultEncoding()</methodname> method:
531 <programlisting language="php"><![CDATA[
532 Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
533 $hits = $index->find($queryStr);
537 The empty string implies 'current locale'.
541 If the correct encoding is specified it can be correctly processed by analyzer. The
542 actual behavior depends on which analyzer is used. See the <link
543 linkend="zend.search.lucene.charset">Character Set</link> documentation section for
548 <sect2 id="zend.search.lucene.best-practice.maintenance">
549 <title>Index maintenance</title>
552 It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other
553 Lucene implementation does not comprise a "database".
557 Indexes should not be used for data storage. They do not provide partial backup/restore
558 functionality, journaling, logging, transactions and many other features associated with
559 database management systems.
563 Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a
564 consistent state at all times.
568 Index backup and restoration should be performed by copying the contents of the index
573 If index corruption occurs for any reason, the corrupted index should be restored or
578 So it's a good idea to backup large indexes and store changelogs to perform manual
579 restoration and roll-forward operations if necessary. This practice dramatically reduces
580 index restoration time.