documentation/manual/en/module_specs/Zend_Search_Lucene-BestPractice.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!-- Reviewed: no -->
   3 <sect1 id="zend.search.lucene.best-practice">
   4     <title>Best Practices</title>
   5
   6     <sect2 id="zend.search.lucene.best-practice.field-names">
   7         <title>Field names</title>
   8
   9         <para>
  10             There are no limitations for field names in <classname>Zend_Search_Lucene</classname>.
  11         </para>
  12
  13         <para>
  14             Nevertheless it's a good idea not to use '<emphasis>id</emphasis>' and
  15             '<emphasis>score</emphasis>' names to avoid ambiguity in <code>QueryHit</code>
  16             properties names.
  17         </para>
  18
  19         <para>
  20             The <classname>Zend_Search_Lucene_Search_QueryHit</classname> <code>id</code> and
  21             <code>score</code> properties always refer to internal Lucene document id and hit <link
  22                 linkend="zend.search.lucene.searching.results-scoring">score</link>. If the indexed
  23             document has the same stored fields, you have to use the
  24             <methodname>getDocument()</methodname> method to access them:
  25
  26             <programlisting language="php"><![CDATA[
  27 $hits = $index->find($query);
  28
  29 foreach ($hits as $hit) {
  30     // Get 'title' document field
  31     $title = $hit->title;
  32
  33     // Get 'contents' document field
  34     $contents = $hit->contents;
  35
  36     // Get internal Lucene document id
  37     $id = $hit->id;
  38
  39     // Get query hit score
  40     $score = $hit->score;
  41
  42     // Get 'id' document field
  43     $docId = $hit->getDocument()->id;
  44
  45     // Get 'score' document field
  46     $docId = $hit->getDocument()->score;
  47
  48     // Another way to get 'title' document field
  49     $title = $hit->getDocument()->title;
  50 }
  51 ]]></programlisting>
  52         </para>
  53     </sect2>
  54
  55     <sect2 id="zend.search.lucene.best-practice.indexing-performance">
  56         <title>Indexing performance</title>
  57
  58         <para>
  59             Indexing performance is a compromise between used resources, indexing time and index
  60             quality.
  61         </para>
  62
  63         <para>
  64             Index quality is completely determined by number of index segments.
  65         </para>
  66
  67         <para>
  68             Each index segment is entirely independent portion of data. So indexes containing more
  69             segments need more memory and time for searching.
  70         </para>
  71
  72         <para>
  73             Index optimization is a process of merging several segments into a new one. A fully
  74             optimized index contains only one segment.
  75         </para>
  76
  77         <para>
  78             Full index optimization may be performed with the <methodname>optimize()</methodname>
  79             method:
  80
  81             <programlisting language="php"><![CDATA[
  82 $index = Zend_Search_Lucene::open($indexPath);
  83
  84 $index->optimize();
  85 ]]></programlisting>
  86         </para>
  87
  88         <para>
  89             Index optimization works with data streams and doesn't take a lot of memory but does
  90             require processor resources and time.
  91         </para>
  92
  93         <para>
  94             Lucene index segments are not updatable by their nature (the update operation requires
  95             the segment file to be completely rewritten). So adding new document(s) to an index
  96             always generates a new segment. This, in turn, decreases index quality.
  97         </para>
  98
  99         <para>
 100             An index auto-optimization process is performed after each segment generation and
 101             consists of merging partial segments.
 102         </para>
 103
 104         <para>
 105             There are three options to control the behavior of auto-optimization (see <link
 106                 linkend="zend.search.lucene.index-creation.optimization">Index optimization</link>
 107             section):
 108
 109             <itemizedlist>
 110                 <listitem>
 111                     <para>
 112                         <emphasis>MaxBufferedDocs</emphasis> is the number of documents that can be
 113                         buffered in memory before a new segment is generated and written to the hard
 114                         drive.
 115                     </para>
 116                 </listitem>
 117
 118                 <listitem>
 119                     <para>
 120                         <emphasis>MaxMergeDocs</emphasis> is the maximum number of documents merged
 121                         by auto-optimization process into a new segment.
 122                     </para>
 123                 </listitem>
 124
 125                 <listitem>
 126                     <para>
 127                         <emphasis>MergeFactor</emphasis> determines how often auto-optimization is
 128                         performed.
 129                     </para>
 130                 </listitem>
 131             </itemizedlist>
 132
 133             <note>
 134                 <para>
 135                     All these options are <classname>Zend_Search_Lucene</classname> object
 136                     properties- not index properties. They affect only current
 137                     <classname>Zend_Search_Lucene</classname> object behavior and may vary for
 138                     different scripts.
 139                 </para>
 140             </note>
 141         </para>
 142
 143         <para>
 144             <emphasis>MaxBufferedDocs</emphasis> doesn't have any effect if you index only one
 145             document per script execution. On the other hand, it's very important for batch
 146             indexing. Greater values increase indexing performance, but also require more memory.
 147         </para>
 148
 149         <para>
 150             There is simply no way to calculate the best value for the
 151             <emphasis>MaxBufferedDocs</emphasis> parameter because it depends on average document
 152             size, the analyzer in use and allowed memory.
 153         </para>
 154
 155         <para>
 156             A good way to find the right value is to perform several tests with the largest document
 157             you expect to be added to the index
 158
 159             <footnote>
 160                 <para>
 161                     <methodname>memory_get_usage()</methodname> and
 162                     <methodname>memory_get_peak_usage()</methodname> may be used to control memory
 163                     usage.
 164                 </para>
 165             </footnote>
 166
 167             . It's a best practice not to use more than a half of the allowed memory.
 168         </para>
 169
 170         <para>
 171             <emphasis>MaxMergeDocs</emphasis> limits the segment size (in terms of documents). It
 172             therefore also limits auto-optimization time by guaranteeing that the
 173             <methodname>addDocument()</methodname> method is not executed more than a certain number
 174             of times. This is very important for interactive applications.
 175         </para>
 176
 177         <para>
 178             Lowering the <emphasis>MaxMergeDocs</emphasis> parameter also may improve batch indexing
 179             performance. Index auto-optimization is an iterative process and is performed from
 180             bottom up. Small segments are merged into larger segment, which are in turn merged into
 181             even larger segments and so on. Full index optimization is achieved when only one large
 182             segment file remains.
 183         </para>
 184
 185         <para>
 186             Small segments generally decrease index quality. Many small segments may also trigger
 187             the "Too many open files" error determined by OS limitations
 188
 189             <footnote>
 190                 <para>
 191                     <classname>Zend_Search_Lucene</classname> keeps each segment file opened to
 192                     improve search performance.
 193                 </para>
 194             </footnote>.
 195         </para>
 196
 197         <para>
 198             in general, background index optimization should be performed for interactive indexing
 199             mode and <emphasis>MaxMergeDocs</emphasis> shouldn't be too low for batch indexing.
 200         </para>
 201
 202         <para>
 203             <emphasis>MergeFactor</emphasis> affects auto-optimization frequency. Lower values
 204             increase the quality of unoptimized indexes. Larger values increase indexing
 205             performance, but also increase the number of merged segments. This again may trigger the
 206             "Too many open files" error.
 207         </para>
 208
 209         <para>
 210             <emphasis>MergeFactor</emphasis> groups index segments by their size:
 211
 212             <orderedlist>
 213                 <listitem>
 214                     <para>Not greater than <emphasis>MaxBufferedDocs</emphasis>.</para>
 215                 </listitem>
 216
 217                 <listitem>
 218                     <para>
 219                         Greater than <emphasis>MaxBufferedDocs</emphasis>, but not greater than
 220                         <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>.
 221                     </para>
 222                 </listitem>
 223
 224                 <listitem>
 225                     <para>
 226                         Greater than
 227                         <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>, but
 228                         not greater than
 229                         <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis>*<emphasis>MergeFactor</emphasis>.
 230                     </para>
 231                 </listitem>
 232
 233                 <listitem><para>...</para></listitem>
 234             </orderedlist>
 235         </para>
 236
 237         <para>
 238             <classname>Zend_Search_Lucene</classname> checks during each
 239             <methodname>addDocument()</methodname> call to see if merging any segments may move the
 240             newly created segment into the next group. If yes, then merging is performed.
 241         </para>
 242
 243         <para>
 244             So an index with N groups may contain <emphasis>MaxBufferedDocs</emphasis> +
 245             (N-1)*<emphasis>MergeFactor</emphasis> segments and contains at least
 246             <emphasis>MaxBufferedDocs</emphasis>*<emphasis>MergeFactor</emphasis><superscript>(N-1)</superscript>
 247             documents.
 248         </para>
 249
 250         <para>
 251             This gives good approximation for the number of segments in the index:
 252         </para>
 253         <para>
 254             <emphasis>NumberOfSegments</emphasis> &lt;= <emphasis>MaxBufferedDocs</emphasis> +
 255             <emphasis>MergeFactor</emphasis>*log
 256             <subscript><emphasis>MergeFactor</emphasis></subscript>
 257             (<emphasis>NumberOfDocuments</emphasis>/<emphasis>MaxBufferedDocs</emphasis>)
 258         </para>
 259
 260         <para>
 261             <emphasis>MaxBufferedDocs</emphasis> is determined by allowed memory. This allows for
 262             the appropriate merge factor to get a reasonable number of segments.
 263         </para>
 264
 265         <para>
 266             Tuning the <emphasis>MergeFactor</emphasis> parameter is more effective for batch
 267             indexing performance than <emphasis>MaxMergeDocs</emphasis>. But it's also more
 268             course-grained. So use the estimation above for tuning <emphasis>MergeFactor</emphasis>,
 269             then play with <emphasis>MaxMergeDocs</emphasis> to get best batch indexing performance.
 270         </para>
 271     </sect2>
 272
 273     <sect2 id="zend.search.lucene.best-practice.shutting-down">
 274         <title>Index during Shut Down</title>
 275
 276         <para>
 277             The <classname>Zend_Search_Lucene</classname> instance performs some work at exit time
 278             if any documents were added to the index but not written to a new segment.
 279         </para>
 280
 281         <para>
 282             It also may trigger an auto-optimization process.
 283         </para>
 284
 285         <para>
 286             The index object is automatically closed when it, and all returned QueryHit objects, go
 287             out of scope.
 288         </para>
 289
 290         <para>
 291             If index object is stored in global variable than it's closed only at the end of script
 292             execution
 293
 294             <footnote>
 295                 <para>
 296                     This also may occur if the index or QueryHit instances are referred to in some
 297                     cyclical data structures, because <acronym>PHP</acronym> garbage collects
 298                     objects with cyclic references only at the end of script execution.
 299                 </para>
 300             </footnote>.
 301         </para>
 302
 303         <para>
 304             <acronym>PHP</acronym> exception processing is also shut down at this moment.
 305         </para>
 306
 307         <para>
 308             It doesn't prevent normal index shutdown process, but may prevent accurate error
 309             diagnostic if any error occurs during shutdown.
 310         </para>
 311
 312         <para>
 313             There are two ways with which you may avoid this problem.
 314         </para>
 315
 316         <para>
 317             The first is to force going out of scope:
 318
 319             <programlisting language="php"><![CDATA[
 320 $index = Zend_Search_Lucene::open($indexPath);
 321
 322 ...
 323
 324 unset($index);
 325 ]]></programlisting>
 326         </para>
 327
 328         <para>
 329             And the second is to perform a commit operation before the end of script execution:
 330
 331             <programlisting language="php"><![CDATA[
 332 $index = Zend_Search_Lucene::open($indexPath);
 333
 334 $index->commit();
 335 ]]></programlisting>
 336
 337             This possibility is also described in the "<link
 338                 linkend="zend.search.lucene.advanced.static">Advanced. Using index as static
 339                 property</link>" section.
 340         </para>
 341     </sect2>
 342
 343     <sect2 id="zend.search.lucene.best-practice.unique-id">
 344         <title>Retrieving documents by unique id</title>
 345
 346         <para>
 347             It's a common practice to store some unique document id in the index. Examples include
 348             url, path, or database id.
 349         </para>
 350
 351         <para>
 352             <classname>Zend_Search_Lucene</classname> provides a <methodname>termDocs()</methodname>
 353             method for retrieving documents containing specified terms.
 354         </para>
 355
 356         <para>
 357             This is more efficient than using the <methodname>find()</methodname> method:
 358
 359             <programlisting language="php"><![CDATA[
 360 // Retrieving documents with find() method using a query string
 361 $query = $idFieldName . ':' . $docId;
 362 $hits  = $index->find($query);
 363 foreach ($hits as $hit) {
 364     $title    = $hit->title;
 365     $contents = $hit->contents;
 366     ...
 367 }
 368 ...
 369
 370 // Retrieving documents with find() method using the query API
 371 $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
 372 $query = new Zend_Search_Lucene_Search_Query_Term($term);
 373 $hits  = $index->find($query);
 374 foreach ($hits as $hit) {
 375     $title    = $hit->title;
 376     $contents = $hit->contents;
 377     ...
 378 }
 379
 380 ...
 381
 382 // Retrieving documents with termDocs() method
 383 $term = new Zend_Search_Lucene_Index_Term($docId, $idFieldName);
 384 $docIds  = $index->termDocs($term);
 385 foreach ($docIds as $id) {
 386     $doc = $index->getDocument($id);
 387     $title    = $doc->title;
 388     $contents = $doc->contents;
 389     ...
 390 }
 391 ]]></programlisting>
 392         </para>
 393     </sect2>
 394
 395     <sect2 id="zend.search.lucene.best-practice.memory-usage">
 396         <title>Memory Usage</title>
 397
 398         <para>
 399             <classname>Zend_Search_Lucene</classname> is a relatively memory-intensive module.
 400         </para>
 401
 402         <para>
 403             It uses memory to cache some information and optimize searching and indexing
 404             performance.
 405         </para>
 406
 407         <para>
 408             The memory required differs for different modes.
 409         </para>
 410
 411         <para>
 412             The terms dictionary index is loaded during the search. It's actually each
 413             128<superscript>th</superscript>
 414
 415             <footnote>
 416                 <para>
 417                     The Lucene file format allows you to configure this number, but
 418                     <classname>Zend_Search_Lucene</classname> doesn't expose this in its
 419                     <acronym>API</acronym>. Nevertheless you still have the ability to configure
 420                     this value if the index is prepared with another Lucene implementation.
 421                 </para>
 422             </footnote>
 423
 424             term of the full dictionary.
 425         </para>
 426
 427         <para>
 428             Thus memory usage is increased if you have a high number of unique terms. This may
 429             happen if you use untokenized phrases as a field values or index a large volume of
 430             non-text information.
 431         </para>
 432
 433         <para>
 434             An unoptimized index consists of several segments. It also increases memory usage.
 435             Segments are independent, so each segment contains its own terms dictionary and terms
 436             dictionary index. If an index consists of <emphasis>N</emphasis> segments it may
 437             increase memory usage by <emphasis>N</emphasis> times in worst case. Perform index
 438             optimization to merge all segments into one to avoid such memory consumption.
 439         </para>
 440
 441         <para>
 442             Indexing uses the same memory as searching plus memory for buffering documents. The
 443             amount of memory used may be managed with <emphasis>MaxBufferedDocs</emphasis>
 444             parameter.
 445         </para>
 446
 447         <para>
 448             Index optimization (full or partial) uses stream-style data processing and doesn't
 449             require a lot of memory.
 450         </para>
 451     </sect2>
 452
 453     <sect2 id="zend.search.lucene.best-practice.encoding">
 454         <title>Encoding</title>
 455
 456         <para>
 457             <classname>Zend_Search_Lucene</classname> works with UTF-8 strings internally. So all
 458             strings returned by <classname>Zend_Search_Lucene</classname> are UTF-8 encoded.
 459         </para>
 460
 461         <para>
 462             You shouldn't be concerned with encoding if you work with pure <acronym>ASCII</acronym>
 463             data, but you should be careful if this is not the case.
 464         </para>
 465
 466         <para>
 467             Wrong encoding may cause error notices at the encoding conversion time or loss of data.
 468         </para>
 469
 470         <para>
 471             <classname>Zend_Search_Lucene</classname> offers a wide range of encoding possibilities
 472             for indexed documents and parsed queries.
 473         </para>
 474
 475         <para>
 476             Encoding may be explicitly specified as an optional parameter of field creation methods:
 477
 478             <programlisting language="php"><![CDATA[
 479 $doc = new Zend_Search_Lucene_Document();
 480 $doc->addField(Zend_Search_Lucene_Field::Text('title',
 481                                               $title,
 482                                               'iso-8859-1'));
 483 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
 484                                                   $contents,
 485                                                   'utf-8'));
 486 ]]></programlisting>
 487
 488             This is the best way to avoid ambiguity in the encoding used.
 489         </para>
 490
 491         <para>
 492             If optional encoding parameter is omitted, then the current locale is used. The current
 493             locale may contain character encoding data in addition to the language specification:
 494
 495             <programlisting language="php"><![CDATA[
 496 setlocale(LC_ALL, 'fr_FR');
 497 ...
 498
 499 setlocale(LC_ALL, 'de_DE.iso-8859-1');
 500 ...
 501
 502 setlocale(LC_ALL, 'ru_RU.UTF-8');
 503 ...
 504 ]]></programlisting>
 505         </para>
 506
 507         <para>
 508             The same approach is used to set query string encoding.
 509         </para>
 510
 511         <para>
 512             If encoding is not specified, then the current locale is used to determine the encoding.
 513         </para>
 514
 515         <para>
 516             Encoding may be passed as an optional parameter, if the query is parsed explicitly
 517             before search:
 518
 519             <programlisting language="php"><![CDATA[
 520 $query =
 521     Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');
 522 $hits = $index->find($query);
 523 ...
 524 ]]></programlisting>
 525         </para>
 526
 527         <para>
 528             The default encoding may also be specified with
 529             <methodname>setDefaultEncoding()</methodname> method:
 530
 531             <programlisting language="php"><![CDATA[
 532 Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-1');
 533 $hits = $index->find($queryStr);
 534 ...
 535 ]]></programlisting>
 536
 537             The empty string implies 'current locale'.
 538         </para>
 539
 540         <para>
 541             If the correct encoding is specified it can be correctly processed by analyzer. The
 542             actual behavior depends on which analyzer is used. See the <link
 543                 linkend="zend.search.lucene.charset">Character Set</link> documentation section for
 544             details.
 545         </para>
 546     </sect2>
 547
 548     <sect2 id="zend.search.lucene.best-practice.maintenance">
 549         <title>Index maintenance</title>
 550
 551         <para>
 552             It should be clear that <classname>Zend_Search_Lucene</classname> as well as any other
 553             Lucene implementation does not comprise a "database".
 554         </para>
 555
 556         <para>
 557             Indexes should not be used for data storage. They do not provide partial backup/restore
 558             functionality, journaling, logging, transactions and many other features associated with
 559             database management systems.
 560         </para>
 561
 562         <para>
 563             Nevertheless, <classname>Zend_Search_Lucene</classname> attempts to keep indexes in a
 564             consistent state at all times.
 565         </para>
 566
 567         <para>
 568             Index backup and restoration should be performed by copying the contents of the index
 569             folder.
 570         </para>
 571
 572         <para>
 573             If index corruption occurs for any reason, the corrupted index should be restored or
 574             completely rebuilt.
 575         </para>
 576
 577         <para>
 578             So it's a good idea to backup large indexes and store changelogs to perform manual
 579             restoration and roll-forward operations if necessary. This practice dramatically reduces
 580             index restoration time.
 581         </para>
 582     </sect2>
 583 </sect1>