documentation/manual/en/tutorials/lucene-indexing.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!-- Reviewed: no -->
   3 <sect1 id="learning.lucene.indexing">
   4     <title>Indexing</title>
   5
   6     <para>
   7         Indexing is performed by adding a new document to an existing or new index:
   8     </para>
   9
  10     <programlisting language="php"><![CDATA[
  11 $index->addDocument($doc);
  12 ]]></programlisting>
  13
  14     <para>
  15         There are two ways to create document object. The first is to do it manually.
  16     </para>
  17
  18     <example id="learning.lucene.indexing.doc-creation">
  19         <title>Manual Document Construction</title>
  20
  21         <programlisting language="php"><![CDATA[
  22 $doc = new Zend_Search_Lucene_Document();
  23 $doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
  24 $doc->addField(Zend_Search_Lucene_Field::Text('title', $docTitle));
  25 $doc->addField(Zend_Search_Lucene_Field::unStored('contents', $docBody));
  26 $doc->addField(Zend_Search_Lucene_Field::binary('avatar', $avatarData));
  27 ]]></programlisting>
  28     </example>
  29
  30     <para>
  31         The second method is to load it from <acronym>HTML</acronym> or Microsoft Office 2007 files:
  32     </para>
  33
  34     <example id="learning.lucene.indexing.doc-loading">
  35         <title>Document loading</title>
  36
  37         <programlisting language="php"><![CDATA[
  38 $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
  39 $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($path);
  40 $doc = Zend_Search_Lucene_Document_Pptx::loadPptFile($path);
  41 $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($path);
  42 ]]></programlisting>
  43     </example>
  44
  45     <para>
  46         If a document is loaded from one of the supported formats, it still can be extended manually
  47         with new user defined fields.
  48     </para>
  49
  50     <sect2 id="learning.lucene.indexing.policy">
  51         <title>Indexing Policy</title>
  52
  53         <para>
  54             You should define indexing policy within your application architectural design.
  55         </para>
  56
  57         <para>
  58             You may need an on-demand indexing configuration (something like <acronym>OLTP</acronym>
  59             system). In such systems, you usually add one document per user request. As such, the
  60             <emphasis>MaxBufferedDocs</emphasis> option will not affect the system. On the other
  61             hand, <emphasis>MaxMergeDocs</emphasis> is really helpful as it allows you to limit
  62             maximum script execution time. <emphasis>MergeFactor</emphasis> should be set to a value
  63             that keeps balance between the average indexing time (it's also affected by average
  64             auto-optimization time) and search performance (index optimization level is dependent on
  65             the number of segments).
  66         </para>
  67
  68         <para>
  69             If you will be primarily performing batch index updates, your configuration should use a
  70             <emphasis>MaxBufferedDocs</emphasis> option set to the maximum value supported by the
  71             available amount of memory. <emphasis>MaxMergeDocs</emphasis> and
  72             <emphasis>MergeFactor</emphasis> have to be set to values reducing auto-optimization
  73             involvement as much as possible <footnote><para>An additional limit is the maximum file
  74                     handlers supported by the operation system for concurrent open
  75                     operations</para></footnote>. Full index optimization should be applied after
  76             indexing.
  77         </para>
  78
  79         <example id="learning.lucene.indexing.optimization">
  80             <title>Index optimization</title>
  81
  82             <programlisting language="php"><![CDATA[
  83 $index->optimize();
  84 ]]></programlisting>
  85         </example>
  86
  87         <para>
  88             In some configurations, it's more effective to serialize index updates by organizing
  89             update requests into a queue and processing several update requests in a single script
  90             execution. This reduces index opening overhead, and allows utilizing index document
  91             buffering.
  92         </para>
  93     </sect2>
  94 </sect1>