1 <?xml version="1.0" encoding="UTF-8"?>
3 <sect1 id="zend.search.lucene.overview">
4 <title>Overview</title>
6 <sect2 id="zend.search.lucene.introduction">
7 <title>Introduction</title>
10 <classname>Zend_Search_Lucene</classname> is a general purpose text search engine
11 written entirely in <acronym>PHP</acronym> 5. Since it stores its index on the
12 filesystem and does not require a database server, it can add search capabilities to
13 almost any <acronym>PHP</acronym>-driven website.
14 <classname>Zend_Search_Lucene</classname> supports the following features:
18 <para>Ranked searching - best results returned first</para>
23 Many powerful query types: phrase queries, boolean queries, wildcard queries,
24 proximity queries, range queries and many others.
29 <para>Search by specific field (e.g., title, author, contents)</para>
33 <classname>Zend_Search_Lucene</classname> was derived from the Apache Lucene project.
34 The currently (starting from ZF 1.6) supported Lucene index format versions are 1.4 -
35 2.3. For more information on Lucene, visit <ulink
36 url="http://lucene.apache.org/java/docs/"/>.
43 Previous <classname>Zend_Search_Lucene</classname> implementations support the
44 Lucene 1.4 (1.9) - 2.1 index formats.
48 Starting from Zend Framework 1.5 any index created using pre-2.1 index format is
49 automatically upgraded to Lucene 2.1 format after the
50 <classname>Zend_Search_Lucene</classname> update and will not be compatible with
51 <classname>Zend_Search_Lucene</classname> implementations included into Zend
57 <sect2 id="zend.search.lucene.index-creation.documents-and-fields">
58 <title>Document and Field Objects</title>
61 <classname>Zend_Search_Lucene</classname> operates with documents as atomic objects for
62 indexing. A document is divided into named fields, and fields have content that can be
67 A document is represented by the <classname>Zend_Search_Lucene_Document</classname>
68 class, and this objects of this class contain instances of
69 <classname>Zend_Search_Lucene_Field</classname> that represent the fields on the
74 It is important to note that any information can be added to the index.
75 Application-specific information or metadata can be stored in the document
76 fields, and later retrieved with the document during search.
80 It is the responsibility of your application to control the indexer.
81 This means that data can be indexed from any source
82 that is accessible by your application. For example, this could be the
83 filesystem, a database, an <acronym>HTML</acronym> form, etc.
87 <classname>Zend_Search_Lucene_Field</classname> class provides several static methods to
88 create fields with different characteristics:
91 <programlisting language="php"><![CDATA[
92 $doc = new Zend_Search_Lucene_Document();
94 // Field is not tokenized, but is indexed and stored within the index.
95 // Stored fields can be retrived from the index.
96 $doc->addField(Zend_Search_Lucene_Field::Keyword('doctype',
99 // Field is not tokenized nor indexed, but is stored in the index.
100 $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
103 // Binary String valued Field that is not tokenized nor indexed,
104 // but is stored in the index.
105 $doc->addField(Zend_Search_Lucene_Field::Binary('icon',
108 // Field is tokenized and indexed, and is stored in the index.
109 $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
110 'Document annotation text'));
112 // Field is tokenized and indexed, but is not stored in the index.
113 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
114 'My document content'));
118 Each of these methods (excluding the
119 <methodname>Zend_Search_Lucene_Field::Binary()</methodname> method) has an optional
120 <varname>$encoding</varname> parameter for specifying input data encoding.
124 Encoding may differ for different documents as well as for different fields within one
128 <programlisting language="php"><![CDATA[
129 $doc = new Zend_Search_Lucene_Document();
130 $doc->addField(Zend_Search_Lucene_Field::Text('title',
133 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
139 If encoding parameter is omitted, then the current locale is used at processing time.
143 <programlisting language="php"><![CDATA[
144 setlocale(LC_ALL, 'de_DE.iso-8859-1');
146 $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents));
150 Fields are always stored and returned from the index in UTF-8 encoding. Any required
151 conversion to UTF-8 happens automatically.
155 Text analyzers (<link linkend="zend.search.lucene.extending.analysis">see below</link>)
156 may also convert text to some other encodings. Actually, the default analyzer converts
157 text to 'ASCII//TRANSLIT' encoding. Be careful, however; this translation may depend on
162 Fields' names are defined at your discretion in the <methodname>addField()</methodname>
167 Java Lucene uses the 'contents' field as a default field to search.
168 <classname>Zend_Search_Lucene</classname> searches through all fields by default, but
169 the behavior is configurable. See the <link
170 linkend="zend.search.lucene.query-language.fields">"Default search field"</link>
175 <sect2 id="zend.search.lucene.index-creation.understanding-field-types">
176 <title>Understanding Field Types</title>
181 <code>Keyword</code> fields are stored and indexed, meaning that they can be
182 searched as well as displayed in search results. They are not split up into
183 separate words by tokenization. Enumerated database fields usually translate
184 well to Keyword fields in <classname>Zend_Search_Lucene</classname>.
190 <code>UnIndexed</code> fields are not searchable, but they are returned with
191 search hits. Database timestamps, primary keys, file system paths, and other
192 external identifiers are good candidates for UnIndexed fields.
198 <code>Binary</code> fields are not tokenized or indexed, but are stored for
199 retrieval with search hits. They can be used to store any data encoded as a
200 binary string, such as an image icon.
206 <code>Text</code> fields are stored, indexed, and tokenized. Text fields are
207 appropriate for storing information like subjects and titles that need to be
208 searchable as well as returned with search results.
214 <code>UnStored</code> fields are tokenized and indexed, but not stored in the
215 index. Large amounts of text are best indexed using this type of field. Storing
216 data creates a larger index on disk, so if you need to search but not redisplay
217 the data, use an UnStored field. UnStored fields are practical when using a
218 <classname>Zend_Search_Lucene</classname> index in combination with a relational
219 database. You can index large data fields with UnStored fields for searching,
220 and retrieve them from your relational database by using a separate field as an
224 <table id="zend.search.lucene.index-creation.understanding-field-types.table">
225 <title>Zend_Search_Lucene_Field Types</title>
230 <entry>Field Type</entry>
231 <entry>Stored</entry>
232 <entry>Indexed</entry>
233 <entry>Tokenized</entry>
234 <entry>Binary</entry>
240 <entry>Keyword</entry>
248 <entry>UnIndexed</entry>
256 <entry>Binary</entry>
272 <entry>UnStored</entry>
285 <sect2 id="zend.search.lucene.index-creation.html-documents">
286 <title>HTML documents</title>
289 <classname>Zend_Search_Lucene</classname> offers a <acronym>HTML</acronym> parsing
290 feature. Documents can be created directly from a <acronym>HTML</acronym> file or
294 <programlisting language="php"><![CDATA[
295 $doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
296 $index->addDocument($doc);
298 $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
299 $index->addDocument($doc);
303 <classname>Zend_Search_Lucene_Document_Html</classname> class uses the
304 <methodname>DOMDocument::loadHTML()</methodname> and
305 <methodname>DOMDocument::loadHTMLFile()</methodname> methods to parse the source
306 <acronym>HTML</acronym>, so it doesn't need <acronym>HTML</acronym> to be well formed or
307 to be <acronym>XHTML</acronym>. On the other hand, it's sensitive to the encoding
308 specified by the "meta http-equiv" header tag.
312 <classname>Zend_Search_Lucene_Document_Html</classname> class recognizes document title,
313 body and document header meta tags.
317 The 'title' field is actually the /html/head/title value. It's stored within the index,
318 tokenized and available for search.
322 The 'body' field is the actual body content of the <acronym>HTML</acronym> file or
323 string. It doesn't include scripts, comments or attributes.
327 The <methodname>loadHTML()</methodname> and <methodname>loadHTMLFile()</methodname>
328 methods of <classname>Zend_Search_Lucene_Document_Html</classname> class also have
329 second optional argument. If it's set to <constant>TRUE</constant>, then body content is
330 also stored within index and can be retrieved from the index. By default, the body is
331 tokenized and indexed, but not stored.
335 The third parameter of <methodname>loadHTML()</methodname> and
336 <methodname>loadHTMLFile()</methodname> methods optionally specifies source
337 <acronym>HTML</acronym> document encoding. It's used if encoding is not specified using
338 Content-type HTTP-EQUIV meta tag.
342 Other document header meta tags produce additional document fields. The field 'name' is
343 taken from 'name' attribute, and the 'content' attribute populates the field 'value'.
344 Both are tokenized, indexed and stored, so documents may be searched by their meta tags
345 (for example, by keywords).
349 Parsed documents may be augmented by the programmer with any other field:
352 <programlisting language="php"><![CDATA[
353 $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
354 $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
356 $doc->addField(Zend_Search_Lucene_Field::UnIndexed('updated',
358 $doc->addField(Zend_Search_Lucene_Field::Text('annotation',
359 'Document annotation text'));
360 $index->addDocument($doc);
364 Document links are not included in the generated document, but may be retrieved with
365 the <methodname>Zend_Search_Lucene_Document_Html::getLinks()</methodname> and
366 <methodname>Zend_Search_Lucene_Document_Html::getHeaderLinks()</methodname> methods:
369 <programlisting language="php"><![CDATA[
370 $doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
371 $linksArray = $doc->getLinks();
372 $headerLinksArray = $doc->getHeaderLinks();
376 Starting from Zend Framework 1.6 it's also possible to exclude links with
377 <code>rel</code> attribute set to <code>'nofollow'</code>. Use
378 <methodname>Zend_Search_Lucene_Document_Html::setExcludeNoFollowLinks($true)</methodname>
379 to turn on this option.
383 <methodname>Zend_Search_Lucene_Document_Html::getExcludeNoFollowLinks()</methodname>
384 method returns current state of "Exclude nofollow links" flag.
388 <sect2 id="zend.search.lucene.index-creation.docx-documents">
389 <title>Word 2007 documents</title>
392 <classname>Zend_Search_Lucene</classname> offers a Word 2007 parsing feature. Documents
393 can be created directly from a Word 2007 file:
396 <programlisting language="php"><![CDATA[
397 $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
398 $index->addDocument($doc);
402 <classname>Zend_Search_Lucene_Document_Docx</classname> class uses the
403 <code>ZipArchive</code> class and <code>simplexml</code> methods to parse the source
404 document. If the <code>ZipArchive</code> class (from module php_zip) is not available,
405 the <classname>Zend_Search_Lucene_Document_Docx</classname> will also not be available
406 for use with Zend Framework.
410 <classname>Zend_Search_Lucene_Document_Docx</classname> class recognizes document meta
411 data and document text. Meta data consists, depending on document contents, of filename,
412 title, subject, creator, keywords, description, lastModifiedBy, revision, modified,
417 The 'filename' field is the actual Word 2007 file name.
421 The 'title' field is the actual document title.
425 The 'subject' field is the actual document subject.
429 The 'creator' field is the actual document creator.
433 The 'keywords' field contains the actual document keywords.
437 The 'description' field is the actual document description.
441 The 'lastModifiedBy' field is the username who has last modified the actual document.
445 The 'revision' field is the actual document revision number.
449 The 'modified' field is the actual document last modified date / time.
453 The 'created' field is the actual document creation date / time.
457 The 'body' field is the actual body content of the Word 2007 document. It only includes
458 normal text, comments and revisions are not included.
462 The <methodname>loadDocxFile()</methodname> methods of
463 <classname>Zend_Search_Lucene_Document_Docx</classname> class also have second optional
464 argument. If it's set to <constant>TRUE</constant>, then body content is also stored
465 within index and can be retrieved from the index. By default, the body is tokenized and
466 indexed, but not stored.
470 Parsed documents may be augmented by the programmer with any other field:
473 <programlisting language="php"><![CDATA[
474 $doc = Zend_Search_Lucene_Document_Docx::loadDocxFile($filename);
475 $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
479 $doc->addField(Zend_Search_Lucene_Field::Text(
481 'Document annotation text')
483 $index->addDocument($doc);
488 <sect2 id="zend.search.lucene.index-creation.pptx-documents">
489 <title>Powerpoint 2007 documents</title>
492 <classname>Zend_Search_Lucene</classname> offers a Powerpoint 2007 parsing feature.
493 Documents can be created directly from a Powerpoint 2007 file:
496 <programlisting language="php"><![CDATA[
497 $doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
498 $index->addDocument($doc);
502 <classname>Zend_Search_Lucene_Document_Pptx</classname> class uses the
503 <code>ZipArchive</code> class and <code>simplexml</code> methods to parse the source
504 document. If the <code>ZipArchive</code> class (from module php_zip) is not available,
505 the <classname>Zend_Search_Lucene_Document_Pptx</classname> will also not be available
506 for use with Zend Framework.
510 <classname>Zend_Search_Lucene_Document_Pptx</classname> class recognizes document meta
511 data and document text. Meta data consists, depending on document contents, of filename,
512 title, subject, creator, keywords, description, lastModifiedBy, revision, modified,
517 The 'filename' field is the actual Powerpoint 2007 file name.
521 The 'title' field is the actual document title.
525 The 'subject' field is the actual document subject.
529 The 'creator' field is the actual document creator.
533 The 'keywords' field contains the actual document keywords.
537 The 'description' field is the actual document description.
541 The 'lastModifiedBy' field is the username who has last modified the actual document.
545 The 'revision' field is the actual document revision number.
549 The 'modified' field is the actual document last modified date / time.
553 The 'created' field is the actual document creation date / time.
557 The 'body' field is the actual content of all slides and slide notes in the Powerpoint
562 The <methodname>loadPptxFile()</methodname> methods of
563 <classname>Zend_Search_Lucene_Document_Pptx</classname> class also have second optional
564 argument. If it's set to <constant>TRUE</constant>, then body content is also stored
565 within index and can be retrieved from the index. By default, the body is tokenized and
566 indexed, but not stored.
570 Parsed documents may be augmented by the programmer with any other field:
573 <programlisting language="php"><![CDATA[
574 $doc = Zend_Search_Lucene_Document_Pptx::loadPptxFile($filename);
575 $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
578 $doc->addField(Zend_Search_Lucene_Field::Text(
580 'Document annotation text'));
581 $index->addDocument($doc);
585 <sect2 id="zend.search.lucene.index-creation.xlsx-documents">
586 <title>Excel 2007 documents</title>
588 <classname>Zend_Search_Lucene</classname> offers a Excel 2007 parsing feature. Documents
589 can be created directly from a Excel 2007 file:
592 <programlisting language="php"><![CDATA[
593 $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
594 $index->addDocument($doc);
598 <classname>Zend_Search_Lucene_Document_Xlsx</classname> class uses the
599 <code>ZipArchive</code> class and <code>simplexml</code> methods to parse the source
600 document. If the <code>ZipArchive</code> class (from module php_zip) is not available,
601 the <classname>Zend_Search_Lucene_Document_Xlsx</classname> will also not be available
602 for use with Zend Framework.
606 <classname>Zend_Search_Lucene_Document_Xlsx</classname> class recognizes document meta
607 data and document text. Meta data consists, depending on document contents, of filename,
608 title, subject, creator, keywords, description, lastModifiedBy, revision, modified,
613 The 'filename' field is the actual Excel 2007 file name.
617 The 'title' field is the actual document title.
621 The 'subject' field is the actual document subject.
625 The 'creator' field is the actual document creator.
629 The 'keywords' field contains the actual document keywords.
633 The 'description' field is the actual document description.
637 The 'lastModifiedBy' field is the username who has last modified the actual document.
641 The 'revision' field is the actual document revision number.
645 The 'modified' field is the actual document last modified date / time.
649 The 'created' field is the actual document creation date / time.
653 The 'body' field is the actual content of all cells in all worksheets of the Excel 2007
658 The <methodname>loadXlsxFile()</methodname> methods of
659 <classname>Zend_Search_Lucene_Document_Xlsx</classname> class also have second optional
660 argument. If it's set to <constant>TRUE</constant>, then body content is also stored
661 within index and can be retrieved from the index. By default, the body is tokenized and
662 indexed, but not stored.
666 Parsed documents may be augmented by the programmer with any other field:
669 <programlisting language="php"><![CDATA[
670 $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
671 $doc->addField(Zend_Search_Lucene_Field::UnIndexed(
674 $doc->addField(Zend_Search_Lucene_Field::Text(
676 'Document annotation text'));
677 $index->addDocument($doc);