documentation/manual/en/module_specs/Zend_Search_Lucene-QueryLanguage.xml

   1 <?xml version="1.0" encoding="UTF-8"?>
   2 <!-- Reviewed: no -->
   3 <sect1 id="zend.search.lucene.query-language">
   4     <title>Query Language</title>
   5
   6     <para>
   7         Java Lucene and <classname>Zend_Search_Lucene</classname> provide quite powerful query
   8         languages.
   9     </para>
  10
  11     <para>
  12         These languages are mostly the same with some minor differences, which are mentioned below.
  13     </para>
  14
  15     <para>
  16         Full Java Lucene query language syntax documentation can be found
  17         <ulink url="http://lucene.apache.org/java/2_3_0/queryparsersyntax.html">here</ulink>.
  18     </para>
  19
  20     <sect2 id="zend.search.lucene.query-language.terms">
  21         <title>Terms</title>
  22
  23         <para>
  24             A query is broken up into terms and operators. There are three types of terms: Single
  25             Terms, Phrases, and Subqueries.
  26         </para>
  27
  28         <para>
  29             A Single Term is a single word such as "test" or "hello".
  30         </para>
  31
  32         <para>
  33             A Phrase is a group of words surrounded by double quotes such as "hello dolly".
  34         </para>
  35
  36         <para>
  37             A Subquery is a query surrounded by parentheses such as "(hello dolly)".
  38         </para>
  39
  40         <para>
  41             Multiple terms can be combined together with boolean operators to form complex queries
  42             (see below).
  43         </para>
  44     </sect2>
  45
  46     <sect2 id="zend.search.lucene.query-language.fields">
  47         <title>Fields</title>
  48
  49         <para>
  50             Lucene supports fields of data. When performing a search you can either specify a field,
  51             or use the default field. The field names depend on indexed data and default field is
  52             defined by current settings.
  53         </para>
  54
  55         <para>
  56             The first and most significant difference from Java Lucene is that terms are searched
  57             through <emphasis>all fields</emphasis> by default.
  58         </para>
  59
  60         <para>
  61             There are two static methods in the <classname>Zend_Search_Lucene</classname> class
  62             which allow the developer to configure these settings:
  63         </para>
  64
  65         <programlisting language="php"><![CDATA[
  66 $defaultSearchField = Zend_Search_Lucene::getDefaultSearchField();
  67 ...
  68 Zend_Search_Lucene::setDefaultSearchField('contents');
  69 ]]></programlisting>
  70
  71         <para>
  72             The <constant>NULL</constant> value indicated that the search is performed across all
  73             fields. It's the default setting.
  74         </para>
  75
  76         <para>
  77             You can search specific fields by typing the field name followed by a colon ":" followed
  78             by the term you are looking for.
  79         </para>
  80
  81         <para>
  82             As an example, let's assume a Lucene index contains two fields- title and text- with
  83             text as the default field. If you want to find the document entitled "The Right Way"
  84             which contains the text "don't go this way", you can enter:
  85         </para>
  86
  87         <programlisting language="querystring"><![CDATA[
  88 title:"The Right Way" AND text:go
  89 ]]></programlisting>
  90
  91         <para>
  92             or
  93         </para>
  94
  95         <programlisting language="querystring"><![CDATA[
  96 title:"Do it right" AND go
  97 ]]></programlisting>
  98
  99         <para>
 100             Because "text" is the default field, the field indicator is not required.
 101         </para>
 102
 103         <para>
 104             Note: The field is only valid for the term, phrase or subquery that it directly
 105             precedes, so the query
 106         </para>
 107
 108         <programlisting language="querystring"><![CDATA[
 109 title:Do it right
 110 ]]></programlisting>
 111
 112         <para>
 113             Will only find "Do" in the title field. It will find "it" and "right" in the default
 114             field (if the default field is set) or in all indexed fields (if the default field is
 115             set to <constant>NULL</constant>).
 116         </para>
 117     </sect2>
 118
 119     <sect2 id="zend.search.lucene.query-language.wildcard">
 120         <title>Wildcards</title>
 121
 122         <para>
 123             Lucene supports single and multiple character wildcard searches within single terms (but
 124             not within phrase queries).
 125         </para>
 126
 127         <para>
 128             To perform a single character wildcard search use the "?" symbol.
 129         </para>
 130
 131         <para>
 132             To perform a multiple character wildcard search use the "*" symbol.
 133         </para>
 134
 135         <para>
 136             The single character wildcard search looks for string that match the term with the "?"
 137             replaced by any single character. For example, to search for "text" or "test" you can
 138             use the search:
 139         </para>
 140
 141         <programlisting language="querystring"><![CDATA[
 142 te?t
 143 ]]></programlisting>
 144
 145         <para>
 146             Multiple character wildcard searches look for 0 or more characters when matching strings
 147             against terms. For example, to search for test, tests or tester, you can use the search:
 148         </para>
 149
 150         <programlisting language="querystring"><![CDATA[
 151 test*
 152 ]]></programlisting>
 153
 154         <para>
 155             You can use "?", "*" or both at any place of the term:
 156         </para>
 157
 158         <programlisting language="querystring"><![CDATA[
 159 *wr?t*
 160 ]]></programlisting>
 161
 162         <para>
 163             It searches for "write", "wrote", "written", "rewrite", "rewrote" and so on.
 164         </para>
 165
 166         <para>
 167             Starting from ZF 1.7.7 wildcard patterns need some non-wildcard prefix. Default prefix
 168             length is 3 (like in Java Lucene). So "*", "te?t", "*wr?t*" terms will cause an
 169             exception
 170
 171             <footnote>
 172                 <para>
 173                     Please note, that it's not a
 174                     <code>Zend_Search_Lucene_Search_QueryParserException</code>, but a
 175                     <code>Zend_Search_Lucene_Exception</code>. It's thrown during query rewrite
 176                     (execution) operation.
 177                 </para>
 178             </footnote>.
 179         </para>
 180
 181         <para>
 182             It can be altered using
 183             <code>Zend_Search_Lucene_Search_Query_Wildcard::getMinPrefixLength()</code> and
 184             <code>Zend_Search_Lucene_Search_Query_Wildcard::setMinPrefixLength()</code> methods.
 185         </para>
 186     </sect2>
 187
 188     <sect2 id="zend.search.lucene.query-language.modifiers">
 189         <title>Term Modifiers</title>
 190
 191         <para>
 192             Lucene supports modifying query terms to provide a wide range of searching options.
 193         </para>
 194
 195         <para>
 196             "~" modifier can be used to specify proximity search for phrases or fuzzy search for
 197             individual terms.
 198         </para>
 199     </sect2>
 200
 201     <sect2 id="zend.search.lucene.query-language.range">
 202         <title>Range Searches</title>
 203
 204         <para>
 205             Range queries allow the developer or user to match documents whose field(s) values are
 206             between the lower and upper bound specified by the range query. Range Queries can be
 207             inclusive or exclusive of the upper and lower bounds. Sorting is performed
 208             lexicographically.
 209         </para>
 210
 211         <programlisting language="querystring"><![CDATA[
 212 mod_date:[20020101 TO 20030101]
 213 ]]></programlisting>
 214
 215         <para>
 216             This will find documents whose mod_date fields have values between 20020101 and
 217             20030101, inclusive. Note that Range Queries are not reserved for date fields. You could
 218             also use range queries with non-date fields:
 219         </para>
 220
 221         <programlisting language="querystring"><![CDATA[
 222 title:{Aida TO Carmen}
 223 ]]></programlisting>
 224
 225         <para>
 226             This will find all documents whose titles would be sorted between Aida and Carmen, but
 227             not including Aida and Carmen.
 228         </para>
 229
 230         <para>
 231             Inclusive range queries are denoted by square brackets. Exclusive range queries are
 232             denoted by curly brackets.
 233         </para>
 234
 235         <para>
 236             If field is not specified then <classname>Zend_Search_Lucene</classname> searches for
 237             specified interval through all fields by default.
 238         </para>
 239
 240         <programlisting language="querystring"><![CDATA[
 241 {Aida TO Carmen}
 242 ]]></programlisting>
 243     </sect2>
 244
 245     <sect2 id="zend.search.lucene.query-language.fuzzy">
 246         <title>Fuzzy Searches</title>
 247
 248         <para>
 249             <classname>Zend_Search_Lucene</classname> as well as Java Lucene supports fuzzy searches
 250             based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use
 251             the tilde, "~", symbol at the end of a Single word Term. For example to search for a
 252             term similar in spelling to "roam" use the fuzzy search:
 253         </para>
 254
 255         <programlisting language="querystring"><![CDATA[
 256 roam~
 257 ]]></programlisting>
 258
 259         <para>
 260             This search will find terms like foam and roams. Additional (optional) parameter can
 261             specify the required similarity. The value is between 0 and 1, with a value closer to 1
 262             only terms with a higher similarity will be matched. For example:
 263         </para>
 264
 265         <programlisting language="querystring"><![CDATA[
 266 roam~0.8
 267 ]]></programlisting>
 268
 269         <para>
 270             The default that is used if the parameter is not given is 0.5.
 271         </para>
 272     </sect2>
 273
 274     <sect2 id="zend.search.lucene.query-language.matched-terms-limitations">
 275         <title>Matched terms limitation</title>
 276
 277         <para>
 278             Wildcard, range and fuzzy search queries may match too many terms. It may cause
 279             incredible search performance downgrade.
 280         </para>
 281
 282         <para>
 283             So <classname>Zend_Search_Lucene</classname> sets a limit of matching terms per query
 284             (subquery). This limit can be retrieved and set using
 285             <code>Zend_Search_Lucene::getTermsPerQueryLimit()</code>/<code>Zend_Search_Lucene::setTermsPerQueryLimit($limit)</code>
 286             methods.
 287         </para>
 288
 289         <para>
 290             Default matched terms per query limit is 1024.
 291         </para>
 292     </sect2>
 293
 294     <sect2 id="zend.search.lucene.query-language.proximity-search">
 295         <title>Proximity Searches</title>
 296
 297         <para>
 298             Lucene supports finding words from a phrase that are within a specified word distance in
 299             a string. To do a proximity search use the tilde, "~", symbol at the end of the phrase.
 300             For example to search for a "Zend" and "Framework" within 10 words of each other in a
 301             document use the search:
 302         </para>
 303
 304         <programlisting language="querystring"><![CDATA[
 305 "Zend Framework"~10
 306 ]]></programlisting>
 307     </sect2>
 308
 309     <sect2 id="zend.search.lucene.query-language.boosting">
 310         <title>Boosting a Term</title>
 311
 312         <para>
 313             Java Lucene and <classname>Zend_Search_Lucene</classname> provide the relevance level of
 314             matching documents based on the terms found. To boost the relevance of a term use the
 315             caret, "^", symbol with a boost factor (a number) at the end of the term you are
 316             searching. The higher the boost factor, the more relevant the term will be.
 317         </para>
 318
 319         <para>
 320             Boosting allows you to control the relevance of a document by boosting individual terms.
 321             For example, if you are searching for
 322         </para>
 323
 324         <programlisting language="querystring"><![CDATA[
 325 PHP framework
 326 ]]></programlisting>
 327
 328         <para>
 329             and you want the term "PHP" to be more relevant boost it using the ^ symbol along with
 330             the boost factor next to the term. You would type:
 331         </para>
 332
 333         <programlisting language="querystring"><![CDATA[
 334 PHP^4 framework
 335 ]]></programlisting>
 336
 337         <para>
 338             This will make documents with the term <acronym>PHP</acronym> appear more relevant. You
 339             can also boost phrase terms and subqueries as in the example:
 340         </para>
 341
 342         <programlisting language="querystring"><![CDATA[
 343 "PHP framework"^4 "Zend Framework"
 344 ]]></programlisting>
 345
 346         <para>
 347             By default, the boost factor is 1. Although the boost factor must be positive,
 348             it may be less than 1 (e.g. 0.2).
 349         </para>
 350     </sect2>
 351
 352     <sect2 id="zend.search.lucene.query-language.boolean">
 353         <title>Boolean Operators</title>
 354
 355         <para>
 356             Boolean operators allow terms to be combined through logic operators.
 357             Lucene supports AND, "+", OR, NOT and "-" as Boolean operators.
 358             Java Lucene requires boolean operators to be ALL CAPS.
 359             <classname>Zend_Search_Lucene</classname> does not.
 360         </para>
 361
 362         <para>
 363             AND, OR, and NOT operators and "+", "-" defines two different styles to construct
 364             boolean queries. Unlike Java Lucene, <classname>Zend_Search_Lucene</classname> doesn't
 365             allow these two styles to be mixed.
 366         </para>
 367
 368         <para>
 369             If the AND/OR/NOT style is used, then an AND or OR operator must be present between all
 370             query terms. Each term may also be preceded by NOT operator. The AND operator has higher
 371             precedence than the OR operator. This differs from Java Lucene behavior.
 372         </para>
 373
 374         <sect3 id="zend.search.lucene.query-language.boolean.and">
 375             <title>AND</title>
 376
 377             <para>
 378                 The AND operator means that all terms in the "AND group" must match some part of the
 379                 searched field(s).
 380             </para>
 381
 382             <para>
 383                 To search for documents that contain "PHP framework" and "Zend Framework" use the
 384                 query:
 385             </para>
 386
 387             <programlisting language="querystring"><![CDATA[
 388 "PHP framework" AND "Zend Framework"
 389 ]]></programlisting>
 390         </sect3>
 391
 392         <sect3 id="zend.search.lucene.query-language.boolean.or">
 393             <title>OR</title>
 394
 395             <para>
 396                 The OR operator divides the query into several optional terms.
 397             </para>
 398
 399             <para>
 400                 To search for documents that contain "PHP framework" or "Zend Framework" use the
 401                 query:
 402             </para>
 403
 404             <programlisting language="querystring"><![CDATA[
 405 "PHP framework" OR "Zend Framework"
 406 ]]></programlisting>
 407         </sect3>
 408
 409         <sect3 id="zend.search.lucene.query-language.boolean.not">
 410             <title>NOT</title>
 411
 412             <para>
 413                 The NOT operator excludes documents that contain the term after NOT. But an "AND
 414                 group" which contains only terms with the NOT operator gives an empty result set
 415                 instead of a full set of indexed documents.
 416             </para>
 417
 418             <para>
 419                 To search for documents that contain "PHP framework" but not "Zend Framework" use
 420                 the query:
 421             </para>
 422
 423             <programlisting language="querystring"><![CDATA[
 424 "PHP framework" AND NOT "Zend Framework"
 425 ]]></programlisting>
 426         </sect3>
 427
 428         <sect3 id="zend.search.lucene.query-language.boolean.other-form">
 429             <title>&amp;&amp;, ||, and ! operators</title>
 430
 431             <para>
 432                 &amp;&amp;, ||, and ! may be used instead of AND, OR, and NOT notation.
 433             </para>
 434         </sect3>
 435
 436         <sect3 id="zend.search.lucene.query-language.boolean.plus">
 437             <title>+</title>
 438
 439             <para>
 440                 The "+" or required operator stipulates that the term after the "+" symbol must
 441                 match the document.
 442             </para>
 443
 444             <para>
 445                 To search for documents that must contain "Zend" and may contain "Framework" use the
 446                 query:
 447             </para>
 448
 449             <programlisting language="querystring"><![CDATA[
 450 +Zend Framework
 451 ]]></programlisting>
 452         </sect3>
 453
 454         <sect3 id="zend.search.lucene.query-language.boolean.minus">
 455             <title>-</title>
 456
 457             <para>
 458                 The "-" or prohibit operator excludes documents that match the term after the "-"
 459                 symbol.
 460             </para>
 461
 462             <para>
 463                 To search for documents that contain "PHP framework" but not "Zend Framework" use
 464                 the query:
 465             </para>
 466
 467             <programlisting language="querystring"><![CDATA[
 468 "PHP framework" -"Zend Framework"
 469 ]]></programlisting>
 470         </sect3>
 471
 472         <sect3 id="zend.search.lucene.query-language.boolean.no-operator">
 473             <title>No Operator</title>
 474
 475             <para>
 476                 If no operator is used, then the search behavior is defined by the "default boolean
 477                 operator".
 478             </para>
 479
 480             <para>
 481                 This is set to <code>OR</code> by default.
 482             </para>
 483
 484             <para>
 485                 That implies each term is optional by default. It may or may not be present within
 486                 document, but documents with this term will receive a higher score.
 487             </para>
 488
 489             <para>
 490                 To search for documents that requires "PHP framework" and may contain "Zend
 491                 Framework" use the query:
 492             </para>
 493
 494             <programlisting language="querystring"><![CDATA[
 495 +"PHP framework" "Zend Framework"
 496 ]]></programlisting>
 497
 498             <para>
 499                 The default boolean operator may be set or retrieved with the
 500                 <classname>Zend_Search_Lucene_Search_QueryParser::setDefaultOperator($operator)</classname>
 501                 and
 502                 <classname>Zend_Search_Lucene_Search_QueryParser::getDefaultOperator()</classname>
 503                 methods, respectively.
 504             </para>
 505
 506             <para>
 507                 These methods operate with the
 508                 <classname>Zend_Search_Lucene_Search_QueryParser::B_AND</classname> and
 509                 <classname>Zend_Search_Lucene_Search_QueryParser::B_OR</classname> constants.
 510             </para>
 511         </sect3>
 512     </sect2>
 513
 514     <sect2 id="zend.search.lucene.query-language.grouping">
 515         <title>Grouping</title>
 516
 517         <para>
 518             Java Lucene and <classname>Zend_Search_Lucene</classname> support using parentheses to
 519             group clauses to form sub queries. This can be useful if you want to control the
 520             precedence of boolean logic operators for a query or mix different boolean query styles:
 521         </para>
 522
 523         <programlisting language="querystring"><![CDATA[
 524 +(framework OR library) +php
 525 ]]></programlisting>
 526
 527         <para>
 528             <classname>Zend_Search_Lucene</classname> supports subqueries nested to any level.
 529         </para>
 530     </sect2>
 531
 532     <sect2 id="zend.search.lucene.query-language.field-grouping">
 533         <title>Field Grouping</title>
 534
 535         <para>
 536             Lucene also supports using parentheses to group multiple clauses to a single field.
 537         </para>
 538
 539         <para>
 540             To search for a title that contains both the word "return" and the phrase "pink panther"
 541             use the query:
 542         </para>
 543
 544         <programlisting language="querystring"><![CDATA[
 545 title:(+return +"pink panther")
 546 ]]></programlisting>
 547     </sect2>
 548
 549     <sect2 id="zend.search.lucene.query-language.escaping">
 550         <title>Escaping Special Characters</title>
 551
 552         <para>
 553             Lucene supports escaping special characters that are used in query syntax. The current
 554             list of special characters is:
 555         </para>
 556
 557         <para>
 558             + - &amp;&amp; || ! ( ) { } [ ] ^ " ~ * ? : \
 559         </para>
 560
 561         <para>
 562             + and - inside single terms are automatically treated as common characters.
 563         </para>
 564
 565         <para>
 566             For other instances of these characters use the \ before each special character you'd
 567             like to escape. For example to search for (1+1):2 use the query:
 568         </para>
 569
 570         <programlisting language="querystring"><![CDATA[
 571 \(1\+1\)\:2
 572 ]]></programlisting>
 573     </sect2>
 574 </sect1>