xapian-applications/omega/docs/overview.rst

   1 ==============
   2 Omega overview
   3 ==============
   4
   5 If you just want a very quick overview, you might prefer to read the
   6 `quick-start guide <quickstart.html>`_.
   7
   8 Omega operates on a set of databases.  Each database is created and updated
   9 separately using either omindex or `scriptindex <scriptindex.html>`_.  You can
  10 search these databases (or any other Xapian database with suitable contents)
  11 via a web front-end provided by omega, a CGI application.  A search can also be
  12 done over more than one database at once.
  13
  14 There are separate documents covering `CGI parameters <cgiparams.html>`_, the
  15 `Term Prefixes <termprefixes.html>`_ which are conventionally used, and
  16 `OmegaScript <omegascript.html>`_, the language used to define omega's web
  17 interface.  Omega ships with several OmegaScript templates and you can
  18 use these, modify them, or just write your own.  See the "Supplied Templates"
  19 section below for details of the supplied templates.
  20
  21 Omega parses queries using the ``Xapian::QueryParser`` class - for the supported
  22 syntax, see queryparser.html in the xapian-core documentation
  23 - available online at: https://xapian.org/docs/queryparser.html
  24
  25 Term construction
  26 =================
  27
  28 Documents within an omega database are indexed by two types of terms: those
  29 used for a weighted search from a parsed query string (the CGI parameter
  30 ``P``), and those used for boolean filtering (the CGI parameters ``B`` and
  31 ``N`` - the latter is a negated variant of 'B' and was added in Omega 1.3.5).
  32
  33 Boolean terms always start with a prefix which is an initial capital letter (or
  34 multiple capital letters if the first character is `X`) which denotes the
  35 category of the term (e.g. `M` for MIME type).
  36
  37 Parsed query terms may have a prefix, but don't always.  Those from the body of
  38 the document in unstemmed form don't; stemmed terms have a `Z` prefix; terms
  39 from other fields have a prefix to indicate the field, such as `S` for the
  40 document title; stemmed terms from a field have both prefixes, e.g. `ZS`.
  41
  42 The "english" stemmer is used by default - you can configure this for omindex
  43 and scriptindex with ``--stemmer=LANGUAGE`` (use ``--stemmer=none`` to disable
  44 stemming, see omindex ``--help`` for the list of accepted language names).  At
  45 search time you can configure the stemmer by adding ``$set{stemmer,LANGUAGE}``
  46 to the top of your OmegaScript template.
  47
  48 The two term types are used as follows when building the query:
  49
  50 The ``P`` parameter is parsed using `Xapian::QueryParser` to give a
  51 `Xapian::Query` object denoted as `P-terms` below.
  52
  53 There are two ways that ``B`` and ``N`` parameters are handled, depending if
  54 the term-prefix has been configured as "non-exclusive" or not.  The default is
  55 "exclusive" (and in versions before 1.3.4, this was how all ``B`` parameters
  56 were handled).
  57
  58 Exclusive Boolean Prefix
  59 ------------------------
  60
  61 B(oolean) terms from 'B' parameters with the same prefix are ORed together,
  62 like so::
  63
  64
  65                     [   OR   ]
  66                    /    | ... \
  67               B(F,1) B(F,2)...B(F,n)
  68
  69 Where B(F,1) is the first boolean term with prefix F from a 'B' parameter, and
  70 so on.
  71
  72 Non-Exclusive Boolean Prefix
  73 ----------------------------
  74
  75 For example, ``$setmap{nonexclusiveprefix,K,true}`` sets prefix `K` as
  76 non-exclusive, which means that multiple filter terms from 'B' parameters will
  77 be combined with "AND" instead of "OR", like so::
  78
  79                     [   AND   ]
  80                    /     | ... \
  81               B(K,1) B(K,2)... B(K,m)
  82
  83 Combining the Boolean Filters
  84 -----------------------------
  85
  86 The subqueries for each prefix from "B" parameters are combined with AND,
  87 to make this (which we refer to as "B-filter" below)::
  88
  89                          [     AND     ]
  90                         /       |  ...  \
  91                        /                 \
  92                  [   OR   ]               [   AND  ]
  93                 /    | ... \             /    | ... \
  94            B(F,1) B(F,2)...B(F,n)   B(K,1) B(K,2)...B(K,m)
  95
  96
  97 Negated Boolean Terms
  98 ---------------------
  99
 100 All the terms from all 'N' parameters are combined together with "OR", to
 101 make this (which we refer to as "N-filter" below)::
 102
 103                     [       OR       ]
 104                    / ... |     |  ... \
 105               N(F,1)...N(F,n) N(K,1)...N(K,m)
 106
 107 Putting it all together
 108 -----------------------
 109
 110 The P-terms are filtered by the B-filter using "FILTER" and by the N-filter
 111 using "AND_NOT"::
 112
 113                         [ AND_NOT ]
 114                        /           \
 115                       /             \
 116             [ FILTER ]             N-terms
 117              /      \
 118             /        \
 119        P-terms      B-terms
 120
 121 The intent here is to allow filtering on arbitrary (and, typically,
 122 orthogonal) characteristics of the document. For instance, by adding
 123 boolean terms "Ttext/html", "Ttext/plain" and "J/press" you would be
 124 filtering the parsed query to only retrieve documents that are both in
 125 the "/press" site *and* which are either of MIME type text/html or
 126 text/plain. (See below for more information about sites.)
 127
 128 If B-terms or N-terms is absent, that part of the query is simply omitted.
 129
 130 If there is no parsed query, the boolean filter is promoted to
 131 be the query, and the weighting scheme is set to boolean.  This has
 132 the effect of applying the boolean filter to the whole database.  If
 133 there are only N-terms, then ``Query::MatchAll`` is used for the left
 134 side of the "AND_NOT".
 135
 136 In order to add more boolean prefixes, you will need to alter the
 137 ``index_file()`` function in omindex.cc. Currently omindex adds several
 138 useful ones, detailed below.
 139
 140 Parsed query terms are constructed from the title, body and keywords
 141 of a document. (Not all document types support all three areas of
 142 text.) Title terms are stored with position data starting at 0, body
 143 terms starting 100 beyond title terms, and keyword terms starting 100
 144 beyond body terms. This allows queries using positional data without
 145 causing false matches across the different types of term.
 146
 147 Sites
 148 =====
 149
 150 Within a database, Omega supports multiple sites. These are recorded
 151 using boolean terms (see 'Term construction', above) to allow
 152 filtering on them.
 153
 154 Sites work by having all documents within them having a common base
 155 URL. For instance, you might have two sites, one for your press area
 156 and one for your product descriptions:
 157
 158 - \http://example.com/press/index.html
 159 - \http://example.com/press/bigrelease.html
 160 - \http://example.com/products/bigproduct.html
 161 - \http://example.com/products/littleproduct.html
 162
 163 You could index all documents within \http://example.com/press/ using a
 164 site of '/press', and all within \http://example.com/products/ using
 165 '/products'.
 166
 167 Sites are also useful because omindex indexes documents through the
 168 file system, not by fetching from the web server. If you don't have a
 169 URL to file system mapping which puts all documents under one
 170 hierarchy, you'll need to index each separate section as a site.
 171
 172 An obvious example of this is the way that many web servers map URLs
 173 of the form <\http://example.com/~<username>/> to a directory within
 174 that user's home directory (such as ~<username>/pub on a Unix
 175 system). In this case, you can index each user's home page separately,
 176 as a site of the form '/~<username>'. You can then use boolean
 177 filters to allow people to search only a specific home page (or a
 178 group of them), or omit such terms to search everyone's pages.
 179
 180 Note that the site specified when you index is used to build the
 181 complete URL that the results page links to. Thus while sites will
 182 typically want to be relative to the hostname part of the URL (e.g.
 183 '/site' rather than '\http://example.com/site'), you can use them
 184 to have a single search across several different hostnames. This will
 185 still work if you actually store each distinct hostname in a different
 186 database.
 187
 188 omindex operation
 189 =================
 190
 191 omindex is fairly simple to use, for example::
 192
 193   omindex --db default --url http://example.com/ /var/www/example.com
 194
 195 For a full list of command line options supported, see ``man omindex``
 196 or ``omindex --help``.
 197
 198 You *must* specify the database to index into (it's created if it doesn't
 199 exist, but parent directories must exist).  You will often also want to specify
 200 the base URL (which is used as the site, and can be relative to the hostname -
 201 starts '/' - or absolute - starts with a scheme, e.g.
 202 '\http://example.com/products/').  If not specified, the base URL defaults to
 203 ``/``.
 204
 205 You also need to tell omindex which directory to index. This should be
 206 either a single directory (in which case it is taken to be the
 207 directory base of the entire site being indexed), or as two arguments,
 208 the first being the directory base of the site being indexed, and the
 209 second being a relative directory within that to index.
 210
 211 For instance, in the example above, if you separate your products by
 212 size, you might end up with:
 213
 214 - \http://example.com/press/index.html
 215 - \http://example.com/press/bigrelease.html
 216 - \http://example.com/products/large/bigproduct.html
 217 - \http://example.com/products/small/littleproduct.html
 218
 219 If the entire website is stored in the file system under the directory
 220 /www/example, then you would probably index the site in two
 221 passes, one for the '/press' site and one for the '/products' site. You
 222 might use the following commands::
 223
 224 $ omindex -p --db /var/lib/omega/data/default --url /press /www/example/press
 225 $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products
 226
 227 If you add a new large products, but don't want to reindex the whole of
 228 the products section, you could do::
 229
 230 $ omindex -p --db /var/lib/omega/data/default --url /products /www/example/products large
 231
 232 and just the large products will be reindexed. You need to do it like that, and
 233 not as::
 234
 235 $ omindex -p --db /var/lib/omega/data/default --url /products/large /www/example/products/large
 236
 237 because that would make the large products part of a new site,
 238 '/products/large', which is unlikely to be what you want, as large
 239 products would no longer come up in a search of the products
 240 site. (Note that the ``--depth-limit`` option may come in handy if you have
 241 sites '/products' and '/products/large', or similar.)
 242
 243 omindex has built-in support for indexing HTML, PHP, text files, CSV
 244 (Comma-Separated Values) files, SVG, Atom feeds, and AbiWord documents.  It can
 245 also index a number of other formats using external programs.  Filter programs
 246 are run with CPU, time and memory limits to prevent a runaway filter from
 247 blocking indexing of other files.
 248
 249 The way omindex decides how to index a file is based around MIME content-types.
 250 First of all omindex will look up a file's extension in its extension to MIME
 251 type map.  If there's no entry, it will then ask libmagic to examine the
 252 contents of the file and try to determine a MIME type.
 253
 254 The following formats are supported as standard (you can tell omindex to use
 255 other filters too - see below):
 256
 257 * HTML (.html, .htm, .shtml, .shtm, .xhtml, .xhtm)
 258 * PHP (.php) - our HTML parser knows to ignore PHP code
 259 * text files (.txt, .text)
 260 * SVG (.svg)
 261 * CSV (Comma-Separated Values) files (.csv)
 262 * PDF (.pdf) if pdftotext is available (comes with poppler or xpdf)
 263 * PostScript (.ps, .eps, .ai) if ps2pdf (from ghostscript) and pdftotext (comes
 264   with poppler or xpdf) are available
 265 * OpenOffice/StarOffice documents (.sxc, .stc, .sxd, .std, .sxi, .sti, .sxm,
 266   .sxw, .sxg, .stw) if unzip is available
 267 * OpenDocument format documents (.odt, .ods, .odp, .odg, .odc, .odf, .odb,
 268   .odi, .odm, .ott, .ots, .otp, .otg, .otc, .otf, .oti, .oth) if unzip is
 269   available
 270 * MS Word documents (.dot) if antiword is available (.doc files are left to
 271   libmagic, as they may actually be RTF (AbiWord saves RTF when asked to save
 272   as .doc, and Microsoft Word quietly loads RTF files with a .doc extension),
 273   or plain-text).
 274 * MS Excel documents (.xls, .xlb, .xlt, .xlr, .xla) if xls2csv is available
 275   (comes with catdoc)
 276 * MS Powerpoint documents (.ppt, .pps) if catppt is available (comes with
 277   catdoc)
 278 * MS Office 2007 documents (.docx, .docm, .dotx, .dotm, .xlsx, .xlsm, .xltx,
 279   .xltm, .pptx, .pptm, .potx, .potm, .ppsx, .ppsm) if unzip is available
 280 * Wordperfect documents (.wpd) if wpd2text is available (comes with libwpd)
 281 * MS Works documents (.wps, .wpt) if wps2text is available (comes with libwps)
 282 * MS Outlook message (.msg) if perl with Email::Outlook::Message and
 283   HTML::Parser modules is available
 284 * MS Publisher documents (.pub) if pub2xhtml is available (comes with libmspub)
 285 * MS Visio documents (.vsd, .vss, .vst, .vsw, .vsdx, .vssx, .vstx, .vsdm,
 286   .vssm, .vstm) if vsd2xhtml is available (comes with libvisio)
 287 * Apple Keynote documents (.key, .kth, .apxl) if key2text is available (comes
 288   with libetonyek)
 289 * Apple Numbers documents (.numbers) if numbers2text is available (comes with
 290   libetonyek)
 291 * Apple Pages documents (.pages) if pages2text is available (comes with
 292   libetonyek)
 293 * AbiWord documents (.abw, .awt)
 294 * Compressed AbiWord documents (.zabw)
 295 * Rich Text Format documents (.rtf) if unrtf is available
 296 * Perl POD documentation (.pl, .pm, .pod) if pod2text is available
 297 * reStructured text (.rst, .rest) if rst2html is available (comes with
 298   docutils)
 299 * Markdown (.md, .markdown) if markdown is available
 300 * TeX DVI files (.dvi) if catdvi is available
 301 * DjVu files (.djv, .djvu) if djvutxt is available
 302 * OpenXPS and XPS files (.oxps, .xps) if unzip is available
 303 * Debian packages (.deb, .udeb) if dpkg-deb is available
 304 * RPM packages (.rpm) if rpm is available
 305 * Atom feeds (.atom)
 306 * MAFF (.maff) if unzip is available
 307 * MHTML (.mhtml, .mht) if perl with MIME::Tools is available
 308 * MIME email messages (.eml) and USENET articles if perl with MIME::Tools and
 309   HTML::Parser is available
 310 * vCard files (.vcf, .vcard) if perl with Text::vCard is available
 311
 312 If you have additional extensions that represent one of these types, you can
 313 add an additional MIME mapping using the ``--mime-type`` option.  For
 314 instance, if your press releases are PostScript files with extension
 315 ``.posts`` you can tell omindex this like so::
 316
 317 $ omindex --db /var/lib/omega/data/default --url /press /www/example/press --mime-type posts:application/postscript
 318
 319 The syntax of ``--mime-type`` is 'ext:type', where ext is the extension of
 320 a file of that type (everything after the last '.').  The ``type`` can be any
 321 string, but to be useful there either needs to be a filter set for that type
 322 - either using ``--filter`` or ``--read-filters``, or by ``type`` being
 323 understood by default:
 324
 325 .. include:: inc/mimetypes.rst
 326
 327 You can specify ``*`` as the MIME sub-type for ``--filter``, for example if you
 328 have a filter you want to apply to any video files, you could specify it using
 329 ``--filter 'video/*:index-video-file'``.  Note that this is checked right after
 330 checking for the exact MIME type, so will override any built-in filters which
 331 would otherwise match.  Also you can't use arbitrary wildcards, just ``*`` for
 332 the entire sub-type.  And be careful to quote ``*`` to protect it from the
 333 shell.  Support for this was added in 1.3.3.
 334
 335 If there's no specific filter, and no subtype wildcard, then ``*/*`` is checked
 336 (assuming the mimetype contains a ``/``), and after that ``*`` (for any
 337 mimetype string).  Combined with filter command ``true`` for indexing by
 338 meta-data only, you can specify a fall back case of indexing by meta-data
 339 only using ``--filter '*:true'``.  Support for this was added in 1.3.4.
 340
 341 There are also two special values that can be specified instead of a MIME
 342 type:
 343
 344 * ignore - tells omindex to quietly ignore such files
 345 * skip - tells omindex to skip such files
 346
 347 By default no extensions are marked as "skip", and the following extensions are
 348 marked as "ignore":
 349
 350 .. include:: inc/ignored.rst
 351
 352 If you wish to remove a MIME mapping, you can do this by omitting the type -
 353 for example if you have ``.dot`` files which are inputs for the graphviz
 354 tool ``dot``, then you may wish to remove the default mapping for ``.dot``
 355 files and let libmagic be used to determine their type, which you can do
 356 using: ``--mime-type=dot:`` (if you want to *ignore* all ``.dot`` files,
 357 instead use ``--mime-type=dot:ignore``).
 358
 359 The lookup of extensions in the MIME mappings is case sensitive, but if an
 360 extension isn't found and includes upper case ASCII letters, they're converted
 361 to lower case and the lookup is repeated, so you effectively get case
 362 insensitive lookup for mappings specified with a lower-case extension, but
 363 you can set different handling for differently cased variants if you need
 364 to.
 365
 366 You can add support for additional MIME content types (or override existing
 367 ones) using the ``--filter`` and/or ``--read-filters`` options to specify a
 368 command to run.  At present, this command needs to produce output in either
 369 HTML, SVG, or plain text format (as of 1.3.3, you can specify the character
 370 encoding that the output will be in; in earlier versions, plain text output had
 371 to be UTF-8).  Support for SVG output from external commands was added in
 372 1.4.8.
 373
 374 As of 1.3.3, the command can include certain placeholders which are substituted
 375 by omindex:
 376
 377 * Any ``%f`` in this command will be replaced with the filename of the file to
 378   extract (suitably escaped to protect it from the shell, so don't put quotes
 379   around ``%f``).
 380
 381   If you don't include ``%f`` in the command, then the filename of the file to
 382   be extracted will be appended to the command, separated by a space.
 383
 384 * Any ``%t`` in this command will be replaced with a filename in a temporary
 385   directory (suitably escaped to protect it from the shell, so don't put
 386   quotes around ``%t``).  The extension of this filename will reflect the
 387   expected output format (either ``.html``, ``.svg`` or ``.txt``).  If you
 388   don't use ``%t`` in the command, then omindex will expect output on
 389   ``stdout`` (prior to 1.3.3, output had to be on ``stdout``).
 390
 391 * ``%%`` can be used should you need a literal ``%`` in the command.
 392
 393 For example, if you'd prefer to use Abiword to extract text from word documents
 394 (by default, omindex uses antiword), then you can pass the option
 395 ``--filter=application/msword:'abiword --to=txt --to-name=fd://1'`` to
 396 omindex.
 397
 398 Another example - if you wanted to handle files of MIME type
 399 ``application/octet-stream`` by running them through ``strings -n8``, you can
 400 pass the option ``--filter=application/octet-stream:'strings -n8'``.
 401
 402 A more complex example: to process ``.foo`` files with the (fictional)
 403 ``foo2utf16`` utility which produces UTF-16 text but doesn't support writing
 404 output to stdout, run omindex with ``-Mfoo:text/x-foo
 405 -Ftext/x-foo,,utf-16:'foo2utf16 %f %t'``.
 406
 407 A less contrived example of the use of ``--filter`` makes use of LibreOffice,
 408 via the unoconv script, to extract text from various formats.  First you
 409 need to start a listening instance (if you don't, unoconv will start up
 410 LibreOffice for every file, which is rather inefficient) - the ``&`` tells
 411 the shell to run it in the background::
 412
 413   unoconv --listener &
 414
 415 Then run omindex with options such as
 416 ``--filter=application/msword,html:'unoconv --stdout -f html'`` (you'll want
 417 to repeat this for each format which you want to use LibreOffice on).
 418
 419 If you specify ``false`` as the command in ``--filter``, omindex will skip
 420 files with the specified MIME type.  (As of 1.2.20 and 1.3.3 ``false`` is
 421 explicitly checked for; in earlier versions this will also work, at least
 422 on Unix where ``false`` is a command which ignores its arguments and exits with
 423 a non-zero status).
 424
 425 If you specify ``true`` as the command in ``--filter``, omindex won't try
 426 to extract text from the file, but will index it such that it can be searched
 427 for via metadata which comes from the filing system (filename, extension, mime
 428 content-type, last modified time, size).  (As of 1.2.22 and 1.3.4 ``true`` is
 429 explicitly checked for; in earlier versions this will also work, at least
 430 on Unix where ``true`` is a command which ignores its arguments and exits with
 431 a status zero).
 432
 433 If you know of a reliable filter which can extract text from a file format
 434 which might be of interest to others, please let us know so we can consider
 435 including it as a standard filter.
 436
 437 The ``--duplicates`` option controls how omindex handles documents which map
 438 to a URL which is already in the database.  The default (which can be
 439 explicitly set with ``--duplicates=replace``) is to reindex if the last
 440 modified time of the file is newer than that recorded in the database.
 441 The alternative is ``--duplicates=ignore``, which will never reindex an
 442 existing document.  If you only add documents, this avoids the overhead
 443 of checking the last modified time.  It also allows you to prioritise
 444 adding completely new documents to the database over updating existing ones.
 445
 446 By default, omindex will remove any document in the database which has a URL
 447 that doesn't correspond to a file seen on disk - in other words, it will clear
 448 out everything that doesn't exist any more.  However if you are building up
 449 an omega database with several runs of omindex, this is not
 450 appropriate (as each run would delete the data from the previous run),
 451 so you should use the ``--no-delete`` option.  Note that if you
 452 choose to work like this, it is impossible to prune old documents from
 453 the database using omindex. If this is a problem for you, an
 454 alternative is to index each subsite into a different database, and
 455 merge all the databases together when searching.
 456
 457 ``--depth-limit`` allows you to prevent omindex from descending more than
 458 a certain number of directories.  Specifying ``--depth-limit=0`` means no limit
 459 is imposed on recursion; ``--depth-limit=1`` means don't descend into any
 460 subdirectories of the start directory.
 461
 462 Tracking files which couldn't be indexed
 463 ----------------------------------------
 464
 465 In older versions, omindex only tracked files which it successfully indexed -
 466 if a file couldn't be read, or a filter program failed on it, or it was marked
 467 not to be indexed (e.g. with an HTML meta tag) then it would be retried on
 468 subsequent runs.  Starting from version 1.3.4, omindex now tracks failed
 469 files in the user metadata of the database, along with their sizes and last
 470 modified times, and uses this data to skip files which previously failed and
 471 haven't changed since.
 472
 473 You can force omindex to retry such files using the ``--retry-failed`` option.
 474 One situation in which this is useful is if you've upgraded a filter program
 475 to a newer version which you suspect will index some files which previously
 476 failed.
 477
 478 Currently there's no mechanism for automatically removing failure entries
 479 when the file they refer to is removed or renamed.  These lingering entries are
 480 harmless, except they bloat the database a little.  A simple way to clear them
 481 out is to run periodically with ``--retry-failed`` as this removes any existing
 482 failure entries before indexing starts.
 483
 484 HTML Parsing
 485 ============
 486
 487 The document ``<title>`` tag is used as the document title.  Metadata in various
 488 ``<meta>`` tags is also understood - these values of the ``name`` parameter are
 489 currently handled when found:
 490
 491  * ``author``, ``dcterms.creator``, ``dcterms.contributor``: author(s)
 492  * ``created``, ``dcterms.issued``: document creation date
 493  * ``classification``: document topic
 494  * ``keywords``, ``dcterms.subject``, ``dcterms.description``: indexed as extra
 495    document text (but not stored in the sample)
 496  * ``description``: by default, handled as ``keywords``, as of Omega 1.4.4.
 497    If ``omindex`` is run with ``--sample=description``, then this is used as
 498    the preferred source for the stored sample of document text (HTML documents
 499    with no ``description`` fall back to a sample from the body; if
 500    ``description`` occurs multiple times then second and subsequent are handled
 501    as ``keywords``).  In Omega 1.4.2 and earlier, ``--sample`` wasn't supported
 502    and the behaviour was as if ``--sample=description`` had been specified.  In
 503    Omega 1.4.3, ``--sample`` was added, but the default was
 504    ``--sample=description`` (contrary to the intended and documented behaviour)
 505    - you can use ``--sample=body`` with 1.4.3 and later to store a sample from
 506    the document body.
 507
 508 The HTML parser will look for the 'robots' META tag, and won't index pages
 509 which are marked as ``noindex`` or ``none``, for example any of the following::
 510
 511     <meta name="robots" content="noindex,nofollow">
 512     <meta name="robots" content="noindex">
 513     <meta name="robots" content="none">
 514
 515 The ``omindex`` option ``--ignore-exclusions`` disables this behaviour, so
 516 the files with the above will be indexed anyway.
 517
 518 Sometimes it is useful to be able to exclude just part of a page from being
 519 indexed (for example you may not want to index navigation links, or a footer
 520 which appears on every page).  To allow this, the parser supports "magic"
 521 comments to mark sections of the document to not index.  Two formats are
 522 supported - htdig_noindex (used by ht://Dig) and UdmComment (used by
 523 mnoGoSearch)::
 524
 525     Index this bit <!--htdig_noindex-->but <b>not</b> this<!--/htdig_noindex-->
 526
 527 ::
 528
 529     <!--UdmComment--><div>Boring copyright notice</div><!--/UdmComment-->
 530
 531 Boolean terms
 532 =============
 533
 534 omindex will create the following boolean terms when it indexes a
 535 document:
 536
 537 E
 538     Extension of the file (e.g. `Epdf`) [since Omega 1.2.5]
 539 T
 540     MIME type
 541
 542 J
 543     The base URL, omitting any trailing slash (so if the base URL was just
 544     `/`, the term is just `J`).  If the resulting term would be > 240
 545     bytes, it's hashed in the same way an `U` prefix terms are.  Mnemonic: the
 546     Jumping-off point. [since Omega 1.3.4]
 547 H
 548     hostname of site (if supplied - this term won't exist if you index a
 549     site with base URL '/press', for instance).  Since Omega 1.3.4, if the
 550     resulting term would be > 240 bytes, it's hashed in the same way as `U`
 551     prefix terms are.
 552 P
 553     path terms - one term for the directory which the document is in, and for
 554     each parent directories, with no trailing slashes [since Omega 1.3.4 -
 555     in earlier versions, there was just one `P` term for the path of site (i.e.
 556     the rest of the site base URL) - this will be amongst the terms Omega 1.3.4
 557     adds].  Since Omega 1.3.4, if the resulting term would be > 240 bytes, it's
 558     hashed in the same way as `U` prefix terms are.
 559 U
 560     full URL of indexed document - if the resulting term would be > 240 bytes,
 561     a hashing scheme is used to avoid overflowing Xapian's term length limit.
 562
 563 D
 564     date (numeric format: YYYYMMDD)
 565
 566     date can also have the magical form "latest" - a document indexed
 567     by the term Dlatest matches any date-range without an end date.
 568     You can index dynamic documents which are always up to date
 569     with Dlatest and they'll match as expected.  (If you use sort by date,
 570     you'll probably also want to set the value containing the timestamp to
 571     a "max" value so dynamic documents match a date in the far future).
 572 M
 573     month (numeric format: YYYYMM)
 574 Y
 575     year (four digits)
 576
 577 omega configuration
 578 ===================
 579
 580 Most of the omega CGI configuration is dynamic, by setting CGI
 581 parameters. However some things must be configured using a
 582 configuration file.  The configuration file is searched for in
 583 various locations:
 584
 585 - Firstly, if the "OMEGA_CONFIG_FILE" environment variable is
 586   set, its value is used as the full path to a configuration file
 587   to read.
 588 - Next (if the environment variable is not set, or the file pointed
 589   to is not present), the file "omega.conf" in the same directory as
 590   the Omega CGI is used.
 591 - Next (if neither of the previous steps found a file), the file
 592   "${sysconfdir}/omega.conf" (e.g. /etc/omega.conf on Linux systems)
 593   is used.
 594 - Finally, if no configuration file is found, default values are used.
 595
 596 The format of the file is very simple: a line per option, with the
 597 option name followed by its value, separated by a whitespace.  Blank
 598 lines are ignored.  If the first non-whitespace character on a line
 599 is a '#', omega treats the line as a comment and ignores it.
 600
 601 The current options are:
 602
 603 - `database_dir`: the directory containing all the Omega databases
 604 - `template_dir`: the directory containing the OmegaScript templates
 605 - `log_dir`: the directory which the OmegaScript `$log` command writes log
 606   files to
 607 - `cdb_dir`: the directory which the OmegaScript `$lookup` command
 608   looks for CDB files in
 609
 610 The default values (used if no configuration file is found) are::
 611
 612  database_dir /var/lib/omega/data
 613  template_dir /var/lib/omega/templates
 614  log_dir /var/log/omega
 615  cdb_dir /var/lib/omega/cdb
 616
 617 Note that, with apache, environment variables may be set using mod_env, and
 618 with apache 1.3.7 or later this may be used inside a .htaccess file.  This
 619 makes it reasonably easy to share a single system installed copy of Omega
 620 between multiple users.
 621
 622 Supplied Templates
 623 ==================
 624
 625 The OmegaScript templates supplied with Omega are:
 626
 627 * query - This is the default template, providing a typical Web search
 628   interface.
 629 * topterms - This is just like query, but provides a "top terms" feature
 630   which suggests terms the user might want to add to their query to
 631   obtain better results.
 632 * godmode - Allows you to inspect a database showing which terms index
 633   each document, and which documents are indexed by each term.
 634 * opensearch - Provides results in OpenSearch format (for more details
 635   see http://www.opensearch.org/).
 636 * xml - Provides results in a custom XML format.
 637 * emptydocs - Shows a list of documents with zero length.  If CGI parameter
 638   TERM is set to a non-empty value, then only documents indexed by that given
 639   term are shown (e.g. TERM=Tapplication/pdf to show PDF files with no text);
 640   otherwise all zero length documents are shown.
 641
 642 There are also "helper fragments" used by the templates above:
 643
 644 * inc/anyalldropbox - Provides a choice of matching "any" or "all" terms
 645   by default as a drop down box.
 646 * inc/anyallradio - Provides a choice of matching "any" or "all" terms
 647   by default as radio buttons.
 648 * toptermsjs - Provides some JavaScript used by the topterms template.
 649
 650 Document data construction
 651 ==========================
 652
 653 This is only useful if you need to inject your own documents into the
 654 database independently of omindex, such as if you are indexing
 655 dynamically-generated documents that are served using a server-side
 656 system such as PHP or ASP, but which you can determine the contents of
 657 in some way, such as documents generated from reasonably static
 658 database contents.
 659
 660 The document data field stores some summary information about the
 661 document, in the following (sample) format::
 662
 663  url=<baseurl>
 664  sample=<sample>
 665  caption=<title>
 666  type=<mimetype>
 667
 668 Further fields may be added (although omindex doesn't currently add any
 669 others), and may be looked up from OmegaScript using the $field{}
 670 command.
 671
 672 As of Omega 0.9.3, you can alternatively add something like this near the
 673 start of your OmegaScript template::
 674
 675 $set{fieldnames,$split{caption sample url}}
 676
 677 Then you need only give the field values in the document data, which can
 678 save a lot of space in a large database.  With the setting of fieldnames
 679 above, the first line of document data can be accessed with $field{caption},
 680 the second with $field{sample}, and the third with $field{url}.
 681
 682 Stopword List
 683 =============
 684
 685 At search time, Omega uses a built-in list of stopwords, which are::
 686
 687     a about an and are as at be by en for from how i in is it of on or that the
 688     this to was what when where which who why will with you your