5 The basic format is one or more field names followed by a colon, followed by
6 one or more actions. Some actions take an optional or required parameter.
8 Since Omega 1.4.6, the parameter value can be enclosed in double quotes,
9 which is necessary if it contains whitespace; it's also needed for
10 parameter values containing a comma for actions which support multiple
11 parameters (such as ``split``) since there unquoted commas are interpreted
12 as separating parameters.
14 Since Omega 1.4.8, the following C-like escape sequences are supported
15 for parameter values enclosed in double quotes: ``\\``, ``\"``, ``\0``, ``\t``,
16 ``\n``, ``\r``, and ``\x`` followed by two hex digits.
18 The actions are applied in the specified order to each field listed, and
19 a field can be listed in multiple lines.
21 Comments are allowed on a line by themselves, introduced by a ``#``.
25 desc1 : unhtml index truncate=200 field=sample
26 desc2 desc3 desc4 : unhtml index
28 name : field=caption weight=3 index
29 ref : boolean=Q unique=Q
33 Don't put spaces around the ``=`` separating an action and its argument -
34 current versions allow spaces here (though this was never documented as
35 supported) but it leads to a missing argument quietly swallowing the next
36 action rather than using an empty value or giving an error, e.g. this takes
37 ``hash`` as the field name, which is unlikely to be what was intended::
39 url : field= hash boolean=Q unique=Q
41 Since 1.4.6 a deprecation warning is emitted for spaces before or after the
47 index the text as a single boolean term (with prefix PREFIX). If
48 there's no text, no term is added. Omega expects certain prefixes to
49 be used for certain purposes - those starting "X" are reserved for user
50 applications. ``Q`` is conventionally used as the prefix for a unique
54 generate ``D``-, ``M``- and ``Y``-prefixed terms for date range
55 searching (e.g. ``D20021221``, ``M200212`` and ``Y2002`` for the
56 21st December 2002). The following values for *FORMAT* are supported:
58 * ``unix``: the value is interpreted as a Unix local time_t (seconds
59 since the start of 1970 in the local timezone).
60 * ``unixutc``: the value is interpreted as a Unix UTC time_t
61 (seconds since the start of 1970 in UTC). (Since Omega 1.4.12)
62 * ``yyyymmdd``: the value is interpreted as an 8 digit string, e.g.
63 20021221 for 21st December 2002.
65 Unknown formats give an error at script parse time since Omega 1.4.12
66 (in earlier versions unknown formats uselessly resulted in the terms
67 ``D``, ``M`` and ``Y`` literally being added to every document).
69 Invalid values result in no terms being added (and since Omega 1.4.12
70 a warning is emitted).
73 add as a field to the Xapian record. FIELDNAME defaults to the field
74 name in the dumpfile. It is valid to have more than one instance of
75 a given field: all instances will be processed and stored in the
79 leave a gap of SIZE term positions. SIZE defaults to 100. This
80 provides a way to stop phrases, ``NEAR`` and ``ADJ`` from matching
84 Xapian has a limit on the length of a term. To handle arbitrarily
85 long URLs as terms, omindex implements a scheme where the end of
86 a long URL is hashed (short URLs are left as-is). You can use this
87 same scheme in scriptindex. LENGTH defaults to 239, which if you
88 index with prefix "U" produces url terms compatible with omindex.
89 If specified, LENGTH must be at least 6 (because the hash takes 6
93 converts pairs of hex digits to binary byte values (providing a way
94 to specify arbitrary binary strings e.g. for use in a document value
95 slot). The input must have an even length and be composed entirely
96 of hex digits (if it isn't, an error is reported).
98 ``hextobin`` was added in Omega 1.4.6. Prior to Omega 1.4.20, the
99 "error" on a bad value was really handled like a warning - it didn't
100 cause Omega to exit with non-zero status, instead the value was
104 split text into words and index (with prefix PREFIX if specified).
107 split text into words and index (with prefix PREFIX if specified), but
108 don't include positional information in the database - this makes the
109 database smaller, but phrase searching won't work.
112 reads the contents of the file using the current text as the filename
113 and then sets the current text to the contents. If the current text
114 is empty, a warning is issued (since Xapian 1.4.10). If the file can't
115 be loaded (not found, wrong permissions, etc) then an error is issued and
116 scriptindex exits (prior to Omega 1.4.20 this "error" was really handled
117 as a warning - scriptindex continued with the current text set to empty,
118 and the final exit status wasn't affected).
120 If the next action is ``truncate``, then scriptindex is smart enough to
121 know it only needs to load the start of a large file.
124 lowercase the text (useful for generating boolean terms)
126 ltrim[=CHARACTERSTOTRIM]
127 remove leading characters from the text which are in
128 ``CHARACTERSTOTRIM`` (default: space, tab, formfeed, vertical tab,
129 carriage return, linefeed).
131 Currently only ASCII characters are supported in ``CHARACTERSTOTRIM``.
133 See also ``rtrim``, ``squash`` and ``trim``.
135 ``ltrim`` was added in Omega 1.4.19.
138 parse the text as a date string using ``strptime()`` (or C++11's
139 ``std::get_time()`` on platforms without ``strptime()``) with the
140 format specified by ``FORMAT``, and set the text to the result as a
141 Unix ``time_t`` (seconds since the start of 1970 in UTC), which can
142 then be fed into ``date=unixutc`` or ``valuepacked``, for example::
144 last_update : parsedate="%Y%m%d %T" field=lastmod valuepacked=0
146 Format strings containing ``%Z`` are rejected with an error, as it
147 seems that ``strptime()`` implementations don't properly support this
148 (glibc's just accepts any sequence of non-whitespace and ignores it).
150 Format strings containing ``%z`` are only supported on platforms
151 where ``struct tm`` has a ``tm_gmtoff`` member, which is needed to
152 correctly apply the timezone offset. On other platforms ``%z`` is
153 also rejected with an error.
155 ``parsedate`` was added in Omega 1.4.6.
157 rtrim[=CHARACTERSTOTRIM]
158 remove trailing characters from the text which are in
159 ``CHARACTERSTOTRIM`` (default: space, tab, formfeed, vertical tab,
160 carriage return, linefeed).
162 Currently only ASCII characters are supported in ``CHARACTERSTOTRIM``.
164 See also ``ltrim``, ``squash`` and ``trim``.
166 ``rtrim`` was added in Omega 1.4.19.
169 Generate spelling correction data for any ``index`` or ``indexnopos``
170 actions in the remainder of this list of actions.
172 split=DELIMITER[,OPERATION]
173 Split the text at each occurrence of ``DELIMITER``, discard any empty
174 strings, perform ``OPERATION`` on the resulting list, and then for each
175 entry perform all the actions which follow ``split`` in the current rule.
177 ``OPERATION`` can be ``dedup`` (remove second and subsequent
178 occurrences from the list of any value), ``prefixes`` (which instead of
179 just giving the text between delimiters, gives the text up to each
180 delimiter), ``sort`` (sort), or ``none`` (default: none).
182 If you want to specify ``,`` for delimiter, you need to quote it, e.g.
185 squash[=CHARACTERSTOTRIM]
186 replace runs of one or more characters from ``CHARACTERSTOTRIM`` in the
187 text with a single space. Leading and trailing runs are removed entirely.
189 ``CHARACTERSTOTRIM`` defaults to: space, tab, formfeed, vertical tab,
190 carriage return, linefeed).
192 Currently only ASCII characters are supported in ``CHARACTERSTOTRIM``.
194 See also ``ltrim``, ``rtrim`` and ``trim``.
196 ``squash`` was added in Omega 1.4.19.
198 trim[=CHARACTERSTOTRIM]
199 remove leading and trailing characters from the text which are in
200 ``CHARACTERSTOTRIM`` (default: space, tab, formfeed, vertical tab,
201 carriage return, linefeed).
203 Currently only ASCII characters are supported in ``CHARACTERSTOTRIM``.
205 See also ``ltrim``, ``rtrim`` and ``squash``.
207 ``trim`` was added in Omega 1.4.19.
210 truncate to at most LENGTH bytes, but avoid chopping off a word (useful
211 for sample and title fields)
216 unique[=PREFIX[,missing=MISSINGACTION]]
217 use the value in this field for a unique ID. If the value is empty,
218 a warning is issued but nothing else is done. Only one record with
219 each value of the ID may be present in the index: adding a new record
220 with an ID which is already present will cause the old record to be
223 Deletion happens if the only input field present has the ``unique``
224 action applied to it. (Prior to 1.5.0, if there were multiple lists
225 of actions applied to an input field this triggered replacement instead
226 of deletion). If you want to suppress this deletion feature, supplying
227 a dummy input field which doesn't match the index script will achieve
230 You should also index the field as a boolean field using the same
231 prefix so that the old record can be found. In Omega, ``Q`` is
232 conventionally used as the prefix of a unique term.
234 You can use ``unique`` at most once in each index script (this is only
235 enforced since Omega 1.4.5, but older versions didn't handle multiple
238 The optional ``missing`` parameter is supported since Omega 1.4.20.
239 It controls what happens when a record is processed which doesn't
240 trigger the ``unique`` action or triggers the ``unique`` action with
241 an empty value. It can be one of:
243 * ``error``: Exit with an error upon encountering such a document
244 (default in Omega >= 1.5.0)
245 * ``new``: Create a new document (default in Omega < 1.4.20 when
246 ``unique`` not triggered)
247 * ``warn+new``: Issue a warning and create a new document (default in
248 Omega >= 1.4.20 and in older versions when ``unique`` is triggered
250 * ``skip``: Move on to the next record
251 * ``warn+skip``: Issue a warning and move on to the next record
254 strip out XML tags, replacing with a space (``unxml`` is similar to
255 ``unhtml``, but ``unhtml`` varies the whitespace type or omits it
256 entirely, based on HTML tag semantics).
258 ``unxml`` was added in Omega 1.5.0.
261 add as a Xapian document value in slot VALUESLOT. Values can be used
262 for collapsing equivalent documents, sorting the MSet, etc. If you
263 want to perform numeric sorting, use the valuenumeric action instead.
265 valuenumeric=VALUESLOT
266 Like value=VALUESLOT, this adds as a Xapian document value in slot
267 VALUESLOT, but it first encodes for numeric sorting using
268 Xapian::sortable_serialise(). Values set with this action can be
269 used for numeric sorting of the MSet.
271 valuepacked=VALUESLOT
272 Like value=VALUESLOT, this adds as a Xapian document value in slot
273 VALUESLOT, but it first encodes as a 4 byte big-endian binary string.
274 If the input is a Unix time_t value, the resulting slot can be used for
275 date range filtering and to sort the MSet by date. Can be used in
276 combination with ``parsedate``, for example::
278 last_update : parsedate="%Y%m%d %T" field=lastmod valuepacked=0
280 ``valuepacked`` was added in Omega 1.4.6.
283 set the weighting factor to FACTOR (a non-negative integer) for any
284 ``index`` or ``indexnopos`` actions in the remainder of this list of
285 actions. The default is 1. Use this to add extra weight to titles,
286 keyword fields, etc, so that words in them are regarded as more
287 important by searches.
292 The data to be indexed is read in from one or more input files. Each input
293 file consists of zero or more records, each separated by one or more blank
296 Omega 1.4.20 and later explicitly allow multiple blank lines between
297 records, and also blank lines before the first record and after the last
298 record - in earlier versions only a single blank line after each record was
299 explicitly handled, and extra blank lines were handled as an empty records.
300 If you want to be compatible with older versions we recommend a single
301 blank line after each record (with the blank line after the final record
304 Each record contains one or more fields of the form "name=value". If value
305 contains newlines, these must be escaped by inserting an equals sign ('=')
306 after each newline. Here's an example record::
310 value=This is a multi-line
311 =value. Note how each newline
318 See mbox2omega and mbox2omega.script for an example of how you can generate a
319 dump file from an external source and write an index script to be used with it.
320 Try "mbox2omega --help" for more information.