2 .. Copyright (C) 2006 Lemur Consulting Ltd
3 .. Copyright (C) 2007,2008,2009,2010,2011,2012,2016 Olly Betts
5 ============================
6 Xapian Administrator's Guide
7 ============================
9 .. contents:: Table of contents
14 This document is intended to provide general hints, tips and advice to
15 administrators of Xapian systems. It assumes that you have installed Xapian
16 on your system, and are familiar with the basics of creating and searching
19 The intended audience is system administrators who need to be able to perform
20 general management of a Xapian database, including tasks such as taking
21 backups and optimising performance. It may also be useful introductory
22 reading for Xapian application developers.
24 The document is up-to-date for Xapian version 1.4.21.
29 Xapian databases hold all the information needed to perform searches in a set
30 of tables. The default database backend for the 1.4 release series is called
31 `glass`. The default backend for the 1.2 release series was called `chert`,
32 and this is also supported by 1.4.
37 The following table always exists:
39 - The `postlist` table holds a list of all the documents indexed by each term
40 in the database (`postings`), and also chunked streams of the values in each
43 The following table exists by default, but you can choose not to have it:
45 - The `termlist` table holds a list of all the terms which index each
46 document, and also the value slots used in each document. Without this,
47 some features aren't supported - see `Xapian::DB_NO_TERMLIST` for details.
49 And the following optional tables exist only when there is data to store in
52 - The `docdata` table holds the document data associated with each document
54 - The `position` table stores all the word positions in each document
55 which each term occurs at.
56 - The `spelling` table holds data for suggesting spelling corrections.
57 - The `synonym` table holds a synonym dictionary.
59 Each of the tables is held in a separate file with extension `.glass` (e.g.
60 `postlist.glass`), allowing an administrator to see how much data is being used
61 for each of the above purposes.
63 The `.glass` file actually stores the data, and is structured as a tree of
64 blocks, which have a default size of 8KB (though this can be set, either
65 through the Xapian API, or with the ``xapian-compact`` tool described below).
67 Changing the blocksize may have performance implications, but it is hard to
68 know whether these will be positive or negative for a particular combination
69 of hardware and software without doing some profiling.
71 The `.baseA` and `.baseB` files you may remember if you've worked with older
72 Xapian database backends no longer exist in glass databases - the information
73 about unused blocks is stored in a freelist (itself stored in unused blocks in
74 the `.glass` file, and the other information is stored in the `iamglass`
77 Glass also supports databases stored in a single file - currently these only
78 support read operations, and have to be created by compacting an existing
79 glass database. It won't save you diskspace, but it means only one file needs
80 to be opened to open the database so reduces initialisation overhead a little,
81 and a single file is more convenient if you need to copy it around. You can
82 even embed the database in another file so you can ship a single file
83 containing content and a Xapian database which provides a search of it.
88 The following tables always exist:
90 - The `postlist` holds a list of all the documents indexed by
91 each term in the database, and also chunked streams of the values in each
93 - The `record` holds the document data associated with each document
95 - The `termlist` holds a list of all the terms which index each
96 document, and also the value slots used in each document.
98 And the following optional tables exist only when there is data to store in
101 - The `position` holds a list of all the word positions in each
102 document which each term occurs at.
103 - The `spelling` holds data for suggesting spelling corrections.
104 - The `synonym` holds a synonym dictionary.
106 Each of the tables is held in a separate file, allowing an administrator to
107 see how much data is being used for each of the above purposes. It is not
108 always necessary to fully populate these tables: for example, if phrase
109 searches are never going to be performed on the database, it is not necessary
110 to store any positionlist information.
112 If you look at a Xapian database, you will see that each of these tables
113 actually uses 2 or 3 files. For example, for a "chert" format database the
114 termlist table is stored in the files ``termlist.baseA``, ``termlist.baseB``
117 The ``.DB`` file actually stores the data, and is structured as a tree of
118 blocks, which have a default size of 8KB (though this can be set, either
119 through the Xapian API, or with some of the tools detailed later in this
122 The ``.baseA`` and ``.baseB`` files are used to keep track of where to start
123 looking for data in the ``.DB`` file (the root of the tree), and which blocks are
124 in use. Often only one of the ``.baseA`` and ``.baseB`` files will be present;
125 each of these files refers to a revision of the database, and there may be more
126 than one valid revision of the database stored in the ``.DB`` file at once.
128 Changing the blocksize may have performance implications, but it is hard to
129 tell whether these will be positive or negative for a particular combination
130 of hardware and software without doing some profiling.
135 Xapian ensures that all modifications to its database are performed
136 atomically. This means that:
138 - From the point of view of a separate process (or a separate database object
139 in the same process) reading the database, all modifications made to a
140 database are invisible until the modifications is committed.
141 - The database on disk is always in a consistent state.
142 - If the system is interrupted during a modification, the database should
143 always be left in a valid state. This applies even if the power is cut
144 unexpectedly, as long as the disk does not become corrupted due to hardware
147 Committing a modification requires several calls to the operating system to
148 make it flush any cached modifications to the database to disk. This is to
149 ensure that if the system fails at any point, the database is left in a
150 consistent state. Of course, this is a fairly slow process (since the system
151 has to wait for the disk to physically write the data), so grouping many
152 changes together will speed up the throughput considerably.
154 Many modifications can be explicitly grouped into a single transaction, so
155 that lots of changes are applied at once. Even if an application doesn't
156 explicitly protect modifications to the database using transactions, Xapian
157 will group modifications into transactions, applying the modifications in
160 Note that it is not currently possible to extend Xapian's transactions to
161 cover multiple databases, or to link them with transactions in external
162 systems, such as an RDBMS.
164 Finally, note that it is possible to compile Xapian such that it doesn't make
165 modifications in an atomic manner, in order to build very large databases more
166 quickly (search the Xapian mailing list archives for "DANGEROUS" mode for more
167 details). This isn't yet integrated into standard builds of Xapian, but may
168 be in future, if appropriate protections can be incorporated.
170 Single writer, multiple reader
171 ------------------------------
173 Xapian implements a "single writer, multiple reader" model. This means that,
174 at any given instant, there is only permitted to be a single object modifying
175 a database, but there may (simultaneously) be many objects reading the
178 Xapian enforces this restriction using by having a writer lock the database.
179 Each Xapian database directory contains a lock file named
180 ``flintlock`` (we've kept the same name as flint used, since the locking
181 technique is the same).
183 This lock-file will always exist, but will be locked using ``fcntl()`` when the
184 database is open for writing. A major advantage of ``fnctl()`` locks is that
185 if a writer exits without being given a chance to clean up (for example, if the
186 application holding the writer is killed), any ``fcntl()`` locks held will be
187 automatically released by the operating system so stale locks can't happen.
189 Unfortunately, ``fcntl()`` locking has some unhelpful semantics (if a process
190 closes *ANY* open file descriptor on the file that releases the lock) so on
191 most POSIX platforms we spawn a child process to hold the lock for each
192 database opened for writing, which then exec-s ``cat``, so you will see a
193 ``cat`` subprocess of any writer process in the output of ``ps``, ``top``, etc.
195 "Open File Description" locks are like traditional ``fcntl()`` locks but with
196 this problem addressed, and Xapian will use these if available and avoid these
197 extra child processes. At the time of writing it seems only Linux (since kernel
198 3.15) supports these, but hopefully they'll get added to POSIX so in the future.
200 Under Microsoft Windows, we use a different locking technique which doesn't
201 require a child process, but still means the lock is released automatically
202 when the writing process exits.
207 Xapian databases contain a revision number. This is essentially a count of
208 the number of modifications since the database was created, and is needed to
209 implement the atomic modification functionality. It is stored as a 32 bit
210 integer, so there is a chance that a very frequently updated database could
211 cause this to overflow. The consequence of such an overflow would be to throw
212 an exception reporting that the database has run out of revision numbers.
214 This isn't likely to be a practical problem, since it would take nearly a year
215 for a database to reach this limit if 100 modifications were committed every
216 second, and no normal Xapian system will commit more than once every few
217 seconds. However, if you are concerned, you can use the ``xapian-compact``
218 tool to make a fresh copy of the database with the revision number set to 1.
220 The revision number of each table can be displayed by the ``xapian-check``
226 Xapian should work correctly over a network file system. However, there are
227 various potential issues with such file systems, so we recommend
228 extensive testing of your particular network file system before deployment.
230 Be warned that Xapian is heavily I/O dependent, and therefore performance over
231 a network file system is likely to be slow unless you've got a very well tuned
234 Xapian needs to be able to lock a file in a database directory when
235 modifications are being performed. On some network files systems (e.g., NFS)
236 this requires a lock daemon to be running.
238 Which database format to use?
239 -----------------------------
241 As of release 1.4.0, you should generally use the glass format (which is now
244 Support for the pre-1.0 quartz format (deprecated in 1.0) was removed in 1.1.0.
245 See below for how to convert a quartz database to a flint one.
247 The flint backend (the default for 1.0, and still supported by 1.2.x) was
248 removed in 1.3.0. See below for how to convert a flint database to a chert one.
250 The chert backend (the default for 1.2) is still supported by 1.4.x, but
251 deprecated - only use it if you already have databases in this format; and plan
254 .. There's also a development backend called XXXXX. The main distinguishing
255 .. feature of this is that the format may change incompatibly from time to time.
256 .. It passes Xapian's extensive testsuite, but has seen less real world use
259 Can I put other files in the database directory?
260 ------------------------------------------------
262 If you wish to store meta-data or other information relating to the Xapian
263 database, it is reasonable to wish to put this in files inside the Xapian
264 database directory, for neatness. For example, you might wish to store a list
265 of the prefixes you've applied to terms for specific fields in the database.
267 Current Xapian backends don't do anything
268 which will break this technique, so as long as you don't choose a filename
269 that Xapian uses itself, there should be no problems. However, be aware that
270 new versions of Xapian may use new files in the database directory, and it is
271 also possible that new backend formats may not be compatible with the
272 technique. And of course you can't do this with a single-file glass database.
281 - The simplest way to perform a backup is to temporarily halt modifications,
282 take a copy of all files in the database directory, and then allow
283 modifications to resume. Read access can continue while a backup is being
286 - If you have a filesystem which allows atomic snapshots to be taken of
287 directories (such as an LVM filesystem), an alternative strategy is to take
288 a snapshot and simply copy all the files in the database directory to the
289 backup medium. Such a copy will always be a valid database.
291 - Progressive backups are not easily possible; modifications are typically
292 spread throughout the database files.
297 Even though Xapian databases are often automatically generated from source
298 data which is stored in a reliable manner, it is usually desirable to keep
299 backups of Xapian databases being run in production environments. This is
300 particularly important in systems with high-availability requirements, since
301 re-building a Xapian database from scratch can take many hours. It is also
302 important in the case where the data stored in the database cannot easily be
303 recovered from external sources.
305 Xapian databases are managed such that at any instant in time, there is at
306 least one valid revision of the database written to disk (and if there are
307 multiple valid revisions, Xapian will always open the most recent).
308 Therefore, if it is possible to take an instantaneous snapshot of all the
309 database files (for example, on an LVM filesystem), this snapshot is suitable
310 for copying to a backup medium. Note that it is not sufficient to take a
311 snapshot of each database file in turn - the snapshot must be across all
312 database files simultaneously. Otherwise, there is a risk that the snapshot
313 could contain database files from different revisions.
315 If it is not possible to take an instantaneous snapshot, the best backup
316 strategy is simply to ensure that no modifications are committed during the
317 backup procedure. While the simplest way to implement this may be to stop
318 whatever processes are used to modify the database, and ensure that they close
319 the database, it is not actually necessary to ensure that no writers are open
320 on the database; it is enough to ensure that no writer makes any modification
323 Because a Xapian database can contain more than one valid revision of the
324 database, it is actually possible to allow a limited number of modifications
325 to be performed while a backup copy is being made, but this is tricky and we
326 do not recommend relying on it. Future versions of Xapian are likely to
327 support this better, by allowing the current revision of a database to be
328 preserved while modifications continue.
330 Progressive backups are not recommended for Xapian databases: Xapian database
331 files are block-structured, and modifications are spread throughout the
332 /database file. Therefore, a progressive backup tool will not be able to take
333 a backup by storing only the new parts of the database. Modifications will
334 normally be so extensive that most parts of the database have been modified,
335 however, if only a small number of modifications have been made, a binary diff
336 algorithm might make a usable progressive backup tool.
339 Inspecting a database
340 =====================
342 When designing an indexing strategy, it is often useful to be able to check
343 the contents of the database. Xapian includes a simple command-line program,
344 ``xapian-delve``, to allow this (prior to 1.3.0, ``xapian-delve`` was usually
345 called ``delve``, though some packages were already renaming it).
347 For example, to display the list of terms in document "1" of the database
352 xapian-delve foo -r 1
354 It is also possible to perform simple searches of a database. Xapian includes
355 another simple command-line program, ``quest``, to support this. ``quest`` is
356 only able to search for un-prefixed terms, the query string must be quoted to
357 protect it from the shell. To search the database "foo" for the phrase "hello
362 quest -d foo '"hello world"'
364 If you have installed the "Omega" CGI application built on Xapian, this can
365 also be used with the built-in "godmode" template to provide a web-based
366 interface for browsing a database. See Omega's documentation for more details
372 Compacting a database
373 ---------------------
375 Xapian databases normally have some spare space in each block to allow
376 new information to be efficiently slotted into the database. However, the
377 smaller a database is, the faster it can be searched, so if there aren't
378 expected to be many further modifications, it can be desirable to compact the
381 Xapian includes a tool called ``xapian-compact`` for compacting databases.
382 This tool makes a copy of a database, and takes advantage of
383 the sorted nature of the source Xapian database to write the database out
384 without leaving spare space for future modifications. This can result in a
387 The downside of compaction is that future modifications may take a little
388 longer, due to needing to reorganise the database to make space for them.
389 However, modifications are still possible, and if many modifications are made,
390 the database will gradually develop spare space.
392 There's an option (``-F``) to perform a "fuller" compaction. This option
393 compacts the database as much as possible, but it violates the design of the
394 Btree format slightly to achieve this, so it is not recommended if further
395 modifications are at all likely in future. If you do need to modify a "fuller"
396 compacted database, we recommend you run ``xapian-compact`` on it without ``-F``
399 You can specify the blocksize to use for the compacted database (which should
400 be a power of 2 between 2KB and 64KB, with the default being 8KB).
402 Making the blocksize a multiple of (or the same as) both the sector size of the
403 device and the blocksize of the filing system which the database is on is
404 a good idea, but sector size seems to always be 4K or less
405 (at least according to https://en.wikipedia.org/wiki/Disk_sector) and FS block
406 size still seems to be 4K by default (the widely used Linux ext4 FS potentially
407 supports up to 64K but only up to the system page size which is 4K on e.g. x86
408 and x86-64). So in practice a Xapian blocksize of 4KB or more will satisfy
411 The main benefits a larger blocksize gives are slightly more efficient packing
412 and reduced total per-block overheads (and the additional gains here are
413 likely to be smaller for each extra block size doubling), while the downside is
414 needing to read/write more data to read/write a single block. The extra data is
415 at least contiguous (at least in file offset terms - maybe not always on disk
416 if the file is fragmented) but there are potentially significant negative
417 factors like added pressure on the drive cache and OS file cache. The
418 additional losses are likely to grow for each extra block size doubling.
420 In general for most people just using the default block size is sensible. It's
421 something you might tune when you either care more about reducing size over
422 anything else, or if you're prepared to profile your complete system with
423 different block sizes to see what works best for your own situation.
425 If profiling different blocksizes including the 8KB default, remember to use a
426 compacted version for the 8KB block size database or else you won't get a fair
433 When building an index for a very large amount of data, it can be desirable to
434 index the data in smaller chunks (perhaps on separate machines), and then
435 merge the chunks together into a single database. This can be performed using
436 the ``xapian-compact`` tool, simply by supplying several source database paths.
438 Normally, merging works by reading the source databases in parallel, and
439 writing the contents in sorted order to the destination database. This will
440 work most efficiently if excessive disk seeking can be avoided; if you have
441 several disks, it may be worth placing the source databases and the
442 destination database on separate disks to obtain maximum speed.
444 The ``xapian-compact`` tool supports an additional option, ``--multipass``,
445 which is useful when merging more than three databases. This will cause the
446 postlist tables to be grouped and merged into temporary tables, which are then
447 grouped and merged, and so on until a single postlist table is created, which
448 is usually faster, but requires more disk space for the temporary files.
451 Checking database integrity
452 ---------------------------
454 Xapian includes a command-line tool to check that a database is
455 self-consistent. This tool, ``xapian-check``, runs through the entire database,
456 checking that all the internal nodes are correctly connected. It can also be
457 used on a single table, for example, this command will check the termlist table
462 xapian-check foo/termlist.DB
465 Fixing corrupted databases
466 --------------------------
468 The "xapian-check" tool is capable of fixing corrupted databases in certain
469 limited situations. Currently it only supports this for chert, where it is
472 * Regenerating a damaged ``iamchert`` file (if you've lost yours completely
473 just create an invalid one, e.g. with ``touch iamchert``).
475 * Regenerating damaged or lost base files from the corresponding DB files.
476 This was developed for the scenario where the database is freshly compacted
477 but should work provided the last update was cleanly applied. If the last
478 update wasn't actually committed, then it is possible that it will try to
479 pick the root block for the partial update, which isn't what you want.
480 If you are in this situation, come and talk to us - with a testcase we
481 should be able to make it handle this better.
483 To fix such issues, run xapian-check like so:
487 xapian-check /path/to/database F
490 Converting a chert database to a glass database
491 -----------------------------------------------
493 This can be done using the ``copydatabase`` example program included with Xapian.
494 This is a lot slower to run than ``xapian-compact``, since it has to perform the
495 sorting of the term occurrence data from scratch, but should be faster than a
496 re-index from source data since it doesn't need to perform the tokenisation
497 step. It is also useful if you no longer have the source data available.
499 The following command will copy a database from "SOURCE" to "DESTINATION",
500 creating the new database at "DESTINATION" as a chert database:
504 copydatabase SOURCE DESTINATION
506 By default copydatabase will renumber your documents starting with docid 1.
507 If the docids are stored in or come from some external system, you should
508 preserve them by using the ``--no-renumber`` option:
512 copydatabase --no-renumber SOURCE DESTINATION
515 Converting a pre-1.1.4 chert database to a chert database
516 ---------------------------------------------------------
518 The chert format changed in 1.1.4 - at that point the format hadn't been
519 finalised, but a number of users had already deployed it, and it wasn't hard
520 to write an updater, so we provided one called ``xapian-chert-update`` which
521 makes a copy with the updated format:
525 xapian-chert-update SOURCE DESTINATION
527 It works much like ``xapian-compact`` so should take a similar amount of time
528 (and results in a compact database). The initial version had a few bugs, so
529 use xapian-chert-update from Xapian 1.2.5 or later.
531 The ``xapian-chert-update`` utility was removed in Xapian 1.3.0, so you'll need
532 to install Xapian 1.2.x to use it.
535 Converting a flint database to a chert database
536 -----------------------------------------------
538 It is possible to convert a flint database to a chert database by installing
539 Xapian 1.2.x (since this has support for both flint and chert)
540 using the ``copydatabase`` example program included with Xapian. This is a
541 lot slower to run than ``xapian-compact``, since it has to perform the
542 sorting of the term occurrence data from scratch, but should be faster than a
543 re-index from source data since it doesn't need to perform the tokenisation
544 step. It is also useful if you no longer have the source data available.
546 The following command will copy a database from "SOURCE" to "DESTINATION",
547 creating the new database at "DESTINATION" as a chert database:
551 copydatabase SOURCE DESTINATION
553 By default ``copydatabase`` will renumber your documents starting with docid 1.
554 If the docids are stored in or come from some external system, you should
555 preserve them by using the ``--no-renumber`` option (new in Xapian 1.2.5):
559 copydatabase --no-renumber SOURCE DESTINATION
561 Converting a quartz database to a flint database
562 ------------------------------------------------
564 It is possible to convert a quartz database to a flint database by installing
565 Xapian 1.0.x (since this has support for both quartz and flint)
566 and using the ``copydatabase`` example program included with Xapian. This is a
567 lot slower to run than ``xapian-compact``, since it has to perform the
568 sorting of the term occurrence data from scratch, but should be faster than a
569 re-index from source data since it doesn't need to perform the tokenisation
570 step. It is also useful if you no longer have the source data available.
572 The following command will copy a database from "SOURCE" to "DESTINATION",
573 creating the new database at "DESTINATION" as a flint database:
577 copydatabase SOURCE DESTINATION
580 Converting a 0.9.x flint database to work with 1.0.y
581 ----------------------------------------------------
583 In 0.9.x, flint was the development backend.
585 Due to a bug in the flint position list encoding in 0.9.x which made flint
586 databases non-portable between platforms, we had to make an incompatible
587 change in the flint format. It's not easy to write an upgrader, but you
588 can convert a database using the following procedure (although it might
589 be better to rebuild from scratch if you want to use the new UTF-8 support
590 in :xapian-class:`QueryParser`, :xapian-class:`Stem`, and
591 :xapian-class:`TermGenerator`).
593 Run the following command in your Xapian 0.9.x installation to copy your
594 0.9.x flint database "SOURCE" to a new quartz database "INTERMEDIATE":
598 copydatabase SOURCE INTERMEDIATE
600 Then run the following command in your Xapian 1.0.y installation to copy
601 your quartz database to a 1.0.y flint database "DESTINATION":
605 copydatabase INTERMEDIATE DESTINATION