4 This document aims to provide some theoretical background to Xapian.
9 In Information Retrieval (IR), the items we are trying to retrieve are
10 called *documents*, and each document is described by a collection of
11 *terms*. These two words, `document` and `term`, are now traditional
12 in the vocabulary of IR, and reflect its Library Science origins.
13 Usually a document is thought of as a piece of text, most likely in a
14 machine readable form, and a term as a word or phrase which helps to
15 describe the document, and which may indeed occur one or more times in
16 the document. So a document might be about dental care, and could be
17 described by corresponding terms `tooth`, `teeth`, `toothbrush`,
18 `decay`, `cavity`, `plaque`, `diet` and so on.
20 More generally a document can be anything we want to retrieve, and a
21 term any feature that helps describe the documents. So the documents
22 could be a collection of fossils held in a big museum collection, and
23 the terms could be morphological characteristics of the fossils. Or the
24 documents could be tunes, and the terms could then be phrases of notes
25 that occur in the tunes.
27 If, in an IR system, a document, D, is described by a term, t, t is said
28 to *index* D, and we can write,
32 In fact an IR system consists of a set of documents, :math:`D_1`, :math:`D_2`, :math:`D_3` ...,
33 a set of terms :math:`t_1`, :math:`t_2`, :math:`t_3` ..., and set of relationships,
37 i.e. instances of terms indexing documents. A single instance of a
38 particular term indexing a particular document is called a *posting*.
40 For a document, D, there is a list of terms which index it. This is
41 called the *term list* of D.
43 For a term, t, there is a list of documents which it indexes. This is
44 called the *posting list* of t. (`Document list` would be more
45 consistent, but sounds a little too vague for this very important
48 At a simple level a computerised IR system puts the terms in an *index*
49 file. A term can be efficiently looked up and its posting list found. In
50 the posting list, each document is represented by a short identifier. To
51 over-simplify a little, a posting list can be thought of as a list of
52 numbers (document ids), and term list as a list of strings (the terms).
53 Some systems represent each term by a number internally, so the term
54 list is then also a list of numbers. Xapian doesn't - it uses the terms
55 themselves, and uses prefix compression to store them compactly.
57 The terms needn't be (and often aren't) just the words from the
58 document. Usually they are converted to lower case, and often a stemming
59 algorithm is applied, so a single term `connect` might derive from a
60 number of words, `connect`, `connects`, `connection`, `connected`
61 and so on. A single word might also give rise to more than one term, for
62 example you might index both stemmed and unstemmed forms of some or all
63 terms. Or a stemming algorithm could conceivably produce more than one
64 stem in some cases (this isn't the case for any of the stemming
65 algorithms Xapian currently supports, but consider the `Double
66 Metaphone <https://en.wikipedia.org/wiki/Double_Metaphone>`_ phonetic
67 algorithm which can produce two codes from a single input).
69 Xapian's context within IR
70 --------------------------
72 In the beginning IR was dominated by Boolean retrieval, described in the
73 next section. This could be called the antediluvian period, or
74 generation zero. The first generation of IR research dates from the
75 early sixties, and was dominated by model building, experimentation, and
76 heuristics. The big names were `Gerard
77 Salton <https://en.wikipedia.org/wiki/Gerard_Salton>`_ and `Karen Sparck
78 Jones <https://en.wikipedia.org/wiki/Karen_Sparck_Jones>`_. The second
79 period, which began in the mid-seventies, saw a big shift towards
80 mathematics, and a rise of the IR model based upon probability theory -
81 probabilistic IR. The big name here was, and continues to be, `Stephen
82 Robertson <http://www.soi.city.ac.uk/~ser/homepage.html>`_. More
84 Rijsbergen <https://en.wikipedia.org/wiki/C._J._van_Rijsbergen>`_ has led
85 a group that has developed underlying logical models of IR, but
86 interesting as this new work is, it has not as yet led to results that
87 offer improvements for the IR system builder.
89 Xapian was built as a system for efficiently implementing the
90 probabilistic IR model (though this doesn't mean it is limited to only
91 implementing this model - other models can be implemented providing they
92 can be expressed in a suitable way). Xapian tries to implement the
93 probabilistic model faithfully, though in some places it can be told to
94 use short-cuts for efficiency.
96 The model has two striking advantages:
98 #. It leads to systems that give good retrieval performance. As the
99 model has developed over the last 25 years, this has proved so
100 consistently true that one is led to suspect that the probability
101 theory model is, in some sense, the "correct" model for IR. The IR
102 process would appear to function as the model suggests.
103 #. As new problems come up in IR, the probabilistic model can usually
104 suggest a solution. This makes it a very practical mental tool for
105 cutting through the jungle of possibilities when designing IR
108 In simple cases the model reduces to simple formulae in general use, so
109 don't be alarmed by the apparent complexity of the equations below. We
110 need them for a full understanding of the general case.
115 A Boolean construct of terms retrieves a corresponding set of documents.
118 | :math:`t_1` indexes documents 1 2 3 5 8
119 | :math:`t_2` indexes documents 2 3 6
123 | :math:`t_1` ``AND`` :math:`t_2` retrieves 2 3
124 | :math:`t_1` ``OR`` :math:`t_2` retrieves 1 2 3 5 6 8
125 | :math:`t_1` ``AND_NOT`` :math:`t_2` retrieves 1 5 8
126 | :math:`t_2` ``AND_NOT`` :math:`t_1` retrieves 6
128 The posting list of a term is a set of documents. IR becomes a matter of
129 constructing other sets by doing unions, intersections and differences
132 For example, in an IR system of works of literature, a Boolean query
135 (lang:en OR lang:fr OR lang:de) AND (type:novel OR type:play) AND century:19
137 might be used to retrieve all English, French or German novels or plays
140 Boolean retrieval is often useful, but is rather inadequate on its own
141 as a general IR tool. Results aren't ordered by any measure of how
142 "good" they might be, and users require training to make effective use
143 of such a system. Despite this, purely boolean IR systems continue to
146 By default, Xapian uses probabilistic ranking to order retrieved
147 documents while allowing Boolean expressions of arbitrary complexity
148 (some boolean IR systems are restricted to queries in normal form) to
149 limit those documents retrieved, which provides the benefits of both
150 approaches. Pure Boolean retrieval is also supported (select the
151 `BoolWeight <apidoc/html/classXapian_1_1BoolWeight.html>`_ weighting
152 scheme using ``enquire.set_weighting_scheme(Xapian::BoolWeight());``).
154 Relevance and the idea of a query
155 ---------------------------------
157 *Relevance* is a central concept to the probabilistic model. Whole
158 academic papers have been devoted to discussing the nature of relevance
159 but essentially a document is relevant if it was what the user really
160 wanted! Retrieval is rarely perfect, so among documents retrieved there
161 will be non-relevant ones; among those not retrieved, relevant ones.
163 Relevance is modelled as a black or white attribute. There are no
164 degrees of relevance, a document either is, or is not, relevant. In the
165 probabilistic model there is however a probability of relevance, and
166 documents of low probability of relevance in the model generally
167 correspond to documents that, in practice, one would describe as having
170 What the user actually wants has to be expressed in some form, and the
171 expression of the user's need is the query. In the probabilistic model
172 the query is, usually, a list of terms, but that is the end process of a
173 chain of events. The user has a need; this is expressed in ordinary
174 language; this is then turned into a written form that the user judges
175 will yield good results in an IR system, and the IR system then turns
176 this form into a set, *Q*, of terms for processing the query. Relevance
177 must be judged against the user's original need, not against a later
178 interpretation of what *Q*, the set of terms, ought to mean.
180 Below, a query is taken to be just a set of terms, but it is important
181 to realise that this is a simplification. Each link in the chain that
182 takes us from the *information need* ("what the user is looking for") to
183 the abstraction in *Q* is liable to error, and these errors compound to
184 affect IR performance. In fact the performance of IR systems as a whole
185 is much worse than most people generally imagine.
187 Evaluating IR performance
188 -------------------------
190 It is possible to set up a test to evaluate an IR system. Suppose *Q* is
191 a query, and out of the complete collection of documents in the IR
192 system, a set of documents *R* of size R are relevant to the query. So
193 if a document is in *R* it is relevant, and if not in *R* it is
194 non-relevant. Suppose the IR system is able to give us back K documents,
195 among which r are relevant. *Precision* and *recall* are defined as
199 \text{precision = }\frac{r}{K}\text{ , recall = }\frac{r}{R}
202 Precision is the density of relevant documents among those retrieved.
203 Recall is the proportion of relevant documents retrieved. In most IR
204 systems K is a parameter that can be varied, and what you find is that
205 when K is low you get high precision at the expense of low recall, and
206 when K is high you get high recall at the expense of low precision.
208 The ideal value of K will depend on the use of the system. For example,
209 if a user wants the answer to a simple question and the system contains
210 many documents which would answer it, a low value of K will be best to
211 give a small number of relevant results. But in a system indexing legal
212 cases, users will often wish to make sure no potentially relevant case
213 is missed even if that requires they check more non-relevant cases, so a
214 high value of K will be best.
216 Retrieval effectiveness is often shown as a graph of precision against
217 recall average over a number of queries, and plotted for different
218 values of K. Such curves typically have a shape similar to a hyperbola
221 A collection like this, consisting of a set of documents, a set of
222 queries, and for each query, a complete set of relevance assessments, is
223 called a *test collection*. With a test collection you can test out
224 different IR ideas, and see how well one performs against another. The
225 controversial part of establishing any test collection is the procedure
226 employed for determining the sets :math:`R_i`, of relevance
227 assessments. Subjectivity of judgement comes in here, and people will
228 differ about whether a particular document is relevant to a particular
229 query. Even so, the averaging across queries reduces the errors that may
230 occasionally arise through faulty relevance judgements, and averaging
231 important tests across a number of test collections reduces the effects
232 caused by accidental features of individual collections, and the results
233 obtained by these tests in modern research are generally accepted as
234 trustworthy. Nowadays such research with test collections is organised
235 from `TREC <https://trec.nist.gov/>`_.
237 Probabilistic term weights
238 --------------------------
240 In this section we will try to present some of the thinking behind the
241 formulae. This is really to give a feel for where the probabilistic
242 model comes from. You may want to skim through this section if you're
245 Suppose we have an IR system with a total of N documents. And suppose
246 *Q* is a query in this IR system, made up of terms :math:`t_1`,
247 :math:`t_2` ... :math:`t_Q`. There is a set, *R*, of documents
248 relevant to the query.
250 In 1976, Stephen Robertson derived a formula which gives an ideal
251 numeric weight to a term t of Q. Just how this weight gets used we will
252 see below, but essentially a high weight means an important term and a
253 low weight means an unimportant term. The formula is,
256 w(t) =\log\:\:\frac{p(1-q)}{(1-p)q}
258 (The base of the logarithm doesn't matter, but we can suppose it is e.)
259 p is the probability that t indexes a relevant document, and q the
260 probability that t indexes a non-relevant document. And of course, 1 - p
261 is the probability that t does not index a relevant document, and 1 - q
262 the probability that t does not index a non-relevant document. More
266 p = P(t \to D | D in R)
267 q = P(t \to D | D not in R)
269 1 - p = P(t not \to D | D in R)
270 1 - q = P(t not \to D | D not in R)
272 Suppose that t indexes n of the N documents in the IR system. As before,
273 we suppose also that there are R documents in *R*, and that there are r
274 documents in *R* which are indexed by t.
276 p is easily estimated by r/R, the ratio of the number of relevant
277 documents indexed by t to the total number of relevant documents.
279 The total number of non-relevant documents is N - R, and the number of
280 those indexed by t is n - r, so we can estimate q as (n - r)/(N - R).
281 This gives us the estimates,
284 p =\:\frac{r}{R}\:\:,\:\:1-q =\:\frac{N-R-n+r}{N-R}
286 1-p=\:\frac{R-r}{R}\:\:,\:\:q=\:\frac{n-r}{N-R}
288 and so substituting in the formula for w(t) we get the estimate,
291 w(t) = \log\:\:\frac{r(N - R - n + r)}{(R - r)(n - r)}
293 Unfortunately, this formula is subject to violent behaviour when, say, n
294 = r (infinity) or r = 0 (minus infinity), and so Robertson suggests the
298 w(t) = \log\:\:\frac{(r+\frac{1}{2})(N-R-n+r+\frac{1}{2})}{(R-r+\frac{1}{2})(n-r+\frac{1}{2})}
300 with the reassurance that this has "some theoretical justification".
301 This is the form of the term weighting formula used in Xapian's
304 Note that n is dependent on the term, t, and R on the query, *Q*, while
305 r depends both on t and *Q*. N is constant, at least until the IR system
308 At first sight this formula may appear to be quite useless. After all,
309 *R* is what we are trying to find. We can't evaluate w(t) until we have
310 *R*, and if we have *R* the retrieval process is over, and term weights
311 are no longer of any interest to us.
313 But the point is we can estimate p and q from a subset of *R*. As soon
314 as some records are found relevant by the user they can be used as a
315 working set for *R* from which the weights w(t) can be derived, and
316 these new weights can be used to improve the processing of the query.
318 In fact in the Xapian software *R* tends to mean not the complete set of
319 relevant documents, which indeed can rarely be discovered, but a small
320 set of documents which have been judged as relevant.
322 Suppose we have no documents marked as relevant. Then R = r = 0, and
326 \log\:\frac{N-n+\frac{1}{2}}{n+\frac{1}{2}}
328 This is approximately log((N - n)/n). Or log(N/n), since n is usually
329 small compared with N. This is called inverse logarithmic weighting, and
330 has been used in IR for many decades, quite independently of the
331 probabilistic theory which underpins it. Weights of this form are in
332 fact the starting point in Xapian when no relevance information is
335 The number n incidentally is often called the *frequency* of a term. We
336 prefer the phrase *term frequency*, to better distinguish it from wdf
337 and wqf introduced below.
339 In extreme cases w(t) can be negative. In Xapian, negative values are
340 disallowed, and simply replaced by a small positive value.
342 wdp, wdf, ndl and wqf
343 ---------------------
345 Before we see how the weights are used there are a few more ideas to
348 As mentioned before, a term t is said to index a document D, or :math:`t\to D`.
349 We have emphasised that D may not be a piece of text in machine-readable
350 form, and that, even when it is, t may not actually occur in the text of
351 D. Nevertheless, it will often be the case that D is made up of a list
354 :math:`D = w_1, w_2, w_3` ... :math:`w_m`
356 and that many, if not all, of the terms which index D derive from these
357 words (for example, the terms are often lower-cased and stemmed forms of
360 If a term derives from words :math:`w_9, w_38, w_97` and :math:`w_221` in the indexing
361 process, we can say that the term "occurs" in D at positions 9, 38, 97 and
362 221, and so for each term a document may have a vector of positional
363 information. These are the *within-document positions* of t, or the *wdp*
366 The *within-document frequency*, or *wdf*, of a term t in D is the
367 number of times it is pulled out of D in the indexing process. Usually
368 this is the size of the wdp vector, but in Xapian it can exceed it,
369 since we can apply extra wdf to some parts of the document text. For
370 example, often this is done for the document title and abstract to
371 attach extra importance to their contents compared to the rest of the
374 There are various ways in which we might measure the length of a
375 document, but the easiest is to suppose it is made up of m words,
376 :math:`w_1` to :math:`w_m`, and to define its length as m.
378 The *normalised document length*, or *ndl*, is then m divided by the
379 average length of the documents in the IR system. So the average length
380 document has ndl equal to 1, short documents are less than 1, long
381 documents greater than 1. We have found that very small ndl values
382 create problems, so Xapian actually allows for a non-zero minimum value
385 In the probabilistic model the query, *Q*, is itself very much like
386 another document. Frequently indeed *Q* will be created from a document,
387 either one already in the IR system, or by an indexing process very
388 similar to the one used to add documents into the whole IR system. This
389 corresponds to a user saying "give me other documents like this one".
390 One can therefore attach a similar meaning to within-query position
391 information, within-query frequency, and normalised query length, or
392 wqp, wqf and nql. Xapian does not currently use the concept of wqp.
394 Using the weights. The *MSet*
395 -----------------------------
397 Now to pull everything together. From the probabilistic term weights we
398 can assign a weight to any document, d, as follows,
401 W(d) = \sum_{t\to d,t \in Q}\:\frac{(k + 1) f_t}{k.L_d + f_t}\:w(t)
403 The sum extends over the terms of *Q* which index d. :math:`f_t` is
404 the wdf of t in d, :math:`L_d` is the ndl of d, and k is some suitably
407 The factor :math:`k+1` is actually redundant, but helps with the interpretation
408 of the equation. In Xapian, this weighting scheme is implemented by the
409 `Xapian::TradWeight class <apidoc/html/classXapian_1_1TradWeight.html>`_
410 and the factor :math:`(k+1)` is ignored.
412 If :math:`k` is set to zero the factor before :math:`w(t)` is 1, and the wdfs are
413 ignored. As k tends to infinity, the factor becomes
414 :math:`f_t`/:math:`L_d`, and the wdfs take on their greatest
415 importance. Intermediate values scale the wdf contribution between these
416 extremes. The best :math:`k` actually depends on the characteristics of the IR
417 system as a whole, and unfortunately no rule can be given for choosing
418 it. By default, Xapian sets :math:`k` to 1 which should give reasonable results
419 for most systems. :math:`W(d)` is merely tweaked a bit by the wdf values, and
420 users observe a simple pattern of retrieval. It is possible to tune :math:`k` to
421 provide optimal results for a specific system.
423 Any :math:`d` in the IR system has a value :math:`W(d)`, but, if no term of the query
424 indexes :math:`d`, :math:`W(d)` will be zero. In practice only documents for which
425 :math:`W(d)>0` will be of interest, and these are the documents indexed by at least
426 one term of *Q*. If we now take these documents and arrange them by
427 decreasing :math:`W(d)` value, we get a ranked list called the *match set*, or
428 *MSet*, of document and weight pairs:
447 where :math:`W(D_j) \ge W(D_i)` if j > i.
449 And according to the probabilistic model, the documents :math:`D_0, D_1, D_2 ...`
450 are ranked by decreasing order of probability of relevance. So :math:`D_0` has highest
451 probability of being relevant, then :math:`D_1` and so on.
453 Xapian creates the MSet from the posting lists of the terms of the
454 query. This is the central operation of any IR system, and will be
455 familiar to anyone who has used one of the Internet's major search
456 engines, where the query is what you type in the query box, and the
457 resulting hit list corresponds to the top few items of the MSet.
459 The cutoff point, K, is chosen when the MSet is created. The candidates
460 for inclusion in the MSet are all documents indexed by at least one term
461 of *Q*, and their number will usually exceed the choice of K (K is
462 typically set to be 1000 or less). So the MSet is actually the best K
463 documents found in the match process.
465 A modification of this weighting scheme can be employed that takes into
466 account the query itself:
469 W(d) = \sum_{t \to d,\:t \in Q}\frac{(k_3+1)q_t}{(k_3L'+q_t)}\:\:\frac{(k+1)f_t}{(kL_d+f_t)}\:w(t)
471 where :math:`q_t` is the wqf of t in *Q*, :math:`L'` is the nql, or normalised
472 query length, and :math:`k_3` is a further constant. In computing :math:`W(d)`
473 across the document space, this extra factor may be viewed as just a
474 modification to the basic term weights, :math:`w(t)`. Like :math:`k` and :math:`k_3`,
475 we will need to make an inspired guess for :math:`L'`. In fact the choices for
476 :math:`k_3` and :math:`L'` will depend on the broader context of the use of
477 this formula, and more advice will be given as occasion arises.
479 Xapian's default weighting scheme is a generalised form of this
480 weighting scheme modification, known as `BM25 <bm25.html>`_. In BM25, :math:`L'`
483 Using the weights: the *ESet*
484 -----------------------------
486 But as well as ranking documents, Xapian can rank terms, and this is
487 most important. The higher up the ranking the term is, the more likely
488 it is to act as a good differentiator between relevant and non-relevant
489 documents. It is therefore a candidate for adding back into the query.
490 Terms from this list can therefore be used to expand the size of the
491 query, after which the query can be re-run to get a better MSet. Because
492 this list of terms is mainly used for query expansion, it is called the
493 *expand set* or *ESet*.
495 The term expansion weighting formula is as follows,
497 :math:`W(t) = r\:w(t)`
499 in other words we multiply the term weight by the number of relevant
500 documents that have been indexed by the term.
502 The ESet then has this form,
521 where :math:`W(t_j) \ge W(t_i)` if j > i.
523 Since the main function of the ESet is to find new terms to be added to
524 *Q*, we usually omit from it terms already in *Q*.
526 The :math:`W(t)` weight is applicable to any term in the IR system, but has a
527 value zero when t does not index a relevant document. The ESet is
528 therefore confined to be a ranking of the best K terms which index
531 This simple form of :math:`W(t)` is traditional in the probabilistic model, but
532 seems less than optimal because it does not take into account wdf
533 information. One can if fact try to generalise it to:
536 W(t) = \sum_{t \to d,d \in R}\:\frac{(k+1)f_t}{kL+f_t}w(t)
539 :math:`k` is again a constant, but it does not need to have the same value as
540 the :math:`k` used in the probabilistic term weights above. In Xapian, :math:`k`
541 defaults to 1.0 for ESet generation.
543 This reduces to :math:`W(t) = r\:w(t)` when :math:`k=0`. Certainly this form can be
544 recommended in the very common case where :math:`r=1`, that is, we have a
545 single document marked relevant.
547 The progress of a query
548 -----------------------
550 Below we describe the general case of the IR model supported, including
551 use of a relevance set (`RSet <glossary.html#rset>`_), query expansion,
552 improved term weights and reranking. You don't have to use any of these
553 for Xapian to be useful, but they are available should you need them.
555 The user enters a query. This is parsed into a form the IR system
556 understands, and run by the IR system, which returns two lists, a list
557 of captions, derived from the MSet, and a list of terms, from the ESet.
558 If the RSet is empty, the first few documents of the MSet can be used as
559 a stand-in - after all, they have a good chance of being relevant! You
560 can read a document by clicking on the caption. (We assume the usual
561 screen/mouse environment.) But you can also mark a document as relevant
562 (change *R*) or cause a term to be added from the ESet to the query
563 (change *Q*). As soon as any change is made to the query environment the
564 query can be rerun, although you might have a front-end where nothing
565 happens until you click on some "Run Query" button.
567 In any case rerunning the query leads to a new MSet and ESet, and so to
568 a new display. The IR process is then an iterative one. You can delete
569 terms from the query or add them in; mark or unmark documents as being
570 relevant. Eventually you converge on the answer to the query, or at
571 least, the best answer the IR system can give you.
576 If you want to find out more, then `"Simple, proven approaches to text
577 retrieval" <https://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.8337>`_
578 is a worthwhile read. It's a good introduction to Probabilistic
579 Information retrieval, which is basically what Xapian provides.
581 There are also several good books on the subject of Information
584 - "*Information Retrieval*" by C. J. van Rijsbergen is well worth
585 reading. It's out of print, but is available for free `from the
586 author's website <http://www.dcs.gla.ac.uk/Keith/Preface.html>`_ (in
588 - "*Readings in Information Retrieval*" (published by Morgan Kaufmann,
589 edited by Karen Sparck Jones and Peter Willett) is a collection of
590 published papers covering many aspects of the subject.
591 - "*Managing Gigabytes*" (also published by Morgan Kaufmann, written by
592 Ian H. Witten, Alistair Moffat and Timothy C. Bell) describes
593 information retrieval and compression techniques.
594 - "*Modern Information Retrieval*" (published by Addison Wesley,
595 written by Ricardo Baeza-Yates and Berthier Ribeiro-Neto) gives a
596 good overview of the field. It was published more recently than the
597 books above, and so covers some more recent developments.
598 - "*Introduction to Information Retrieval*" (published by Cambridge
599 University Press, written by Christopher D. Manning, Prabhakar
600 Raghavan and Hinrich Schütze) looks to be a good introductory work
601 (we've not read it in detail yet). As well as the print version,
602 there's an online version on `the book's companion
603 website <http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html>`_.