xapian-core/docs/intro_ir.rst

   1 Theoretical Background
   2 ======================
   3
   4 This document aims to provide some theoretical background to Xapian.
   5
   6 Documents and terms
   7 -------------------
   8
   9 In Information Retrieval (IR), the items we are trying to retrieve are
  10 called *documents*, and each document is described by a collection of
  11 *terms*. These two words, `document` and `term`, are now traditional
  12 in the vocabulary of IR, and reflect its Library Science origins.
  13 Usually a document is thought of as a piece of text, most likely in a
  14 machine readable form, and a term as a word or phrase which helps to
  15 describe the document, and which may indeed occur one or more times in
  16 the document. So a document might be about dental care, and could be
  17 described by corresponding terms `tooth`, `teeth`, `toothbrush`,
  18 `decay`, `cavity`, `plaque`, `diet` and so on.
  19
  20 More generally a document can be anything we want to retrieve, and a
  21 term any feature that helps describe the documents. So the documents
  22 could be a collection of fossils held in a big museum collection, and
  23 the terms could be morphological characteristics of the fossils. Or the
  24 documents could be tunes, and the terms could then be phrases of notes
  25 that occur in the tunes.
  26
  27 If, in an IR system, a document, D, is described by a term, t, t is said
  28 to *index* D, and we can write,
  29
  30     :math:`t\to D`
  31
  32 In fact an IR system consists of a set of documents, :math:`D_1`, :math:`D_2`, :math:`D_3` ...,
  33 a set of terms :math:`t_1`, :math:`t_2`, :math:`t_3` ..., and set of relationships,
  34
  35     :math:`t_i\to D_j`
  36
  37 i.e. instances of terms indexing documents. A single instance of a
  38 particular term indexing a particular document is called a *posting*.
  39
  40 For a document, D, there is a list of terms which index it. This is
  41 called the *term list* of D.
  42
  43 For a term, t, there is a list of documents which it indexes. This is
  44 called the *posting list* of t. (`Document list` would be more
  45 consistent, but sounds a little too vague for this very important
  46 concept.)
  47
  48 At a simple level a computerised IR system puts the terms in an *index*
  49 file. A term can be efficiently looked up and its posting list found. In
  50 the posting list, each document is represented by a short identifier. To
  51 over-simplify a little, a posting list can be thought of as a list of
  52 numbers (document ids), and term list as a list of strings (the terms).
  53 Some systems represent each term by a number internally, so the term
  54 list is then also a list of numbers. Xapian doesn't - it uses the terms
  55 themselves, and uses prefix compression to store them compactly.
  56
  57 The terms needn't be (and often aren't) just the words from the
  58 document. Usually they are converted to lower case, and often a stemming
  59 algorithm is applied, so a single term `connect` might derive from a
  60 number of words, `connect`, `connects`, `connection`, `connected`
  61 and so on. A single word might also give rise to more than one term, for
  62 example you might index both stemmed and unstemmed forms of some or all
  63 terms. Or a stemming algorithm could conceivably produce more than one
  64 stem in some cases (this isn't the case for any of the stemming
  65 algorithms Xapian currently supports, but consider the `Double
  66 Metaphone <https://en.wikipedia.org/wiki/Double_Metaphone>`_ phonetic
  67 algorithm which can produce two codes from a single input).
  68
  69 Xapian's context within IR
  70 --------------------------
  71
  72 In the beginning IR was dominated by Boolean retrieval, described in the
  73 next section. This could be called the antediluvian period, or
  74 generation zero. The first generation of IR research dates from the
  75 early sixties, and was dominated by model building, experimentation, and
  76 heuristics. The big names were `Gerard
  77 Salton <https://en.wikipedia.org/wiki/Gerard_Salton>`_ and `Karen Sparck
  78 Jones <https://en.wikipedia.org/wiki/Karen_Sparck_Jones>`_. The second
  79 period, which began in the mid-seventies, saw a big shift towards
  80 mathematics, and a rise of the IR model based upon probability theory -
  81 probabilistic IR. The big name here was, and continues to be, `Stephen
  82 Robertson <http://www.soi.city.ac.uk/~ser/homepage.html>`_. More
  83 recently `Keith van
  84 Rijsbergen <https://en.wikipedia.org/wiki/C._J._van_Rijsbergen>`_ has led
  85 a group that has developed underlying logical models of IR, but
  86 interesting as this new work is, it has not as yet led to results that
  87 offer improvements for the IR system builder.
  88
  89 Xapian was built as a system for efficiently implementing the
  90 probabilistic IR model (though this doesn't mean it is limited to only
  91 implementing this model - other models can be implemented providing they
  92 can be expressed in a suitable way). Xapian tries to implement the
  93 probabilistic model faithfully, though in some places it can be told to
  94 use short-cuts for efficiency.
  95
  96 The model has two striking advantages:
  97
  98 #. It leads to systems that give good retrieval performance. As the
  99    model has developed over the last 25 years, this has proved so
 100    consistently true that one is led to suspect that the probability
 101    theory model is, in some sense, the "correct" model for IR. The IR
 102    process would appear to function as the model suggests.
 103 #. As new problems come up in IR, the probabilistic model can usually
 104    suggest a solution. This makes it a very practical mental tool for
 105    cutting through the jungle of possibilities when designing IR
 106    software.
 107
 108 In simple cases the model reduces to simple formulae in general use, so
 109 don't be alarmed by the apparent complexity of the equations below. We
 110 need them for a full understanding of the general case.
 111
 112 Boolean retrieval
 113 -----------------
 114
 115 A Boolean construct of terms retrieves a corresponding set of documents.
 116 So, if:
 117
 118     |    :math:`t_1` indexes documents  1 2 3 5 8
 119     |    :math:`t_2` indexes documents  2 3 6
 120
 121 then
 122
 123     |    :math:`t_1` ``AND`` :math:`t_2`      retrieves  2 3
 124     |    :math:`t_1` ``OR`` :math:`t_2`       retrieves  1 2 3 5 6 8
 125     |    :math:`t_1` ``AND_NOT`` :math:`t_2`  retrieves  1 5 8
 126     |    :math:`t_2` ``AND_NOT`` :math:`t_1`  retrieves  6
 127
 128 The posting list of a term is a set of documents. IR becomes a matter of
 129 constructing other sets by doing unions, intersections and differences
 130 on posting lists.
 131
 132 For example, in an IR system of works of literature, a Boolean query
 133 ::
 134
 135         (lang:en OR lang:fr OR lang:de) AND (type:novel OR type:play) AND century:19
 136
 137 might be used to retrieve all English, French or German novels or plays
 138 of the 19th century.
 139
 140 Boolean retrieval is often useful, but is rather inadequate on its own
 141 as a general IR tool. Results aren't ordered by any measure of how
 142 "good" they might be, and users require training to make effective use
 143 of such a system. Despite this, purely boolean IR systems continue to
 144 survive.
 145
 146 By default, Xapian uses probabilistic ranking to order retrieved
 147 documents while allowing Boolean expressions of arbitrary complexity
 148 (some boolean IR systems are restricted to queries in normal form) to
 149 limit those documents retrieved, which provides the benefits of both
 150 approaches. Pure Boolean retrieval is also supported (select the
 151 `BoolWeight <apidoc/html/classXapian_1_1BoolWeight.html>`_ weighting
 152 scheme using ``enquire.set_weighting_scheme(Xapian::BoolWeight());``).
 153
 154 Relevance and the idea of a query
 155 ---------------------------------
 156
 157 *Relevance* is a central concept to the probabilistic model. Whole
 158 academic papers have been devoted to discussing the nature of relevance
 159 but essentially a document is relevant if it was what the user really
 160 wanted! Retrieval is rarely perfect, so among documents retrieved there
 161 will be non-relevant ones; among those not retrieved, relevant ones.
 162
 163 Relevance is modelled as a black or white attribute. There are no
 164 degrees of relevance, a document either is, or is not, relevant. In the
 165 probabilistic model there is however a probability of relevance, and
 166 documents of low probability of relevance in the model generally
 167 correspond to documents that, in practice, one would describe as having
 168 low relevance.
 169
 170 What the user actually wants has to be expressed in some form, and the
 171 expression of the user's need is the query. In the probabilistic model
 172 the query is, usually, a list of terms, but that is the end process of a
 173 chain of events. The user has a need; this is expressed in ordinary
 174 language; this is then turned into a written form that the user judges
 175 will yield good results in an IR system, and the IR system then turns
 176 this form into a set, *Q*, of terms for processing the query. Relevance
 177 must be judged against the user's original need, not against a later
 178 interpretation of what *Q*, the set of terms, ought to mean.
 179
 180 Below, a query is taken to be just a set of terms, but it is important
 181 to realise that this is a simplification. Each link in the chain that
 182 takes us from the *information need* ("what the user is looking for") to
 183 the abstraction in *Q* is liable to error, and these errors compound to
 184 affect IR performance. In fact the performance of IR systems as a whole
 185 is much worse than most people generally imagine.
 186
 187 Evaluating IR performance
 188 -------------------------
 189
 190 It is possible to set up a test to evaluate an IR system. Suppose *Q* is
 191 a query, and out of the complete collection of documents in the IR
 192 system, a set of documents *R* of size R are relevant to the query. So
 193 if a document is in *R* it is relevant, and if not in *R* it is
 194 non-relevant. Suppose the IR system is able to give us back K documents,
 195 among which r are relevant. *Precision* and *recall* are defined as
 196 being,
 197
 198 .. math::
 199    \text{precision = }\frac{r}{K}\text{ , recall = }\frac{r}{R}
 200
 201
 202 Precision is the density of relevant documents among those retrieved.
 203 Recall is the proportion of relevant documents retrieved. In most IR
 204 systems K is a parameter that can be varied, and what you find is that
 205 when K is low you get high precision at the expense of low recall, and
 206 when K is high you get high recall at the expense of low precision.
 207
 208 The ideal value of K will depend on the use of the system. For example,
 209 if a user wants the answer to a simple question and the system contains
 210 many documents which would answer it, a low value of K will be best to
 211 give a small number of relevant results. But in a system indexing legal
 212 cases, users will often wish to make sure no potentially relevant case
 213 is missed even if that requires they check more non-relevant cases, so a
 214 high value of K will be best.
 215
 216 Retrieval effectiveness is often shown as a graph of precision against
 217 recall average over a number of queries, and plotted for different
 218 values of K. Such curves typically have a shape similar to a hyperbola
 219 (y=1/x).
 220
 221 A collection like this, consisting of a set of documents, a set of
 222 queries, and for each query, a complete set of relevance assessments, is
 223 called a *test collection*. With a test collection you can test out
 224 different IR ideas, and see how well one performs against another. The
 225 controversial part of establishing any test collection is the procedure
 226 employed for determining the sets :math:`R_i`, of relevance
 227 assessments. Subjectivity of judgement comes in here, and people will
 228 differ about whether a particular document is relevant to a particular
 229 query. Even so, the averaging across queries reduces the errors that may
 230 occasionally arise through faulty relevance judgements, and averaging
 231 important tests across a number of test collections reduces the effects
 232 caused by accidental features of individual collections, and the results
 233 obtained by these tests in modern research are generally accepted as
 234 trustworthy. Nowadays such research with test collections is organised
 235 from `TREC <https://trec.nist.gov/>`_.
 236
 237 Probabilistic term weights
 238 --------------------------
 239
 240 In this section we will try to present some of the thinking behind the
 241 formulae. This is really to give a feel for where the probabilistic
 242 model comes from. You may want to skim through this section if you're
 243 not too interested.
 244
 245 Suppose we have an IR system with a total of N documents. And suppose
 246 *Q* is a query in this IR system, made up of terms :math:`t_1`,
 247 :math:`t_2` ... :math:`t_Q`. There is a set, *R*, of documents
 248 relevant to the query.
 249
 250 In 1976, Stephen Robertson derived a formula which gives an ideal
 251 numeric weight to a term t of Q. Just how this weight gets used we will
 252 see below, but essentially a high weight means an important term and a
 253 low weight means an unimportant term. The formula is,
 254
 255 .. math::
 256     w(t) =\log\:\:\frac{p(1-q)}{(1-p)q}
 257
 258 (The base of the logarithm doesn't matter, but we can suppose it is e.)
 259 p is the probability that t indexes a relevant document, and q the
 260 probability that t indexes a non-relevant document. And of course, 1 - p
 261 is the probability that t does not index a relevant document, and 1 - q
 262 the probability that t does not index a non-relevant document. More
 263 mathematically,
 264
 265 .. math::
 266         p = P(t \to D | D in R)
 267         q = P(t \to D | D not in R)
 268
 269         1 - p = P(t not \to D | D in R)
 270         1 - q = P(t not \to D | D not in R)
 271
 272 Suppose that t indexes n of the N documents in the IR system. As before,
 273 we suppose also that there are R documents in *R*, and that there are r
 274 documents in *R* which are indexed by t.
 275
 276 p is easily estimated by r/R, the ratio of the number of relevant
 277 documents indexed by t to the total number of relevant documents.
 278
 279 The total number of non-relevant documents is N - R, and the number of
 280 those indexed by t is n - r, so we can estimate q as (n - r)/(N - R).
 281 This gives us the estimates,
 282
 283 .. math::
 284      p =\:\frac{r}{R}\:\:,\:\:1-q =\:\frac{N-R-n+r}{N-R}
 285 .. math::
 286      1-p=\:\frac{R-r}{R}\:\:,\:\:q=\:\frac{n-r}{N-R}
 287
 288 and so substituting in the formula for w(t) we get the estimate,
 289
 290 .. math::
 291      w(t) = \log\:\:\frac{r(N - R - n + r)}{(R - r)(n - r)}
 292
 293 Unfortunately, this formula is subject to violent behaviour when, say, n
 294 = r (infinity) or r = 0 (minus infinity), and so Robertson suggests the
 295 modified form
 296
 297 .. math::
 298      w(t) = \log\:\:\frac{(r+\frac{1}{2})(N-R-n+r+\frac{1}{2})}{(R-r+\frac{1}{2})(n-r+\frac{1}{2})}
 299
 300 with the reassurance that this has "some theoretical justification".
 301 This is the form of the term weighting formula used in Xapian's
 302 BM25Weight.
 303
 304 Note that n is dependent on the term, t, and R on the query, *Q*, while
 305 r depends both on t and *Q*. N is constant, at least until the IR system
 306 changes.
 307
 308 At first sight this formula may appear to be quite useless. After all,
 309 *R* is what we are trying to find. We can't evaluate w(t) until we have
 310 *R*, and if we have *R* the retrieval process is over, and term weights
 311 are no longer of any interest to us.
 312
 313 But the point is we can estimate p and q from a subset of *R*. As soon
 314 as some records are found relevant by the user they can be used as a
 315 working set for *R* from which the weights w(t) can be derived, and
 316 these new weights can be used to improve the processing of the query.
 317
 318 In fact in the Xapian software *R* tends to mean not the complete set of
 319 relevant documents, which indeed can rarely be discovered, but a small
 320 set of documents which have been judged as relevant.
 321
 322 Suppose we have no documents marked as relevant. Then R = r = 0, and
 323 w(t) becomes,
 324
 325 .. math::
 326     \log\:\frac{N-n+\frac{1}{2}}{n+\frac{1}{2}}
 327
 328 This is approximately log((N - n)/n). Or log(N/n), since n is usually
 329 small compared with N. This is called inverse logarithmic weighting, and
 330 has been used in IR for many decades, quite independently of the
 331 probabilistic theory which underpins it. Weights of this form are in
 332 fact the starting point in Xapian when no relevance information is
 333 present.
 334
 335 The number n incidentally is often called the *frequency* of a term. We
 336 prefer the phrase *term frequency*, to better distinguish it from wdf
 337 and wqf introduced below.
 338
 339 In extreme cases w(t) can be negative. In Xapian, negative values are
 340 disallowed, and simply replaced by a small positive value.
 341
 342 wdp, wdf, ndl and wqf
 343 ---------------------
 344
 345 Before we see how the weights are used there are a few more ideas to
 346 introduce.
 347
 348 As mentioned before, a term t is said to index a document D, or :math:`t\to D`.
 349 We have emphasised that D may not be a piece of text in machine-readable
 350 form, and that, even when it is, t may not actually occur in the text of
 351 D. Nevertheless, it will often be the case that D is made up of a list
 352 of words,
 353
 354 :math:`D = w_1, w_2, w_3` ... :math:`w_m`
 355
 356 and that many, if not all, of the terms which index D derive from these
 357 words (for example, the terms are often lower-cased and stemmed forms of
 358 these words).
 359
 360 If a term derives from words :math:`w_9, w_38, w_97` and :math:`w_221` in the indexing
 361 process, we can say that the term "occurs" in D at positions 9, 38, 97 and
 362 221, and so for each term a document may have a vector of positional
 363 information. These are the *within-document positions* of t, or the *wdp*
 364 information of t.
 365
 366 The *within-document frequency*, or *wdf*, of a term t in D is the
 367 number of times it is pulled out of D in the indexing process. Usually
 368 this is the size of the wdp vector, but in Xapian it can exceed it,
 369 since we can apply extra wdf to some parts of the document text. For
 370 example, often this is done for the document title and abstract to
 371 attach extra importance to their contents compared to the rest of the
 372 document text.
 373
 374 There are various ways in which we might measure the length of a
 375 document, but the easiest is to suppose it is made up of m words,
 376 :math:`w_1` to :math:`w_m`, and to define its length as m.
 377
 378 The *normalised document length*, or *ndl*, is then m divided by the
 379 average length of the documents in the IR system. So the average length
 380 document has ndl equal to 1, short documents are less than 1, long
 381 documents greater than 1. We have found that very small ndl values
 382 create problems, so Xapian actually allows for a non-zero minimum value
 383 for the ndl.
 384
 385 In the probabilistic model the query, *Q*, is itself very much like
 386 another document. Frequently indeed *Q* will be created from a document,
 387 either one already in the IR system, or by an indexing process very
 388 similar to the one used to add documents into the whole IR system. This
 389 corresponds to a user saying "give me other documents like this one".
 390 One can therefore attach a similar meaning to within-query position
 391 information, within-query frequency, and normalised query length, or
 392 wqp, wqf and nql. Xapian does not currently use the concept of wqp.
 393
 394 Using the weights. The *MSet*
 395 -----------------------------
 396
 397 Now to pull everything together. From the probabilistic term weights we
 398 can assign a weight to any document, d, as follows,
 399
 400 .. math::
 401      W(d) = \sum_{t\to d,t \in Q}\:\frac{(k + 1) f_t}{k.L_d + f_t}\:w(t)
 402
 403 The sum extends over the terms of *Q* which index d. :math:`f_t` is
 404 the wdf of t in d, :math:`L_d` is the ndl of d, and k is some suitably
 405 chosen constant.
 406
 407 The factor :math:`k+1` is actually redundant, but helps with the interpretation
 408 of the equation. In Xapian, this weighting scheme is implemented by the
 409 `Xapian::TradWeight class <apidoc/html/classXapian_1_1TradWeight.html>`_
 410 and the factor :math:`(k+1)` is ignored.
 411
 412 If :math:`k` is set to zero the factor before :math:`w(t)` is 1, and the wdfs are
 413 ignored. As k tends to infinity, the factor becomes
 414 :math:`f_t`/:math:`L_d`, and the wdfs take on their greatest
 415 importance. Intermediate values scale the wdf contribution between these
 416 extremes. The best :math:`k` actually depends on the characteristics of the IR
 417 system as a whole, and unfortunately no rule can be given for choosing
 418 it. By default, Xapian sets :math:`k` to 1 which should give reasonable results
 419 for most systems. :math:`W(d)` is merely tweaked a bit by the wdf values, and
 420 users observe a simple pattern of retrieval. It is possible to tune :math:`k` to
 421 provide optimal results for a specific system.
 422
 423 Any :math:`d` in the IR system has a value :math:`W(d)`, but, if no term of the query
 424 indexes :math:`d`, :math:`W(d)` will be zero. In practice only documents for which
 425 :math:`W(d)>0` will be of interest, and these are the documents indexed by at least
 426 one term of *Q*. If we now take these documents and arrange them by
 427 decreasing :math:`W(d)` value, we get a ranked list called the *match set*, or
 428 *MSet*, of document and weight pairs:
 429
 430 .. math::
 431      item 0:\:D_0, W(D_0)
 432
 433 .. math::
 434      item 1:\:D_1, W(D_1)
 435
 436 .. math::
 437      item 2:\:D_2, W(D_2)
 438
 439 .. math::
 440      \text{. \\
 441            . \\
 442            . }
 443
 444 .. math::
 445     item K:\:D_K, W(D_K)
 446
 447 where :math:`W(D_j) \ge W(D_i)` if j > i.
 448
 449 And according to the probabilistic model, the documents :math:`D_0, D_1, D_2 ...`
 450 are ranked by decreasing order of probability of relevance. So :math:`D_0` has highest
 451 probability of being relevant, then :math:`D_1` and so on.
 452
 453 Xapian creates the MSet from the posting lists of the terms of the
 454 query. This is the central operation of any IR system, and will be
 455 familiar to anyone who has used one of the Internet's major search
 456 engines, where the query is what you type in the query box, and the
 457 resulting hit list corresponds to the top few items of the MSet.
 458
 459 The cutoff point, K, is chosen when the MSet is created. The candidates
 460 for inclusion in the MSet are all documents indexed by at least one term
 461 of *Q*, and their number will usually exceed the choice of K (K is
 462 typically set to be 1000 or less). So the MSet is actually the best K
 463 documents found in the match process.
 464
 465 A modification of this weighting scheme can be employed that takes into
 466 account the query itself:
 467
 468 .. math::
 469      W(d) = \sum_{t \to d,\:t \in Q}\frac{(k_3+1)q_t}{(k_3L'+q_t)}\:\:\frac{(k+1)f_t}{(kL_d+f_t)}\:w(t)
 470
 471 where :math:`q_t` is the wqf of t in *Q*, :math:`L'` is the nql, or normalised
 472 query length, and :math:`k_3` is a further constant. In computing :math:`W(d)`
 473 across the document space, this extra factor may be viewed as just a
 474 modification to the basic term weights, :math:`w(t)`. Like :math:`k` and :math:`k_3`,
 475 we will need to make an inspired guess for :math:`L'`. In fact the choices for
 476 :math:`k_3` and :math:`L'` will depend on the broader context of the use of
 477 this formula, and more advice will be given as occasion arises.
 478
 479 Xapian's default weighting scheme is a generalised form of this
 480 weighting scheme modification, known as `BM25 <bm25.html>`_. In BM25, :math:`L'`
 481 is always set to 1.
 482
 483 Using the weights: the *ESet*
 484 -----------------------------
 485
 486 But as well as ranking documents, Xapian can rank terms, and this is
 487 most important. The higher up the ranking the term is, the more likely
 488 it is to act as a good differentiator between relevant and non-relevant
 489 documents. It is therefore a candidate for adding back into the query.
 490 Terms from this list can therefore be used to expand the size of the
 491 query, after which the query can be re-run to get a better MSet. Because
 492 this list of terms is mainly used for query expansion, it is called the
 493 *expand set* or *ESet*.
 494
 495 The term expansion weighting formula is as follows,
 496
 497         :math:`W(t) = r\:w(t)`
 498
 499 in other words we multiply the term weight by the number of relevant
 500 documents that have been indexed by the term.
 501
 502 The ESet then has this form,
 503
 504 .. math::
 505      item 0:\:t_0, W(t_0)
 506
 507 .. math::
 508      item 1:\:t_1, W(t_1)
 509
 510 .. math::
 511      item 2:\:t_2, W(t_2)
 512
 513 .. math::
 514      \text{. \\
 515            . \\
 516            . }
 517
 518 .. math::
 519     item K:\:t_K, W(t_K)
 520
 521 where :math:`W(t_j) \ge W(t_i)` if j > i.
 522
 523 Since the main function of the ESet is to find new terms to be added to
 524 *Q*, we usually omit from it terms already in *Q*.
 525
 526 The :math:`W(t)` weight is applicable to any term in the IR system, but has a
 527 value zero when t does not index a relevant document. The ESet is
 528 therefore confined to be a ranking of the best K terms which index
 529 relevant documents.
 530
 531 This simple form of :math:`W(t)` is traditional in the probabilistic model, but
 532 seems less than optimal because it does not take into account wdf
 533 information. One can if fact try to generalise it to:
 534
 535 .. math::
 536     W(t) = \sum_{t \to d,d \in R}\:\frac{(k+1)f_t}{kL+f_t}w(t)
 537
 538
 539 :math:`k` is again a constant, but it does not need to have the same value as
 540 the :math:`k` used in the probabilistic term weights above. In Xapian, :math:`k`
 541 defaults to 1.0 for ESet generation.
 542
 543 This reduces to :math:`W(t) = r\:w(t)` when :math:`k=0`. Certainly this form can be
 544 recommended in the very common case where :math:`r=1`, that is, we have a
 545 single document marked relevant.
 546
 547 The progress of a query
 548 -----------------------
 549
 550 Below we describe the general case of the IR model supported, including
 551 use of a relevance set (`RSet <glossary.html#rset>`_), query expansion,
 552 improved term weights and reranking. You don't have to use any of these
 553 for Xapian to be useful, but they are available should you need them.
 554
 555 The user enters a query. This is parsed into a form the IR system
 556 understands, and run by the IR system, which returns two lists, a list
 557 of captions, derived from the MSet, and a list of terms, from the ESet.
 558 If the RSet is empty, the first few documents of the MSet can be used as
 559 a stand-in - after all, they have a good chance of being relevant! You
 560 can read a document by clicking on the caption. (We assume the usual
 561 screen/mouse environment.) But you can also mark a document as relevant
 562 (change *R*) or cause a term to be added from the ESet to the query
 563 (change *Q*). As soon as any change is made to the query environment the
 564 query can be rerun, although you might have a front-end where nothing
 565 happens until you click on some "Run Query" button.
 566
 567 In any case rerunning the query leads to a new MSet and ESet, and so to
 568 a new display. The IR process is then an iterative one. You can delete
 569 terms from the query or add them in; mark or unmark documents as being
 570 relevant. Eventually you converge on the answer to the query, or at
 571 least, the best answer the IR system can give you.
 572
 573 Further Reading
 574 ---------------
 575
 576 If you want to find out more, then `"Simple, proven approaches to text
 577 retrieval" <https://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.53.8337>`_
 578 is a worthwhile read. It's a good introduction to Probabilistic
 579 Information retrieval, which is basically what Xapian provides.
 580
 581 There are also several good books on the subject of Information
 582 retrieval.
 583
 584 -  "*Information Retrieval*" by C. J. van Rijsbergen is well worth
 585    reading. It's out of print, but is available for free `from the
 586    author's website <http://www.dcs.gla.ac.uk/Keith/Preface.html>`_ (in
 587    HTML or PDF).
 588 -  "*Readings in Information Retrieval*" (published by Morgan Kaufmann,
 589    edited by Karen Sparck Jones and Peter Willett) is a collection of
 590    published papers covering many aspects of the subject.
 591 -  "*Managing Gigabytes*" (also published by Morgan Kaufmann, written by
 592    Ian H. Witten, Alistair Moffat and Timothy C. Bell) describes
 593    information retrieval and compression techniques.
 594 -  "*Modern Information Retrieval*" (published by Addison Wesley,
 595    written by Ricardo Baeza-Yates and Berthier Ribeiro-Neto) gives a
 596    good overview of the field. It was published more recently than the
 597    books above, and so covers some more recent developments.
 598 -  "*Introduction to Information Retrieval*" (published by Cambridge
 599    University Press, written by Christopher D. Manning, Prabhakar
 600    Raghavan and Hinrich Schütze) looks to be a good introductory work
 601    (we've not read it in detail yet). As well as the print version,
 602    there's an online version on `the book's companion
 603    website <http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html>`_.
 604