1 =======================
2 Log Data Postprocessing
3 =======================
7 - data logged from the search results page rendered by Omega query template and,
8 - clicks recorded when a document link is clicked on the results page.
10 Further processing of this raw data is handled by the ``postprocess``
11 script. This script generates the final clickstream log file from the input
12 search and click log files which can be used to train click models for clickstream
13 data mining. It also creates a query file that can be used by Xapian Letor module
14 for generating its training files.
16 The two functions defined in ``postprocess`` are:
18 - ``postprocess.generate_combined_log(search_log, clicks_log, final_log)``
20 Generates the final log file.
23 - **search_log:** Path to the search.log file.
24 - **clicks_log:** Path to the clicks.log file.
25 - **final_log:** Path to save final.log file.
27 - ``postprocess.generate_query_file(final_log, query_file)``
29 Generates the query file formatted as per Xapian Letor documentation_.
32 - **final_log:** Path to save final.log file.
33 - **query_file:** Path to save query.txt file.
35 .. _documentation: https://github.com/xapian/xapian/blob/master/xapian-letor/docs/letor.rst
37 These functions can be used independently as a part of postprocess module.
38 For example, if you are interested in just generating the final log file for
39 training for a click model then you may just use the ``generate_combined_log`` function
42 from postprocess import generate_combined_log
43 generate_combined_log(search_log, clicks_log, final_log)
47 Expect this documentation to be converted into a full sphinx manual
53 .. note:: All files are in CSV format.
55 ``search_log`` file: Each line in this file contains four fields as follows:
57 1. **QueryID** - an identifier for each query.
58 2. **Query** - text of the query (when the search button is clicked on). This field may be empty (e.g. when search button is clicked without any query text entered) in which case **Hits** will also be empty.
59 3. **Hits** - a list of Xapian docid of all documents displayed in the search results.
60 4. **Offset** - document number of the first document on the current page of hit list (starting from 0).
62 Some example entries in ``search_log``::
64 821f03288846297c2cf43c34766a38f7,"book","45,54",0
65 098f6bcd4621d373cade4e832627b4f6,"test","45,42",0
66 d41d8cd98f00b204e9800998ecf8427e,"","",0
68 ``clicks_log`` file: Each line in this file has two fields as follows:
70 1. **QueryID** - an identifier for each query.
71 2. **Hit** - the Xapian docid of a document that was clicked from the search results.
73 Some example entries in ``clicks_log``::
75 821f03288846297c2cf43c34766a38f7,54
76 821f03288846297c2cf43c34766a38f7,54
77 098f6bcd4621d373cade4e832627b4f6,42
79 ``final_log`` file: Similar to ``search_log`` file but with an additional fifth field as follows:
81 1. **QueryID** - an identifier for each query.
82 2. **Query** - text of the query (when the search button is clicked on).
83 3. **Hits** - a list of Xapian docid of all documents displayed in the search results.
84 4. **Offset** - document number on the current page of hit list (starting from 0).
85 5. **Clicks** - a list of Xapian docid with the number of times the corresponding document was clicked.
87 Some example entries in ``final.log``::
89 QueryID,Query,Hits,Offset,Clicks
90 821f03288846297c2cf43c34766a38f7,book,"45,54",0,"45:0,54:2"
91 098f6bcd4621d373cade4e832627b4f6,test,"45,42",0,"45:0,42:1"
93 ``query.txt`` file: Each line in this file contains two fields as follows:
95 1. **QueryID** - an identifier for each query.
96 2. **Query** - text of the query.
98 Some example entries in ``query.txt``::
100 821f03288846297c2cf43c34766a38f7,book
101 098f6bcd4621d373cade4e832627b4f6,test