xapian-applications/omega/docs/newformat.rst

   1 =====================================
   2 Add support for a new format to Omega
   3 =====================================
   4
   5 We can add support for a new file format to Omega through an external filter or a library. For this, we must follow a series of steps.
   6
   7 First of all, we need a mime type for the new file format. Omega uses mime types to identify the format of a file and handle it in a proper way. The official registry is at http://www.iana.org/assignments/media-types/ but not all filetypes have a corresponding official mime-type. In that case, a de-facto standard "x-" prefixed mime-type often exists. A good way to look for one is to ask the ``file`` utility to identify a file (Omega uses the same library as file to identify files when it does not recognise the extension)::
   8
   9   file --mime-type example.fb2
  10
  11 which responds::
  12
  13   example.fb2: text/xml
  14
  15 Sometimes ``file`` just returns a generic answer (most commonly ``text/plain`` or ``application/octet-stream``) and occasionally it misidentifies a file. If that is the case, we can associate the file format extension with a particular mime type at 'mimemap.tokens'. If multiple extensions are used for a format (such as htm and html for HTML) then add an entry for each.
  16
  17 When indexing a filename which has an extension in upper-case or mixed-case, omindex will check for an exact match for the extension, and if not found, it will force the extension to lower-case and try again, so just add the extension in lower-case unless different cases actually have different meanings.
  18
  19 In this example, ``text/xml`` is too broad so we can associate ``fb2`` to ``application/x-fictionbook+xml`` which is much more specific.
  20
  21 Extracted data variables
  22 ========================
  23
  24 In order to add a new filter and index a document, you will need to fill some C++ variables in ``index_file.cc``:
  25
  26 * **dump**: contain the "body" text of the document.
  27 * **title**: storage the title of the document.
  28 * **author**: save the author of the document.
  29 * **keywords**: additional text to index, but not to show the user.
  30 * **sample**: if set, this is used to generate the static document "snippet" which is stored; if not set, this is generated from dump.
  31 * **topic**: save the topic of the document.
  32
  33 It is not necessary to fill all the variables, but try to fill as many as you can.
  34
  35 Using an external filter
  36 ========================
  37
  38 To add a new filter to omega we have to follow a series of steps:
  39
  40 1. The first job is to find a good external filter. Some formats have several filters to choose from. The attributes which interest us are reliably extracting the text with word breaks in the right places, and supporting Unicode (ideally as UTF-8). If you have several choices, try them on some sample files.
  41
  42    The ideal (and simplest) case is that you have a filter which can produce an UTF-8 output in plain text. It may require special command line options to do so, in which case work out what they are from the documentation or source code, and check that the output is indeed as documented.
  43
  44    It is most efficient if the filter program can write to stdout, but output to a temporary file works too.
  45
  46    For example, if we want to use ``python2text`` for handling ``text/x-python``, we should use ``python2text --utf8 --stdout``.
  47
  48 2. Then, we need to add the filter to Omega. Omega has the ability to specify additional external filters on the command line using ``--filter=M[,[T][,C]]:CMD``, which process files with MIME Content-Type M through command CMD and produces output (on stdout or in a temporary file) with format T (Content-Type or file extension; currently txt (default), html or svg) in character encoding C (default: UTF-8). For example
  49    ::
  50
  51      --filter=text/x-foo,text/html,utf-16:'foo2utf16 --content %f %t'
  52
  53    In this case, we are going to handle ``text/x-foo`` files with ``foo2utf16`` that is going to produce html with UTF-16 encoding on a temporary file. Note that %f will be replaced with the filename and %t with a temporary output file (that is going to be created by omindex at runtime and the extension of it will reflect the expected output format). This tells omindex to index files with content-type ``text/x-foo`` by running
  54    ::
  55
  56      foo2utf16 --content path/to/file path/to/temporary/file.html
  57
  58    If you don't include ``%f``, then the filename of the file to be extracted will be appended to the command, separated by a space and if you don't use ``%t``, then omindex will expect output on stdout. Besides, ``%%`` can be used should you need a literal % in the command.
  59
  60    If you specify ``false`` as the command in ``--filter``, omindex will skip files with the specified MIME type. If you specify ``true`` as the command in ``--filter``, omindex won't try to extract text from the file, but will index it such that it can be searched for via metadata which comes from the filing system (filename, extension, mime content-type, last modified time, size).
  61
  62    If we want to add the filter permanently, we can add a new entry in ``index_add_default_filters`` at 'index_file.cc'. Following with the example
  63    ::
  64
  65      index_command("text/x-foo", Filter("foo2utf16 --content %f %t", "text/html", "utf-16"));
  66
  67    There are more options that we can use for Filter (see 'index_file.h').
  68
  69 3. In some cases, you will have to run several programs for each file or make some extra work so you will either need to put together a script which fits what omindex supports, or else modify the source code in ``index_file.cc`` by adding a test for the new mime-type to the long if/else-if chain inside ``index_mimetype`` function. New formats should generally go at the end, unless they are very common
  70    ::
  71
  72      } else if (mimetype == "text/x-foo") {
  73
  74    The filename of the file is in ``file``. The code you add should set the variables described in the `Extracted data variables`_ section above.
  75    ::
  76
  77      string tmpfile = get_tmpfile("tmp.html");
  78      if (tmpfile.empty())
  79        return;
  80      string cmd = "foo2utf16 --content";
  81      append_filename_argument(cmd, file);
  82      append_filename_argument(cmd, tmpfile);
  83      MyHtmlParser p;
  84      try {
  85        (void)stdout_to_string(cmd);
  86        dump = file_to_string(tmpfile);
  87        p.parse_html(dump, "UTF-16", false);
  88        unlink(tmpfile.c_str());
  89      } catch (ReadError) {
  90        skip_cmd_failed(urlterm, context, cmd, d.get_size(), d.get_mtime());
  91        unlink(tmpfile.c_str());
  92        return;
  93      } catch (...) {
  94        unlink(tmpfile.c_str());
  95        throw;
  96      }
  97      dump = p.dump;
  98      title = p.title;
  99      author = p.author;
 100      keywords = p.keywords;
 101      topic = p.topic;
 102      sample = p.sample;
 103
 104    The ``stdout_to_string`` function runs a command and captures its output as a C++ std::string. If the command is not installed on PATH, omindex detects this automatically and disables support for the mimetype in the current run, so it will only try the first file of each such type.
 105
 106    If UTF-8 output is not supported, pick the best (or only!) supported encoding and then convert the output to UTF-8 - to do this, once you have dump, convert it like so (replacing "UTF-16" with the character set which is produced)
 107    ::
 108
 109      convert_to_utf8(string, "UTF-16");
 110
 111    In this case, ``MyHtmlParser`` will convert the text of the file to UTF-8 if necessary.
 112
 113 If you find a reliable external filter or library and think it might be useful to other people, please let us know about it.
 114
 115 Submitting a patch
 116 ==================
 117
 118 Once you are happy with how your handler/filter works, please submit a patch so we can include it in future releases (creating a new trac ticket and attaching the patch is best). Before doing so, please also update docs/overview.rst by:
 119
 120 - Adding the format and extensions recognised for it to the list.
 121 - Adding the mime-type to 'mimemap.tokens'.
 122
 123 It would be really useful if you are able to supply some sample files with a licence which allows redistribution so we can test the filters on it. Ideally ones with non-ASCII characters so that we know Unicode support works.
 124
 125 You can read more about `how to contribute to Xapian <https://xapian-developer-guide.readthedocs.io/en/latest/contributing/index.html>`_.
 126
 127 If you have problems you can ask for help on the `IRC channel or mailing list <https://xapian.org/lists>`_.