xapian-applications/omega/docs/newformat.rst

   1 =====================================
   2 Add support for a new format to Omega
   3 =====================================
   4
   5 We can add support for a new file format to Omega through an external filter or a library. For this, we must follow a series of steps.
   6
   7 First of all, we need a mime type for the new file format. Omega uses mime types to identify the format of a file and handle it in a proper way. The official registry is at http://www.iana.org/assignments/media-types/ but not all filetypes have a corresponding official mime-type. In that case, a de-facto standard "x-" prefixed mime-type often exists. A good way to look for one is to ask the ``file`` utility to identify a file (Omega uses the same library as file to identify files when it does not recognise the extension)::
   8
   9   file --mime-type example.fb2
  10
  11 which responds::
  12
  13   example.fb2: text/xml
  14
  15 Sometimes ``file`` just returns a generic answer (most commonly ``text/plain`` or ``application/octet-stream``) and occasionally it misidentifies a file. If that is the case, we can associate the file format extension with a particular mime type at 'mimemap.tokens'. If multiple extensions are used for a format (such as htm and html for HTML) then add an entry for each.
  16
  17 When indexing a filename which has an extension in upper-case or mixed-case, omindex will check for an exact match for the extension, and if not found, it will force the extension to lower-case and try again, so just add the extension in lower-case unless different cases actually have different meanings.
  18
  19 In this example, ``text/xml`` is too broad so we can associate ``fb2`` to ``application/x-fictionbook+xml`` which is much more specific.
  20
  21 Then, you have to decide if you are going to add the new format `Using a library`_ or `Using an external filter`_.
  22
  23 Extracted data variables
  24 ========================
  25
  26 In order to add a new filter and index a document, you will need to fill some C++ variables in ``index_file.cc``:
  27
  28 * **dump**: contain the "body" text of the document.
  29 * **title**: storage the title of the document.
  30 * **author**: save the author of the document.
  31 * **keywords**: additional text to index, but not to show the user.
  32 * **sample**: if set, this is used to generate the static document "snippet" which is stored; if not set, this is generated from dump.
  33 * **topic**: save the topic of the document.
  34
  35 It is not necessary to fill all the variables, but try to fill as many as you can.
  36
  37 Using a library
  38 ===============
  39
  40 For safety reasons, it is not allowed to directly add libraries to omega source code. It is possible that some libraries contain bugs that affect the correct operation of the program. Because of that, omega isolates external libraries in subprocesses using modules ``worker`` and ``assistant``. These modules will take care of the safety measures and use handlers to get access to the libraries.
  41
  42 1. To begin with, we need a library to extract information from the desired format. If many options are available, try to choose a library with an active community, that supports UTF-8 encoding and has a licence compatible with MIT/X license and GNU GPL version 2 and later. If UTF-8 output is not supported, you can convert the encoding using ``convert_to_utf8`` (replacing "ISO-8859-1" with the character set which is produced)
  43    ::
  44
  45      convert_to_utf8(text, "ISO-8859-1");
  46
  47 2. Once you have the library and the mimetype, it is time to modify the code. We have to create a new handler, which is the process used by omindex to access to the library. In order to do it, we have to create a file ``handler_yourlibrary.cc`` that includes ``handler.h`` and gives a definition to function ``extract`` (there are some examples at xapian-applications/omega such as 'handler_tesseract.cc').
  48    ::
  49
  50      void
  51      extract(const std::string& filename,
  52              const std::string& mimetype);
  53
  54    In the function body you will use the library to extract the necessary information and call ``response()`` passing each piece of information.  These can be passed as C++ ``std::string``,
  55    or as ``const char*`` pointers with or without lengths (if without, the strings must be nul-terminated.
  56
  57    If there's an error, call ``fail()`` instead.  If you don't call ``response()`` or ``fail()`` before returning the harness will effectively call ``fail("?")`` for you.
  58
  59    You can get more information about these functions at 'xapian-applications/omega/handler.h'.
  60
  61 3. After the handler is implemented, the build system must be updated. In particular, it is necessary to modify 'configure.ac' and 'Makefile.am'.
  62
  63    In 'configure.ac', we need to check if the library is available using ``PKG_CHECK_MODULES`` and define a compilation variable if we can use it. It is highly recommended to follow the example of the other handlers.
  64    Some macros that you might find useful are ``AC_CHECK_HEADERS``, ``AC_DEFINE``, ``AC_COMPILE_IFELSE``, ``AC_LINK_IFELSE``
  65    In Makefile.am,  we should add the program to ``EXTRA_PROGRAMS`` and define the variables ``omindex_yourlibrary_SOURCES``, and  ``omindex_yourlibrary_LDADD`` and ``omindex_yourlibrary_CPPFLAGS`` if they are necessary.
  66
  67    For example, if we want to add 'omindex_tesseract' we will get
  68    ::
  69
  70      dnl Check if libtesseract is available.
  71      PKG_CHECK_MODULES([TESSERACT], [tesseract, lept], [
  72          AC_DEFINE(HAVE_TESSERACT, 1, [Define HAVE_TESSERACT if the library is available])
  73          OMINDEX_MODULES="$OMINDEX_MODULES omindex_tesseract"],[ ])
  74
  75    in 'configure.ac', and
  76    ::
  77
  78      EXTRA_PROGRAMS = omindex_tesseract
  79
  80    and
  81    ::
  82
  83      omindex_tesseract_SOURCES = assistant.cc worker_comms.cc handler_tesseract.cc
  84      omindex_tesseract_LDADD = $(TESSERACT_LIBS)
  85      omindex_tesseract_CXXFLAGS = $(TESSERACT_CFLAGS)
  86
  87    in 'Makefile.am'.
  88
  89 4. The last step is adding a new worker for the mime type to omindex. We can do it on the function ``add_default_libreries`` at 'index_file.cc'. Here we will need the compilation variable, which was defined at 'configure.ac'.
  90
  91    Following with the example of tesseract
  92    ::
  93
  94      add_default_libreries() {
  95      #if defined HAVE_TESSERACT
  96          Worker* omindex_tesseract = new Worker("omindex_tesseract");
  97          index_library("image/png", omindex_tesseract);
  98      #endif
  99
 100 Finally, we can compile our program to be sure that everything is okay. If the modifications are correct, we will find a new executable ``omindex_yourlibrary`` in the working directory.
 101
 102 Using an external filter
 103 ========================
 104
 105 To add a new filter to omega we have to follow a series of steps:
 106
 107 1. The first job is to find a good external filter. Some formats have several filters to choose from. The attributes which interest us are reliably extracting the text with word breaks in the right places, and supporting Unicode (ideally as UTF-8). If you have several choices, try them on some sample files.
 108
 109    The ideal (and simplest) case is that you have a filter which can produce an UTF-8 output in plain text. It may require special command line options to do so, in which case work out what they are from the documentation or source code, and check that the output is indeed as documented.
 110
 111    It is most efficient if the filter program can write to stdout, but output to a temporary file works too.
 112
 113    For example, if we want to use ``python2text`` for handling ``text/x-python``, we should use ``python2text --utf8 --stdout``.
 114
 115 2. Then, we need to add the filter to Omega. Omega has the ability to specify additional external filters on the command line using ``--filter=M[,[T][,C]]:CMD``, which process files with MIME Content-Type M through command CMD and produces output (on stdout or in a temporary file) with format T (Content-Type or file extension; currently txt (default), html or svg) in character encoding C (default: UTF-8). For example
 116    ::
 117
 118      --filter=text/x-foo,text/html,utf-16:'foo2utf16 --content %f %t'
 119
 120    In this case, we are going to handle ``text/x-foo`` files with ``foo2utf16`` that is going to produce html with UTF-16 encoding on a temporary file. Note that %f will be replaced with the filename and %t with a temporary output file (that is going to be created by omindex at runtime and the extension of it will reflect the expected output format). This tells omindex to index files with content-type ``text/x-foo`` by running
 121    ::
 122
 123      foo2utf16 --content path/to/file path/to/temporary/file.html
 124
 125    If you don't include ``%f``, then the filename of the file to be extracted will be appended to the command, separated by a space and if you don't use ``%t``, then omindex will expect output on stdout. Besides, ``%%`` can be used should you need a literal % in the command.
 126
 127    If you specify ``false`` as the command in ``--filter``, omindex will skip files with the specified MIME type. If you specify ``true`` as the command in ``--filter``, omindex won't try to extract text from the file, but will index it such that it can be searched for via metadata which comes from the filing system (filename, extension, mime content-type, last modified time, size).
 128
 129    If we want to add the filter permanently, we can add a new entry in ``index_add_default_filters`` at 'index_file.cc'. Following with the example
 130    ::
 131
 132      index_command("text/x-foo", Filter("foo2utf16 --content %f %t", "text/html", "utf-16"));
 133
 134    There are more options that we can use for Filter (see 'index_file.h').
 135
 136 3. In some cases, you will have to run several programs for each file or make some extra work so you will either need to put together a script which fits what omindex supports, or else modify the source code in ``index_file.cc`` by adding a test for the new mime-type to the long if/else-if chain inside ``index_mimetype`` function. New formats should generally go at the end, unless they are very common
 137    ::
 138
 139      } else if (mimetype == "text/x-foo") {
 140
 141    The filename of the file is in ``file``. The code you add should set the variables described in the `Extracted data variables`_ section above.
 142    ::
 143
 144      string tmpfile = get_tmpfile("tmp.html");
 145      if (tmpfile.empty())
 146        return;
 147      string cmd = "foo2utf16 --content";
 148      append_filename_argument(cmd, file);
 149      append_filename_argument(cmd, tmpfile);
 150      HtmlParser p;
 151      try {
 152        (void)stdout_to_string(cmd);
 153        dump = file_to_string(tmpfile);
 154        p.parse(dump, "UTF-16", false);
 155        unlink(tmpfile.c_str());
 156      } catch (ReadError) {
 157        skip_cmd_failed(urlterm, context, cmd, d.get_size(), d.get_mtime());
 158        unlink(tmpfile.c_str());
 159        return;
 160      } catch (...) {
 161        unlink(tmpfile.c_str());
 162        throw;
 163      }
 164      dump = p.dump;
 165      title = p.title;
 166      author = p.author;
 167      keywords = p.keywords;
 168      topic = p.topic;
 169      sample = p.sample;
 170
 171    The ``stdout_to_string`` function runs a command and captures its output as a C++ std::string. If the command is not installed on PATH, omindex detects this automatically and disables support for the mimetype in the current run, so it will only try the first file of each such type.
 172
 173    If UTF-8 output is not supported, pick the best (or only!) supported encoding and then convert the output to UTF-8 - to do this, once you have dump, convert it like so (replacing "UTF-16" with the character set which is produced)
 174    ::
 175
 176      convert_to_utf8(string, "UTF-16");
 177
 178    In this case, ``HtmlParser`` will convert the text of the file to UTF-8 if necessary.
 179
 180 If you find a reliable external filter or library and think it might be useful to other people, please let us know about it.
 181
 182 Submitting a patch
 183 ==================
 184
 185 Once you are happy with how your handler/filter works, please submit a patch so we can include it in future releases (creating a new trac ticket and attaching the patch is best). Before doing so, please also update docs/overview.rst by:
 186
 187 - Adding the format and extensions recognised for it to the list.
 188 - Adding the mime-type to 'mimemap.tokens'.
 189
 190 It would be really useful if you are able to supply some sample files with a licence which allows redistribution so we can test the filters on it. Ideally ones with non-ASCII characters so that we know Unicode support works.
 191
 192 You can read more about `how to contribute to Xapian <https://xapian-developer-guide.readthedocs.io/en/latest/contributing/index.html>`_.
 193
 194 If you have problems you can ask for help on the `IRC channel or mailing list <https://xapian.org/lists>`_.