xapian-applications/omega/docs/encodings.rst

   1 ===================
   2 Character Encodings
   3 ===================
   4
   5 The omega CGI assumes that text in the database is encoded as UTF-8.
   6
   7 If you are writing your own search form, it is best to ensure that the query
   8 will be sent as UTF-8.  By default, web browsers will send the form parameters
   9 with the same encoding as the page the form is on (and the default encoding for
  10 HTML pages is ISO-8859-1).  You can override this by adding the parameter
  11 `accept-charset="UTF-8"` to the `<form>` tag of your search form (and it's
  12 safe to do this even in a page which is explicitly UTF-8).
  13
  14 If the form parameters get sent as ISO-8859-1, there are several issues:
  15
  16 The first is that characters which aren't representable in ISO-8859-1 get
  17 sent as numeric HTML entities, such as `&#25991;`.  But there's no way
  18 to distinguish these from the same text literally entered into the form
  19 by the user.
  20
  21 The second is that Omega can't simply re-encode the form data as the
  22 encoding used isn't specified in the form submission (whether that is by
  23 GET or POST).
  24
  25 If Xapian is asked to parse a query string which isn't valid UTF-8, it will
  26 fall-back to handling it as ISO-8859-1, which will usually do the right thing
  27 for queries which are representable in ISO-8859-1.  However, things like
  28 boolean filters in `B` parameters will be used as-is, so any which contain
  29 non-ASCII characters won't work properly.
  30
  31 omindex
  32 =======
  33
  34 When using omindex to index, this should automatically be the case - omindex
  35 converts text extracted from documents to UTF-8 if it isn't already in this
  36 encoding.  There's built-in code to handle the following: ISO-8859-1,
  37 ISO-8859-15, WINDOWS-1252, CP-1252, UTF-16, UCS-2, UTF-16BE, UCS-2BE, UTF-16LE,
  38 UCS-2LE.  And if built with iconv, many other encodings can be handled.
  39
  40 For plain text, omindex looks for a Byte Order Mark (BOM) to recognise
  41 UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE.  Otherwise files are
  42 assumed to be UTF-8, or ISO-8859-1 if they contain byte sequences which
  43 aren't valid UTF-8.
  44
  45 When omindex builds URLs, it percent-encodes bytes according to RFC-3986.
  46 On modern systems, filenames are usually encoded in UTF-8, and the bytes
  47 which make up multi-byte UTF-8 sequences will get encoded.  In Omega 1.2.21
  48 or 1.3.3 and later, the OmegaScript `$prettyurl` command will reverse this
  49 encoding for valid UTF-8 sequences, and so filenames should be shown with only
  50 the bare minimum of characters escaped.
  51
  52 However, if your filenames aren't encoded in UTF-8, `$prettyurl` will leave
  53 alone percent-encoded bytes for non-ASCII characters (it is possible it could
  54 find a valid UTF-8 sequence in other data and so show the wrong character, but
  55 this is unlikely in real-world data).  Everything should still work at least.
  56
  57 scriptindex
  58 ===========
  59
  60 When using scriptindex, you should ensure that text you feed to scriptindex is
  61 UTF-8.