5 The omega CGI assumes that text in the database is encoded as UTF-8.
7 If you are writing your own search form, it is best to ensure that the query
8 will be sent as UTF-8. By default, web browsers will send the form parameters
9 with the same encoding as the page the form is on (and the default encoding for
10 HTML pages is ISO-8859-1). You can override this by adding the parameter
11 `accept-charset="UTF-8"` to the `<form>` tag of your search form (and it's
12 safe to do this even in a page which is explicitly UTF-8).
14 If the form parameters get sent as ISO-8859-1, there are several issues:
16 The first is that characters which aren't representable in ISO-8859-1 get
17 sent as numeric HTML entities, such as `文`. But there's no way
18 to distinguish these from the same text literally entered into the form
21 The second is that Omega can't simply re-encode the form data as the
22 encoding used isn't specified in the form submission (whether that is by
25 If Xapian is asked to parse a query string which isn't valid UTF-8, it will
26 fall-back to handling it as ISO-8859-1, which will usually do the right thing
27 for queries which are representable in ISO-8859-1. However, things like
28 boolean filters in `B` parameters will be used as-is, so any which contain
29 non-ASCII characters won't work properly.
34 When using omindex to index, this should automatically be the case - omindex
35 converts text extracted from documents to UTF-8 if it isn't already in this
36 encoding. There's built-in code to handle the following: ISO-8859-1,
37 ISO-8859-15, WINDOWS-1252, CP-1252, UTF-16, UCS-2, UTF-16BE, UCS-2BE, UTF-16LE,
38 UCS-2LE. And if built with iconv, many other encodings can be handled.
40 For plain text, omindex looks for a Byte Order Mark (BOM) to recognise
41 UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE. Otherwise files are
42 assumed to be UTF-8, or ISO-8859-1 if they contain byte sequences which
45 When omindex builds URLs, it percent-encodes bytes according to RFC-3986.
46 On modern systems, filenames are usually encoded in UTF-8, and the bytes
47 which make up multi-byte UTF-8 sequences will get encoded. In Omega 1.2.21
48 or 1.3.3 and later, the OmegaScript `$prettyurl` command will reverse this
49 encoding for valid UTF-8 sequences, and so filenames should be shown with only
50 the bare minimum of characters escaped.
52 However, if your filenames aren't encoded in UTF-8, `$prettyurl` will leave
53 alone percent-encoded bytes for non-ASCII characters (it is possible it could
54 find a valid UTF-8 sequence in other data and so show the wrong character, but
55 this is unlikely in real-world data). Everything should still work at least.
60 When using scriptindex, you should ensure that text you feed to scriptindex is