NOTES

   1 Q: Why does libiconv support encoding XXX? Why does libiconv not support
   2    encoding ZZZ?
   3
   4 A: libiconv, as an internationalization library, supports those character
   5    sets and encodings which are in wide-spread use in at least one territory
   6    of the world.
   7
   8    Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
   9    page "Languages, countries, and the charsets typically used for them".
  10    From this table, we can conclude that the following are in active use:
  11
  12      ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
  13                           English, Faroese, Finnish, French, Galician, German,
  14                           Icelandic, Irish, Italian, Norwegian, Portuguese,
  15                           Scottish, Spanish, Swedish
  16      ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
  17                           Slovenian
  18      ISO-8859-3           Esperanto, Maltese
  19      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
  20                           Serbian, Ukrainian
  21      ISO-8859-6           Arabic
  22      ISO-8859-7           Greek
  23      ISO-8859-8           Hebrew
  24      ISO-8859-9, CP1254   Turkish
  25      ISO-8859-10          Inuit, Lapp
  26      ISO-8859-13          Latvian, Lithuanian
  27      ISO-8859-15          Estonian
  28      KOI8-R               Russian
  29      SHIFT_JIS            Japanese
  30      ISO-2022-JP          Japanese
  31      EUC-JP               Japanese
  32
  33    Ordered by frequency on the web (1997):
  34      ISO-8859-1, CP1252   96%
  35      SHIFT_JIS             1.6%
  36      ISO-2022-JP           1.2%
  37      EUC-JP                0.4%
  38      CP1250                0.3%
  39      CP1251                0.2%
  40      CP850                 0.1%
  41      MACINTOSH             0.1%
  42      ISO-8859-5            0.1%
  43      ISO-8859-2            0.0%
  44
  45    Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
  46
  47      ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
  48                           English, Estonian, Faroese, Finnish, French,
  49                           Galician, German, Greenlandic, Icelandic,
  50                           Indonesian, Irish, Italian, Lithuanian, Norwegian,
  51                           Occitan, Portuguese, Scottish, Spanish, Swedish,
  52                           Walloon, Welsh
  53      ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
  54                           Romanian, Serbian, Slovak, Slovenian
  55      ISO-8859-3           Esperanto
  56      ISO-8859-4           Estonian, Latvian, Lithuanian
  57      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
  58                           Serbian, Ukrainian
  59      ISO-8859-6           Arabic
  60      ISO-8859-7           Greek
  61      ISO-8859-8           Hebrew
  62      ISO-8859-9           Turkish
  63      ISO-8859-14          Breton, Irish, Scottish, Welsh
  64      ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
  65                           Faroese, Finnish, French, Galician, German,
  66                           Greenlandic, Icelandic, Irish, Italian, Lithuanian,
  67                           Norwegian, Occitan, Portuguese, Scottish, Spanish,
  68                           Swedish, Walloon, Welsh
  69      KOI8-R               Russian
  70      KOI8-U               Russian, Ukrainian
  71      EUC-JP (alias eucJP)      Japanese
  72      ISO-2022-JP (alias JIS7)  Japanese
  73      SHIFT_JIS (alias SJIS)    Japanese
  74      U90                       Japanese
  75      S90                       Japanese
  76      EUC-CN (alias eucCN)      Chinese
  77      EUC-TW (alias eucTW)      Chinese
  78      BIG5                      Chinese
  79      EUC-KR (alias eucKR)      Korean
  80      ARMSCII-8                 Armenian
  81      GEORGIAN-ACADEMY          Georgian
  82      GEORGIAN-PS               Georgian
  83      TIS-620 (alias TACTIS)    Thai
  84      MULELAO-1                 Laothian
  85      IBM-CP1133                Laothian
  86      VISCII                    Vietnamese
  87      TCVN                      Vietnamese
  88      NUNACOM-8                 Inuktitut
  89
  90    Hint3: The character sets supported by Netscape Communicator 4.
  91
  92      Where is this documented? For the complete picture, I had to use
  93      "strings netscape" and then a lot of guesswork. For a quick take,
  94      look at the "View - Character set" menu of Netscape Communicator 4.6:
  95
  96      ISO-8859-{1,2,5,7,9,15}
  97      WINDOWS-{1250,1251,1253}
  98      KOI8-R               Cyrillic
  99      CP866                Cyrillic
 100      Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
 101      EUC-JP               Japanese
 102      SHIFT_JIS            Japanese
 103      GB2312               Chinese
 104      BIG5                 Chinese
 105      EUC-TW               Chinese
 106      Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)
 107
 108      UTF-8
 109      UTF-7
 110
 111    Hint4: The character sets supported by Microsoft Internet Explorer 4.
 112
 113      ISO-8859-{1,2,3,4,5,6,7,8,9}
 114      WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
 115      KOI8-R               Cyrillic
 116      KOI8-RU              Ukrainian
 117      ASMO-708             Arabic
 118      EUC-JP               Japanese
 119      ISO-2022-JP          Japanese
 120      SHIFT_JIS            Japanese
 121      GB2312               Chinese
 122      HZ-GB-2312           Chinese
 123      BIG5                 Chinese
 124      EUC-KR               Korean
 125      ISO-2022-KR          Korean
 126      WINDOWS-874          Thai
 127      WINDOWS-1258         Vietnamese
 128
 129      UTF-8
 130      UTF-7
 131      UNICODE             actually UNICODE-LITTLE
 132      UNICODEFEFF         actually UNICODE-BIG
 133
 134      and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
 135
 136    We take the union of all these four sets. The result is:
 137
 138    European and Semitic languages
 139      * ASCII.
 140        We implement this because it is occasionally useful to know or to
 141        check whether some text is entirely ASCII (i.e. if the conversion
 142        ISO-8859-x -> UTF-8 is trivial).
 143      * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
 144        We implement this because they are widely used. Except ISO-8859-4
 145        which appears to have been superseded by ISO-8859-13 in the baltic
 146        countries. But it's an ISO standard anyway.
 147      * ISO-8859-13
 148        We implement this because it's a standard in Lithuania and Latvia.
 149      * ISO-8859-14
 150        We implement this because it's an ISO standard.
 151      * ISO-8859-15
 152        We implement this because it's increasingly used in Europe, because
 153        of the Euro symbol.
 154      * ISO-8859-16
 155        We implement this because it's an ISO standard.
 156      * KOI8-R, KOI8-U
 157        We implement this because it appears to be the predominant encoding
 158        on Unix in Russia and Ukraine, respectively.
 159      * KOI8-RU
 160        We implement this because MSIE4 supports it.
 161      * KOI8-T
 162        We implement this because it is the locale encoding in glibc's Tajik
 163        locale.
 164      * PT154
 165        We implement this because it is the locale encoding in glibc's Kazakh
 166        locale.
 167      * RK1048
 168        We implement this because it's a standard in Kazakhstan.
 169      * CP{1250,1251,1252,1253,1254,1255,1256,1257}
 170        We implement these because they are the predominant Windows encodings
 171        in Europe.
 172      * CP850
 173        We implement this because it is mentioned as occurring in the web
 174        in the aforementioned statistics.
 175      * CP862
 176        We implement this because Ron Aaron says it is sometimes used in web
 177        pages and emails.
 178      * CP866
 179        We implement this because Netscape Communicator does.
 180      * CP1131
 181        We implement this because it is the locale encoding of a Belorusian
 182        locale in FreeBSD and MacOS X.
 183      * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
 184        Mac{Hebrew,Arabic}
 185        We implement these because the Sun JDK does, and because Mac users
 186        don't deserve to be punished.
 187      * Macintosh
 188        We implement this because it is mentioned as occurring in the web
 189        in the aforementioned statistics.
 190    Japanese
 191      * EUC-JP, SHIFT_JIS, ISO-2022-JP
 192        We implement these because they are widely used. EUC-JP and SHIFT_JIS
 193        are more used for files, whereas ISO-2022-JP is recommended for email.
 194      * CP932
 195        We implement this because it is the Microsoft variant of SHIFT_JIS,
 196        used on Windows.
 197      * ISO-2022-JP-2
 198        We implement this because it's the common way to represent mails which
 199        make use of JIS X 0212 characters.
 200      * ISO-2022-JP-1
 201        We implement this because it's in the RFCs, but I don't think it is
 202        really used.
 203      * ISO-2022-JP-MS
 204        We implement this because Microsoft Outlook Express / Microsoft MimeOLE
 205        sends emails in this encoding.
 206      * U90, S90
 207        We DON'T implement this because I have no information about what it
 208        is or who uses it.
 209    Simplified Chinese
 210      * EUC-CN = GB2312
 211        We implement this because it is the widely used representation
 212        of simplified Chinese.
 213      * GBK
 214        We implement this because it appears to be used on Solaris and Windows.
 215      * GB18030
 216        We implement this because it is an official requirement in the
 217        People's Republic of China.
 218      * ISO-2022-CN
 219        We implement this because it is in the RFCs, but I have no idea
 220        whether it is really used.
 221      * ISO-2022-CN-EXT
 222        We implement this because it's in the RFCs, but I don't think it is
 223        really used.
 224      * HZ = HZ-GB-2312
 225        We implement this because the RFCs recommend it for Usenet postings,
 226        and because MSIE4 supports it.
 227    Traditional Chinese
 228      * EUC-TW
 229        We implement it because it appears to be used on Unix.
 230      * BIG5
 231        We implement it because it is the de-facto standard for traditional
 232        Chinese.
 233      * CP950
 234        We implement this because it is the Microsoft variant of BIG5, used
 235        on Windows.
 236      * BIG5+
 237        We DON'T implement this because it doesn't appear to be in wide use.
 238        Only the CWEX fonts use this encoding. Furthermore, the conversion
 239        tables in the big5p package are not coherent: If you convert directly,
 240        you get different results than when you convert via GBK.
 241      * BIG5-HKSCS
 242        We implement it because it is the de-facto standard for traditional
 243        Chinese in Hongkong.
 244    Korean
 245      * EUC-KR
 246        We implement these because they appear to be the widely used
 247        representations for Korean.
 248      * CP949
 249        We implement this because it is the Microsoft variant of EUC-KR, used
 250        on Windows.
 251      * ISO-2022-KR
 252        We implement it because it is in the RFCs and because MSIE4 supports
 253        it, but I have no idea whether it's really used.
 254      * JOHAB
 255        We implement this because it is apparently used on Windows as a locale
 256        encoding (codepage 1361).
 257      * ISO-646-KR
 258        We DON'T implement this because although an old ASCII variant, its
 259        glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
 260        say it's a tilde, but Ken Lunde's "CJKV information processing" says
 261        it's an overline. And it is not ISO-IR registered.
 262    Armenian
 263      * ARMSCII-8
 264        We implement it because XFree86 supports it.
 265    Georgian
 266      * Georgian-Academy, Georgian-PS
 267        We implement these because they appear to be both used for Georgian;
 268        Xfree86 supports them.
 269    Thai
 270      * ISO-8859-11, TIS-620
 271        We implement these because it seems to be standard for Thai.
 272      * CP874
 273        We implement this because MSIE4 supports it.
 274      * MacThai
 275        We implement this because the Sun JDK does, and because Mac users
 276        don't deserve to be punished.
 277    Laotian
 278      * MuleLao-1, CP1133
 279        We implement these because XFree86 supports them. I have no idea which
 280        one is used more widely.
 281    Vietnamese
 282      * VISCII, TCVN
 283        We implement these because XFree86 supports them.
 284      * CP1258
 285        We implement this because MSIE4 supports it.
 286    Other languages
 287      * NUNACOM-8 (Inuktitut)
 288        We DON'T implement this because it isn't part of Unicode yet, and
 289        therefore doesn't convert to anything except itself.
 290    Platform specifics
 291      * HP-ROMAN8, NEXTSTEP
 292        We implement these because they were the native character set on HPs
 293        and NeXTs for a long time, and libiconv is intended to be usable on
 294        these old machines.
 295    Full Unicode
 296      * UTF-8, UCS-2, UCS-4
 297        We implement these. Obviously.
 298      * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
 299        We implement these because they are the preferred internal
 300        representation of strings in Unicode aware applications. These are
 301        non-ambiguous names, known to glibc. (glibc doesn't have
 302        UCS-2-INTERNAL and UCS-4-INTERNAL.)
 303      * UTF-16, UTF-16BE, UTF-16LE
 304        We implement these, because UTF-16 is still the favourite encoding of
 305        the president of the Unicode Consortium (for political reasons), and
 306        because they appear in RFC 2781.
 307      * UTF-32, UTF-32BE, UTF-32LE
 308        We implement these because they are part of Unicode 3.1.
 309      * UTF-7
 310        We implement this because it is essential functionality for mail
 311        applications.
 312      * C99
 313        We implement it because it's used for C and C++ programs and because
 314        it's a nice encoding for debugging.
 315      * JAVA
 316        We implement it because it's used for Java programs and because it's
 317        a nice encoding for debugging.
 318      * UNICODE (big endian), UNICODEFEFF (little endian)
 319        We DON'T implement these because they are stupid and not standardized.
 320    Full Unicode, in terms of 'uint16_t' or 'uint32_t'
 321    (with machine dependent endianness and alignment)
 322      * UCS-2-INTERNAL, UCS-4-INTERNAL
 323        We implement these because they are the preferred internal
 324        representation of strings in Unicode aware applications.
 325
 326 Q: Support encodings mentioned in RFC 1345 ?
 327 A: No, they are not in use any more. Supporting ISO-646 variants is pointless
 328    since ISO-8859-* have been adopted.
 329
 330 Q: Support EBCDIC ?
 331 A: Available through --enable-extra-encodings.
 332    Why? Because several people (Ulrich Schwab, Calvin Buckley) have shown
 333    interest in these encodings, by preparing forks of GNU libiconv.
 334
 335 Q: How do I add a new character set?
 336 A: 1. Explain the "why" in this file, above.
 337    2. You need to have a conversion table from/to Unicode. Transform it into
 338    the format used by the mapping tables found on ftp.unicode.org: each line
 339    contains the character code, in hex, with 0x prefix, then whitespace,
 340    then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
 341    counts as a comment delimiter until end of line.
 342    Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
 343    can include it in his collection.
 344    3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
 345    tools directory to generate the C code for the conversion. You may tweak
 346    the resulting C code if you are not satisfied with its quality, but this
 347    is rarely needed.
 348    If it's a two-dimensional character set (with rows and columns), use the
 349    'cjk_tab_to_h' program in the tools directory to generate the C code for
 350    the conversion. You will need to modify the main() function to recognize
 351    the new character set name, with the proper dimensions, but that shouldn't
 352    be too hard. This yields the CCS. The CES you have to write by hand.
 353    4. Store the resulting C code file in the lib directory. Add a #include
 354    directive to converters.h, and add an entry to the encodings.def file.
 355    5. Compile the package, and test your new encoding using a program like
 356    iconv(1) or clisp(1).
 357    6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
 358    encoding, create the complete table as a TXT file. For a stateful encoding,
 359    provide a text snippet encoded using your new encoding and its UTF-8
 360    equivalent.
 361    7. Update the README and man/iconv_open.3, to mention the new encoding.
 362    Add a note in the NEWS file.
 363
 364 Q: What about bidirectional text? Should it be tagged or reversed when
 365    converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
 366    this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
 367 A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
 368    ISO-8859-E remains to be implemented.
 369    On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
 370    is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
 371    the same as ISO-8859-8-I. I'm confused.
 372
 373 Other character sets not implemented:
 374 "MNEMONIC" = "csMnemonic"
 375 "MNEM" = "csMnem"
 376 "ISO-10646-UCS-Basic" = "csUnicodeASCII"
 377 "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
 378 "ISO-10646-J-1"
 379 "UNICODE-1-1" = "csUnicode11"
 380 "csWindows31Latin5"
 381
 382 Other aliases not implemented (and not implemented in glibc-2.1 either):
 383   From MSIE4:
 384     ISO-8859-1: alias ISO8859-1
 385     ISO-8859-2: alias ISO8859-2
 386     KSC_5601: alias KS_C_5601
 387     UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
 388
 389
 390 Q: How can I integrate libiconv into my package?
 391 A: Just copy the entire libiconv package into a subdirectory of your package.
 392    At configuration time, call libiconv's configure script with the
 393    appropriate --srcdir option and maybe --enable-static or --disable-shared.
 394    Then "cd libiconv && make && make install-lib libdir=... includedir=...".
 395    'install-lib' is a special (not GNU standardized) target which installs
 396    only the include file - in $(includedir) - and the library - in $(libdir) -
 397    and does not use other directory variables. After "installing" libiconv
 398    in your package's build directory, building of your package can proceed.
 399
 400 Q: Why is the testsuite so big?
 401 A: Because some of the tests are very comprehensive.
 402    If you don't feel like using the testsuite, you can simply remove the
 403    tests/ directory.
 404