1 Q: Why does libiconv support encoding XXX? Why does libiconv not support
4 A: libiconv, as an internationalization library, supports those character
5 sets and encodings which are in wide-spread use in at least one territory
8 Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
9 page "Languages, countries, and the charsets typically used for them".
10 From this table, we can conclude that the following are in active use:
12 ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
13 English, Faroese, Finnish, French, Galician, German,
14 Icelandic, Irish, Italian, Norwegian, Portuguese,
15 Scottish, Spanish, Swedish
16 ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
18 ISO-8859-3 Esperanto, Maltese
19 ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
24 ISO-8859-9, CP1254 Turkish
25 ISO-8859-10 Inuit, Lapp
26 ISO-8859-13 Latvian, Lithuanian
33 Ordered by frequency on the web (1997):
34 ISO-8859-1, CP1252 96%
45 Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
47 ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
48 English, Estonian, Faroese, Finnish, French,
49 Galician, German, Greenlandic, Icelandic,
50 Indonesian, Irish, Italian, Lithuanian, Norwegian,
51 Occitan, Portuguese, Scottish, Spanish, Swedish,
53 ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish,
54 Romanian, Serbian, Slovak, Slovenian
56 ISO-8859-4 Estonian, Latvian, Lithuanian
57 ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
63 ISO-8859-14 Breton, Irish, Scottish, Welsh
64 ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian,
65 Faroese, Finnish, French, Galician, German,
66 Greenlandic, Icelandic, Irish, Italian, Lithuanian,
67 Norwegian, Occitan, Portuguese, Scottish, Spanish,
68 Swedish, Walloon, Welsh
70 KOI8-U Russian, Ukrainian
71 EUC-JP (alias eucJP) Japanese
72 ISO-2022-JP (alias JIS7) Japanese
73 SHIFT_JIS (alias SJIS) Japanese
76 EUC-CN (alias eucCN) Chinese
77 EUC-TW (alias eucTW) Chinese
79 EUC-KR (alias eucKR) Korean
81 GEORGIAN-ACADEMY Georgian
83 TIS-620 (alias TACTIS) Thai
90 Hint3: The character sets supported by Netscape Communicator 4.
92 Where is this documented? For the complete picture, I had to use
93 "strings netscape" and then a lot of guesswork. For a quick take,
94 look at the "View - Character set" menu of Netscape Communicator 4.6:
96 ISO-8859-{1,2,5,7,9,15}
97 WINDOWS-{1250,1251,1253}
100 Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
106 Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB)
111 Hint4: The character sets supported by Microsoft Internet Explorer 4.
113 ISO-8859-{1,2,3,4,5,6,7,8,9}
114 WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
127 WINDOWS-1258 Vietnamese
131 UNICODE actually UNICODE-LITTLE
132 UNICODEFEFF actually UNICODE-BIG
134 and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
136 We take the union of all these four sets. The result is:
138 European and Semitic languages
140 We implement this because it is occasionally useful to know or to
141 check whether some text is entirely ASCII (i.e. if the conversion
142 ISO-8859-x -> UTF-8 is trivial).
143 * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
144 We implement this because they are widely used. Except ISO-8859-4
145 which appears to have been superseded by ISO-8859-13 in the baltic
146 countries. But it's an ISO standard anyway.
148 We implement this because it's a standard in Lithuania and Latvia.
150 We implement this because it's an ISO standard.
152 We implement this because it's increasingly used in Europe, because
155 We implement this because it's an ISO standard.
157 We implement this because it appears to be the predominant encoding
158 on Unix in Russia and Ukraine, respectively.
160 We implement this because MSIE4 supports it.
162 We implement this because it is the locale encoding in glibc's Tajik
165 We implement this because it is the locale encoding in glibc's Kazakh
168 We implement this because it's a standard in Kazakhstan.
169 * CP{1250,1251,1252,1253,1254,1255,1256,1257}
170 We implement these because they are the predominant Windows encodings
173 We implement this because it is mentioned as occurring in the web
174 in the aforementioned statistics.
176 We implement this because Ron Aaron says it is sometimes used in web
179 We implement this because Netscape Communicator does.
181 We implement this because it is the locale encoding of a Belorusian
182 locale in FreeBSD and MacOS X.
183 * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
185 We implement these because the Sun JDK does, and because Mac users
186 don't deserve to be punished.
188 We implement this because it is mentioned as occurring in the web
189 in the aforementioned statistics.
191 * EUC-JP, SHIFT_JIS, ISO-2022-JP
192 We implement these because they are widely used. EUC-JP and SHIFT_JIS
193 are more used for files, whereas ISO-2022-JP is recommended for email.
195 We implement this because it is the Microsoft variant of SHIFT_JIS,
198 We implement this because it's the common way to represent mails which
199 make use of JIS X 0212 characters.
201 We implement this because it's in the RFCs, but I don't think it is
204 We implement this because Microsoft Outlook Express / Microsoft MimeOLE
205 sends emails in this encoding.
207 We DON'T implement this because I have no information about what it
211 We implement this because it is the widely used representation
212 of simplified Chinese.
214 We implement this because it appears to be used on Solaris and Windows.
216 We implement this because it is an official requirement in the
217 People's Republic of China.
219 We implement this because it is in the RFCs, but I have no idea
220 whether it is really used.
222 We implement this because it's in the RFCs, but I don't think it is
225 We implement this because the RFCs recommend it for Usenet postings,
226 and because MSIE4 supports it.
229 We implement it because it appears to be used on Unix.
231 We implement it because it is the de-facto standard for traditional
234 We implement this because it is the Microsoft variant of BIG5, used
237 We DON'T implement this because it doesn't appear to be in wide use.
238 Only the CWEX fonts use this encoding. Furthermore, the conversion
239 tables in the big5p package are not coherent: If you convert directly,
240 you get different results than when you convert via GBK.
242 We implement it because it is the de-facto standard for traditional
246 We implement these because they appear to be the widely used
247 representations for Korean.
249 We implement this because it is the Microsoft variant of EUC-KR, used
252 We implement it because it is in the RFCs and because MSIE4 supports
253 it, but I have no idea whether it's really used.
255 We implement this because it is apparently used on Windows as a locale
256 encoding (codepage 1361).
258 We DON'T implement this because although an old ASCII variant, its
259 glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
260 say it's a tilde, but Ken Lunde's "CJKV information processing" says
261 it's an overline. And it is not ISO-IR registered.
264 We implement it because XFree86 supports it.
266 * Georgian-Academy, Georgian-PS
267 We implement these because they appear to be both used for Georgian;
268 Xfree86 supports them.
270 * ISO-8859-11, TIS-620
271 We implement these because it seems to be standard for Thai.
273 We implement this because MSIE4 supports it.
275 We implement this because the Sun JDK does, and because Mac users
276 don't deserve to be punished.
279 We implement these because XFree86 supports them. I have no idea which
280 one is used more widely.
283 We implement these because XFree86 supports them.
285 We implement this because MSIE4 supports it.
287 * NUNACOM-8 (Inuktitut)
288 We DON'T implement this because it isn't part of Unicode yet, and
289 therefore doesn't convert to anything except itself.
291 * HP-ROMAN8, NEXTSTEP
292 We implement these because they were the native character set on HPs
293 and NeXTs for a long time, and libiconv is intended to be usable on
296 * UTF-8, UCS-2, UCS-4
297 We implement these. Obviously.
298 * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
299 We implement these because they are the preferred internal
300 representation of strings in Unicode aware applications. These are
301 non-ambiguous names, known to glibc. (glibc doesn't have
302 UCS-2-INTERNAL and UCS-4-INTERNAL.)
303 * UTF-16, UTF-16BE, UTF-16LE
304 We implement these, because UTF-16 is still the favourite encoding of
305 the president of the Unicode Consortium (for political reasons), and
306 because they appear in RFC 2781.
307 * UTF-32, UTF-32BE, UTF-32LE
308 We implement these because they are part of Unicode 3.1.
310 We implement this because it is essential functionality for mail
313 We implement it because it's used for C and C++ programs and because
314 it's a nice encoding for debugging.
316 We implement it because it's used for Java programs and because it's
317 a nice encoding for debugging.
318 * UNICODE (big endian), UNICODEFEFF (little endian)
319 We DON'T implement these because they are stupid and not standardized.
320 Full Unicode, in terms of 'uint16_t' or 'uint32_t'
321 (with machine dependent endianness and alignment)
322 * UCS-2-INTERNAL, UCS-4-INTERNAL
323 We implement these because they are the preferred internal
324 representation of strings in Unicode aware applications.
326 Q: Support encodings mentioned in RFC 1345 ?
327 A: No, they are not in use any more. Supporting ISO-646 variants is pointless
328 since ISO-8859-* have been adopted.
331 A: Available through --enable-extra-encodings.
332 Why? Because several people (Ulrich Schwab, Calvin Buckley) have shown
333 interest in these encodings, by preparing forks of GNU libiconv.
335 Q: How do I add a new character set?
336 A: 1. Explain the "why" in this file, above.
337 2. You need to have a conversion table from/to Unicode. Transform it into
338 the format used by the mapping tables found on ftp.unicode.org: each line
339 contains the character code, in hex, with 0x prefix, then whitespace,
340 then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
341 counts as a comment delimiter until end of line.
342 Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
343 can include it in his collection.
344 3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
345 tools directory to generate the C code for the conversion. You may tweak
346 the resulting C code if you are not satisfied with its quality, but this
348 If it's a two-dimensional character set (with rows and columns), use the
349 'cjk_tab_to_h' program in the tools directory to generate the C code for
350 the conversion. You will need to modify the main() function to recognize
351 the new character set name, with the proper dimensions, but that shouldn't
352 be too hard. This yields the CCS. The CES you have to write by hand.
353 4. Store the resulting C code file in the lib directory. Add a #include
354 directive to converters.h, and add an entry to the encodings.def file.
355 5. Compile the package, and test your new encoding using a program like
356 iconv(1) or clisp(1).
357 6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
358 encoding, create the complete table as a TXT file. For a stateful encoding,
359 provide a text snippet encoded using your new encoding and its UTF-8
361 7. Update the README and man/iconv_open.3, to mention the new encoding.
362 Add a note in the NEWS file.
364 Q: What about bidirectional text? Should it be tagged or reversed when
365 converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
366 this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
367 A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
368 ISO-8859-E remains to be implemented.
369 On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
370 is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
371 the same as ISO-8859-8-I. I'm confused.
373 Other character sets not implemented:
374 "MNEMONIC" = "csMnemonic"
376 "ISO-10646-UCS-Basic" = "csUnicodeASCII"
377 "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
379 "UNICODE-1-1" = "csUnicode11"
382 Other aliases not implemented (and not implemented in glibc-2.1 either):
384 ISO-8859-1: alias ISO8859-1
385 ISO-8859-2: alias ISO8859-2
386 KSC_5601: alias KS_C_5601
387 UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
390 Q: How can I integrate libiconv into my package?
391 A: Just copy the entire libiconv package into a subdirectory of your package.
392 At configuration time, call libiconv's configure script with the
393 appropriate --srcdir option and maybe --enable-static or --disable-shared.
394 Then "cd libiconv && make && make install-lib libdir=... includedir=...".
395 'install-lib' is a special (not GNU standardized) target which installs
396 only the include file - in $(includedir) - and the library - in $(libdir) -
397 and does not use other directory variables. After "installing" libiconv
398 in your package's build directory, building of your package can proceed.
400 Q: Why is the testsuite so big?
401 A: Because some of the tests are very comprehensive.
402 If you don't feel like using the testsuite, you can simply remove the