Cygwin: mmap: allow remapping part of an existing anonymous mapping
[newlib-cygwin.git] / newlib / libc / iconv / iconv.tex
blobb668af7fb012d7cd1f59dc76ee0cdc14e70d60d7
1 @node Encoding conversions
2 @chapter Encoding conversions (@file{iconv.h})
4 This chapter describes the Newlib iconv library.
5 The iconv functions declarations are in
6 @file{iconv.h}.
8 @menu
9 * Function iconv:: Encoding conversion routines
10 * Introduction to iconv:: Introduction to iconv and encodings
11 * Supported encodings:: The list of currently supported encodings
12 * iconv design decisions:: General iconv library design issues
13 * iconv configuration:: iconv-related configure script options
14 * Encoding names:: How encodings are named.
15 * CCS tables:: CCS tables format and 'mktbl.pl' Perl script
16 * CES converters:: CES converters description
17 * The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl'
18 * How to add new encoding:: The steps to add new encoding support
19 * The locale support interfaces:: Locale-related iconv interfaces
20 * Contact:: The author contact
21 @end menu
23 @page
24 @include iconv/lib/iconv.def
26 @page
27 @node Introduction to iconv
28 @section Introduction to iconv
29 @findex encoding
30 @findex character set
31 @findex charset
32 @findex CES
33 @findex CCS
35 The iconv library is intended to convert characters from one encoding to
36 another. It implements iconv(), iconv_open() and iconv_close()
37 calls, which are defined by the Single Unix Specification.
40 In addition to these user-level interfaces, the iconv library also has
41 several useful interfaces which are needed to support coding
42 capabilities of the Newlib Locale infrastructure. Since Locale
43 support also needs to
44 convert various character sets to and from the @emph{wide characters
45 set}, the iconv library shares it's capabilities with the Newlib Locale
46 subsystem. Moreover, the iconv library supports several features which are
47 only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
50 The Newlib iconv library was created using concepts from another iconv
51 library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
52 was rewritten from scratch and contains a lot of improvements with respect to
53 the original iconv library.
56 Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
57 are often used with various meanings. The following are the definitions of terms
58 which are used in this documentation as well as in the iconv library
59 implementation:
61 @itemize @bullet
62 @item
63 @dfn{encoding} - a machine representation of characters by means of bits;
65 @item
66 @dfn{Character Set} or @dfn{Charset} - just a collection of
67 characters, i.e. the encoding is the machine representation of the character set;
69 @item
70 @dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
71 set of integers @dfn{character codes};
73 @item
74 @dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
75 codes to a sequence of bytes;
76 @end itemize
79 Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
80 ASCII, etc. Encodings are formed by the following chain of steps:
82 @enumerate
83 @item
84 User has a set of characters which are specific to his or her language (character set).
86 @item
87 Each character from this set is uniquely numbered, resulting in an CCS.
89 @item
90 Each number from the CCS is converted to a sequence of bits or bytes by means
91 of a CES and form some encoding. Thus, CES may be considered as a
92 function of CCS which produces some encoding. Note, that CES may be
93 applied to more than one CCS.
94 @end enumerate
97 Thus, an encoding may be considered as one or more CCS + CES.
100 Sometimes, there is no CES and in such cases encoding is equivalent
101 to CCS, e.g. KOI8-R or ASCII.
104 An example of a more complicated encoding is UTF-8 which is the UCS
105 (or Unicode) CCS plus the UTF-8 CES.
108 The following is a brief list of iconv library features:
109 @itemize
110 @item
111 Generic architecture;
112 @item
113 Locale infrastructure support;
114 @item
115 Automatic generation of the program code which handles
116 CES/CCS/Encoding/Names/Aliases dependencies;
117 @item
118 The ability to choose size- or speed-optimazed
119 configuration;
120 @item
121 The ability to exclude a lot of unneeded code and data from the linking step.
122 @end itemize
127 @page
128 @node Supported encodings
129 @section Supported encodings
130 @findex big5
131 @findex cp775
132 @findex cp850
133 @findex cp852
134 @findex cp855
135 @findex cp866
136 @findex euc_jp
137 @findex euc_kr
138 @findex euc_tw
139 @findex iso_8859_1
140 @findex iso_8859_10
141 @findex iso_8859_11
142 @findex iso_8859_13
143 @findex iso_8859_14
144 @findex iso_8859_15
145 @findex iso_8859_2
146 @findex iso_8859_3
147 @findex iso_8859_4
148 @findex iso_8859_5
149 @findex iso_8859_6
150 @findex iso_8859_7
151 @findex iso_8859_8
152 @findex iso_8859_9
153 @findex iso_ir_111
154 @findex koi8_r
155 @findex koi8_ru
156 @findex koi8_u
157 @findex koi8_uni
158 @findex ucs_2
159 @findex ucs_2_internal
160 @findex ucs_2be
161 @findex ucs_2le
162 @findex ucs_4
163 @findex ucs_4_internal
164 @findex ucs_4be
165 @findex ucs_4le
166 @findex us_ascii
167 @findex utf_16
168 @findex utf_16be
169 @findex utf_16le
170 @findex utf_8
171 @findex win_1250
172 @findex win_1251
173 @findex win_1252
174 @findex win_1253
175 @findex win_1254
176 @findex win_1255
177 @findex win_1256
178 @findex win_1257
179 @findex win_1258
181 The following is the list of currently supported encodings. The first column
182 corresponds to the encoding name, the second column is the list of aliases,
183 the third column is its CES and CCS components names, and the fourth column
184 is a short description.
186 @multitable @columnfractions .20 .26 .24 .30
187 @item
188 Name
189 @tab
190 Aliases
191 @tab
192 CES/CCS
193 @tab
194 Short description
195 @item
196 @tab
197 @tab
198 @tab
201 @item
202 big5
203 @tab
204 csbig5, big_five, bigfive, cn_big5, cp950
205 @tab
206 table_pcs / big5, us_ascii
207 @tab
208 The encoding for the Traditional Chinese.
211 @item
212 cp775
213 @tab
214 ibm775, cspc775baltic
215 @tab
216 table / cp775
217 @tab
218 The updated version of CP 437 that supports the balitic languages.
221 @item
222 cp850
223 @tab
224 ibm850, 850, cspc850multilingual
225 @tab
226 table / cp850
227 @tab
228 IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
229 added instead of some less-often used characters like the line-drawing
230 and the greek ones.
233 @item
234 cp852
235 @tab
236 ibm852, 852, cspcp852
237 @tab
238 @tab
239 IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
240 instead of some less-often used characters like the line-drawing and the greek ones.
243 @item
244 cp855
245 @tab
246 ibm855, 855, csibm855
247 @tab
248 table / cp855
249 @tab
250 IBM 855 - the updated version of CP 437 that supports Cyrillic.
253 @item
254 cp866
255 @tab
256 866, IBM866, CSIBM866
257 @tab
258 table / cp866
259 @tab
260 IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet
261 ordering of the alternative variant that is preferred by many Russian users.
264 @item
265 euc_jp
266 @tab
267 eucjp
268 @tab
269 euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
270 @tab
271 EUC-JP - The EUC for Japanese.
274 @item
275 euc_kr
276 @tab
277 euckr
278 @tab
279 euc / ksx1001
280 @tab
281 EUC-KR - The EUC for Korean.
284 @item
285 euc_tw
286 @tab
287 euctw
288 @tab
289 euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
290 @tab
291 EUC-TW - The EUC for Traditional Chinese.
294 @item
295 iso_8859_1
296 @tab
297 iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
298 @tab
299 table / iso_8859_1
300 @tab
301 ISO 8859-1:1987 - Latin 1, West European.
304 @item
305 iso_8859_10
306 @tab
307 iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
308 @tab
309 table / iso_8859_10
310 @tab
311 ISO 8859-10:1992 - Latin 6, Nordic.
314 @item
315 iso_8859_11
316 @tab
317 iso8859_11, iso885911
318 @tab
319 table / iso_8859_11
320 @tab
321 ISO 8859-11 - Thai.
324 @item
325 iso_8859_13
326 @tab
327 iso_8859_13:1998, iso8859_13, iso885913
328 @tab
329 table / iso_8859_13
330 @tab
331 ISO 8859-13:1998 - Latin 7, Baltic Rim.
334 @item
335 iso_8859_14
336 @tab
337 iso_8859_14:1998, iso885914, iso8859_14
338 @tab
339 table / iso_8859_14
340 @tab
341 ISO 8859-14:1998 - Latin 8, Celtic.
344 @item
345 iso_8859_15
346 @tab
347 iso885915, iso_8859_15:1998, iso8859_15,
348 @tab
349 table / iso_8859_15
350 @tab
351 ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
354 @item
355 iso_8859_2
356 @tab
357 iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
358 @tab
359 table / iso_8859_2
360 @tab
361 ISO 8859-2:1987 - Latin 2, East European.
364 @item
365 iso_8859_3
366 @tab
367 iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
368 @tab
369 table / iso_8859_3
370 @tab
371 ISO 8859-3:1988 - Latin 3, South European.
374 @item
375 iso_8859_4
376 @tab
377 iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
378 @tab
379 table / iso_8859_4
380 @tab
381 ISO 8859-4:1988 - Latin 4, North European.
384 @item
385 iso_8859_5
386 @tab
387 iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
388 @tab
389 table / iso_8859_5
390 @tab
391 ISO 8859-5:1988 - Cyrillic.
394 @item
395 iso_8859_6
396 @tab
397 iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
398 @tab
399 table / iso_8859_6
400 @tab
401 ISO i8859-6:1987 - Arabic.
404 @item
405 iso_8859_7
406 @tab
407 iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
408 @tab
409 table / iso_8859_7
410 @tab
411 ISO 8859-7:1987 - Greek.
414 @item
415 iso_8859_8
416 @tab
417 iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
418 @tab
419 table / iso_8859_8
420 @tab
421 ISO 8859-8:1988 - Hebrew.
424 @item
425 iso_8859_9
426 @tab
427 iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
428 @tab
429 table / iso_8859_9
430 @tab
431 ISO 8859-9:1989 - Latin 5, Turkish.
434 @item
435 iso_ir_111
436 @tab
437 ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
438 @tab
439 table / iso_ir_111
440 @tab
441 ISO IR 111/ECMA Cyrillic.
444 @item
445 koi8_r
446 @tab
447 cskoi8r, koi8r, koi8
448 @tab
449 table / koi8_r
450 @tab
451 RFC 1489 Cyrillic.
454 @item
455 koi8_ru
456 @tab
457 koi8ru
458 @tab
459 table / koi8_ru
460 @tab
461 The obsolete Ukrainian.
464 @item
465 koi8_u
466 @tab
467 koi8u
468 @tab
469 table / koi8_u
470 @tab
471 RFC 2319 Ukrainian.
474 @item
475 koi8_uni
476 @tab
477 koi8uni
478 @tab
479 table / koi8_uni
480 @tab
481 KOI8 Unified.
484 @item
485 ucs_2
486 @tab
487 ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
488 @tab
489 ucs_2 / (UCS)
490 @tab
491 ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
494 @item
495 ucs_2_internal
496 @tab
497 ucs2_internal, ucs_2internal, ucs2internal
498 @tab
499 ucs_2_internal / (UCS)
500 @tab
501 ISO-10646-UCS-2 in system byte order.
502 NBSP is always interpreted as NBSP (BOM isn't supported).
505 @item
506 ucs_2be
507 @tab
508 ucs2be
509 @tab
510 ucs_2 / (UCS)
511 @tab
512 Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
513 Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
516 @item
517 ucs_2le
518 @tab
519 ucs2le
520 @tab
521 ucs_2 / (UCS)
522 @tab
523 Little Endian version of ISO-10646-UCS-2.
524 Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
527 @item
528 ucs_4
529 @tab
530 ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
531 @tab
532 ucs_4 / (UCS)
533 @tab
534 ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
537 @item
538 ucs_4_internal
539 @tab
540 ucs4_internal, ucs_4internal, ucs4internal
541 @tab
542 ucs_4_internal / (UCS)
543 @tab
544 ISO-10646-UCS-4 in system byte order.
545 NBSP is always interpreted as NBSP (BOM isn't supported).
548 @item
549 ucs_4be
550 @tab
551 ucs4be
552 @tab
553 ucs_4 / (UCS)
554 @tab
555 Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
556 Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
559 @item
560 ucs_4le
561 @tab
562 ucs4le
563 @tab
564 ucs_4 / (UCS)
565 @tab
566 Little Endian version of ISO-10646-UCS-4.
567 Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
570 @item
571 us_ascii
572 @tab
573 ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
574 @tab
575 us_ascii / (ASCII)
576 @tab
577 7-bit ASCII.
580 @item
581 utf_16
582 @tab
583 utf16
584 @tab
585 utf_16 / (UCS)
586 @tab
587 RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
590 @item
591 utf_16be
592 @tab
593 utf16be
594 @tab
595 utf_16 / (UCS)
596 @tab
597 Big Endian version of RFC 2781 UTF-16.
598 NBSP is always interpreted as NBSP (BOM isn't supported).
601 @item
602 utf_16le
603 @tab
604 utf16le
605 @tab
606 utf_16 / (UCS)
607 @tab
608 Little Endian version of RFC 2781 UTF-16.
609 NBSP is always interpreted as NBSP (BOM isn't supported).
612 @item
613 utf_8
614 @tab
615 utf8
616 @tab
617 utf_8 / (UCS)
618 @tab
619 RFC 3629 UTF-8.
622 @item
623 win_1250
624 @tab
625 cp1250
626 @tab
627 @tab
628 Win-1250 Croatian.
631 @item
632 win_1251
633 @tab
634 cp1251
635 @tab
636 table / win_1251
637 @tab
638 Win-1251 - Cyrillic.
641 @item
642 win_1252
643 @tab
644 cp1252
645 @tab
646 table / win_1252
647 @tab
648 Win-1252 - Latin 1.
651 @item
652 win_1253
653 @tab
654 cp1253
655 @tab
656 table / win_1253
657 @tab
658 Win-1253 - Greek.
661 @item
662 win_1254
663 @tab
664 cp1254
665 @tab
666 table / win_1254
667 @tab
668 Win-1254 - Turkish.
671 @item
672 win_1255
673 @tab
674 cp1255
675 @tab
676 table / win_1255
677 @tab
678 Win-1255 - Hebrew.
681 @item
682 win_1256
683 @tab
684 cp1256
685 @tab
686 table / win_1256
687 @tab
688 Win-1256 - Arabic.
691 @item
692 win_1257
693 @tab
694 cp1257
695 @tab
696 table / win_1257
697 @tab
698 Win-1257 - Baltic.
701 @item
702 win_1258
703 @tab
704 cp1258
705 @tab
706 table / win_1258
707 @tab
708 Win-1258 - Vietnamese7 that supports Cyrillic.
709 @end multitable
715 @page
716 @node iconv design decisions
717 @section iconv design decisions
718 @findex CCS table
719 @findex CES converter
720 @findex Speed-optimized tables
721 @findex Size-optimized tables
723 The first iconv library design issue arises when considering the
724 following two design approaches:
726 @enumerate
727 @item
728 Have modules which implement conversion from the encoding A to the encoding B
729 and vice versa i.e., one conversion module relates to any two encodings.
730 @item
731 Have modules which implement conversion from the encoding A to the fixed
732 encoding C and vice versa i.e., one conversion module relates to any
733 one encoding A and one fixed encoding C. In this case, to convert from
734 the encoding A to the encoding B, two modules are needed (in order to convert
735 from A to C and then from C to B).
736 @end enumerate
739 It's obvious, that we have tradeoff between commonality/flexibility and
740 efficiency: the first method is more efficient since it converts
741 directly; however, it isn't so flexible since for each
742 encoding pair a distinct module is needed.
745 The Newlib iconv model uses the second method and always converts through the 32-bit
746 UCS but its design also allows one to write specialized conversion
747 modules if the conversion speed is critical.
750 The second design issue is how to break down (decompose) encodings.
751 The Newlib iconv library uses the fact that any encoding may be
752 considered as one or more CCS plus a CES. It also decomposes its
753 conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
754 tables}. CCS tables map CCS to UCS and vice versa; the CES converters
755 map CCS to the encoding and vice versa.
758 As the example, let's consider the conversion from the big5 encoding to
759 the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
760 CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
761 and CNS11643_PLANE14 CCS-es plus the EUC CES.
764 The euc_jp -> big5 conversion is performed as follows:
766 @enumerate
767 @item
768 The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
769 transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
770 CCS-es);
771 @item
772 The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
773 CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
774 @item
775 The resulting UCS codes are transformed to the ASCII and BIG5 codes using
776 the corresponding CCS tables;
777 @item
778 The obtained CCS codes are transformed to the big5 encoding using the corresponding
779 CES converter.
780 @end enumerate
783 Analogously, the backward conversion is performed as follows:
785 @enumerate
786 @item
787 The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
788 (the ASCII and BIG5 CCS-es);
789 @item
790 The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
791 @item
792 The resulting UCS codes are transformed to the ASCII and BIG5 codes using
793 the corresponding CCS tables;
794 @item
795 The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
796 CES converter.
797 @end enumerate
800 Note, the above is just an example and real names (which are implemented
801 in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
804 The third design issue also relates to flexibility. Obviously, it isn't
805 desirable to always link all the CES converters and the CCS tables to the library
806 but instead, we want to be able to load the needed converters and tables
807 dynamically on demand. This isn't a problem on "big" machines such as
808 a PC, but it may be very problematical within "small" embedded systems.
811 Since the CCS tables are just data, it is possible to load them
812 dynamically from external files. The CES converters, on the other hand
813 are algorithms with some code so a dynamic library loading
814 capability is required.
817 Apart from possible restrictions applied by embedded systems (small
818 RAM for example), Newlib itself has no dynamic library support and
819 therefore, all the CES converters which will ever be used must be linked into
820 the library. However, loading of the dynamic CCS tables is possible and is
821 implemented in the Newlib iconv library. It may be enabled via the Newlib
822 configure script options.
825 The next design issue is fine-tuning the iconv library
826 configuration. One important ability is for iconv to not link all it's
827 converters and tables (if dynamic loading is not enabled) but instead,
828 enable only those encodings which are specified at configuration
829 time (see the section about the configure script options).
832 In addition, the Newlib iconv library configure options distinguish between
833 conversion directions. This means that not only are supported encodings
834 selectable, the conversion direction is as well. For example, if user wants
835 the configuration which allows conversions from UTF-8 to UTF-16 and
836 doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can
837 enable only
838 this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
839 be included) thus, saving some memory (note, that such technique allows to
840 exclude one half of a CCS table from linking which may be big enough).
843 One more design aspect are the speed- and size- optimized tables. Users can
844 select between them using configure script options. The
845 speed-optimized CCS tables are the same as the size-optimized ones in
846 case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
847 CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
848 other hand, conversion with speed tables is several times faster.
851 Its worth to stress that the new encoding support can't be
852 dynamically added into an already compiled Newlib library, even if it
853 needs only an additional CCS table and iconv is configured to use
854 the external files with CCS tables (this isn't the fundamental restriction
855 and the possibility to add new Table-based encoding support dynamically, by
856 means of just adding new .cct file, may be easily added).
859 Theoretically, the compiled-in CCS tables should be more appropriate for
860 embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM
861 whereas dynamic loading requires RAM. Moreover, in the current iconv
862 implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
863 This means, for example, that if two iconv descriptors for
864 "KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
865 koi8-r .cct file will be loaded (actually, iconv loads only the needed part
866 of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
868 @page
869 @node iconv configuration
870 @section iconv configuration
871 @findex iconv configuration
872 @findex --enable-newlib-iconv-encodings
873 @findex --enable-newlib-iconv-from-encodings
874 @findex --enable-newlib-iconv-to-encodings
875 @findex --enable-newlib-iconv-external-ccs
876 @findex NLSPATH
878 To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
879 script option should be used. This option accepts a comma-separated list
880 of @emph{encodings} that should be enabled. The option enables each encoding in both
881 ("to" and "from") directions.
884 The @option{--enable-newlib-iconv-from-encodings} configure script option enables
885 "from" support for each encoding that was passed to it.
888 The @option{--enable-newlib-iconv-to-encodings} configure script option enables
889 "to" support for each encoding that was passed to it.
892 Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
893 "KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
894 code and data will be linked) is to configure Newlib with the following
895 options:
897 @code{--enable-newlib-iconv-encodings=UTF-8
898 --enable-newlib-iconv-from-encodings=KOI8-R
899 --enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
901 which is the same as
903 @code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
904 --enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
906 User may also just use the
908 @code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
910 configure script option, but it isn't so optimal since there will be
911 some unneeded data and code.
914 The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
915 capabilities to work with the external CCS files.
918 The @option{--enable-target-optspace} Newlib configure script option also affects
919 the iconv library. If this option is present, the library uses the size
920 optimized CCS tables. This means, that only the size-optimized CCS
921 tables will be linked or, if the
922 @option{--enable-newlib-iconv-external-ccs} configure script option was used,
923 the iconv library will load the size-optimized tables. If the
924 @option{--enable-target-optspace}configure script option is disabled,
925 the speed-optimized CCS tables are used.
928 Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
929 Thus, the NLSPATH environment variable should be set.
935 @page
936 @node Encoding names
937 @section Encoding names
938 @findex encoding name
939 @findex encoding alias
940 @findex normalized name
942 Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
943 user works with the iconv library (i.e., when the @code{iconv_open} call
944 is used) both name or aliases may be used. The same is when encoding
945 names are used in configure script options.
948 Names and aliases may be specified in any case (small or capital
949 letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
952 Internally the Newlib iconv library always converts aliases to names. It
953 also converts names and aliases in the @dfn{normalized} form which means
954 that all capital letters are converted to small letters and the @kbd{-}
955 symbols are converted to @kbd{_} symbols.
960 @page
961 @node CCS tables
962 @section CCS tables
963 @findex Size-optimized CCS table
964 @findex Speed-optimized CCS table
965 @findex mktbl.pl Perl script
966 @findex .cct files
967 @findex The CCT tables source files
968 @findex CCS source files
970 The iconv library stores files with CCS tables in the the @emph{ccs/}
971 subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
972 (@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
973 of compilable .c source files. The .cct files are only used when the
974 @option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
975 The .c files are linked to the Newlib library if the corresponding
976 encoding is enabled.
979 As stated earlier, the Newlib iconv library performs all
980 conversions through the 32-bit UCS, but the codes which are used
981 in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
982 Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
983 used instead of the 32-bit UCS-4.
986 CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
987 16-bit UCS-2 and vice versa while 16-bit CCS tables map
988 16-bit CCS to 16-bit UCS-2 and vice versa.
989 8-bit tables are small (in size) while 16-bit tables may be big enough.
990 Because of this, 16-bit CCS tables may be
991 either speed- or size-optimized. Size-optimized CCS tables are
992 smaller then speed-optimized ones, but the conversion process is
993 slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
994 size-optimized variant.
996 Each CCS table (both speed- and size-optimized) consists of
997 @dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
998 UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
999 UCS-2 codes.
1002 Almost all 16-bit CCS tables contain less then 0xFFFF codes and
1003 a lot of gaps exist.
1005 @subsection Speed-optimized tables format
1007 In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
1008 trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
1009 UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
1010 as @emph{Y = to_ucs[X]}.
1013 Obviously, the simplest way to create the "from_ucs" table or the
1014 16-bit "to_ucs" table is to use the huge 16-bit array like in case
1015 of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
1016 less then 0xFFFF code maps and this fact may be exploited to reduce
1017 the size of the CCS tables.
1020 In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
1021 16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
1022 direction and the CCS bits number.
1025 In case of the 8-bit speed-optimized table the "from_ucs" subtable
1026 corresponds the "from_ucs" array and has the following layout:
1029 from_ucs array:
1031 -------------------------------------
1033 0xFF mapping (2 bytes) (only for
1034 8-bit table).
1036 -------------------------------------
1038 Heading block
1040 -------------------------------------
1042 Block 1
1044 -------------------------------------
1046 Block 2
1048 -------------------------------------
1052 -------------------------------------
1054 Block N
1056 -------------------------------------
1059 The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
1060 subrange is represented by an 256-element @dfn{block} (256 1-byte
1061 elements or 256 2-byte element in case of 16-bit CCS table) with
1062 elements which are equivalent to the CCS codes of this subrange.
1063 If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
1064 absent and there will be less then 256 blocks.
1067 Any element number @emph{m} of @dfn{the heading block} (which contains
1068 256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
1069 If the subrange contains some codes, the value of the @emph{m}-th element of
1070 the heading block contains the offset of the corresponding block in the
1071 "from_ucs" array. If there is no codes in the subrange, the heading
1072 block element contains 0xFFFF.
1075 If there are some gaps in a block, the corresponding block elements have
1076 the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
1077 is defined in the first 2-byte element of the "from_ucs" array.
1080 Having such a table format, the algorithm of searching the CCS code
1081 @emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
1084 @enumerate
1085 @item If @emph{Y} is equivalent to the value of the first 2-byte element
1086 of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
1088 @item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
1090 @item If the heading block element with number @emph{BlkN} is 0xFFFF, there
1091 is no corresponding CCS code (error, wrong input data). Else, fetch the
1092 "flom_ucs" array index of the @emph{BlkN}-th block.
1094 @item Calculate the offset of the @emph{X} code in its block:
1095 @emph{Xindex = Y & 0xFF}
1097 @item If the @emph{Xindex}-th element of the block (which is equivalent to
1098 @emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
1099 CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
1100 @end enumerate
1102 @subsection Size-optimized tables format
1104 As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
1105 This is because there is too small difference between the speed-optimized
1106 and the size-optimized table sizes in case of 8-bit CCS-es.
1109 Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
1110 size-optimized tables.
1112 This sections describes the format of the "UCS-2 -> CCS" size-optimized
1113 CCS table. The format of "CCS -> UCS-2" table is the same.
1115 The idea of the size-optimized tables is to split the UCS-2 codes
1116 ("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
1117 Then CCS codes ("to" codes) are stored only for the codes from these
1118 ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
1119 together with the corresponding "to" codes.
1122 The following is the layout of the size-optimized table array:
1125 size_arr array:
1127 -------------------------------------
1129 Ranges number (2 bytes)
1131 -------------------------------------
1133 Unranged codes number (2 bytes)
1135 -------------------------------------
1137 Unranged codes array index (2 bytes)
1139 -------------------------------------
1141 Ranges indexes (triads)
1143 -------------------------------------
1145 Ranges
1147 -------------------------------------
1149 Unranged codes array
1151 -------------------------------------
1154 The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
1155 the offset of the needed range in the @emph{size_arr} and has
1156 the following format (triads):
1158 the first code in range, the last code in range, range offset.
1161 The array of these triads is sorted by the firs element, therefore it is
1162 possible to quickly find the needed range index.
1165 Each range has the corresponding sub-array containing the "to" codes. These
1166 sub-arrays are stored in the place marked as "Ranges" in the layout
1167 diagram.
1170 The "Unranged codes array" contains pairs ("from" code, "to" code") for
1171 each unranged code. The array of these pairs is sorted by "from" code
1172 values, therefore it is possible to find the needed pair quickly.
1175 Note, that each range requires 6 bytes to form its index. If, for
1176 example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
1177 (7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
1178 code (total 16). But it is better to join both ranges as 1 - 10 and
1179 mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
1180 range index and 4 bytes to mark codes 6 and 8 as absent are needed
1181 (total 10 bytes). This optimization is done in the size-optimized tables.
1182 Thus, ranges may contain small gaps. The absent codes in ranges are marked
1183 as 0xFFFF.
1186 Note, a pair of "from" codes is stored by means of unranged codes since
1187 the number of bytes which are needed to form the range is greater than
1188 the number of bytes to store two unranged codes (5 against 4).
1191 The algorithm of searching of the CCS code
1192 @emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
1193 CCS" size-optimized table is as follows.
1196 @enumerate
1197 @item Try to find the corresponding triad in the "Unranged codes array
1198 index". Since we are searching in the sorted array, we can do it quickly
1199 (divide by 2, compare, etc).
1201 @item If the triad is found, fetch the @emph{X} code from the corresponding
1202 range array. If it is 0xFFFF, return an error.
1204 @item If there is no corresponding triad, search the @emph{X} code among the
1205 sorted unranged codes. Return error, if noting was found.
1206 @end enumerate
1208 @subsection .cct ant .c CCS Table files
1210 The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
1211 speed-optimized tables. The .c source files for 16-bit CCS tables have
1212 "to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
1213 tables.
1216 When .c files are compiled and used, all the 16-bit and 32-bit values
1217 have the native endian format (Big Endian for the BE systems and Little
1218 Endian for the LE systems) since they are compile for the system before
1219 they are used.
1222 In case of .cct files, which are intended for dynamic CCS tables
1223 loading, the CCS tables are stored either in LE or BE format. Since the
1224 .cct files are generated by the 'mktbl.pl' Perl script, it is possible
1225 to choose the endianess of the tables. It is also possible to store two
1226 copies (both LE and BE) of the CCS tables in one .cct file. The default
1227 .cct files (which come with the Newlib sources) have both LE and BE CCS
1228 tables. The Newlib iconv library automatically chooses the needed CCS tables
1229 (with appropriate endianess).
1232 Note, the .cct files are only used when the
1233 @option{--enable-newlib-iconv-external-ccs} is used.
1235 @subsection The 'mktbl.pl' Perl script
1237 The 'mktbl.pl' script is intended to generate .cct and .c CCS table
1238 files from the @dfn{CCS source files}.
1241 The CCS source files are just text files which has one or more colons
1242 with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
1243 source files see one of them using URL-s which will be given bellow.
1246 The following table describes where the source files for CCS table files
1247 provided by the Newlib distribution are located.
1249 @multitable @columnfractions .25 .75
1250 @item
1251 Name
1252 @tab
1255 @item
1256 @tab
1258 @item
1259 big5
1260 @tab
1261 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
1263 @item
1264 cns11643_plane1
1265 cns11643_plane14
1266 cns11643_plane2
1267 @tab
1268 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
1270 @item
1271 cp775
1272 cp850
1273 cp852
1274 cp855
1275 cp866
1276 @tab
1277 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1279 @item
1280 iso_8859_1
1281 iso_8859_2
1282 iso_8859_3
1283 iso_8859_4
1284 iso_8859_5
1285 iso_8859_6
1286 iso_8859_7
1287 iso_8859_8
1288 iso_8859_9
1289 iso_8859_10
1290 iso_8859_11
1291 iso_8859_13
1292 iso_8859_14
1293 iso_8859_15
1294 @tab
1295 http://www.unicode.org/Public/MAPPINGS/ISO8859/
1297 @item
1298 iso_ir_111
1299 @tab
1300 http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
1302 @item
1303 jis_x0201_1976
1304 jis_x0208_1990
1305 jis_x0212_1990
1306 @tab
1307 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
1309 @item
1310 koi8_r
1311 @tab
1312 http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
1314 @item
1315 koi8_ru
1316 @tab
1317 http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
1319 @item
1320 koi8_u
1321 @tab
1322 http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
1324 @item
1325 koi8_uni
1326 @tab
1327 http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
1329 @item
1330 ksx1001
1331 @tab
1332 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
1334 @item
1335 win_1250
1336 win_1251
1337 win_1252
1338 win_1253
1339 win_1254
1340 win_1255
1341 win_1256
1342 win_1257
1343 win_1258
1344 @tab
1345 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1346 @end multitable
1348 The CCS source files aren't distributed with Newlib because of License
1349 restrictions in most Unicode.org's files.
1351 The following are 'mktbl.pl' options which were used to generate .cct
1352 files. Note, to generate CCS tables source files @option{-s} option
1353 should be added.
1355 @enumerate
1356 @item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
1357 iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
1358 iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
1359 iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
1360 win_1256.cct, win_1258.cct, win_1251.cct,
1361 win_1253.cct, win_1255.cct, win_1257.cct,
1362 koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
1363 big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
1364 files, only the @option{-i <SRC_FILE_NAME>} option were used.
1366 @item To generate the jis_x0208_1990.cct file, the
1367 @option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
1369 @item To generate the cns11643_plane1.cct file, the
1370 @option{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct}
1371 options were used.
1373 @item To generate the cns11643_plane2.cct file, the
1374 @option{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct}
1375 options were used.
1377 @item To generate the cns11643_plane14.cct file, the
1378 @option{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct}
1379 options were used.
1380 @end enumerate
1383 For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
1386 It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
1387 in the CCS source file, the bits which are higher then 16 defines plane (see the
1388 cns11643.txt CCS source file).
1391 Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
1392 several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
1393 the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
1394 codes}) aren't just rejected but instead, they are mapped to the default
1395 UCS-2 code (which is currently the @kbd{?} character's code).
1401 @page
1402 @node CES converters
1403 @section CES converters
1404 @findex PCS
1406 Similar to the CCS tables, CES converters are also split into "from UCS"
1407 and "to UCS" parts. Depending on the iconv library configuration, these
1408 parts are enabled or disabled.
1411 The following it the list of CES converters which are currently present
1412 in the Newlib iconv library.
1414 @itemize @bullet
1415 @item
1416 @emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
1417 encodings. The @emph{euc} CES converter uses the @emph{table} and the
1418 @emph{us_ascii} CES converters.
1420 @item
1421 @emph{table} - this CES converter corresponds to "null" and just performs
1422 tables-based conversion using 8- and 16-bit CCS tables. This converter
1423 is also used by any other CES converter which needs the CCS table-based
1424 conversions. The @emph{table} converter is also responsible for .cct files
1425 loading.
1427 @item
1428 @emph{table_pcs} - this is the wrapper over the @emph{table} converter
1429 which is intended for 16-bit encodings which also use the @dfn{Portable
1430 Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
1431 This means, that if the first byte the CCS code is in range of [0x00-0x7f],
1432 this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
1433 the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
1434 The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
1435 @emph{table_pcs} CES converter depends on the @emph{table} CES converter.
1437 @item
1438 @emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
1439 @emph{ucs_2le} encodings support.
1441 @item
1442 @emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
1443 @emph{ucs_4le} encodings support.
1445 @item
1446 @emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
1448 @item
1449 @emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
1451 @item
1452 @emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
1453 principle, the most natural way to support the @emph{us_ascii} encoding
1454 is to define the @emph{us_ascii} CCS and use the @emph{table} CES
1455 converter. But for the optimization purposes, the specialized
1456 @emph{us_ascii} CES converter was created.
1458 @item
1459 @emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
1460 @emph{utf_16le} encodings support.
1462 @item
1463 @emph{utf_8} - intended for the @emph{utf_8} encoding support.
1464 @end itemize
1470 @page
1471 @node The encodings description file
1472 @section The encodings description file
1473 @findex encoding.deps description file
1474 @findex mkdeps.pl Perl script
1476 To simplify the process of adding new encodings support allowing to
1477 automatically generate a lot of "glue" files.
1480 There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
1481 is used to describe encoding's properties. The 'mkdeps.pl' Perl script
1482 uses 'encoding.deps' to generates the "glue" files.
1485 The 'encoding.deps' file is composed of sections, each section consists
1486 of entries, each entry contains some encoding/CES/CCS description.
1489 The 'encoding.deps' file's syntax is very simple. Currently only two
1490 sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
1493 Each @emph{ENCODINGS} section's entry describes one encoding and
1494 contains the following information.
1496 @itemize @bullet
1497 @item
1498 Encoding name (the @emph{ENCODING} field). The name should
1499 be unique and only one name is possible.
1501 @item
1502 The encoding's CES converter name (the @emph{CES} field). Only one CES
1503 converter is allowed.
1505 @item
1506 The whitespace-separated list of CCS table names which are used by the
1507 encoding (the @emph{CCS} field).
1509 @item
1510 The whitespace-separated list of aliases names (the @emph{ENCODING}
1511 field).
1512 @end itemize
1515 Note all names in the 'encoding.deps' file have to have the normalized
1516 form.
1519 Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
1520 one CES converted. For example, the @emph{euc} CES converter depends on
1521 the @emph{table} and the @emph{us_ascii} CES converter since the
1522 @emph{euc} CES converter uses them. This means, that both @emph{table}
1523 and @emph{us_ascii} CES converters should be linked if the @emph{euc}
1524 CES converter is enabled.
1527 The @emph{CES_DEPENDENCIES} section defines the following:
1529 @itemize @bullet
1530 @item
1531 the CES converter name for which the dependencies are defined in this
1532 entry (the @emph{CES} field);
1534 @item
1535 the whitespace-separated list of CES converters which are needed for
1536 this CES converter (the @emph{USED_CES} field).
1537 @end itemize
1540 The 'mktbl.pl' Perl script automatically solves the following tasks.
1542 @itemize @bullet
1543 @item
1544 User works with the iconv library in terms of encodings and doesn't know
1545 anything about CES converters and CCS tables. The script automatically
1546 generates code which enables all needed CES converters and CCS tables
1547 for all encodings, which were enabled by the user.
1549 @item
1550 The CES converters may have dependencies and the script automatically
1551 generates the code which handles these dependencies.
1553 @item
1554 The list of encoding's aliases is also automatically generated.
1556 @item
1557 The script uses a lot of macros in order to enable only the minimum set
1558 of code/data which is needed to support the requested encodings in the
1559 requested directions.
1560 @end itemize
1563 The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
1564 file and generates the following files.
1566 @itemize @bullet
1567 @item
1568 @emph{lib/encnames.h} - this header files contains macro definitions for all
1569 encoding names
1571 @item
1572 @emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
1573 is used to find the name of requested encoding by it's alias.
1575 @item
1576 @emph{ces/cesbi.c} - this file defines two arrays
1577 (@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
1578 description of enabled "to UCS" and "from UCS" CES converters and the
1579 names of encodings which are supported by these CES converters.
1581 @item
1582 @emph{ces/cesbi.h} - this file contains the set of macros which defines
1583 the set of CES converters which should be enabled if only the set of
1584 enabled encodings is given (through macros defined in the
1585 @emph{newlib.h} file). Note, that one CES converter may handle several
1586 encodings.
1588 @item
1589 @emph{ces/cesdeps.h} - the CES converters dependencies are handled in
1590 this file.
1592 @item
1593 @emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
1594 here.
1596 @item
1597 @emph{ccs/ccsnames.h} - this header files contains macro definitions for all
1598 CCS names.
1600 @item
1601 @emph{encoding.aliases} - the list of supported encodings and their
1602 aliases which is intended for the Newlib configure scripts in order to
1603 handle the iconv-related configure script options.
1604 @end itemize
1610 @page
1611 @node How to add new encoding
1612 @section How to add new encoding
1614 At first, the new encoding should be broken down to CCS and CES. Then,
1615 the process of adding new encoding is split to the following activities.
1617 @enumerate
1618 @item Generate the .cct CCS file and the .c source file for the new
1619 encoding's CCS (if it isn't already present). To do this, the CCS source
1620 file should be had and the 'mktbl.pl' script should be used.
1622 @item Write the corresponding CES converter (if it isn't already
1623 present). Use the existing CES converters as an example.
1625 @item
1626 Add the corresponding entries to the 'encoding.deps' file and regenerate
1627 the autogenerated "glue" files using the 'mkdeps.pl' script.
1629 @item
1630 Don't forget to add entries to the newlib/newlib.hin file.
1632 @item
1633 Of course, the 'Makefile.am'-s should also be updated (if new files were
1634 added) and the 'Makefile.in'-s should be regenerated using the correct
1635 version of 'automake'.
1637 @item
1638 Don't forget to update the documentation (the list of
1639 supported encodings and CES converters).
1640 @end enumerate
1642 In case a new encoding doesn't fit to the CES/CCS decomposition model or
1643 it is desired to add the specialized (non UCS-based) conversion support,
1644 the Newlib iconv library code should be upgraded.
1650 @page
1651 @node The locale support interfaces
1652 @section The locale support interfaces
1654 The newlib iconv library also has some interface functions (besides the
1655 @code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
1656 are intended for the Locale subsystem. All the locale-related code is
1657 placed in the @emph{lib/iconvnls.c} file.
1660 The following is the description of the locale-related interfaces:
1662 @itemize @bullet
1663 @item
1664 @code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
1665 wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
1666 passed in the function parameters. The @emph{wchar_t} characters encoding is
1667 either ucs_2_internal or ucs_4_internal depending on size of
1668 @emph{wchar_t}.
1670 @item
1671 @code{_iconv_nls_conv} - the function is similar to the @code{iconv}
1672 functions, but if there is no character in the output encoding which
1673 corresponds to the character in the input encoding, the default
1674 conversion isn't performed (the @code{iconv} function sets such output
1675 characters to the @kbd{?} symbol and this is the behavior, which is
1676 specified in SUSv3).
1678 @item
1679 @code{_iconv_nls_get_state} - returns the current encoding's shift state
1680 (the @code{mbstate_t} object).
1682 @item
1683 @code{_iconv_nls_set_state} sets the current encoding's shift state (the
1684 @code{mbstate_t} object).
1686 @item
1687 @code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
1688 or stateless.
1690 @item
1691 @code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
1692 maximum bytes number) of the encoding's characters.
1693 @end itemize
1698 @page
1699 @node Contact
1700 @section Contact
1702 The author of the original BSD iconv library (Alexander Chuguev) no longer
1703 supports that code.
1706 Any questions regarding the iconv library may be forwarded to
1707 Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
1708 well as to the public Newlib mailing list.