1 @node Encoding conversions
2 @chapter Encoding conversions (@file
{iconv.h
})
4 This chapter describes the Newlib iconv library.
5 The iconv functions declarations are in
9 * Function iconv:: Encoding conversion routines
10 * Introduction to iconv:: Introduction to iconv and encodings
11 * Supported encodings:: The list of currently supported encodings
12 * iconv design decisions:: General iconv library design issues
13 * iconv configuration:: iconv-related configure script options
14 * Encoding names:: How encodings are named.
15 * CCS tables:: CCS tables format and 'mktbl.pl' Perl script
16 * CES converters:: CES converters description
17 * The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl'
18 * How to add new encoding:: The steps to add new encoding support
19 * The locale support interfaces:: Locale-related iconv interfaces
20 * Contact:: The author contact
24 @include iconv/lib/iconv.def
27 @node Introduction to iconv
28 @section Introduction to iconv
35 The iconv library is intended to convert characters from one encoding to
36 another. It implements iconv(), iconv_open() and iconv_close()
37 calls, which are defined by the Single Unix Specification.
40 In addition to these user-level interfaces, the iconv library also has
41 several useful interfaces which are needed to support coding
42 capabilities of the Newlib Locale infrastructure. Since Locale
44 convert various character sets to and from the @emph
{wide characters
45 set
}, the iconv library shares it's capabilities with the Newlib Locale
46 subsystem. Moreover, the iconv library supports several features which are
47 only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
50 The Newlib iconv library was created using concepts from another iconv
51 library implemented by Konstantin Chuguev (ver
2.0). The Newlib iconv library
52 was rewritten from scratch and contains a lot of improvements with respect to
53 the original iconv library.
56 Terms like @dfn
{encoding
} or @dfn
{character set
} aren't well defined and
57 are often used with various meanings. The following are the definitions of terms
58 which are used in this documentation as well as in the iconv library
63 @dfn
{encoding
} - a machine representation of characters by means of bits;
66 @dfn
{Character Set
} or @dfn
{Charset
} - just a collection of
67 characters, i.e. the encoding is the machine representation of the character set;
70 @dfn
{CCS
} (@dfn
{Coded Character Set
}) - a mapping from an character set to a
71 set of integers @dfn
{character codes
};
74 @dfn
{CES
} (@dfn
{Character Encoding Scheme
}) - a mapping from a set of character
75 codes to a sequence of bytes;
79 Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-
8,
80 ASCII, etc. Encodings are formed by the following chain of steps:
84 User has a set of characters which are specific to his or her language (character set).
87 Each character from this set is uniquely numbered, resulting in an CCS.
90 Each number from the CCS is converted to a sequence of bits or bytes by means
91 of a CES and form some encoding. Thus, CES may be considered as a
92 function of CCS which produces some encoding. Note, that CES may be
93 applied to more than one CCS.
97 Thus, an encoding may be considered as one or more CCS + CES.
100 Sometimes, there is no CES and in such cases encoding is equivalent
101 to CCS, e.g. KOI8-R or ASCII.
104 An example of a more complicated encoding is UTF-
8 which is the UCS
105 (or Unicode) CCS plus the UTF-
8 CES.
108 The following is a brief list of iconv library features:
111 Generic architecture;
113 Locale infrastructure support;
115 Automatic generation of the program code which handles
116 CES/CCS/Encoding/Names/Aliases dependencies;
118 The ability to choose size- or speed-optimazed
121 The ability to exclude a lot of unneeded code and data from the linking step.
128 @node Supported encodings
129 @section Supported encodings
159 @findex ucs_2_internal
163 @findex ucs_4_internal
181 The following is the list of currently supported encodings. The first column
182 corresponds to the encoding name, the second column is the list of aliases,
183 the third column is its CES and CCS components names, and the fourth column
184 is a short description.
186 @multitable @columnfractions
.20 .26 .24 .30
204 csbig5, big_five, bigfive, cn_big5, cp950
206 table_pcs / big5, us_ascii
208 The encoding for the Traditional Chinese.
214 ibm775, cspc775baltic
218 The updated version of CP
437 that supports the balitic languages.
224 ibm850,
850, cspc850multilingual
228 IBM
850 - the updated version of CP
437 where several Latin
1 characters have been
229 added instead of some less-often used characters like the line-drawing
236 ibm852,
852, cspcp852
239 IBM
852 - the updated version of CP
437 where several Latin
2 characters have been added
240 instead of some less-often used characters like the line-drawing and the greek ones.
246 ibm855,
855, csibm855
250 IBM
855 - the updated version of CP
437 that supports Cyrillic.
256 866, IBM866, CSIBM866
260 IBM
866 - the updated version of CP
855 which follows more the logical Russian alphabet
261 ordering of the alternative variant that is preferred by many Russian users.
269 euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
271 EUC-JP - The EUC for Japanese.
281 EUC-KR - The EUC for Korean.
289 euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
291 EUC-TW - The EUC for Traditional Chinese.
297 iso8859_1, iso88591, iso_8859_1:
1987, iso_ir_100,
latin1, l1, ibm819, cp819, csisolatin1
301 ISO
8859-
1:
1987 - Latin
1, West European.
307 iso_8859_10:
1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
311 ISO
8859-
10:
1992 - Latin
6, Nordic.
317 iso8859_11, iso885911
327 iso_8859_13:
1998, iso8859_13, iso885913
331 ISO
8859-
13:
1998 - Latin
7, Baltic Rim.
337 iso_8859_14:
1998, iso885914, iso8859_14
341 ISO
8859-
14:
1998 - Latin
8, Celtic.
347 iso885915, iso_8859_15:
1998, iso8859_15,
351 ISO
8859-
15:
1998 - Latin
9, West Europe, successor of Latin
1.
357 iso8859_2, iso88592, iso_8859_2:
1987, iso_ir_101, latin2, l2, csisolatin2
361 ISO
8859-
2:
1987 - Latin
2, East European.
367 iso_8859_3:
1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
371 ISO
8859-
3:
1988 - Latin
3, South European.
377 iso8859_4, iso88594, iso_8859_4:
1988, iso_ir_110, latin4, l4, csisolatin4
381 ISO
8859-
4:
1988 - Latin
4, North European.
387 iso8859_5, iso88595, iso_8859_5:
1988, iso_ir_144, cyrillic, csisolatincyrillic
391 ISO
8859-
5:
1988 - Cyrillic.
397 iso_8859_6:
1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
401 ISO i8859-
6:
1987 - Arabic.
407 iso_8859_7:
1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
411 ISO
8859-
7:
1987 - Greek.
417 iso_8859_8:
1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
421 ISO
8859-
8:
1988 - Hebrew.
427 iso_8859_9:
1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
431 ISO
8859-
9:
1989 - Latin
5, Turkish.
437 ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
441 ISO IR
111/ECMA Cyrillic.
461 The obsolete Ukrainian.
487 ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
491 ISO-
10646-UCS-
2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
497 ucs2_internal, ucs_2internal, ucs2internal
499 ucs_2_internal / (UCS)
501 ISO-
10646-UCS-
2 in system byte order.
502 NBSP is always interpreted as NBSP (BOM isn't supported).
512 Big Endian version of ISO-
10646-UCS-
2 (in fact, equivalent to ucs_2).
513 Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
523 Little Endian version of ISO-
10646-UCS-
2.
524 Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
530 ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
534 ISO-
10646-UCS-
4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
540 ucs4_internal, ucs_4internal, ucs4internal
542 ucs_4_internal / (UCS)
544 ISO-
10646-UCS-
4 in system byte order.
545 NBSP is always interpreted as NBSP (BOM isn't supported).
555 Big Endian version of ISO-
10646-UCS-
4 (in fact, equivalent to ucs_4).
556 Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
566 Little Endian version of ISO-
10646-UCS-
4.
567 Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
573 ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:
1991, ascii, iso646_us, us, ibm367, cp367, csascii
587 RFC
2781 UTF-
16. The very first NBSP code in stream is interpreted as BOM.
597 Big Endian version of RFC
2781 UTF-
16.
598 NBSP is always interpreted as NBSP (BOM isn't supported).
608 Little Endian version of RFC
2781 UTF-
16.
609 NBSP is always interpreted as NBSP (BOM isn't supported).
708 Win-
1258 - Vietnamese7 that supports Cyrillic.
716 @node iconv design decisions
717 @section iconv design decisions
719 @findex CES converter
720 @findex Speed-optimized tables
721 @findex Size-optimized tables
723 The first iconv library design issue arises when considering the
724 following two design approaches:
728 Have modules which implement conversion from the encoding A to the encoding B
729 and vice versa i.e., one conversion module relates to any two encodings.
731 Have modules which implement conversion from the encoding A to the fixed
732 encoding C and vice versa i.e., one conversion module relates to any
733 one encoding A and one fixed encoding C. In this case, to convert from
734 the encoding A to the encoding B, two modules are needed (in order to convert
735 from A to C and then from C to B).
739 It's obvious, that we have tradeoff between commonality/flexibility and
740 efficiency: the first method is more efficient since it converts
741 directly; however, it isn't so flexible since for each
742 encoding pair a distinct module is needed.
745 The Newlib iconv model uses the second method and always converts through the
32-bit
746 UCS but its design also allows one to write specialized conversion
747 modules if the conversion speed is critical.
750 The second design issue is how to break down (decompose) encodings.
751 The Newlib iconv library uses the fact that any encoding may be
752 considered as one or more CCS plus a CES. It also decomposes its
753 conversion modules on @dfn
{CES converter
} plus one or more @dfn
{CCS
754 tables
}. CCS tables map CCS to UCS and vice versa; the CES converters
755 map CCS to the encoding and vice versa.
758 As the example, let's consider the conversion from the big5 encoding to
759 the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
760 CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
761 and CNS11643_PLANE14 CCS-es plus the EUC CES.
764 The euc_jp -> big5 conversion is performed as follows:
768 The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
769 transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
772 The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
773 CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
775 The resulting UCS codes are transformed to the ASCII and BIG5 codes using
776 the corresponding CCS tables;
778 The obtained CCS codes are transformed to the big5 encoding using the corresponding
783 Analogously, the backward conversion is performed as follows:
787 The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
788 (the ASCII and BIG5 CCS-es);
790 The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
792 The resulting UCS codes are transformed to the ASCII and BIG5 codes using
793 the corresponding CCS tables;
795 The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
800 Note, the above is just an example and real names (which are implemented
801 in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
804 The third design issue also relates to flexibility. Obviously, it isn't
805 desirable to always link all the CES converters and the CCS tables to the library
806 but instead, we want to be able to load the needed converters and tables
807 dynamically on demand. This isn't a problem on "big" machines such as
808 a PC, but it may be very problematical within "small" embedded systems.
811 Since the CCS tables are just data, it is possible to load them
812 dynamically from external files. The CES converters, on the other hand
813 are algorithms with some code so a dynamic library loading
814 capability is required.
817 Apart from possible restrictions applied by embedded systems (small
818 RAM for example), Newlib itself has no dynamic library support and
819 therefore, all the CES converters which will ever be used must be linked into
820 the library. However, loading of the dynamic CCS tables is possible and is
821 implemented in the Newlib iconv library. It may be enabled via the Newlib
822 configure script options.
825 The next design issue is fine-tuning the iconv library
826 configuration. One important ability is for iconv to not link all it's
827 converters and tables (if dynamic loading is not enabled) but instead,
828 enable only those encodings which are specified at configuration
829 time (see the section about the configure script options).
832 In addition, the Newlib iconv library configure options distinguish between
833 conversion directions. This means that not only are supported encodings
834 selectable, the conversion direction is as well. For example, if user wants
835 the configuration which allows conversions from UTF-
8 to UTF-
16 and
836 doesn't plan using the "UTF-
16 to UTF-
8" conversions, he or she can
838 this conversion direction (i.e., no "UTF-
16 -> UTF-
8"-related code will
839 be included) thus, saving some memory (note, that such technique allows to
840 exclude one half of a CCS table from linking which may be big enough).
843 One more design aspect are the speed- and size- optimized tables. Users can
844 select between them using configure script options. The
845 speed-optimized CCS tables are the same as the size-optimized ones in
846 case of
8-bit CCS (e.g.m KOI8-R), but for
16-bit CCS-es the size-optimized
847 CCS tables may be
1.5 to
2 times less then the speed-optimized ones. On the
848 other hand, conversion with speed tables is several times faster.
851 Its worth to stress that the new encoding support can't be
852 dynamically added into an already compiled Newlib library, even if it
853 needs only an additional CCS table and iconv is configured to use
854 the external files with CCS tables (this isn't the fundamental restriction
855 and the possibility to add new Table-based encoding support dynamically, by
856 means of just adding new .cct file, may be easily added).
859 Theoretically, the compiled-in CCS tables should be more appropriate for
860 embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM
861 whereas dynamic loading requires RAM. Moreover, in the current iconv
862 implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
863 This means, for example, that if two iconv descriptors for
864 "KOI8-R -> UCS-
4BE" and "KOI8-R -> UTF-
16BE" are opened, two copies of
865 koi8-r .cct file will be loaded (actually, iconv loads only the needed part
866 of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
869 @node iconv configuration
870 @section iconv configuration
871 @findex iconv configuration
872 @findex --enable-newlib-iconv-encodings
873 @findex --enable-newlib-iconv-from-encodings
874 @findex --enable-newlib-iconv-to-encodings
875 @findex --enable-newlib-iconv-external-ccs
878 To enable an encoding, the @emph
{--enable-newlib-iconv-encodings
} configure
879 script option should be used. This option accepts a comma-separated list
880 of @emph
{encodings
} that should be enabled. The option enables each encoding in both
881 ("to" and "from") directions.
884 The @option
{--enable-newlib-iconv-from-encodings
} configure script option enables
885 "from" support for each encoding that was passed to it.
888 The @option
{--enable-newlib-iconv-to-encodings
} configure script option enables
889 "to" support for each encoding that was passed to it.
892 Example: if user plans only the "KOI8-R -> UTF-
8", "UTF-
8 -> ISO-
8859-
5" and
893 "KOI8-R -> UCS-
2" conversions, the most optimal way (minimal iconv
894 code and data will be linked) is to configure Newlib with the following
897 @code
{--enable-newlib-iconv-encodings=UTF-
8
898 --enable-newlib-iconv-from-encodings=KOI8-R
899 --enable-newlib-iconv-to-encodings=UCS-
2,ISO-
8859-
5}
903 @code
{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-
8
904 --enable-newlib-iconv-to-encodings=UCS-
2,ISO-
8859-
5,UTF-
8}
906 User may also just use the
908 @code
{--enable-newlib-iconv-encodings=KOI8-R,ISO-
8859-
5,UTF-
8,UCS-
2}
910 configure script option, but it isn't so optimal since there will be
911 some unneeded data and code.
914 The @option
{--enable-newlib-iconv-external-ccs
} option enables iconv's
915 capabilities to work with the external CCS files.
918 The @option
{--enable-target-optspace
} Newlib configure script option also affects
919 the iconv library. If this option is present, the library uses the size
920 optimized CCS tables. This means, that only the size-optimized CCS
921 tables will be linked or, if the
922 @option
{--enable-newlib-iconv-external-ccs
} configure script option was used,
923 the iconv library will load the size-optimized tables. If the
924 @option
{--enable-target-optspace
}configure script option is disabled,
925 the speed-optimized CCS tables are used.
928 Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
929 Thus, the NLSPATH environment variable should be set.
937 @section Encoding names
938 @findex encoding name
939 @findex encoding alias
940 @findex normalized name
942 Each encoding has one @dfn
{name
} and a number of @dfn
{aliases
}. When
943 user works with the iconv library (i.e., when the @code
{iconv_open
} call
944 is used) both name or aliases may be used. The same is when encoding
945 names are used in configure script options.
948 Names and aliases may be specified in any case (small or capital
949 letters) and the @kbd
{-
} symbol is equivalent to the @kbd
{_
} symbol.
952 Internally the Newlib iconv library always converts aliases to names. It
953 also converts names and aliases in the @dfn
{normalized
} form which means
954 that all capital letters are converted to small letters and the @kbd
{-
}
955 symbols are converted to @kbd
{_
} symbols.
963 @findex Size-optimized CCS table
964 @findex Speed-optimized CCS table
965 @findex mktbl.pl Perl script
967 @findex The CCT tables source files
968 @findex CCS source files
970 The iconv library stores files with CCS tables in the the @emph
{ccs/
}
971 subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
972 (@dfn
{.cct files
}, see the @emph
{ccs/binary/
} subdirectory) and in form
973 of compilable .c source files. The .cct files are only used when the
974 @option
{--enable-newlib-iconv-external-ccs
} configure script option is enabled.
975 The .c files are linked to the Newlib library if the corresponding
979 As stated earlier, the Newlib iconv library performs all
980 conversions through the
32-bit UCS, but the codes which are used
981 in most CCS-es, fit into the first
16-bit subset of the
32-bit UCS set.
982 Thus, in order to make the CCS tables more compact, the
16-bit UCS-
2 is
983 used instead of the
32-bit UCS-
4.
986 CCS tables may be
8- or
16-bit wide.
8-bit CCS tables map
8-bit CCS to
987 16-bit UCS-
2 and vice versa while
16-bit CCS tables map
988 16-bit CCS to
16-bit UCS-
2 and vice versa.
989 8-bit tables are small (in size) while
16-bit tables may be big enough.
990 Because of this,
16-bit CCS tables may be
991 either speed- or size-optimized. Size-optimized CCS tables are
992 smaller then speed-optimized ones, but the conversion process is
993 slower if the size-optimized CCS tables are used.
8-bit CCS tables have only
994 size-optimized variant.
996 Each CCS table (both speed- and size-optimized) consists of
997 @dfn
{from_ucs
} and @dfn
{to_ucs
} subtables. "from_ucs" subtable maps
998 UCS-
2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
1002 Almost all
16-bit CCS tables contain less then
0xFFFF codes and
1003 a lot of gaps exist.
1005 @subsection Speed-optimized tables format
1007 In case of
8-bit speed-optimized CCS tables the "to_ucs" subtables format is
1008 trivial - it is just the array of
256 16-bit UCS codes. Therefore, an
1009 UCS-
2 code @emph
{Y
} corresponding to a @emph
{X
} CCS code is calculates
1010 as @emph
{Y = to_ucs
[X
]}.
1013 Obviously, the simplest way to create the "from_ucs" table or the
1014 16-bit "to_ucs" table is to use the huge
16-bit array like in case
1015 of the
8-bit "to_ucs" table. But almost all the
16-bit CCS tables contain
1016 less then
0xFFFF code maps and this fact may be exploited to reduce
1017 the size of the CCS tables.
1020 In this chapter the "UCS-
2 -> CCS"
8-bit CCS table format is described. The
1021 16-bit "CCS -> UCS-
2" CCS table format is the same, except the mapping
1022 direction and the CCS bits number.
1025 In case of the
8-bit speed-optimized table the "from_ucs" subtable
1026 corresponds the "from_ucs" array and has the following layout:
1031 -------------------------------------
1033 0xFF mapping (
2 bytes) (only for
1036 -------------------------------------
1040 -------------------------------------
1044 -------------------------------------
1048 -------------------------------------
1052 -------------------------------------
1056 -------------------------------------
1059 The
0x0000-
0xFFFF 16-bit code range is divided to
256 code subranges. Each
1060 subrange is represented by an
256-element @dfn
{block
} (
256 1-byte
1061 elements or
256 2-byte element in case of
16-bit CCS table) with
1062 elements which are equivalent to the CCS codes of this subrange.
1063 If the "UCS-
2 -> CCS" mapping has big enough gaps, some blocks will be
1064 absent and there will be less then
256 blocks.
1067 Any element number @emph
{m
} of @dfn
{the heading block
} (which contains
1068 256 2-byte elements) corresponds to the @emph
{m
}-th
256-element subrange.
1069 If the subrange contains some codes, the value of the @emph
{m
}-th element of
1070 the heading block contains the offset of the corresponding block in the
1071 "from_ucs" array. If there is no codes in the subrange, the heading
1072 block element contains
0xFFFF.
1075 If there are some gaps in a block, the corresponding block elements have
1076 the
0xFF value. If there is an
0xFF code present in the CCS, it's mapping
1077 is defined in the first
2-byte element of the "from_ucs" array.
1080 Having such a table format, the algorithm of searching the CCS code
1081 @emph
{X
} which corresponds to the UCS-
2 code @emph
{Y
} is as follows.
1085 @item If @emph
{Y
} is equivalent to the value of the first
2-byte element
1086 of the "from_ucs" array, @emph
{X
} is
0xFF. Else, continue to search.
1088 @item Calculate the block number: @emph
{BlkN = (Y &
0xFF00) >>
8}.
1090 @item If the heading block element with number @emph
{BlkN
} is
0xFFFF, there
1091 is no corresponding CCS code (error, wrong input data). Else, fetch the
1092 "flom_ucs" array index of the @emph
{BlkN
}-th block.
1094 @item Calculate the offset of the @emph
{X
} code in its block:
1095 @emph
{Xindex = Y &
0xFF}
1097 @item If the @emph
{Xindex
}-th element of the block (which is equivalent to
1098 @emph
{from_ucs
[BlkN+Xindex
]}) value is
0xFF, there is no corresponding
1099 CCS code (error, wrong input data). Else, @emph
{X = from_ucs
[BlkN+Xindex
]}.
1102 @subsection Size-optimized tables format
1104 As it is stated above, size-optimized tables exist only for
16-bit CCS-es.
1105 This is because there is too small difference between the speed-optimized
1106 and the size-optimized table sizes in case of
8-bit CCS-es.
1109 Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
1110 size-optimized tables.
1112 This sections describes the format of the "UCS-
2 -> CCS" size-optimized
1113 CCS table. The format of "CCS -> UCS-
2" table is the same.
1115 The idea of the size-optimized tables is to split the UCS-
2 codes
1116 ("from" codes) on @dfn
{ranges
} (@dfn
{range
} is a number of consecutive UCS-
2 codes).
1117 Then CCS codes ("to" codes) are stored only for the codes from these
1118 ranges. Distinct "from" codes, which have no range (@dfn
{unranged codes
}, are stored
1119 together with the corresponding "to" codes.
1122 The following is the layout of the size-optimized table array:
1127 -------------------------------------
1129 Ranges number (
2 bytes)
1131 -------------------------------------
1133 Unranged codes number (
2 bytes)
1135 -------------------------------------
1137 Unranged codes array index (
2 bytes)
1139 -------------------------------------
1141 Ranges indexes (triads)
1143 -------------------------------------
1147 -------------------------------------
1149 Unranged codes array
1151 -------------------------------------
1154 The @dfn
{Unranged codes array index
} @emph
{size_arr
} section helps to find
1155 the offset of the needed range in the @emph
{size_arr
} and has
1156 the following format (triads):
1158 the first code in range, the last code in range, range offset.
1161 The array of these triads is sorted by the firs element, therefore it is
1162 possible to quickly find the needed range index.
1165 Each range has the corresponding sub-array containing the "to" codes. These
1166 sub-arrays are stored in the place marked as "Ranges" in the layout
1170 The "Unranged codes array" contains pairs ("from" code, "to" code") for
1171 each unranged code. The array of these pairs is sorted by "from" code
1172 values, therefore it is possible to find the needed pair quickly.
1175 Note, that each range requires
6 bytes to form its index. If, for
1176 example, there are two ranges (
1 -
5 and
9 -
10), and one unranged code
1177 (
7),
12 bytes are needed for two range indexes and
4 bytes for the unranged
1178 code (total
16). But it is better to join both ranges as
1 -
10 and
1179 mark codes
6 and
8 as absent. In this case, only
6 additional bytes for the
1180 range index and
4 bytes to mark codes
6 and
8 as absent are needed
1181 (total
10 bytes). This optimization is done in the size-optimized tables.
1182 Thus, ranges may contain small gaps. The absent codes in ranges are marked
1186 Note, a pair of "from" codes is stored by means of unranged codes since
1187 the number of bytes which are needed to form the range is greater than
1188 the number of bytes to store two unranged codes (
5 against
4).
1191 The algorithm of searching of the CCS code
1192 @emph
{X
} which corresponds to the UCS-
2 code @emph
{Y
} (input) in the "UCS-
2 ->
1193 CCS" size-optimized table is as follows.
1197 @item Try to find the corresponding triad in the "Unranged codes array
1198 index". Since we are searching in the sorted array, we can do it quickly
1199 (divide by
2, compare, etc).
1201 @item If the triad is found, fetch the @emph
{X
} code from the corresponding
1202 range array. If it is
0xFFFF, return an error.
1204 @item If there is no corresponding triad, search the @emph
{X
} code among the
1205 sorted unranged codes. Return error, if noting was found.
1208 @subsection .cct ant .c CCS Table files
1210 The .c source files for
8-bit CCS tables have "to_ucs" and "from_ucs"
1211 speed-optimized tables. The .c source files for
16-bit CCS tables have
1212 "to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
1216 When .c files are compiled and used, all the
16-bit and
32-bit values
1217 have the native endian format (Big Endian for the BE systems and Little
1218 Endian for the LE systems) since they are compile for the system before
1222 In case of .cct files, which are intended for dynamic CCS tables
1223 loading, the CCS tables are stored either in LE or BE format. Since the
1224 .cct files are generated by the 'mktbl.pl' Perl script, it is possible
1225 to choose the endianess of the tables. It is also possible to store two
1226 copies (both LE and BE) of the CCS tables in one .cct file. The default
1227 .cct files (which come with the Newlib sources) have both LE and BE CCS
1228 tables. The Newlib iconv library automatically chooses the needed CCS tables
1229 (with appropriate endianess).
1232 Note, the .cct files are only used when the
1233 @option
{--enable-newlib-iconv-external-ccs
} is used.
1235 @subsection The 'mktbl.pl' Perl script
1237 The 'mktbl.pl' script is intended to generate .cct and .c CCS table
1238 files from the @dfn
{CCS source files
}.
1241 The CCS source files are just text files which has one or more colons
1242 with CCS <-> UCS-
2 codes mapping. To see an example of the CCS table
1243 source files see one of them using URL-s which will be given bellow.
1246 The following table describes where the source files for CCS table files
1247 provided by the Newlib distribution are located.
1249 @multitable @columnfractions
.25 .75
1261 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
1268 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
1277 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1295 http://www.unicode.org/Public/MAPPINGS/ISO8859/
1300 http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
1307 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
1312 http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
1317 http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
1322 http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
1327 http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
1332 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
1345 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1348 The CCS source files aren't distributed with Newlib because of License
1349 restrictions in most Unicode.org's files.
1351 The following are 'mktbl.pl' options which were used to generate .cct
1352 files. Note, to generate CCS tables source files @option
{-s
} option
1356 @item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
1357 iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
1358 iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
1359 iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
1360 win_1256.cct, win_1258.cct, win_1251.cct,
1361 win_1253.cct, win_1255.cct, win_1257.cct,
1362 koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
1363 big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
1364 files, only the @option
{-i <SRC_FILE_NAME>
} option were used.
1366 @item To generate the jis_x0208_1990.cct file, the
1367 @option
{-i jis_x0208_1990.txt -x
2 -y
3} options were used.
1369 @item To generate the cns11643_plane1.cct file, the
1370 @option
{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct
}
1373 @item To generate the cns11643_plane2.cct file, the
1374 @option
{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct
}
1377 @item To generate the cns11643_plane14.cct file, the
1378 @option
{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct
}
1383 For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
1386 It is assumed that CCS codes are
16 or less bits wide. If there are wider CCS codes
1387 in the CCS source file, the bits which are higher then
16 defines plane (see the
1388 cns11643.txt CCS source file).
1391 Sometimes, it is impossible to map some CCS codes to the
16-bit UCS if, for example,
1392 several different CCS codes are mapped to one UCS-
2 code or one CCS code is mapped to
1393 the pair of UCS-
2 codes. In these cases, such CCS codes (@dfn
{lost
1394 codes
}) aren't just rejected but instead, they are mapped to the default
1395 UCS-
2 code (which is currently the @kbd
{?
} character's code).
1402 @node CES converters
1403 @section CES converters
1406 Similar to the CCS tables, CES converters are also split into "from UCS"
1407 and "to UCS" parts. Depending on the iconv library configuration, these
1408 parts are enabled or disabled.
1411 The following it the list of CES converters which are currently present
1412 in the Newlib iconv library.
1416 @emph
{euc
} - supports the @emph
{euc_jp
}, @emph
{euc_kr
} and @emph
{euc_tw
}
1417 encodings. The @emph
{euc
} CES converter uses the @emph
{table
} and the
1418 @emph
{us_ascii
} CES converters.
1421 @emph
{table
} - this CES converter corresponds to "null" and just performs
1422 tables-based conversion using
8- and
16-bit CCS tables. This converter
1423 is also used by any other CES converter which needs the CCS table-based
1424 conversions. The @emph
{table
} converter is also responsible for .cct files
1428 @emph
{table_pcs
} - this is the wrapper over the @emph
{table
} converter
1429 which is intended for
16-bit encodings which also use the @dfn
{Portable
1430 Character Set
} (@dfn
{PCS
}) which is the same as the @emph
{US-ASCII
}.
1431 This means, that if the first byte the CCS code is in range of
[0x00-
0x7f],
1432 this is the
7-bit PCS code. Else, this is the
16-bit CCS code. Of course,
1433 the
16-bit codes must not contain bytes in the range of
[0x00-
0x7f].
1434 The @emph
{big5
} encoding uses the @emph
{table_pcs
} CES converter and the
1435 @emph
{table_pcs
} CES converter depends on the @emph
{table
} CES converter.
1438 @emph
{ucs_2
} - intended for the @emph
{ucs_2
}, @emph
{ucs_2be
} and
1439 @emph
{ucs_2le
} encodings support.
1442 @emph
{ucs_4
} - intended for the @emph
{ucs_4
}, @emph
{ucs_4be
} and
1443 @emph
{ucs_4le
} encodings support.
1446 @emph
{ucs_2_internal
} - intended for the @emph
{ucs_2_internal
} encoding support.
1449 @emph
{ucs_4_internal
} - intended for the @emph
{ucs_4_internal
} encoding support.
1452 @emph
{us_ascii
} - intended for the @emph
{us_ascii
} encoding support. In
1453 principle, the most natural way to support the @emph
{us_ascii
} encoding
1454 is to define the @emph
{us_ascii
} CCS and use the @emph
{table
} CES
1455 converter. But for the optimization purposes, the specialized
1456 @emph
{us_ascii
} CES converter was created.
1459 @emph
{utf_16
} - intended for the @emph
{utf_16
}, @emph
{utf_16be
} and
1460 @emph
{utf_16le
} encodings support.
1463 @emph
{utf_8
} - intended for the @emph
{utf_8
} encoding support.
1471 @node The encodings description file
1472 @section The encodings description file
1473 @findex encoding.deps description file
1474 @findex mkdeps.pl Perl script
1476 To simplify the process of adding new encodings support allowing to
1477 automatically generate a lot of "glue" files.
1480 There is the 'encoding.deps' file in the @emph
{lib/
} subdirectory which
1481 is used to describe encoding's properties. The 'mkdeps.pl' Perl script
1482 uses 'encoding.deps' to generates the "glue" files.
1485 The 'encoding.deps' file is composed of sections, each section consists
1486 of entries, each entry contains some encoding/CES/CCS description.
1489 The 'encoding.deps' file's syntax is very simple. Currently only two
1490 sections are defined: @emph
{ENCODINGS
} and @emph
{CES_DEPENDENCIES
}.
1493 Each @emph
{ENCODINGS
} section's entry describes one encoding and
1494 contains the following information.
1498 Encoding name (the @emph
{ENCODING
} field). The name should
1499 be unique and only one name is possible.
1502 The encoding's CES converter name (the @emph
{CES
} field). Only one CES
1503 converter is allowed.
1506 The whitespace-separated list of CCS table names which are used by the
1507 encoding (the @emph
{CCS
} field).
1510 The whitespace-separated list of aliases names (the @emph
{ENCODING
}
1515 Note all names in the 'encoding.deps' file have to have the normalized
1519 Each @emph
{CES_DEPENDENCIES
} section's entry describes dependencies of
1520 one CES converted. For example, the @emph
{euc
} CES converter depends on
1521 the @emph
{table
} and the @emph
{us_ascii
} CES converter since the
1522 @emph
{euc
} CES converter uses them. This means, that both @emph
{table
}
1523 and @emph
{us_ascii
} CES converters should be linked if the @emph
{euc
}
1524 CES converter is enabled.
1527 The @emph
{CES_DEPENDENCIES
} section defines the following:
1531 the CES converter name for which the dependencies are defined in this
1532 entry (the @emph
{CES
} field);
1535 the whitespace-separated list of CES converters which are needed for
1536 this CES converter (the @emph
{USED_CES
} field).
1540 The 'mktbl.pl' Perl script automatically solves the following tasks.
1544 User works with the iconv library in terms of encodings and doesn't know
1545 anything about CES converters and CCS tables. The script automatically
1546 generates code which enables all needed CES converters and CCS tables
1547 for all encodings, which were enabled by the user.
1550 The CES converters may have dependencies and the script automatically
1551 generates the code which handles these dependencies.
1554 The list of encoding's aliases is also automatically generated.
1557 The script uses a lot of macros in order to enable only the minimum set
1558 of code/data which is needed to support the requested encodings in the
1559 requested directions.
1563 The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
1564 file and generates the following files.
1568 @emph
{lib/encnames.h
} - this header files contains macro definitions for all
1572 @emph
{lib/aliasesbi.c
} - the array of encoding names and aliases. The array
1573 is used to find the name of requested encoding by it's alias.
1576 @emph
{ces/cesbi.c
} - this file defines two arrays
1577 (@code
{_iconv_from_ucs_ces
} and @code
{_iconv_to_ucs_ces
}) which contain
1578 description of enabled "to UCS" and "from UCS" CES converters and the
1579 names of encodings which are supported by these CES converters.
1582 @emph
{ces/cesbi.h
} - this file contains the set of macros which defines
1583 the set of CES converters which should be enabled if only the set of
1584 enabled encodings is given (through macros defined in the
1585 @emph
{newlib.h
} file). Note, that one CES converter may handle several
1589 @emph
{ces/cesdeps.h
} - the CES converters dependencies are handled in
1593 @emph
{ccs/ccsdeps.h
} - the array of linked-in CCS tables is defined
1597 @emph
{ccs/ccsnames.h
} - this header files contains macro definitions for all
1601 @emph
{encoding.aliases
} - the list of supported encodings and their
1602 aliases which is intended for the Newlib configure scripts in order to
1603 handle the iconv-related configure script options.
1611 @node How to add new encoding
1612 @section How to add new encoding
1614 At first, the new encoding should be broken down to CCS and CES. Then,
1615 the process of adding new encoding is split to the following activities.
1618 @item Generate the .cct CCS file and the .c source file for the new
1619 encoding's CCS (if it isn't already present). To do this, the CCS source
1620 file should be had and the 'mktbl.pl' script should be used.
1622 @item Write the corresponding CES converter (if it isn't already
1623 present). Use the existing CES converters as an example.
1626 Add the corresponding entries to the 'encoding.deps' file and regenerate
1627 the autogenerated "glue" files using the 'mkdeps.pl' script.
1630 Don't forget to add entries to the newlib/newlib.hin file.
1633 Of course, the 'Makefile.am'-s should also be updated (if new files were
1634 added) and the 'Makefile.in'-s should be regenerated using the correct
1635 version of 'automake'.
1638 Don't forget to update the documentation (the list of
1639 supported encodings and CES converters).
1642 In case a new encoding doesn't fit to the CES/CCS decomposition model or
1643 it is desired to add the specialized (non UCS-based) conversion support,
1644 the Newlib iconv library code should be upgraded.
1651 @node The locale support interfaces
1652 @section The locale support interfaces
1654 The newlib iconv library also has some interface functions (besides the
1655 @code
{iconv
}, @code
{iconv_open
} and @code
{iconv_close
} interfaces) which
1656 are intended for the Locale subsystem. All the locale-related code is
1657 placed in the @emph
{lib/iconvnls.c
} file.
1660 The following is the description of the locale-related interfaces:
1664 @code
{_iconv_nls_open
} - opens two iconv descriptors for "CCS ->
1665 wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
1666 passed in the function parameters. The @emph
{wchar_t
} characters encoding is
1667 either ucs_2_internal or ucs_4_internal depending on size of
1671 @code
{_iconv_nls_conv
} - the function is similar to the @code
{iconv
}
1672 functions, but if there is no character in the output encoding which
1673 corresponds to the character in the input encoding, the default
1674 conversion isn't performed (the @code
{iconv
} function sets such output
1675 characters to the @kbd
{?
} symbol and this is the behavior, which is
1676 specified in SUSv3).
1679 @code
{_iconv_nls_get_state
} - returns the current encoding's shift state
1680 (the @code
{mbstate_t
} object).
1683 @code
{_iconv_nls_set_state
} sets the current encoding's shift state (the
1684 @code
{mbstate_t
} object).
1687 @code
{_iconv_nls_is_stateful
} - checks whether the encoding is stateful
1691 @code
{_iconv_nls_get_mb_cur_max
} - returns the maximum length (the
1692 maximum bytes number) of the encoding's characters.
1702 The author of the original BSD iconv library (Alexander Chuguev) no longer
1706 Any questions regarding the iconv library may be forwarded to
1707 Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
1708 well as to the public Newlib mailing list.