newlib/libc/iconv/iconv.tex

   1 @node Encoding conversions
   2 @chapter Encoding conversions (@file{iconv.h})
   3
   4 This chapter describes the Newlib iconv library.
   5 The iconv functions declarations are in
   6 @file{iconv.h}.
   7
   8 @menu
   9 * Function iconv::                  Encoding conversion routines
  10 * Introduction to iconv::           Introduction to iconv and encodings
  11 * Supported encodings::             The list of currently supported encodings
  12 * iconv design decisions::          General iconv library design issues
  13 * iconv configuration::             iconv-related configure script options
  14 * Encoding names::                  How encodings are named.
  15 * CCS tables::                      CCS tables format and 'mktbl.pl' Perl script
  16 * CES converters::                  CES converters description
  17 * The encodings description file::  The 'encoding.deps' file and 'mkdeps.pl'
  18 * How to add new encoding::         The steps to add new encoding support
  19 * The locale support interfaces::   Locale-related iconv interfaces
  20 * Contact::                         The author contact
  21 @end menu
  22
  23 @page
  24 @include iconv/lib/iconv.def
  25
  26 @page
  27 @node Introduction to iconv
  28 @section Introduction to iconv
  29 @findex encoding
  30 @findex character set
  31 @findex charset
  32 @findex CES
  33 @findex CCS
  34 @*
  35 The iconv library is intended to convert characters from one encoding to
  36 another. It implements iconv(), iconv_open() and iconv_close()
  37 calls, which are defined by the Single Unix Specification.
  38
  39 @*
  40 In addition to these user-level interfaces, the iconv library also has
  41 several useful interfaces which are needed to support coding
  42 capabilities of the Newlib Locale infrastructure.  Since Locale
  43 support also needs to
  44 convert various character sets to and from the @emph{wide characters
  45 set}, the iconv library shares it's capabilities with the Newlib Locale
  46 subsystem. Moreover, the iconv library supports several features which are
  47 only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).
  48
  49 @*
  50 The Newlib iconv library was created using concepts from another iconv
  51 library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
  52 was rewritten from scratch and contains a lot of improvements with respect to
  53 the original iconv library.
  54
  55 @*
  56 Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
  57 are often used with various meanings. The following are the definitions of terms
  58 which are used in this documentation as well as in the iconv library
  59 implementation:
  60
  61 @itemize @bullet
  62 @item
  63 @dfn{encoding} - a machine representation of characters by means of bits;
  64
  65 @item
  66 @dfn{Character Set} or @dfn{Charset} - just a collection of
  67 characters, i.e. the encoding is the machine representation of the character set;
  68
  69 @item
  70 @dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
  71 set of integers @dfn{character codes};
  72
  73 @item
  74 @dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
  75 codes to a sequence of bytes;
  76 @end itemize
  77
  78 @*
  79 Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
  80 ASCII, etc. Encodings are formed by the following chain of steps:
  81
  82 @enumerate
  83 @item
  84 User has a set of characters which are specific to his or her language (character set).
  85
  86 @item
  87 Each character from this set is uniquely numbered, resulting in an CCS.
  88
  89 @item
  90 Each number from the CCS is converted to a sequence of bits or bytes by means
  91 of a CES and form some encoding. Thus, CES may be considered as a
  92 function of CCS which produces some encoding. Note, that CES may be
  93 applied to more than one CCS.
  94 @end enumerate
  95
  96 @*
  97 Thus, an encoding may be considered as one or more CCS + CES.
  98
  99 @*
 100 Sometimes, there is no CES and in such cases encoding is equivalent
 101 to CCS, e.g. KOI8-R or ASCII.
 102
 103 @*
 104 An example of a more complicated encoding is UTF-8 which is the UCS
 105 (or Unicode) CCS plus the UTF-8 CES.
 106
 107 @*
 108 The following is a brief list of iconv library features:
 109 @itemize
 110 @item
 111 Generic architecture;
 112 @item
 113 Locale infrastructure support;
 114 @item
 115 Automatic generation of the program code which handles
 116 CES/CCS/Encoding/Names/Aliases dependencies;
 117 @item
 118 The ability to choose size- or speed-optimazed
 119 configuration;
 120 @item
 121 The ability to exclude a lot of unneeded code and data from the linking step.
 122 @end itemize
 123
 124
 125
 126
 127 @page
 128 @node Supported encodings
 129 @section Supported encodings
 130 @findex big5
 131 @findex cp775
 132 @findex cp850
 133 @findex cp852
 134 @findex cp855
 135 @findex cp866
 136 @findex euc_jp
 137 @findex euc_kr
 138 @findex euc_tw
 139 @findex iso_8859_1
 140 @findex iso_8859_10
 141 @findex iso_8859_11
 142 @findex iso_8859_13
 143 @findex iso_8859_14
 144 @findex iso_8859_15
 145 @findex iso_8859_2
 146 @findex iso_8859_3
 147 @findex iso_8859_4
 148 @findex iso_8859_5
 149 @findex iso_8859_6
 150 @findex iso_8859_7
 151 @findex iso_8859_8
 152 @findex iso_8859_9
 153 @findex iso_ir_111
 154 @findex koi8_r
 155 @findex koi8_ru
 156 @findex koi8_u
 157 @findex koi8_uni
 158 @findex ucs_2
 159 @findex ucs_2_internal
 160 @findex ucs_2be
 161 @findex ucs_2le
 162 @findex ucs_4
 163 @findex ucs_4_internal
 164 @findex ucs_4be
 165 @findex ucs_4le
 166 @findex us_ascii
 167 @findex utf_16
 168 @findex utf_16be
 169 @findex utf_16le
 170 @findex utf_8
 171 @findex win_1250
 172 @findex win_1251
 173 @findex win_1252
 174 @findex win_1253
 175 @findex win_1254
 176 @findex win_1255
 177 @findex win_1256
 178 @findex win_1257
 179 @findex win_1258
 180 @*
 181 The following is the list of currently supported encodings. The first column
 182 corresponds to the encoding name, the second column is the list of aliases,
 183 the third column is its CES and CCS components names, and the fourth column
 184 is a short description.
 185
 186 @multitable @columnfractions .20 .26 .24 .30
 187 @item
 188 Name
 189 @tab
 190 Aliases
 191 @tab
 192 CES/CCS
 193 @tab
 194 Short description
 195 @item
 196 @tab
 197 @tab
 198 @tab
 199
 200
 201 @item
 202 big5
 203 @tab
 204 csbig5, big_five, bigfive, cn_big5, cp950
 205 @tab
 206 table_pcs / big5, us_ascii
 207 @tab
 208 The encoding for the Traditional Chinese.
 209
 210
 211 @item
 212 cp775
 213 @tab
 214 ibm775, cspc775baltic
 215 @tab
 216 table / cp775
 217 @tab
 218 The updated version of CP 437 that supports the balitic languages.
 219
 220
 221 @item
 222 cp850
 223 @tab
 224 ibm850, 850, cspc850multilingual
 225 @tab
 226 table / cp850
 227 @tab
 228 IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
 229 added instead of some less-often used characters like the line-drawing
 230 and the greek ones.
 231
 232
 233 @item
 234 cp852
 235 @tab
 236 ibm852, 852, cspcp852
 237 @tab
 238 @tab
 239 IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
 240 instead of some less-often used characters like the line-drawing and the greek ones.
 241
 242
 243 @item
 244 cp855
 245 @tab
 246 ibm855, 855, csibm855
 247 @tab
 248 table / cp855
 249 @tab
 250 IBM 855 - the updated version of CP 437 that supports Cyrillic.
 251
 252
 253 @item
 254 cp866
 255 @tab
 256 866, IBM866, CSIBM866
 257 @tab
 258 table / cp866
 259 @tab
 260 IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet
 261 ordering of the alternative variant that is preferred by many Russian users.
 262
 263
 264 @item
 265 euc_jp
 266 @tab
 267 eucjp
 268 @tab
 269 euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
 270 @tab
 271 EUC-JP - The EUC for Japanese.
 272
 273
 274 @item
 275 euc_kr
 276 @tab
 277 euckr
 278 @tab
 279 euc / ksx1001
 280 @tab
 281 EUC-KR - The EUC for Korean.
 282
 283
 284 @item
 285 euc_tw
 286 @tab
 287 euctw
 288 @tab
 289 euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
 290 @tab
 291 EUC-TW - The EUC for Traditional Chinese.
 292
 293
 294 @item
 295 iso_8859_1
 296 @tab
 297 iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
 298 @tab
 299 table / iso_8859_1
 300 @tab
 301 ISO 8859-1:1987 - Latin 1, West European.
 302
 303
 304 @item
 305 iso_8859_10
 306 @tab
 307 iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
 308 @tab
 309 table / iso_8859_10
 310 @tab
 311 ISO 8859-10:1992 - Latin 6, Nordic.
 312
 313
 314 @item
 315 iso_8859_11
 316 @tab
 317 iso8859_11, iso885911
 318 @tab
 319 table / iso_8859_11
 320 @tab
 321 ISO 8859-11 - Thai.
 322
 323
 324 @item
 325 iso_8859_13
 326 @tab
 327 iso_8859_13:1998, iso8859_13, iso885913
 328 @tab
 329 table / iso_8859_13
 330 @tab
 331 ISO 8859-13:1998 - Latin 7, Baltic Rim.
 332
 333
 334 @item
 335 iso_8859_14
 336 @tab
 337 iso_8859_14:1998, iso885914, iso8859_14
 338 @tab
 339 table / iso_8859_14
 340 @tab
 341 ISO 8859-14:1998 - Latin 8, Celtic.
 342
 343
 344 @item
 345 iso_8859_15
 346 @tab
 347 iso885915, iso_8859_15:1998, iso8859_15,
 348 @tab
 349 table / iso_8859_15
 350 @tab
 351 ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.
 352
 353
 354 @item
 355 iso_8859_2
 356 @tab
 357 iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
 358 @tab
 359 table / iso_8859_2
 360 @tab
 361 ISO 8859-2:1987 - Latin 2, East European.
 362
 363
 364 @item
 365 iso_8859_3
 366 @tab
 367 iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
 368 @tab
 369 table / iso_8859_3
 370 @tab
 371 ISO 8859-3:1988 - Latin 3, South European.
 372
 373
 374 @item
 375 iso_8859_4
 376 @tab
 377 iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
 378 @tab
 379 table / iso_8859_4
 380 @tab
 381 ISO 8859-4:1988 - Latin 4, North European.
 382
 383
 384 @item
 385 iso_8859_5
 386 @tab
 387 iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
 388 @tab
 389 table / iso_8859_5
 390 @tab
 391 ISO 8859-5:1988 - Cyrillic.
 392
 393
 394 @item
 395 iso_8859_6
 396 @tab
 397 iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
 398 @tab
 399 table / iso_8859_6
 400 @tab
 401 ISO i8859-6:1987 - Arabic.
 402
 403
 404 @item
 405 iso_8859_7
 406 @tab
 407 iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
 408 @tab
 409 table / iso_8859_7
 410 @tab
 411 ISO 8859-7:1987 - Greek.
 412
 413
 414 @item
 415 iso_8859_8
 416 @tab
 417 iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
 418 @tab
 419 table / iso_8859_8
 420 @tab
 421 ISO 8859-8:1988 - Hebrew.
 422
 423
 424 @item
 425 iso_8859_9
 426 @tab
 427 iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
 428 @tab
 429 table / iso_8859_9
 430 @tab
 431 ISO 8859-9:1989 - Latin 5, Turkish.
 432
 433
 434 @item
 435 iso_ir_111
 436 @tab
 437 ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
 438 @tab
 439 table / iso_ir_111
 440 @tab
 441 ISO IR 111/ECMA Cyrillic.
 442
 443
 444 @item
 445 koi8_r
 446 @tab
 447 cskoi8r, koi8r, koi8
 448 @tab
 449 table / koi8_r
 450 @tab
 451 RFC 1489 Cyrillic.
 452
 453
 454 @item
 455 koi8_ru
 456 @tab
 457 koi8ru
 458 @tab
 459 table / koi8_ru
 460 @tab
 461 The obsolete Ukrainian.
 462
 463
 464 @item
 465 koi8_u
 466 @tab
 467 koi8u
 468 @tab
 469 table / koi8_u
 470 @tab
 471 RFC 2319 Ukrainian.
 472
 473
 474 @item
 475 koi8_uni
 476 @tab
 477 koi8uni
 478 @tab
 479 table / koi8_uni
 480 @tab
 481 KOI8 Unified.
 482
 483
 484 @item
 485 ucs_2
 486 @tab
 487 ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
 488 @tab
 489 ucs_2 / (UCS)
 490 @tab
 491 ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 492
 493
 494 @item
 495 ucs_2_internal
 496 @tab
 497 ucs2_internal, ucs_2internal, ucs2internal
 498 @tab
 499 ucs_2_internal / (UCS)
 500 @tab
 501 ISO-10646-UCS-2 in system byte order.
 502 NBSP is always interpreted as NBSP (BOM isn't supported).
 503
 504
 505 @item
 506 ucs_2be
 507 @tab
 508 ucs2be
 509 @tab
 510 ucs_2 / (UCS)
 511 @tab
 512 Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
 513 Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 514
 515
 516 @item
 517 ucs_2le
 518 @tab
 519 ucs2le
 520 @tab
 521 ucs_2 / (UCS)
 522 @tab
 523 Little Endian version of ISO-10646-UCS-2.
 524 Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 525
 526
 527 @item
 528 ucs_4
 529 @tab
 530 ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
 531 @tab
 532 ucs_4 / (UCS)
 533 @tab
 534 ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 535
 536
 537 @item
 538 ucs_4_internal
 539 @tab
 540 ucs4_internal, ucs_4internal, ucs4internal
 541 @tab
 542 ucs_4_internal / (UCS)
 543 @tab
 544 ISO-10646-UCS-4 in system byte order.
 545 NBSP is always interpreted as NBSP (BOM isn't supported).
 546
 547
 548 @item
 549 ucs_4be
 550 @tab
 551 ucs4be
 552 @tab
 553 ucs_4 / (UCS)
 554 @tab
 555 Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
 556 Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 557
 558
 559 @item
 560 ucs_4le
 561 @tab
 562 ucs4le
 563 @tab
 564 ucs_4 / (UCS)
 565 @tab
 566 Little Endian version of ISO-10646-UCS-4.
 567 Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).
 568
 569
 570 @item
 571 us_ascii
 572 @tab
 573 ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
 574 @tab
 575 us_ascii / (ASCII)
 576 @tab
 577 7-bit ASCII.
 578
 579
 580 @item
 581 utf_16
 582 @tab
 583 utf16
 584 @tab
 585 utf_16 / (UCS)
 586 @tab
 587 RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.
 588
 589
 590 @item
 591 utf_16be
 592 @tab
 593 utf16be
 594 @tab
 595 utf_16 / (UCS)
 596 @tab
 597 Big Endian version of RFC 2781 UTF-16.
 598 NBSP is always interpreted as NBSP (BOM isn't supported).
 599
 600
 601 @item
 602 utf_16le
 603 @tab
 604 utf16le
 605 @tab
 606 utf_16 / (UCS)
 607 @tab
 608 Little Endian version of RFC 2781 UTF-16.
 609 NBSP is always interpreted as NBSP (BOM isn't supported).
 610
 611
 612 @item
 613 utf_8
 614 @tab
 615 utf8
 616 @tab
 617 utf_8 / (UCS)
 618 @tab
 619 RFC 3629 UTF-8.
 620
 621
 622 @item
 623 win_1250
 624 @tab
 625 cp1250
 626 @tab
 627 @tab
 628 Win-1250 Croatian.
 629
 630
 631 @item
 632 win_1251
 633 @tab
 634 cp1251
 635 @tab
 636 table / win_1251
 637 @tab
 638 Win-1251 - Cyrillic.
 639
 640
 641 @item
 642 win_1252
 643 @tab
 644 cp1252
 645 @tab
 646 table / win_1252
 647 @tab
 648 Win-1252 - Latin 1.
 649
 650
 651 @item
 652 win_1253
 653 @tab
 654 cp1253
 655 @tab
 656 table / win_1253
 657 @tab
 658 Win-1253 - Greek.
 659
 660
 661 @item
 662 win_1254
 663 @tab
 664 cp1254
 665 @tab
 666 table / win_1254
 667 @tab
 668 Win-1254 - Turkish.
 669
 670
 671 @item
 672 win_1255
 673 @tab
 674 cp1255
 675 @tab
 676 table / win_1255
 677 @tab
 678 Win-1255 - Hebrew.
 679
 680
 681 @item
 682 win_1256
 683 @tab
 684 cp1256
 685 @tab
 686 table / win_1256
 687 @tab
 688 Win-1256 - Arabic.
 689
 690
 691 @item
 692 win_1257
 693 @tab
 694 cp1257
 695 @tab
 696 table / win_1257
 697 @tab
 698 Win-1257 - Baltic.
 699
 700
 701 @item
 702 win_1258
 703 @tab
 704 cp1258
 705 @tab
 706 table / win_1258
 707 @tab
 708 Win-1258 - Vietnamese7 that supports Cyrillic.
 709 @end multitable
 710
 711
 712
 713
 714
 715 @page
 716 @node iconv design decisions
 717 @section iconv design decisions
 718 @findex CCS table
 719 @findex CES converter
 720 @findex Speed-optimized tables
 721 @findex Size-optimized tables
 722 @*
 723 The first iconv library design issue arises when considering the
 724 following two design approaches:
 725
 726 @enumerate
 727 @item
 728 Have modules which implement conversion from the encoding A to the encoding B
 729 and vice versa i.e., one conversion module relates to any two encodings.
 730 @item
 731 Have modules which implement conversion from the encoding A to the fixed
 732 encoding C and vice versa i.e., one conversion module relates to any
 733 one encoding A and one fixed encoding C. In this case, to convert from
 734 the encoding A to the encoding B, two modules are needed (in order to convert
 735 from A to C and then from C to B).
 736 @end enumerate
 737
 738 @*
 739 It's obvious, that we have tradeoff between commonality/flexibility and
 740 efficiency: the first method is more efficient since it converts
 741 directly; however, it isn't so flexible since for each
 742 encoding pair a distinct module is needed.
 743
 744 @*
 745 The Newlib iconv model uses the second method and always converts through the 32-bit
 746 UCS but its design also allows one to write specialized conversion
 747 modules if the conversion speed is critical.
 748
 749 @*
 750 The second design issue is how to break down (decompose) encodings.
 751 The Newlib iconv library uses the fact that any encoding may be
 752 considered as one or more CCS plus a CES. It also decomposes its
 753 conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
 754 tables}. CCS tables map CCS to UCS and vice versa; the CES converters
 755 map CCS to the encoding and vice versa.
 756
 757 @*
 758 As the example, let's consider the conversion from the big5 encoding to
 759 the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
 760 CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
 761 and CNS11643_PLANE14 CCS-es plus the EUC CES.
 762
 763 @*
 764 The euc_jp -> big5 conversion is performed as follows:
 765
 766 @enumerate
 767 @item
 768 The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
 769 transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
 770 CCS-es);
 771 @item
 772 The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
 773 CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
 774 @item
 775 The resulting UCS codes are transformed to the ASCII and BIG5 codes using
 776 the corresponding CCS tables;
 777 @item
 778 The obtained CCS codes are transformed to the big5 encoding using the corresponding
 779 CES converter.
 780 @end enumerate
 781
 782 @*
 783 Analogously, the backward conversion is performed as follows:
 784
 785 @enumerate
 786 @item
 787 The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
 788 (the ASCII and BIG5 CCS-es);
 789 @item
 790 The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
 791 @item
 792 The resulting UCS codes are transformed to the ASCII and BIG5 codes using
 793 the corresponding CCS tables;
 794 @item
 795 The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
 796 CES converter.
 797 @end enumerate
 798
 799 @*
 800 Note, the above is just an example and real names (which are implemented
 801 in the Newlib iconv) of the CES converters and the CCS tables are slightly different.
 802
 803 @*
 804 The third design issue also relates to flexibility. Obviously, it isn't
 805 desirable to always link all the CES converters and the CCS tables to the library
 806 but instead, we want to be able to load the needed converters and tables
 807 dynamically on demand. This isn't a problem on "big" machines such as
 808 a PC, but it may be very problematical within "small" embedded systems.
 809
 810 @*
 811 Since the CCS tables are just data, it is possible to load them
 812 dynamically from external files.  The CES converters, on the other hand
 813 are algorithms with some code so a dynamic library loading
 814 capability is required.
 815
 816 @*
 817 Apart from possible restrictions applied by embedded systems (small
 818 RAM for example), Newlib itself has no dynamic library support and
 819 therefore, all the CES converters which will ever be used must be linked into
 820 the library.   However, loading of the dynamic CCS tables is possible and is
 821 implemented in the Newlib iconv library.  It may be enabled via the Newlib
 822 configure script options.
 823
 824 @*
 825 The next design issue is fine-tuning the iconv library
 826 configuration.  One important ability is for iconv to not link all it's
 827 converters and tables (if dynamic loading is not enabled) but instead,
 828 enable only those encodings which are specified at configuration
 829 time (see the section about the configure script options).
 830
 831 @*
 832 In addition, the Newlib iconv library configure options distinguish between
 833 conversion directions. This means that not only are supported encodings
 834 selectable, the conversion direction is as well. For example, if user wants
 835 the configuration which allows conversions from UTF-8 to UTF-16 and
 836 doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can
 837 enable only
 838 this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
 839 be included) thus, saving some memory (note, that such technique allows to
 840 exclude one half of a CCS table from linking which may be big enough).
 841
 842 @*
 843 One more design aspect are the speed- and size- optimized tables. Users can
 844 select between them using configure script options. The
 845 speed-optimized CCS tables are the same as the size-optimized ones in
 846 case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
 847 CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
 848 other hand, conversion with speed tables is several times faster.
 849
 850 @*
 851 Its worth to stress that the new encoding support can't be
 852 dynamically added into an already compiled Newlib library, even if it
 853 needs only an additional CCS table and iconv is configured to use
 854 the external files with CCS tables (this isn't the fundamental restriction
 855 and the possibility to add new Table-based encoding support dynamically, by
 856 means of just adding new .cct file, may be easily added).
 857
 858 @*
 859 Theoretically, the compiled-in CCS tables should be more appropriate for
 860 embedded systems than dynamically loaded CCS tables.  This is because the compiled-in tables are read-only and can be placed in ROM
 861 whereas dynamic loading requires RAM.  Moreover, in the current iconv
 862 implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
 863 This means, for example, that if two iconv descriptors for
 864 "KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
 865 koi8-r .cct file will be loaded (actually, iconv loads only the needed part
 866 of these files).  On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.
 867
 868 @page
 869 @node iconv configuration
 870 @section iconv configuration
 871 @findex iconv configuration
 872 @findex --enable-newlib-iconv-encodings
 873 @findex --enable-newlib-iconv-from-encodings
 874 @findex --enable-newlib-iconv-to-encodings
 875 @findex --enable-newlib-iconv-external-ccs
 876 @findex NLSPATH
 877 @*
 878 To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
 879 script option should be used. This option accepts a comma-separated list
 880 of @emph{encodings} that should be enabled. The option enables each encoding in both
 881 ("to" and "from") directions.
 882
 883 @*
 884 The @option{--enable-newlib-iconv-from-encodings} configure script option enables
 885 "from" support for each encoding that was passed to it.
 886
 887 @*
 888 The @option{--enable-newlib-iconv-to-encodings} configure script option enables
 889 "to" support for each encoding that was passed to it.
 890
 891 @*
 892 Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
 893 "KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
 894 code and data will be linked) is to configure Newlib with the following
 895 options:
 896 @*
 897 @code{--enable-newlib-iconv-encodings=UTF-8
 898 --enable-newlib-iconv-from-encodings=KOI8-R
 899 --enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
 900 @*
 901 which is the same as
 902 @*
 903 @code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
 904 --enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
 905 @*
 906 User may also just use the
 907 @*
 908 @code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
 909 @*
 910 configure script option, but it isn't so optimal since there will be
 911 some unneeded data and code.
 912
 913 @*
 914 The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
 915 capabilities to work with the external CCS files.
 916
 917 @*
 918 The @option{--enable-target-optspace} Newlib configure script option also affects
 919 the iconv library. If this option is present, the library uses the size
 920 optimized CCS tables. This means, that only the size-optimized CCS
 921 tables will be linked or, if the
 922 @option{--enable-newlib-iconv-external-ccs} configure script option was used,
 923 the iconv library will load the size-optimized tables. If the
 924 @option{--enable-target-optspace}configure script option is disabled,
 925 the speed-optimized CCS tables are used.
 926
 927 @*
 928 Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
 929 Thus, the NLSPATH environment variable should be set.
 930
 931
 932
 933
 934
 935 @page
 936 @node Encoding names
 937 @section Encoding names
 938 @findex encoding name
 939 @findex encoding alias
 940 @findex normalized name
 941 @*
 942 Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
 943 user works with the iconv library (i.e., when the @code{iconv_open} call
 944 is used) both name or aliases may be used. The same is when encoding
 945 names are used in configure script options.
 946
 947 @*
 948 Names and aliases may be specified in any case (small or capital
 949 letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
 950
 951 @*
 952 Internally the Newlib iconv library always converts aliases to names. It
 953 also converts names and aliases in the @dfn{normalized} form which means
 954 that all capital letters are converted to small letters and the @kbd{-}
 955 symbols are converted to @kbd{_} symbols.
 956
 957
 958
 959
 960 @page
 961 @node CCS tables
 962 @section CCS tables
 963 @findex Size-optimized CCS table
 964 @findex Speed-optimized CCS table
 965 @findex mktbl.pl Perl script
 966 @findex .cct files
 967 @findex The CCT tables source files
 968 @findex CCS source files
 969 @*
 970 The iconv library stores files with CCS tables in the the @emph{ccs/}
 971 subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
 972 (@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
 973 of compilable .c source files. The .cct files are only used when the
 974 @option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
 975 The .c files are linked to the Newlib library if the corresponding
 976 encoding is enabled.
 977
 978 @*
 979 As stated earlier, the Newlib iconv library performs all
 980 conversions through the 32-bit UCS, but the codes which are used
 981 in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
 982 Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
 983 used instead of the 32-bit UCS-4.
 984
 985 @*
 986 CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
 987 16-bit UCS-2 and vice versa while 16-bit CCS tables map
 988 16-bit CCS to 16-bit UCS-2 and vice versa.
 989 8-bit tables are small (in size) while 16-bit tables may be big enough.
 990 Because of this, 16-bit CCS tables may be
 991 either speed- or size-optimized. Size-optimized CCS tables are
 992 smaller then speed-optimized ones, but the conversion process is
 993 slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
 994 size-optimized variant.
 995
 996 Each CCS table (both speed- and size-optimized) consists of
 997 @dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
 998 UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
 999 UCS-2 codes.
1000
1001 @*
1002 Almost all 16-bit CCS tables contain less then 0xFFFF codes and
1003 a lot of gaps exist.
1004
1005 @subsection Speed-optimized tables format
1006 @*
1007 In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
1008 trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
1009 UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
1010 as @emph{Y = to_ucs[X]}.
1011
1012 @*
1013 Obviously, the simplest way to create the "from_ucs" table or the
1014 16-bit "to_ucs" table is to use the huge 16-bit array like in case
1015 of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
1016 less then 0xFFFF code maps and this fact may be exploited to reduce
1017 the size of the CCS tables.
1018
1019 @*
1020 In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
1021 16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
1022 direction and the CCS bits number.
1023
1024 @*
1025 In case of the 8-bit speed-optimized table the "from_ucs" subtable
1026 corresponds the "from_ucs" array and has the following layout:
1027
1028 @*
1029 from_ucs array:
1030 @*
1031 -------------------------------------
1032 @*
1033 0xFF mapping (2 bytes) (only for
1034 8-bit table).
1035 @*
1036 -------------------------------------
1037 @*
1038 Heading block
1039 @*
1040 -------------------------------------
1041 @*
1042 Block 1
1043 @*
1044 -------------------------------------
1045 @*
1046 Block 2
1047 @*
1048 -------------------------------------
1049 @*
1050   ...
1051 @*
1052 -------------------------------------
1053 @*
1054 Block N
1055 @*
1056 -------------------------------------
1057
1058 @*
1059 The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
1060 subrange is represented by an 256-element @dfn{block} (256 1-byte
1061 elements or 256 2-byte element in case of 16-bit CCS table) with
1062 elements which are equivalent to the CCS codes of this subrange.
1063 If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
1064 absent and there will be less then 256 blocks.
1065
1066 @*
1067 Any element number @emph{m} of @dfn{the heading block} (which contains
1068 256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
1069 If the subrange contains some codes, the value of the @emph{m}-th element of
1070 the heading block contains the offset of the corresponding block in the
1071 "from_ucs" array. If there is no codes in the subrange, the heading
1072 block element contains 0xFFFF.
1073
1074 @*
1075 If there are some gaps in a block, the corresponding block elements have
1076 the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
1077 is defined in the first 2-byte element of the "from_ucs" array.
1078
1079 @*
1080 Having such a table format, the algorithm of searching the CCS code
1081 @emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.
1082
1083 @*
1084 @enumerate
1085 @item If @emph{Y} is equivalent to the value of the first 2-byte element
1086 of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.
1087
1088 @item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.
1089
1090 @item If the heading block element with number @emph{BlkN} is 0xFFFF, there
1091 is no corresponding CCS code (error, wrong input data). Else, fetch the
1092 "flom_ucs" array index of the @emph{BlkN}-th block.
1093
1094 @item Calculate the offset of the @emph{X} code in its block:
1095 @emph{Xindex = Y & 0xFF}
1096
1097 @item If the @emph{Xindex}-th element of the block (which is equivalent to
1098 @emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
1099 CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
1100 @end enumerate
1101
1102 @subsection Size-optimized tables format
1103 @*
1104 As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
1105 This is because there is too small difference between the speed-optimized
1106 and the size-optimized table sizes in case of 8-bit CCS-es.
1107
1108 @*
1109 Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
1110 size-optimized tables.
1111
1112 This sections describes the format of the "UCS-2 -> CCS" size-optimized
1113 CCS table. The format of "CCS -> UCS-2" table is the same.
1114
1115 The idea of the size-optimized tables is to split the UCS-2 codes
1116 ("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
1117 Then CCS codes ("to" codes) are stored only for the codes from these
1118 ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
1119 together with the corresponding "to" codes.
1120
1121 @*
1122 The following is the layout of the size-optimized table array:
1123
1124 @*
1125 size_arr array:
1126 @*
1127 -------------------------------------
1128 @*
1129 Ranges number (2 bytes)
1130 @*
1131 -------------------------------------
1132 @*
1133 Unranged codes number (2 bytes)
1134 @*
1135 -------------------------------------
1136 @*
1137 Unranged codes array index (2 bytes)
1138 @*
1139 -------------------------------------
1140 @*
1141 Ranges indexes (triads)
1142 @*
1143 -------------------------------------
1144 @*
1145 Ranges
1146 @*
1147 -------------------------------------
1148 @*
1149 Unranged codes array
1150 @*
1151 -------------------------------------
1152
1153 @*
1154 The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
1155 the offset of the needed range in the @emph{size_arr} and has
1156 the following format (triads):
1157 @*
1158 the first code in range, the last code in range, range offset.
1159
1160 @*
1161 The array of these triads is sorted by the firs element, therefore it is
1162 possible to quickly find the needed range index.
1163
1164 @*
1165 Each range has the corresponding sub-array containing the "to" codes. These
1166 sub-arrays are stored in the place marked as "Ranges" in the layout
1167 diagram.
1168
1169 @*
1170 The "Unranged codes array" contains pairs ("from" code, "to" code") for
1171 each unranged code. The array of these pairs is sorted by "from" code
1172 values, therefore it is possible to find the needed pair quickly.
1173
1174 @*
1175 Note, that each range requires 6 bytes to form its index. If, for
1176 example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
1177 (7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
1178 code (total 16). But it is better to join both ranges as 1 - 10 and
1179 mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
1180 range index and 4 bytes to mark codes 6 and 8 as absent are needed
1181 (total 10 bytes). This optimization is done in the size-optimized tables.
1182 Thus, ranges may contain small gaps. The absent codes in ranges are marked
1183 as 0xFFFF.
1184
1185 @*
1186 Note, a pair of "from" codes is stored by means of unranged codes since
1187 the number of bytes which are needed to form the range is greater than
1188 the number of bytes to store two unranged codes (5 against 4).
1189
1190 @*
1191 The algorithm of searching of the CCS code
1192 @emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
1193 CCS" size-optimized table is as follows.
1194
1195 @*
1196 @enumerate
1197 @item Try to find the corresponding triad in the "Unranged codes array
1198 index". Since we are searching in the sorted array, we can do it quickly
1199 (divide by 2, compare, etc).
1200
1201 @item If the triad is found, fetch the @emph{X} code from the corresponding
1202 range array. If it is 0xFFFF, return an error.
1203
1204 @item If there is no corresponding triad, search the @emph{X} code among the
1205 sorted unranged codes. Return error, if noting was found.
1206 @end enumerate
1207
1208 @subsection .cct ant .c CCS Table files
1209 @*
1210 The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
1211 speed-optimized tables. The .c source files for 16-bit CCS tables have
1212 "to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
1213 tables.
1214
1215 @*
1216 When .c files are compiled and used, all the 16-bit and 32-bit values
1217 have the native endian format (Big Endian for the BE systems and Little
1218 Endian for the LE systems) since they are compile for the system before
1219 they are used.
1220
1221 @*
1222 In case of .cct files, which are intended for dynamic CCS tables
1223 loading, the CCS tables are stored either in LE or BE format. Since the
1224 .cct files are generated by the 'mktbl.pl' Perl script, it is possible
1225 to choose the endianess of the tables. It is also possible to store two
1226 copies (both LE and BE) of the CCS tables in one .cct file. The default
1227 .cct files (which come with the Newlib sources) have both LE and BE CCS
1228 tables. The Newlib iconv library automatically chooses the needed CCS tables
1229 (with appropriate endianess).
1230
1231 @*
1232 Note, the .cct files are only used when the
1233 @option{--enable-newlib-iconv-external-ccs} is used.
1234
1235 @subsection The 'mktbl.pl' Perl script
1236 @*
1237 The 'mktbl.pl' script is intended to generate .cct and .c CCS table
1238 files from the @dfn{CCS source files}.
1239
1240 @*
1241 The CCS source files are just text files which has one or more colons
1242 with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
1243 source files see one of them using URL-s which will be given bellow.
1244
1245 @*
1246 The following table describes where the source files for CCS table files
1247 provided by the Newlib distribution are located.
1248
1249 @multitable @columnfractions .25 .75
1250 @item
1251 Name
1252 @tab
1253 URL
1254
1255 @item
1256 @tab
1257
1258 @item
1259 big5
1260 @tab
1261 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT
1262
1263 @item
1264 cns11643_plane1
1265 cns11643_plane14
1266 cns11643_plane2
1267 @tab
1268 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT
1269
1270 @item
1271 cp775
1272 cp850
1273 cp852
1274 cp855
1275 cp866
1276 @tab
1277 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1278
1279 @item
1280 iso_8859_1
1281 iso_8859_2
1282 iso_8859_3
1283 iso_8859_4
1284 iso_8859_5
1285 iso_8859_6
1286 iso_8859_7
1287 iso_8859_8
1288 iso_8859_9
1289 iso_8859_10
1290 iso_8859_11
1291 iso_8859_13
1292 iso_8859_14
1293 iso_8859_15
1294 @tab
1295 http://www.unicode.org/Public/MAPPINGS/ISO8859/
1296
1297 @item
1298 iso_ir_111
1299 @tab
1300 http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT
1301
1302 @item
1303 jis_x0201_1976
1304 jis_x0208_1990
1305 jis_x0212_1990
1306 @tab
1307 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT
1308
1309 @item
1310 koi8_r
1311 @tab
1312 http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT
1313
1314 @item
1315 koi8_ru
1316 @tab
1317 http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT
1318
1319 @item
1320 koi8_u
1321 @tab
1322 http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT
1323
1324 @item
1325 koi8_uni
1326 @tab
1327 http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT
1328
1329 @item
1330 ksx1001
1331 @tab
1332 http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
1333
1334 @item
1335 win_1250
1336 win_1251
1337 win_1252
1338 win_1253
1339 win_1254
1340 win_1255
1341 win_1256
1342 win_1257
1343 win_1258
1344 @tab
1345 http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
1346 @end multitable
1347
1348 The CCS source files aren't distributed with Newlib because of License
1349 restrictions in most Unicode.org's files.
1350
1351 The following are 'mktbl.pl' options which were used to generate .cct
1352 files. Note, to generate CCS tables source files @option{-s} option
1353 should be added.
1354
1355 @enumerate
1356 @item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
1357 iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
1358 iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
1359 iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
1360 win_1256.cct, win_1258.cct, win_1251.cct,
1361 win_1253.cct, win_1255.cct, win_1257.cct,
1362 koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
1363 big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
1364 files, only the @option{-i <SRC_FILE_NAME>} option were used.
1365
1366 @item To generate the jis_x0208_1990.cct file, the
1367 @option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.
1368
1369 @item To generate the cns11643_plane1.cct file, the
1370 @option{-i cns11643.txt -p1 -N cns11643_plane1  -o cns11643_plane1.cct}
1371 options were used.
1372
1373 @item To generate the cns11643_plane2.cct file, the
1374 @option{-i cns11643.txt -p2 -N cns11643_plane2  -o cns11643_plane2.cct}
1375 options were used.
1376
1377 @item To generate the cns11643_plane14.cct file, the
1378 @option{-i cns11643.txt -p0xE -N cns11643_plane14  -o cns11643_plane14.cct}
1379 options were used.
1380 @end enumerate
1381
1382 @*
1383 For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.
1384
1385 @*
1386 It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
1387 in the CCS source file, the bits which are higher then 16 defines plane (see the
1388 cns11643.txt CCS source file).
1389
1390 @*
1391 Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
1392 several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
1393 the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
1394 codes}) aren't just rejected but instead, they are mapped to the default
1395 UCS-2 code (which is currently the @kbd{?} character's code).
1396
1397
1398
1399
1400
1401 @page
1402 @node CES converters
1403 @section CES converters
1404 @findex PCS
1405 @*
1406 Similar to the CCS tables, CES converters are also split into "from UCS"
1407 and "to UCS" parts. Depending on the iconv library configuration, these
1408 parts are enabled or disabled.
1409
1410 @*
1411 The following it the list of CES converters which are currently present
1412 in the Newlib iconv library.
1413
1414 @itemize @bullet
1415 @item
1416 @emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
1417 encodings. The @emph{euc} CES converter uses the @emph{table} and the
1418 @emph{us_ascii} CES converters.
1419
1420 @item
1421 @emph{table} - this CES converter corresponds to "null" and just performs
1422 tables-based conversion using 8- and 16-bit CCS tables. This converter
1423 is also used by any other CES converter which needs the CCS table-based
1424 conversions. The @emph{table} converter is also responsible for .cct files
1425 loading.
1426
1427 @item
1428 @emph{table_pcs} - this is the wrapper over the @emph{table} converter
1429 which is intended for 16-bit encodings which also use the @dfn{Portable
1430 Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
1431 This means, that if the first byte the CCS code is in range of [0x00-0x7f],
1432 this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
1433 the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
1434 The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
1435 @emph{table_pcs} CES converter depends on the @emph{table} CES converter.
1436
1437 @item
1438 @emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
1439 @emph{ucs_2le} encodings support.
1440
1441 @item
1442 @emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
1443 @emph{ucs_4le} encodings support.
1444
1445 @item
1446 @emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.
1447
1448 @item
1449 @emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.
1450
1451 @item
1452 @emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
1453 principle, the most natural way to support the @emph{us_ascii} encoding
1454 is to define the @emph{us_ascii} CCS and use the @emph{table} CES
1455 converter. But for the optimization purposes, the specialized
1456 @emph{us_ascii} CES converter was created.
1457
1458 @item
1459 @emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
1460 @emph{utf_16le} encodings support.
1461
1462 @item
1463 @emph{utf_8} - intended for the @emph{utf_8} encoding support.
1464 @end itemize
1465
1466
1467
1468
1469
1470 @page
1471 @node The encodings description file
1472 @section The encodings description file
1473 @findex encoding.deps description file
1474 @findex mkdeps.pl Perl script
1475 @*
1476 To simplify the process of adding new encodings support allowing to
1477 automatically generate a lot of "glue" files.
1478
1479 @*
1480 There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
1481 is used to describe encoding's properties. The 'mkdeps.pl' Perl script
1482 uses 'encoding.deps' to generates the "glue" files.
1483
1484 @*
1485 The 'encoding.deps' file is composed of sections, each section consists
1486 of entries, each entry contains some encoding/CES/CCS description.
1487
1488 @*
1489 The 'encoding.deps' file's syntax is very simple. Currently only two
1490 sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.
1491
1492 @*
1493 Each @emph{ENCODINGS} section's entry describes one encoding and
1494 contains the following information.
1495
1496 @itemize @bullet
1497 @item
1498 Encoding name (the @emph{ENCODING} field). The name should
1499 be unique and only one name is possible.
1500
1501 @item
1502 The encoding's CES converter name (the @emph{CES} field). Only one CES
1503 converter is allowed.
1504
1505 @item
1506 The whitespace-separated list of CCS table names which are used by the
1507 encoding (the @emph{CCS} field).
1508
1509 @item
1510 The whitespace-separated list of aliases names (the @emph{ENCODING}
1511 field).
1512 @end itemize
1513
1514 @*
1515 Note all names in the 'encoding.deps' file have to have the normalized
1516 form.
1517
1518 @*
1519 Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
1520 one CES converted. For example, the @emph{euc} CES converter depends on
1521 the @emph{table} and the @emph{us_ascii} CES converter since the
1522 @emph{euc} CES converter uses them. This means, that both @emph{table}
1523 and @emph{us_ascii} CES converters should be linked if the @emph{euc}
1524 CES converter is enabled.
1525
1526 @*
1527 The @emph{CES_DEPENDENCIES} section defines the following:
1528
1529 @itemize @bullet
1530 @item
1531 the CES converter name for which the dependencies are defined in this
1532 entry (the @emph{CES} field);
1533
1534 @item
1535 the whitespace-separated list of CES converters which are needed for
1536 this CES converter (the @emph{USED_CES} field).
1537 @end itemize
1538
1539 @*
1540 The 'mktbl.pl' Perl script automatically solves the following tasks.
1541
1542 @itemize @bullet
1543 @item
1544 User works with the iconv library in terms of encodings and doesn't know
1545 anything about CES converters and CCS tables. The script automatically
1546 generates code which enables all needed CES converters and CCS tables
1547 for all encodings, which were enabled by the user.
1548
1549 @item
1550 The CES converters may have dependencies and the script automatically
1551 generates the code which handles these dependencies.
1552
1553 @item
1554 The list of encoding's aliases is also automatically generated.
1555
1556 @item
1557 The script uses a lot of macros in order to enable only the minimum set
1558 of code/data which is needed to support the requested encodings in the
1559 requested directions.
1560 @end itemize
1561
1562 @*
1563 The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
1564 file and generates the following files.
1565
1566 @itemize @bullet
1567 @item
1568 @emph{lib/encnames.h} - this header files contains macro definitions for all
1569 encoding names
1570
1571 @item
1572 @emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
1573 is used to find the name of requested encoding by it's alias.
1574
1575 @item
1576 @emph{ces/cesbi.c} - this file defines two arrays
1577 (@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
1578 description of enabled "to UCS" and "from UCS" CES converters and the
1579 names of encodings which are supported by these CES converters.
1580
1581 @item
1582 @emph{ces/cesbi.h} - this file contains the set of macros which defines
1583 the set of CES converters which should be enabled if only the set of
1584 enabled encodings is given (through macros defined in the
1585 @emph{newlib.h} file). Note, that one CES converter may handle several
1586 encodings.
1587
1588 @item
1589 @emph{ces/cesdeps.h} - the CES converters dependencies are handled in
1590 this file.
1591
1592 @item
1593 @emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
1594 here.
1595
1596 @item
1597 @emph{ccs/ccsnames.h} - this header files contains macro definitions for all
1598 CCS names.
1599
1600 @item
1601 @emph{encoding.aliases} - the list of supported encodings and their
1602 aliases which is intended for the Newlib configure scripts in order to
1603 handle the iconv-related configure script options.
1604 @end itemize
1605
1606
1607
1608
1609
1610 @page
1611 @node How to add new encoding
1612 @section How to add new encoding
1613 @*
1614 At first, the new encoding should be broken down to CCS and CES. Then,
1615 the process of adding new encoding is split to the following activities.
1616
1617 @enumerate
1618 @item Generate the .cct CCS file and the .c source file for the new
1619 encoding's CCS (if it isn't already present). To do this, the CCS source
1620 file should be had and the 'mktbl.pl' script should be used.
1621
1622 @item Write the corresponding CES converter (if it isn't already
1623 present). Use the existing CES converters as an example.
1624
1625 @item
1626 Add the corresponding entries to the 'encoding.deps' file and regenerate
1627 the autogenerated "glue" files using the 'mkdeps.pl' script.
1628
1629 @item
1630 Don't forget to add entries to the newlib/newlib.hin file.
1631
1632 @item
1633 Of course, the 'Makefile.am'-s should also be updated (if new files were
1634 added) and the 'Makefile.in'-s should be regenerated using the correct
1635 version of 'automake'.
1636
1637 @item
1638 Don't forget to update the documentation (the list of
1639 supported encodings and CES converters).
1640 @end enumerate
1641
1642 In case a new encoding doesn't fit to the CES/CCS decomposition model or
1643 it is desired to add the specialized (non UCS-based) conversion support,
1644 the Newlib iconv library code should be upgraded.
1645
1646
1647
1648
1649
1650 @page
1651 @node The locale support interfaces
1652 @section The locale support interfaces
1653 @*
1654 The newlib iconv library also has some interface functions (besides the
1655 @code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
1656 are intended for the Locale subsystem. All the locale-related code is
1657 placed in the @emph{lib/iconvnls.c} file.
1658
1659 @*
1660 The following is the description of the locale-related interfaces:
1661
1662 @itemize @bullet
1663 @item
1664 @code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
1665 wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
1666 passed in the function parameters. The @emph{wchar_t} characters encoding is
1667 either ucs_2_internal or ucs_4_internal depending on size of
1668 @emph{wchar_t}.
1669
1670 @item
1671 @code{_iconv_nls_conv} - the function is similar to the @code{iconv}
1672 functions, but if there is no character in the output encoding which
1673 corresponds to the character in the input encoding, the default
1674 conversion isn't performed (the @code{iconv} function sets such output
1675 characters to the @kbd{?} symbol and this is the behavior, which is
1676 specified in SUSv3).
1677
1678 @item
1679 @code{_iconv_nls_get_state} - returns the current encoding's shift state
1680 (the @code{mbstate_t} object).
1681
1682 @item
1683 @code{_iconv_nls_set_state} sets the current encoding's shift state (the
1684 @code{mbstate_t} object).
1685
1686 @item
1687 @code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
1688 or stateless.
1689
1690 @item
1691 @code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
1692 maximum bytes number) of the encoding's characters.
1693 @end itemize
1694
1695
1696
1697
1698 @page
1699 @node Contact
1700 @section Contact
1701 @*
1702 The author of the original BSD iconv library (Alexander Chuguev) no longer
1703 supports that code.
1704
1705 @*
1706 Any questions regarding the iconv library may be forwarded to
1707 Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
1708 well as to the public Newlib mailing list.
1709