doc/src/sgml/charset.sgml

   1 <!-- $PostgreSQL$ -->
   2
   3 <chapter id="charset">
   4  <title>Localization</>
   5
   6  <para>
   7   This chapter describes the available localization features from the
   8   point of view of the administrator.
   9   <productname>PostgreSQL</productname> supports localization with
  10   two approaches:
  11
  12    <itemizedlist>
  13     <listitem>
  14      <para>
  15       Using the locale features of the operating system to provide
  16       locale-specific collation order, number formatting, translated
  17       messages, and other aspects.
  18      </para>
  19     </listitem>
  20
  21     <listitem>
  22      <para>
  23       Providing a number of different character sets to support storing text
  24       in all kinds of languages, and providing character set translation
  25       between client and server.
  26      </para>
  27     </listitem>
  28    </itemizedlist>
  29   </para>
  30
  31
  32  <sect1 id="locale">
  33   <title>Locale Support</title>
  34
  35   <indexterm zone="locale"><primary>locale</></>
  36
  37   <para>
  38    <firstterm>Locale</> support refers to an application respecting
  39    cultural preferences regarding alphabets, sorting, number
  40    formatting, etc.  <productname>PostgreSQL</> uses the standard ISO
  41    C and <acronym>POSIX</acronym> locale facilities provided by the server operating
  42    system.  For additional information refer to the documentation of your
  43    system.
  44   </para>
  45
  46   <sect2>
  47    <title>Overview</>
  48
  49    <para>
  50     Locale support is automatically initialized when a database
  51     cluster is created using <command>initdb</command>.
  52     <command>initdb</command> will initialize the database cluster
  53     with the locale setting of its execution environment by default,
  54     so if your system is already set to use the locale that you want
  55     in your database cluster then there is nothing else you need to
  56     do.  If you want to use a different locale (or you are not sure
  57     which locale your system is set to), you can instruct
  58     <command>initdb</command> exactly which locale to use by
  59     specifying the <option>--locale</option> option. For example:
  60 <screen>
  61 initdb --locale=sv_SE
  62 </screen>
  63    </para>
  64
  65    <para>
  66     This example for Unix systems sets the locale to Swedish
  67     (<literal>sv</>) as spoken
  68     in Sweden (<literal>SE</>).  Other possibilities might be
  69     <literal>en_US</> (U.S. English) and <literal>fr_CA</> (French
  70     Canadian).  If more than one character set can be useful for a
  71     locale then the specifications look like this:
  72     <literal>cs_CZ.ISO8859-2</>. What locales are available under what
  73     names on your system depends on what was provided by the operating
  74     system vendor and what was installed.  On most Unix systems, the command
  75     <literal>locale -a</> will provide a list of available locales.
  76     Windows uses more verbose locale names, such as <literal>German_Germany</>
  77     or <literal>Swedish_Sweden.1252</>, but the principles are the same.
  78    </para>
  79
  80    <para>
  81     Occasionally it is useful to mix rules from several locales, e.g.,
  82     use English collation rules but Spanish messages.  To support that, a
  83     set of locale subcategories exist that control only a certain
  84     aspect of the localization rules:
  85
  86     <informaltable>
  87      <tgroup cols="2">
  88       <tbody>
  89        <row>
  90         <entry><envar>LC_COLLATE</></>
  91         <entry>String sort order</>
  92        </row>
  93        <row>
  94         <entry><envar>LC_CTYPE</></>
  95         <entry>Character classification (What is a letter? Its upper-case equivalent?)</>
  96        </row>
  97        <row>
  98         <entry><envar>LC_MESSAGES</></>
  99         <entry>Language of messages</>
 100        </row>
 101        <row>
 102         <entry><envar>LC_MONETARY</></>
 103         <entry>Formatting of currency amounts</>
 104        </row>
 105        <row>
 106         <entry><envar>LC_NUMERIC</></>
 107         <entry>Formatting of numbers</>
 108        </row>
 109        <row>
 110         <entry><envar>LC_TIME</></>
 111         <entry>Formatting of dates and times</>
 112        </row>
 113       </tbody>
 114      </tgroup>
 115     </informaltable>
 116
 117     The category names translate into names of
 118     <command>initdb</command> options to override the locale choice
 119     for a specific category.  For instance, to set the locale to
 120     French Canadian, but use U.S. rules for formatting currency, use
 121     <literal>initdb --locale=fr_CA --lc-monetary=en_US</literal>.
 122    </para>
 123
 124    <para>
 125     If you want the system to behave as if it had no locale support,
 126     use the special locale <literal>C</> or <literal>POSIX</>.
 127    </para>
 128
 129    <para>
 130     The nature of some locale categories is that their value has to be
 131     fixed when the database is created.  You can use different settings
 132     for different databases, but once a database is created, you cannot
 133     change them for that database anymore. <literal>LC_COLLATE</literal>
 134     and <literal>LC_CTYPE</literal> are these categories.  They affect
 135     the sort order of indexes, so they must be kept fixed, or indexes on
 136     text columns will become corrupt.  The default values for these
 137     categories are determined when <command>initdb</command> is run, and
 138     those values are used when new databases are created, unless
 139     specified otherwise in the <command>CREATE DATABASE</command> command.
 140    </para>
 141
 142    <para>
 143     The other locale categories can be changed whenever desired
 144     by setting the server configuration parameters
 145     that have the same name as the locale categories (see <xref
 146     linkend="runtime-config-client-format"> for details).  The values
 147     that are chosen by <command>initdb</command> are actually only written
 148     into the configuration file <filename>postgresql.conf</filename> to
 149     serve as defaults when the server is started.  If you delete these
 150     assignments from <filename>postgresql.conf</filename> then the
 151     server will inherit the settings from its execution environment.
 152    </para>
 153
 154    <para>
 155     Note that the locale behavior of the server is determined by the
 156     environment variables seen by the server, not by the environment
 157     of any client.  Therefore, be careful to configure the correct locale settings
 158     before starting the server.  A consequence of this is that if
 159     client and server are set up in different locales, messages might
 160     appear in different languages depending on where they originated.
 161    </para>
 162
 163    <note>
 164     <para>
 165      When we speak of inheriting the locale from the execution
 166      environment, this means the following on most operating systems:
 167      For a given locale category, say the collation, the following
 168      environment variables are consulted in this order until one is
 169      found to be set: <envar>LC_ALL</envar>, <envar>LC_COLLATE</envar>
 170      (or the variable corresponding to the respective category),
 171      <envar>LANG</envar>.  If none of these environment variables are
 172      set then the locale defaults to <literal>C</literal>.
 173     </para>
 174
 175     <para>
 176      Some message localization libraries also look at the environment
 177      variable <envar>LANGUAGE</envar> which overrides all other locale
 178      settings for the purpose of setting the language of messages.  If
 179      in doubt, please refer to the documentation of your operating
 180      system, in particular the documentation about
 181      <application>gettext</>, for more information.
 182     </para>
 183    </note>
 184
 185    <para>
 186     To enable messages to be translated to the user's preferred language,
 187     <acronym>NLS</acronym> must have been selected at build time
 188     (<literal>configure --enable-nls</>).  All other locale support is
 189     built in automatically.
 190    </para>
 191   </sect2>
 192
 193   <sect2>
 194    <title>Behavior</>
 195
 196    <para>
 197     The locale settings influence the following SQL features:
 198
 199     <itemizedlist>
 200      <listitem>
 201       <para>
 202        Sort order in queries using <literal>ORDER BY</> or the standard
 203        comparison operators on textual data
 204        <indexterm><primary>ORDER BY</><secondary>and locales</></indexterm>
 205       </para>
 206      </listitem>
 207
 208      <listitem>
 209       <para>
 210        The ability to use indexes with <literal>LIKE</> clauses
 211        <indexterm><primary>LIKE</><secondary>and locales</></indexterm>
 212       </para>
 213      </listitem>
 214
 215      <listitem>
 216       <para>
 217        The <function>upper</>,  <function>lower</>,  and <function>initcap</>
 218        functions
 219        <indexterm><primary>upper</><secondary>and locales</></indexterm>
 220        <indexterm><primary>lower</><secondary>and locales</></indexterm>
 221       </para>
 222      </listitem>
 223
 224      <listitem>
 225       <para>
 226        The <function>to_char</> family of functions
 227        <indexterm><primary>to_char</><secondary>and locales</></indexterm>
 228       </para>
 229      </listitem>
 230     </itemizedlist>
 231    </para>
 232
 233    <para>
 234     The drawback of using locales other than <literal>C</> or
 235     <literal>POSIX</> in <productname>PostgreSQL</> is its performance
 236     impact. It slows character handling and prevents ordinary indexes
 237     from being used by <literal>LIKE</>. For this reason use locales
 238     only if you actually need them.
 239    </para>
 240
 241    <para>
 242     As a workaround to allow <productname>PostgreSQL</> to use indexes
 243     with <literal>LIKE</> clauses under a non-C locale, several custom
 244     operator classes exist. These allow the creation of an index that
 245     performs a strict character-by-character comparison, ignoring
 246     locale comparison rules. Refer to <xref linkend="indexes-opclass">
 247     for more information.
 248    </para>
 249   </sect2>
 250
 251   <sect2>
 252    <title>Problems</>
 253
 254    <para>
 255     If locale support doesn't work according to the explanation above,
 256     check that the locale support in your operating system is
 257     correctly configured.  To check what locales are installed on your
 258     system, you can use the command <literal>locale -a</literal> if
 259     your operating system provides it.
 260    </para>
 261
 262    <para>
 263     Check that <productname>PostgreSQL</> is actually using the locale
 264     that you think it is.  The <envar>LC_COLLATE</> and <envar>LC_CTYPE</>
 265     settings are determined when a database is created, and cannot be
 266     changed except by creating a new database.  Other locale
 267     settings including <envar>LC_MESSAGES</> and <envar>LC_MONETARY</>
 268     are initially determined by the environment the server is started
 269     in, but can be changed on-the-fly.  You can check the active locale
 270     settings using the <command>SHOW</> command.
 271    </para>
 272
 273    <para>
 274     The directory <filename>src/test/locale</> in the source
 275     distribution contains a test suite for
 276     <productname>PostgreSQL</>'s locale support.
 277    </para>
 278
 279    <para>
 280     Client applications that handle server-side errors by parsing the
 281     text of the error message will obviously have problems when the
 282     server's messages are in a different language.  Authors of such
 283     applications are advised to make use of the error code scheme
 284     instead.
 285    </para>
 286
 287    <para>
 288     Maintaining catalogs of message translations requires the on-going
 289     efforts of many volunteers that want to see
 290     <productname>PostgreSQL</> speak their preferred language well.
 291     If messages in your language are currently not available or not fully
 292     translated, your assistance would be appreciated.  If you want to
 293     help, refer to <xref linkend="nls"> or write to the developers'
 294     mailing list.
 295    </para>
 296   </sect2>
 297  </sect1>
 298
 299
 300  <sect1 id="multibyte">
 301   <title>Character Set Support</title>
 302
 303   <indexterm zone="multibyte"><primary>character set</></>
 304
 305   <para>
 306    The character set support in <productname>PostgreSQL</productname>
 307    allows you to store text in a variety of character sets (also called
 308    encodings), including
 309    single-byte character sets such as the ISO 8859 series and
 310    multiple-byte character sets such as <acronym>EUC</> (Extended Unix
 311    Code), UTF-8, and Mule internal code.  All supported character sets
 312    can be used transparently by clients, but a few are not supported
 313    for use within the server (that is, as a server-side encoding).
 314    The default character set is selected while
 315    initializing your <productname>PostgreSQL</productname> database
 316    cluster using <command>initdb</>.  It can be overridden when you
 317    create a database, so you can have multiple
 318    databases each with a different character set.
 319   </para>
 320
 321   <para>
 322    An important restriction, however, is that each database's character set
 323    must be compatible with the database's <envar>LC_CTYPE</> and
 324    <envar>LC_COLLATE</> locale settings. For <literal>C</> or
 325    <literal>POSIX</> locale, any character set is allowed, but for other
 326    locales there is only one character set that will work correctly.
 327    (On Windows, however, UTF-8 encoding can be used with any locale.)
 328   </para>
 329
 330    <sect2 id="multibyte-charset-supported">
 331     <title>Supported Character Sets</title>
 332
 333     <para>
 334      <xref linkend="charset-table"> shows the character sets available
 335      for use in <productname>PostgreSQL</productname>.
 336     </para>
 337
 338      <table id="charset-table">
 339       <title><productname>PostgreSQL</productname> Character Sets</title>
 340       <tgroup cols="6">
 341        <thead>
 342         <row>
 343          <entry>Name</entry>
 344          <entry>Description</entry>
 345          <entry>Language</entry>
 346          <entry>Server?</entry>
 347          <!--
 348           The Bytes/Char field is populated by looking at the values returned
 349           by pg_wchar_table.mblen function for each encoding.
 350          -->
 351          <entry>Bytes/Char</entry>
 352          <entry>Aliases</entry>
 353         </row>
 354        </thead>
 355        <tbody>
 356         <row>
 357          <entry><literal>BIG5</literal></entry>
 358          <entry>Big Five</entry>
 359          <entry>Traditional Chinese</entry>
 360          <entry>No</entry>
 361          <entry>1-2</entry>
 362          <entry><literal>WIN950</>, <literal>Windows950</></entry>
 363         </row>
 364         <row>
 365          <entry><literal>EUC_CN</literal></entry>
 366          <entry>Extended UNIX Code-CN</entry>
 367          <entry>Simplified Chinese</entry>
 368          <entry>Yes</entry>
 369          <entry>1-3</entry>
 370          <entry></entry>
 371         </row>
 372         <row>
 373          <entry><literal>EUC_JP</literal></entry>
 374          <entry>Extended UNIX Code-JP</entry>
 375          <entry>Japanese</entry>
 376          <entry>Yes</entry>
 377          <entry>1-3</entry>
 378          <entry></entry>
 379         </row>
 380         <row>
 381          <entry><literal>EUC_JIS_2004</literal></entry>
 382          <entry>Extended UNIX Code-JP, JIS X 0213</entry>
 383          <entry>Japanese</entry>
 384          <entry>Yes</entry>
 385          <entry>1-3</entry>
 386          <entry></entry>
 387         </row>
 388         <row>
 389          <entry><literal>EUC_KR</literal></entry>
 390          <entry>Extended UNIX Code-KR</entry>
 391          <entry>Korean</entry>
 392          <entry>Yes</entry>
 393          <entry>1-3</entry>
 394          <entry></entry>
 395         </row>
 396         <row>
 397          <entry><literal>EUC_TW</literal></entry>
 398          <entry>Extended UNIX Code-TW</entry>
 399          <entry>Traditional Chinese, Taiwanese</entry>
 400          <entry>Yes</entry>
 401          <entry>1-3</entry>
 402          <entry></entry>
 403         </row>
 404         <row>
 405          <entry><literal>GB18030</literal></entry>
 406          <entry>National Standard</entry>
 407          <entry>Chinese</entry>
 408          <entry>No</entry>
 409          <entry>1-2</entry>
 410          <entry></entry>
 411         </row>
 412         <row>
 413          <entry><literal>GBK</literal></entry>
 414          <entry>Extended National Standard</entry>
 415          <entry>Simplified Chinese</entry>
 416          <entry>No</entry>
 417          <entry>1-2</entry>
 418          <entry><literal>WIN936</>, <literal>Windows936</></entry>
 419         </row>
 420         <row>
 421          <entry><literal>ISO_8859_5</literal></entry>
 422          <entry>ISO 8859-5, <acronym>ECMA</> 113</entry>
 423          <entry>Latin/Cyrillic</entry>
 424          <entry>Yes</entry>
 425          <entry>1</entry>
 426          <entry></entry>
 427         </row>
 428         <row>
 429          <entry><literal>ISO_8859_6</literal></entry>
 430          <entry>ISO 8859-6, <acronym>ECMA</> 114</entry>
 431          <entry>Latin/Arabic</entry>
 432          <entry>Yes</entry>
 433          <entry>1</entry>
 434          <entry></entry>
 435         </row>
 436         <row>
 437          <entry><literal>ISO_8859_7</literal></entry>
 438          <entry>ISO 8859-7, <acronym>ECMA</> 118</entry>
 439          <entry>Latin/Greek</entry>
 440          <entry>Yes</entry>
 441          <entry>1</entry>
 442          <entry></entry>
 443         </row>
 444         <row>
 445          <entry><literal>ISO_8859_8</literal></entry>
 446          <entry>ISO 8859-8, <acronym>ECMA</> 121</entry>
 447          <entry>Latin/Hebrew</entry>
 448          <entry>Yes</entry>
 449          <entry>1</entry>
 450          <entry></entry>
 451         </row>
 452         <row>
 453          <entry><literal>JOHAB</literal></entry>
 454          <entry><acronym>JOHAB</></entry>
 455          <entry>Korean (Hangul)</entry>
 456          <entry>No</entry>
 457          <entry>1-3</entry>
 458          <entry></entry>
 459         </row>
 460         <row>
 461          <entry><literal>KOI8R</literal></entry>
 462          <entry><acronym>KOI</acronym>8-R</entry>
 463          <entry>Cyrillic (Russian)</entry>
 464          <entry>Yes</entry>
 465          <entry>1</entry>
 466          <entry><literal>KOI8</></entry>
 467         </row>
 468         <row>
 469          <entry><literal>KOI8U</literal></entry>
 470          <entry><acronym>KOI</acronym>8-U</entry>
 471          <entry>Cyrillic (Ukrainian)</entry>
 472          <entry>Yes</entry>
 473          <entry>1</entry>
 474          <entry></entry>
 475         </row>
 476         <row>
 477          <entry><literal>LATIN1</literal></entry>
 478          <entry>ISO 8859-1, <acronym>ECMA</> 94</entry>
 479          <entry>Western European</entry>
 480          <entry>Yes</entry>
 481          <entry>1</entry>
 482          <entry><literal>ISO88591</></entry>
 483         </row>
 484         <row>
 485          <entry><literal>LATIN2</literal></entry>
 486          <entry>ISO 8859-2, <acronym>ECMA</> 94</entry>
 487          <entry>Central European</entry>
 488          <entry>Yes</entry>
 489          <entry>1</entry>
 490          <entry><literal>ISO88592</></entry>
 491         </row>
 492         <row>
 493          <entry><literal>LATIN3</literal></entry>
 494          <entry>ISO 8859-3, <acronym>ECMA</> 94</entry>
 495          <entry>South European</entry>
 496          <entry>Yes</entry>
 497          <entry>1</entry>
 498          <entry><literal>ISO88593</></entry>
 499         </row>
 500         <row>
 501          <entry><literal>LATIN4</literal></entry>
 502          <entry>ISO 8859-4, <acronym>ECMA</> 94</entry>
 503          <entry>North European</entry>
 504          <entry>Yes</entry>
 505          <entry>1</entry>
 506          <entry><literal>ISO88594</></entry>
 507         </row>
 508         <row>
 509          <entry><literal>LATIN5</literal></entry>
 510          <entry>ISO 8859-9, <acronym>ECMA</> 128</entry>
 511          <entry>Turkish</entry>
 512          <entry>Yes</entry>
 513          <entry>1</entry>
 514          <entry><literal>ISO88599</></entry>
 515         </row>
 516         <row>
 517          <entry><literal>LATIN6</literal></entry>
 518          <entry>ISO 8859-10, <acronym>ECMA</> 144</entry>
 519          <entry>Nordic</entry>
 520          <entry>Yes</entry>
 521          <entry>1</entry>
 522          <entry><literal>ISO885910</></entry>
 523         </row>
 524         <row>
 525          <entry><literal>LATIN7</literal></entry>
 526          <entry>ISO 8859-13</entry>
 527          <entry>Baltic</entry>
 528          <entry>Yes</entry>
 529          <entry>1</entry>
 530          <entry><literal>ISO885913</></entry>
 531         </row>
 532         <row>
 533          <entry><literal>LATIN8</literal></entry>
 534          <entry>ISO 8859-14</entry>
 535          <entry>Celtic</entry>
 536          <entry>Yes</entry>
 537          <entry>1</entry>
 538          <entry><literal>ISO885914</></entry>
 539         </row>
 540         <row>
 541          <entry><literal>LATIN9</literal></entry>
 542          <entry>ISO 8859-15</entry>
 543          <entry>LATIN1 with Euro and accents</entry>
 544          <entry>Yes</entry>
 545          <entry>1</entry>
 546          <entry>ISO885915</entry>
 547         </row>
 548         <row>
 549          <entry><literal>LATIN10</literal></entry>
 550          <entry>ISO 8859-16, <acronym>ASRO</> SR 14111</entry>
 551          <entry>Romanian</entry>
 552          <entry>Yes</entry>
 553          <entry>1</entry>
 554          <entry><literal>ISO885916</></entry>
 555         </row>
 556         <row>
 557          <entry><literal>MULE_INTERNAL</literal></entry>
 558          <entry>Mule internal code</entry>
 559          <entry>Multilingual Emacs</entry>
 560          <entry>Yes</entry>
 561          <entry>1-4</entry>
 562          <entry></entry>
 563         </row>
 564         <row>
 565          <entry><literal>SJIS</literal></entry>
 566          <entry>Shift JIS</entry>
 567          <entry>Japanese</entry>
 568          <entry>No</entry>
 569          <entry>1-2</entry>
 570          <entry><literal>Mskanji</>, <literal>ShiftJIS</>, <literal>WIN932</>, <literal>Windows932</></entry>
 571         </row>
 572         <row>
 573          <entry><literal>SHIFT_JIS_2004</literal></entry>
 574          <entry>Shift JIS, JIS X 0213</entry>
 575          <entry>Japanese</entry>
 576          <entry>No</entry>
 577          <entry>1-2</entry>
 578          <entry></entry>
 579         </row>
 580         <row>
 581          <entry><literal>SQL_ASCII</literal></entry>
 582          <entry>unspecified (see text)</entry>
 583          <entry><emphasis>any</></entry>
 584          <entry>Yes</entry>
 585          <entry>1</entry>
 586          <entry></entry>
 587         </row>
 588         <row>
 589          <entry><literal>UHC</literal></entry>
 590          <entry>Unified Hangul Code</entry>
 591          <entry>Korean</entry>
 592          <entry>No</entry>
 593          <entry>1-2</entry>
 594          <entry><literal>WIN949</>, <literal>Windows949</></entry>
 595         </row>
 596         <row>
 597          <entry><literal>UTF8</literal></entry>
 598          <entry>Unicode, 8-bit</entry>
 599          <entry><emphasis>all</></entry>
 600          <entry>Yes</entry>
 601          <entry>1-4</entry>
 602          <entry><literal>Unicode</></entry>
 603         </row>
 604         <row>
 605          <entry><literal>WIN866</literal></entry>
 606          <entry>Windows CP866</entry>
 607          <entry>Cyrillic</entry>
 608          <entry>Yes</entry>
 609          <entry>1</entry>
 610          <entry><literal>ALT</></entry>
 611         </row>
 612         <row>
 613          <entry><literal>WIN874</literal></entry>
 614          <entry>Windows CP874</entry>
 615          <entry>Thai</entry>
 616          <entry>Yes</entry>
 617          <entry>1</entry>
 618          <entry></entry>
 619         </row>
 620         <row>
 621          <entry><literal>WIN1250</literal></entry>
 622          <entry>Windows CP1250</entry>
 623          <entry>Central European</entry>
 624          <entry>Yes</entry>
 625          <entry>1</entry>
 626          <entry></entry>
 627         </row>
 628         <row>
 629          <entry><literal>WIN1251</literal></entry>
 630          <entry>Windows CP1251</entry>
 631          <entry>Cyrillic</entry>
 632          <entry>Yes</entry>
 633          <entry>1</entry>
 634          <entry><literal>WIN</></entry>
 635         </row>
 636         <row>
 637          <entry><literal>WIN1252</literal></entry>
 638          <entry>Windows CP1252</entry>
 639          <entry>Western European</entry>
 640          <entry>Yes</entry>
 641          <entry>1</entry>
 642          <entry></entry>
 643         </row>
 644         <row>
 645          <entry><literal>WIN1253</literal></entry>
 646          <entry>Windows CP1253</entry>
 647          <entry>Greek</entry>
 648          <entry>Yes</entry>
 649          <entry>1</entry>
 650          <entry></entry>
 651         </row>
 652         <row>
 653          <entry><literal>WIN1254</literal></entry>
 654          <entry>Windows CP1254</entry>
 655          <entry>Turkish</entry>
 656          <entry>Yes</entry>
 657          <entry>1</entry>
 658          <entry></entry>
 659         </row>
 660         <row>
 661          <entry><literal>WIN1255</literal></entry>
 662          <entry>Windows CP1255</entry>
 663          <entry>Hebrew</entry>
 664          <entry>Yes</entry>
 665          <entry>1</entry>
 666          <entry></entry>
 667         </row>
 668         <row>
 669          <entry><literal>WIN1256</literal></entry>
 670          <entry>Windows CP1256</entry>
 671          <entry>Arabic</entry>
 672          <entry>Yes</entry>
 673          <entry>1</entry>
 674          <entry></entry>
 675         </row>
 676         <row>
 677          <entry><literal>WIN1257</literal></entry>
 678          <entry>Windows CP1257</entry>
 679          <entry>Baltic</entry>
 680          <entry>Yes</entry>
 681          <entry>1</entry>
 682          <entry></entry>
 683         </row>
 684         <row>
 685          <entry><literal>WIN1258</literal></entry>
 686          <entry>Windows CP1258</entry>
 687          <entry>Vietnamese</entry>
 688          <entry>Yes</entry>
 689          <entry>1</entry>
 690          <entry><literal>ABC</>, <literal>TCVN</>, <literal>TCVN5712</>, <literal>VSCII</></entry>
 691         </row>
 692        </tbody>
 693       </tgroup>
 694      </table>
 695
 696      <para>
 697       Not all <acronym>API</>s support all the listed character sets. For example, the
 698       <productname>PostgreSQL</>
 699       JDBC driver does not support <literal>MULE_INTERNAL</>, <literal>LATIN6</>,
 700       <literal>LATIN8</>, and <literal>LATIN10</>.
 701      </para>
 702
 703      <para>
 704       The <literal>SQL_ASCII</> setting behaves considerably differently
 705       from the other settings.  When the server character set is
 706       <literal>SQL_ASCII</>, the server interprets byte values 0-127
 707       according to the ASCII standard, while byte values 128-255 are taken
 708       as uninterpreted characters.  No encoding conversion will be done when
 709       the setting is <literal>SQL_ASCII</>.  Thus, this setting is not so
 710       much a declaration that a specific encoding is in use, as a declaration
 711       of ignorance about the encoding.  In most cases, if you are
 712       working with any non-ASCII data, it is unwise to use the
 713       <literal>SQL_ASCII</> setting, because
 714       <productname>PostgreSQL</productname> will be unable to help you by
 715       converting or validating non-ASCII characters.
 716      </para>
 717     </sect2>
 718
 719    <sect2>
 720     <title>Setting the Character Set</title>
 721
 722     <para>
 723      <command>initdb</> defines the default character set
 724      for a <productname>PostgreSQL</productname> cluster. For example,
 725
 726 <screen>
 727 initdb -E EUC_JP
 728 </screen>
 729
 730      sets the default character set (encoding) to
 731      <literal>EUC_JP</literal> (Extended Unix Code for Japanese).  You
 732      can use <option>--encoding</option> instead of
 733      <option>-E</option> if you prefer to type longer option strings.
 734      If no <option>-E</> or <option>--encoding</option> option is
 735      given, <command>initdb</> attempts to determine the appropriate
 736      encoding to use based on the specified or default locale.
 737     </para>
 738
 739     <para>
 740      You can specify a non-default encoding at database creation time,
 741      provided that the encoding is compatible with the selected locale:
 742
 743 <screen>
 744 createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
 745 </screen>
 746
 747      This will create a database named <literal>korean</literal> that
 748      uses the character set <literal>EUC_KR</literal>, and locale <literal>ko_KR</literal>.
 749      Another way to accomplish this is to use this SQL command:
 750
 751 <programlisting>
 752 CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
 753 </programlisting>
 754
 755      Notice that the above commands specify copying the <literal>template0</>
 756      database.  When copying any other database, the encoding and locale
 757      settings cannot be changed from those of the source database, because
 758      that might result in corrupt data.  For more information see
 759      <xref linkend="manage-ag-templatedbs">.
 760     </para>
 761
 762     <para>
 763      The encoding for a database is stored in the system catalog
 764      <literal>pg_database</literal>.  You can see it by using the
 765      <option>-l</option> option or the <command>\l</command> command
 766      of <command>psql</command>.
 767
 768 <screen>
 769 $ <userinput>psql -l</userinput>
 770                                          List of databases
 771    Name    |  Owner   | Encoding  |  Collation  |    Ctype    |          Access Privileges
 772 -----------+----------+-----------+-------------+-------------+-------------------------------------
 773  clocaledb | hlinnaka | SQL_ASCII | C           | C           |
 774  englishdb | hlinnaka | UTF8      | en_GB.UTF8  | en_GB.UTF8  |
 775  japanese  | hlinnaka | UTF8      | ja_JP.UTF8  | ja_JP.UTF8  |
 776  korean    | hlinnaka | EUC_KR    | ko_KR.euckr | ko_KR.euckr |
 777  postgres  | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  |
 778  template0 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
 779  template1 | hlinnaka | UTF8      | fi_FI.UTF8  | fi_FI.UTF8  | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
 780 (7 rows)
 781 </screen>
 782     </para>
 783
 784     <important>
 785      <para>
 786       On most modern operating systems, <productname>PostgreSQL</productname>
 787       can determine which character set is implied by an <envar>LC_CTYPE</>
 788       setting, and it will enforce that only the matching database encoding is
 789       used.  On older systems it is your responsibility to ensure that you use
 790       the encoding expected by the locale you have selected.  A mistake in
 791       this area is likely to lead to strange misbehavior of locale-dependent
 792       operations such as sorting.
 793      </para>
 794
 795      <para>
 796       <productname>PostgreSQL</productname> will allow superusers to create
 797       databases with <literal>SQL_ASCII</> encoding even when
 798       <envar>LC_CTYPE</> is not <literal>C</> or <literal>POSIX</>.  As noted
 799       above, <literal>SQL_ASCII</> does not enforce that the data stored in
 800       the database has any particular encoding, and so this choice poses risks
 801       of locale-dependent misbehavior.  Using this combination of settings is
 802       deprecated and may someday be forbidden altogether.
 803      </para>
 804     </important>
 805    </sect2>
 806
 807    <sect2>
 808     <title>Automatic Character Set Conversion Between Server and Client</title>
 809
 810     <para>
 811      <productname>PostgreSQL</productname> supports automatic
 812      character set conversion between server and client for certain
 813      character set combinations. The conversion information is stored in the
 814      <literal>pg_conversion</> system catalog.  <productname>PostgreSQL</>
 815      comes with some predefined conversions, as shown in <xref
 816      linkend="multibyte-translation-table">. You can create a new
 817      conversion using the SQL command <command>CREATE CONVERSION</command>.
 818     </para>
 819
 820      <table id="multibyte-translation-table">
 821       <title>Client/Server Character Set Conversions</title>
 822       <tgroup cols="2">
 823        <thead>
 824         <row>
 825          <entry>Server Character Set</entry>
 826          <entry>Available Client Character Sets</entry>
 827         </row>
 828        </thead>
 829        <tbody>
 830         <row>
 831          <entry><literal>BIG5</literal></entry>
 832          <entry><emphasis>not supported as a server encoding</emphasis>
 833          </entry>
 834         </row>
 835         <row>
 836          <entry><literal>EUC_CN</literal></entry>
 837          <entry><emphasis>EUC_CN</emphasis>,
 838          <literal>MULE_INTERNAL</literal>,
 839          <literal>UTF8</literal>
 840          </entry>
 841         </row>
 842         <row>
 843          <entry><literal>EUC_JP</literal></entry>
 844          <entry><emphasis>EUC_JP</emphasis>,
 845          <literal>MULE_INTERNAL</literal>,
 846          <literal>SJIS</literal>,
 847          <literal>UTF8</literal>
 848          </entry>
 849         </row>
 850         <row>
 851          <entry><literal>EUC_KR</literal></entry>
 852          <entry><emphasis>EUC_KR</emphasis>,
 853          <literal>MULE_INTERNAL</literal>,
 854          <literal>UTF8</literal>
 855          </entry>
 856         </row>
 857         <row>
 858          <entry><literal>EUC_TW</literal></entry>
 859          <entry><emphasis>EUC_TW</emphasis>,
 860          <literal>BIG5</literal>,
 861          <literal>MULE_INTERNAL</literal>,
 862          <literal>UTF8</literal>
 863          </entry>
 864         </row>
 865         <row>
 866          <entry><literal>GB18030</literal></entry>
 867          <entry><emphasis>not supported as a server encoding</emphasis>
 868          </entry>
 869         </row>
 870         <row>
 871          <entry><literal>GBK</literal></entry>
 872          <entry><emphasis>not supported as a server encoding</emphasis>
 873          </entry>
 874         </row>
 875         <row>
 876          <entry><literal>ISO_8859_5</literal></entry>
 877          <entry><emphasis>ISO_8859_5</emphasis>,
 878          <literal>KOI8R</literal>,
 879          <literal>MULE_INTERNAL</literal>,
 880          <literal>UTF8</literal>,
 881          <literal>WIN866</literal>,
 882          <literal>WIN1251</literal>
 883          </entry>
 884         </row>
 885         <row>
 886          <entry><literal>ISO_8859_6</literal></entry>
 887          <entry><emphasis>ISO_8859_6</emphasis>,
 888          <literal>UTF8</literal>
 889          </entry>
 890         </row>
 891         <row>
 892          <entry><literal>ISO_8859_7</literal></entry>
 893          <entry><emphasis>ISO_8859_7</emphasis>,
 894          <literal>UTF8</literal>
 895          </entry>
 896         </row>
 897         <row>
 898          <entry><literal>ISO_8859_8</literal></entry>
 899          <entry><emphasis>ISO_8859_8</emphasis>,
 900          <literal>UTF8</literal>
 901          </entry>
 902         </row>
 903         <row>
 904          <entry><literal>JOHAB</literal></entry>
 905          <entry><emphasis>JOHAB</emphasis>,
 906          <literal>UTF8</literal>
 907          </entry>
 908         </row>
 909         <row>
 910          <entry><literal>KOI8R</literal></entry>
 911          <entry><emphasis>KOI8R</emphasis>,
 912          <literal>ISO_8859_5</literal>,
 913          <literal>MULE_INTERNAL</literal>,
 914          <literal>UTF8</literal>,
 915          <literal>WIN866</literal>,
 916          <literal>WIN1251</literal>
 917          </entry>
 918         </row>
 919         <row>
 920          <entry><literal>KOI8U</literal></entry>
 921          <entry><emphasis>KOI8U</emphasis>,
 922          <literal>UTF8</literal>
 923          </entry>
 924         </row>
 925         <row>
 926          <entry><literal>LATIN1</literal></entry>
 927          <entry><emphasis>LATIN1</emphasis>,
 928          <literal>MULE_INTERNAL</literal>,
 929          <literal>UTF8</literal>
 930          </entry>
 931         </row>
 932         <row>
 933          <entry><literal>LATIN2</literal></entry>
 934          <entry><emphasis>LATIN2</emphasis>,
 935          <literal>MULE_INTERNAL</literal>,
 936          <literal>UTF8</literal>,
 937          <literal>WIN1250</literal>
 938          </entry>
 939         </row>
 940         <row>
 941          <entry><literal>LATIN3</literal></entry>
 942          <entry><emphasis>LATIN3</emphasis>,
 943          <literal>MULE_INTERNAL</literal>,
 944          <literal>UTF8</literal>
 945          </entry>
 946         </row>
 947         <row>
 948          <entry><literal>LATIN4</literal></entry>
 949          <entry><emphasis>LATIN4</emphasis>,
 950          <literal>MULE_INTERNAL</literal>,
 951          <literal>UTF8</literal>
 952          </entry>
 953         </row>
 954         <row>
 955          <entry><literal>LATIN5</literal></entry>
 956          <entry><emphasis>LATIN5</emphasis>,
 957          <literal>UTF8</literal>
 958          </entry>
 959         </row>
 960         <row>
 961          <entry><literal>LATIN6</literal></entry>
 962          <entry><emphasis>LATIN6</emphasis>,
 963          <literal>UTF8</literal>
 964          </entry>
 965         </row>
 966         <row>
 967          <entry><literal>LATIN7</literal></entry>
 968          <entry><emphasis>LATIN7</emphasis>,
 969          <literal>UTF8</literal>
 970          </entry>
 971         </row>
 972         <row>
 973          <entry><literal>LATIN8</literal></entry>
 974          <entry><emphasis>LATIN8</emphasis>,
 975          <literal>UTF8</literal>
 976          </entry>
 977         </row>
 978         <row>
 979          <entry><literal>LATIN9</literal></entry>
 980          <entry><emphasis>LATIN9</emphasis>,
 981          <literal>UTF8</literal>
 982          </entry>
 983         </row>
 984         <row>
 985          <entry><literal>LATIN10</literal></entry>
 986          <entry><emphasis>LATIN10</emphasis>,
 987          <literal>UTF8</literal>
 988          </entry>
 989         </row>
 990         <row>
 991          <entry><literal>MULE_INTERNAL</literal></entry>
 992          <entry><emphasis>MULE_INTERNAL</emphasis>,
 993           <literal>BIG5</literal>,
 994           <literal>EUC_CN</literal>,
 995           <literal>EUC_JP</literal>,
 996           <literal>EUC_KR</literal>,
 997           <literal>EUC_TW</literal>,
 998           <literal>ISO_8859_5</literal>,
 999           <literal>KOI8R</literal>,
1000           <literal>LATIN1</literal> to <literal>LATIN4</literal>,
1001           <literal>SJIS</literal>,
1002           <literal>WIN866</literal>,
1003           <literal>WIN1250</literal>,
1004           <literal>WIN1251</literal>
1005          </entry>
1006         </row>
1007         <row>
1008          <entry><literal>SJIS</literal></entry>
1009          <entry><emphasis>not supported as a server encoding</emphasis>
1010          </entry>
1011         </row>
1012         <row>
1013          <entry><literal>SQL_ASCII</literal></entry>
1014          <entry><emphasis>any (no conversion will be performed)</emphasis>
1015          </entry>
1016         </row>
1017         <row>
1018          <entry><literal>UHC</literal></entry>
1019          <entry><emphasis>not supported as a server encoding</emphasis>
1020          </entry>
1021         </row>
1022         <row>
1023          <entry><literal>UTF8</literal></entry>
1024          <entry><emphasis>all supported encodings</emphasis>
1025          </entry>
1026         </row>
1027         <row>
1028          <entry><literal>WIN866</literal></entry>
1029          <entry><emphasis>WIN866</emphasis>,
1030           <literal>ISO_8859_5</literal>,
1031           <literal>KOI8R</literal>,
1032           <literal>MULE_INTERNAL</literal>,
1033           <literal>UTF8</literal>,
1034           <literal>WIN1251</literal>
1035          </entry>
1036         </row>
1037         <row>
1038          <entry><literal>WIN874</literal></entry>
1039          <entry><emphasis>WIN874</emphasis>,
1040          <literal>UTF8</literal>
1041          </entry>
1042         </row>
1043         <row>
1044          <entry><literal>WIN1250</literal></entry>
1045          <entry><emphasis>WIN1250</emphasis>,
1046           <literal>LATIN2</literal>,
1047           <literal>MULE_INTERNAL</literal>,
1048           <literal>UTF8</literal>
1049          </entry>
1050         </row>
1051         <row>
1052          <entry><literal>WIN1251</literal></entry>
1053          <entry><emphasis>WIN1251</emphasis>,
1054           <literal>ISO_8859_5</literal>,
1055           <literal>KOI8R</literal>,
1056           <literal>MULE_INTERNAL</literal>,
1057           <literal>UTF8</literal>,
1058           <literal>WIN866</literal>
1059          </entry>
1060         </row>
1061         <row>
1062          <entry><literal>WIN1252</literal></entry>
1063          <entry><emphasis>WIN1252</emphasis>,
1064           <literal>UTF8</literal>
1065          </entry>
1066         </row>
1067         <row>
1068          <entry><literal>WIN1253</literal></entry>
1069          <entry><emphasis>WIN1253</emphasis>,
1070           <literal>UTF8</literal>
1071          </entry>
1072         </row>
1073         <row>
1074          <entry><literal>WIN1254</literal></entry>
1075          <entry><emphasis>WIN1254</emphasis>,
1076           <literal>UTF8</literal>
1077          </entry>
1078         </row>
1079         <row>
1080          <entry><literal>WIN1255</literal></entry>
1081          <entry><emphasis>WIN1255</emphasis>,
1082           <literal>UTF8</literal>
1083          </entry>
1084         </row>
1085         <row>
1086          <entry><literal>WIN1256</literal></entry>
1087          <entry><emphasis>WIN1256</emphasis>,
1088          <literal>UTF8</literal>
1089          </entry>
1090         </row>
1091         <row>
1092          <entry><literal>WIN1257</literal></entry>
1093          <entry><emphasis>WIN1257</emphasis>,
1094           <literal>UTF8</literal>
1095          </entry>
1096         </row>
1097         <row>
1098          <entry><literal>WIN1258</literal></entry>
1099          <entry><emphasis>WIN1258</emphasis>,
1100          <literal>UTF8</literal>
1101          </entry>
1102         </row>
1103        </tbody>
1104       </tgroup>
1105      </table>
1106
1107     <para>
1108      To enable automatic character set conversion, you have to
1109      tell <productname>PostgreSQL</productname> the character set
1110      (encoding) you would like to use in the client. There are several
1111      ways to accomplish this:
1112
1113      <itemizedlist>
1114       <listitem>
1115        <para>
1116         Using the <command>\encoding</command> command in
1117         <application>psql</application>.
1118         <command>\encoding</command> allows you to change client
1119         encoding on the fly. For
1120         example, to change the encoding to <literal>SJIS</literal>, type:
1121
1122 <programlisting>
1123 \encoding SJIS
1124 </programlisting>
1125        </para>
1126       </listitem>
1127
1128       <listitem>
1129        <para>
1130         <application>libpq</> (<xref linkend="libpq-control">) has functions to control the client encoding.
1131        </para>
1132       </listitem>
1133
1134       <listitem>
1135        <para>
1136         Using <command>SET client_encoding TO</command>.
1137
1138         Setting the client encoding can be done with this SQL command:
1139
1140 <programlisting>
1141 SET CLIENT_ENCODING TO '<replaceable>value</>';
1142 </programlisting>
1143
1144         Also you can use the standard SQL syntax <literal>SET NAMES</literal>
1145         for this purpose:
1146
1147 <programlisting>
1148 SET NAMES '<replaceable>value</>';
1149 </programlisting>
1150
1151         To query the current client encoding:
1152
1153 <programlisting>
1154 SHOW client_encoding;
1155 </programlisting>
1156
1157         To return to the default encoding:
1158
1159 <programlisting>
1160 RESET client_encoding;
1161 </programlisting>
1162        </para>
1163       </listitem>
1164
1165       <listitem>
1166        <para>
1167         Using <envar>PGCLIENTENCODING</envar>. If the environment variable
1168         <envar>PGCLIENTENCODING</envar> is defined in the client's
1169         environment, that client encoding is automatically selected
1170         when a connection to the server is made.  (This can
1171         subsequently be overridden using any of the other methods
1172         mentioned above.)
1173        </para>
1174       </listitem>
1175
1176       <listitem>
1177       <para>
1178        Using the configuration variable <xref
1179        linkend="guc-client-encoding">. If the
1180        <varname>client_encoding</> variable is set, that client
1181        encoding is automatically selected when a connection to the
1182        server is made.  (This can subsequently be overridden using any
1183        of the other methods mentioned above.)
1184        </para>
1185       </listitem>
1186
1187      </itemizedlist>
1188     </para>
1189
1190     <para>
1191      If the conversion of a particular character is not possible
1192      &mdash; suppose you chose <literal>EUC_JP</literal> for the
1193      server and <literal>LATIN1</literal> for the client, then some
1194      Japanese characters do not have a representation in
1195      <literal>LATIN1</literal> &mdash; then an error is reported.
1196     </para>
1197
1198     <para>
1199      If the client character set is defined as <literal>SQL_ASCII</>,
1200      encoding conversion is disabled, regardless of the server's character
1201      set.  Just as for the server, use of <literal>SQL_ASCII</> is unwise
1202      unless you are working with all-ASCII data.
1203     </para>
1204    </sect2>
1205
1206    <sect2>
1207     <title>Further Reading</title>
1208
1209     <para>
1210      These are good sources to start learning about various kinds of encoding
1211      systems.
1212
1213      <variablelist>
1214       <varlistentry>
1215        <term><ulink url="http://www.i18ngurus.com/docs/984813247.html"></ulink></term>
1216
1217        <listitem>
1218         <para>
1219          An extensive collection of documents about character sets, encodings,
1220          and code pages.
1221         </para>
1222        </listitem>
1223       </varlistentry>
1224
1225       <varlistentry>
1226        <term><ulink url="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"></ulink></term>
1227
1228        <listitem>
1229         <para>
1230          Detailed explanations of <literal>EUC_JP</literal>,
1231          <literal>EUC_CN</literal>, <literal>EUC_KR</literal>,
1232          <literal>EUC_TW</literal> appear in section 3.2.
1233         </para>
1234        </listitem>
1235       </varlistentry>
1236
1237       <varlistentry>
1238        <term><ulink url="http://www.unicode.org/"></ulink></term>
1239
1240        <listitem>
1241         <para>
1242          The web site of the Unicode Consortium.
1243         </para>
1244        </listitem>
1245       </varlistentry>
1246
1247       <varlistentry>
1248        <term>RFC 3629</term>
1249
1250        <listitem>
1251         <para>
1252          <acronym>UTF</acronym>-8 is defined here.
1253         </para>
1254        </listitem>
1255       </varlistentry>
1256      </variablelist>
1257     </para>
1258    </sect2>
1259
1260   </sect1>
1261
1262 </chapter>