Fix xslt_process() to ensure that it inserts a NULL terminator after the
[PostgreSQL.git] / doc / src / sgml / charset.sgml
blobf942b537a0e0606ee9ea117113da3d5d18a6e3ef
1 <!-- $PostgreSQL$ -->
3 <chapter id="charset">
4 <title>Localization</>
6 <para>
7 This chapter describes the available localization features from the
8 point of view of the administrator.
9 <productname>PostgreSQL</productname> supports localization with
10 two approaches:
12 <itemizedlist>
13 <listitem>
14 <para>
15 Using the locale features of the operating system to provide
16 locale-specific collation order, number formatting, translated
17 messages, and other aspects.
18 </para>
19 </listitem>
21 <listitem>
22 <para>
23 Providing a number of different character sets to support storing text
24 in all kinds of languages, and providing character set translation
25 between client and server.
26 </para>
27 </listitem>
28 </itemizedlist>
29 </para>
32 <sect1 id="locale">
33 <title>Locale Support</title>
35 <indexterm zone="locale"><primary>locale</></>
37 <para>
38 <firstterm>Locale</> support refers to an application respecting
39 cultural preferences regarding alphabets, sorting, number
40 formatting, etc. <productname>PostgreSQL</> uses the standard ISO
41 C and <acronym>POSIX</acronym> locale facilities provided by the server operating
42 system. For additional information refer to the documentation of your
43 system.
44 </para>
46 <sect2>
47 <title>Overview</>
49 <para>
50 Locale support is automatically initialized when a database
51 cluster is created using <command>initdb</command>.
52 <command>initdb</command> will initialize the database cluster
53 with the locale setting of its execution environment by default,
54 so if your system is already set to use the locale that you want
55 in your database cluster then there is nothing else you need to
56 do. If you want to use a different locale (or you are not sure
57 which locale your system is set to), you can instruct
58 <command>initdb</command> exactly which locale to use by
59 specifying the <option>--locale</option> option. For example:
60 <screen>
61 initdb --locale=sv_SE
62 </screen>
63 </para>
65 <para>
66 This example for Unix systems sets the locale to Swedish
67 (<literal>sv</>) as spoken
68 in Sweden (<literal>SE</>). Other possibilities might be
69 <literal>en_US</> (U.S. English) and <literal>fr_CA</> (French
70 Canadian). If more than one character set can be useful for a
71 locale then the specifications look like this:
72 <literal>cs_CZ.ISO8859-2</>. What locales are available under what
73 names on your system depends on what was provided by the operating
74 system vendor and what was installed. On most Unix systems, the command
75 <literal>locale -a</> will provide a list of available locales.
76 Windows uses more verbose locale names, such as <literal>German_Germany</>
77 or <literal>Swedish_Sweden.1252</>, but the principles are the same.
78 </para>
80 <para>
81 Occasionally it is useful to mix rules from several locales, e.g.,
82 use English collation rules but Spanish messages. To support that, a
83 set of locale subcategories exist that control only a certain
84 aspect of the localization rules:
86 <informaltable>
87 <tgroup cols="2">
88 <tbody>
89 <row>
90 <entry><envar>LC_COLLATE</></>
91 <entry>String sort order</>
92 </row>
93 <row>
94 <entry><envar>LC_CTYPE</></>
95 <entry>Character classification (What is a letter? Its upper-case equivalent?)</>
96 </row>
97 <row>
98 <entry><envar>LC_MESSAGES</></>
99 <entry>Language of messages</>
100 </row>
101 <row>
102 <entry><envar>LC_MONETARY</></>
103 <entry>Formatting of currency amounts</>
104 </row>
105 <row>
106 <entry><envar>LC_NUMERIC</></>
107 <entry>Formatting of numbers</>
108 </row>
109 <row>
110 <entry><envar>LC_TIME</></>
111 <entry>Formatting of dates and times</>
112 </row>
113 </tbody>
114 </tgroup>
115 </informaltable>
117 The category names translate into names of
118 <command>initdb</command> options to override the locale choice
119 for a specific category. For instance, to set the locale to
120 French Canadian, but use U.S. rules for formatting currency, use
121 <literal>initdb --locale=fr_CA --lc-monetary=en_US</literal>.
122 </para>
124 <para>
125 If you want the system to behave as if it had no locale support,
126 use the special locale <literal>C</> or <literal>POSIX</>.
127 </para>
129 <para>
130 The nature of some locale categories is that their value has to be
131 fixed when the database is created. You can use different settings
132 for different databases, but once a database is created, you cannot
133 change them for that database anymore. <literal>LC_COLLATE</literal>
134 and <literal>LC_CTYPE</literal> are these categories. They affect
135 the sort order of indexes, so they must be kept fixed, or indexes on
136 text columns will become corrupt. The default values for these
137 categories are determined when <command>initdb</command> is run, and
138 those values are used when new databases are created, unless
139 specified otherwise in the <command>CREATE DATABASE</command> command.
140 </para>
142 <para>
143 The other locale categories can be changed whenever desired
144 by setting the server configuration parameters
145 that have the same name as the locale categories (see <xref
146 linkend="runtime-config-client-format"> for details). The values
147 that are chosen by <command>initdb</command> are actually only written
148 into the configuration file <filename>postgresql.conf</filename> to
149 serve as defaults when the server is started. If you delete these
150 assignments from <filename>postgresql.conf</filename> then the
151 server will inherit the settings from its execution environment.
152 </para>
154 <para>
155 Note that the locale behavior of the server is determined by the
156 environment variables seen by the server, not by the environment
157 of any client. Therefore, be careful to configure the correct locale settings
158 before starting the server. A consequence of this is that if
159 client and server are set up in different locales, messages might
160 appear in different languages depending on where they originated.
161 </para>
163 <note>
164 <para>
165 When we speak of inheriting the locale from the execution
166 environment, this means the following on most operating systems:
167 For a given locale category, say the collation, the following
168 environment variables are consulted in this order until one is
169 found to be set: <envar>LC_ALL</envar>, <envar>LC_COLLATE</envar>
170 (or the variable corresponding to the respective category),
171 <envar>LANG</envar>. If none of these environment variables are
172 set then the locale defaults to <literal>C</literal>.
173 </para>
175 <para>
176 Some message localization libraries also look at the environment
177 variable <envar>LANGUAGE</envar> which overrides all other locale
178 settings for the purpose of setting the language of messages. If
179 in doubt, please refer to the documentation of your operating
180 system, in particular the documentation about
181 <application>gettext</>, for more information.
182 </para>
183 </note>
185 <para>
186 To enable messages to be translated to the user's preferred language,
187 <acronym>NLS</acronym> must have been selected at build time
188 (<literal>configure --enable-nls</>). All other locale support is
189 built in automatically.
190 </para>
191 </sect2>
193 <sect2>
194 <title>Behavior</>
196 <para>
197 The locale settings influence the following SQL features:
199 <itemizedlist>
200 <listitem>
201 <para>
202 Sort order in queries using <literal>ORDER BY</> or the standard
203 comparison operators on textual data
204 <indexterm><primary>ORDER BY</><secondary>and locales</></indexterm>
205 </para>
206 </listitem>
208 <listitem>
209 <para>
210 The ability to use indexes with <literal>LIKE</> clauses
211 <indexterm><primary>LIKE</><secondary>and locales</></indexterm>
212 </para>
213 </listitem>
215 <listitem>
216 <para>
217 The <function>upper</>, <function>lower</>, and <function>initcap</>
218 functions
219 <indexterm><primary>upper</><secondary>and locales</></indexterm>
220 <indexterm><primary>lower</><secondary>and locales</></indexterm>
221 </para>
222 </listitem>
224 <listitem>
225 <para>
226 The <function>to_char</> family of functions
227 <indexterm><primary>to_char</><secondary>and locales</></indexterm>
228 </para>
229 </listitem>
230 </itemizedlist>
231 </para>
233 <para>
234 The drawback of using locales other than <literal>C</> or
235 <literal>POSIX</> in <productname>PostgreSQL</> is its performance
236 impact. It slows character handling and prevents ordinary indexes
237 from being used by <literal>LIKE</>. For this reason use locales
238 only if you actually need them.
239 </para>
241 <para>
242 As a workaround to allow <productname>PostgreSQL</> to use indexes
243 with <literal>LIKE</> clauses under a non-C locale, several custom
244 operator classes exist. These allow the creation of an index that
245 performs a strict character-by-character comparison, ignoring
246 locale comparison rules. Refer to <xref linkend="indexes-opclass">
247 for more information.
248 </para>
249 </sect2>
251 <sect2>
252 <title>Problems</>
254 <para>
255 If locale support doesn't work according to the explanation above,
256 check that the locale support in your operating system is
257 correctly configured. To check what locales are installed on your
258 system, you can use the command <literal>locale -a</literal> if
259 your operating system provides it.
260 </para>
262 <para>
263 Check that <productname>PostgreSQL</> is actually using the locale
264 that you think it is. The <envar>LC_COLLATE</> and <envar>LC_CTYPE</>
265 settings are determined when a database is created, and cannot be
266 changed except by creating a new database. Other locale
267 settings including <envar>LC_MESSAGES</> and <envar>LC_MONETARY</>
268 are initially determined by the environment the server is started
269 in, but can be changed on-the-fly. You can check the active locale
270 settings using the <command>SHOW</> command.
271 </para>
273 <para>
274 The directory <filename>src/test/locale</> in the source
275 distribution contains a test suite for
276 <productname>PostgreSQL</>'s locale support.
277 </para>
279 <para>
280 Client applications that handle server-side errors by parsing the
281 text of the error message will obviously have problems when the
282 server's messages are in a different language. Authors of such
283 applications are advised to make use of the error code scheme
284 instead.
285 </para>
287 <para>
288 Maintaining catalogs of message translations requires the on-going
289 efforts of many volunteers that want to see
290 <productname>PostgreSQL</> speak their preferred language well.
291 If messages in your language are currently not available or not fully
292 translated, your assistance would be appreciated. If you want to
293 help, refer to <xref linkend="nls"> or write to the developers'
294 mailing list.
295 </para>
296 </sect2>
297 </sect1>
300 <sect1 id="multibyte">
301 <title>Character Set Support</title>
303 <indexterm zone="multibyte"><primary>character set</></>
305 <para>
306 The character set support in <productname>PostgreSQL</productname>
307 allows you to store text in a variety of character sets (also called
308 encodings), including
309 single-byte character sets such as the ISO 8859 series and
310 multiple-byte character sets such as <acronym>EUC</> (Extended Unix
311 Code), UTF-8, and Mule internal code. All supported character sets
312 can be used transparently by clients, but a few are not supported
313 for use within the server (that is, as a server-side encoding).
314 The default character set is selected while
315 initializing your <productname>PostgreSQL</productname> database
316 cluster using <command>initdb</>. It can be overridden when you
317 create a database, so you can have multiple
318 databases each with a different character set.
319 </para>
321 <para>
322 An important restriction, however, is that each database's character set
323 must be compatible with the database's <envar>LC_CTYPE</> and
324 <envar>LC_COLLATE</> locale settings. For <literal>C</> or
325 <literal>POSIX</> locale, any character set is allowed, but for other
326 locales there is only one character set that will work correctly.
327 (On Windows, however, UTF-8 encoding can be used with any locale.)
328 </para>
330 <sect2 id="multibyte-charset-supported">
331 <title>Supported Character Sets</title>
333 <para>
334 <xref linkend="charset-table"> shows the character sets available
335 for use in <productname>PostgreSQL</productname>.
336 </para>
338 <table id="charset-table">
339 <title><productname>PostgreSQL</productname> Character Sets</title>
340 <tgroup cols="6">
341 <thead>
342 <row>
343 <entry>Name</entry>
344 <entry>Description</entry>
345 <entry>Language</entry>
346 <entry>Server?</entry>
347 <!--
348 The Bytes/Char field is populated by looking at the values returned
349 by pg_wchar_table.mblen function for each encoding.
351 <entry>Bytes/Char</entry>
352 <entry>Aliases</entry>
353 </row>
354 </thead>
355 <tbody>
356 <row>
357 <entry><literal>BIG5</literal></entry>
358 <entry>Big Five</entry>
359 <entry>Traditional Chinese</entry>
360 <entry>No</entry>
361 <entry>1-2</entry>
362 <entry><literal>WIN950</>, <literal>Windows950</></entry>
363 </row>
364 <row>
365 <entry><literal>EUC_CN</literal></entry>
366 <entry>Extended UNIX Code-CN</entry>
367 <entry>Simplified Chinese</entry>
368 <entry>Yes</entry>
369 <entry>1-3</entry>
370 <entry></entry>
371 </row>
372 <row>
373 <entry><literal>EUC_JP</literal></entry>
374 <entry>Extended UNIX Code-JP</entry>
375 <entry>Japanese</entry>
376 <entry>Yes</entry>
377 <entry>1-3</entry>
378 <entry></entry>
379 </row>
380 <row>
381 <entry><literal>EUC_JIS_2004</literal></entry>
382 <entry>Extended UNIX Code-JP, JIS X 0213</entry>
383 <entry>Japanese</entry>
384 <entry>Yes</entry>
385 <entry>1-3</entry>
386 <entry></entry>
387 </row>
388 <row>
389 <entry><literal>EUC_KR</literal></entry>
390 <entry>Extended UNIX Code-KR</entry>
391 <entry>Korean</entry>
392 <entry>Yes</entry>
393 <entry>1-3</entry>
394 <entry></entry>
395 </row>
396 <row>
397 <entry><literal>EUC_TW</literal></entry>
398 <entry>Extended UNIX Code-TW</entry>
399 <entry>Traditional Chinese, Taiwanese</entry>
400 <entry>Yes</entry>
401 <entry>1-3</entry>
402 <entry></entry>
403 </row>
404 <row>
405 <entry><literal>GB18030</literal></entry>
406 <entry>National Standard</entry>
407 <entry>Chinese</entry>
408 <entry>No</entry>
409 <entry>1-2</entry>
410 <entry></entry>
411 </row>
412 <row>
413 <entry><literal>GBK</literal></entry>
414 <entry>Extended National Standard</entry>
415 <entry>Simplified Chinese</entry>
416 <entry>No</entry>
417 <entry>1-2</entry>
418 <entry><literal>WIN936</>, <literal>Windows936</></entry>
419 </row>
420 <row>
421 <entry><literal>ISO_8859_5</literal></entry>
422 <entry>ISO 8859-5, <acronym>ECMA</> 113</entry>
423 <entry>Latin/Cyrillic</entry>
424 <entry>Yes</entry>
425 <entry>1</entry>
426 <entry></entry>
427 </row>
428 <row>
429 <entry><literal>ISO_8859_6</literal></entry>
430 <entry>ISO 8859-6, <acronym>ECMA</> 114</entry>
431 <entry>Latin/Arabic</entry>
432 <entry>Yes</entry>
433 <entry>1</entry>
434 <entry></entry>
435 </row>
436 <row>
437 <entry><literal>ISO_8859_7</literal></entry>
438 <entry>ISO 8859-7, <acronym>ECMA</> 118</entry>
439 <entry>Latin/Greek</entry>
440 <entry>Yes</entry>
441 <entry>1</entry>
442 <entry></entry>
443 </row>
444 <row>
445 <entry><literal>ISO_8859_8</literal></entry>
446 <entry>ISO 8859-8, <acronym>ECMA</> 121</entry>
447 <entry>Latin/Hebrew</entry>
448 <entry>Yes</entry>
449 <entry>1</entry>
450 <entry></entry>
451 </row>
452 <row>
453 <entry><literal>JOHAB</literal></entry>
454 <entry><acronym>JOHAB</></entry>
455 <entry>Korean (Hangul)</entry>
456 <entry>No</entry>
457 <entry>1-3</entry>
458 <entry></entry>
459 </row>
460 <row>
461 <entry><literal>KOI8R</literal></entry>
462 <entry><acronym>KOI</acronym>8-R</entry>
463 <entry>Cyrillic (Russian)</entry>
464 <entry>Yes</entry>
465 <entry>1</entry>
466 <entry><literal>KOI8</></entry>
467 </row>
468 <row>
469 <entry><literal>KOI8U</literal></entry>
470 <entry><acronym>KOI</acronym>8-U</entry>
471 <entry>Cyrillic (Ukrainian)</entry>
472 <entry>Yes</entry>
473 <entry>1</entry>
474 <entry></entry>
475 </row>
476 <row>
477 <entry><literal>LATIN1</literal></entry>
478 <entry>ISO 8859-1, <acronym>ECMA</> 94</entry>
479 <entry>Western European</entry>
480 <entry>Yes</entry>
481 <entry>1</entry>
482 <entry><literal>ISO88591</></entry>
483 </row>
484 <row>
485 <entry><literal>LATIN2</literal></entry>
486 <entry>ISO 8859-2, <acronym>ECMA</> 94</entry>
487 <entry>Central European</entry>
488 <entry>Yes</entry>
489 <entry>1</entry>
490 <entry><literal>ISO88592</></entry>
491 </row>
492 <row>
493 <entry><literal>LATIN3</literal></entry>
494 <entry>ISO 8859-3, <acronym>ECMA</> 94</entry>
495 <entry>South European</entry>
496 <entry>Yes</entry>
497 <entry>1</entry>
498 <entry><literal>ISO88593</></entry>
499 </row>
500 <row>
501 <entry><literal>LATIN4</literal></entry>
502 <entry>ISO 8859-4, <acronym>ECMA</> 94</entry>
503 <entry>North European</entry>
504 <entry>Yes</entry>
505 <entry>1</entry>
506 <entry><literal>ISO88594</></entry>
507 </row>
508 <row>
509 <entry><literal>LATIN5</literal></entry>
510 <entry>ISO 8859-9, <acronym>ECMA</> 128</entry>
511 <entry>Turkish</entry>
512 <entry>Yes</entry>
513 <entry>1</entry>
514 <entry><literal>ISO88599</></entry>
515 </row>
516 <row>
517 <entry><literal>LATIN6</literal></entry>
518 <entry>ISO 8859-10, <acronym>ECMA</> 144</entry>
519 <entry>Nordic</entry>
520 <entry>Yes</entry>
521 <entry>1</entry>
522 <entry><literal>ISO885910</></entry>
523 </row>
524 <row>
525 <entry><literal>LATIN7</literal></entry>
526 <entry>ISO 8859-13</entry>
527 <entry>Baltic</entry>
528 <entry>Yes</entry>
529 <entry>1</entry>
530 <entry><literal>ISO885913</></entry>
531 </row>
532 <row>
533 <entry><literal>LATIN8</literal></entry>
534 <entry>ISO 8859-14</entry>
535 <entry>Celtic</entry>
536 <entry>Yes</entry>
537 <entry>1</entry>
538 <entry><literal>ISO885914</></entry>
539 </row>
540 <row>
541 <entry><literal>LATIN9</literal></entry>
542 <entry>ISO 8859-15</entry>
543 <entry>LATIN1 with Euro and accents</entry>
544 <entry>Yes</entry>
545 <entry>1</entry>
546 <entry>ISO885915</entry>
547 </row>
548 <row>
549 <entry><literal>LATIN10</literal></entry>
550 <entry>ISO 8859-16, <acronym>ASRO</> SR 14111</entry>
551 <entry>Romanian</entry>
552 <entry>Yes</entry>
553 <entry>1</entry>
554 <entry><literal>ISO885916</></entry>
555 </row>
556 <row>
557 <entry><literal>MULE_INTERNAL</literal></entry>
558 <entry>Mule internal code</entry>
559 <entry>Multilingual Emacs</entry>
560 <entry>Yes</entry>
561 <entry>1-4</entry>
562 <entry></entry>
563 </row>
564 <row>
565 <entry><literal>SJIS</literal></entry>
566 <entry>Shift JIS</entry>
567 <entry>Japanese</entry>
568 <entry>No</entry>
569 <entry>1-2</entry>
570 <entry><literal>Mskanji</>, <literal>ShiftJIS</>, <literal>WIN932</>, <literal>Windows932</></entry>
571 </row>
572 <row>
573 <entry><literal>SHIFT_JIS_2004</literal></entry>
574 <entry>Shift JIS, JIS X 0213</entry>
575 <entry>Japanese</entry>
576 <entry>No</entry>
577 <entry>1-2</entry>
578 <entry></entry>
579 </row>
580 <row>
581 <entry><literal>SQL_ASCII</literal></entry>
582 <entry>unspecified (see text)</entry>
583 <entry><emphasis>any</></entry>
584 <entry>Yes</entry>
585 <entry>1</entry>
586 <entry></entry>
587 </row>
588 <row>
589 <entry><literal>UHC</literal></entry>
590 <entry>Unified Hangul Code</entry>
591 <entry>Korean</entry>
592 <entry>No</entry>
593 <entry>1-2</entry>
594 <entry><literal>WIN949</>, <literal>Windows949</></entry>
595 </row>
596 <row>
597 <entry><literal>UTF8</literal></entry>
598 <entry>Unicode, 8-bit</entry>
599 <entry><emphasis>all</></entry>
600 <entry>Yes</entry>
601 <entry>1-4</entry>
602 <entry><literal>Unicode</></entry>
603 </row>
604 <row>
605 <entry><literal>WIN866</literal></entry>
606 <entry>Windows CP866</entry>
607 <entry>Cyrillic</entry>
608 <entry>Yes</entry>
609 <entry>1</entry>
610 <entry><literal>ALT</></entry>
611 </row>
612 <row>
613 <entry><literal>WIN874</literal></entry>
614 <entry>Windows CP874</entry>
615 <entry>Thai</entry>
616 <entry>Yes</entry>
617 <entry>1</entry>
618 <entry></entry>
619 </row>
620 <row>
621 <entry><literal>WIN1250</literal></entry>
622 <entry>Windows CP1250</entry>
623 <entry>Central European</entry>
624 <entry>Yes</entry>
625 <entry>1</entry>
626 <entry></entry>
627 </row>
628 <row>
629 <entry><literal>WIN1251</literal></entry>
630 <entry>Windows CP1251</entry>
631 <entry>Cyrillic</entry>
632 <entry>Yes</entry>
633 <entry>1</entry>
634 <entry><literal>WIN</></entry>
635 </row>
636 <row>
637 <entry><literal>WIN1252</literal></entry>
638 <entry>Windows CP1252</entry>
639 <entry>Western European</entry>
640 <entry>Yes</entry>
641 <entry>1</entry>
642 <entry></entry>
643 </row>
644 <row>
645 <entry><literal>WIN1253</literal></entry>
646 <entry>Windows CP1253</entry>
647 <entry>Greek</entry>
648 <entry>Yes</entry>
649 <entry>1</entry>
650 <entry></entry>
651 </row>
652 <row>
653 <entry><literal>WIN1254</literal></entry>
654 <entry>Windows CP1254</entry>
655 <entry>Turkish</entry>
656 <entry>Yes</entry>
657 <entry>1</entry>
658 <entry></entry>
659 </row>
660 <row>
661 <entry><literal>WIN1255</literal></entry>
662 <entry>Windows CP1255</entry>
663 <entry>Hebrew</entry>
664 <entry>Yes</entry>
665 <entry>1</entry>
666 <entry></entry>
667 </row>
668 <row>
669 <entry><literal>WIN1256</literal></entry>
670 <entry>Windows CP1256</entry>
671 <entry>Arabic</entry>
672 <entry>Yes</entry>
673 <entry>1</entry>
674 <entry></entry>
675 </row>
676 <row>
677 <entry><literal>WIN1257</literal></entry>
678 <entry>Windows CP1257</entry>
679 <entry>Baltic</entry>
680 <entry>Yes</entry>
681 <entry>1</entry>
682 <entry></entry>
683 </row>
684 <row>
685 <entry><literal>WIN1258</literal></entry>
686 <entry>Windows CP1258</entry>
687 <entry>Vietnamese</entry>
688 <entry>Yes</entry>
689 <entry>1</entry>
690 <entry><literal>ABC</>, <literal>TCVN</>, <literal>TCVN5712</>, <literal>VSCII</></entry>
691 </row>
692 </tbody>
693 </tgroup>
694 </table>
696 <para>
697 Not all <acronym>API</>s support all the listed character sets. For example, the
698 <productname>PostgreSQL</>
699 JDBC driver does not support <literal>MULE_INTERNAL</>, <literal>LATIN6</>,
700 <literal>LATIN8</>, and <literal>LATIN10</>.
701 </para>
703 <para>
704 The <literal>SQL_ASCII</> setting behaves considerably differently
705 from the other settings. When the server character set is
706 <literal>SQL_ASCII</>, the server interprets byte values 0-127
707 according to the ASCII standard, while byte values 128-255 are taken
708 as uninterpreted characters. No encoding conversion will be done when
709 the setting is <literal>SQL_ASCII</>. Thus, this setting is not so
710 much a declaration that a specific encoding is in use, as a declaration
711 of ignorance about the encoding. In most cases, if you are
712 working with any non-ASCII data, it is unwise to use the
713 <literal>SQL_ASCII</> setting, because
714 <productname>PostgreSQL</productname> will be unable to help you by
715 converting or validating non-ASCII characters.
716 </para>
717 </sect2>
719 <sect2>
720 <title>Setting the Character Set</title>
722 <para>
723 <command>initdb</> defines the default character set
724 for a <productname>PostgreSQL</productname> cluster. For example,
726 <screen>
727 initdb -E EUC_JP
728 </screen>
730 sets the default character set (encoding) to
731 <literal>EUC_JP</literal> (Extended Unix Code for Japanese). You
732 can use <option>--encoding</option> instead of
733 <option>-E</option> if you prefer to type longer option strings.
734 If no <option>-E</> or <option>--encoding</option> option is
735 given, <command>initdb</> attempts to determine the appropriate
736 encoding to use based on the specified or default locale.
737 </para>
739 <para>
740 You can specify a non-default encoding at database creation time,
741 provided that the encoding is compatible with the selected locale:
743 <screen>
744 createdb -E EUC_KR -T template0 --lc-collate=ko_KR.euckr --lc-ctype=ko_KR.euckr korean
745 </screen>
747 This will create a database named <literal>korean</literal> that
748 uses the character set <literal>EUC_KR</literal>, and locale <literal>ko_KR</literal>.
749 Another way to accomplish this is to use this SQL command:
751 <programlisting>
752 CREATE DATABASE korean WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
753 </programlisting>
755 Notice that the above commands specify copying the <literal>template0</>
756 database. When copying any other database, the encoding and locale
757 settings cannot be changed from those of the source database, because
758 that might result in corrupt data. For more information see
759 <xref linkend="manage-ag-templatedbs">.
760 </para>
762 <para>
763 The encoding for a database is stored in the system catalog
764 <literal>pg_database</literal>. You can see it by using the
765 <option>-l</option> option or the <command>\l</command> command
766 of <command>psql</command>.
768 <screen>
769 $ <userinput>psql -l</userinput>
770 List of databases
771 Name | Owner | Encoding | Collation | Ctype | Access Privileges
772 -----------+----------+-----------+-------------+-------------+-------------------------------------
773 clocaledb | hlinnaka | SQL_ASCII | C | C |
774 englishdb | hlinnaka | UTF8 | en_GB.UTF8 | en_GB.UTF8 |
775 japanese | hlinnaka | UTF8 | ja_JP.UTF8 | ja_JP.UTF8 |
776 korean | hlinnaka | EUC_KR | ko_KR.euckr | ko_KR.euckr |
777 postgres | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 |
778 template0 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
779 template1 | hlinnaka | UTF8 | fi_FI.UTF8 | fi_FI.UTF8 | {=c/hlinnaka,hlinnaka=CTc/hlinnaka}
780 (7 rows)
781 </screen>
782 </para>
784 <important>
785 <para>
786 On most modern operating systems, <productname>PostgreSQL</productname>
787 can determine which character set is implied by an <envar>LC_CTYPE</>
788 setting, and it will enforce that only the matching database encoding is
789 used. On older systems it is your responsibility to ensure that you use
790 the encoding expected by the locale you have selected. A mistake in
791 this area is likely to lead to strange misbehavior of locale-dependent
792 operations such as sorting.
793 </para>
795 <para>
796 <productname>PostgreSQL</productname> will allow superusers to create
797 databases with <literal>SQL_ASCII</> encoding even when
798 <envar>LC_CTYPE</> is not <literal>C</> or <literal>POSIX</>. As noted
799 above, <literal>SQL_ASCII</> does not enforce that the data stored in
800 the database has any particular encoding, and so this choice poses risks
801 of locale-dependent misbehavior. Using this combination of settings is
802 deprecated and may someday be forbidden altogether.
803 </para>
804 </important>
805 </sect2>
807 <sect2>
808 <title>Automatic Character Set Conversion Between Server and Client</title>
810 <para>
811 <productname>PostgreSQL</productname> supports automatic
812 character set conversion between server and client for certain
813 character set combinations. The conversion information is stored in the
814 <literal>pg_conversion</> system catalog. <productname>PostgreSQL</>
815 comes with some predefined conversions, as shown in <xref
816 linkend="multibyte-translation-table">. You can create a new
817 conversion using the SQL command <command>CREATE CONVERSION</command>.
818 </para>
820 <table id="multibyte-translation-table">
821 <title>Client/Server Character Set Conversions</title>
822 <tgroup cols="2">
823 <thead>
824 <row>
825 <entry>Server Character Set</entry>
826 <entry>Available Client Character Sets</entry>
827 </row>
828 </thead>
829 <tbody>
830 <row>
831 <entry><literal>BIG5</literal></entry>
832 <entry><emphasis>not supported as a server encoding</emphasis>
833 </entry>
834 </row>
835 <row>
836 <entry><literal>EUC_CN</literal></entry>
837 <entry><emphasis>EUC_CN</emphasis>,
838 <literal>MULE_INTERNAL</literal>,
839 <literal>UTF8</literal>
840 </entry>
841 </row>
842 <row>
843 <entry><literal>EUC_JP</literal></entry>
844 <entry><emphasis>EUC_JP</emphasis>,
845 <literal>MULE_INTERNAL</literal>,
846 <literal>SJIS</literal>,
847 <literal>UTF8</literal>
848 </entry>
849 </row>
850 <row>
851 <entry><literal>EUC_KR</literal></entry>
852 <entry><emphasis>EUC_KR</emphasis>,
853 <literal>MULE_INTERNAL</literal>,
854 <literal>UTF8</literal>
855 </entry>
856 </row>
857 <row>
858 <entry><literal>EUC_TW</literal></entry>
859 <entry><emphasis>EUC_TW</emphasis>,
860 <literal>BIG5</literal>,
861 <literal>MULE_INTERNAL</literal>,
862 <literal>UTF8</literal>
863 </entry>
864 </row>
865 <row>
866 <entry><literal>GB18030</literal></entry>
867 <entry><emphasis>not supported as a server encoding</emphasis>
868 </entry>
869 </row>
870 <row>
871 <entry><literal>GBK</literal></entry>
872 <entry><emphasis>not supported as a server encoding</emphasis>
873 </entry>
874 </row>
875 <row>
876 <entry><literal>ISO_8859_5</literal></entry>
877 <entry><emphasis>ISO_8859_5</emphasis>,
878 <literal>KOI8R</literal>,
879 <literal>MULE_INTERNAL</literal>,
880 <literal>UTF8</literal>,
881 <literal>WIN866</literal>,
882 <literal>WIN1251</literal>
883 </entry>
884 </row>
885 <row>
886 <entry><literal>ISO_8859_6</literal></entry>
887 <entry><emphasis>ISO_8859_6</emphasis>,
888 <literal>UTF8</literal>
889 </entry>
890 </row>
891 <row>
892 <entry><literal>ISO_8859_7</literal></entry>
893 <entry><emphasis>ISO_8859_7</emphasis>,
894 <literal>UTF8</literal>
895 </entry>
896 </row>
897 <row>
898 <entry><literal>ISO_8859_8</literal></entry>
899 <entry><emphasis>ISO_8859_8</emphasis>,
900 <literal>UTF8</literal>
901 </entry>
902 </row>
903 <row>
904 <entry><literal>JOHAB</literal></entry>
905 <entry><emphasis>JOHAB</emphasis>,
906 <literal>UTF8</literal>
907 </entry>
908 </row>
909 <row>
910 <entry><literal>KOI8R</literal></entry>
911 <entry><emphasis>KOI8R</emphasis>,
912 <literal>ISO_8859_5</literal>,
913 <literal>MULE_INTERNAL</literal>,
914 <literal>UTF8</literal>,
915 <literal>WIN866</literal>,
916 <literal>WIN1251</literal>
917 </entry>
918 </row>
919 <row>
920 <entry><literal>KOI8U</literal></entry>
921 <entry><emphasis>KOI8U</emphasis>,
922 <literal>UTF8</literal>
923 </entry>
924 </row>
925 <row>
926 <entry><literal>LATIN1</literal></entry>
927 <entry><emphasis>LATIN1</emphasis>,
928 <literal>MULE_INTERNAL</literal>,
929 <literal>UTF8</literal>
930 </entry>
931 </row>
932 <row>
933 <entry><literal>LATIN2</literal></entry>
934 <entry><emphasis>LATIN2</emphasis>,
935 <literal>MULE_INTERNAL</literal>,
936 <literal>UTF8</literal>,
937 <literal>WIN1250</literal>
938 </entry>
939 </row>
940 <row>
941 <entry><literal>LATIN3</literal></entry>
942 <entry><emphasis>LATIN3</emphasis>,
943 <literal>MULE_INTERNAL</literal>,
944 <literal>UTF8</literal>
945 </entry>
946 </row>
947 <row>
948 <entry><literal>LATIN4</literal></entry>
949 <entry><emphasis>LATIN4</emphasis>,
950 <literal>MULE_INTERNAL</literal>,
951 <literal>UTF8</literal>
952 </entry>
953 </row>
954 <row>
955 <entry><literal>LATIN5</literal></entry>
956 <entry><emphasis>LATIN5</emphasis>,
957 <literal>UTF8</literal>
958 </entry>
959 </row>
960 <row>
961 <entry><literal>LATIN6</literal></entry>
962 <entry><emphasis>LATIN6</emphasis>,
963 <literal>UTF8</literal>
964 </entry>
965 </row>
966 <row>
967 <entry><literal>LATIN7</literal></entry>
968 <entry><emphasis>LATIN7</emphasis>,
969 <literal>UTF8</literal>
970 </entry>
971 </row>
972 <row>
973 <entry><literal>LATIN8</literal></entry>
974 <entry><emphasis>LATIN8</emphasis>,
975 <literal>UTF8</literal>
976 </entry>
977 </row>
978 <row>
979 <entry><literal>LATIN9</literal></entry>
980 <entry><emphasis>LATIN9</emphasis>,
981 <literal>UTF8</literal>
982 </entry>
983 </row>
984 <row>
985 <entry><literal>LATIN10</literal></entry>
986 <entry><emphasis>LATIN10</emphasis>,
987 <literal>UTF8</literal>
988 </entry>
989 </row>
990 <row>
991 <entry><literal>MULE_INTERNAL</literal></entry>
992 <entry><emphasis>MULE_INTERNAL</emphasis>,
993 <literal>BIG5</literal>,
994 <literal>EUC_CN</literal>,
995 <literal>EUC_JP</literal>,
996 <literal>EUC_KR</literal>,
997 <literal>EUC_TW</literal>,
998 <literal>ISO_8859_5</literal>,
999 <literal>KOI8R</literal>,
1000 <literal>LATIN1</literal> to <literal>LATIN4</literal>,
1001 <literal>SJIS</literal>,
1002 <literal>WIN866</literal>,
1003 <literal>WIN1250</literal>,
1004 <literal>WIN1251</literal>
1005 </entry>
1006 </row>
1007 <row>
1008 <entry><literal>SJIS</literal></entry>
1009 <entry><emphasis>not supported as a server encoding</emphasis>
1010 </entry>
1011 </row>
1012 <row>
1013 <entry><literal>SQL_ASCII</literal></entry>
1014 <entry><emphasis>any (no conversion will be performed)</emphasis>
1015 </entry>
1016 </row>
1017 <row>
1018 <entry><literal>UHC</literal></entry>
1019 <entry><emphasis>not supported as a server encoding</emphasis>
1020 </entry>
1021 </row>
1022 <row>
1023 <entry><literal>UTF8</literal></entry>
1024 <entry><emphasis>all supported encodings</emphasis>
1025 </entry>
1026 </row>
1027 <row>
1028 <entry><literal>WIN866</literal></entry>
1029 <entry><emphasis>WIN866</emphasis>,
1030 <literal>ISO_8859_5</literal>,
1031 <literal>KOI8R</literal>,
1032 <literal>MULE_INTERNAL</literal>,
1033 <literal>UTF8</literal>,
1034 <literal>WIN1251</literal>
1035 </entry>
1036 </row>
1037 <row>
1038 <entry><literal>WIN874</literal></entry>
1039 <entry><emphasis>WIN874</emphasis>,
1040 <literal>UTF8</literal>
1041 </entry>
1042 </row>
1043 <row>
1044 <entry><literal>WIN1250</literal></entry>
1045 <entry><emphasis>WIN1250</emphasis>,
1046 <literal>LATIN2</literal>,
1047 <literal>MULE_INTERNAL</literal>,
1048 <literal>UTF8</literal>
1049 </entry>
1050 </row>
1051 <row>
1052 <entry><literal>WIN1251</literal></entry>
1053 <entry><emphasis>WIN1251</emphasis>,
1054 <literal>ISO_8859_5</literal>,
1055 <literal>KOI8R</literal>,
1056 <literal>MULE_INTERNAL</literal>,
1057 <literal>UTF8</literal>,
1058 <literal>WIN866</literal>
1059 </entry>
1060 </row>
1061 <row>
1062 <entry><literal>WIN1252</literal></entry>
1063 <entry><emphasis>WIN1252</emphasis>,
1064 <literal>UTF8</literal>
1065 </entry>
1066 </row>
1067 <row>
1068 <entry><literal>WIN1253</literal></entry>
1069 <entry><emphasis>WIN1253</emphasis>,
1070 <literal>UTF8</literal>
1071 </entry>
1072 </row>
1073 <row>
1074 <entry><literal>WIN1254</literal></entry>
1075 <entry><emphasis>WIN1254</emphasis>,
1076 <literal>UTF8</literal>
1077 </entry>
1078 </row>
1079 <row>
1080 <entry><literal>WIN1255</literal></entry>
1081 <entry><emphasis>WIN1255</emphasis>,
1082 <literal>UTF8</literal>
1083 </entry>
1084 </row>
1085 <row>
1086 <entry><literal>WIN1256</literal></entry>
1087 <entry><emphasis>WIN1256</emphasis>,
1088 <literal>UTF8</literal>
1089 </entry>
1090 </row>
1091 <row>
1092 <entry><literal>WIN1257</literal></entry>
1093 <entry><emphasis>WIN1257</emphasis>,
1094 <literal>UTF8</literal>
1095 </entry>
1096 </row>
1097 <row>
1098 <entry><literal>WIN1258</literal></entry>
1099 <entry><emphasis>WIN1258</emphasis>,
1100 <literal>UTF8</literal>
1101 </entry>
1102 </row>
1103 </tbody>
1104 </tgroup>
1105 </table>
1107 <para>
1108 To enable automatic character set conversion, you have to
1109 tell <productname>PostgreSQL</productname> the character set
1110 (encoding) you would like to use in the client. There are several
1111 ways to accomplish this:
1113 <itemizedlist>
1114 <listitem>
1115 <para>
1116 Using the <command>\encoding</command> command in
1117 <application>psql</application>.
1118 <command>\encoding</command> allows you to change client
1119 encoding on the fly. For
1120 example, to change the encoding to <literal>SJIS</literal>, type:
1122 <programlisting>
1123 \encoding SJIS
1124 </programlisting>
1125 </para>
1126 </listitem>
1128 <listitem>
1129 <para>
1130 <application>libpq</> (<xref linkend="libpq-control">) has functions to control the client encoding.
1131 </para>
1132 </listitem>
1134 <listitem>
1135 <para>
1136 Using <command>SET client_encoding TO</command>.
1138 Setting the client encoding can be done with this SQL command:
1140 <programlisting>
1141 SET CLIENT_ENCODING TO '<replaceable>value</>';
1142 </programlisting>
1144 Also you can use the standard SQL syntax <literal>SET NAMES</literal>
1145 for this purpose:
1147 <programlisting>
1148 SET NAMES '<replaceable>value</>';
1149 </programlisting>
1151 To query the current client encoding:
1153 <programlisting>
1154 SHOW client_encoding;
1155 </programlisting>
1157 To return to the default encoding:
1159 <programlisting>
1160 RESET client_encoding;
1161 </programlisting>
1162 </para>
1163 </listitem>
1165 <listitem>
1166 <para>
1167 Using <envar>PGCLIENTENCODING</envar>. If the environment variable
1168 <envar>PGCLIENTENCODING</envar> is defined in the client's
1169 environment, that client encoding is automatically selected
1170 when a connection to the server is made. (This can
1171 subsequently be overridden using any of the other methods
1172 mentioned above.)
1173 </para>
1174 </listitem>
1176 <listitem>
1177 <para>
1178 Using the configuration variable <xref
1179 linkend="guc-client-encoding">. If the
1180 <varname>client_encoding</> variable is set, that client
1181 encoding is automatically selected when a connection to the
1182 server is made. (This can subsequently be overridden using any
1183 of the other methods mentioned above.)
1184 </para>
1185 </listitem>
1187 </itemizedlist>
1188 </para>
1190 <para>
1191 If the conversion of a particular character is not possible
1192 &mdash; suppose you chose <literal>EUC_JP</literal> for the
1193 server and <literal>LATIN1</literal> for the client, then some
1194 Japanese characters do not have a representation in
1195 <literal>LATIN1</literal> &mdash; then an error is reported.
1196 </para>
1198 <para>
1199 If the client character set is defined as <literal>SQL_ASCII</>,
1200 encoding conversion is disabled, regardless of the server's character
1201 set. Just as for the server, use of <literal>SQL_ASCII</> is unwise
1202 unless you are working with all-ASCII data.
1203 </para>
1204 </sect2>
1206 <sect2>
1207 <title>Further Reading</title>
1209 <para>
1210 These are good sources to start learning about various kinds of encoding
1211 systems.
1213 <variablelist>
1214 <varlistentry>
1215 <term><ulink url="http://www.i18ngurus.com/docs/984813247.html"></ulink></term>
1217 <listitem>
1218 <para>
1219 An extensive collection of documents about character sets, encodings,
1220 and code pages.
1221 </para>
1222 </listitem>
1223 </varlistentry>
1225 <varlistentry>
1226 <term><ulink url="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"></ulink></term>
1228 <listitem>
1229 <para>
1230 Detailed explanations of <literal>EUC_JP</literal>,
1231 <literal>EUC_CN</literal>, <literal>EUC_KR</literal>,
1232 <literal>EUC_TW</literal> appear in section 3.2.
1233 </para>
1234 </listitem>
1235 </varlistentry>
1237 <varlistentry>
1238 <term><ulink url="http://www.unicode.org/"></ulink></term>
1240 <listitem>
1241 <para>
1242 The web site of the Unicode Consortium.
1243 </para>
1244 </listitem>
1245 </varlistentry>
1247 <varlistentry>
1248 <term>RFC 3629</term>
1250 <listitem>
1251 <para>
1252 <acronym>UTF</acronym>-8 is defined here.
1253 </para>
1254 </listitem>
1255 </varlistentry>
1256 </variablelist>
1257 </para>
1258 </sect2>
1260 </sect1>
1262 </chapter>