1 *multibyte.txt* For Vim version 5.8. Last change: 2000 Jun 07
4 VIM REFERENCE MANUAL by Bram Moolenaar et al.
7 Multi-byte support *multibyte* *multi-byte*
9 *Chinese* *Japanese* *Korean*
10 There are languages which have many characters that can not be represented
11 using one byte (one octet). These are Chinese (simplified or traditional),
12 Japanese and Korean. These languages uses more than one byte to represent a
15 This is limited information on the support in Vim to edit files that use more
16 than one byte per character. Actually, only two-byte codes are currently
19 Also see |+multi_byte| and |'fileencoding'|.
21 1. Introduction |multibyte-intro|
22 2. Compiling |multibyte-compiling|
23 3. Display (X fontset support) |multibyte-display|
24 4. Input (XIM support) |multibyte-input|
25 5. UTF-8 in XFree86 xterm |UTF8-xterm|
27 ==============================================================================
28 1. Introduction *multibyte-intro*
32 There are a number of languages in the world. And there are different
33 cultures and environments at least as much as the number of languages. A
34 linguistic environment corresponding to an area is called "|locale|". The
35 POSIX standard defines a concept of |locale|, which includes a lot of
36 information about |charset|, collating order for sorting, date format,
37 currency format and so on.
39 Your system need to support the |locale| system and the language |locale| of
40 your choice. Some system has a few language |locale|s, so the |locale| of the
41 language which you want to use may not be on your system. If so, you have to
42 add the language |locale|. But on some systems, it is not possible to add
43 other |locale|s. In this case, install X |locale|s by installing X compiled
44 with X_LOCALE. Add "-DX_LOCALE" to the CFLAGS if your X lib support X_LOCALE.
45 For example, When you are using Linux system and you want to use Japanese, set
46 up your system one of the followings.
47 - libc5 + X compiled with X_LOCALE
48 - glibc-2.0 + libwcsmbs + X compiled without X_LOCALE
49 - glibc-2.1 + locale-ja + X compiled without X_LOCALE
51 The location in which the |locale|s are installed varies system to system.
52 For example, "/usr/share/locale", "/usr/lib/locale", etc. See your system's
55 *locale-name* *$LANG-multibyte*
56 The format of |locale| name is:
57 language[_territory[. codeset]]
58 Territory means the country, codeset means the |charset|. For example, the
59 |locale| name "ja_JP.eucJP" means the language is Japanese, the country is
60 Japan, the codeset is EUC-JP. But it also could be "ja", "ja_JP.EUC",
61 "ja_JP.ujis", etc. And unfortunately, the |locale| name for a specific
62 language, territory and codeset is not unified and depends on your system.
63 This name is used for the LANG environment value. When you want to use Korean
64 and the |locale| name is "ko", do this:
68 Examples of locale name:
69 |charset| language |locale-name|
70 GB2312 Chinese (simplified) zh_CN.EUC, zh_CN.GB2312
71 Big5 Chinese (traditional) zh_TW.BIG5, zh_TW.Big5
72 CNS-11643 Chinese (traditional) zh_TW
73 EUC-JP Japanese ja, ja_JP.EUC, ja_JP.ujis, ja_JP.eucJP
74 Shift_JIS Japanese ja_JP.SJIS, ja_JP.Shift_JIS
75 EUC-KR Korean ko, ko_KR.EUC
77 Even if your system does not have the multibyte language |locale| of your
78 choice, or does not have a enough implementation of the locale, Vim can
79 somehow handle the multibyte languages. Add "--enable-broken-locale" flag at
83 CODED CHARACTER SET (CCS)
84 *coded-character-set* *CCS*
85 |CCS| is a mapping from a set of characters to a set of integers. For
86 example, ((65, A), (66, B), (67, C)) is a |CCS| and ((0x41, A), (0x42, B),
87 (0x43, C)) is also a |CCS|. Examples of |CCS| are ISO 10646, US-ASCII,
88 ISO-8859 series, JIS X 0208, JIS X 0201, KS C 5601 (KS X 1001) and KS C 5636
91 The term "integer" means code point or character number and is different from
92 octets or bit combination.
94 Typically, a |CCS| is a character table. Representing the column/line as
95 hexadecimal number becomes the code point of the character. For example,
96 US-ASCII CCS has 8x16 character table, the column number start with 0 and end
97 with 7, the line number start with 0 end with F. The code point of the
98 character at 4/1 is 0x41.
101 CHARACTER ENCODING SCHEME (CES)
103 *character-encode-scheme* *CES*
104 |CES| is a mapping from a sequence of elements in one or more |CCS|es to a
105 sequence of octets. Examples of |CES| are EUC-JP, EUC-KR, EUC-CN (GB 2312),
106 EUC-TW (CNS-11643), ISO-2022-JP, ISO-2022-KR, ISO-2022-CN, UTF-8, etc.
111 |charset| is a method of converting a sequence of octets into a sequence of
112 characters, the combination of one or more |CCS|es and a |CES|. For example,
113 ISO-2022-JP |charset| is the combination of ASCII, JIS X 0201, JIS X 0208
114 |CCS|es and ISO-2022-JP |CES|. Examples of |charset| are US-ASCII, ISO-8859
115 series, GB2312, EUC-JP, EUC-KR, Shift_JIS, Big5, UTF-8, etc.
117 Note that this is not a term used by other standards bodies, such as ISO, but
118 a term defined in RFC 2130. The term "codeset" in POSIX has the same meaning
119 as |charset| here. |charset| does not mean character set (a set of
120 characters) and the term "character repertoire" means a collection of distinct
121 characters. There are historical reasons, see RFC 2130.
124 One language could have some |charset|s. For example, Japanese has
125 ISO-2022-JP, EUC-JP and Shift_JIS |charset|s. ISO-2022-JP |charset| is used
126 mainly for internet messages, because it is encoded in 7-bit scheme. EUC-JP
127 is mainly used on Unix, Shift_JIS is mainly used on Windows and MacOS.
129 Vim does not convert automatically to the locale's |charset| at display time.
130 So, if a file's |charset| differs from your locale's |charset|, the file is
131 not displayed correctly. So, you must know the file's |charset| by any way:
132 guessing, using some utilities, etc, and convert the |charset| to the locale's
135 Useful utilities for converting the |charset|:
137 Nkf is "Network Kanji code conversion Filter". One of the most unique
138 facility of nkf is the guess of the input Kanji code. So, you don't
139 need to know what the inputting file's |charset| is. When convert to
140 EUC-JP from ISO-2022-JP or Shift_JIS, simply do the following command
144 http://www.sfc.wide.ad.jp/~max/FreeBSD/ports/distfiles/nkf-1.62.tar.gz
146 Hc is "Hanzi Converter". Hc convert a GB file to a Big5 file, or Big5
147 file to GB file. Hc can be found at:
148 ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/unix/convert/hc-30.tar.gz
150 Hmconv is Korean code conversion utility especially for E-mail. It can
151 convert between EUC-KR and ISO-2022-KR. Hmconv can be found at:
152 ftp://ftp.kaist.ac.kr/pub/hangul/code/hmconv/hmconv1.0pl3
154 Lv is a Powerful Multilingual File Viewer. And it can be worked as
155 |charset| converter. Supported |charset|: ISO-2022-CN, ISO-2022-JP,
156 ISO-2022-KR, EUC-CN, EUC-JP, EUC-KR, EUC-TW, UTF-7, UTF-8, ISO-8859
157 series, Shift_JIS, Big5 and HZ. Lv can be found at:
158 http://www.ff.iij4u.or.jp/~nrt/freeware/lv4493.tar.gz
161 X LOGICAL FONT DESCRIPTION (XLFD)
163 XLFD is the X font name and contains the information about the font size,
164 |CCS|, etc. The name is in this format:
166 FOUNDRY-FAMILY-WEIGHT-SLANT-WIDTH-STYLE-PIXEL-POINT-X-Y-SPACE-AVE-CR-CE
170 - FOUNDRY: FOUNDRY field. The company that created the font.
171 - FAMILY: FAMILY_NAME field. Basic font family name. (helvetica, gothic,
173 - WEIGHT: WEIGHT_NAME field. How thick the letters are. (light, medium,
175 - SLANT: SLANT field.
183 - WIDTH: SETWIDTH_NAME field. Width of characters. (normal, condensed,
185 - STYLE: ADD_STYLE_NAME field. Extra info to describe font. (Serif, Sans
186 Serif, Informal, Decorated, etc)
187 - PIXEL: PIXEL_SIZE field. Height, in pixels, of characters.
188 - POINT: POINT_SIZE field. Ten times height of characters in points.
189 - X: RESOLUTION_X field. X resolution (dots per inch).
190 - Y: RESOLUTION_Y field. Y resolution (dots per inch).
191 - SPACE: SPACING field.
195 - AVE: AVERAGE_WIDTH field. Ten times average width in pixels.
196 - CR: CHARSET_REGISTRY field. Indicates the name of the font |CCS| name.
197 - CE: CHARSET_ENCODING field. In some CCSes, such as ISO-8859 series,
198 this field is the part of |CCS| name. In other CCSes, such as JIS
199 X 0208, if this field is 0, code points has the same value as GL,
202 For example, in case of a 14 dots font corresponding to JIS X 0208, it is
204 -misc-fixed-medium-r-normal--16-110-100-100-c-160-jisx0208.1990-0
209 A |CCS| typically associated with one font. The languages which must manage
210 multiple |CCS|es needs to manage multiple font. In X11R5, for the
211 internationalization of output API, FontSet was introduced. By using this,
212 Xlib takes care of switching of fonts and the display. Till X11R4, the
213 application themselves had to manage this.
215 |locale| database has the information about the |charset| of the |locale|,
216 which |CCS|(es) is needed and which |CES| the locale uses. When you use the
217 locale which must manage multiple |CCS|es, you have to specify the each
218 |CCS|'s font in 'guifontset' option.
221 |charset| language |CCS|es
222 GB2312 Chinese (simplified) ISO-8859-1 and GB 2312
223 Big5 Chinese (traditional) ISO-8859-1 and Big5
224 CNS-11643 Chinese (traditional) ISO-8859-1, CNS 11643-1 and CNS 11643-2
225 EUC-JP Japanese JIS X 0201 and JIS X 0208
226 EUC-KR Korean ISO-8859-1 and KS C 5601 (KS X 1001)
228 The |XLFD| contains the information of |CCS|. So, by searching in fonts.dir,
229 you can find the |CCS|'s font. The fonts.dir is in the fonts directory (e.g.
230 /usr/X11R6/lib/X11/fonts/*), the format of the file is:
231 First line: the number of fonts which are contained in this fonts.dir
232 other line: FILENAME |XLFD|
233 Or, you can search fonts using xlsfonts command. For example, when you're
234 searching for the font for KS C 5601:
235 > xlsfonts | grep ksc5601
236 will show you the list of it.
238 *base_font_name_list*
239 In 'guifontset' option and ~/.Xdefaults, you specify the
240 |base_font_name_list|, which is a list of |XLFD| font names that Xlib uses to
241 load the fonts needed for the |locale|. The base font names are a
242 comma-separated list.
244 For example, when you use the ja_JP.eucJP |locale|, which require JIS X 0201
245 and JIS X 0208 |CCS|es. You could supply a |base_font_name_list| that
246 explicitly specifies the charsets, like:
248 guifontset=-misc-fixed-medium-r-normal--14-130-75-75-c-140-jisx0208.1983-0,
249 \-misc-fixed-medium-r-normal--14-130-75-75-c-70-jisx0201.1976-0
251 Alternatively, the user could supply a base font name list that omits the
252 |CCS| name, letting Xlib select font characters required for the locale. For
255 guifontset=-misc-fixed-medium-r-normal--14-130-75-75-c-140,
256 \-misc-fixed-medium-r-normal--14-130-75-75-c-70
258 Alternatively, the user could supply a single base font name that allows Xlib
259 to select from all available fonts. For example:
261 guifontset=-misc-fixed-medium-r-normal--14-*
263 Alternatively, the user could specify the alias name. See fonts.alias in
268 Note that in East Asian fonts, the standard character cell is square. When
269 mixing Latin font and East Asian font, East Asian font width should be twice
270 the Latin font width. And GVIM needs fixed width font.
273 X INPUT METHOD (XIM) *XIM* *xim* *x-input-method*
275 XIM (X Input Method) is an international input module for X. There are two
276 kind of structures, Xlib unit type and |IM-server| (Input-Method server) type.
277 |IM-server| type is suitable for complex inputting, like CJK inputting.
281 In |IM-server| type input structures, the input event is handled by either
282 of the two ways: FrontEnd system and BackEnd system. In the FrontEnd
283 system, input events are snatched by the |IM-server| first, then |IM-server|
284 give the application the result of input. On the other hand, the BackEnd
285 system works reverse order. MS Windows adopt BackEnd system. In X, most of
286 |IM-server|s adopt FrontEnd system. The demerit of BackEnd system is the
287 large overhead in communication, but it provides safe synchronization with
288 no restrictions on applications.
290 For example, there are xwnmo and kinput2 Japanese |IM-server|, both are
291 FrontEnd system. Xwnmo is distributed with Wnn (see below), kinput2 can be
292 found at: ftp://ftp.sra.co.jp/pub/x11/kinput2/
294 For Chinese, there's a great XIM server named "xcin", you can input both
295 Traditional and Simplified Chinese characters. And it can accept other
296 locale if you make a correct input table. Xcin can be found at:
297 http://xcin.linux.org.tw/
301 Some system needs additional server: conversion server. Most of Japanese
302 |IM-server|s need it, Kana-Kanji conversion server. For Chinese inputting,
303 it depends on the method of inputting, in some methods, PinYin or ZhuYin to
304 HanZi conversion server is needed. For Korean inputting, if you want to
305 input Hanja, Hangul-Hanja conversion server is needed.
307 For example, the Japanese inputting process is divided into 2 steps. First
308 we pre-input Hira-gana, second Kana-Kanji conversion. There are so many
309 Kanji characters (6349 Kanji characters are defined in JIS X 0208) and the
310 number of Hira-gana characters are 76. So, first, we pre-input text as
311 pronounced in Hira-gana, second, we convert Hira-gana to Kanji or Kata-Kana,
312 if needed. There are some Kana-Kanji conversion server: jserver
313 (distributed with Wnn, see below) and canna. Canna can be found at:
314 ftp://ftp.nec.co.jp/pub/Canna/
316 There is a good input system: Wnn4.2. Wnn 4.2 contains,
317 xwnmo (|multilingualized| |IM-server|)
318 jserver (Japanese Kana-Kanji conversion server)
319 cserver (Chinese PinYin or ZhuYin to simplified HanZi conversion server)
320 tserver (Chinese PinYin or ZhuYin to traditional HanZi conversion server)
321 kserver (Hangul-Hanja conversion server)
322 Wnn 4.2 can be found at:
323 ftp://ftp.FreeBSD.ORG/pub/FreeBSD/ports/distfiles/Wnn4.2.tar.gz
328 When inputting CJK, there needs four areas.
330 1. The area to perform display of input in the midst
331 2. The area to display input mode.
332 3. The area to display the next candidate for the selection.
333 4. The area to display other tools.
335 The third area is needed when converting. For example, in Japanese
336 inputting, multiple Kanji characters could have the same pronunciation, so
337 a sequence of Hira-gana characters could map to a distinct sequence of Kanji
340 The first and second areas are defined in international input of X with the
341 names of "Preedit Area", "Status Area" respectively. The third and fourth
342 areas are not defined and are left to be managed by the |IM-server|. In the
343 international input, four input styles have been defined using combinations
344 of Preedit Area and Status Area: |OnTheSpot|, |OffTheSpot|, |OverTheSpot|
347 Currently, GUI Vim support three style, |OverTheSpot|, |OffTheSpot| and
350 *. on-the-spot *OnTheSpot*
351 Preedit Area and Status Area are performed by the client application in
352 the area of application. The client application is directed by the
353 |IM-server| to display all pre-edit data at the location of text
354 insertion. The client registers callbacks invoked by the input method
356 *. over-the-spot *OverTheSpot*
357 Status Area is created in a fixed position within the area of application,
358 in case of Vim, the position is the additional status line. Preedit Area
359 is made at present input position of application. The input method
360 displays pre-edit data in a window which it brings up directly over the
361 text insertion position.
362 *. off-the-spot *OffTheSpot*
363 Preedit Area and Status Area are performed in the area of application, in
364 case of Vim, the area is additional status line. The client application
365 provides display windows for the pre-edit data to the input method which
366 displays into them directly.
367 *. root-window *Root*
368 Preedit Area and Status Area are performed outside of the area of
369 application. The input method displays all pre-edit data in a separate
370 area of the screen in a window specific to the input method.
373 LOCALIZATION, INTERNATIONALIZATION AND MULTILINGUALIZATION
375 *localized* *Localization* *L10N*
376 Localization (L10N) To fit a system or an application with a
378 *internationalized* *Internationalization* *I18N*
379 Internationalization (I18N) To enable a system or an application to fit
380 with a specific language according to the
382 *multilingualized* *Multilingualization* *M17N*
383 Multilingualization (M17N) To enable a system or an application to be
384 able to use multiple languages at the same
386 For example, JVim (Japanized version Vim 3.0) is a |localized| application for
387 Japanese. Cxterm (|localized| xterm for Chinese), kterm (|localized| xterm
388 for Japanese) and hanterm (|localized| xterm for Korean) is also a |localized|
389 application. Gnome is an |internationalized| application. It can be
390 |localized| for many languages according to the |locale|. Mule (Multilingual
391 Enhancement for GNU Emacs) is a |multilingualized| application. It can handle
392 multiple |charset|s and can maintain a mixture of languages in a single
395 Vim is an |internationalized| application. So, you can change the language
396 specifying the |locale| and some options at start time.
398 ==============================================================================
399 2. Compiling *multibyte-compiling*
401 -. Before you start to compile Vim, be sure that your system has the language
402 |locale| of your choice. You might need to add "-DX_LOCALE" to CFLAGS.
405 > ./configure --with-x --enable-multibyte --enable-fontset --enable-xim
408 -. You can use multi-byte in the Vim GUI, which fully supports the
409 |+multi_byte| feature. If you only use console Vim, low-level multibyte
410 input/output depends on your console. For example, if you run Vim in an
411 xterm, you should use a |localized| xterm or an xterm which support |XIM|.
412 |localized| xterms are kterm (Kanji term) or hanterm (for Korean) for
413 example. Known |XIM| supporting xterms are Eterm (Enlightened terminal)
416 ==============================================================================
417 3. Display *multibyte-display*
419 Note that Display and Input are independent. It is possible to see your
420 language even though you have no input method for it.
422 Multibyte output uses |xfontset| feature.
424 -. Be sure that your system has the fonts corresponding to the |CCS|es, which
425 the |locale| needs to manage. See: |xfontset|.
427 -. Following are requirements to use multibyte language.
429 If needed, insert the lines below in your $HOME/.Xdefaults file.
430 The GTK+ version of GUI Vim does not use .Xdefaults, thus this change is
431 not needed for the GTK+ version.
433 These 3 lines are specific for Vim:
435 Vim.font: |base_font_name_list|
436 Vim*fontSet: |base_font_name_list|
437 Vim*fontList: your_language_font:
439 Note: Vim.font is for text area.
440 Vim*fontSet is for menu.
441 Vim*fontList is for menu (for Motif GUI)
443 For example, when you are using Japanese and 14 dots font,
445 > Vim.font: -misc-fixed-medium-r-normal--14-*
446 > Vim*fontSet: -misc-fixed-medium-r-normal--14-*
447 > Vim*fontList: -misc-fixed-medium-r-normal--14-*
452 > Vim.fontSet: k14,r14
455 You should set the 'guifontset' option to display a multi-byte language.
458 :set guifontset=|base_font_name_list|
460 For example, when you are using Japanese and 14 dots font,
462 > set guifontset=-misc-fixed-medium-r-normal--14-*
466 > set guifontset=k14,r14
468 Note: You can not use IM unless you specify 'guifontset'.
469 Therefore, Latin users, you have to also use 'guifontset'
472 You should not set 'guifont'. If it is set, Vim ignores 'guifontset'.
473 It means Vim runs without fontset support, you can see only English. The
474 multi-byte characters are displayed corrupted.
476 After the |+xfontset| feature is enabled as explained above, Vim does not
477 allow using 'font'. For example, if you use:
478 > :set guifontset=eng_font,your_font
479 in your .gvimrc, then you should use for highlighting:
480 > :hi Comment font=another_eng_font,another_your_font
482 > :hi Comment font=another_eng_font
483 VIM will also try to use it as a fontset. So, if it cannot display your
484 |locale| dependent codeset, you will see a error message.
486 -. In your .vimrc, add this
487 > set fileencoding=korea
488 You can change "korea" to the some other name such as japan, taiwan.
489 See |'fileencoding'| for the supported encodings.
491 -. If a file's charset is different from your |locale|'s charset, you need to
492 convert the charset. See |charset-conversion|.
494 ==============================================================================
495 4. Input (XIM, X Input Method support) *multibyte-input*
497 Note that Display and Input are independent. It is possible to see your
498 language even though you have no input method for it. But when your Display
499 method doesn't match your Input method, the text will be displayed wrong.
501 -. To input your language you should run the |IM-server| which supports your
502 language and |conversion-server| if needed. Multibyte input uses |XIM|
505 Next 3 lines are common for all X applications which uses |XIM|.
506 If you already use |XIM|, don't care.
508 > *international: True
509 > *.inputMethod: your_input_server_name
510 > *.preeditType: your_input_style
512 Note: input_server_name is your |IM-server| name (check your
514 your_input_style is one of |OverTheSpot|, |OffTheSpot|, |Root|.
515 See also |xim-input-style|.
516 *international may not necessary if you use X11R6.
517 *.inputMethod and *.preeditType is a optional if you use X11R6.
519 For example, when you are using kinput2 as |IM-server|,
521 > *international: True
522 > *.inputMethod: kinput2
523 > *.preeditType: OverTheSpot
525 When using |OverTheSpot|, GUI Vim always connects to the IM Server even in
526 Normal mode, so you can input your language with commands like "f" and
527 "r". But when using one of the other two methods, GUI Vim connects to the
528 IM Server only if it is not in Normal mode.
530 If your IM Server does not support |OverTheSpot|, and if you want to use
531 your language with some Normal mode command like "f" or "r", then you
532 should use a |localized| xterm or an xterm which supports |XIM|
534 -. If needed, you can set the XMODIFIERS env. var.
536 sh: export XMODIFIERS="@im=input_server_name"
537 csh: setenv XMODIFIERS "@im=input_server_name"
539 For example, when you are using kinput2 as |IM-server| and sh,
541 > export XMODIFIERS="@im=kinput2"
544 Contributions specifically for the multi-byte features by:
545 Chi-Deok Hwang <hwang@mizi.co.kr>
546 Sung-Hyun Nam <namsh@lgic.co.kr>
547 K.Nagano <nagano@atese.advantest.co.jp>
548 Taro Muraoka <koron@tka.att.ne.jp>
549 Yasuhiro Matsumoto <mattn@mail.goo.ne.jp>
551 ==============================================================================
552 5. UTF-8 in XFree86 xterm *UTF8-xterm*
554 This is a short explanation of how to use UTF-8 character encoding in the
555 xterm that comes with XFree86 by Thomas Dickey (text by Markus Kuhn).
557 NOTE: Editing and viewing UTF-8 text in Vim does not work as expected yet!
559 Get the latest xterm version which has now UTF-8 support:
561 http://www.clark.net/pub/dickey/xterm/xterm.tar.gz
563 Compile it with "./configure --enable-wide-chars ; make"
565 Also get the ISO 10646-1 version of the 6x13 font, which is available on
567 http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts.tar.gz
569 and install the font as described in the README file.
573 > xterm -u8 -fn -misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1
575 and you will have a working UTF-8 terminal emulator. Try both
580 with the demo text that comes with ucs-fonts.tar.gz in order to see
581 whether there are any problems with UTF-8 in your xterm.