components/x11/xorg-docs/src/ctext/ctext.txt

   1 Compound Text Encoding
   2
   3 X Consortium Standard
   4
   5 Robert W. Scheifler
   6
   7 X Consortium
   8
   9 X Version 11, Release 7.7
  10
  11 Version 1.1
  12
  13 Copyright © 1989 X Consortium
  14
  15 Permission is hereby granted, free of charge, to any person obtaining a copy of
  16 this software and associated documentation files (the "Software"), to deal in
  17 the Software without restriction, including without limitation the rights to
  18 use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
  19 of the Software, and to permit persons to whom the Software is furnished to do
  20 so, subject to the following conditions:
  21
  22 The above copyright notice and this permission notice shall be included in all
  23 copies or substantial portions of the Software.
  24
  25 THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
  26 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
  27 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE X
  28 CONSORTIUM BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
  29 ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
  30 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
  31
  32 Except as contained in this notice, the name of the X Consortium shall not be
  33 used in advertising or otherwise to promote the sale, use or other dealings in
  34 this Software without prior written authorization from the X Consortium.
  35
  36 X Window System is a trademark of The Open Group.
  37
  38 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  39
  40 Table of Contents
  41
  42 Overview
  43 Values
  44 Control Characters
  45 Standard Character Set Encodings
  46 Approved Standard Encodings
  47 Non-Standard Character Set Encodings
  48 Directionality
  49 Resources
  50 Font Names
  51 Extensions
  52 Errors
  53
  54 Overview
  55
  56 Compound Text is a format for multiple character set data, such as
  57 multi-lingual text. The format is based on ISO standards for encoding and
  58 combining character sets. Compound Text is intended to be used in three main
  59 contexts: inter-client communication using selections, as defined in the
  60 Inter-Client Communication Conventions Manual (ICCCM); window properties (e.g.,
  61 window manager hints as defined in the ICCCM); and resources (e.g., as defined
  62 in Xlib and the Xt Intrinsics).
  63
  64 Compound Text is intended as an external representation, or interchange format,
  65 not as an internal representation. It is expected (but not required) that
  66 clients will convert Compound Text to some internal representation for
  67 processing and rendering, and convert from that internal representation to
  68 Compound Text when providing textual data to another client.
  69
  70 Values
  71
  72 The name of this encoding is "COMPOUND_TEXT". When text values are used in the
  73 ICCCM-compliant selection mechanism or are stored as window properties in the
  74 server, the type used should be the atom for "COMPOUND_TEXT".
  75
  76 Octet values are represented in this document as two decimal numbers in the
  77 form col/row. This means the value (col * 16) + row. For example, 02/01 means
  78 the value 33.
  79
  80 For our purposes, the octet encoding space is divided into four ranges:
  81
  82 C0 octets from 00/00 to 01/15
  83 GL octets from 02/00 to 07/15
  84 C1 octets from 08/00 to 09/15
  85 GR octets from 10/00 to 15/15
  86
  87 C0 and C1 are "control character" sets, while GL and GR are "graphic character"
  88 sets. Only a subset of C0 and C1 octets are used in the encoding, and depending
  89 on the character set encoding defined as GL or GR, a subset of GL and GR octets
  90 may be used; see below for details. All octets (00/00 to 15/15) may appear
  91 inside the text of extended segments (defined below).
  92
  93 [For those familiar with ISO 2022, we will use only an 8-bit environment, and
  94 we will always use G0 for GL and G1 for GR.]
  95
  96 Control Characters
  97
  98 In C0, only the following values will be used:
  99
 100 00/09 HT  HORIZONTAL TABULATION
 101 00/10 NL  NEW LINE
 102 01/11 ESC (ESCAPE)
 103
 104 In C1, only the following value will be used:
 105
 106 09/11 CSI CONTROL SEQUENCE INTRODUCER
 107
 108 [The alternate 7-bit CSI encoding 01/11 05/11 is not used in Compound Text.]
 109
 110 No control sequences are defined in Compound Text for changing the C0 and C1
 111 sets.
 112
 113 A horizontal tab can be represented with the octet 00/09. Specification of
 114 tabulation width settings is not part of Compound Text and must be obtained
 115 from context (in an unspecified manner).
 116
 117 [Inclusion of horizontal tab is for consistency with the STRING type currently
 118 defined in the ICCCM.]
 119
 120 A newline (line separator/terminator) can be represented with the octet 00/10.
 121
 122 [Note that 00/10 is normally LINEFEED, but is being interpreted as NEWLINE.
 123 This can be thought of as using the (deprecated) NEW LINE mode, E.1.3, in ISO
 124 6429. Use of this value instead of 08/05 (NEL, NEXT LINE) is for consistency
 125 with the STRING type currently defined in the ICCCM.]
 126
 127 The remaining C0 and C1 values (01/11 and 09/11) are only used in the control
 128 sequences defined below.
 129
 130 Standard Character Set Encodings
 131
 132 The default GL and GR sets in Compound Text correspond to the left and right
 133 halves of ISO 8859-1 (Latin 1). As such, any legal instance of a STRING type
 134 (as defined in the ICCCM) is also a legal instance of type COMPOUND_TEXT.
 135
 136 [The implied initial state in ISO 2022 is defined with the sequence: 01/11 02/
 137 00 04/03 GO and G1 in an 8-bit environment only. Designation also invokes. 01/
 138 11 02/00 04/07 In an 8-bit environment, C1 represented as 8-bits. 01/11 02/00
 139 04/09 Graphic character sets can be 94 or 96. 01/11 02/00 04/11 8-bit code is
 140 used. 01/11 02/08 04/02 Designate ASCII into G0. 01/11 02/13 04/01 Designate
 141 right-hand part of ISO Latin-1 into G1. ]
 142
 143 To define one of the approved standard character set encodings to be the GL
 144 set, one of the following control sequences is used:
 145
 146 01/11 02/08 {I} F      94 character set
 147 01/11 02/04 02/08{I} F 94^N character set
 148
 149 To define one of the approved standard character set encodings to be the GR
 150 set, one of the following control sequences is used:
 151
 152 01/11 02/09 {I} F       94 character set
 153 01/11 02/13 {I} F       96 character set
 154 01/11 02/04 02/09 {I} F 94^N character set
 155
 156 The "F"in the control sequences above stands for "Final character", which is
 157 always in the range 04/00 to 07/14. The "{I}" stands for zero or more
 158 "intermediate characters", which are always in the range 02/00 to 02/15, with
 159 the first intermediate character always in the range 02/01 to 02/03. The
 160 registration authority has defined an "{I} F" sequence for each registered
 161 character set encoding.
 162
 163 [Final characters for private encodings (in the range 03/00 to 03/15) are not
 164 permitted here in Compound Text.]
 165
 166 For GL, octet 02/00 is always defined as SPACE, and octet 07/15 (normally
 167 DELETE) is never used. For a 94-character set defined as GR, octets 10/00 and
 168 15/15 are never used.
 169
 170 [This is consistent with ISO 2022.]
 171
 172 A 94^N character set uses N octets (N > 1) for each character. The value of N
 173 is derived from the column value for F:
 174
 175 column 04 or 05 2 octets
 176 column 06       3 octets
 177 column 07       4 or more octets
 178
 179 In a 94^N encoding, the octet values 02/00 and 07/15 (in GL) and 10/00 and 15/
 180 15 (in GR) are never used.
 181
 182 [The column definitions come from ISO 2022.]
 183
 184 Once a GL or GR set has been defined, all further octets in that range (except
 185 within control sequences and extended segments) are interpreted with respect to
 186 that character set encoding, until the GL or GR set is redefined. GL and GR
 187 sets can be defined independently, they do not have to be defined in pairs.
 188
 189 Note that when actually using a character set encoding as the GR set, you must
 190 force the most significant bit (08/00) of each octet to be a one, so that it
 191 falls in the range 10/00 to 15/15.
 192
 193 [Control sequences to specify character set encoding revisions (as in section
 194 6.3.13 of ISO 2022) are not used in Compound Text. Revision indicators do not
 195 appear to provide useful information in the context of Compound Text. The most
 196 recent revision can always be assumed, since revisions are upward compatible.]
 197
 198 Approved Standard Encodings
 199
 200 The following are the approved standard encodings to be used with Compound
 201 Text. Note that none have Intermediate characters; however, a good parser will
 202 still deal with Intermediate characters in the event that additional encodings
 203 are later added to this list.
 204
 205 ┌────┬────┬───────────────────────────────────────────────────────────────────┐
 206 │{I} │94/ │Description                                                        │
 207 │F   │96  │                                                                   │
 208 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 209 │4/02│94  │7-bit ASCII graphics (ANSI X3.4-1968), Left half of ISO 8859 sets  │
 210 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 211 │04/ │94  │Right half of JIS X0201-1976 (reaffirmed 1984), 8-Bit              │
 212 │09  │    │Alphanumeric-Katakana Code                                         │
 213 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 214 │04/ │94  │Left half of JIS X0201-1976 (reaffirmed 1984), 8-Bit               │
 215 │10  │    │Alphanumeric-Katakana Code                                         │
 216 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 217 │04/ │96  │Right half of ISO 8859-1, Latin alphabet No. 1                     │
 218 │01  │    │                                                                   │
 219 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 220 │04/ │96  │Right half of ISO 8859-2, Latin alphabet No. 2                     │
 221 │02  │    │                                                                   │
 222 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 223 │04/ │96  │Right half of ISO 8859-3, Latin alphabet No. 3                     │
 224 │03  │    │                                                                   │
 225 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 226 │04/ │96  │Right half of ISO 8859-4, Latin alphabet No. 4                     │
 227 │04  │    │                                                                   │
 228 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 229 │04/ │96  │Right half of ISO 8859-7, Latin/Greek alphabet                     │
 230 │06  │    │                                                                   │
 231 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 232 │04/ │96  │Right half of ISO 8859-6, Latin/Arabic alphabet                    │
 233 │07  │    │                                                                   │
 234 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 235 │04/ │96  │Right half of ISO 8859-8, Latin/Hebrew alphabet                    │
 236 │08  │    │                                                                   │
 237 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 238 │04/ │96  │Right half of ISO 8859-5, Latin/Cyrillic alphabet                  │
 239 │12  │    │                                                                   │
 240 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 241 │04/ │96  │Right half of ISO 8859-9, Latin alphabet No. 5                     │
 242 │13  │    │                                                                   │
 243 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 244 │04/ │942 │GB2312-1980, China (PRC) Hanzi                                     │
 245 │01  │    │                                                                   │
 246 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 247 │04/ │942 │JIS X0208-1983, Japanese Graphic Character Set                     │
 248 │02  │    │                                                                   │
 249 ├────┼────┼───────────────────────────────────────────────────────────────────┤
 250 │04/ │942 │KS C5601-1987, Korean Graphic Character Set                        │
 251 │03  │    │                                                                   │
 252 └────┴────┴───────────────────────────────────────────────────────────────────┘
 253
 254 The sets listed as "Left half of ..." should always be defined as GL. The sets
 255 listed as "Right half of ..." should always be defined as GR. Other sets can be
 256 defined either as GL or GR.
 257
 258 Non-Standard Character Set Encodings
 259
 260 Character set encodings that are not in the list of approved standard encodings
 261 can be included using "extended segments". An extended segment begins with one
 262 of the following sequences:
 263
 264 01/11 2/05 02/15 03/00 M L variable number of octets per character
 265 01/11 2/05 02/15 03/01 M L 1 octet per character
 266 01/11 2/05 02/15 03/02 M L 2 octet per character
 267 01/11 2/05 02/15 03/03 M L 3 octet per character
 268 01/11 2/05 02/15 03/04 M L 4 octet per character
 269
 270 [This uses the "other coding system" of ISO 2022, using private Final
 271 characters.]
 272
 273 The "M" and "L" octets represent a 14-bit unsigned value giving the number of
 274 octets that appear in the remainder of the segment. The number is computed as
 275 ((M - 128) * 128) + (L - 128). The most significant bit M and L are always set
 276 to one. The remainder of the segment consists of two parts, the name of the
 277 character set encoding and the actual text. The name of the encoding comes
 278 first and is separated from the text by the octet 00/02 (STX, START OF TEXT).
 279 Note that the length defined by M and L includes the encoding name and
 280 separator.
 281
 282 [The encoding of the length is chosen to avoid having zero octets in Compound
 283 Text when possible, because embedded NUL values are problematic in many C
 284 language routines. The use of zero octets cannot be ruled out entirely however,
 285 since some octets in the actual text of the extended segment may have to be
 286 zero.]
 287
 288 The name of the encoding should be registered with the X Consortium to avoid
 289 conflicts and should when appropriate match the CharSet Registry and Encoding
 290 registration used in the X Logical Font Description. The name itself should be
 291 encoded using ISO 8859-1 (Latin 1), should not use question mark (03/15) or
 292 asterisk (02/10), and should use hyphen (02/13) only in accordance with the X
 293 Logical Font Description.
 294
 295 Extended segments are not to be used for any character set encoding that can be
 296 constructed from a GL/GR pair of approved standard encodings. For example, it
 297 is incorrect to use an extended segment for any of the ISO 8859 family of
 298 encodings.
 299
 300 It should be noted that the contents of an extended segment are arbitrary; for
 301 example, they may contain octets in the C0 and C1 ranges, including 00/00, and
 302 octets comprising a given character may differ in their most significant bit.
 303
 304 [ISO-registered "other coding systems" are not used in Compound Text; extended
 305 segments are the only mechanism for non-2022 encodings.]
 306
 307 Directionality
 308
 309 If desired, horizontal text direction can be indicated using the following
 310 control sequences:
 311
 312 09/11 03/01 05/13 begin left-to-right text
 313 09/11 03/02 05/13 begin right-to-left text
 314 09/11 05/13       end of string
 315
 316 [This is a subset of the SDS (START DIRECTED STRING) control in the Draft
 317 Bidirectional Addendum to ISO 6429.]
 318
 319 Directionality can be nested. Logically, a stack of directions is maintained.
 320 Each of the first two control sequences pushes a new direction on the stack,
 321 and the third sequence (revert) pops a direction from the stack. The stack
 322 starts out empty at the beginning of a Compound Text string. When the stack is
 323 empty, the directionality of the text is unspecified.
 324
 325 Directionality applies to all subsequent text, whether in GL, GR, or an
 326 extended segment. If the desired directionality of GL, GR, or extended segments
 327 differs, then directionality control sequences must be inserted when switching
 328 between them.
 329
 330 Note that definition of GL and GR sets is independent of directionality;
 331 defining a new GL or GR set does not change the current directionality, and
 332 pushing or popping a directionality does not change the current GL and GR
 333 definitions.
 334
 335 Specification of directionality is entirely optional; text direction should be
 336 clear from context in most cases. However, it must be the case that either all
 337 characters in a Compound Text string have explicitly specified direction or
 338 that all characters have unspecified direction. That is, if directionality
 339 control sequences are used, the first such control sequence must precede the
 340 first graphic character in a Compound Text string, and graphic characters are
 341 not permitted whenever the directionality stack is empty.
 342
 343 Resources
 344
 345 To use Compound Text in a resource, you can simply treat all octets as if they
 346 were ASCII/Latin-1 and just replace all "\" octets (05/12) with the two octets
 347 "\\", all newline octets (00/10) with the two octets "\n", and all zero octets
 348 with the four octets "\000". It is up to the client making use of the resource
 349 to interpret the data as Compound Text; the policy by which this is ascertained
 350 is not constrained by the Compound Text specification.
 351
 352 Font Names
 353
 354 The following CharSet names for the standard character set encodings are
 355 registered for use in font names under the X Logical Font Description:
 356
 357 ┌───────────────┬──────────────────────────────┬──────────────────────────────┐
 358 │Name           │Encoding Standard             │Description                   │
 359 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 360 │ISO8859-1      │ISO8859-1                     │Latinalphabet No. 1           │
 361 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 362 │ISO8859-2      │ISO8859-2                     │Latinalphabet No. 2           │
 363 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 364 │ISO8859-3      │ISO8859-3                     │Latinalphabet No. 3           │
 365 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 366 │ISO8859-4      │ISO8859-4                     │Latinalphabet No. 4           │
 367 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 368 │ISO8859-5      │ISO 8859-5                    │Latin/Cyrillic alphabet       │
 369 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 370 │ISO8859-6      │ISO 8859-6                    │Latin/Arabic alphabet         │
 371 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 372 │ISO8859-7      │ISO8859-7                     │Latin/Greekalphabet           │
 373 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 374 │ISO8859-8      │ISO8859-8                     │Latin/Hebrew alphabet         │
 375 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 376 │ISO8859-9      │ISO8859-9                     │Latinalphabet No. 5           │
 377 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 378 │JISX0201.1976-0│JIS X0201-1976 (reaffirmed    │8-bit Alphanumeric-Katakana   │
 379 │               │1984)                         │Code                          │
 380 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 381 │GB2312.1980-0  │GB2312-1980, GL encoding      │China (PRC) Hanzi             │
 382 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 383 │JISX0208.1983-0│JIS X0208-1983, GL encoding   │Japanese Graphic Character Set│
 384 ├───────────────┼──────────────────────────────┼──────────────────────────────┤
 385 │KSC5601.1987-0 │KS C5601-1987, GL encoding    │Korean Graphic Character Set  │
 386 └───────────────┴──────────────────────────────┴──────────────────────────────┘
 387
 388 Extensions
 389
 390 There is no absolute requirement for a parser to deal with anything but the
 391 particular encoding syntax defined in this specification. However, it is
 392 possible that Compound Text may be extended in the future, and as such it may
 393 be desirable to construct the parser to handle 2022/6429 syntax more generally.
 394
 395 There are two general formats covering all control sequences that are expected
 396 to appear in extensions:
 397
 398 01/11 {I} F
 399
 400 For this format, I is always in the range 02/00 to 02/15, and F is always in
 401 the range 03/00 to 07/14.
 402
 403 09/11 {P} {I} F
 404
 405 For this format, P is always in the range 03/00 to 03/15, I is always in the
 406 range 02/00 to 02/15, and F is always in the range 04/00 to 07/14.
 407
 408 In addition, new (singleton) control characters (in the C0 and C1 ranges) might
 409 be defined in the future.
 410
 411 Finally, new kinds of "segments" might be defined in the future using syntax
 412 similar to extended segments:
 413
 414 01/11 02/05 02/15 F M L
 415
 416 For this format, F is in the range 03/05 to 3/15. M and L are as defined in
 417 extended segments. Such a segment will always be followed by the number of
 418 octets defined by M and L. These octets can have arbitrary values and need not
 419 follow the internal structure defined for current extended segments.
 420
 421 If extensions to this specification are defined in the future, then any string
 422 incorporating instances of such extensions must start with one of the following
 423 control sequences:
 424
 425 01/11 02/03 V 03/00 ignoring extensions is OK
 426 01/11 02/03 V 03/01 ignoring extensions is not OK
 427
 428 In either case, V is in the range 02/00 to 02/15 and indicates the major
 429 version minus one of the specification being used. These version control
 430 sequences are for use by clients that implement earlier versions, but have
 431 implemented a general parser. The first control sequence indicates that it is
 432 acceptable to ignore all extension control sequences; no mandatory information
 433 will be lost in the process. The second control sequence indicates that it is
 434 unacceptable to ignore any extension control sequences; mandatory information
 435 would be lost in the process. In general, it will be up to the client
 436 generating the Compound Text to decide which control sequence to use.
 437
 438 Errors
 439
 440 If a Compound Text string does not match the specification here (e.g., uses
 441 undefined control characters, or undefined control sequences, or incorrectly
 442 formatted extended segments), it is best to treat the entire string as invalid,
 443 except as indicated by a version control sequence.
 444