1 # Copyright (C) 2008-2010, Parrot Foundation.
8 This PDD describes the conventions for strings in Parrot,
9 including but not limited to support for multiple character sets,
10 encodings, and languages.
20 A character is the abstract description of a symbol. It's the smallest
21 chunk of text a computer knows how to deal with. Internally to
22 the computer, a character (just like everything else) is a number, so
23 a few further definitions are needed.
27 The Unicode Standard prefers the concepts of I<character repertoire> (a
28 collection of characters) and I<character code> (a mapping which tells you
29 what number represents which character in the repertoire). Character set is
30 commonly used to mean the standard which defines both a repertoire and a code.
34 A codepoint is the numeric representation of a character according to a
35 given character set. So in ASCII, the character C<A> has codepoint 0x41.
39 An encoding determines how a codepoint is represented inside a computer.
40 Simple encodings like ASCII define that the codepoints 0-127 simply
41 live as their numeric equivalents inside an eight-bit bytes. Other
42 fixed-width encodings like UCS-2 use more bytes to encode more
43 codepoints. Variable-width encodings like UTF-8 use one byte for
44 codepoints 0-127, two bytes for codepoints 127-2047, and so on.
46 Character sets and encodings are related but separate concepts. An
47 encoding is the lower-level representation of a string's data, whereas
48 the character set determines higher-level semantics. Typically,
49 character set functions will ask a string's encoding functions to
50 retrieve data from the string, and then process the retrieved data.
52 =head3 Combining Character
54 A combining character is a Unicode concept. It is a character which
55 modifies the preceding character. For instance, accents, lines, circles,
56 boxes, etc. which are not to be displayed on their own, but to be
57 composed with the preceding character.
61 In linguistics, a grapheme is a single symbol in a writing system (letter,
62 number, punctuation mark, kanji, hiragana, Arabic glyph, Devanagari symbol,
63 etc), including any modifiers (diacritics, etc).
65 The Unicode Standard defines a I<grapheme cluster> (commonly simplified to
66 just I<grapheme>) as one or more characters forming a visible whole when
67 displayed, in other words, a bundle of a character and all of its combining
68 characters. Because graphemes are the highest-level abstract idea of a
69 "character", they're useful for converting between character sets.
71 =head3 Normalization Form
73 A normalization form standardizes the representation of a string by
74 transforming a sequence of combining characters into a more complex character
75 (composition), or by transforming a complex character into a sequence of
76 composing characters (decomposition). The decomposition forms also define a
77 standard order for the composing characters, to allow string comparisons. The
78 Unicode Standard defines four normalization forms: NFC and NFKC are
79 composition, NFD and NFKD are decomposition. See L<Unicode Normalization
80 Forms|http://www.unicode.org/reports/tr15/> for more details.
82 =head3 Grapheme Normalization Form
84 Grapheme normalization form (NFG) is a normalization which allocates exactly
85 one codepoint to each grapheme.
93 Parrot supports multiple string formats, and so users of Parrot strings must
94 be aware at all times of string encoding issues and how these relate to the
99 Parrot provides an interface for interacting with strings and converting
100 between character sets and encodings.
104 Operations that require understanding the semantics of a string must respect
105 the character set of the string.
109 Operations that require understanding the layout of the string must respect
110 the encoding of the string.
114 In addition to common string formats, Parrot provides an additional string
115 format that is a sequence of 32-bit Unicode codepoints in NFG.
119 =head2 Implementation
121 Parrot was designed from the outset to support multiple string formats:
122 multiple character sets and multiple encodings. We don't standardize on
123 Unicode internally, converting all strings to Unicode strings, because for the
124 majority of use cases it's still far more efficient to deal with whatever
125 input data the user sends us.
127 Consumers of Parrot strings need to be aware that there is a plurality of
128 string encodings inside Parrot. (Producers of Parrot strings can do whatever
129 is most efficient for them.) To put it in simple terms: if you find yourself
130 writing C<*s++> or any other C string idioms, you need to stop and think if
131 that's what you really mean. Not everything is byte-based anymore.
133 =head3 Grapheme Normalization Form
135 Unicode characters can be expressed in a number of different ways according to
136 the Unicode Standard. This is partly to do with maintaining compatibility with
137 existing character encodings. For instance, in Serbo-Croatian and Slovenian,
138 there's a letter which looks like an C<i> without the dot but with two grave
139 (C<`>) accents (E<0x209>). Unicode can represent this letter as a composed
140 character C<0x209>, also known as C<LATIN SMALL LETTER I WITH DOUBLE GRAVE>,
141 which does the job all in one go. It can also represent this letter as a
142 decomposed sequence: C<LATIN SMALL LETTER I> (C<0x69>) followed by C<COMBINING
143 DOUBLE GRAVE ACCENT> (C<0x30F>). We use the term I<grapheme> to refer to a
144 "letter" whether it's represented by a single codepoint or multiple
147 String operations on this kind of variable-byte encoding can be complex and
148 expensive. Operations like comparison and traversal require a series of
149 computations and lookaheads, because any given grapheme may be a sequence of
150 combining characters. The Unicode Standard defines several "normalization
151 forms" that help with this problem. Normalization Form C (NFC), for example,
152 decomposes everything, then re-composes as much as possible. So if you see the
153 integer stream C<0x69 0x30F>, it needs to be replaced by C<0x209>. However,
154 Unicode's normalization forms don't go quite far enough to completely solve
155 the problem. For example, Serbo-Croat is sometimes also written with Cyrillic
156 letters rather than Latin letters. Unicode doesn't have a single composed
157 character for the Cyrillic equivalent of the Serbo-Croat C<LATIN SMALL LETTER
158 I WITH DOUBLE GRAVE>, so it is represented as a decomposed pair C<CYRILLIC
159 SMALL LETTER I> (C<0x438>) with C<COMBINING DOUBLE GRAVE ACCENT> (C<0x30F>).
160 This means that even in the most normalized Unicode form, string manipulation
161 code must always assume a variable-byte encoding, and use expensive
162 lookaheads. The cost is incurred on every operation, though the particular
163 string operated on might not contain combining characters. It's particularly
164 noticeable in parsing and regular expres699sion matches, where backtracking
165 operations may re-traverse the characters of a simple string hundreds of
168 In order to reduce the cost of variable-byte operations and simplify some
169 string manipulation tasks, Parrot defines an additional normalization:
170 Normalization Form G (NFG). In NFG, every grapheme is guaranteed to be
171 represented by a single codepoint. Graphemes that don't have a single
172 codepoint representation in Unicode are given a dynamically generated
173 codepoint unique to the NFG string.
175 An NFG string is a sequence of signed 32-bit Unicode codepoints. It's
176 equivalent to UCS-4 except for the normalization form semantics. UCS-4
177 specifies an encoding for Unicode codepoints from 0 to 0x7FFFFFFF. In other
178 words, any codepoints with the first bit set are undefined. NFG interprets the
179 unused bit as a sign bit, and reserves all negative codepoints as dynamic
180 codepoints. A negative codepoint acts as an index into a lookup table, which
181 maps between a dynamic codepoint and its associated decomposition.
183 In practice, this goes as follows: When our Russified Serbo-Croat string is
184 converted to NFG, it is normalized to a single character having the codepoint
185 C<0xFFFFFFFFF> (in other words, -1 in 2's complement). At the same time,
186 Parrot inserts an entry into the string's grapheme table at array index -1,
187 containing the Unicode decomposition of the grapheme C<0x00000438
190 Parrot will provide both grapheme-aware and codepoint-aware string operations,
191 such as iterators for string traversal and calculations of string length.
192 Individual language implementations can choose between the two types of
193 operations depending on whether their string semantics are character-based or
194 codepoint-based. For languages that don't currently have Unicode support, the
195 grapheme operations will allow them to safely manipulate Unicode data without
196 changing their string semantics.
200 Applications that don't care about graphemes can handle a NFG codepoint in a
201 string as if it's any other character. Only applications that care about the
202 specific properties of Unicode characters need to take the load of peeking
203 inside the grapheme table and reading the decomposition.
205 Using negative numbers for dynamic codepoints allows Parrot to check if a
206 particular codepoint is dynamic using a single sign-comparison operation. It
207 also means that NFG can be used without conflict on encodings from 7-bit
208 (signed 8-bit integers) to 63-bit (using signed 64-bit integers) and beyond.
210 Because any grapheme from any character set can be represented by a single NFG
211 codepoint, NFG strings are useful as an intermediate representation for
212 converting between string types.
216 A 32-bit encoding is quite large, considering the fact that the Unicode
217 codespace only requires up to C<0x10FFFF>. The Unicode Consortium's FAQ notes
218 that most Unicode interfaces use UTF-16 instead of UTF-32, out of memory
219 considerations. This means that although Parrot will use 32-bit NFG strings
220 for optimizations within operations, for the most part individual users should
221 use the native character set and encoding of their data, rather than using NFG
224 The conceptual cost of adding a normalization form beyond those defined in the
225 Unicode Standard has to be considered. However, to fully support Unicode,
226 Parrot already needs to keep track of what normalization form a given string
227 is in, and provide functions to convert between normalization forms. The
228 conceptual cost of one additional normalization form is relatively small.
230 =head4 The grapheme table
232 When constructing strings in NFG, graphemes not expressible as a single
233 character in Unicode are represented by a dynamic codepoint index into the
234 string's grapheme table. When Parrot comes across a multi-codepoint grapheme,
235 it must first determine whether or not the grapheme already has an entry in
236 the grapheme table. Therefore the table cannot strictly be an array, as that
237 would make lookup inefficient. The grapheme table is represented, then, as
238 both an array and a hash structure. The array interface provides
239 forward-lookup and the hash interface reverse lookup. Converting a
240 multi-codepoint grapheme into a dynamic codepoint can be demonstrated with the
241 following Perl 5 pseudocode, for the grapheme C<0x438 0x30F>:
243 $codepoint = ($grapheme_lookup->{0x438}{0x30F} ||= do {
244 push @grapheme_table, "\x{438}\x{30F}";
247 push @string, $codepoint;
251 Strings in the Parrot core should use the Parrot C<STRING> structure. Parrot
252 developers generally shouldn't deal with C<char *> or other string-like types
253 outside of this abstraction. It's also best not to access members of the
254 C<STRING> structure directly. The interpretation of the data inside the
255 structure is determined by the data's encoding. Parrot's strings are
256 encoding-aware so your functions don't need to be.
258 Parrot's internal strings (C<STRING>s) have the following structure:
260 struct parrot_string_t {
268 const struct _encoding *encoding;
277 A pointer to the buffer for the string data.
281 The size of the buffer in bytes.
285 Binary flags used for garbage collection, copy-on-write tracking, and other
290 The amount of the buffer currently in use, in bytes.
294 The length of the string, in bytes. {{NOTE, not in characters, as characters
295 may be variably sized.}}
299 A cache of the hash value of the string, for rapid lookups when the string is
304 What sort of string data is in the buffer, for example ASCII, ISO-8859-1,
307 The encoding structure specifies the encoding (by index number and by name,
308 for ease of lookup), the maximum number of bytes that a single character will
309 occupy in that encoding, as well as functions for manipulating strings with
314 {{DEPRECATION NOTE: the enum C<parrot_string_representation_t> will be removed
315 from the parrot string structure. It's been commented out for years.}}
317 {{DEPRECATION NOTE: the C<char *> pointer C<strstart> will be removed. It
318 complicates the entire string subsystem for a tiny optimization on substring
319 operations, and offset math is messy with encodings that aren't byte-based.}}
321 =head4 Conversions between normalization form, encoding, and charset
323 Conversion will be done with a function called C<Parrot_str_grapheme_copy>:
325 INTVAL Parrot_str_grapheme_copy(STRING *src, STRING *dst)
327 Converting a string from one format to another involves creating a new empty
328 string with the required attributes, and passing the source string and the new
329 string to C<Parrot_str_grapheme_copy>. This function iterates through the
330 source string one grapheme at a time, using the character set function pointer
331 C<get_grapheme> (which may read ahead multiple characters with strings that
332 aren't in NFG). For each source grapheme, the function will call
333 C<set_grapheme> on the destination string (which may append multiple
334 characters in non-NFG strings). This conversion effectively uses an
335 intermediate NFG representation.
338 =head3 String Interface Functions
340 The current string functions will be maintained, with some modifications for
341 the addition of the NFG string format. Many string functions that are part of
342 Parrot's external API will be renamed for the standard "Parrot_*" naming
345 =head4 Parrot_str_concat (was string_concat)
347 Concatenate two strings. Takes two strings as arguments.
349 =head4 Parrot_str_new (was string_from_cstring)
351 Return a new string with the default encoding and character set. Accepts two
352 arguments, a C string (C<char *>) to initialize the value of the string, and
353 an integer length of the string (number of characters). If the integer length
354 isn't passed, the function will calculate the length.
356 {{NOTE: the integer length isn't really necessary, and is under consideration
359 =head4 Parrot_str_new_noinit (was string_make_empty)
361 Returns a new empty string with the default encoding and character set.
363 =head4 Parrot_str_new_init (was string_make_direct)
365 Returns a new string of the requested encoding, character set, and
366 normalization form, initializing the string value to the value passed in. The
367 three arguments are a C string (C<char *>), an integer length of the string
368 argument in bytes, and a struct pointer for the encoding struct. If the C
369 string (C<char *>) value is not passed, returns an empty string. If the
370 encoding is passed as null value, a default value is used.
372 {{ NOTE: the crippled version of this function, C<string_make>, used to accept
373 a string name for the character set. This behavior is no longer supported, but
374 C<Parrot_find_encoding> and C<Parrot_find_charset> can look up the encoding or
375 character set structs. }}
377 =head4 Parrot_str_new_constant (was const_string)
379 Creates and returns a new Parrot constant string. Takes one C string (a C<char
380 *>) as an argument, the value of the constant string. The length of the C
381 string is calculated internally.
383 =head4 Parrot_str_length (was string_compute_strlen)
385 Returns the number of characters in the string. Combining characters are each
386 counted separately. Variable-width encodings may lookahead.
388 =head4 Parrot_str_grapheme_length
390 Returns the number of graphemes in the string. Groups of combining characters
391 count as a single grapheme.
393 =head4 Parrot_str_byte_length (was string_length)
395 Returns the number of bytes in the string. The character width of
396 variable-width encodings is ignored. Combining characters are not treated any
397 differently than other characters. This is equivalent to accessing the
398 C<strlen> member of the C<STRING> struct directly.
400 =head4 Parrot_str_indexed (was string_index)
402 Returns the character at the specified index (the Nth character from the start
403 of the string). Combining characters are counted separately. Variable-width
404 encodings will lookahead to capture full character values.
406 =head4 Parrot_str_grapheme_indexed
408 Returns the grapheme at the given index (the Nth grapheme from the string's
409 start). Groups of combining characters count as a single grapheme, so this
410 function may return multiple characters.
412 =head4 Parrot_str_find_index (was string_str_index)
414 Search for a given substring within a string. If it's found, return an integer
415 index to the substring's location (the Nth character from the start of the
416 string). Combining characters are counted separately. Variable-width encodings
417 will lookahead to capture full character values. Returns -1 unless the
420 =head4 Parrot_str_copy (was string_copy)
422 Make a COW copy a string (a new string header pointing to the same string
425 =head4 Parrot_str_grapheme_copy (new)
427 Accepts two string arguments: a destination and a source. Iterates through the
428 source string one grapheme at a time and appends it to the destination string.
430 This function can be used to convert a string of one format to another format.
432 =head4 Parrot_str_repeat (was string_repeat)
434 Return a string containing the passed string argument, repeated the number of
435 times in the integer argument.
437 =head4 Parrot_str_substr (was string_substr)
439 Return a substring starting at an integer offset with an integer length. The
440 offset and length specify characters. Combining characters are counted
441 separately. Variable-width encodings will lookahead to capture full character
444 =head4 Parrot_str_grapheme_substr
446 Return a substring starting at an integer offset with an integer length. The
447 offset and length specify graphemes. Groups of combining characters count as a
450 =head4 Parrot_str_replace (was string_replace)
452 Replaces a substring within the first string argument with the second string
453 argument. An integer offset and length, in characters, specify where the
454 removed substring starts and how long it is.
456 =head4 Parrot_str_grapheme_replace
458 Replaces a substring within the first string argument with the second string
459 argument. An integer offset and length in graphemes specify where the removed
460 substring starts and how long it is.
462 =head4 Parrot_str_chopn (was string_chopn)
464 Chop the requested number of characters off the end of a string without
465 modifying the original string.
467 =head4 Parrot_str_grapheme_chopn
469 Chop the requested number of graphemes off the end of a string without
470 modifying the original string.
472 =head4 Parrot_str_compare (was string_compare)
474 Compare two strings to each other. Return 0 if they are equal, 1 if the first
475 is greater and -1 if the second is greater. Uses character set collation order
476 for the comparison. (Two strings that are logically equivalent in terms of
477 display, but stored in different normalizations are not equal.)
479 =head4 Parrot_str_grapheme_compare
481 Compare two strings to each other. Return 0 if they are equal, 1 if the first
482 is greater and -1 if the second is greater. Uses NFG normalization to compare
485 =head4 Parrot_str_equal
487 Compare two strings, return 1 if they are equal, 0 if they are not equal.
489 =head4 Parrot_str_not_equal (was string_equal)
491 Compare two strings, return 0 if they are equal, 1 if they are not equal.
493 {{DEPRECATION NOTE: The return value of 'Parrot_str_equal' is reversed from
494 the old logic, but 'Parrot_str_not_equal' is provided as a drop-in
495 replacement for the old function.}}
497 =head4 Parrot_str_grapheme_equal
499 Compare two strings using NFG normalization, return 1 if they are equal, 0 if
502 =head4 Parrot_str_split
504 Splits the string C<str> at the delimiter C<delim>.
506 =head3 Internal String Functions
508 The following functions are used internally and are not part of the public
511 =head4 Parrot_str_init (was string_init)
513 Initialize Parrot's string subsystem, including string allocation and garbage
516 =head4 Parrot_str_finish (was string_deinit)
518 Terminate and clean up Parrot's string subsystem, including string allocation
519 and garbage collection.
521 =head3 Deprecated String Functions
523 The following string functions are slated to be deprecated.
525 =head4 string_max_bytes
527 Calculate the number of bytes needed to hold a given number of characters in a
528 particular encoding, multiplying the maximum possible width of a character in
529 the encoding by the number of characters requested.
531 {{NOTE: pretty primitive and not very useful. May be deprecated.}}
533 =head4 string_primary_encoding_for_representation
535 Not useful, it only ever returned ASCII.
537 =head4 string_rep_compatible
539 Only useful on a very narrow set of string encodings/character sets.
543 A crippled version of a string initializer, now replaced with the full version
544 C<Parrot_str_new_init>.
546 =head4 string_capacity
548 This was used to calculate the size of the buffer after the C<strstart>
549 pointer. Deprecated with C<strstart>.
553 Replaced by C<Parrot_str_indexed>.
557 This is handled just fine by C<Parrot_str_new>, we don't need a special
558 version for a single character.
562 An archaic function that uses a method of describing strings that hasn't been
565 =head4 string_to_cstring_nullable
567 Just the implementation of string_to_cstring, no need for a separate function
568 that specially allows returning a NULL string.
570 =head4 string_increment
572 Old Perl 5-style behavior where "aa" goes to "bb". Only useful for ASCII
573 strings, and not terribly useful even there.
575 =head4 Parrot_str_cstring
577 Unsafe, and behavior handled by Parrot_str_to_cstring.
579 =head4 Parrot_str_free (was string_free)
581 Unsafe and unuseful, let the garbage collector take care.
583 =head3 String PMC API
585 The String PMC provides a high-level object interface to the string
586 functionality. It contains a standard Parrot string, holding the string data.
588 =head4 Vtable Functions
590 The String PMC implements the following vtable functions.
596 Initialize a new String PMC.
604 Mark the string value of the String PMC as live.
609 Return the integer representation of the string.
613 Return the floating-point representation of the string.
617 Return the string value of the String PMC.
621 Return the boolean value of the string.
623 =item set_integer_native
625 Set the string to an integer value, transforming the integer to its string
630 Set the string to a boolean (integer) value, transforming the boolean to its
633 =item set_number_native
635 Set the string to a floating-point value by transforming the number to its
638 =item set_string_native
640 Set the String PMC's stored string value to be the string argument. If the
641 passed in string is a constant, store a copy.
643 =item assign_string_native
645 Set the String PMC's stored string value to a copy of the string argument.
647 =item set_string_same
649 Set the String PMC's stored string value to the same as another String PMC's
650 stored string value. {{NOTE: uses direct access into the storage of the two
655 Set the String PMC's stored string value to the same as another PMC's string
656 value, as returned by that PMC's C<get_string> vtable function.
660 All the bitwise string vtable functions, for AND, OR, XOR, and NOT, both
661 inplace and standard return.
665 Compares the string values of two PMCs and returns true if they match exactly.
669 Compares the numeric values of two PMCs (first transforming any strings to
670 numbers) and returns true if they match exactly.
672 =item is_equal_string
674 Compares the string values of two PMCs and returns true if they match exactly.
675 {{ NOTE: the documentation for the PMC says that it returns FALSE if they
676 match. This is not the desired behavior. }}
680 Compares two PMCs and returns true if they are the same PMC class and contain
681 the same string (not an equivalent string value, but aliases to the same
686 Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length
687 strings, and -1 if the passed in string argument is shorter.
691 Compares the numeric values of two PMCs (first changing those values to
692 numbers) and returns 1 if SELF is smaller, 0 if they are equal, and -1 if the
693 passed in string argument is smaller.
697 Compares two PMCs and returns 1 if SELF is shorter, 0 if they are equal length
698 strings, and -1 if the passed in string argument is shorter.
702 Extract a substring of a given length starting from a given offset (in
703 graphemes) and store the result in the string argument.
707 Extract a substring of a given length starting from a given offset (in
708 graphemes) and return the string.
712 Return true if the Nth grapheme in the string exists. Negative numbers count
715 =item get_string_keyed
717 Return the Nth grapheme in the string. Negative numbers count from the end.
719 =item set_string_keyed
721 Insert a string at the Nth grapheme position in the string. {{NOTE: this is
722 different than the current implementation.}}
724 =item get_integer_keyed
726 Returns the integer value of the Nth C<char> in the string. {{DEPRECATE}}
728 =item set_integer_keyed
730 Replace the C<char> at the Nth character position in the string with the
731 C<char> that corresponds to the passed integer value key. {{DEPRECATE}}
737 The String PMC provides the following methods.
743 Replace every occurrence of one string with another.
747 Return the integer equivalent of a string.
751 Change the string to all lowercase.
755 Translate an ASCII string with entries from a translation table.
757 {{NOTE: likely to be deprecated.}}
761 Reverse a string, one grapheme at a time. {{ NOTE: Currently only works for
762 ASCII strings, because it reverses one C<char> at a time. }}
767 Checks if the string is just an integer. {{ NOTE: Currently only works for
768 ASCII strings, fix or deprecate. }}
775 L<http://sirviente.9grid.es/sources/plan9/sys/doc/utf.ps> - Plan 9's Runes are
776 not dissimilar to NFG strings, and this is a good introduction to the Unicode
779 L<http://www.unicode.org/reports/tr15/> - The Unicode Consortium's
780 explanation of different normalization forms.
782 L<http://unicode.org/reports/tr29/> - "grapheme clusters" in the Unicode
785 "Unicode: A Primer", Tony Graham - Arguably the most readable book on
788 "Advanced Perl Programming", Chapter 6, "Unicode"