esc/es4-doc/unicode.txt

   1 NAME:       "Unicode updates"
   2 CATEGORY:   Lexical conventions (E262-3 ch 7)
   3 SOURCES:    Reference [1]
   4 AUTHOR:     ?
   5 STATUS:     ?
   6 REVIEWS:    ?
   7 RI STATUS:  ?
   8 ESC STATUS: ?
   9 TEST CASE:  ?
  10
  11
  12            **** VERY PRELIMINARY ****
  13
  14
  15
  16 THE CONSTRAINTS
  17
  18   * Don't really want to force implementations to move to full
  19     Unicode, though most probably will anyhow.  (A lot of hair with
  20     the browser API, DOM, etc -- significant work.)  Would be nice to
  21     let browsers stay in the ES3 world for another iteration.
  22
  23   * Would like to define what it means to implement full Unicode in
  24     the language.
  25
  26 ISSUES
  27
  28   * Backward compatibility:
  29
  30     Currently "\uXXXX\uYYYY".length is 2 even if the two code points
  31     form a surrogate pair (ie a character).  But Waldemar claims the
  32     Unicode people will lynch us if we support full Unicode and don't
  33     require merging.
  34
  35     Ditto for the use of unicode escapes for identifiers.
  36
  37     String.prototype.charCodeAt is defined as returning a number less
  38     than 2^16 (or NaN).
  39
  40     String.prototype.fromCharCode is defined as chopping its argument
  41     (after conversion to int) to 16 bits.
  42
  43     If s1 and s2 are strings, then (s1 + s2).length == s1.length + s2.length.
  44     But if we merge surrogate pairs here then that will no longer be true.
  45     (This is not the same case as above because that talks about source text,
  46     this about run-time semantics.)
  47
  48   * Usability / reliability of the solution
  49
  50
  51 DESCRIPTION
  52
  53 There are several parts to this:
  54
  55   * Upgrade to Unicode 5 and the full Unicode character set (going
  56     beyond 16-bit code points)
  57
  58   * Encoding of Unicode code points in source text
  59
  60   * Handling surrogate pairs in existing source text
  61
  62   * Handling splitting of code points into surrogate pairs in
  63     implementations that don't support 21-bit code points
  64
  65
  66 ENCODING UNICODE CODE POINTS
  67
  68 Unicode code points can be expressed using a new escape character
  69 syntax that is valid whereever the current \u syntax is valid.  A code
  70 point is expressed as "\u{n...n}" where each n is a hexadecimal digit.
  71
  72 If the integer value of n...n does not denote a valid Unicode code
  73 point then the effects are implementation-defined.
  74
  75 If the implementation does not support 21-bit code points then the
  76 code point must be split into a surrogate pair of 16-bit code point
  77 values.
  78
  79
  80 MERGING SURROGATE PAIRS
  81
  82 Probably uncontroversial:
  83
  84 In version 4 scripts on implementations that support 21-bit code
  85 points, surrogate pairs in the source text are merged to a single code
  86 point.
  87
  88 In version 3 scripts, surrogate pairs are never merged under any
  89 circumstances, for backward compatibility reasons.
  90
  91 The implementation either supports or does not support 21-bit code
  92 points.  If it does, then version 3 code may read code point values
  93 above 2^16 from a string, and String.fromCharCode called from version
  94 3 code may not truncate the input value to 16 bits.
  95
  96
  97 Maybe more open to discussion:
  98
  99 Code points that form half of a surrogate pair are allowed by
 100 themselves.  (Even if we do prohibit them and prohibit their creation
 101 in String.fromCharCode, ES3 code can still create them by using a
 102 string literal containing a pair and then reading the individual parts
 103 from that string.  We can't prohibit that because that would break
 104 backward compatibility.  But obviously those values can leak into ES4
 105 code.)
 106
 107 Surrogate pairs are never automatically merged when written into a
 108 string singly, as in c1 + c2 where c1 ends with the high part of a
 109 code point and c2 starts with the low part.  (It wouldn't be hard to
 110 change this, though it would mean different string concatenation
 111 semantics in ES3 and ES4 code.)
 112
 113
 114 NOTES
 115
 116 We are not using the more obvious notation \Unnnnnnnn instead of the
 117 proposed \u{n...n} because the former is backward incompatible with
 118 ES3.  In current ES3 programs, "\U" means "U".  The proposed syntax
 119 does not have this problem.  We could change that by interpreting \U
 120 differently under the version=4 tag, and I think that would be my
 121 preference, esp since we are proposing to handle surrogate pairs
 122 differently under version=4 in any case.
 123
 124
 125 MORE RADICAL APPROACHES
 126
 127 Just say that ES4 upgrades to 21-bit unicode and that implementations
 128 that claim to support the full language will have to follow; code may
 129 break but c'est la guerre.
 130
 131
 132
 133 REFERENCES
 134
 135 [1] http://wiki.ecmascript.org/doku.php?id=proposals:update_unicode