notes/unicode-composition-for-filenames

   1                                                                 -*- Text -*-
   2
   3
   4 Content
   5 =======
   6
   7  * Context
   8  * Issue description
   9  * Pre-resolution state of affairs
  10    - Single platform
  11    - Multi-platform: Windows + MacOS X
  12  * Proposed support library
  13    - Assumptions
  14    - Options
  15  * Proposed normal form
  16  * Possible solutions
  17    - Normalization of path-input on MacOS X
  18    - Normalization of path-input everywhere
  19    - Comparison routines (client side)
  20    - Comparison routines (everywhere)
  21  * Short term (ie before 2.0) solution
  22  * Long term solution (ie 2.0+)
  23  * References
  24
  25
  26 Context
  27 =======
  28
  29 Within Unicode, some characters - with diacritical marks - can be
  30 represented in 2 forms: Normal Form Composed (NFC) or Normal Form
  31 Decomposed (NFD).  A string of unicode characters can contain any
  32 mixture of both forms.
  33
  34 This problem explicitly does not concern itself with invisible
  35 characters, spaces or other characters unlikely to be present in
  36 filenames.  Please note that this issue is explicitly excluding
  37 NFKC/NFKD (compatibility) normal forms, because they remove
  38 for example formatting (meaning they are lossy?).
  39
  40
  41 Because there are 2 forms for representing (some) characters in Unicode,
  42 it's possible to produce different sequences of codepoints meaning to
  43 indicate the same sequence of characters [1].  UTF-8, the internal
  44 Unicode encoding of choice for Subversion, encodes codepoints in (a
  45 series of) bytes (octets).  Because the sequences of codepoints specifying
  46 a character may differ, so may the resulting UTF-8.  Hence, we end up
  47 with more than one way to specify the same path.
  48
  49
  50 The following table specifies behaviour of OSes related to handling
  51 of Unicode filenames:
  52
  53
  54           Accepts   Gives back    See
  55 MacOS X     *          NFD(*)     [2]
  56 Linux       *        <input>
  57 Windows     *        <input>
  58 Others      ?           ?
  59
  60 *) There are some remarks to be made regarding full or partial
  61   NFD here, but the essential thing is: If you send in NFC, don't
  62   expect it back!
  63
  64
  65 Issue description
  66 =================
  67
  68 From the above issue description, 2 problems follow:
  69
  70  1) We can't generally depend on the OS to give us back the
  71      exact filename we gave it
  72  2) The same filename may be encoded in different codepoints
  73
  74 Issue #1 is mainly a client side issue, something which might be
  75 resolved in the client side libraries (client/subr/wc).
  76
  77 Issue #2 is much broader than that, especially given the fact that
  78 we already have lots of populated repositories "out there": it means
  79 we cannot depend on a filename coming from the operating system - even
  80 though different from the one in the repository - to name a different
  81 file.  This has repository (ie. server-side) impact.
  82
  83
  84 Pre-resolution state of affairs
  85 ===============================
  86
  87 This section serves to describe the problems to be expected in different
  88 combinations of client/server OSes.  As indicated in the table in the
  89 context section, Linux and Windows are expected to behave equally. This
  90 section therefor leaves out the consideration of Linux as a separate
  91 system.
  92
  93 The platforms below are strictly client side: the server side problems
  94 mentioned in the issue description section solely relates to the repository,
  95 which can be located at any server platform.
  96
  97
  98 Single platform
  99 ---------------
 100 This can be multiple MacOSX machines or multiple Windows machines.  In this
 101 scenario, no interoperability problems are to be expected.
 102
 103
 104 Multi-platform: Windows + MacOSX
 105 --------------------------------
 106 Consider a file which contains one or more precomposed (NFC) characters
 107 being committed from Windows.  When the MacOSX developer updates, a
 108 file is written in NFC form, but as stated in the context section, Mac
 109 recodes that to NFD.  Now, when comparing what comes from the disk (NFD)
 110 with what's in the entries file (NFC), results in a missing file (the
 111 NFC encoded one) and an unversioned file (the NFD encoded one).  Both of
 112 these files look exactly the same to the person reading the Subversion
 113 output on the screen. [==> confusion!]
 114
 115 Committing a file the other way around might be less problematic, since
 116 Windows is capable of storing NFD filenames.
 117
 118
 119 Proposed support library
 120 ========================
 121
 122 Assumptions
 123 -----------
 124 The main assumption is that we'll keep using APR for character set
 125 conversion, meaning that the recoding solution to choose would not need
 126 to provide any other functionality than recoding.
 127
 128 Options
 129 -------
 130 There are 2 options (that I'm aware of [dionisos]) for choosing a library
 131 which supports the required functionality:
 132
 133 1) ICU - International Component for Unicode [3]
 134    a library with a very wide range of targeted functions, with a
 135    memory footprint to match.  In order to be able to use it, we'd need
 136    to trim this library down significantly.
 137 2) utf8proc - a library for processing UTF-8 encoded unicode strings
 138    a library specifically targeted at a limited number of operations
 139    to be performed on UTF-8 encoded strings.  It consists of 2 .c and
 140    1 .h file, with a total source size of 1MB (compiled less than 0.5MB).
 141
 142 From these 2, under the given assumption, it only makes sense to use
 143 utf8proc.
 144
 145
 146 Proposed normal form
 147 ====================
 148
 149 The proposed internal normal 'normal form' should be NFC, if only if it
 150 were because it's the most compact form of the two: when allocating memory
 151 to store a conversion result, it won't be necessary (ever) to allocate more
 152 than the size of the input buffer.
 153
 154 This would give the maximum performance from utf8proc, which requires 2
 155 recoding runs when the buffer is too small: 1 to retrieve the required
 156 buffer size, the second to actually store the result.
 157
 158
 159 Possible solutions
 160 ==================
 161
 162 Several options are available for resolution of this problem, each
 163 with its pros and cons, to be outlined below.
 164
 165  1) Normalization of (path) input on MacOSX
 166     Since the Mac seems to be the only platform which mutilates its
 167     pathname input to be NFD, this seems like a logical (low impact)
 168     solution.
 169  2) Normalization of (path) input on all platforms
 170     Since paths can't differ only in encoding if we standardize on
 171     encoding, this seems like a logical (relatively low) impact solution.
 172  3) Normalization of path input in the client and server
 173     On the server side, non-normalized paths may have become part
 174     of the repository.  We can achieve full in-memory standardization
 175     by converting any path coming from the repository as well as the
 176     client.
 177  4) Client and server-side path comparison routines
 178     Because paths read from the repository may be used to access said
 179     repository, possibly by calculating hash values, paths from can't be
 180     munged (repository-side).  To eliminate the effect, we acknowledge
 181     we're not going to be 'clean': we'll always need path comparison
 182     routines.
 183
 184
 185 Solution (1) has a very strong CON: it will break all pre-existing
 186 MacOSX-only workshops.  Consider a client which starts sending NFC
 187 encoded paths in an environment where all paths have been NFD encoded
 188 until that time - without proper support in the server.  This would
 189 result in commits with NFC encoded paths to files for which the path
 190 in the repository is NFD encoded: breakage.
 191
 192 Solution (2) has the same problem as solution (1) on MacOSX, but
 193 on the upside it prevents new NFD paths from entering into the repository
 194 (for sufficiently broad definitions of 'client' [think mod_dav_svn]).
 195
 196 As already stated, solution (3) may prevent paths from being found, if
 197 the retrieval mechanism is hash-based.  Meaning this could break any
 198 repository backend using hashing to store information about paths.
 199 (Don't we store locks in FSFS based on hashing?)
 200
 201 Solution (4) defines no internal standard representation, assuming it's
 202 not possible to maintain a clean in-memory state, given all problems
 203 found in the earlier solutions.  Instead, it requires all path comparisons
 204 to be performed using special NFC/NFD encoding aware functions.
 205
 206
 207 Short term solution
 208 ===================
 209
 210 Because of our interoperability guarantees, the client and server
 211 should be considered separate universes, each of which can use its own
 212 (internal) solution.  However, the client should at all times use the
 213 exact path the server sent it.  The same applies the other way around.
 214
 215 Given the above, the short term (before 2.0) solution should be to
 216 use path comparison routines as stated in solution (4).
 217
 218
 219 Long term solution
 220 ==================
 221
 222 The long term (2.0+) solution would be to use option (2), which ensures
 223 recoding of all input paths into the 'normal' normal form (NFC).  In that
 224 case, it'll no longer require the use of specialised path comparison
 225 routines (although that might still be desired for other design
 226 considerations).
 227
 228
 229 Short term solution implementation consequences
 230 ===============================================
 231
 232 As stated before, since we don't know whether the other side of the
 233 equation might be a pre-normalization-aware client or server until
 234 we break backward compat in 2.0, the client and server should be
 235 able to talk backward compatibly with a pre-NF-aware 'other side'.
 236
 237 Hence, solving this problem means considering the client and the server
 238 separate universes, each of which can employ its own internal solution.
 239
 240
 241 Implementing option (4) means:
 242
 243  A. Comparing file names with entry paths using NFC/NFD aware comparison
 244     functions. Then, when there's a match, *use the pathname from the
 245     entries file* to communicate with the server; after all, the path
 246     might have been added with a different encoding than we got back
 247     from the disk.
 248
 249  B. Match working copy paths with entries-file paths using NFC/NFD aware
 250     comparison functions. On a match, use the entries-file path to
 251     communicate with the server.
 252
 253
 254 References
 255 ==========
 256
 257 1) UAX #15: Unicode normalization forms
 258    http://unicode.org/reports/tr15/
 259 2) Apple Technical Q&A: Path encodings in VFS
 260    http://developer.apple.com/qa/qa2001/qa1173.html
 261 3) ICU - International Component for Unicode
 262    http://www-306.ibm.com/software/globalization/icu/index.jsp
 263 4) utf8proc - a library targeted at processing UTF-8 encoded unicode strings
 264    http://www.flexiguided.de/publications.utf8proc.en.html