notes/wc-improvements

   1 Anyone who has worked on the libsvn_wc code will have discovered that
   2 the current code is a complicated mess of special cases, and that it
   3 is difficult to understand, inconsistent, slow and buggy.  I know this
   4 because I wrote some of it.  It's possible that the libsvn_wc code
   5 will gradually evolve into an elegant, efficient, code base; on the
   6 other hand comments like "when we rewrite libsvn_wc" regularly appear
   7 on the dev list.  This document is *not* a plan or design for a
   8 rewrite, it's just some of the thoughts of a libsvn_wc hacker.
   9
  10 From Past to Present
  11 ====================
  12
  13 The original code for libsvn_wc used an implementation that stored
  14 more or less all state information on disk in the .svn area on disk,
  15 so during most operations the entries files were read and written many
  16 times.  This led to the development of an API that passed around a lot
  17 of path parameters (as svn_string_t* originally, as const char* now)
  18 and to the development of the svn_io_xxx functions, which also accept
  19 path parameters.  The implementation was slow, and didn't scale
  20 particularly well as working copies got larger.  To improve things the
  21 current access baton and entries caching system was gradually hacked
  22 in, and libsvn_wc is now faster and scales a bit better, but problems
  23 still remain.
  24
  25 A good example of the problems caused by the "path as parameter" API
  26 is svn_wc_text_modified_p.  Its basic function is to determine if the
  27 base text file and the working file are the same or different, but
  28 physical IO operations have to be repeated because they are buried
  29 behind several layers of API.  It's difficult to fix without
  30 rewriting, or duplicating, a number of svn_io_xxx and svn_wc_xxx
  31 functions.  Aside from the repeated IO itself, each IO operation also
  32 has to repeat the UTF-8 to native path conversion.
  33
  34 The current entries caching makes things faster than in the past, but
  35 has its own problems.  Most operations now cache the entire entries
  36 hierarchy in memory which limits the size of the working copies that
  37 can be handled.  The problem is difficult to solve as some operations
  38 make multiple passes--commit for instance makes a first pass searching
  39 for modifications, a second pass reporting to the repository, and a
  40 third pass to do post-commit processing.
  41
  42 The original code also did not always make a distinction between the
  43 versioned hierarchy in the entries file and the physical hierarchy on
  44 disk.  Things like using stat() or svn_io_check_path() calls to
  45 determine whether an item was versioned as file or directory do not
  46 work when the working copy on disk is obstructed or incomplete.
  47
  48 The Future
  49 ==========
  50
  51 Some of these ideas are trivial, some of them are difficult to
  52 implement, some of them may not work at all.
  53
  54 - Have an svn_wc_t context object, opaque outside the library, that
  55   would replace the access batons.  This would get passed through most
  56   of the libsvn_wc functions and could read/cache the entries files on
  57   demand as the working copy was traversed.  It could also cache the
  58   UTF-8 xlate handle.
  59
  60 - Have an API to svn_wc_entry_t, perhaps make the struct opaque, so
  61   that things like URL need not be constructed when the entries file
  62   is read but can be created on demand if required and possibly cached
  63   once created.  The aim would be to reduce the memory used by the
  64   entries cache.
  65
  66 - Consider caching physical IO results in svn_wc_entry_t/svn_wc_t.
  67   Should we really stat() any file more than once?  This becomes less
  68   important as we reduce the number of IO operations.
  69
  70 - Consider caching UTF-8 to native path conversions either in
  71   svn_wc_t, or svn_wc_entry_t, or locally in functions and using
  72   svn_io_xxx equivalents that accept native paths.  This becomes less
  73   important as we reduce the number of IO operations.
  74
  75 - Make interfaces pass svn_wc_entry_t* rather than simple paths.  The
  76   public API using const char* paths would remain to be used by
  77   libsvn_client et al.
  78
  79 - Maintain a clear distinction between the versioned hierarchy and the
  80   physical hierarchy when writing code, it's usually a mistake to use
  81   one when the other should be used.  To this end, audit the use of
  82   svn_io_check_path().
  83
  84 - Avoid using stat() to determine if an item is present on disk before
  85   using the item, just use it straight away and handle the error if it
  86   doesn't exist.
  87
  88 - Search out and destroy functions that read and discard entries files
  89   e.g. the apparently "simple" functions like svn_wc_is_wc_root or
  90   check_wc_root.  Such overhead is expensive when used by operations
  91   that are not going to do much other work, running status on a single
  92   file for example.  The overhead may not matter to a command line
  93   client, but it can matter to a GUI that makes many such calls.
  94
  95 - Consider supporting out of tree .svn directories.
  96
  97 - In the present code most operations are IO bound and have CPU to
  98   spare.  Perhaps compressed text-bases would make things faster
  99   rather than slower, by trading spare CPU for reduced IO?
 100
 101 - Keep track of the last text time written into an entries file and
 102   store it in svn_wc_t.  Then when we come to do a timestamp sleep
 103   we can do it from that time rather than the current time.
 104
 105 - Store working file size in the entries file and use it as another
 106   shortcut to detect modifications.  This should not need any extra
 107   system calls, the stat() for timestamp can also return the size.
 108   When it triggers it will be much faster than possibly detranslating
 109   and then doing a byte-by-byte comparison.
 110   ### Problem: This doesn't work when the file needs translation, because the
 111   ### file might be modified in such a way that these modifications disappear
 112   ### when the file is detranslated.
 113
 114 - Consider using the translated stream API to get a (de)translated
 115   version of a file, when the temp file is not needed.
 116
 117 - Make the entries file smaller.  The properties committed-date
 118   committed-rev and last-author are really only needed for keyword
 119   expansion, so only store them if the appropriate svn:keywords value
 120   is present.  Note that committed-rev has a more general use as rPREV,
 121   however just about all uses of rPREV involve repository access so
 122   rPREV could be determined via an RA call.  Removing the three
 123   properties could reduce entries file size by as much as one third,
 124   it's possible that might make reading, writing and parsing faster.
 125   It would reduce the memory used to cache the entries, an ABI change
 126   to svn_wc_entry_t might reduce it further.
 127
 128 - Look at calls to svn_wc__get_keywords and svn_wc__get_eol_style
 129   Each of those reads the properties file.  If they occur together
 130   then consider replacing them with a single call to svn_wc_prop_list,
 131   and perhaps write some functions that accept the properties hash
 132   as an argument.  Alternatively, consider caching the existence of
 133   these two properties in the entries file to avoid reading the props
 134   file at all in some cases.
 135
 136 - Optimise update/incomplete handling to reduce the number of times
 137   the entry file gets written.
 138   http://svn.haxx.se/dev/archive-2005-03/0060.shtml
 139
 140   * Avoid adding incomplete="true" if the revision is not changing.
 141
 142   * Don't write incomplete="true" immediately, cache it in the access
 143     baton and only write it when next writing the entries file.
 144
 145   * Combine removing incomplete="true", and revision bumping, with the
 146     last change due to the update.
 147
 148 - The svn_wc_t context could work in conjunction with a more advanced
 149   svn_wc_crawl_revisions system.  This would provide a way of plugging
 150   multiple callbacks into a queue, probably with some sort of ordering
 151   and filtering ability, the aim being to replace most/all of the
 152   existing explicit loops.  This would put more of the pool handling in
 153   one central location, it may even be possible to provide different
 154   entry caching schemes.  I don't know how practical this idea is, or
 155   even if it is desirable.
 156
 157 - Have a .svn/deleted directory so that schedule delete directories
 158   can be moved out of the working copy. At present a skeleton hierarchy
 159   of schedule delete directories remains in the working copy until the
 160   delete is committed.
 161
 162 - When handling a delete received during update/switch perhaps do it
 163   in two stages.  First move the item into a holding area within .svn
 164   and finally delete all such items at the end of the update.  This
 165   would allow adds-with-history to use the deleted item and so might
 166   be a way to handle moves (implemented as delete plus add) in the
 167   presence of local modifications.  Thought would have to be given to
 168   the revision of the local deleted item, what happens if it doesn't
 169   match the copyfrom revision?  Perhaps we could get diffs, rather
 170   than full text, for adds-with-history if the copyfrom source is
 171   reported to the repository?
 172
 173 - Consider implementing atomic move for wc-to-wc moves, rather than
 174   using copy+delete.  This would be considerably faster for big
 175   directories, would lead to better revert behaviour, and avoid
 176   case-insensitivity problems (and if we ever get atomic mv in
 177   libsvn_fs then the wc code would be ready for it).
 178
 179 - Consider writing some libsvn_wc compiled C regression tests to allow
 180   more complete coverage.  Most of the current libsvn_wc testing is
 181   done via the command line client and it can be hard to get a working
 182   copy into the state necessary to test all code paths.
 183
 184 - There are some basic features that are fragile.  Switch has some
 185   bugs that can break a working copy, see issue 1906.  I don't know
 186   how the system is supposed to work in theory, let alone how it
 187   should be implemented.  Non-recursive checkout is broken, see issue
 188   695; this probably applies to non-recursive update and switch as well.
 189
 190 - Use absolute paths within libsvn_wc so that "." is not automatically
 191   a wc root.
 192
 193 - Read notes/entries-caching for some details of the logging/caching
 194   in the current libsvn_wc.  It's important that writing the entries
 195   file is handled efficiently.