notes/fs-improvements.txt

   1 Last updated: $Date: 2001/07/02 21:46:52 $
   2
   3 Three things that need to happen in the filesystem:
   4
   5    a) Switch to a sane node key system (issue #654).
   6
   7    b) We need to operate on file contents without holding the entire
   8       contents in RAM.  Berkeley DB gives us tools to operate on
   9       regions within a value, we just need to use them.
  10
  11    c) We need to do reverse-delta storage in the filesystem (with
  12       checksums).
  13
  14    d) Record (potentially) multiple parents per change.
  15
  16    e) Implement atomic renames.
  17
  18 Some thoughts on them:
  19
  20 \f
  21 a) Switch to a sane node key system (issue #654)
  22 ================================================
  23
  24 For more background, read the archived dev list thread with subject
  25 "Maintaining NodeID sanity":
  26
  27    http://subversion.tigris.org/servlets/ReadMsg?msgId=72265&listName=dev
  28
  29 Here's the plan, mostly by Bill Tutt and Branko, with assists from
  30 Karl and Mike:
  31
  32 Note:
  33
  34    - This is described in terms of a BDB implementation.  Translate to
  35      RDB-speak in your head as necessary :-).
  36
  37    - This proposal supports one copy (a.k.a. branch) operation.  You
  38      can call it anything you want: "copy", "branch", "split",
  39      "pumpkin", whatever.  We're calling it "copy" :-).  It is the SCM
  40      branch operation.
  41
  42 First, a node's key consists of three parts:
  43
  44    nodeID.copyID.txnID
  45
  46 The "txnID" is really just a unique identifier, but we happened to
  47 inherit it from the fs txn, and we needed a name for that portion,
  48 so... :-) Also, the copyID could theoretically live in the node's
  49 value instead of in its key, but it feels right to put it in the
  50 key.  (How's that for a powerful argument?)  For nodes that are not
  51 copies, the copyID is just "0" or some other special value.
  52
  53 There are no more mutability flags -- mutability is determined by
  54 examining whether the node key's txnID matches the txn in question.
  55 Therefore, there is no stabilization walk at commit time.
  56
  57 When we commit a change to a node, the nodeID and copyID stay the
  58 same, only the txnID changes (actually there is a circumstance where
  59 the copyID can change, but more on that later).  The new txnID is not
  60 necessarily greater than the old one -- sometimes txns get committed
  61 out of order! -- but anyway it's different from the old txnID, and the
  62 same new txnID is used for all other changes in that commit.
  63
  64   [Greg and Karl disagree on whether to use integer types or `char *'
  65    for the parsed representation of IDs.  See the dev list thread
  66    that starts here for the details of this:
  67    http://subversion.tigris.org/servlets/ReadMsg?msgId=72277&listName=dev
  68   ]
  69
  70 After a commit, the txn record in the transactions table does not go
  71 away; instead, it is updated so it now maps the txnID to the new
  72 revision.  This allows us to determine the revision a node was
  73 committed in, in constant time, given the node's key.
  74
  75 Each new version of a node stores the node's predecessor (and does not
  76 store copyform history).  When node "5.0.fzb" is committed as a
  77 successor to "5.0.qnr", the new node's value stores a reference to
  78 "5.0.qnr".
  79
  80 What about copies?
  81
  82 As in the current fs, copies are shallow.  The top of the copied tree
  83 gets a new node, but the nodes below it are shared with the copy
  84 source.  The new node key keeps the same nodeID, but gets a new txnID,
  85 and gets the next unique copyID (replacing the current copyID, if
  86 any).
  87
  88 In the example below, we copy `A' to `Y'.  Node keys for `A', `Y', and
  89 `bar' are given in parentheses:
  90
  91             BEFORE THE COPY               AFTER THE COPY
  92                <root>                         <root>
  93               /  |                           /  |  \
  94             /    |                         /    |    \
  95           /      |                       /      |      \
  96         A(3.0.m) B                     A(3.0.m) B       Y(3.jfb.p)
  97         / \      |                     / \      |      / \
  98      foo  bar   qux                 foo  bar   qux   foo  bar
  99        (5.0.abc)                       (5.0.abc)        (5.0.abc)
 100
 101 Let's flesh out the example with some commits under A and Y.  To save
 102 space, the colons represent time flow, not directory hierarchy --
 103 imagine they're the Z axis coming out of the screen or something :-).
 104
 105                          <root>
 106                         /  |  \
 107                       /    |    \
 108                     /      |      \
 109                  A(3.0.m)  B       Y(3.jfb.p)
 110                   / \      |      / \
 111                foo  bar   qux   foo  bar
 112                  (5.0.abc)         (5.0.abc)
 113                      :                 :
 114                      :                 :
 115                  (5.0.ejk)         (5.jfb.mht)
 116                      :                 :
 117                      :                 :
 118                  (5.0.xyz)         (5.jfb.uuu)
 119                      :                 :
 120                      :                 :
 121                  (5.0.r2d2)        (5.jfb.2pz)
 122                      :                 :
 123                      :                 :
 124                  (5.0.c3po)        (5.jfb.rdt)
 125
 126 Let's see how easy it is to answer various questions in this system:
 127
 128 Obviously, is_related() is simple -- just check that the nodeID
 129 portion is the same.  You may not know if the relationship is cousins
 130 vs ancestor/descendant, but you know whether or not they're related.
 131
 132 Asking is_predecessor(A,B) is also easy.  Just fetch the predecessor
 133 pointer from B and see if it's A.
 134
 135 Finding out what revisions a node changed in is proportional to the
 136 number of changes the node has undergone: start at the node, walk back
 137 through its predecessors, and for each txnID, look up the revision
 138 number via the transactions table (as described earlier).
 139
 140 During this walk, you can always tell when you encounter a node that
 141 results from a copy, because the copyID portion will either change or
 142 disappear entirely.  When this happens, you know one of two things is
 143 true: either the previous node in the walk was the top of a copied
 144 tree, or *this* node (the one with the different copyID) was one of
 145 the unchanged nodes down inside a copied tree.
 146
 147 One might think "Oh, we'll distinguish between these cases by walking
 148 up the parents of the node, and seeing if we eventually encounter the
 149 old copyID in one of the parents.  That would tell us that we're in
 150 the second case.  If we never encounter it, that tells us we're in the
 151 first."
 152
 153 Not so fast, Spidey.  We don't have parent pointers -- this is a
 154 predecessor walk by node keys; we can't just walk up the parent path
 155 like that.  Fortunately, copyIDs are generated from a new `copies'
 156 table, which maps unique copyIDs onto (REV COPY_DST_PATH
 157 COPY_DST_NODEKEY).  We look up the rev/path for the old copyID,
 158 convert it to a node key, and compare it to the node key we're
 159 currently on.  Voilà!  Actually, we're not sure we'll store all of
 160 those in the copies table, it may boil down to just the DST_NODEKEY or
 161 just the other two, we'll see.
 162
 163 Writing those predecessor walk loops well is left as an exercise for
 164 the reader (umm, more likely for the writers, heh), but you can see
 165 that the necessary questions can be answered efficiently.
 166
 167 Note that, like txnIDs, copyIDs are just unique numbers.  They may be
 168 increasing monotonically in the `copies' table, but (due to the fact
 169 that txn A may be started before txn B yet be committed afterwards)
 170 it's quite possible that a higher copyID will become visible in the
 171 revision history before a lower one.
 172
 173 The one thing we can know is that a lower copyID can never be a
 174 branchwise descendent of a lower copyID, since the lower one must have
 175 been committed before any of its descendants txns were started, of
 176 course.  I'm not sure this minimal inference will ever be useful, but
 177 anyway it's all we've got.  Anyway, right now, we only ever need ask
 178 if two copyIDs are equal -- we don't order them.
 179
 180 Okay, now what if A already had copied trees underneath it when we
 181 copied it to Y?  Suppose `bar' is down in one of those subdirectories?
 182
 183 Then when we're committing on /Y/.../bar, we watch for copyIDs as we
 184 walk down from root, like usual, and if we're in a copy underneath a
 185 copy, we bubble down a _new_ copyID, distinct from both Y's and B's,
 186 starting from that point.  Justification: a branch of a branch is
 187 still a branch, so it gets its own copyID.
 188
 189 At this point, I'm going to hand-wave on describing the
 190 copy-under-copy behavior any further.  I think the above is enough to
 191 see that there are no insurmountable problems here, and that the
 192 filesystem will now have node keys that [lazily] reflect actual
 193 branching patterns.
 194
 195 A few more notes:
 196
 197 The is_ancestor(X,Y) question is really only asked during merge(),
 198 that is, during commits.  If entry "somedir/blah" has NodeID Y in the
 199 txn being committed, and NodeID X in head tree being merged into that
 200 txn, then we need to make sure X is an ancestor of Y, so that when the
 201 commit replaces X with Y, we know we're not losing any history.
 202
 203 Therefore, we think we can get away with storing the ancestors in a
 204 distributed fashion, as a chain: each node knows its immediate
 205 predecessor (or "precessors", in the future), and you walk back until
 206 you have your answer.  In real life, we won't usually be walking back
 207 too far, and the search can be further bounded by the revision range
 208 implied by the two nodes.  If profiling proves these walks to be a
 209 bottleneck, we can start caching the results of such walks in a new
 210 table whose keys are node keys, and values are _full_ ancestor lists.
 211
 212
 213 \f
 214 b) Operate on portions of files efficiently.
 215 ============================================
 216
 217    [still pondering this section]
 218
 219 We should take advantage of Berkeley DB's partial record stuff, all
 220 the way up to the top of the svn fs interfaces.
 221
 222    - dag_node_t gets two new fields: contents_offset and contents_len.
 223      They apply to the node's cache of the contents, not the header or
 224      proplist.
 225
 226    - svn_fs__get_node_rev_contents() takes offset and len arguments,
 227      fetches only that data.  The dag_node_t will remember the offset
 228      and len.
 229
 230    - svn_fs__put_node_rev_contents() takes offset and len args as
 231      well.
 232
 233    - change set_node_revision() accordingly.
 234
 235    - ... todo thinking here ...
 236
 237 So now, whenever you read or write a node revision, you are operating
 238 on a range.  There will be some way to say "I mean the whole thing",
 239 of course, so it won't be necessary to know the size in advance.
 240
 241 Thought: possibly we should stop storing data in the dag_node_t
 242 itself, and just return the data in a void pointer passed to
 243 svn_fs__get_node_rev_contents().  Still pondering.
 244
 245
 246 \f
 247 c) Reverse-delta storage.
 248 =========================
 249
 250 The naive way to recover an old text is:
 251
 252    retrieve_node_rev (N)
 253    {
 254      grab_node_revision (&N);
 255
 256      if (is_fulltext (N))
 257        return N;
 258      else if (is_shared (N))
 259        return retrieve_node_rev (get_sharee (N));
 260      else if (is_svndiff (N))
 261        return svnpatch (get_svndiff (N), retrieve_node_rev (get_base (N)))
 262    }
 263
 264 (Loose pseudo-code, obviously, and the recursion could be a loop, but
 265 you get the idea.)
 266
 267 The trouble with this is that it constructs and patches each
 268 intermediate revision.  That'll probably be i/o bound, and anyway much
 269 of the intermediate material may not end up in the final target, in
 270 which case reconstructing it was a waste of time.
 271
 272 What we really want is a way to compose a bunch of svndiffs, and then
 273 apply that composition to the current head, to give us the older
 274 revision in one step (well, one application anyway).  Both the
 275 composition and the final application need to happen in a streamy or
 276 windowed fashion -- we shouldn't have to hold the entire diff in
 277 memory, nor the entire source, nor the target.
 278
 279 Here's a way to do this:
 280
 281 An svndiff is a series of instructions that are followed to
 282 reconstruct the target.  There are three possible instructions:
 283
 284    a) Insert X bytes of new text into the TARGET, and I'm giving you
 285       those bytes right here.
 286    b) Copy N bytes, starting from offset F in the SOURCE, into the
 287       TARGET.
 288    c) Copy N bytes, starting from offset F in the TARGET, into the
 289       TARGET.
 290
 291 (Note that (c) can actually run past the current end of the target, as
 292 long by the time you get there, the target is longer.)
 293
 294 To compose two svndiffs...
 295
 296    ...and I hate to tantalize you, but I'm late and have to run now,
 297       so I'll try to finish this tomorrow... crimminy... The quick
 298       summary is, we build a new svndiff (the composition of all the
 299       intermediates), and as it gets too large, we windowify as we go
 300       and put each window temporarily in the database; this makes the
 301       composition as a whole less efficient, but means that at any
 302       given time we don't have to have the whole thing in memory.  The
 303       arithmetic for offset-adjustment is fairly straightforward even
 304       when one has to offload windows, I believe.  It's nice that the
 305       source is already in the db and we get offset+length style
 306       operations from Berkeley naturally anyway.  Branko or anyone,
 307       feel free to continue this recipe and see if you can take it
 308       somewhere before I get in tomorrow morning...  -kff
 309
 310
 311 ------------------------------------------------------------------------
 312    Notes from JimB about optimizing the in-repository delta generation
 313    to make deltas that can be composed more quickly:
 314
 315 I talked about this with Karl on the phone, and gave pretty bad
 316 explanations of my thinking; I'll try to do better today.  This will
 317 also provide some background for other readers.
 318
 319 I'm told that RCS reconstructs older file revisions by walking the
 320 list of diffs, ignoring the actual text, and simply recording the line
 321 numbers and sizes of the substitutions.  Then, it can run over that
 322 data and do arithmetic to construct a single `composed' diff against
 323 the youngest revision would reconstructs the older revision.  Then,
 324 you apply that composed diff to get the revision you wanted.
 325
 326 For example, if your fulltext is:
 327
 328 rev 3:  abcdefghijklmnopqrst
 329
 330 and your deltas are:
 331
 332 rev 2:  replace 3--6 with "howdy"    (yielding abchowdyghijklmnopqrst)
 333 rev 1:  replace 6--8 with " are you" (yielding abchow are youghijklmnopqrst)
 334
 335 then the RCS algorithm would gather this info:
 336
 337 rev 2:  replace 3--6 with 5 chars (call them X)
 338 rev 1:  replace 6--8 with 8 chars (call them Y)
 339
 340 Now, without looking at any of the text, it can compose the two deltas
 341 to get the following delta, yielding rev 1:
 342
 343         replace 3--6 with range 0--3 from X
 344         replace 6--6 with range 0--8 from Y (i.e. insert Y at 6)
 345
 346 If we then apply this `composed' delta to the original text, we get:
 347
 348         abchow are youghijklmnopqrst
 349
 350 The point of all this is that you don't need to mess around with
 351 actual text until the very end.  Until that point, the amount of work
 352 depends on the amount of change, not on the amount of text involved.
 353 And when you do get around to actually assembling text, the amount of
 354 work depends on the size of the output file --- because you're only
 355 touching each piece of text once --- and only weakly on the amount of
 356 change.
 357
 358 Our current svndiff format frustrates this somewhat by compressing the
 359 new text, as well as pulling in pieces of the original.  The
 360 compression process produces a lot of delta ops that deal with small
 361 pieces of text.  (Or at least, I expect it does...)  So even if the
 362 change is something simple --- replacing a single block of text in the
 363 middle of the file, say --- you end up with an svndiff with a lot of
 364 ops, mostly concerned with building the replacement text from pieces
 365 of itself.  This is great, except that having lots of small ops
 366 increases the amount of work the delta composition phase needs to do.
 367 In fact, if the ops usually deal with really small pieces of text ---
 368 a few dozen bytes or so --- I expect it'd be faster to just throw the
 369 actual text around.  Memcpy is pretty optimized on real systems; you
 370 could copy a lot of bytes in the time it would take to do funky
 371 intersections and adjustments on a list of ops talking about those
 372 bytes, so those ops had better refer to large blocks of bytes.
 373
 374 I'm not sure what to do with that.  It almost seems like we want the
 375 text delta computation algorithm to optimize deltas for network
 376 transmission and deltas for storage differently.
 377
 378 \f
 379 ------------------------------------------------------------------------
 380   Notes from Brane: Delta composition
 381
 382 Despite JimB's concerns, it turns out that delta composition is
 383 straight-forward. The basic idea is that combining two deltas is
 384 equivalent to applying the second delta to a representation of the
 385 first delta's result. Bear with me while I shamelessly abuse what
 386 little I remember about linear algebra.
 387
 388 Given two deltas, A and B, and the source stream S, we want to
 389 construct a composed delta, AB, that converts S to the target stream
 390 T. Let's borrow notation from linear algebra and write this
 391 transformation like this:
 392
 393     T = AB(S)
 394
 395 Abusing algebra some more, I'll assume that a delta behaves like any
 396 other linear transformatsion; therefore,
 397
 398     AB(S) = B(A(S))
 399
 400 and since I'm not about to develop any rigorous proofs here, I'll
 401 just say that it follows from the above that
 402
 403     T = B(S'), where S' = A(S)
 404
 405 A small note here: Every delta is actually an n-tuple of delta
 406 opserations, represented by what I'll call the delta operators [n]
 407 (for new data), [s] (for source copies) and [t] (for target
 408 copies). [s] and [t] create bits of the target stream by operating on
 409 contiguous parts of the source and (existing) target stream,
 410 respectively; while [n] does the same by operating on raw data.
 411
 412
 413 Now, what we actually want to do is derive some form of AB (which, by
 414 the way, does not have a unique representation, sice we're not trying
 415 to find the optimal ("normalized") transform) that doesn't in any way
 416 rely on the value of S'. We do that by building a representation of S'
 417 that relies only on S, and any new data introduced by the [n]
 418 operators in A. That's possible because any [t] ops in A merely copy
 419 parts of S' that have been previously defined by [s] and [n]
 420 ops. Transforming A by (recursively) replacing all [t] ops with the
 421 equivalent [s] and [n] ops gives us exactly such a representation,
 422 which I'll call A'. [*]
 423
 424 Building AB from B and A' is trivial: traversing the list of delta ops
 425 in B, copy all [n] and [t] ops into the result; for [s] ops, copy the
 426 range of ops from A' that define the appropriate range in S'. For some
 427 of the copies, the first or last op from the range in A' will have to
 428 be split, and the first op in the copy range can sometimes be merged
 429 with the previous op in AB.
 430
 431
 432 Of course, stopping here could give a very sub-optimal AB, because it
 433 could contain many duplicate copies of the same op ranges from A'. We
 434 fix this by doing exactly the opposite transformation than A->A': by
 435 transforming [s] ops from B into [t] ops. We do this by recording the
 436 source and target of each copy from A' to AB and, whenever the [s]
 437 from B describes a range in T that was already defined, converting
 438 that into a [t] instead [**]. Unlike the A->A' transform, we can't remove
 439 all copies from A' to AB (we can only do that when AB doesn't refer to
 440 S at all), but we can significantly reduce the number of such copies.
 441
 442 The resulting AB will usually not be the optimal delta from S to T,
 443 because we will never actually look at S while constructing AB.
 444
 445
 446 Summarizing the above, we get the following delta composition
 447 algorithm:
 448
 449     ;; X is the set of byte ranges in T defined by copies from S'
 450     ;; Y is the current offset in T, defined by the ops in AB
 451     foreach OP in B:
 452       if (OP = [t]) or (OP = [n]):
 453         copy OP to AB
 454       else
 455         R = (set of ops from A that define S'[OP.offset, OP.length])
 456 [**]    using X, find the optimal set O of [t] ops to replace part of R
 457         foreach OP' in R:
 458           if OP' is in (R - O):
 459             if (OP' = [t]):
 460 [*]           replace OP' with equivalent [s] and [n] ops
 461             copy OP' to AB
 462           else
 463 [**]        copy OP's equivalent in O to AB
 464         insert S'[OP.offset, OP.length]->[Y, OP.length] into X
 465
 466
 467 This algorithm ignores details such as op splitting and merging,
 468 ensuring every op from O gets copied exactly once, helper indexes,
 469 etc..., but those are implementation details.
 470
 471 \f
 472 ------------------------------------------------------------------------
 473   Notes from Brane: Delta storage
 474
 475 O.K., now that delta composition is out of the way, let's talk a bit
 476 about storing and retrieving deltified text from the filesystem. I'll
 477 just jot down a few thoughts about how this could be done, assuming
 478 one of two possible models of storage management in the filesystem:
 479
 480   a) all data store and retrieve operations are synchronous; or,
 481   b) deltification can run in the background.
 482
 483
 484 1) Storing new data
 485
 486 New data arrives either as fulltext (from add) or as a delta from an
 487 existing version (copy, modify).
 488
 489   a) if the new data is a delta, convert it to fulltext (possibly
 490      combining several existing deltas in the process). Store the
 491      fulltext in the tip, and replace the previous tip (which, by
 492      induction, contains fulltext) with a delta from the current tip.
 493
 494   b) store the fulltext or delta in the tip and mark it for async
 495      modification. Do a) in the background.
 496
 497 2) Retrieving old data
 498
 499   a) If the data is a delta, follow the delte references (combining
 500      the deltas) until a fulltext is found; apply the combined delta
 501      to get the required fulltext.
 502
 503      If the combined delta reduces to a no-op (the two fulltexts are
 504      the same), store the fulltext in the younger of the two nodes and
 505      replace the older node's data with a "same" note.
 506
 507   b) Same as para 1 of a), then mark the node for async
 508      modification. In the background, find the diff between the two
 509      fulltexts. If they're equal, do para 2 of a). Otherwise, if the
 510      diff is smaller than the current diff in the node, replace the
 511      current representation. ("Smaller" could be construed as "more
 512      optimal" -- it would make sense to take into account the number
 513      of delta combinations that could be skipped by replacing the
 514      current representation when comparing sizes.)
 515
 516 \f
 517 d) Multiple parents per change
 518 ==============================
 519
 520 This is necessary for -- at least -- accurrate Merge Tracking, to
 521 allow for accurate calculation of change set graph.  Use cases
 522 include:
 523
 524 1) Avoiding repeated merges when performing cyclic merging
 525    (e.g branch A -> B -> A).
 526
 527 2) Sussing explicit merge info when a change in merge info occurs
 528    during a copy operation (e.g. to avoid paying attention to implied
 529    merge info from the copy source).
 530
 531 Mercurial (hg) does this.
 532
 533 \f
 534 e) Atomic renames
 535 =================
 536
 537 This may just be a means to an end?  Mercurial (hg) gets by without
 538 this, but this may be due to its distributed repository implementation.
 539
 540 It seems we can handle a lot of the desired use cases (see
 541 notes/tree-conflicts.txt) without true renames.
 542
 543 Exploratory work has been started on the fs-atomic-renames branch.