subversion/libsvn_fs_base/notes/fs-history

   1                             -*- text -*-
   2
   3                    Subversion Filesystem History
   4          (a love song for libsvn_fs, by C. Michael Pilato)
   5
   6
   7 The Subversion filesystem can be your best friend, or your worst
   8 enemy, usually depending on which side of the public API you are
   9 working on.  Callers of the libsvn_fs interfaces do their work in a
  10 world pleasantly addressed by roots (the name given to a revision or
  11 transaction snapshot of the versioned directory tree) and paths under
  12 those roots.  But once you swim beneath the surface, you quickly
  13 realize that there is a beast both beautiful and dangerous lying in
  14 wait.  What looks to the outside world as a sort of coordinate system
  15 with axes for "Time" and "Location" is, in fact, a complicated DAG
  16 subsystem, with nodes that represent revisions of versioned locations
  17 all interconnected in various relationships with each other.
  18
  19 The goal of this document is straightforward: to relay knowledge about
  20 how to untangle that DAG subsystem -- knowledge which likely lives
  21 only in the minds of a few developers -- so that the Few might become
  22 the Many.
  23
  24
  25 \f
  26 Node-Revisions: The Nodes of the DAG
  27
  28 When working outside the filesystem API, people generally talk about
  29 their versioned resources in terms of the paths of those resources,
  30 and the global revisions (or revisions-to-be) in which those paths
  31 are located.  But inside the filesystem, paths are broken down and
  32 stored as a hierarchical linked-list of path components.  Each of
  33 these path components has its own historical lineage (because
  34 Subversion versions directories, too!), and each revision of that
  35 lineage is referred to as a "node-revision".  It is these
  36 node-revisions which are the nodes of the DAG subsystem, or "DAG
  37 nodes".
  38
  39 DAG nodes are identified by unique keys called "node-revision IDs",
  40 and are inter-connected in a variety of ways.  A DAG node that
  41 represents a directory stores information about which other DAG nodes
  42 represent the children of that directory.  A DAG node contains
  43 information about which other DAG node is its historical predecessor.
  44 By tracing these links from node to node, we can effectively traverse
  45 both space and time, both the geography and the chronology of the
  46 filesystem landscape.
  47
  48 For example, the path "/trunk/src/main.c" in revision 4 of the
  49 filesystem consumes four DAG nodes: one for "/", one for "/trunk", one
  50 for "/trunk/src", and one for "/trunk/src/main.c".  The DAG node for
  51 "/" contains a list of the names and node-revision IDs of its
  52 children, among which is the node-revision ID for the child named
  53 "trunk".  Similar links are found in "/trunk" (for "src") and
  54 "/trunk/src" (for "main.c").  Additionally, if these paths existed in
  55 different forms in previous revisions of the filesystem, their DAG
  56 nodes would store the node-revision IDs of their respective
  57 predecessor nodes.
  58
  59 Whenever someone uses the public API to query for information about a
  60 versioned path under a particular root, the typical course of action
  61 under-the-hood is as follows:
  62
  63    1. The root refers to a particular snapshot of the DAG node tree,
  64       and from this we can learn the node-revision ID of the node
  65       which represents the root directory ("/") as it appears in that
  66       snaphot.  Given this node-revision ID, it's all DAG from here.
  67
  68    2. The path is split into components and traversed, beginning with
  69       the root node, and walking down toward the full path.  Each
  70       intermediate node-revision is read, its entries list parsed, and
  71       the next component looked up in that entries list to find the
  72       next node-revision ID along the traversal path.
  73
  74    3. Finally, we wind up with a node-revision ID for our original
  75       path.  We use it and its associated node-revision to answer the
  76       query.
  77
  78 Seems pretty easy, doesn't it?  Keep reading.
  79
  80
  81 \f
  82 All About Node-Revision IDs
  83
  84 As previously mentioned, each node-revision in the filesystem has a
  85 unique key, referred to as the node-revision ID.  This key is
  86 typically represented as a string which looks like a period-separated
  87 list of its three components:
  88
  89    1. node ID: This key is unique to the members of a single
  90       historical lineage.  Differing versions of the same versioned
  91       resource, irrespective of the paths and revision in which those
  92       versions are located, all share this ID.  If two node-revisions
  93       have different node IDs, their are historically unrelated.
  94
  95    2. copy ID: This key uniquely identifies a copy operation, and is
  96       sometimes referred to (or at least thought of) as a "branch ID."
  97       If two node-revisions have the same copy ID, they are said to be
  98       on the same branch.  The only exception to this is in the key
  99       "0", a special key that means "not branched".
 100
 101    3. txn ID: This key uniquely identifies the Subversion transaction
 102       in which this node-revision came into existence.
 103
 104 Whenever someone uses the public API to *modify* a versioned resource,
 105 these actions are much the same as those used when querying.  But
 106 there are important differences.
 107
 108    1. The path is traversed in the same manner is described in the
 109       previous section.  The result is an in-memory linked-list of
 110       information about the node-revisions which comprise the
 111       components of the path.
 112
 113    2. But before any changes can be made to a path, its node-revision
 114       and those of its parent directories must first be cloned so that
 115       changes to them don't affect previous incarnations of those
 116       node-revisions.  This process is called "making the path
 117       mutable".  If previous operations under this transaction caused
 118       one or more of the parent directories to be made mutable
 119       already, they are not again cloned.
 120
 121    3. Once the path and all its parents are mutable, the desired
 122       changes can be made to the cloned node-revision, and they in no
 123       way affect prior history.
 124
 125 To clone a node-revision means to literally make a duplicate of it
 126 which is granted its own unique node-revision ID.  The new
 127 node-revision ID consists of the same node ID as the node-revision
 128 that was cloned (since this is just another point along the historical
 129 lineage of this versioned resource), a copy ID (which will be
 130 discussed later), and the txn ID in which this modification is
 131 occuring.
 132
 133 There are some cool things we can read between the lines above.  Since
 134 the only time a node-revision comes into existence is when it is brand
 135 new or a fresh clone, and we never do cloning except during a
 136 modification, then we can use the txn ID as a sort of mutability flag.
 137 Mutability of a node-revision is determined by comparing the txn ID of
 138 the node-revision with the ID of the Subversion transaction being used
 139 to modify the filesystem -- if, and only if, they are the same, the node
 140 is allowed to be changed inside that transaction.
 141
 142 So, we know how txn IDs come into existence now.  And the origin of
 143 node IDs hardly warrants its own paragraph: brand new lines of history
 144 (introduced with svn_fs_make_file() and svn_fs_make_dir()) get new
 145 unique node IDs, and every other node-revision that is created simply
 146 keeps the same node ID as the node-revision on which it is based.
 147
 148 So what about those copy IDs?
 149
 150 Copy IDs are assigned to nodes-revisions to denote on which "branch"
 151 of a line of history that node-revision resides.  (They are called
 152 copy IDs for political reasons, really -- Subversion doesn't expose a
 153 branching API per se, instead promoting the idea that branches are
 154 just forks in the development of a line of history that can just as
 155 easily be represented using path semantics.)  New copy IDs are
 156 allocated whenever a branching operation occurs.  New node-revisions
 157 can inherit the copy IDs of their predecessors (in the trivial cloning
 158 case), inherit the copy-IDs of one of their parents (by nature of
 159 their parent being copied), or inherit new copy-IDs.  In the absence
 160 of any branching, node-revisions are assigned the special copy ID "0".
 161
 162
 163 \f
 164 Copies and Copy IDs
 165
 166 Currently there are two kinds of copy operation.  The first is a
 167 "real" copy, and is the direct result of a call to svn_fs_copy().
 168 When a real copy is made, the node-revision of the copy source is
 169 cloned, and earns its own brand new unique node-revision ID.  This
 170 node-revision ID is constructed from the original node ID, a brand new
 171 copy ID, and (as always) the txn ID of the transaction in which the
 172 copy is being made.
 173
 174 The Subversion filesystem uses a "cheap copy/lazy migration" model.
 175 This means that when a directory node-revision is copied via
 176 svn_fs_copy(), only the node-revision of the top of the copied "tree"
 177 is cloned (again, earning a new copy ID), not every child of that
 178 tree.  This makes the svn_fs_copy() operation quite fast, at least
 179 compared to the alternative.  From that point, any children of the
 180 copy target are lazily migrated.  The first time they are themselves
 181 modified after the original copy, they are cloned from their original
 182 source location into the new target location.  This lazy migration
 183 procedure costs about the same as a regular cloning operation, which
 184 keeps the "cheap copy" cheap, even the morning after.
 185
 186 Copies of files behave no differently than copies of directories.  But
 187 files never have children, so effectively the "tree" being copied is
 188 exactly one node-revision.  This node-revision is explicitly cloned at
 189 the time of the copy, and there is nothing to lazily migrate
 190 afterwards.
 191
 192 The second type of copy operation is a "soft" copy.  These types of
 193 copies are not explicitly triggered via the filesystem API, but are
 194 implementation artifacts of other filesystem operations.  A soft copy
 195 happens whenever a node-revision exists in a different branch than
 196 that of its parent, and the parent is again branched.  Huh?!  Let's
 197 see if an example will help explain this a bit.
 198
 199 Say you have a directory, "/trunk".  Now, into "/trunk" you copy a
 200 file "README" from some other part of the tree.  You have now
 201 effectively branched the original "README"'s history -- part of it
 202 will live on in the original location, but part of it now thrives in
 203 its new "/trunk/README" location.  The copy operation assigned a brand
 204 new copy ID to the new node-revision for "/trunk/README", which is
 205 necessarily different from the copy ID assigned to the node-revision
 206 for "/trunk".
 207
 208 Later, you decide to copy "/trunk" to "/branches/mine".  So the new
 209 "/branches/mine" also gets a brand new copy ID, since it is now a
 210 historical branch of "/trunk".  But what happens when
 211 "/branches/mine/README" is cloned later as part of some edits you are
 212 making?  What copy ID does the new clone get?  Because "/trunk/README"
 213 was on a different historical branch than "/trunk", our copy of
 214 "/trunk" causes (in "README") a branch of a branch.  So
 215 "/branches/mine/README" gets a brand new copy ID, and the filesystem
 216 remembers that the copy operation associated with that ID was a soft
 217 copy.
 218
 219    [### Right about here, C-Mike's memory starts getting fuzzy ###]
 220
 221 The following is the copy ID inheritence algorithm, used to calculate
 222 what copy ID a node revision will use when cloned for mutability.
 223 Remember that a node revision is never cloned until its parent is
 224 first cloned.
 225
 226   1.  If the node revision is already mutable, its copy ID never
 227       changes.
 228
 229   2.  If the node revision has a copy ID of "0" (which is to say, it's
 230       never been itself copied or cloned as a child of a copied
 231       parent), then it inherits whatever copy ID its parent winds up
 232       with.
 233
 234   3.  If the node revision is on the same branch as its parent before
 235       cloning, it will remain on the same branch as its parent after
 236       cloning.  A node revision can be said to be on the same branch
 237       as its parent if:
 238
 239         a) their copy IDs are the same, or
 240
 241         b) the node revision is not a branch point (meaning, it was
 242            not the node revision created by the copy associated with
 243            its copy ID), or
 244
 245         c) the node revision is a branch point which being accessed via
 246            its copy destination path.
 247
 248    4.  If, however, the node revision is *not* on the same branch as
 249        its parent before cloning, it cannot be on the same branch as
 250        its parent after cloning.  This breaks down to two cases:
 251
 252         a) If the node revision was the target of the copy operation
 253            whose ID it holds, then it gets to keep its same copy ID.
 254
 255         b) Otherwise, the node revision is the unedited child of some
 256            parent that was copied, and wasn't on the same branch as
 257            that parent before the copy.  In this special case, the
 258            cloned node revision will get a brand new copy ID which
 259            points to one of those "soft copy" things we've been
 260            talking about.
 261
 262 The initial root directory's node revision, created when the
 263 filesystem is initialized, begins life with a magical "0" copy ID.
 264 Afterward, any new nodes (as in, freshly created files and
 265 directories) begin life with the same copy ID as their parent.
 266
 267 \f
 268 Traversing History
 269
 270    ### todo:  put the history harvesting algorithm here