technical/sparse-index.txt

   1 Git Sparse-Index Design Document
   2 ================================
   3
   4 The sparse-checkout feature allows users to focus a working directory on
   5 a subset of the files at HEAD. The cone mode patterns, enabled by
   6 `core.sparseCheckoutCone`, allow for very fast pattern matching to
   7 discover which files at HEAD belong in the sparse-checkout cone.
   8
   9 Three important scale dimensions for a Git working directory are:
  10
  11 * `HEAD`: How many files are present at `HEAD`?
  12
  13 * Populated: How many files are within the sparse-checkout cone.
  14
  15 * Modified: How many files has the user modified in the working directory?
  16
  17 We will use big-O notation -- O(X) -- to denote how expensive certain
  18 operations are in terms of these dimensions.
  19
  20 These dimensions are ordered by their magnitude: users (typically) modify
  21 fewer files than are populated, and we can only populate files at `HEAD`.
  22
  23 Problems occur if there is an extreme imbalance in these dimensions. For
  24 example, if `HEAD` contains millions of paths but the populated set has
  25 only tens of thousands, then commands like `git status` and `git add` can
  26 be dominated by operations that require O(`HEAD`) operations instead of
  27 O(Populated). Primarily, the cost is in parsing and rewriting the index,
  28 which is filled primarily with files at `HEAD` that are marked with the
  29 `SKIP_WORKTREE` bit.
  30
  31 The sparse-index intends to take these commands that read and modify the
  32 index from O(`HEAD`) to O(Populated). To do this, we need to modify the
  33 index format in a significant way: add "sparse directory" entries.
  34
  35 With cone mode patterns, it is possible to detect when an entire
  36 directory will have its contents outside of the sparse-checkout definition.
  37 Instead of listing all of the files it contains as individual entries, a
  38 sparse-index contains an entry with the directory name, referencing the
  39 object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
  40 If we need to discover the details for paths within that directory, we
  41 can parse trees to find that list.
  42
  43 At time of writing, sparse-directory entries violate expectations about the
  44 index format and its in-memory data structure. There are many consumers in
  45 the codebase that expect to iterate through all of the index entries and
  46 see only files. In fact, these loops expect to see a reference to every
  47 staged file. One way to handle this is to parse trees to replace a
  48 sparse-directory entry with all of the files within that tree as the index
  49 is loaded. However, parsing trees is slower than parsing the index format,
  50 so that is a slower operation than if we left the index alone. The plan is
  51 to make all of these integrations "sparse aware" so this expansion through
  52 tree parsing is unnecessary and they use fewer resources than when using a
  53 full index.
  54
  55 The implementation plan below follows four phases to slowly integrate with
  56 the sparse-index. The intention is to incrementally update Git commands to
  57 interact safely with the sparse-index without significant slowdowns. This
  58 may not always be possible, but the hope is that the primary commands that
  59 users need in their daily work are dramatically improved.
  60
  61 Phase I: Format and initial speedups
  62 ------------------------------------
  63
  64 During this phase, Git learns to enable the sparse-index and safely parse
  65 one. Protections are put in place so that every consumer of the in-memory
  66 data structure can operate with its current assumption of every file at
  67 `HEAD`.
  68
  69 At first, every index parse will call a helper method,
  70 `ensure_full_index()`, which scans the index for sparse-directory entries
  71 (pointing to trees) and replaces them with the full list of paths (with
  72 blob contents) by parsing tree objects. This will be slower in all cases.
  73 The only noticeable change in behavior will be that the serialized index
  74 file contains sparse-directory entries.
  75
  76 To start, we use a new required index extension, `sdir`, to allow
  77 inserting sparse-directory entries into indexes with file format
  78 versions 2, 3, and 4. This prevents Git versions that do not understand
  79 the sparse-index from operating on one, while allowing tools that do not
  80 understand the sparse-index to operate on repositories as long as they do
  81 not interact with the index. A new format, index v5, will be introduced
  82 that includes sparse-directory entries by default. It might also
  83 introduce other features that have been considered for improving the
  84 index, as well.
  85
  86 Next, consumers of the index will be guarded against operating on a
  87 sparse-index by inserting calls to `ensure_full_index()` or
  88 `expand_index_to_path()`. If a specific path is requested, then those will
  89 be protected from within the `index_file_exists()` and `index_name_pos()`
  90 API calls: they will call `ensure_full_index()` if necessary. The
  91 intention here is to preserve existing behavior when interacting with a
  92 sparse-checkout. We don't want a change to happen by accident, without
  93 tests. Many of these locations may not need any change before removing the
  94 guards, but we should not do so without tests to ensure the expected
  95 behavior happens.
  96
  97 It may be desirable to _change_ the behavior of some commands in the
  98 presence of a sparse index or more generally in any sparse-checkout
  99 scenario. In such cases, these should be carefully communicated and
 100 tested. No such behavior changes are intended during this phase.
 101
 102 During a scan of the codebase, not every iteration of the cache entries
 103 needs an `ensure_full_index()` check. The basic reasons include:
 104
 105 1. The loop is scanning for entries with non-zero stage. These entries
 106    are not collapsed into a sparse-directory entry.
 107
 108 2. The loop is scanning for submodules. These entries are not collapsed
 109    into a sparse-directory entry.
 110
 111 3. The loop is part of the index API, especially around reading or
 112    writing the format.
 113
 114 4. The loop is checking for correct order of cache entries and that is
 115    correct if and only if the sparse-directory entries are in the correct
 116    location.
 117
 118 5. The loop ignores entries with the `SKIP_WORKTREE` bit set, or is
 119    otherwise already aware of sparse directory entries.
 120
 121 6. The sparse-index is disabled at this point when using the split-index
 122    feature, so no effort is made to protect the split-index API.
 123
 124 Even after inserting these guards, we will keep expanding sparse-indexes
 125 for most Git commands using the `command_requires_full_index` repository
 126 setting. This setting will be on by default and disabled one builtin at a
 127 time until we have sufficient confidence that all of the index operations
 128 are properly guarded.
 129
 130 To complete this phase, the commands `git status` and `git add` will be
 131 integrated with the sparse-index so that they operate with O(Populated)
 132 performance. They will be carefully tested for operations within and
 133 outside the sparse-checkout definition.
 134
 135 Phase II: Careful integrations
 136 ------------------------------
 137
 138 This phase focuses on ensuring that all index extensions and APIs work
 139 well with a sparse-index. This requires significant increases to our test
 140 coverage, especially for operations that interact with the working
 141 directory outside of the sparse-checkout definition. Some of these
 142 behaviors may not be the desirable ones, such as some tests already
 143 marked for failure in `t1092-sparse-checkout-compatibility.sh`.
 144
 145 The index extensions that may require special integrations are:
 146
 147 * FS Monitor
 148 * Untracked cache
 149
 150 While integrating with these features, we should look for patterns that
 151 might lead to better APIs for interacting with the index. Coalescing
 152 common usage patterns into an API call can reduce the number of places
 153 where sparse-directories need to be handled carefully.
 154
 155 Phase III: Important command speedups
 156 -------------------------------------
 157
 158 At this point, the patterns for testing and implementing sparse-directory
 159 logic should be relatively stable. This phase focuses on updating some of
 160 the most common builtins that use the index to operate as O(Populated).
 161 Here is a potential list of commands that could be valuable to integrate
 162 at this point:
 163
 164 * `git commit`
 165 * `git checkout`
 166 * `git merge`
 167 * `git rebase`
 168
 169 Hopefully, commands such as `git merge` and `git rebase` can benefit
 170 instead from merge algorithms that do not use the index as a data
 171 structure, such as the merge-ORT strategy. As these topics mature, we
 172 may enable the ORT strategy by default for repositories using the
 173 sparse-index feature.
 174
 175 Along with `git status` and `git add`, these commands cover the majority
 176 of users' interactions with the working directory. In addition, we can
 177 integrate with these commands:
 178
 179 * `git grep`
 180 * `git rm`
 181
 182 These have been proposed as some whose behavior could change when in a
 183 repo with a sparse-checkout definition. It would be good to include this
 184 behavior automatically when using a sparse-index. Some clarity is needed
 185 to make the behavior switch clear to the user.
 186
 187 This phase is the first where parallel work might be possible without too
 188 much conflicts between topics.
 189
 190 Phase IV: The long tail
 191 -----------------------
 192
 193 This last phase is less a "phase" and more "the new normal" after all of
 194 the previous work.
 195
 196 To start, the `command_requires_full_index` option could be removed in
 197 favor of expanding only when hitting an API guard.
 198
 199 There are many Git commands that could use special attention to operate as
 200 O(Populated), while some might be so rare that it is acceptable to leave
 201 them with additional overhead when a sparse-index is present.
 202
 203 Here are some commands that might be useful to update:
 204
 205 * `git sparse-checkout set`
 206 * `git am`
 207 * `git clean`
 208 * `git stash`