gitformat-chunk.txt

   1 gitformat-chunk(5)
   2 ==================
   3
   4 NAME
   5 ----
   6 gitformat-chunk - Chunk-based file formats
   7
   8 SYNOPSIS
   9 --------
  10
  11 Used by linkgit:gitformat-commit-graph[5] and the "MIDX" format (see
  12 the pack format documentation in linkgit:gitformat-pack[5]).
  13
  14 DESCRIPTION
  15 -----------
  16
  17 Some file formats in Git use a common concept of "chunks" to describe
  18 sections of the file. This allows structured access to a large file by
  19 scanning a small "table of contents" for the remaining data. This common
  20 format is used by the `commit-graph` and `multi-pack-index` files. See
  21 the `multi-pack-index` format in linkgit:gitformat-pack[5] and
  22 the `commit-graph` format in linkgit:gitformat-commit-graph[5] for
  23 how they use the chunks to describe structured data.
  24
  25 A chunk-based file format begins with some header information custom to
  26 that format. That header should include enough information to identify
  27 the file type, format version, and number of chunks in the file. From this
  28 information, that file can determine the start of the chunk-based region.
  29
  30 The chunk-based region starts with a table of contents describing where
  31 each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
  32 where C is the number of chunks. Consider the following table:
  33
  34   | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
  35   |--------------------|------------------------|
  36   | ID[0]              | OFFSET[0]              |
  37   | ...                | ...                    |
  38   | ID[C]              | OFFSET[C]              |
  39   | 0x0000             | OFFSET[C+1]            |
  40
  41 Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
  42 Each integer is stored in network-byte order.
  43
  44 The chunk identifier `ID[i]` is a label for the data stored within this
  45 fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
  46 size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
  47 and `OFFSET[i]`. This requires that the chunk data appears contiguously
  48 in the same order as the table of contents.
  49
  50 The final entry in the table of contents must be four zero bytes. This
  51 confirms that the table of contents is ending and provides the offset for
  52 the end of the chunk-based data.
  53
  54 Note: The chunk-based format expects that the file contains _at least_ a
  55 trailing hash after `OFFSET[C+1]`.
  56
  57 Functions for working with chunk-based file formats are declared in
  58 `chunk-format.h`. Using these methods provide extra checks that assist
  59 developers when creating new file formats.
  60
  61 Writing chunk-based file formats
  62 --------------------------------
  63
  64 To write a chunk-based file format, create a `struct chunkfile` by
  65 calling `init_chunkfile()` and pass a `struct hashfile` pointer. The
  66 caller is responsible for opening the `hashfile` and writing header
  67 information so the file format is identifiable before the chunk-based
  68 format begins.
  69
  70 Then, call `add_chunk()` for each chunk that is intended for write. This
  71 populates the `chunkfile` with information about the order and size of
  72 each chunk to write. Provide a `chunk_write_fn` function pointer to
  73 perform the write of the chunk data upon request.
  74
  75 Call `write_chunkfile()` to write the table of contents to the `hashfile`
  76 followed by each of the chunks. This will verify that each chunk wrote
  77 the expected amount of data so the table of contents is correct.
  78
  79 Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The
  80 caller is responsible for finalizing the `hashfile` by writing the trailing
  81 hash and closing the file.
  82
  83 Reading chunk-based file formats
  84 --------------------------------
  85
  86 To read a chunk-based file format, the file must be opened as a
  87 memory-mapped region. The chunk-format API expects that the entire file
  88 is mapped as a contiguous memory region.
  89
  90 Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`.
  91
  92 After reading the header information from the beginning of the file,
  93 including the chunk count, call `read_table_of_contents()` to populate
  94 the `struct chunkfile` with the list of chunks, their offsets, and their
  95 sizes.
  96
  97 Extract the data information for each chunk using `pair_chunk()` or
  98 `read_chunk()`:
  99
 100 * `pair_chunk()` assigns a given pointer with the location inside the
 101   memory-mapped file corresponding to that chunk's offset. If the chunk
 102   does not exist, then the pointer is not modified.
 103
 104 * `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
 105   with the appropriate initial pointer and size information. The function
 106   is not called if the chunk does not exist. Use this method to read chunks
 107   if you need to perform immediate parsing or if you need to execute logic
 108   based on the size of the chunk.
 109
 110 After calling these methods, call `free_chunkfile()` to clear the
 111 `struct chunkfile` data. This will not close the memory-mapped region.
 112 Callers are expected to own that data for the timeframe the pointers into
 113 the region are needed.
 114
 115 Examples
 116 --------
 117
 118 These file formats use the chunk-format API, and can be used as examples
 119 for future formats:
 120
 121 * *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()`
 122   in `commit-graph.c` for how the chunk-format API is used to write and
 123   parse the commit-graph file format documented in
 124   the commit-graph file format in linkgit:gitformat-commit-graph[5].
 125
 126 * *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()`
 127   in `midx.c` for how the chunk-format API is used to write and
 128   parse the multi-pack-index file format documented in
 129   the multi-pack-index file format section of linkgit:gitformat-pack[5].
 130
 131 GIT
 132 ---
 133 Part of the linkgit:git[1] suite