src/main/asciidoc/_chapters/appendix_hfile_format.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 [appendix]
  23 == HFile format
  24 :doctype: book
  25 :numbered:
  26 :toc: left
  27 :icons: font
  28 :experimental:
  29 :toc: left
  30 :source-language: java
  31
  32 This appendix describes the evolution of the HFile format.
  33
  34 [[hfilev1]]
  35 === HBase File Format (version 1)
  36
  37 As we will be discussing changes to the HFile format, it is useful to give a short overview of the original (HFile version 1) format.
  38
  39 [[hfilev1.overview]]
  40 ==== Overview of Version 1
  41
  42 An HFile in version 1 format is structured as follows:
  43
  44 .HFile V1 Format
  45 image::hfile.png[HFile Version 1]
  46
  47 ====  Block index format in version 1
  48
  49 The block index in version 1 is very straightforward.
  50 For each entry, it contains:
  51
  52 . Offset (long)
  53 . Uncompressed size (int)
  54 . Key (a serialized byte array written using Bytes.writeByteArray)
  55 .. Key length as a variable-length integer (VInt)
  56 .. Key bytes
  57
  58
  59 The number of entries in the block index is stored in the fixed file trailer, and has to be passed in to the method that reads the block index.
  60 One of the limitations of the block index in version 1 is that it does not provide the compressed size of a block, which turns out to be necessary for decompression.
  61 Therefore, the HFile reader has to infer this compressed size from the offset difference between blocks.
  62 We fix this limitation in version 2, where we store on-disk block size instead of uncompressed size, and get uncompressed size from the block header.
  63
  64 [[hfilev2]]
  65 === HBase file format with inline blocks (version 2)
  66
  67 Note:  this feature was introduced in HBase 0.92
  68
  69 ==== Motivation
  70
  71 We found it necessary to revise the HFile format after encountering high memory usage and slow startup times caused by large Bloom filters and block indexes in the region server.
  72 Bloom filters can get as large as 100 MB per HFile, which adds up to 2 GB when aggregated over 20 regions.
  73 Block indexes can grow as large as 6 GB in aggregate size over the same set of regions.
  74 A region is not considered opened until all of its block index data is loaded.
  75 Large Bloom filters produce a different performance problem: the first get request that requires a Bloom filter lookup will incur the latency of loading the entire Bloom filter bit array.
  76
  77 To speed up region server startup we break Bloom filters and block indexes into multiple blocks and write those blocks out as they fill up, which also reduces the HFile writer's memory footprint.
  78 In the Bloom filter case, "filling up a block" means accumulating enough keys to efficiently utilize a fixed-size bit array, and in the block index case we accumulate an "index block" of the desired size.
  79 Bloom filter blocks and index blocks (we call these "inline blocks") become interspersed with data blocks, and as a side effect we can no longer rely on the difference between block offsets to determine data block length, as it was done in version 1.
  80
  81 HFile is a low-level file format by design, and it should not deal with application-specific details such as Bloom filters, which are handled at StoreFile level.
  82 Therefore, we call Bloom filter blocks in an HFile "inline" blocks.
  83 We also supply HFile with an interface to write those inline blocks.
  84
  85 Another format modification aimed at reducing the region server startup time is to use a contiguous "load-on-open" section that has to be loaded in memory at the time an HFile is being opened.
  86 Currently, as an HFile opens, there are separate seek operations to read the trailer, data/meta indexes, and file info.
  87 To read the Bloom filter, there are two more seek operations for its "data" and "meta" portions.
  88 In version 2, we seek once to read the trailer and seek again to read everything else we need to open the file from a contiguous block.
  89
  90 [[hfilev2.overview]]
  91 ==== Overview of Version 2
  92
  93 The version of HBase introducing the above features reads both version 1 and 2 HFiles, but only writes version 2 HFiles.
  94 A version 2 HFile is structured as follows:
  95
  96 .HFile Version 2 Structure
  97 image::hfilev2.png[HFile Version 2]
  98
  99 ==== Unified version 2 block format
 100
 101 In the version 2 every block in the data section contains the following fields:
 102
 103 . 8 bytes: Block type, a sequence of bytes equivalent to version 1's "magic records". Supported block types are:
 104 .. DATA – data blocks
 105 .. LEAF_INDEX – leaf-level index blocks in a multi-level-block-index
 106 .. BLOOM_CHUNK – Bloom filter chunks
 107 .. META – meta blocks (not used for Bloom filters in version 2 anymore)
 108 .. INTERMEDIATE_INDEX – intermediate-level index blocks in a multi-level blockindex
 109 .. ROOT_INDEX – root>level index blocks in a multi>level block index
 110 .. FILE_INFO – the ``file info'' block, a small key>value map of metadata
 111 .. BLOOM_META – a Bloom filter metadata block in the load>on>open section
 112 .. TRAILER – a fixed>size file trailer.
 113   As opposed to the above, this is not an HFile v2 block but a fixed>size (for each HFile version) data structure
 114 .. INDEX_V1 – this block type is only used for legacy HFile v1 block
 115 . Compressed size of the block's data, not including the header (int).
 116 +
 117 Can be used for skipping the current data block when scanning HFile data.
 118 . Uncompressed size of the block's data, not including the header (int)
 119 +
 120 This is equal to the compressed size if the compression algorithm is NONE
 121 . File offset of the previous block of the same type (long)
 122 +
 123 Can be used for seeking to the previous data/index block
 124 . Compressed data (or uncompressed data if the compression algorithm is NONE).
 125
 126 The above format of blocks is used in the following HFile sections:
 127
 128 Scanned block section::
 129   The section is named so because it contains all data blocks that need to be read when an HFile is scanned sequentially.
 130   Also contains leaf block index and Bloom chunk blocks.
 131 Non-scanned block section::
 132   This section still contains unified-format v2 blocks but it does not have to be read when doing a sequential scan.
 133   This section contains "meta" blocks and intermediate-level index blocks.
 134
 135 We are supporting "meta" blocks in version 2 the same way they were supported in version 1, even though we do not store Bloom filter data in these blocks anymore.
 136
 137 ====  Block index in version 2
 138
 139 There are three types of block indexes in HFile version 2, stored in two different formats (root and non-root):
 140
 141 . Data index -- version 2 multi-level block index, consisting of:
 142 .. Version 2 root index, stored in the data block index section of the file
 143 .. Optionally, version 2 intermediate levels, stored in the non%root format in   the data index section of the file. Intermediate levels can only be present if leaf level blocks are present
 144 .. Optionally, version 2 leaf levels, stored in the non%root format inline with   data blocks
 145 . Meta index -- version 2 root index format only, stored in the meta index section of the file
 146 . Bloom index -- version 2 root index format only, stored in the ``load-on-open'' section as part of Bloom filter metadata.
 147
 148 ==== Root block index format in version 2
 149
 150 This format applies to:
 151
 152 . Root level of the version 2 data index
 153 . Entire meta and Bloom indexes in version 2, which are always single-level.
 154
 155 A version 2 root index block is a sequence of entries of the following format, similar to entries of a version 1 block index, but storing on-disk size instead of uncompressed size.
 156
 157 . Offset (long)
 158 +
 159 This offset may point to a data block or to a deeper>level index block.
 160
 161 . On-disk size (int)
 162 . Key (a serialized byte array stored using Bytes.writeByteArray)
 163 +
 164 . Key (VInt)
 165 . Key bytes
 166
 167
 168 A single-level version 2 block index consists of just a single root index block.
 169 To read a root index block of version 2, one needs to know the number of entries.
 170 For the data index and the meta index the number of entries is stored in the trailer, and for the Bloom index it is stored in the compound Bloom filter metadata.
 171
 172 For a multi-level block index we also store the following fields in the root index block in the load-on-open section of the HFile, in addition to the data structure described above:
 173
 174 . Middle leaf index block offset
 175 . Middle leaf block on-disk size (meaning the leaf index block containing the reference to the ``middle'' data block of the file)
 176 . The index of the mid-key (defined below) in the middle leaf-level block.
 177
 178
 179
 180 These additional fields are used to efficiently retrieve the mid-key of the HFile used in HFile splits, which we define as the first key of the block with a zero-based index of (n – 1) / 2, if the total number of blocks in the HFile is n.
 181 This definition is consistent with how the mid-key was determined in HFile version 1, and is reasonable in general, because blocks are likely to be the same size on average, but we don't have any estimates on individual key/value pair sizes.
 182
 183
 184
 185 When writing a version 2 HFile, the total number of data blocks pointed to by every leaf-level index block is kept track of.
 186 When we finish writing and the total number of leaf-level blocks is determined, it is clear which leaf-level block contains the mid-key, and the fields listed above are computed.
 187 When reading the HFile and the mid-key is requested, we retrieve the middle leaf index block (potentially from the block cache) and get the mid-key value from the appropriate position inside that leaf block.
 188
 189 ==== Non-root block index format in version 2
 190
 191 This format applies to intermediate-level and leaf index blocks of a version 2 multi-level data block index.
 192 Every non-root index block is structured as follows.
 193
 194 . numEntries: the number of entries (int).
 195 . entryOffsets: the "secondary index" of offsets of entries in the block, to facilitate
 196   a quick binary search on the key (`numEntries + 1` int values). The last value
 197   is the total length of all entries in this index block. For example, in a non-root
 198   index block with entry sizes 60, 80, 50 the "secondary index" will contain the
 199   following int array: `{0, 60, 140, 190}`.
 200 . Entries.
 201   Each entry contains:
 202 +
 203 . Offset of the block referenced by this entry in the file (long)
 204 . On>disk size of the referenced block (int)
 205 . Key.
 206   The length can be calculated from entryOffsets.
 207
 208
 209 ==== Bloom filters in version 2
 210
 211 In contrast with version 1, in a version 2 HFile Bloom filter metadata is stored in the load-on-open section of the HFile for quick startup.
 212
 213 . A compound Bloom filter.
 214 +
 215 . Bloom filter version = 3 (int). There used to be a DynamicByteBloomFilter class that had the Bloom   filter version number 2
 216 . The total byte size of all compound Bloom filter chunks (long)
 217 . Number of hash functions (int
 218 . Type of hash functions (int)
 219 . The total key count inserted into the Bloom filter (long)
 220 . The maximum total number of keys in the Bloom filter (long)
 221 . The number of chunks (int)
 222 . Comparator class used for Bloom filter keys, a UTF>8 encoded string stored   using Bytes.writeByteArray
 223 . Bloom block index in the version 2 root block index format
 224
 225
 226 ==== File Info format in versions 1 and 2
 227
 228 The file info block is a serialized map from byte arrays to byte arrays, with the following keys, among others.
 229 StoreFile-level logic adds more keys to this.
 230
 231 [cols="1,1", frame="all"]
 232 |===
 233 |hfile.LASTKEY| The last key of the file (byte array)
 234 |hfile.AVG_KEY_LEN| The average key length in the file (int)
 235 |hfile.AVG_VALUE_LEN| The average value length in the file (int)
 236 |===
 237
 238 In version 2, we did not change the file format, but we moved the file info to
 239 the final section of the file, which can be loaded as one block when the HFile
 240 is being opened.
 241
 242 Also, we do not store the comparator in the version 2 file info anymore.
 243 Instead, we store it in the fixed file trailer.
 244 This is because we need to know the comparator at the time of parsing the load-on-open section of the HFile.
 245
 246 ==== Fixed file trailer format differences between versions 1 and 2
 247
 248 The following table shows common and different fields between fixed file trailers in versions 1 and 2.
 249 Note that the size of the trailer is different depending on the version, so it is ``fixed'' only within one version.
 250 However, the version is always stored as the last four-byte integer in the file.
 251
 252 .Differences between HFile Versions 1 and 2
 253 [cols="1,1", frame="all"]
 254 |===
 255 | Version 1 | Version 2
 256 | |File info offset (long)
 257 | Data index offset (long)
 258 | loadOnOpenOffset (long) /The offset of the section that we need to load when opening the file./
 259 | | Number of data index entries (int)
 260 | metaIndexOffset (long) /This field is not being used by the version 1 reader, so we removed it from version 2./ | uncompressedDataIndexSize (long) /The total uncompressed size of the whole data block index, including root-level, intermediate-level, and leaf-level blocks./
 261 | | Number of meta index entries (int)
 262 | | Total uncompressed bytes (long)
 263 | numEntries (int) | numEntries (long)
 264 | Compression codec: 0 = LZO, 1 = GZ, 2 = NONE (int) | Compression codec: 0 = LZO, 1 = GZ, 2 = NONE (int)
 265 | | The number of levels in the data block index (int)
 266 | | firstDataBlockOffset (long) /The offset of the first data block. Used when scanning./
 267 | | lastDataBlockEnd (long) /The offset of the first byte after the last key/value data block. We don't need to go beyond this offset when scanning./
 268 | Version: 1 (int) | Version: 2 (int)
 269 |===
 270
 271
 272
 273 ==== getShortMidpointKey(an optimization for data index block)
 274
 275 Note: this optimization was introduced in HBase 0.95+
 276
 277 HFiles contain many blocks that contain a range of sorted Cells.
 278 Each cell has a key.
 279 To save IO when reading Cells, the HFile also has an index that maps a Cell's start key to the offset of the beginning of a particular block.
 280 Prior to this optimization, HBase would use the key of the first cell in each data block as the index key.
 281
 282 In HBASE-7845, we generate a new key that is lexicographically larger than the last key of the previous block and lexicographically equal or smaller than the start key of the current block.
 283 While actual keys can potentially be very long, this "fake key" or "virtual key" can be much shorter.
 284 For example, if the stop key of previous block is "the quick brown fox", the start key of current block is "the who", we could use "the r" as our virtual key in our hfile index.
 285
 286 There are two benefits to this:
 287
 288 * having shorter keys reduces the hfile index size, (allowing us to keep more indexes in memory), and
 289 * using something closer to the end key of the previous block allows us to avoid a potential extra IO when the target key lives in between the "virtual key" and the key of the first element in the target block.
 290
 291 This optimization (implemented by the getShortMidpointKey method) is inspired by LevelDB's ByteWiseComparatorImpl::FindShortestSeparator() and FindShortSuccessor().
 292
 293 [[hfilev3]]
 294 === HBase File Format with Security Enhancements (version 3)
 295
 296 Note: this feature was introduced in HBase 0.98
 297
 298 [[hfilev3.motivation]]
 299 ==== Motivation
 300
 301 Version 3 of HFile makes changes needed to ease management of encryption at rest and cell-level metadata (which in turn is needed for cell-level ACLs and cell-level visibility labels). For more information see <<hbase.encryption.server,hbase.encryption.server>>, <<hbase.tags,hbase.tags>>, <<hbase.accesscontrol.configuration,hbase.accesscontrol.configuration>>, and <<hbase.visibility.labels,hbase.visibility.labels>>.
 302
 303 [[hfilev3.overview]]
 304 ==== Overview
 305
 306 The version of HBase introducing the above features reads HFiles in versions 1, 2, and 3 but only writes version 3 HFiles.
 307 Version 3 HFiles are structured the same as version 2 HFiles.
 308 For more information see <<hfilev2.overview,hfilev2.overview>>.
 309
 310 [[hvilev3.infoblock]]
 311 ==== File Info Block in Version 3
 312
 313 Version 3 added two additional pieces of information to the reserved keys in the file info block.
 314
 315 [cols="1,1", frame="all"]
 316 |===
 317 | hfile.MAX_TAGS_LEN | The maximum number of bytes needed to store the serialized tags for any single cell in this hfile (int)
 318  | hfile.TAGS_COMPRESSED | Does the block encoder for this hfile compress tags? (boolean). Should only be present if hfile.MAX_TAGS_LEN is also present.
 319 |===
 320
 321 When reading a Version 3 HFile the presence of `MAX_TAGS_LEN` is used to determine how to deserialize the cells within a data block.
 322 Therefore, consumers must read the file's info block prior to reading any data blocks.
 323
 324 When writing a Version 3 HFile, HBase will always include `MAX_TAGS_LEN ` when flushing the memstore to underlying filesystem and when using prefix tree encoding for data blocks, as described in <<compression,compression>>.
 325
 326 When compacting extant files, the default writer will omit `MAX_TAGS_LEN` if all of the files selected do not themselves contain any cells with tags.
 327
 328 See <<compaction,compaction>> for details on the compaction file selection algorithm.
 329
 330 [[hfilev3.datablock]]
 331 ==== Data Blocks in Version 3
 332
 333 Within an HFile, HBase cells are stored in data blocks as a sequence of KeyValues (see <<hfilev1.overview,hfilev1.overview>>, or link:http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html[Lars George's
 334         excellent introduction to HBase Storage]). In version 3, these KeyValue optionally will include a set of 0 or more tags:
 335
 336 [cols="1,1", frame="all"]
 337 |===
 338 | Version 1 & 2,  Version 3 without MAX_TAGS_LEN | Version 3 with MAX_TAGS_LEN
 339 2+| Key Length (4 bytes)
 340 2+| Value Length (4 bytes)
 341 2+| Key bytes (variable)
 342 2+| Value bytes (variable)
 343 | | Tags Length (2 bytes)
 344 | | Tags bytes (variable)
 345 |===
 346
 347 If the info block for a given HFile contains an entry for `MAX_TAGS_LEN` each cell will have the length of that cell's tags included, even if that length is zero.
 348 The actual tags are stored as a sequence of tag length (2 bytes), tag type (1 byte), tag bytes (variable). The format an individual tag's bytes depends on the tag type.
 349
 350 Note that the dependence on the contents of the info block implies that prior to reading any data blocks you must first process a file's info block.
 351 It also implies that prior to writing a data block you must know if the file's info block will include `MAX_TAGS_LEN`.
 352
 353 [[hfilev3.fixedtrailer]]
 354 ==== Fixed File Trailer in Version 3
 355
 356 The fixed file trailers written with HFile version 3 are always serialized with protocol buffers.
 357 Additionally, it adds an optional field to the version 2 protocol buffer named encryption_key.
 358 If HBase is configured to encrypt HFiles this field will store a data encryption key for this particular HFile, encrypted with the current cluster master key using AES.
 359 For more information see <<hbase.encryption.server,hbase.encryption.server>>.
 360
 361 :numbered: