src/main/asciidoc/_chapters/offheap_read_write.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 [[offheap_read_write]]
  23 = RegionServer Offheap Read/Write Path
  24 :doctype: book
  25 :numbered:
  26 :toc: left
  27 :icons: font
  28 :experimental:
  29
  30 [[regionserver.offheap.overview]]
  31 == Overview
  32
  33 To help reduce P99/P999 RPC latencies, HBase 2.x has made the read and write path use a pool of offheap buffers. Cells are
  34 allocated in offheap memory outside of the purview of the JVM garbage collector with attendent reduction in GC pressure.
  35 In the write path, the request packet received from client will be read in on a pre-allocated offheap buffer and retained
  36 offheap until those cells are successfully persisted to the WAL and Memstore. The memory data structure in Memstore does
  37 not directly store the cell memory, but references the cells encoded in the offheap buffers.  Similarly for the read path.
  38 We’ll try to read the block cache first and if a cache misses, we'll go to the HFile and read the respective block. The
  39 workflow from reading blocks to sending cells to client does its best to avoid on-heap memory allocations reducing the
  40 amount of work the GC has to do.
  41
  42 image::offheap-overview.png[]
  43
  44 For redress for the single mention of onheap in the read-section of the diagram above see <<regionserver.read.hdfs.block.offheap>>.
  45
  46 [[regionserver.offheap.readpath]]
  47 == Offheap read-path
  48 In HBase-2.0.0, link:https://issues.apache.org/jira/browse/HBASE-11425[HBASE-11425] changed the HBase read path so it
  49 could hold the read-data off-heap avoiding copying of cached data (BlockCache) on to the java heap (for uncached data,
  50 see note under the diagram in the section above). This reduces GC pauses given there is less garbage made and so less
  51 to clear. The off-heap read path can have a performance that is similar or better to that of the on-heap LRU cache.
  52 This feature is available since HBase 2.0.0. Refer to below blogs for more details and test results on off heaped read path
  53 link:https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in[Offheaping the Read Path in Apache HBase: Part 1 of 2]
  54 and link:https://blogs.apache.org/hbase/entry/offheap-read-path-in-production[Offheap Read-Path in Production - The Alibaba story]
  55
  56 For an end-to-end off-heaped read-path, all you have to do is enable an off-heap backed <<offheap.blockcache>>(BC).
  57 To do this, configure _hbase.bucketcache.ioengine_ to be _offheap_ in _hbase-site.xml_ (See <<bc.deploy.modes>> to learn
  58 more about _hbase.bucketcache.ioengine_ options). Also specify the total capacity of the BC using `hbase.bucketcache.size`.
  59 Please remember to adjust value of 'HBASE_OFFHEAPSIZE' in _hbase-env.sh_ (See <<bc.example>> for help sizing and an example
  60 enabling). This configuration is for specifying the maximum possible off-heap memory allocation for the RegionServer java
  61 process. This should be bigger than the off-heap BC size to accommodate usage by other features making use of off-heap memory
  62 such as Server RPC buffer pool and short-circuit reads (See discussion in <<bc.example>>).
  63
  64 Please keep in mind that there is no default for `hbase.bucketcache.ioengine` which means the `BlockCache` is OFF by default
  65 (See <<direct.memory>>).
  66
  67 This is all you need to do to enable off-heap read path. Most buffers in HBase are already off-heap. With BC off-heap,
  68 the read pipeline will copy data between HDFS and the server socket -- caveat <<hbase.ipc.server.reservoir.initial.max>> --
  69 sending results back to the client.
  70
  71 [[regionserver.offheap.rpc.bb.tuning]]
  72 ===== Tuning the RPC buffer pool
  73 It is possible to tune the ByteBuffer pool on the RPC server side used to accumulate the cell bytes and create result
  74 cell blocks to send back to the client side. Use `hbase.ipc.server.reservoir.enabled` to turn this pool ON or OFF. By
  75 default this pool is ON and available. HBase will create off-heap ByteBuffers and pool them them by default. Please
  76 make sure not to turn this OFF if you want end-to-end off-heaping in read path.
  77
  78 If this pool is turned off, the server will create temp buffers onheap to accumulate the cell bytes and
  79 make a result cell block. This can impact the GC on a highly read loaded server.
  80
  81 NOTE: the config keys which start with prefix `hbase.ipc.server.reservoir` are deprecated in hbase-3.x (the
  82 internal pool implementation changed). If you are still in hbase-2.2.x or older, then just use the old config
  83 keys. Otherwise if in hbase-3.x or hbase-2.3.x+, please use the new config keys
  84 (See <<regionserver.read.hdfs.block.offheap,deprecated and new configs in HBase3.x>>)
  85
  86 Next thing to tune is the ByteBuffer pool on the RPC server side. The user can tune this pool with respect to how
  87 many buffers are in the pool and what should be the size of each ByteBuffer. Use the config
  88 `hbase.ipc.server.reservoir.initial.buffer.size` to tune each of the buffer sizes. Default is 64KB for hbase-2.2.x
  89 and less, changed to 65KB by default for hbase-2.3.x+
  90 (see link:https://issues.apache.org/jira/browse/HBASE-22532[HBASE-22532])
  91
  92 When the result size is larger than one 64KB (Default) ByteBuffer size, the server will try to grab more than one
  93 ByteBuffer and make a result cell block out of a collection of fixed-sized ByteBuffers. When the pool is running
  94 out of buffers, the server will skip the pool and create temporary on-heap buffers.
  95
  96 The maximum number of ByteBuffers in the pool can be tuned using the config `hbase.ipc.server.reservoir.initial.max`.
  97 Its default is a factor of region server handlers count (See the config `hbase.regionserver.handler.count`). The
  98 math is such that by default we consider 2 MB as the result cell block size per read result and each handler will be
  99 handling a read. For 2 MB size, we need 32 buffers each of size 64 KB (See default buffer size in pool). So per handler
 100 32 ByteBuffers(BB). We allocate twice this size as the max BBs count such that one handler can be creating the response
 101 and handing it to the RPC Responder thread and then handling a new request creating a new response cell block (using
 102 pooled buffers). Even if the responder could not send back the first TCP reply immediately, our count should allow that
 103 we should still have enough buffers in our pool without having to make temporary buffers on the heap. Again for smaller
 104 sized random row reads, tune this max count. These are lazily created buffers and the count is the max count to be pooled.
 105
 106 If you still see GC issues even after making end-to-end read path off-heap, look for issues in the appropriate buffer
 107 pool. Check for the below RegionServer log line at INFO level in HBase2.x:
 108
 109 [source]
 110 ----
 111 Pool already reached its max capacity : XXX and no free buffers now. Consider increasing the value for 'hbase.ipc.server.reservoir.initial.max' ?
 112 ----
 113
 114 Or the following log message in HBase3.x:
 115
 116 [source]
 117 ----
 118 Pool already reached its max capacity : XXX and no free buffers now. Consider increasing the value for 'hbase.server.allocator.max.buffer.count' ?
 119 ----
 120
 121 [[hbase.offheapsize]]
 122 The setting for _HBASE_OFFHEAPSIZE_ in _hbase-env.sh_ should consider this off heap buffer pool on the server side also.
 123 We need to config this max off heap size for the RegionServer as a bit higher than the sum of this max pool size and
 124 the off heap cache size. The TCP layer will also need to create direct bytebuffers for TCP communication. Also the DFS
 125 client will need some off-heap to do its workings especially if short-circuit reads are configured. Allocating an extra
 126 1 - 2 GB for the max direct memory size has worked in tests.
 127
 128 If you are using coprocessors and refer to the Cells in the read results, DO NOT store reference to these Cells out of
 129 the scope of the CP hook methods. Some times the CPs want to store info about the cell (Like its row key) for considering
 130 in the next CP hook call etc. For such cases, pls clone the required fields of the entire Cell as per the use cases.
 131 [ See CellUtil#cloneXXX(Cell) APIs ]
 132
 133 [[regionserver.read.hdfs.block.offheap]]
 134 == Read block from HDFS to offheap directly
 135
 136 In HBase-2.x, the RegionServer will read blocks from HDFS to a temporary onheap ByteBuffer and then flush to
 137 the BucketCache. Even if the BucketCache is offheap, we will first pull the HDFS read onheap before writing
 138 it out to the offheap BucketCache.  We can observe much GC pressure when cache hit ratio low (e.g. a cacheHitRatio ~ 60% ).
 139 link:https://issues.apache.org/jira/browse/HBASE-21879[HBASE-21879] addresses this issue (Requires hbase-2.3.x/hbase-3.x).
 140 It depends on there being a supporting HDFS being in place (hadoop-2.10.x or hadoop-3.3.x) and it may require patching
 141 HBase itself (as of this writing); see
 142 link:https://issues.apache.org/jira/browse/HBASE-21879[HBASE-21879 Read HFile's block to ByteBuffer directly instead of to byte for reducing young gc purpose].
 143 Appropriately setup, reads from HDFS can be into offheap buffers passed offheap to the offheap BlockCache to cache.
 144
 145 For more details about the design and performance improvement, please see the
 146 link:https://docs.google.com/document/d/1xSy9axGxafoH-Qc17zbD2Bd--rWjjI00xTWQZ8ZwI_E[Design Doc -Read HFile's block to Offheap].
 147
 148 Here we will share some best practice about the performance tuning but first we introduce new (hbase-3.x/hbase-2.3.x) configuration names
 149 that go with the new internal pool implementation (`ByteBuffAllocator` vs the old `ByteBufferPool`), some of which mimic now deprecated
 150 hbase-2.2.x configurations discussed above in the <<regionserver.offheap.rpc.bb.tuning>>. Much of the advice here overlaps that given above
 151 in the <<regionserver.offheap.rpc.bb.tuning>> since the implementations have similar configurations.
 152
 153 1. `hbase.server.allocator.pool.enabled` is for whether the RegionServer will use the pooled offheap ByteBuffer allocator. Default
 154 value is true. In hbase-2.x, the deprecated `hbase.ipc.server.reservoir.enabled` did similar and is mapped to this config
 155 until support for the old configuration is removed. This new name will be used in hbase-3.x and hbase-2.3.x+.
 156 2. `hbase.server.allocator.minimal.allocate.size` is the threshold at which we start allocating from the pool. Otherwise the
 157 request will be allocated from onheap directly because it would be wasteful allocating small stuff from our pool of fixed-size
 158 ByteBuffers. The default minimum is `hbase.server.allocator.buffer.size/6`.
 159 3. `hbase.server.allocator.max.buffer.count`: The `ByteBuffAllocator`, the new pool/reservoir implementation,  has fixed-size
 160 ByteBuffers. This config is for how many buffers to pool. Its default value is 2MB * 2 * hbase.regionserver.handler.count / 65KB
 161 (similar to thediscussion above in <<regionserver.offheap.rpc.bb.tuning>>). If the default `hbase.regionserver.handler.count` is 30, then the default will be 1890.
 162 4. `hbase.server.allocator.buffer.size`: The byte size of each ByteBuffer. The default value is 66560 (65KB), here we choose 65KB instead of 64KB
 163 because of link:https://issues.apache.org/jira/browse/HBASE-22532[HBASE-22532].
 164
 165 The three config keys -- `hbase.ipc.server.reservoir.enabled`, `hbase.ipc.server.reservoir.initial.buffer.size` and `hbase.ipc.server.reservoir.initial.max` -- introduced in hbase-2.x
 166 have been renamed and deprecated in hbase-3.x/hbase-2.3.x. Please use the new config keys instead:
 167 `hbase.server.allocator.pool.enabled`, `hbase.server.allocator.buffer.size` and `hbase.server.allocator.max.buffer.count`.
 168
 169 Next, we have some suggestions regards performance.
 170
 171 .Please make sure that there are enough pooled DirectByteBuffer in your ByteBuffAllocator.
 172
 173 The ByteBuffAllocator will allocate ByteBuffer from the DirectByteBuffer pool first. If
 174 there’s no available ByteBuffer in the pool, then we will allocate the ByteBuffers from onheap.
 175 By default, we will pre-allocate 4MB for each RPC handler (The handler count is determined by the config:
 176 `hbase.regionserver.handler.count`, it has the default value 30) . That’s to say,  if your `hbase.server.allocator.buffer.size`
 177 is 65KB, then your pool will have 2MB * 2 / 65KB * 30 = 945 DirectByteBuffer.  If you have a large scan and a big cache,
 178 you may have a RPC response whose bytes size is greater than 2MB (another 2MB for receiving rpc request), then it will
 179 be better to increase the `hbase.server.allocator.max.buffer.count`.
 180
 181 The RegionServer web UI has statistics on ByteBuffAllocator:
 182
 183 image::bytebuff-allocator-stats.png[]
 184
 185 If the following condition is met, you may need to increase your max buffer.count:
 186
 187   heapAllocationRatio >= hbase.server.allocator.minimal.allocate.size / hbase.server.allocator.buffer.size * 100%
 188
 189 .Please make sure the buffer size is greater than your block size.
 190
 191 We have the default block size of 64KB, so almost all of the data blocks will be 64KB + a small delta, where the delta is
 192 very small, depending on the size of the last Cell. If we set `hbase.server.allocator.buffer.size`=64KB,
 193 then each block will be allocated as two ByteBuffers:  one 64KB DirectByteBuffer and one HeapByteBuffer for the delta bytes.
 194 Ideally, we should let the data block to be allocated as one ByteBuffer; it has a simpler data structure, faster access speed,
 195 and less heap usage. Also, if the blocks are a composite of multiple ByteBuffers, to validate the checksum
 196 we have to perform a temporary heap copy (see link:https://issues.apache.org/jira/browse/HBASE-21917[HBASE-21917])
 197 whereas if it’s a single ByteBuffer we can speed the checksum by calling the hadoop' checksum native lib; it's more faster.
 198
 199 Please also see: link:https://issues.apache.org/jira/browse/HBASE-22483[HBASE-22483]
 200
 201 Don't forget to up your _HBASE_OFFHEAPSIZE_ accordingly. See <<hbase.offheapsize>>
 202
 203 [[regionserver.offheap.writepath]]
 204 == Offheap write-path
 205
 206 In hbase-2.x, link:https://issues.apache.org/jira/browse/HBASE-15179[HBASE-15179] made the HBase write path work off-heap. By default, the MemStores in
 207 HBase have always used MemStore Local Allocation Buffers (MSLABs) to avoid memory fragmentation; an MSLAB creates bigger fixed sized chunks and then the
 208 MemStores Cell's data gets copied into these MSLAB chunks. These chunks can be pooled also and from hbase-2.x on, the MSLAB pool is by default ON.
 209 Write off-heaping makes use of the MSLAB pool. It creates MSLAB chunks as Direct ByteBuffers and pools them.
 210
 211 `hbase.regionserver.offheap.global.memstore.size` is the configuration key which controls the amount of off-heap data. Its value is the number of megabytes
 212 of off-heap memory that should be used by MSLAB (e.g. `25` would result in 25MB of off-heap). Be sure to increase _HBASE_OFFHEAPSIZE_ which will set the JVM's
 213 MaxDirectMemorySize property (see <<hbase.offheapsize>> for more on _HBASE_OFFHEAPSIZE_). The default value of
 214 `hbase.regionserver.offheap.global.memstore.size` is 0 which means MSLAB uses onheap, not offheap, chunks by default.
 215
 216 `hbase.hregion.memstore.mslab.chunksize` controls the size of each off-heap chunk. Default is `2097152` (2MB).
 217
 218 When a Cell is added to a MemStore, the bytes for that Cell are copied into these off-heap buffers (if `hbase.regionserver.offheap.global.memstore.size` is non-zero)
 219 and a Cell POJO will refer to this memory area. This can greatly reduce the on-heap occupancy of the MemStores and reduce the total heap utilization for RegionServers
 220 in a write-heavy workload. On-heap and off-heap memory utiliazation are tracked at multiple levels to implement low level and high level memory management.
 221 The decision to flush a MemStore considers both the on-heap and off-heap usage of that MemStore. At the Region level, we sum the on-heap and off-heap usages and
 222 compare them against the region flush size (128MB, by default). Globally, on-heap size occupancy of all memstores are tracked as well as off-heap size. When any of
 223 these sizes breache the lower mark (`hbase.regionserver.global.memstore.size.lower.limit`) or the maximum size `hbase.regionserver.global.memstore.size`), all
 224 regions are selected for forced flushes.
 225