src/main/asciidoc/_chapters/architecture.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 = Architecture
  23 :doctype: book
  24 :numbered:
  25 :toc: left
  26 :icons: font
  27 :experimental:
  28 :toc: left
  29 :source-language: java
  30
  31 [[arch.overview]]
  32 == Overview
  33
  34 [[arch.overview.nosql]]
  35 === NoSQL?
  36
  37 HBase is a type of "NoSQL" database.
  38 "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database.
  39 Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
  40
  41 However, HBase has many features which supports both linear and modular scaling.
  42 HBase clusters expand by adding RegionServers that are hosted on commodity class servers.
  43 If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity.
  44 An RDBMS can scale well, but only up to a point - specifically, the size of a single database
  45 server - and for the best performance requires specialized hardware and storage devices.
  46 HBase features of note are:
  47
  48 * Strongly consistent reads/writes:  HBase is not an "eventually consistent" DataStore.
  49   This makes it very suitable for tasks such as high-speed counter aggregation.
  50 * Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
  51 * Automatic RegionServer failover
  52 * Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
  53 * MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
  54 * Java Client API: HBase supports an easy to use Java API for programmatic access.
  55 * Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
  56 * Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
  57 * Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.
  58
  59 [[arch.overview.when]]
  60 === When Should I Use HBase?
  61
  62 HBase isn't suitable for every problem.
  63
  64 First, make sure you have enough data.
  65 If you have hundreds of millions or billions of rows, then HBase is a good candidate.
  66 If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
  67
  68 Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.)  An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example.
  69 Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
  70
  71 Third, make sure you have enough hardware.
  72 Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
  73
  74 HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only.
  75
  76 [[arch.overview.hbasehdfs]]
  77 === What Is The Difference Between HBase and Hadoop/HDFS?
  78
  79 link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html[HDFS] is a distributed file system that is well suited for the storage of large files.
  80 Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files.
  81 HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.
  82 This can sometimes be a point of conceptual confusion.
  83 HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.
  84 See the <<datamodel>> and the rest of this chapter for more information on how HBase achieves its goals.
  85
  86 [[arch.catalog]]
  87 == Catalog Tables
  88
  89 The catalog table `hbase:meta` exists as an HBase table and is filtered out of the HBase shell's `list` command, but is in fact a table just like any other.
  90
  91 [[arch.catalog.meta]]
  92 === hbase:meta
  93
  94 The `hbase:meta` table (previously called `.META.`) keeps a list of all regions in the system, and the location of `hbase:meta` is stored in ZooKeeper.
  95
  96 The `hbase:meta` table structure is as follows:
  97
  98 .Key
  99
 100 * Region key of the format (`[table],[region start key],[region id]`)
 101
 102 .Values
 103
 104 * `info:regioninfo` (serialized link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html[HRegionInfo] instance for this region)
 105 * `info:server` (server:port of the RegionServer containing this region)
 106 * `info:serverstartcode` (start-time of the RegionServer process containing this region)
 107
 108 When a table is in the process of splitting, two other columns will be created, called `info:splitA` and `info:splitB`.
 109 These columns represent the two daughter regions.
 110 The values for these columns are also serialized HRegionInfo instances.
 111 After the region has been split, eventually this row will be deleted.
 112
 113 .Note on HRegionInfo
 114 [NOTE]
 115 ====
 116 The empty key is used to denote table start and table end.
 117 A region with an empty start key is the first region in a table.
 118 If a region has both an empty start and an empty end key, it is the only region in the table
 119 ====
 120
 121 In the (hopefully unlikely) event that programmatic processing of catalog metadata
 122 is required, see the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/RegionInfo.html#parseFrom-byte:A-[RegionInfo.parseFrom] utility.
 123
 124 [[arch.catalog.startup]]
 125 === Startup Sequencing
 126
 127 First, the location of `hbase:meta` is looked up in ZooKeeper.
 128 Next, `hbase:meta` is updated with server and startcode values.
 129
 130 For information on region-RegionServer assignment, see <<regions.arch.assignment>>.
 131
 132 [[architecture.client]]
 133 == Client
 134
 135 The HBase client finds the RegionServers that are serving the particular row range of interest.
 136 It does this by querying the `hbase:meta` table.
 137 See <<arch.catalog.meta>> for details.
 138 After locating the required region(s), the client contacts the RegionServer serving that region, rather than going through the master, and issues the read or write request.
 139 This information is cached in the client so that subsequent requests need not go through the lookup process.
 140 Should a region be reassigned either by the master load balancer or because a RegionServer has died, the client will requery the catalog tables to determine the new location of the user region.
 141
 142 See <<master.runtime>> for more information about the impact of the Master on HBase Client communication.
 143
 144 Administrative functions are done via an instance of link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html[Admin]
 145
 146 [[client.connections]]
 147 === Cluster Connections
 148
 149 The API changed in HBase 1.0. For connection configuration information, see <<client_dependencies>>.
 150
 151 ==== API as of HBase 1.0.0
 152
 153 It's been cleaned up and users are returned Interfaces to work against rather than particular types.
 154 In HBase 1.0, obtain a `Connection` object from `ConnectionFactory` and thereafter, get from it instances of `Table`, `Admin`, and `RegionLocator` on an as-need basis.
 155 When done, close the obtained instances.
 156 Finally, be sure to cleanup your `Connection` instance before exiting.
 157 `Connections` are heavyweight objects but thread-safe so you can create one for your application and keep the instance around.
 158 `Table`, `Admin` and `RegionLocator` instances are lightweight.
 159 Create as you go and then let go as soon as you are done by closing them.
 160 See the link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/package-summary.html[Client Package Javadoc Description] for example usage of the new HBase 1.0 API.
 161
 162 ==== API before HBase 1.0.0
 163
 164 Instances of `HTable` are the way to interact with an HBase cluster earlier than 1.0.0. _link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table] instances are not thread-safe_. Only one thread can use an instance of Table at any given time.
 165 When creating Table instances, it is advisable to use the same link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration[HBaseConfiguration] instance.
 166 This will ensure sharing of ZooKeeper and socket instances to the RegionServers which is usually what you want.
 167 For example, this is preferred:
 168
 169 [source,java]
 170 ----
 171 HBaseConfiguration conf = HBaseConfiguration.create();
 172 HTable table1 = new HTable(conf, "myTable");
 173 HTable table2 = new HTable(conf, "myTable");
 174 ----
 175
 176 as opposed to this:
 177
 178 [source,java]
 179 ----
 180 HBaseConfiguration conf1 = HBaseConfiguration.create();
 181 HTable table1 = new HTable(conf1, "myTable");
 182 HBaseConfiguration conf2 = HBaseConfiguration.create();
 183 HTable table2 = new HTable(conf2, "myTable");
 184 ----
 185
 186 For more information about how connections are handled in the HBase client, see link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ConnectionFactory.html[ConnectionFactory].
 187
 188 [[client.connection.pooling]]
 189 ===== Connection Pooling
 190
 191 For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads in a single JVM), you can pre-create a `Connection`, as shown in the following example:
 192
 193 .Pre-Creating a `Connection`
 194 ====
 195 [source,java]
 196 ----
 197 // Create a connection to the cluster.
 198 Configuration conf = HBaseConfiguration.create();
 199 try (Connection connection = ConnectionFactory.createConnection(conf);
 200      Table table = connection.getTable(TableName.valueOf(tablename))) {
 201   // use table as needed, the table returned is lightweight
 202 }
 203 ----
 204 ====
 205
 206 .`HTablePool` is Deprecated
 207 [WARNING]
 208 ====
 209 Previous versions of this guide discussed `HTablePool`, which was deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by link:https://issues.apache.org/jira/browse/HBASE-6580[HBASE-6580], or `HConnection`, which is deprecated in HBase 1.0 by `Connection`.
 210 Please use link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Connection.html[Connection] instead.
 211 ====
 212
 213 [[client.writebuffer]]
 214 === WriteBuffer and Batch Methods
 215
 216 In HBase 1.0 and later, link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/HTable.html[HTable] is deprecated in favor of link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table]. `Table` does not use autoflush. To do buffered writes, use the BufferedMutator class.
 217
 218 In HBase 2.0 and later, link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/client/HTable.html[HTable] does not use BufferedMutator to execute the ``Put`` operation. Refer to link:https://issues.apache.org/jira/browse/HBASE-18500[HBASE-18500] for more information.
 219
 220 For additional information on write durability, review the link:/acid-semantics.html[ACID semantics] page.
 221
 222 For fine-grained control of batching of ``Put``s or ``Delete``s, see the link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch-java.util.List-java.lang.Object:A-[batch] methods on Table.
 223
 224 [[async.client]]
 225 === Asynchronous Client ===
 226
 227 It is a new API introduced in HBase 2.0 which aims to provide the ability to access HBase asynchronously.
 228
 229 You can obtain an `AsyncConnection` from `ConnectionFactory`, and then get a asynchronous table instance from it to access HBase. When done, close the `AsyncConnection` instance(usually when your program exits).
 230
 231 For the asynchronous table, most methods have the same meaning with the old `Table` interface, expect that the return value is wrapped with a CompletableFuture usually. We do not have any buffer here so there is no close method for asynchronous table, you do not need to close it. And it is thread safe.
 232
 233 There are several differences for scan:
 234
 235 * There is still a `getScanner` method which returns a `ResultScanner`. You can use it in the old way and it works like the old `ClientAsyncPrefetchScanner`.
 236 * There is a `scanAll` method which will return all the results at once. It aims to provide a simpler way for small scans which you want to get the whole results at once usually.
 237 * The Observer Pattern. There is a scan method which accepts a `ScanResultConsumer` as a parameter. It will pass the results to the consumer.
 238
 239 Notice that `AsyncTable` interface is templatized. The template parameter specifies the type of `ScanResultConsumerBase` used by scans, which means the observer style scan APIs are different. The two types of scan consumers are - `ScanResultConsumer` and `AdvancedScanResultConsumer`.
 240
 241 `ScanResultConsumer` needs a separate thread pool which is used to execute the callbacks registered to the returned CompletableFuture. Because the use of separate thread pool frees up RPC threads, callbacks are free to do anything. Use this if the callbacks are not quick, or when in doubt.
 242
 243 `AdvancedScanResultConsumer` executes callbacks inside the framework thread. It is not allowed to do time consuming work in the callbacks else it will likely block the framework threads and cause very bad performance impact. As its name, it is designed for advanced users who want to write high performance code. See `org.apache.hadoop.hbase.client.example.HttpProxyExample` for how to write fully asynchronous code with it.
 244
 245 [[async.admin]]
 246 === Asynchronous Admin ===
 247
 248 You can obtain an `AsyncConnection` from `ConnectionFactory`, and then get a `AsyncAdmin` instance from it to access HBase. Notice that there are two `getAdmin` methods to get a `AsyncAdmin` instance. One method has one extra thread pool parameter which is used to execute callbacks. It is designed for normal users. Another method doesn't need a thread pool and all the callbacks are executed inside the framework thread so it is not allowed to do time consuming works in the callbacks. It is designed for advanced users.
 249
 250 The default `getAdmin` methods will return a `AsyncAdmin` instance which use default configs. If you want to customize some configs, you can use `getAdminBuilder` methods to get a `AsyncAdminBuilder` for creating `AsyncAdmin` instance. Users are free to only set the configs they care about to create a new `AsyncAdmin` instance.
 251
 252 For the `AsyncAdmin` interface, most methods have the same meaning with the old `Admin` interface, expect that the return value is wrapped with a CompletableFuture usually.
 253
 254 For most admin operations, when the returned CompletableFuture is done, it means the admin operation has also been done. But for compact operation, it only means the compact request was sent to HBase and may need some time to finish the compact operation. For `rollWALWriter` method, it only means the rollWALWriter request was sent to the region server and may need some time to finish the `rollWALWriter` operation.
 255
 256 For region name, we only accept `byte[]` as the parameter type and it may be a full region name or a encoded region name. For server name, we only accept `ServerName` as the parameter type. For table name, we only accept `TableName` as the parameter type. For `list*` operations, we only accept `Pattern` as the parameter type if you want to do regex matching.
 257
 258 [[client.external]]
 259 === External Clients
 260
 261 Information on non-Java clients and custom protocols is covered in <<external_apis>>
 262
 263 [[client.masterregistry]]
 264
 265 === Master Registry (new as of 2.3.0)
 266 Client internally works with a _connection registry_ to fetch the metadata needed by connections.
 267 This connection registry implementation is responsible for fetching the following metadata.
 268
 269 * Active master address
 270 * Current meta region(s) locations
 271 * Cluster ID (unique to this cluster)
 272
 273 This information is needed as a part of various client operations like connection set up, scans,
 274 gets, etc. Traditionally, the connection registry implementation has been based on ZooKeeper as the
 275 source of truth and clients fetched the metadata directly from the ZooKeeper quorum. HBase 2.3.0
 276 introduces a new connection registry implementation based on direct communication with the Masters.
 277 With this implementation, clients now fetch required metadata via master RPC end points instead of
 278 maintaining connections to ZooKeeper. This change was done for the following reasons.
 279
 280 * Reduce load on ZooKeeper since that is critical for cluster operation.
 281 * Holistic client timeout and retry configurations since the new registry brings all the client
 282 operations under HBase rpc framework.
 283 * Remove the ZooKeeper client dependency on HBase client library.
 284
 285 This means:
 286
 287 * At least a single active or stand by master is needed for cluster connection setup. Refer to
 288 <<master.runtime>> for more details.
 289 * Master can be in a critical path of read/write operations, especially if the client metadata cache
 290 is empty or stale.
 291 * There is higher connection load on the masters that before since the clients talk directly to
 292 HMasters instead of ZooKeeper ensemble`
 293
 294 To reduce hot-spotting on a single master, all the masters (active & stand-by) expose the needed
 295 service to fetch the connection metadata. This lets the client connect to any master (not just active).
 296 Both ZooKeeper-based and Master-based connection registry implementations are available in 2.3+. For
 297 2.3 and earlier, the ZooKeeper-based implementation remains the default configuration.
 298 The Master-based implementation becomes the default in 3.0.0.
 299
 300 Change the connection registry implementation by updating the value configured for
 301 `hbase.client.registry.impl`. To explicitly enable the ZooKeeper-based registry, use
 302
 303 [source, xml]
 304 <property>
 305     <name>hbase.client.registry.impl</name>
 306     <value>org.apache.hadoop.hbase.client.ZKConnectionRegistry</value>
 307  </property>
 308
 309 To explicitly enable the Master-based registry, use
 310
 311 [source, xml]
 312 <property>
 313     <name>hbase.client.registry.impl</name>
 314     <value>org.apache.hadoop.hbase.client.MasterRegistry</value>
 315  </property>
 316
 317 ==== MasterRegistry RPC hedging
 318
 319 MasterRegistry implements hedging of connection registry RPCs across active and stand-by masters.
 320 This lets the client make the same request to multiple servers and which ever responds first is
 321 returned back to the client immediately. This improves performance, especially when a subset of
 322 servers are under load. The hedging fan out size is configurable, meaning the number of requests
 323 that are hedged in a single attempt, using the configuration key
 324 _hbase.client.master_registry.hedged.fanout_ in the client configuration. It defaults to 2. With
 325 this default, the RPCs are tried in batches of 2. The hedging policy is still primitive and does not
 326 adapt to any sort of live rpc performance metrics.
 327
 328 ==== Additional Notes
 329
 330 * Clients hedge the requests in a randomized order to avoid hot-spotting a single master.
 331 * Cluster internal connections (masters &lt;-&gt; regionservers) still use ZooKeeper based connection
 332 registry.
 333 * Cluster internal state is still tracked in Zookeeper, hence ZK availability requirements are same
 334 as before.
 335 * Inter cluster replication still uses ZooKeeper based connection registry to simplify configuration
 336 management.
 337
 338 For more implementation details, please refer to the https://github.com/apache/hbase/tree/master/dev-support/design-docs[design doc] and
 339 https://issues.apache.org/jira/browse/HBASE-18095[HBASE-18095].
 340
 341 '''
 342 NOTE: (Advanced) In case of any issues with the master based registry, use the following
 343 configuration to fallback to the ZooKeeper based connection registry implementation.
 344 [source, xml]
 345 <property>
 346     <name>hbase.client.registry.impl</name>
 347     <value>org.apache.hadoop.hbase.client.ZKConnectionRegistry</value>
 348  </property>
 349
 350 [[client.filter]]
 351 == Client Request Filters
 352
 353 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] and link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] instances can be optionally configured with link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html[filters] which are applied on the RegionServer.
 354
 355 Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups of Filter functionality.
 356
 357 [[client.filter.structural]]
 358 === Structural
 359
 360 Structural Filters contain other Filters.
 361
 362 [[client.filter.structural.fl]]
 363 ==== FilterList
 364
 365 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html[FilterList] represents a list of Filters with a relationship of `FilterList.Operator.MUST_PASS_ALL` or `FilterList.Operator.MUST_PASS_ONE` between the Filters.
 366 The following example shows an 'or' between two Filters (checking for either 'my value' or 'my other value' on the same attribute).
 367
 368 [source,java]
 369 ----
 370 FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE);
 371 SingleColumnValueFilter filter1 = new SingleColumnValueFilter(
 372   cf,
 373   column,
 374   CompareOperator.EQUAL,
 375   Bytes.toBytes("my value")
 376   );
 377 list.add(filter1);
 378 SingleColumnValueFilter filter2 = new SingleColumnValueFilter(
 379   cf,
 380   column,
 381   CompareOperator.EQUAL,
 382   Bytes.toBytes("my other value")
 383   );
 384 list.add(filter2);
 385 scan.setFilter(list);
 386 ----
 387
 388 [[client.filter.cv]]
 389 === Column Value
 390
 391 [[client.filter.cv.scvf]]
 392 ==== SingleColumnValueFilter
 393
 394 A SingleColumnValueFilter (see:
 395 https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html)
 396 can be used to test column values for equivalence (`CompareOperaor.EQUAL`),
 397 inequality (`CompareOperaor.NOT_EQUAL`), or ranges (e.g., `CompareOperaor.GREATER`). The following is an
 398 example of testing equivalence of a column to a String value "my value"...
 399
 400 [source,java]
 401 ----
 402 SingleColumnValueFilter filter = new SingleColumnValueFilter(
 403   cf,
 404   column,
 405   CompareOperaor.EQUAL,
 406   Bytes.toBytes("my value")
 407   );
 408 scan.setFilter(filter);
 409 ----
 410
 411 [[client.filter.cv.cvf]]
 412 ==== ColumnValueFilter
 413
 414 Introduced in HBase-2.0.0 version as a complementation of SingleColumnValueFilter, ColumnValueFilter
 415 gets matched cell only, while SingleColumnValueFilter gets the entire row
 416 (has other columns and values) to which the matched cell belongs. Parameters of constructor of
 417 ColumnValueFilter are the same as SingleColumnValueFilter.
 418 [source,java]
 419 ----
 420 ColumnValueFilter filter = new ColumnValueFilter(
 421   cf,
 422   column,
 423   CompareOperaor.EQUAL,
 424   Bytes.toBytes("my value")
 425   );
 426 scan.setFilter(filter);
 427 ----
 428
 429 Note. For simple query like "equals to a family:qualifier:value", we highly recommend to use the
 430 following way instead of using SingleColumnValueFilter or ColumnValueFilter:
 431 [source,java]
 432 ----
 433 Scan scan = new Scan();
 434 scan.addColumn(Bytes.toBytes("family"), Bytes.toBytes("qualifier"));
 435 ValueFilter vf = new ValueFilter(CompareOperator.EQUAL,
 436   new BinaryComparator(Bytes.toBytes("value")));
 437 scan.setFilter(vf);
 438 ...
 439 ----
 440 This scan will restrict to the specified column 'family:qualifier', avoiding scan of unrelated
 441 families and columns, which has better performance, and `ValueFilter` is the condition used to do
 442 the value filtering.
 443
 444 But if query is much more complicated beyond this book, then please make your good choice case by case.
 445
 446 [[client.filter.cvp]]
 447 === Column Value Comparators
 448
 449 There are several Comparator classes in the Filter package that deserve special mention.
 450 These Comparators are used in concert with other Filters, such as <<client.filter.cv.scvf>>.
 451
 452 [[client.filter.cvp.rcs]]
 453 ==== RegexStringComparator
 454
 455 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html[RegexStringComparator] supports regular expressions for value comparisons.
 456
 457 [source,java]
 458 ----
 459 RegexStringComparator comp = new RegexStringComparator("my.");   // any value that starts with 'my'
 460 SingleColumnValueFilter filter = new SingleColumnValueFilter(
 461   cf,
 462   column,
 463   CompareOperaor.EQUAL,
 464   comp
 465   );
 466 scan.setFilter(filter);
 467 ----
 468
 469 See the Oracle JavaDoc for link:http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html[supported RegEx patterns in Java].
 470
 471 [[client.filter.cvp.substringcomparator]]
 472 ==== SubstringComparator
 473
 474 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html[SubstringComparator] can be used to determine if a given substring exists in a value.
 475 The comparison is case-insensitive.
 476
 477 [source,java]
 478 ----
 479
 480 SubstringComparator comp = new SubstringComparator("y val");   // looking for 'my value'
 481 SingleColumnValueFilter filter = new SingleColumnValueFilter(
 482   cf,
 483   column,
 484   CompareOperaor.EQUAL,
 485   comp
 486   );
 487 scan.setFilter(filter);
 488 ----
 489
 490 [[client.filter.cvp.bfp]]
 491 ==== BinaryPrefixComparator
 492
 493 See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.html[BinaryPrefixComparator].
 494
 495 [[client.filter.cvp.bc]]
 496 ==== BinaryComparator
 497
 498 See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComparator.html[BinaryComparator].
 499
 500 [[client.filter.cvp.bcc]]
 501 ==== BinaryComponentComparator
 502
 503 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComponentComparator.html[BinaryComponentComparator] can be used to compare specific value at specific location with in the cell value. The comparison can be done for both ascii and binary data.
 504
 505 [source,java]
 506 ----
 507 byte[] partialValue = Bytes.toBytes("partial_value");
 508     int partialValueOffset =
 509     Filter partialValueFilter = new ValueFilter(CompareFilter.CompareOp.GREATER,
 510             new BinaryComponentComparator(partialValue,partialValueOffset));
 511 ----
 512 See link:https://issues.apache.org/jira/browse/HBASE-22969[HBASE-22969] for other use cases and details.
 513
 514 [[client.filter.kvm]]
 515 === KeyValue Metadata
 516
 517 As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters evaluate the existence of keys (i.e., ColumnFamily:Column qualifiers) for a row, as opposed to values the previous section.
 518
 519 [[client.filter.kvm.ff]]
 520 ==== FamilyFilter
 521
 522 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html[FamilyFilter] can be used to filter on the ColumnFamily.
 523 It is generally a better idea to select ColumnFamilies in the Scan than to do it with a Filter.
 524
 525 [[client.filter.kvm.qf]]
 526 ==== QualifierFilter
 527
 528 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html[QualifierFilter] can be used to filter based on Column (aka Qualifier) name.
 529
 530 [[client.filter.kvm.cpf]]
 531 ==== ColumnPrefixFilter
 532
 533 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html[ColumnPrefixFilter] can be used to filter based on the lead portion of Column (aka Qualifier) names.
 534
 535 A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row and for each involved column family.
 536 It can be used to efficiently get a subset of the columns in very wide rows.
 537
 538 Note: The same column qualifier can be used in different column families.
 539 This filter returns all matching columns.
 540
 541 Example: Find all columns in a row and family that start with "abc"
 542
 543 [source,java]
 544 ----
 545 Table t = ...;
 546 byte[] row = ...;
 547 byte[] family = ...;
 548 byte[] prefix = Bytes.toBytes("abc");
 549 Scan scan = new Scan(row, row); // (optional) limit to one row
 550 scan.addFamily(family); // (optional) limit to one family
 551 Filter f = new ColumnPrefixFilter(prefix);
 552 scan.setFilter(f);
 553 scan.setBatch(10); // set this if there could be many columns returned
 554 ResultScanner rs = t.getScanner(scan);
 555 for (Result r = rs.next(); r != null; r = rs.next()) {
 556   for (Cell cell : result.listCells()) {
 557     // each cell represents a column
 558   }
 559 }
 560 rs.close();
 561 ----
 562
 563 [[client.filter.kvm.mcpf]]
 564 ==== MultipleColumnPrefixFilter
 565
 566 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html[MultipleColumnPrefixFilter] behaves like ColumnPrefixFilter but allows specifying multiple prefixes.
 567
 568 Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the first column matching the lowest prefix and also seeks past ranges of columns between prefixes.
 569 It can be used to efficiently get discontinuous sets of columns from very wide rows.
 570
 571 Example: Find all columns in a row and family that start with "abc" or "xyz"
 572
 573 [source,java]
 574 ----
 575 Table t = ...;
 576 byte[] row = ...;
 577 byte[] family = ...;
 578 byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")};
 579 Scan scan = new Scan(row, row); // (optional) limit to one row
 580 scan.addFamily(family); // (optional) limit to one family
 581 Filter f = new MultipleColumnPrefixFilter(prefixes);
 582 scan.setFilter(f);
 583 scan.setBatch(10); // set this if there could be many columns returned
 584 ResultScanner rs = t.getScanner(scan);
 585 for (Result r = rs.next(); r != null; r = rs.next()) {
 586   for (Cell cell : result.listCells()) {
 587     // each cell represents a column
 588   }
 589 }
 590 rs.close();
 591 ----
 592
 593 [[client.filter.kvm.crf]]
 594 ==== ColumnRangeFilter
 595
 596 A link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html[ColumnRangeFilter] allows efficient intra row scanning.
 597
 598 A ColumnRangeFilter can seek ahead to the first matching column for each involved column family.
 599 It can be used to efficiently get a 'slice' of the columns of a very wide row.
 600 i.e.
 601 you have a million columns in a row but you only want to look at columns bbbb-bbdd.
 602
 603 Note: The same column qualifier can be used in different column families.
 604 This filter returns all matching columns.
 605
 606 Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd" (inclusive)
 607
 608 [source,java]
 609 ----
 610 Table t = ...;
 611 byte[] row = ...;
 612 byte[] family = ...;
 613 byte[] startColumn = Bytes.toBytes("bbbb");
 614 byte[] endColumn = Bytes.toBytes("bbdd");
 615 Scan scan = new Scan(row, row); // (optional) limit to one row
 616 scan.addFamily(family); // (optional) limit to one family
 617 Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true);
 618 scan.setFilter(f);
 619 scan.setBatch(10); // set this if there could be many columns returned
 620 ResultScanner rs = t.getScanner(scan);
 621 for (Result r = rs.next(); r != null; r = rs.next()) {
 622   for (Cell cell : result.listCells()) {
 623     // each cell represents a column
 624   }
 625 }
 626 rs.close();
 627 ----
 628
 629 Note:  Introduced in HBase 0.92
 630
 631 [[client.filter.row]]
 632 === RowKey
 633
 634 [[client.filter.row.rf]]
 635 ==== RowFilter
 636
 637 It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RowFilter.html[RowFilter] can also be used.
 638
 639 You can supplement a scan (both bounded and unbounded) with RowFilter constructed from link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComponentComparator.html[BinaryComponentComparator] for further filtering out or filtering in rows. See link:https://issues.apache.org/jira/browse/HBASE-22969[HBASE-22969] for use cases and other details.
 640
 641 [[client.filter.utility]]
 642 === Utility
 643
 644 [[client.filter.utility.fkof]]
 645 ==== FirstKeyOnlyFilter
 646
 647 This is primarily used for rowcount jobs.
 648 See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html[FirstKeyOnlyFilter].
 649
 650 [[architecture.master]]
 651 == Master
 652
 653 `HMaster` is the implementation of the Master Server.
 654 The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes.
 655 In a distributed cluster, the Master typically runs on the <<arch.hdfs.nn>>.
 656 J Mohamed Zahoor goes into some more detail on the Master Architecture in this blog posting, link:http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/[HBase HMaster Architecture ].
 657
 658 [[master.startup]]
 659 === Startup Behavior
 660
 661 If run in a multi-Master environment, all Masters compete to run the cluster.
 662 If the active Master loses its lease in ZooKeeper (or the Master shuts down), then the remaining Masters jostle to take over the Master role.
 663
 664 [[master.runtime]]
 665 === Runtime Impact
 666
 667 A common dist-list question involves what happens to an HBase cluster when the Master goes down. This information has changed starting 3.0.0.
 668
 669 ==== Up until releases 2.x.y
 670 Because the HBase client talks directly to the RegionServers, the cluster can still function in a "steady state". Additionally, per <<arch.catalog>>, `hbase:meta` exists as an HBase table and is not resident in the Master.
 671 However, the Master controls critical functions such as RegionServer failover and completing region splits.
 672 So while the cluster can still run for a short time without the Master, the Master should be restarted as soon as possible.
 673
 674 ==== Staring release 3.0.0
 675 As mentioned in section <<client.masterregistry>>, the default connection registry for clients is now based on master rpc end points. Hence the requirements for
 676 masters' uptime are even tighter starting this release.
 677
 678 - At least one active or stand by master is needed for a connection set up, unlike before when all the clients needed was a ZooKeeper ensemble.
 679 - Master is now in critical path for read/write operations. For example, if the meta region bounces off to a different region server, clients
 680 need master to fetch the new locations. Earlier this was done by fetching this information directly from ZooKeeper.
 681 - Masters will now have higher connection load than before. So, the server side configuration might need adjustment depending on the load.
 682
 683 Overall, the master uptime requirements, when this feature is enabled, are even higher for the client operations to go through.
 684
 685 [[master.api]]
 686 === Interface
 687
 688 The methods exposed by `HMasterInterface` are primarily metadata-oriented methods:
 689
 690 * Table (createTable, modifyTable, removeTable, enable, disable)
 691 * ColumnFamily (addColumn, modifyColumn, removeColumn)
 692 * Region (move, assign, unassign) For example, when the `Admin` method `disableTable` is invoked, it is serviced by the Master server.
 693
 694 [[master.processes]]
 695 === Processes
 696
 697 The Master runs several background threads:
 698
 699 [[master.processes.loadbalancer]]
 700 ==== LoadBalancer
 701
 702 Periodically, and when there are no regions in transition, a load balancer will run and move regions around to balance the cluster's load.
 703 See <<balancer_config>> for configuring this property.
 704
 705 See <<regions.arch.assignment>> for more information on region assignment.
 706
 707 [[master.processes.catalog]]
 708 ==== CatalogJanitor
 709
 710 Periodically checks and cleans up the `hbase:meta` table.
 711 See <<arch.catalog.meta>> for more information on the meta table.
 712
 713 [[master.wal]]
 714 === MasterProcWAL
 715
 716 _MasterProcWAL is replaced in hbase-2.3.0 by an alternate Procedure Store implementation; see
 717 <<in-master-procedure-store-region>>. This section pertains to hbase-2.0.0 through hbase-2.2.x_
 718
 719 HMaster records administrative operations and their running states, such as the handling of a crashed server,
 720 table creation, and other DDLs, into a Procedure Store. The Procedure Store WALs are stored under the
 721 MasterProcWALs directory. The Master WALs are not like RegionServer WALs. Keeping up the Master WAL allows
 722 us to run a state machine that is resilient across Master failures. For example, if a HMaster was in the
 723 middle of creating a table encounters an issue and fails, the next active HMaster can take up where
 724 the previous left off and carry the operation to completion. Since hbase-2.0.0, a
 725 new AssignmentManager (A.K.A AMv2) was introduced and the HMaster handles region assignment
 726 operations, server crash processing, balancing, etc., all via AMv2 persisting all state and
 727 transitions into MasterProcWALs rather than up into ZooKeeper, as we do in hbase-1.x.
 728
 729 See <<amv2>> (and <<pv2>> for its basis) if you would like to learn more about the new
 730 AssignmentManager.
 731
 732 [[master.wal.conf]]
 733 ==== Configurations for MasterProcWAL
 734 Here are the list of configurations that effect MasterProcWAL operation.
 735 You should not have to change your defaults.
 736
 737 [[hbase.procedure.store.wal.periodic.roll.msec]]
 738 *`hbase.procedure.store.wal.periodic.roll.msec`*::
 739 +
 740 .Description
 741 Frequency of generating a new WAL
 742 +
 743 .Default
 744 `1h (3600000 in msec)`
 745
 746 [[hbase.procedure.store.wal.roll.threshold]]
 747 *`hbase.procedure.store.wal.roll.threshold`*::
 748 +
 749 .Description
 750 Threshold in size before the WAL rolls. Every time the WAL reaches this size or the above period, 1 hour, passes since last log roll, the HMaster will generate a new WAL.
 751 +
 752 .Default
 753 `32MB (33554432 in byte)`
 754
 755 [[hbase.procedure.store.wal.warn.threshold]]
 756 *`hbase.procedure.store.wal.warn.threshold`*::
 757 +
 758 .Description
 759 If the number of WALs goes beyond this threshold, the following message should appear in the HMaster log with WARN level when rolling.
 760
 761  procedure WALs count=xx above the warning threshold 64. check running procedures to see if something is stuck.
 762
 763 +
 764 .Default
 765 `64`
 766
 767 [[hbase.procedure.store.wal.max.retries.before.roll]]
 768 *`hbase.procedure.store.wal.max.retries.before.roll`*::
 769 +
 770 .Description
 771 Max number of retry when syncing slots (records) to its underlying storage, such as HDFS. Every attempt, the following message should appear in the HMaster log.
 772
 773  unable to sync slots, retry=xx
 774
 775 +
 776 .Default
 777 `3`
 778
 779 [[hbase.procedure.store.wal.sync.failure.roll.max]]
 780 *`hbase.procedure.store.wal.sync.failure.roll.max`*::
 781 +
 782 .Description
 783 After the above 3 retrials, the log is rolled and the retry count is reset to 0, thereon a new set of retrial starts. This configuration controls the max number of attempts of log rolling upon sync failure. That is, HMaster is allowed to fail to sync 9 times in total. Once it exceeds, the following log should appear in the HMaster log.
 784
 785  Sync slots after log roll failed, abort.
 786 +
 787 .Default
 788 `3`
 789
 790 [[regionserver.arch]]
 791 == RegionServer
 792
 793 `HRegionServer` is the RegionServer implementation.
 794 It is responsible for serving and managing regions.
 795 In a distributed cluster, a RegionServer runs on a <<arch.hdfs.dn>>.
 796
 797 [[regionserver.arch.api]]
 798 === Interface
 799
 800 The methods exposed by `HRegionRegionInterface` contain both data-oriented and region-maintenance methods:
 801
 802 * Data (get, put, delete, next, etc.)
 803 * Region (splitRegion, compactRegion, etc.) For example, when the `Admin` method `majorCompact` is invoked on a table, the client is actually iterating through all regions for the specified table and requesting a major compaction directly to each region.
 804
 805 [[regionserver.arch.processes]]
 806 === Processes
 807
 808 The RegionServer runs a variety of background threads:
 809
 810 [[regionserver.arch.processes.compactsplit]]
 811 ==== CompactSplitThread
 812
 813 Checks for splits and handle minor compactions.
 814
 815 [[regionserver.arch.processes.majorcompact]]
 816 ==== MajorCompactionChecker
 817
 818 Checks for major compactions.
 819
 820 [[regionserver.arch.processes.memstore]]
 821 ==== MemStoreFlusher
 822
 823 Periodically flushes in-memory writes in the MemStore to StoreFiles.
 824
 825 [[regionserver.arch.processes.log]]
 826 ==== LogRoller
 827
 828 Periodically checks the RegionServer's WAL.
 829
 830 === Coprocessors
 831
 832 Coprocessors were added in 0.92.
 833 There is a thorough link:https://blogs.apache.org/hbase/entry/coprocessor_introduction[Blog Overview of CoProcessors] posted.
 834 Documentation will eventually move to this reference guide, but the blog is the most current information available at this time.
 835
 836 [[block.cache]]
 837 === Block Cache
 838
 839 HBase provides two different BlockCache implementations to cache data read from HDFS:
 840 the default on-heap `LruBlockCache` and the `BucketCache`, which is (usually) off-heap.
 841 This section discusses benefits and drawbacks of each implementation, how to choose the
 842 appropriate option, and configuration options for each.
 843
 844 .Block Cache Reporting: UI
 845 [NOTE]
 846 ====
 847 See the RegionServer UI for detail on caching deploy.
 848 See configurations, sizings, current usage, time-in-the-cache, and even detail on block counts and types.
 849 ====
 850
 851 ==== Cache Choices
 852
 853 `LruBlockCache` is the original implementation, and is entirely within the Java heap.
 854 `BucketCache` is optional and mainly intended for keeping block cache data off-heap, although `BucketCache` can also be a file-backed cache.
 855  In file-backed we can either use it in the file mode or the mmaped mode.
 856  We also have pmem mode where the bucket cache resides on the persistent memory device.
 857
 858 When you enable BucketCache, you are enabling a two tier caching system. We used to describe the
 859 tiers as "L1" and "L2" but have deprecated this terminology as of hbase-2.0.0. The "L1" cache referred to an
 860 instance of LruBlockCache and "L2" to an off-heap BucketCache. Instead, when BucketCache is enabled,
 861 all DATA blocks are kept in the BucketCache tier and meta blocks -- INDEX and BLOOM blocks -- are on-heap in the `LruBlockCache`.
 862 Management of these two tiers and the policy that dictates how blocks move between them is done by `CombinedBlockCache`.
 863
 864 [[cache.configurations]]
 865 ==== General Cache Configurations
 866
 867 Apart from the cache implementation itself, you can set some general configuration options to control how the cache performs.
 868 See link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig].
 869 After setting any of these options, restart or rolling restart your cluster for the configuration to take effect.
 870 Check logs for errors or unexpected behavior.
 871
 872 See also <<blockcache.prefetch>>, which discusses a new option introduced in link:https://issues.apache.org/jira/browse/HBASE-9857[HBASE-9857].
 873
 874 [[block.cache.design]]
 875 ==== LruBlockCache Design
 876
 877 The LruBlockCache is an LRU cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:
 878
 879 * Single access priority: The first time a block is loaded from HDFS it normally has this priority and it will be part of the first group to be considered during evictions.
 880   The advantage is that scanned blocks are more likely to get evicted than blocks that are getting more usage.
 881 * Multi access priority: If a block in the previous priority group is accessed again, it upgrades to this priority.
 882   It is thus part of the second group considered during evictions.
 883 * In-memory access priority: If the block's family was configured to be "in-memory", it will be part of this priority disregarding the number of times it was accessed.
 884   Catalog tables are configured like this.
 885   This group is the last one considered during evictions.
 886 +
 887 To mark a column family as in-memory, call
 888
 889 [source,java]
 890 ----
 891 HColumnDescriptor.setInMemory(true);
 892 ----
 893
 894 if creating a table from java, or set `IN_MEMORY => true` when creating or altering a table in the shell: e.g.
 895
 896 [source]
 897 ----
 898 hbase(main):003:0> create  't', {NAME => 'f', IN_MEMORY => 'true'}
 899 ----
 900
 901 For more information, see the LruBlockCache source
 902
 903 [[block.cache.usage]]
 904 ==== LruBlockCache Usage
 905
 906 Block caching is enabled by default for all the user tables which means that any read operation will load the LRU cache.
 907 This might be good for a large number of use cases, but further tunings are usually required in order to achieve better performance.
 908 An important concept is the link:http://en.wikipedia.org/wiki/Working_set_size[working set size], or WSS, which is: "the amount of memory needed to compute the answer to a problem". For a website, this would be the data that's needed to answer the queries over a short amount of time.
 909
 910 The way to calculate how much memory is available in HBase for caching is:
 911
 912 [source]
 913 ----
 914 number of region servers * heap size * hfile.block.cache.size * 0.99
 915 ----
 916
 917 The default value for the block cache is 0.4 which represents 40% of the available heap.
 918 The last value (99%) is the default acceptable loading factor in the LRU cache after which eviction is started.
 919 The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks.
 920 Here are some examples:
 921
 922 * One region server with the heap size set to 1 GB and the default block cache size will have 405 MB of block cache available.
 923 * 20 region servers with the heap size set to 8 GB and a default block cache size will have 63.3 GB of block cache.
 924 * 100 region servers with the heap size set to 24 GB and a block cache size of 0.5 will have about 1.16 TB of block cache.
 925
 926 Your data is not the only resident of the block cache.
 927 Here are others that you may have to take into account:
 928
 929 Catalog Tables::
 930   The `hbase:meta` table is forced into the block cache and have the in-memory priority which means that they are harder to evict.
 931
 932 NOTE: The hbase:meta tables can occupy a few MBs depending on the number of regions.
 933
 934 HFiles Indexes::
 935   An _HFile_ is the file format that HBase uses to store data in HDFS.
 936   It contains a multi-layered index which allows HBase to seek the data without having to read the whole file.
 937   The size of those indexes is a factor of the block size (64KB by default), the size of your keys and the amount of data you are storing.
 938   For big data sets it's not unusual to see numbers around 1GB per region server, although not all of it will be in cache because the LRU will evict indexes that aren't used.
 939
 940 Keys::
 941   The values that are stored are only half the picture, since each value is stored along with its keys (row key, family qualifier, and timestamp). See <<keysize>>.
 942
 943 Bloom Filters::
 944   Just like the HFile indexes, those data structures (when enabled) are stored in the LRU.
 945
 946 Currently the recommended way to measure HFile indexes and bloom filters sizes is to look at the region server web UI and checkout the relevant metrics.
 947 For keys, sampling can be done by using the HFile command line tool and look for the average key size metric.
 948 Since HBase 0.98.3, you can view details on BlockCache stats and metrics in a special Block Cache section in the UI.
 949
 950 It's generally bad to use block caching when the WSS doesn't fit in memory.
 951 This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data.
 952 One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily.
 953 Here are two use cases:
 954
 955 * Fully random reading pattern: This is a case where you almost never access the same row twice within a short amount of time such that the chance of hitting a cached block is close to 0.
 956   Setting block caching on such a table is a waste of memory and CPU cycles, more so that it will generate more garbage to pick up by the JVM.
 957   For more information on monitoring GC, see <<trouble.log.gc>>.
 958 * Mapping a table: In a typical MapReduce job that takes a table in input, every row will be read only once so there's no need to put them into the block cache.
 959   The Scan object has the option of turning this off via the setCacheBlocks method (set it to false). You can still keep block caching turned on on this table if you need fast random read access.
 960   An example would be counting the number of rows in a table that serves live traffic, caching every block of that table would create massive churn and would surely evict data that's currently in use.
 961
 962 [[data.blocks.in.fscache]]
 963 ===== Caching META blocks only (DATA blocks in fscache)
 964
 965 An interesting setup is one where we cache META blocks only and we read DATA blocks in on each access.
 966 If the DATA blocks fit inside fscache, this alternative may make sense when access is completely random across a very large dataset.
 967 To enable this setup, alter your table and for each column family set `BLOCKCACHE => 'false'`.
 968 You are 'disabling' the BlockCache for this column family only. You can never disable the caching of META blocks.
 969 Since link:https://issues.apache.org/jira/browse/HBASE-4683[HBASE-4683 Always cache index and bloom blocks], we will cache META blocks even if the BlockCache is disabled.
 970
 971 [[offheap.blockcache]]
 972 ==== Off-heap Block Cache
 973
 974 [[enable.bucketcache]]
 975 ===== How to Enable BucketCache
 976
 977 The usual deployment of BucketCache is via a managing class that sets up two caching tiers:
 978 an on-heap cache implemented by LruBlockCache and a second  cache implemented with BucketCache.
 979 The managing class is link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html[CombinedBlockCache] by default.
 980 The previous link describes the caching 'policy' implemented by CombinedBlockCache.
 981 In short, it works by keeping meta blocks -- INDEX and BLOOM in the on-heap LruBlockCache tier -- and DATA blocks are kept in the BucketCache tier.
 982
 983 ====
 984 Pre-hbase-2.0.0 versions::
 985 Fetching will always be slower when fetching from BucketCache in pre-hbase-2.0.0,
 986 as compared to the native on-heap LruBlockCache. However, latencies tend to be less
 987 erratic across time, because there is less garbage collection when you use BucketCache since it is managing BlockCache allocations, not the GC.
 988 If the BucketCache is deployed in off-heap mode, this memory is not managed by the GC at all.
 989 This is why you'd use BucketCache in pre-2.0.0, so your latencies are less erratic,
 990 to mitigate GCs and heap fragmentation, and so you can safely use more memory.
 991 See Nick Dimiduk's link:http://www.n10k.com/blog/blockcache-101/[BlockCache 101] for comparisons running on-heap vs off-heap tests.
 992 Also see link:https://people.apache.org/~stack/bc/[Comparing BlockCache Deploys] which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or you want your cache to exist beyond the vagaries of java GC), use BucketCache.
 993 +
 994 In pre-2.0.0,
 995 one can configure the BucketCache so it receives the `victim` of an LruBlockCache eviction.
 996 All Data and index blocks are cached in L1 first. When eviction happens from L1, the blocks (or `victims`) will get moved to L2.
 997 Set `cacheDataInL1` via `(HColumnDescriptor.setCacheDataInL1(true)` or in the shell, creating or amending column families setting `CACHE_DATA_IN_L1` to true: e.g.
 998 [source]
 999 ----
1000 hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}
1001 ----
1002
1003 hbase-2.0.0+ versions::
1004 HBASE-11425 changed the HBase read path so it could hold the read-data off-heap avoiding copying of cached data on to the java heap.
1005 See <<regionserver.offheap.readpath>>. In hbase-2.0.0, off-heap latencies approach those of on-heap cache latencies with the added
1006 benefit of NOT provoking GC.
1007 +
1008 From HBase 2.0.0 onwards, the notions of L1 and L2 have been deprecated. When BucketCache is turned on, the DATA blocks will always go to BucketCache and INDEX/BLOOM blocks go to on heap LRUBlockCache. `cacheDataInL1` support has been removed.
1009 ====
1010
1011 [[bc.deloy.modes]]
1012 ====== BucketCache Deploy Modes
1013 The BucketCache Block Cache can be deployed _offheap_, _file_ or _mmaped_ file mode.
1014
1015 You set which via the `hbase.bucketcache.ioengine` setting.
1016 Setting it to `offheap` will have BucketCache make its allocations off-heap, and an ioengine setting of `file:PATH_TO_FILE` will direct BucketCache to use file caching (Useful in particular if you have some fast I/O attached to the box such as SSDs). From 2.0.0, it is possible to have more than one file backing the BucketCache. This is very useful especially when the Cache size requirement is high. For multiple backing files, configure ioengine as `files:PATH_TO_FILE1,PATH_TO_FILE2,PATH_TO_FILE3`. BucketCache can be configured to use an mmapped file also. Configure ioengine as `mmap:PATH_TO_FILE` for this.
1017
1018 It is possible to deploy a tiered setup where we bypass the CombinedBlockCache policy and have BucketCache working as a strict L2 cache to the L1 LruBlockCache.
1019 For such a setup, set `hbase.bucketcache.combinedcache.enabled` to `false`.
1020 In this mode, on eviction from L1, blocks go to L2.
1021 When a block is cached, it is cached first in L1.
1022 When we go to look for a cached block, we look first in L1 and if none found, then search L2.
1023 Let us call this deploy format, _Raw L1+L2_.
1024 NOTE: This L1+L2 mode is removed from 2.0.0. When BucketCache is used, it will be strictly the DATA cache and the LruBlockCache will cache INDEX/META blocks.
1025
1026 Other BucketCache configs include: specifying a location to persist cache to across restarts, how many threads to use writing the cache, etc.
1027 See the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig.html] class for configuration options and descriptions.
1028
1029 To check it enabled, look for the log line describing cache setup; it will detail how BucketCache has been deployed.
1030 Also see the UI. It will detail the cache tiering and their configuration.
1031
1032 [[bc.example]]
1033 ====== BucketCache Example Configuration
1034 This sample provides a configuration for a 4 GB off-heap BucketCache with a 1 GB on-heap cache.
1035
1036 Configuration is performed on the RegionServer.
1037
1038 Setting `hbase.bucketcache.ioengine` and `hbase.bucketcache.size` > 0 enables `CombinedBlockCache`.
1039 Let us presume that the RegionServer has been set to run with a 5G heap: i.e. `HBASE_HEAPSIZE=5g`.
1040
1041
1042 . First, edit the RegionServer's _hbase-env.sh_ and set `HBASE_OFFHEAPSIZE` to a value greater than the off-heap size wanted, in this case, 4 GB (expressed as 4G). Let's set it to 5G.
1043   That'll be 4G for our off-heap cache and 1G for any other uses of off-heap memory (there are other users of off-heap memory other than BlockCache; e.g.
1044   DFSClient in RegionServer can make use of off-heap memory). See <<direct.memory>>.
1045 +
1046 [source]
1047 ----
1048 HBASE_OFFHEAPSIZE=5G
1049 ----
1050
1051 . Next, add the following configuration to the RegionServer's _hbase-site.xml_.
1052 +
1053 [source,xml]
1054 ----
1055 <property>
1056   <name>hbase.bucketcache.ioengine</name>
1057   <value>offheap</value>
1058 </property>
1059 <property>
1060   <name>hfile.block.cache.size</name>
1061   <value>0.2</value>
1062 </property>
1063 <property>
1064   <name>hbase.bucketcache.size</name>
1065   <value>4196</value>
1066 </property>
1067 ----
1068
1069 . Restart or rolling restart your cluster, and check the logs for any issues.
1070
1071
1072 In the above, we set the BucketCache to be 4G.
1073 We configured the on-heap LruBlockCache have 20% (0.2) of the RegionServer's heap size (0.2 * 5G = 1G). In other words, you configure the L1 LruBlockCache as you would normally (as if there were no L2 cache present).
1074
1075 link:https://issues.apache.org/jira/browse/HBASE-10641[HBASE-10641] introduced the ability to configure multiple sizes for the buckets of the BucketCache, in HBase 0.98 and newer.
1076 To configurable multiple bucket sizes, configure the new property `hbase.bucketcache.bucket.sizes` to a comma-separated list of block sizes, ordered from smallest to largest, with no spaces.
1077 The goal is to optimize the bucket sizes based on your data access patterns.
1078 The following example configures buckets of size 4096 and 8192.
1079
1080 [source,xml]
1081 ----
1082 <property>
1083   <name>hbase.bucketcache.bucket.sizes</name>
1084   <value>4096,8192</value>
1085 </property>
1086 ----
1087
1088 [[direct.memory]]
1089 .Direct Memory Usage In HBase
1090 [NOTE]
1091 ====
1092 The default maximum direct memory varies by JVM.
1093 Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). HBase servers use direct memory, in particular short-circuit reading (See <<perf.hdfs.configs.localread>>), the hosted DFSClient will allocate direct memory buffers. How much the DFSClient uses is not easy to quantify; it is the number of open HFiles * `hbase.dfs.client.read.shortcircuit.buffer.size` where `hbase.dfs.client.read.shortcircuit.buffer.size` is set to 128k in HBase -- see _hbase-default.xml_ default configurations.
1094 If you do off-heap block caching, you'll be making use of direct memory.
1095 The RPCServer uses a ByteBuffer pool. From 2.0.0, these buffers are off-heap ByteBuffers.
1096 Starting your JVM, make sure the `-XX:MaxDirectMemorySize` setting in _conf/hbase-env.sh_ considers off-heap BlockCache (`hbase.bucketcache.size`), DFSClient usage, RPC side ByteBufferPool max size. This has to be bit higher than sum of off heap BlockCache size and max ByteBufferPool size. Allocating an extra of 1-2 GB for the max direct memory size has worked in tests. Direct memory, which is part of the Java process heap, is separate from the object heap allocated by -Xmx.
1097 The value allocated by `MaxDirectMemorySize` must not exceed physical RAM, and is likely to be less than the total available RAM due to other memory requirements and system constraints.
1098
1099 You can see how much memory -- on-heap and off-heap/direct -- a RegionServer is configured to use and how much it is using at any one time by looking at the _Server Metrics: Memory_ tab in the UI.
1100 It can also be gotten via JMX.
1101 In particular the direct memory currently used by the server can be found on the `java.nio.type=BufferPool,name=direct` bean.
1102 Terracotta has a link:http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options[good write up] on using off-heap memory in Java.
1103 It is for their product BigMemory but a lot of the issues noted apply in general to any attempt at going off-heap. Check it out.
1104 ====
1105
1106 .hbase.bucketcache.percentage.in.combinedcache
1107 [NOTE]
1108 ====
1109 This is a pre-HBase 1.0 configuration removed because it was confusing.
1110 It was a float that you would set to some value between 0.0 and 1.0.
1111 Its default was 0.9.
1112 If the deploy was using CombinedBlockCache, then the LruBlockCache L1 size was calculated to be `(1 - hbase.bucketcache.percentage.in.combinedcache) * size-of-bucketcache`  and the BucketCache size was `hbase.bucketcache.percentage.in.combinedcache * size-of-bucket-cache`.
1113 where size-of-bucket-cache itself is EITHER the value of the configuration `hbase.bucketcache.size` IF it was specified as Megabytes OR `hbase.bucketcache.size` * `-XX:MaxDirectMemorySize` if `hbase.bucketcache.size` is between 0 and 1.0.
1114
1115 In 1.0, it should be more straight-forward.
1116 Onheap LruBlockCache size is set as a fraction of java heap using `hfile.block.cache.size setting` (not the best name) and BucketCache is set as above in absolute Megabytes.
1117 ====
1118
1119 ==== Compressed BlockCache
1120
1121 link:https://issues.apache.org/jira/browse/HBASE-11331[HBASE-11331] introduced lazy BlockCache decompression, more simply referred to as compressed BlockCache.
1122 When compressed BlockCache is enabled data and encoded data blocks are cached in the BlockCache in their on-disk format, rather than being decompressed and decrypted before caching.
1123
1124 For a RegionServer hosting more data than can fit into cache, enabling this feature with SNAPPY compression has been shown to result in 50% increase in throughput and 30% improvement in mean latency while, increasing garbage collection by 80% and increasing overall CPU load by 2%. See HBASE-11331 for more details about how performance was measured and achieved.
1125 For a RegionServer hosting data that can comfortably fit into cache, or if your workload is sensitive to extra CPU or garbage-collection load, you may receive less benefit.
1126
1127 The compressed BlockCache is disabled by default. To enable it, set `hbase.block.data.cachecompressed` to `true` in _hbase-site.xml_ on all RegionServers.
1128
1129 [[regionserver_splitting_implementation]]
1130 === RegionServer Splitting Implementation
1131
1132 As write requests are handled by the region server, they accumulate in an in-memory storage system called the _memstore_. Once the memstore fills, its content are written to disk as additional store files. This event is called a _memstore flush_. As store files accumulate, the RegionServer will <<compaction,compact>> them into fewer, larger files. After each flush or compaction finishes, the amount of data stored in the region has changed. The RegionServer consults the region split policy to determine if the region has grown too large or should be split for another policy-specific reason. A region split request is enqueued if the policy recommends it.
1133
1134 Logically, the process of splitting a region is simple. We find a suitable point in the keyspace of the region where we should divide the region in half, then split the region's data into two new regions at that point. The details of the process however are not simple.  When a split happens, the newly created _daughter regions_ do not rewrite all the data into new files immediately. Instead, they create small files similar to symbolic link files, named link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/Reference.html[Reference files], which point to either the top or bottom part of the parent store file according to the split point. The reference file is used just like a regular data file, but only half of the records are considered. The region can only be split if there are no more references to the immutable data files of the parent region. Those reference files are cleaned gradually by compactions, so that the region will stop referring to its parents files, and can be split further.
1135
1136 Although splitting the region is a local decision made by the RegionServer, the split process itself must coordinate with many actors. The RegionServer notifies the Master before and after the split, updates the `.META.` table so that clients can discover the new daughter regions, and rearranges the directory structure and data files in HDFS. Splitting is a multi-task process. To enable rollback in case of an error, the RegionServer keeps an in-memory journal about the execution state. The steps taken by the RegionServer to execute the split are illustrated in <<regionserver_split_process_image>>. Each step is labeled with its step number. Actions from RegionServers or Master are shown in red, while actions from the clients are shown in green.
1137
1138 [[regionserver_split_process_image]]
1139 .RegionServer Split Process
1140 image::region_split_process.png[Region Split Process]
1141
1142 . The RegionServer decides locally to split the region, and prepares the split. *THE SPLIT TRANSACTION IS STARTED.* As a first step, the RegionServer acquires a shared read lock on the table to prevent schema modifications during the splitting process. Then it creates a znode in zookeeper under `/hbase/region-in-transition/region-name`, and sets the znode's state to `SPLITTING`.
1143 . The Master learns about this znode, since it has a watcher for the parent `region-in-transition` znode.
1144 . The RegionServer creates a sub-directory named `.splits` under the parent’s `region` directory in HDFS.
1145 . The RegionServer closes the parent region and marks the region as offline in its local data structures. *THE SPLITTING REGION IS NOW OFFLINE.* At this point, client requests coming to the parent region will throw `NotServingRegionException`. The client will retry with some backoff. The closing region is flushed.
1146 . The RegionServer creates region directories under the `.splits` directory, for daughter
1147 regions A and B, and creates necessary data structures. Then it splits the store files,
1148 in the sense that it creates two Reference files per store file in the parent region.
1149 Those reference files will point to the parent region's files.
1150 . The RegionServer creates the actual region directory in HDFS, and moves the reference files for each daughter.
1151 . The RegionServer sends a `Put` request to the `.META.` table, to set the parent as offline in the `.META.` table and add information about daughter regions. At this point, there won’t be individual entries in `.META.` for the daughters. Clients will see that the parent region is split if they scan `.META.`, but won’t know about the daughters until they appear in `.META.`. Also, if this `Put` to `.META`. succeeds, the parent will be effectively split. If the RegionServer fails before this RPC succeeds, Master and the next Region Server opening the region will clean dirty state about the region split. After the `.META.` update, though, the region split will be rolled-forward by Master.
1152 . The RegionServer opens daughters A and B in parallel.
1153 . The RegionServer adds the daughters A and B to `.META.`, together with information that it hosts the regions. *THE SPLIT REGIONS (DAUGHTERS WITH REFERENCES TO PARENT) ARE NOW ONLINE.* After this point, clients can discover the new regions and issue requests to them. Clients cache the `.META.` entries locally, but when they make requests to the RegionServer or `.META.`, their caches will be invalidated, and they will learn about the new regions from `.META.`.
1154 . The RegionServer updates znode `/hbase/region-in-transition/region-name` in ZooKeeper to state `SPLIT`, so that the master can learn about it. The balancer can freely re-assign the daughter regions to other region servers if necessary. *THE SPLIT TRANSACTION IS NOW FINISHED.*
1155 . After the split, `.META.` and HDFS will still contain references to the parent region. Those references will be removed when compactions in daughter regions rewrite the data files. Garbage collection tasks in the master periodically check whether the daughter regions still refer to the parent region's files. If not, the parent region will be removed.
1156
1157 [[wal]]
1158 === Write Ahead Log (WAL)
1159
1160 [[purpose.wal]]
1161 ==== Purpose
1162
1163 The _Write Ahead Log (WAL)_ records all changes to data in HBase, to file-based storage.
1164 Under normal operations, the WAL is not needed because data changes move from the MemStore to StoreFiles.
1165 However, if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.
1166 If writing to the WAL fails, the entire operation to modify the data fails.
1167
1168 HBase uses an implementation of the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/wal/WAL.html[WAL] interface.
1169 Usually, there is only one instance of a WAL per RegionServer. An exception
1170 is the RegionServer that is carrying _hbase:meta_; the _meta_ table gets its
1171 own dedicated WAL.
1172 The RegionServer records Puts and Deletes to its WAL, before recording them
1173 these Mutations <<store.memstore>> for the affected <<store>>.
1174
1175 .The HLog
1176 [NOTE]
1177 ====
1178 Prior to 2.0, the interface for WALs in HBase was named `HLog`.
1179 In 0.94, HLog was the name of the implementation of the WAL.
1180 You will likely find references to the HLog in documentation tailored to these older versions.
1181 ====
1182
1183 The WAL resides in HDFS in the _/hbase/WALs/_ directory, with subdirectories per region.
1184
1185 For more general information about the concept of write ahead logs, see the Wikipedia
1186 link:http://en.wikipedia.org/wiki/Write-ahead_logging[Write-Ahead Log] article.
1187
1188
1189 [[wal.providers]]
1190 ==== WAL Providers
1191 In HBase, there are a number of WAL implementations (or 'Providers'). Each is known
1192 by a short name label (that unfortunately is not always descriptive). You set the provider in
1193 _hbase-site.xml_ passing the WAL provider short-name as the value on the
1194 _hbase.wal.provider_ property (Set the provider for _hbase:meta_ using the
1195 _hbase.wal.meta_provider_ property, otherwise it uses the same provider configured
1196 by _hbase.wal.provider_).
1197
1198  * _asyncfs_: The *default*. New since hbase-2.0.0 (HBASE-15536, HBASE-14790). This _AsyncFSWAL_ provider, as it identifies itself in RegionServer logs, is built on a new non-blocking dfsclient implementation. It is currently resident in the hbase codebase but intent is to move it back up into HDFS itself. WALs edits are written concurrently ("fan-out") style to each of the WAL-block replicas on each DataNode rather than in a chained pipeline as the default client does. Latencies should be better. See link:https://www.slideshare.net/HBaseCon/apache-hbase-improvements-and-practices-at-xiaomi[Apache HBase Improvements and Practices at Xiaomi] at slide 14 onward for more detail on implementation.
1199  * _filesystem_: This was the default in hbase-1.x releases. It is built on the blocking _DFSClient_ and writes to replicas in classic _DFSCLient_ pipeline mode. In logs it identifies as _FSHLog_ or _FSHLogProvider_.
1200  * _multiwal_: This provider is made of multiple instances of _asyncfs_ or  _filesystem_. See the next section for more on _multiwal_.
1201
1202 Look for the lines like the below in the RegionServer log to see which provider is in place (The below shows the default AsyncFSWALProvider):
1203
1204 ----
1205 2018-04-02 13:22:37,983 INFO  [regionserver/ve0528:16020] wal.WALFactory: Instantiating WALProvider of type class org.apache.hadoop.hbase.wal.AsyncFSWALProvider
1206 ----
1207
1208 NOTE: As the _AsyncFSWAL_ hacks into the internal of DFSClient implementation, it will be easily broken by upgrading the hadoop dependencies, even for a simple patch release. So if you do not specify the wal provider explicitly, we will first try to use the _asyncfs_, if failed, we will fall back to use _filesystem_. And notice that this may not always work, so if you still have problem starting HBase due to the problem of starting _AsyncFSWAL_, please specify _filesystem_ explicitly in the config file.
1209
1210 NOTE: EC support has been added to hadoop-3.x, and it is incompatible with WAL as the EC output stream does not support hflush/hsync. In order to create a non-EC file in an EC directory, we need to use the new builder-based create API for _FileSystem_, but it is only introduced in hadoop-2.9+ and for HBase we still need to support hadoop-2.7.x. So please do not enable EC for the WAL directory until we find a way to deal with it.
1211
1212 ==== MultiWAL
1213 With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.
1214
1215 HBase 1.0 introduces support MultiWal in link:https://issues.apache.org/jira/browse/HBASE-5699[HBASE-5699]. MultiWAL allows a RegionServer to write multiple WAL streams in parallel, by using multiple pipelines in the underlying HDFS instance, which increases total throughput during writes. This parallelization is done by partitioning incoming edits by their Region. Thus, the current implementation will not help with increasing the throughput to a single Region.
1216
1217 RegionServers using the original WAL implementation and those using the MultiWAL implementation can each handle recovery of either set of WALs, so a zero-downtime configuration update is possible through a rolling restart.
1218
1219 .Configure MultiWAL
1220 To configure MultiWAL for a RegionServer, set the value of the property `hbase.wal.provider` to `multiwal` by pasting in the following XML:
1221
1222 [source,xml]
1223 ----
1224 <property>
1225   <name>hbase.wal.provider</name>
1226   <value>multiwal</value>
1227 </property>
1228 ----
1229
1230 Restart the RegionServer for the changes to take effect.
1231
1232 To disable MultiWAL for a RegionServer, unset the property and restart the RegionServer.
1233
1234
1235 [[wal_flush]]
1236 ==== WAL Flushing
1237
1238 TODO (describe).
1239
1240 ==== WAL Splitting
1241
1242 A RegionServer serves many regions.
1243 All of the regions in a region server share the same active WAL file.
1244 Each edit in the WAL file includes information about which region it belongs to.
1245 When a region is opened, the edits in the WAL file which belong to that region need to be replayed.
1246 Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region.
1247 The process of grouping the WAL edits by region is called _log splitting_.
1248 It is a critical process for recovering data if a region server fails.
1249
1250 Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler as a region server shuts down.
1251 So that consistency is guaranteed, affected regions are unavailable until data is restored.
1252 All WAL edits need to be recovered and replayed before a given region can become available again.
1253 As a result, regions affected by log splitting are unavailable until the process completes.
1254
1255 .Procedure: Log Splitting, Step by Step
1256 . The _/hbase/WALs/<host>,<port>,<startcode>_ directory is renamed.
1257 +
1258 Renaming the directory is important because a RegionServer may still be up and accepting requests even if the HMaster thinks it is down.
1259 If the RegionServer does not respond immediately and does not heartbeat its ZooKeeper session, the HMaster may interpret this as a RegionServer failure.
1260 Renaming the logs directory ensures that existing, valid WAL files which are still in use by an active but busy RegionServer are not written to by accident.
1261 +
1262 The new directory is named according to the following pattern:
1263 +
1264 ----
1265 /hbase/WALs/<host>,<port>,<startcode>-splitting
1266 ----
1267 +
1268 An example of such a renamed directory might look like the following:
1269 +
1270 ----
1271 /hbase/WALs/srv.example.com,60020,1254173957298-splitting
1272 ----
1273
1274 . Each log file is split, one at a time.
1275 +
1276 The log splitter reads the log file one edit entry at a time and puts each edit entry into the buffer corresponding to the edit's region.
1277 At the same time, the splitter starts several writer threads.
1278 Writer threads pick up a corresponding buffer and write the edit entries in the buffer to a temporary recovered edit file.
1279 The temporary edit file is stored to disk with the following naming pattern:
1280 +
1281 ----
1282 /hbase/<table_name>/<region_id>/recovered.edits/.temp
1283 ----
1284 +
1285 This file is used to store all the edits in the WAL log for this region.
1286 After log splitting completes, the _.temp_ file is renamed to the sequence ID of the first log written to the file.
1287 +
1288 To determine whether all edits have been written, the sequence ID is compared to the sequence of the last edit that was written to the HFile.
1289 If the sequence of the last edit is greater than or equal to the sequence ID included in the file name, it is clear that all writes from the edit file have been completed.
1290
1291 . After log splitting is complete, each affected region is assigned to a RegionServer.
1292 +
1293 When the region is opened, the _recovered.edits_ folder is checked for recovered edits files.
1294 If any such files are present, they are replayed by reading the edits and saving them to the MemStore.
1295 After all edit files are replayed, the contents of the MemStore are written to disk (HFile) and the edit files are deleted.
1296
1297
1298 ===== Handling of Errors During Log Splitting
1299
1300 If you set the `hbase.hlog.split.skip.errors` option to `true`, errors are treated as follows:
1301
1302 * Any error encountered during splitting will be logged.
1303 * The problematic WAL log will be moved into the _.corrupt_ directory under the hbase `rootdir`,
1304 * Processing of the WAL will continue
1305
1306 If the `hbase.hlog.split.skip.errors` option is set to `false`, the default, the exception will be propagated and the split will be logged as failed.
1307 See link:https://issues.apache.org/jira/browse/HBASE-2958[HBASE-2958 When
1308 hbase.hlog.split.skip.errors is set to false, we fail the split but that's it].
1309 We need to do more than just fail split if this flag is set.
1310
1311 ====== How EOFExceptions are treated when splitting a crashed RegionServer's WALs
1312
1313 If an EOFException occurs while splitting logs, the split proceeds even when `hbase.hlog.split.skip.errors` is set to `false`.
1314 An EOFException while reading the last log in the set of files to split is likely, because the RegionServer was likely in the process of writing a record at the time of a crash.
1315 For background, see link:https://issues.apache.org/jira/browse/HBASE-2643[HBASE-2643 Figure how to deal with eof splitting logs]
1316
1317 ===== Performance Improvements during Log Splitting
1318
1319 WAL log splitting and recovery can be resource intensive and take a long time, depending on the number of RegionServers involved in the crash and the size of the regions. <<distributed.log.splitting>> was developed to improve performance during log splitting.
1320
1321 [[distributed.log.splitting]]
1322 .Enabling or Disabling Distributed Log Splitting
1323
1324 Distributed log processing is enabled by default since HBase 0.92.
1325 The setting is controlled by the `hbase.master.distributed.log.splitting` property, which can be set to `true` or `false`, but defaults to `true`.
1326
1327 ==== WAL splitting based on procedureV2
1328 After HBASE-20610, we introduce a new way to do WAL splitting coordination by procedureV2 framework. This can simplify the process of WAL splitting and no need to connect zookeeper any more.
1329
1330 [[background]]
1331 .Background
1332 Currently, splitting WAL processes are coordinated by zookeeper. Each region server are trying to grab tasks from zookeeper. And the burden becomes heavier when the number of region server increase.
1333
1334 [[implementation.on.master.side]]
1335 .Implementation on Master side
1336 During ServerCrashProcedure, SplitWALManager will create one SplitWALProcedure for each WAL file which should be split. Then each SplitWALProcedure will spawn a SplitWalRemoteProcedure to send the request to region server.
1337 SplitWALProcedure is a StateMachineProcedure and here is the state transfer diagram.
1338
1339 .WAL_splitting_coordination
1340 image::WAL_splitting.png[]
1341
1342 [[implementation.on.region.server.side]]
1343 .Implementation on Region Server side
1344 Region Server will receive a SplitWALCallable and execute it, which is much more straightforward than before. It will return null if success and return exception if there is any error.
1345
1346 [[preformance]]
1347 .Performance
1348 According to tests on a cluster which has 5 regionserver and 1 master.
1349 procedureV2 coordinated WAL splitting has a better performance than ZK coordinated WAL splitting no master when restarting the whole cluster or one region server crashing.
1350
1351 [[enable.this.feature]]
1352 .Enable this feature
1353 To enable this feature, first we should ensure our package of HBase already contains these code. If not, please upgrade the package of HBase cluster without any configuration change first.
1354 Then change configuration 'hbase.split.wal.zk.coordinated' to false. Rolling upgrade the master with new configuration. Now WAL splitting are handled by our new implementation.
1355 But region server are still trying to grab tasks from zookeeper, we can rolling upgrade the region servers with the new configuration to stop that.
1356
1357 * steps as follows:
1358 ** Upgrade whole cluster to get the new Implementation.
1359 ** Upgrade Master with new configuration 'hbase.split.wal.zk.coordinated'=false.
1360 ** Upgrade region server to stop grab tasks from zookeeper.
1361
1362 [[wal.compression]]
1363 ==== WAL Compression ====
1364
1365 The content of the WAL can be compressed using LRU Dictionary compression.
1366 This can be used to speed up WAL replication to different datanodes.
1367 The dictionary can store up to 2^15^ elements; eviction starts after this number is exceeded.
1368
1369 To enable WAL compression, set the `hbase.regionserver.wal.enablecompression` property to `true`.
1370 The default value for this property is `false`.
1371 By default, WAL tag compression is turned on when WAL compression is enabled.
1372 You can turn off WAL tag compression by setting the `hbase.regionserver.wal.tags.enablecompression` property to 'false'.
1373
1374 A possible downside to WAL compression is that we lose more data from the last block in the WAL if it is ill-terminated
1375 mid-write. If entries in this last block were added with new dictionary entries but we failed persist the amended
1376 dictionary because of an abrupt termination, a read of this last block may not be able to resolve last-written entries.
1377
1378 [[wal.durability]]
1379 ==== Durability
1380 It is possible to set _durability_ on each Mutation or on a Table basis. Options include:
1381
1382  * _SKIP_WAL_: Do not write Mutations to the WAL (See the next section, <<wal.disable>>).
1383  * _ASYNC_WAL_: Write the WAL asynchronously; do not hold-up clients waiting on the sync of their write to the filesystem but return immediately. The edit becomes visible. Meanwhile, in the background, the Mutation will be flushed to the WAL at some time later. This option currently may lose data. See HBASE-16689.
1384  * _SYNC_WAL_: The *default*. Each edit is sync'd to HDFS before we return success to the client.
1385  * _FSYNC_WAL_: Each edit is fsync'd to HDFS and the filesystem before we return success to the client.
1386
1387 Do not confuse the _ASYNC_WAL_ option on a Mutation or Table with the _AsyncFSWAL_ writer; they are distinct
1388 options unfortunately closely named
1389
1390 [[arch.custom.wal.dir]]
1391 ==== Custom WAL Directory
1392 HBASE-17437 added support for specifying a WAL directory outside the HBase root directory or even in a different FileSystem since 1.3.3/2.0+. Some FileSystems (such as Amazon S3) don’t support append or consistent writes, in such scenario WAL directory needs to be configured in a different FileSystem to avoid loss of writes.
1393
1394 Following configurations are added to accomplish this:
1395
1396 . `hbase.wal.dir`
1397 +
1398 This defines where the root WAL directory is located, could be on a different FileSystem than the root directory. WAL directory can not be set to a subdirectory of the root directory. The default value of this is the root directory if unset.
1399
1400 . `hbase.rootdir.perms`
1401 +
1402 Configures FileSystem permissions to set on the root directory. This is '700' by default.
1403
1404 . `hbase.wal.dir.perms`
1405 +
1406 Configures FileSystem permissions to set on the WAL directory FileSystem. This is '700' by default.
1407
1408 NOTE: While migrating to custom WAL dir (outside the HBase root directory or a different FileSystem) existing WAL files must be copied manually to new WAL dir, otherwise it may lead to data loss/inconsistency as HMaster has no information about previous WAL directory.
1409
1410 [[wal.disable]]
1411 ==== Disabling the WAL
1412
1413 It is possible to disable the WAL, to improve performance in certain specific situations.
1414 However, disabling the WAL puts your data at risk.
1415 The only situation where this is recommended is during a bulk load.
1416 This is because, in the event of a problem, the bulk load can be re-run with no risk of data loss.
1417
1418 The WAL is disabled by calling the HBase client field `Mutation.writeToWAL(false)`.
1419 Use the `Mutation.setDurability(Durability.SKIP_WAL)` and Mutation.getDurability() methods to set and get the field's value.
1420 There is no way to disable the WAL for only a specific table.
1421
1422 WARNING: If you disable the WAL for anything other than bulk loads, your data is at risk.
1423
1424
1425 [[regions.arch]]
1426 == Regions
1427
1428 Regions are the basic element of availability and distribution for tables, and are comprised of a Store per Column Family.
1429 The hierarchy of objects is as follows:
1430
1431 ----
1432 Table                    (HBase table)
1433     Region               (Regions for the table)
1434         Store            (Store per ColumnFamily for each Region for the table)
1435             MemStore     (MemStore for each Store for each Region for the table)
1436             StoreFile    (StoreFiles for each Store for each Region for the table)
1437                 Block    (Blocks within a StoreFile within a Store for each Region for the table)
1438 ----
1439
1440 For a description of what HBase files look like when written to HDFS, see <<trouble.namenode.hbase.objects>>.
1441
1442 [[arch.regions.size]]
1443 === Considerations for Number of Regions
1444
1445 In general, HBase is designed to run with a small (20-200) number of relatively large (5-20Gb) regions per server.
1446 The considerations for this are as follows:
1447
1448 [[too_many_regions]]
1449 ==== Why should I keep my Region count low?
1450
1451 Typically you want to keep your region count low on HBase for numerous reasons.
1452 Usually right around 100 regions per RegionServer has yielded the best results.
1453 Here are some of the reasons below for keeping region count low:
1454
1455 . MSLAB (MemStore-local allocation buffer) requires 2MB per MemStore (that's 2MB per family per region). 1000 regions that have 2 families each is 3.9GB of heap used, and it's not even storing data yet.
1456   NB: the 2MB value is configurable.
1457 . If you fill all the regions at somewhat the same rate, the global memory usage makes it that it forces tiny flushes when you have too many regions which in turn generates compactions.
1458   Rewriting the same data tens of times is the last thing you want.
1459   An example is filling 1000 regions (with one family) equally and let's consider a lower bound for global MemStore usage of 5GB (the region server would have a big heap). Once it reaches 5GB it will force flush the biggest region, at that point they should almost all have about 5MB of data so it would flush that amount.
1460   5MB inserted later, it would flush another region that will now have a bit over 5MB of data, and so on.
1461   This is currently the main limiting factor for the number of regions; see <<ops.capacity.regions.count>> for detailed formula.
1462 . The master as is is allergic to tons of regions, and will take a lot of time assigning them and moving them around in batches.
1463   The reason is that it's heavy on ZK usage, and it's not very async at the moment (could really be improved -- and has been improved a bunch in 0.96 HBase).
1464 . In older versions of HBase (pre-HFile v2, 0.90 and previous), tons of regions on a few RS can cause the store file index to rise, increasing heap usage and potentially creating memory pressure or OOME on the RSs
1465
1466 Another issue is the effect of the number of regions on MapReduce jobs; it is typical to have one mapper per HBase region.
1467 Thus, hosting only 5 regions per RS may not be enough to get sufficient number of tasks for a MapReduce job, while 1000 regions will generate far too many tasks.
1468
1469 See <<ops.capacity.regions>> for configuration guidelines.
1470
1471 [[regions.arch.assignment]]
1472 === Region-RegionServer Assignment
1473
1474 This section describes how Regions are assigned to RegionServers.
1475
1476 [[regions.arch.assignment.startup]]
1477 ==== Startup
1478
1479 When HBase starts regions are assigned as follows (short version):
1480
1481 . The Master invokes the `AssignmentManager` upon startup.
1482 . The `AssignmentManager` looks at the existing region assignments in `hbase:meta`.
1483 . If the region assignment is still valid (i.e., if the RegionServer is still online) then the assignment is kept.
1484 . If the assignment is invalid, then the `LoadBalancerFactory` is invoked to assign the region.
1485   The load balancer (`StochasticLoadBalancer` by default in HBase 1.0) assign the region to a RegionServer.
1486 . `hbase:meta` is updated with the RegionServer assignment (if needed) and the RegionServer start codes (start time of the RegionServer process) upon region opening by the RegionServer.
1487
1488 [[regions.arch.assignment.failover]]
1489 ==== Failover
1490
1491 When a RegionServer fails:
1492
1493 . The regions immediately become unavailable because the RegionServer is down.
1494 . The Master will detect that the RegionServer has failed.
1495 . The region assignments will be considered invalid and will be re-assigned just like the startup sequence.
1496 . In-flight queries are re-tried, and not lost.
1497 . Operations are switched to a new RegionServer within the following amount of time:
1498 +
1499 [source]
1500 ----
1501 ZooKeeper session timeout + split time + assignment/replay time
1502 ----
1503
1504
1505 [[regions.arch.balancer]]
1506 ==== Region Load Balancing
1507
1508 Regions can be periodically moved by the <<master.processes.loadbalancer>>.
1509
1510 [[regions.arch.states]]
1511 ==== Region State Transition
1512
1513 HBase maintains a state for each region and persists the state in `hbase:meta`.
1514 The state of the `hbase:meta` region itself is persisted in ZooKeeper.
1515 You can see the states of regions in transition in the Master web UI.
1516 Following is the list of possible region states.
1517
1518 .Possible Region States
1519 * `OFFLINE`: the region is offline and not opening
1520 * `OPENING`: the region is in the process of being opened
1521 * `OPEN`: the region is open and the RegionServer has notified the master
1522 * `FAILED_OPEN`: the RegionServer failed to open the region
1523 * `CLOSING`: the region is in the process of being closed
1524 * `CLOSED`: the RegionServer has closed the region and notified the master
1525 * `FAILED_CLOSE`: the RegionServer failed to close the region
1526 * `SPLITTING`: the RegionServer notified the master that the region is splitting
1527 * `SPLIT`: the RegionServer notified the master that the region has finished splitting
1528 * `SPLITTING_NEW`: this region is being created by a split which is in progress
1529 * `MERGING`: the RegionServer notified the master that this region is being merged with another region
1530 * `MERGED`: the RegionServer notified the master that this region has been merged
1531 * `MERGING_NEW`: this region is being created by a merge of two regions
1532
1533 .Region State Transitions
1534 image::region_states.png[]
1535
1536 .Graph Legend
1537 * Brown: Offline state, a special state that can be transient (after closed before opening), terminal (regions of disabled tables), or initial (regions of newly created tables)
1538 * Palegreen: Online state that regions can serve requests
1539 * Lightblue: Transient states
1540 * Red: Failure states that need OPS attention
1541 * Gold: Terminal states of regions split/merged
1542 * Grey: Initial states of regions created through split/merge
1543
1544 .Transition State Descriptions
1545 . The master moves a region from `OFFLINE` to `OPENING` state and tries to assign the region to a RegionServer.
1546   The RegionServer may or may not have received the open region request.
1547   The master retries sending the open region request to the RegionServer until the RPC goes through or the master runs out of retries.
1548   After the RegionServer receives the open region request, the RegionServer begins opening the region.
1549 . If the master is running out of retries, the master prevents the RegionServer from opening the region by moving the region to `CLOSING` state and trying to close it, even if the RegionServer is starting to open the region.
1550 . After the RegionServer opens the region, it continues to try to notify the master until the master moves the region to `OPEN` state and notifies the RegionServer.
1551   The region is now open.
1552 . If the RegionServer cannot open the region, it notifies the master.
1553   The master moves the region to `CLOSED` state and tries to open the region on a different RegionServer.
1554 . If the master cannot open the region on any of a certain number of regions, it moves the region to `FAILED_OPEN` state, and takes no further action until an operator intervenes from the HBase shell, or the server is dead.
1555 . The master moves a region from `OPEN` to `CLOSING` state.
1556   The RegionServer holding the region may or may not have received the close region request.
1557   The master retries sending the close request to the server until the RPC goes through or the master runs out of retries.
1558 . If the RegionServer is not online, or throws `NotServingRegionException`, the master moves the region to `OFFLINE` state and re-assigns it to a different RegionServer.
1559 . If the RegionServer is online, but not reachable after the master runs out of retries, the master moves the region to `FAILED_CLOSE` state and takes no further action until an operator intervenes from the HBase shell, or the server is dead.
1560 . If the RegionServer gets the close region request, it closes the region and notifies the master.
1561   The master moves the region to `CLOSED` state and re-assigns it to a different RegionServer.
1562 . Before assigning a region, the master moves the region to `OFFLINE` state automatically if it is in `CLOSED` state.
1563 . When a RegionServer is about to split a region, it notifies the master.
1564   The master moves the region to be split from `OPEN` to `SPLITTING` state and add the two new regions to be created to the RegionServer.
1565   These two regions are in `SPLITTING_NEW` state initially.
1566 . After notifying the master, the RegionServer starts to split the region.
1567   Once past the point of no return, the RegionServer notifies the master again so the master can update the `hbase:meta` table.
1568   However, the master does not update the region states until it is notified by the server that the split is done.
1569   If the split is successful, the splitting region is moved from `SPLITTING` to `SPLIT` state and the two new regions are moved from `SPLITTING_NEW` to `OPEN` state.
1570 . If the split fails, the splitting region is moved from `SPLITTING` back to `OPEN` state, and the two new regions which were created are moved from `SPLITTING_NEW` to `OFFLINE` state.
1571 . When a RegionServer is about to merge two regions, it notifies the master first.
1572   The master moves the two regions to be merged from `OPEN` to `MERGING` state, and adds the new region which will hold the contents of the merged regions region to the RegionServer.
1573   The new region is in `MERGING_NEW` state initially.
1574 . After notifying the master, the RegionServer starts to merge the two regions.
1575   Once past the point of no return, the RegionServer notifies the master again so the master can update the META.
1576   However, the master does not update the region states until it is notified by the RegionServer that the merge has completed.
1577   If the merge is successful, the two merging regions are moved from `MERGING` to `MERGED` state and the new region is moved from `MERGING_NEW` to `OPEN` state.
1578 . If the merge fails, the two merging regions are moved from `MERGING` back to `OPEN` state, and the new region which was created to hold the contents of the merged regions is moved from `MERGING_NEW` to `OFFLINE` state.
1579 . For regions in `FAILED_OPEN` or `FAILED_CLOSE` states, the master tries to close them again when they are reassigned by an operator via HBase Shell.
1580
1581 [[regions.arch.locality]]
1582 === Region-RegionServer Locality
1583
1584 Over time, Region-RegionServer locality is achieved via HDFS block replication.
1585 The HDFS client does the following by default when choosing locations to write replicas:
1586
1587 . First replica is written to local node
1588 . Second replica is written to a random node on another rack
1589 . Third replica is written on the same rack as the second, but on a different node chosen randomly
1590 . Subsequent replicas are written on random nodes on the cluster.
1591   See _Replica Placement: The First Baby Steps_ on this page: link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html[HDFS Architecture]
1592
1593 Thus, HBase eventually achieves locality for a region after a flush or a compaction.
1594 In a RegionServer failover situation a RegionServer may be assigned regions with non-local StoreFiles (because none of the replicas are local), however as new data is written in the region, or the table is compacted and StoreFiles are re-written, they will become "local" to the RegionServer.
1595
1596 For more information, see _Replica Placement: The First Baby Steps_ on this page: link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html[HDFS Architecture] and also Lars George's blog on link:http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html[HBase and HDFS locality].
1597
1598 [[arch.region.splits]]
1599 === Region Splits
1600
1601 Regions split when they reach a configured threshold.
1602 Below we treat the topic in short.
1603 For a longer exposition, see link:http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/[Apache HBase Region Splitting and Merging] by our Enis Soztutar.
1604
1605 Splits run unaided on the RegionServer; i.e. the Master does not participate.
1606 The RegionServer splits a region, offlines the split region and then adds the daughter regions to `hbase:meta`, opens daughters on the parent's hosting RegionServer and then reports the split to the Master.
1607 See <<disable.splitting>> for how to manually manage splits (and for why you might do this).
1608
1609 ==== Custom Split Policies
1610 You can override the default split policy using a custom
1611 link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/regionserver/RegionSplitPolicy.html[RegionSplitPolicy](HBase 0.94+).
1612 Typically a custom split policy should extend HBase's default split policy:
1613 link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/regionserver/IncreasingToUpperBoundRegionSplitPolicy.html[IncreasingToUpperBoundRegionSplitPolicy].
1614
1615 The policy can set globally through the HBase configuration or on a per-table
1616 basis.
1617
1618 .Configuring the Split Policy Globally in _hbase-site.xml_
1619 [source,xml]
1620 ----
1621 <property>
1622   <name>hbase.regionserver.region.split.policy</name>
1623   <value>org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy</value>
1624 </property>
1625 ----
1626
1627 .Configuring a Split Policy On a Table Using the Java API
1628 [source,java]
1629 HTableDescriptor tableDesc = new HTableDescriptor("test");
1630 tableDesc.setValue(HTableDescriptor.SPLIT_POLICY, ConstantSizeRegionSplitPolicy.class.getName());
1631 tableDesc.addFamily(new HColumnDescriptor(Bytes.toBytes("cf1")));
1632 admin.createTable(tableDesc);
1633 ----
1634
1635 [source]
1636 .Configuring the Split Policy On a Table Using HBase Shell
1637 ----
1638 hbase> create 'test', {METADATA => {'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy'}},{NAME => 'cf1'}
1639 ----
1640
1641 The policy can be set globally through the HBaseConfiguration used or on a per table basis:
1642 [source,java]
1643 ----
1644 HTableDescriptor myHtd = ...;
1645 myHtd.setValue(HTableDescriptor.SPLIT_POLICY, MyCustomSplitPolicy.class.getName());
1646 ----
1647
1648 NOTE: The `DisabledRegionSplitPolicy` policy blocks manual region splitting.
1649
1650 [[manual_region_splitting_decisions]]
1651 === Manual Region Splitting
1652
1653 It is possible to manually split your table, either at table creation (pre-splitting), or at a later time as an administrative action.
1654 You might choose to split your region for one or more of the following reasons.
1655 There may be other valid reasons, but the need to manually split your table might also point to problems with your schema design.
1656
1657 .Reasons to Manually Split Your Table
1658 * Your data is sorted by timeseries or another similar algorithm that sorts new data at the end of the table.
1659   This means that the Region Server holding the last region is always under load, and the other Region Servers are idle, or mostly idle.
1660   See also <<timeseries>>.
1661 * You have developed an unexpected hotspot in one region of your table.
1662   For instance, an application which tracks web searches might be inundated by a lot of searches for a celebrity in the event of news about that celebrity.
1663   See <<perf.one.region,perf.one.region>> for more discussion about this particular scenario.
1664 * After a big increase in the number of RegionServers in your cluster, to get the load spread out quickly.
1665 * Before a bulk-load which is likely to cause unusual and uneven load across regions.
1666
1667 See <<disable.splitting>> for a discussion about the dangers and possible benefits of managing splitting completely manually.
1668
1669 NOTE: The `DisabledRegionSplitPolicy` policy blocks manual region splitting.
1670
1671 ==== Determining Split Points
1672
1673 The goal of splitting your table manually is to improve the chances of balancing the load across the cluster in situations where good rowkey design alone won't get you there.
1674 Keeping that in mind, the way you split your regions is very dependent upon the characteristics of your data.
1675 It may be that you already know the best way to split your table.
1676 If not, the way you split your table depends on what your keys are like.
1677
1678 Alphanumeric Rowkeys::
1679   If your rowkeys start with a letter or number, you can split your table at letter or number boundaries.
1680   For instance, the following command creates a table with regions that split at each vowel, so the first region has A-D, the second region has E-H, the third region has I-N, the fourth region has O-V, and the fifth region has U-Z.
1681
1682 Using a Custom Algorithm::
1683   The RegionSplitter tool is provided with HBase, and uses a _SplitAlgorithm_ to determine split points for you.
1684   As parameters, you give it the algorithm, desired number of regions, and column families.
1685   It includes three split algorithms.
1686   The first is the
1687   `link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/RegionSplitter.HexStringSplit.html[HexStringSplit]`
1688   algorithm, which assumes the row keys are hexadecimal strings.
1689   The second is the
1690   `link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/RegionSplitter.DecimalStringSplit.html[DecimalStringSplit]`
1691   algorithm, which assumes the row keys are decimal strings in the range 00000000 to 99999999.
1692   The third,
1693   `link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/RegionSplitter.UniformSplit.html[UniformSplit]`,
1694   assumes the row keys are random byte arrays.
1695   You will probably need to develop your own
1696   `link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/util/RegionSplitter.SplitAlgorithm.html[SplitAlgorithm]`,
1697   using the provided ones as models.
1698
1699 === Online Region Merges
1700
1701 Both Master and RegionServer participate in the event of online region merges.
1702 Client sends merge RPC to the master, then the master moves the regions together to the RegionServer where the more heavily loaded region resided. Finally the master sends the merge request to this RegionServer which then runs the merge.
1703 Similar to process of region splitting, region merges run as a local transaction on the RegionServer. It offlines the regions and then merges two regions on the file system, atomically delete merging regions from `hbase:meta` and adds the merged region to `hbase:meta`, opens the merged region on the RegionServer and reports the merge to the Master.
1704
1705 An example of region merges in the HBase shell
1706 [source,bourne]
1707 ----
1708 $ hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME'
1709 $ hbase> merge_region 'ENCODED_REGIONNAME', 'ENCODED_REGIONNAME', true
1710 ----
1711 It's an asynchronous operation and call returns immediately without waiting merge completed.
1712 Passing `true` as the optional third parameter will force a merge. Normally only adjacent regions can be merged.
1713 The `force` parameter overrides this behaviour and is for expert use only.
1714
1715 [[store]]
1716 === Store
1717
1718 A Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
1719
1720 [[store.memstore]]
1721 ==== MemStore
1722
1723 The MemStore holds in-memory modifications to the Store.
1724 Modifications are Cells/KeyValues.
1725 When a flush is requested, the current MemStore is moved to a snapshot and is cleared.
1726 HBase continues to serve edits from the new MemStore and backing snapshot until the flusher reports that the flush succeeded.
1727 At this point, the snapshot is discarded.
1728 Note that when the flush happens, MemStores that belong to the same region will all be flushed.
1729
1730 ==== MemStore Flush
1731
1732 A MemStore flush can be triggered under any of the conditions listed below.
1733 The minimum flush unit is per region, not at individual MemStore level.
1734
1735 . When a MemStore reaches the size specified by `hbase.hregion.memstore.flush.size`,
1736   all MemStores that belong to its region will be flushed out to disk.
1737
1738 . When the overall MemStore usage reaches the value specified by
1739   `hbase.regionserver.global.memstore.upperLimit`, MemStores from various regions
1740   will be flushed out to disk to reduce overall MemStore usage in a RegionServer.
1741 +
1742 The flush order is based on the descending order of a region's MemStore usage.
1743 +
1744 Regions will have their MemStores flushed until the overall MemStore usage drops
1745 to or slightly below `hbase.regionserver.global.memstore.lowerLimit`.
1746
1747 . When the number of WAL log entries in a given region server's WAL reaches the
1748   value specified in `hbase.regionserver.max.logs`, MemStores from various regions
1749   will be flushed out to disk to reduce the number of logs in the WAL.
1750 +
1751 The flush order is based on time.
1752 +
1753 Regions with the oldest MemStores are flushed first until WAL count drops below
1754 `hbase.regionserver.max.logs`.
1755
1756 [[hregion.scans]]
1757 ==== Scans
1758
1759 * When a client issues a scan against a table, HBase generates `RegionScanner` objects, one per region, to serve the scan request.
1760 * The `RegionScanner` object contains a list of `StoreScanner` objects, one per column family.
1761 * Each `StoreScanner` object further contains a list of `StoreFileScanner` objects, corresponding to each StoreFile and HFile of the corresponding column family, and a list of `KeyValueScanner` objects for the MemStore.
1762 * The two lists are merged into one, which is sorted in ascending order with the scan object for the MemStore at the end of the list.
1763 * When a `StoreFileScanner` object is constructed, it is associated with a `MultiVersionConcurrencyControl` read point, which is the current `memstoreTS`, filtering out any new updates beyond the read point.
1764
1765 [[hfile]]
1766 ==== StoreFile (HFile)
1767
1768 StoreFiles are where your data lives.
1769
1770 ===== HFile Format
1771
1772 The _HFile_ file format is based on the SSTable file described in the link:http://research.google.com/archive/bigtable.html[BigTable [2006]] paper and on Hadoop's link:https://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/file/tfile/TFile.html[TFile] (The unit test suite and the compression harness were taken directly from TFile). Schubert Zhang's blog post on link:http://cloudepr.blogspot.com/2009/09/hfile-block-indexed-file-format-to.html[HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs] makes for a thorough introduction to HBase's HFile.
1773 Matteo Bertozzi has also put up a helpful description, link:http://th30z.blogspot.com/2011/02/hbase-io-hfile.html?spref=tw[HBase I/O: HFile].
1774
1775 For more information, see the HFile source code.
1776 Also see <<hfilev2>> for information about the HFile v2 format that was included in 0.92.
1777
1778 [[hfile_tool]]
1779 ===== HFile Tool
1780
1781 To view a textualized version of HFile content, you can use the `hbase hfile` tool.
1782 Type the following to see usage:
1783
1784 [source,bash]
1785 ----
1786 $ ${HBASE_HOME}/bin/hbase hfile
1787 ----
1788 For example, to view the content of the file _hdfs://10.81.47.41:8020/hbase/default/TEST/1418428042/DSMP/4759508618286845475_, type the following:
1789 [source,bash]
1790 ----
1791  $ ${HBASE_HOME}/bin/hbase hfile -v -f hdfs://10.81.47.41:8020/hbase/default/TEST/1418428042/DSMP/4759508618286845475
1792 ----
1793 If you leave off the option -v to see just a summary on the HFile.
1794 See usage for other things to do with the `hfile` tool.
1795
1796 NOTE: In the output of this tool, you might see 'seqid=0' for certain keys in places such as 'Mid-key'/'firstKey'/'lastKey'. These are
1797  'KeyOnlyKeyValue' type instances - meaning their seqid is irrelevant & we just need the keys of these Key-Value instances.
1798
1799 [[store.file.dir]]
1800 ===== StoreFile Directory Structure on HDFS
1801
1802 For more information of what StoreFiles look like on HDFS with respect to the directory structure, see <<trouble.namenode.hbase.objects>>.
1803
1804 [[hfile.blocks]]
1805 ==== Blocks
1806
1807 StoreFiles are composed of blocks.
1808 The blocksize is configured on a per-ColumnFamily basis.
1809
1810 Compression happens at the block level within StoreFiles.
1811 For more information on compression, see <<compression>>.
1812
1813 For more information on blocks, see the HFileBlock source code.
1814
1815 [[keyvalue]]
1816 ==== KeyValue
1817
1818 The KeyValue class is the heart of data storage in HBase.
1819 KeyValue wraps a byte array and takes offsets and lengths into the passed array which specify where to start interpreting the content as KeyValue.
1820
1821 The KeyValue format inside a byte array is:
1822
1823 * keylength
1824 * valuelength
1825 * key
1826 * value
1827
1828 The Key is further decomposed as:
1829
1830 * rowlength
1831 * row (i.e., the rowkey)
1832 * columnfamilylength
1833 * columnfamily
1834 * columnqualifier
1835 * timestamp
1836 * keytype (e.g., Put, Delete, DeleteColumn, DeleteFamily)
1837
1838 KeyValue instances are _not_ split across blocks.
1839 For example, if there is an 8 MB KeyValue, even if the block-size is 64kb this KeyValue will be read in as a coherent block.
1840 For more information, see the KeyValue source code.
1841
1842 [[keyvalue.example]]
1843 ===== Example
1844
1845 To emphasize the points above, examine what happens with two Puts for two different columns for the same row:
1846
1847 * Put #1: `rowkey=row1, cf:attr1=value1`
1848 * Put #2: `rowkey=row1, cf:attr2=value2`
1849
1850 Even though these are for the same row, a KeyValue is created for each column:
1851
1852 Key portion for Put #1:
1853
1854 * `rowlength ------------> 4`
1855 * `row ------------------> row1`
1856 * `columnfamilylength ---> 2`
1857 * `columnfamily ---------> cf`
1858 * `columnqualifier ------> attr1`
1859 * `timestamp ------------> server time of Put`
1860 * `keytype --------------> Put`
1861
1862 Key portion for Put #2:
1863
1864 * `rowlength ------------> 4`
1865 * `row ------------------> row1`
1866 * `columnfamilylength ---> 2`
1867 * `columnfamily ---------> cf`
1868 * `columnqualifier ------> attr2`
1869 * `timestamp ------------> server time of Put`
1870 * `keytype --------------> Put`
1871
1872 It is critical to understand that the rowkey, ColumnFamily, and column (aka columnqualifier) are embedded within the KeyValue instance.
1873 The longer these identifiers are, the bigger the KeyValue is.
1874
1875 [[compaction]]
1876 ==== Compaction
1877
1878 .Ambiguous Terminology
1879 * A _StoreFile_ is a facade of HFile.
1880   In terms of compaction, use of StoreFile seems to have prevailed in the past.
1881 * A _Store_ is the same thing as a ColumnFamily.
1882   StoreFiles are related to a Store, or ColumnFamily.
1883 * If you want to read more about StoreFiles versus HFiles and Stores versus ColumnFamilies, see link:https://issues.apache.org/jira/browse/HBASE-11316[HBASE-11316].
1884
1885 When the MemStore reaches a given size (`hbase.hregion.memstore.flush.size`), it flushes its contents to a StoreFile.
1886 The number of StoreFiles in a Store increases over time. _Compaction_ is an operation which reduces the number of StoreFiles in a Store, by merging them together, in order to increase performance on read operations.
1887 Compactions can be resource-intensive to perform, and can either help or hinder performance depending on many factors.
1888
1889 Compactions fall into two categories: minor and major.
1890 Minor and major compactions differ in the following ways.
1891
1892 _Minor compactions_ usually select a small number of small, adjacent StoreFiles and rewrite them as a single StoreFile.
1893 Minor compactions do not drop (filter out) deletes or expired versions, because of potential side effects.
1894 See <<compaction.and.deletes>> and <<compaction.and.versions>> for information on how deletes and versions are handled in relation to compactions.
1895 The end result of a minor compaction is fewer, larger StoreFiles for a given Store.
1896
1897 The end result of a _major compaction_ is a single StoreFile per Store.
1898 Major compactions also process delete markers and max versions.
1899 See <<compaction.and.deletes>> and <<compaction.and.versions>> for information on how deletes and versions are handled in relation to compactions.
1900
1901 [[compaction.and.deletes]]
1902 .Compaction and Deletions
1903 When an explicit deletion occurs in HBase, the data is not actually deleted.
1904 Instead, a _tombstone_ marker is written.
1905 The tombstone marker prevents the data from being returned with queries.
1906 During a major compaction, the data is actually deleted, and the tombstone marker is removed from the StoreFile.
1907 If the deletion happens because of an expired TTL, no tombstone is created.
1908 Instead, the expired data is filtered out and is not written back to the compacted StoreFile.
1909
1910 [[compaction.and.versions]]
1911 .Compaction and Versions
1912 When you create a Column Family, you can specify the maximum number of versions to keep, by specifying `ColumnFamilyDescriptorBuilder.setMaxVersions(int versions)`.
1913 The default value is `1`.
1914 If more versions than the specified maximum exist, the excess versions are filtered out and not written back to the compacted StoreFile.
1915
1916 .Major Compactions Can Impact Query Results
1917 [NOTE]
1918 ====
1919 In some situations, older versions can be inadvertently resurrected if a newer version is explicitly deleted.
1920 See <<major.compactions.change.query.results>> for a more in-depth explanation.
1921 This situation is only possible before the compaction finishes.
1922 ====
1923
1924 In theory, major compactions improve performance.
1925 However, on a highly loaded system, major compactions can require an inappropriate number of resources and adversely affect performance.
1926 In a default configuration, major compactions are scheduled automatically to run once in a 7-day period.
1927 This is sometimes inappropriate for systems in production.
1928 You can manage major compactions manually.
1929 See <<managed.compactions>>.
1930
1931 Compactions do not perform region merges.
1932 See <<ops.regionmgt.merge>> for more information on region merging.
1933
1934 .Compaction Switch
1935 We can switch on and off the compactions at region servers. Switching off compactions will also
1936 interrupt any currently ongoing compactions. It can be done dynamically using the "compaction_switch"
1937 command from hbase shell. If done from the command line, this setting will be lost on restart of the
1938 server. To persist the changes across region servers modify the configuration hbase.regionserver
1939 .compaction.enabled in hbase-site.xml and restart HBase.
1940
1941
1942 [[compaction.file.selection]]
1943 ===== Compaction Policy - HBase 0.96.x and newer
1944
1945 Compacting large StoreFiles, or too many StoreFiles at once, can cause more IO load than your cluster is able to handle without causing performance problems.
1946 The method by which HBase selects which StoreFiles to include in a compaction (and whether the compaction is a minor or major compaction) is called the _compaction policy_.
1947
1948 Prior to HBase 0.96.x, there was only one compaction policy.
1949 That original compaction policy is still available as `RatioBasedCompactionPolicy`. The new compaction default policy, called `ExploringCompactionPolicy`, was subsequently backported to HBase 0.94 and HBase 0.95, and is the default in HBase 0.96 and newer.
1950 It was implemented in link:https://issues.apache.org/jira/browse/HBASE-7842[HBASE-7842].
1951 In short, `ExploringCompactionPolicy` attempts to select the best possible set of StoreFiles to compact with the least amount of work, while the `RatioBasedCompactionPolicy` selects the first set that meets the criteria.
1952
1953 Regardless of the compaction policy used, file selection is controlled by several configurable parameters and happens in a multi-step approach.
1954 These parameters will be explained in context, and then will be given in a table which shows their descriptions, defaults, and implications of changing them.
1955
1956 [[compaction.being.stuck]]
1957 ====== Being Stuck
1958
1959 When the MemStore gets too large, it needs to flush its contents to a StoreFile.
1960 However, Stores are configured with a bound on the number StoreFiles,
1961 `hbase.hstore.blockingStoreFiles`, and if in excess, the MemStore flush must wait
1962 until the StoreFile count is reduced by one or more compactions. If the MemStore
1963 is too large and the number of StoreFiles is also too high, the algorithm is said
1964 to be "stuck". By default we'll wait on compactions up to
1965 `hbase.hstore.blockingWaitTime` milliseconds. If this period expires, we'll flush
1966 anyways even though we are in excess of the
1967 `hbase.hstore.blockingStoreFiles` count.
1968
1969 Upping the `hbase.hstore.blockingStoreFiles` count will allow flushes to happen
1970 but a Store with many StoreFiles in will likely have higher read latencies. Try to
1971 figure why Compactions are not keeping up. Is it a write spurt that is bringing
1972 about this situation or is a regular occurance and the cluster is under-provisioned
1973 for the volume of writes?
1974
1975 [[exploringcompaction.policy]]
1976 ====== The ExploringCompactionPolicy Algorithm
1977
1978 The ExploringCompactionPolicy algorithm considers each possible set of adjacent StoreFiles before choosing the set where compaction will have the most benefit.
1979
1980 One situation where the ExploringCompactionPolicy works especially well is when you are bulk-loading data and the bulk loads create larger StoreFiles than the StoreFiles which are holding data older than the bulk-loaded data.
1981 This can "trick" HBase into choosing to perform a major compaction each time a compaction is needed, and cause a lot of extra overhead.
1982 With the ExploringCompactionPolicy, major compactions happen much less frequently because minor compactions are more efficient.
1983
1984 In general, ExploringCompactionPolicy is the right choice for most situations, and thus is the default compaction policy.
1985 You can also use ExploringCompactionPolicy along with <<ops.stripe>>.
1986
1987 The logic of this policy can be examined in hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/compactions/ExploringCompactionPolicy.java.
1988 The following is a walk-through of the logic of the ExploringCompactionPolicy.
1989
1990
1991 . Make a list of all existing StoreFiles in the Store.
1992   The rest of the algorithm filters this list to come up with the subset of HFiles which will be chosen for compaction.
1993 . If this was a user-requested compaction, attempt to perform the requested compaction type, regardless of what would normally be chosen.
1994   Note that even if the user requests a major compaction, it may not be possible to perform a major compaction.
1995   This may be because not all StoreFiles in the Column Family are available to compact or because there are too many Stores in the Column Family.
1996 . Some StoreFiles are automatically excluded from consideration.
1997   These include:
1998 +
1999 * StoreFiles that are larger than `hbase.hstore.compaction.max.size`
2000 * StoreFiles that were created by a bulk-load operation which explicitly excluded compaction.
2001   You may decide to exclude StoreFiles resulting from bulk loads, from compaction.
2002   To do this, specify the `hbase.mapreduce.hfileoutputformat.compaction.exclude` parameter during the bulk load operation.
2003
2004 . Iterate through the list from step 1, and make a list of all potential sets of StoreFiles to compact together.
2005   A potential set is a grouping of `hbase.hstore.compaction.min` contiguous StoreFiles in the list.
2006   For each set, perform some sanity-checking and figure out whether this is the best compaction that could be done:
2007 +
2008 * If the number of StoreFiles in this set (not the size of the StoreFiles) is fewer than `hbase.hstore.compaction.min` or more than `hbase.hstore.compaction.max`, take it out of consideration.
2009 * Compare the size of this set of StoreFiles with the size of the smallest possible compaction that has been found in the list so far.
2010   If the size of this set of StoreFiles represents the smallest compaction that could be done, store it to be used as a fall-back if the algorithm is "stuck" and no StoreFiles would otherwise be chosen.
2011   See <<compaction.being.stuck>>.
2012 * Do size-based sanity checks against each StoreFile in this set of StoreFiles.
2013 ** If the size of this StoreFile is larger than `hbase.hstore.compaction.max.size`, take it out of consideration.
2014 ** If the size is greater than or equal to `hbase.hstore.compaction.min.size`, sanity-check it against the file-based ratio to see whether it is too large to be considered.
2015 +
2016 The sanity-checking is successful if:
2017 ** There is only one StoreFile in this set, or
2018 ** For each StoreFile, its size multiplied by `hbase.hstore.compaction.ratio` (or `hbase.hstore.compaction.ratio.offpeak` if off-peak hours are configured and it is during off-peak hours) is less than the sum of the sizes of the other HFiles in the set.
2019
2020 . If this set of StoreFiles is still in consideration, compare it to the previously-selected best compaction.
2021   If it is better, replace the previously-selected best compaction with this one.
2022 . When the entire list of potential compactions has been processed, perform the best compaction that was found.
2023   If no StoreFiles were selected for compaction, but there are multiple StoreFiles, assume the algorithm is stuck (see <<compaction.being.stuck>>) and if so, perform the smallest compaction that was found in step 3.
2024
2025 [[compaction.ratiobasedcompactionpolicy.algorithm]]
2026 ====== RatioBasedCompactionPolicy Algorithm
2027
2028 The RatioBasedCompactionPolicy was the only compaction policy prior to HBase 0.96, though ExploringCompactionPolicy has now been backported to HBase 0.94 and 0.95.
2029 To use the RatioBasedCompactionPolicy rather than the ExploringCompactionPolicy, set `hbase.hstore.defaultengine.compactionpolicy.class` to `RatioBasedCompactionPolicy` in the _hbase-site.xml_ file.
2030 To switch back to the ExploringCompactionPolicy, remove the setting from the _hbase-site.xml_.
2031
2032 The following section walks you through the algorithm used to select StoreFiles for compaction in the RatioBasedCompactionPolicy.
2033
2034
2035 . The first phase is to create a list of all candidates for compaction.
2036   A list is created of all StoreFiles not already in the compaction queue, and all StoreFiles newer than the newest file that is currently being compacted.
2037   This list of StoreFiles is ordered by the sequence ID.
2038   The sequence ID is generated when a Put is appended to the write-ahead log (WAL), and is stored in the metadata of the HFile.
2039 . Check to see if the algorithm is stuck (see <<compaction.being.stuck>>, and if so, a major compaction is forced.
2040   This is a key area where <<exploringcompaction.policy>> is often a better choice than the RatioBasedCompactionPolicy.
2041 . If the compaction was user-requested, try to perform the type of compaction that was requested.
2042   Note that a major compaction may not be possible if all HFiles are not available for compaction or if too many StoreFiles exist (more than `hbase.hstore.compaction.max`).
2043 . Some StoreFiles are automatically excluded from consideration.
2044   These include:
2045 +
2046 * StoreFiles that are larger than `hbase.hstore.compaction.max.size`
2047 * StoreFiles that were created by a bulk-load operation which explicitly excluded compaction.
2048   You may decide to exclude StoreFiles resulting from bulk loads, from compaction.
2049   To do this, specify the `hbase.mapreduce.hfileoutputformat.compaction.exclude` parameter during the bulk load operation.
2050
2051 . The maximum number of StoreFiles allowed in a major compaction is controlled by the `hbase.hstore.compaction.max` parameter.
2052   If the list contains more than this number of StoreFiles, a minor compaction is performed even if a major compaction would otherwise have been done.
2053   However, a user-requested major compaction still occurs even if there are more than `hbase.hstore.compaction.max` StoreFiles to compact.
2054 . If the list contains fewer than `hbase.hstore.compaction.min` StoreFiles to compact, a minor compaction is aborted.
2055   Note that a major compaction can be performed on a single HFile.
2056   Its function is to remove deletes and expired versions, and reset locality on the StoreFile.
2057 . The value of the `hbase.hstore.compaction.ratio` parameter is multiplied by the sum of StoreFiles smaller than a given file, to determine whether that StoreFile is selected for compaction during a minor compaction.
2058   For instance, if hbase.hstore.compaction.ratio is 1.2, FileX is 5MB, FileY is 2MB, and FileZ is 3MB:
2059 +
2060 ----
2061 5 <= 1.2 x (2 + 3)            or            5 <= 6
2062 ----
2063 +
2064 In this scenario, FileX is eligible for minor compaction.
2065 If FileX were 7MB, it would not be eligible for minor compaction.
2066 This ratio favors smaller StoreFile.
2067 You can configure a different ratio for use in off-peak hours, using the parameter `hbase.hstore.compaction.ratio.offpeak`, if you also configure `hbase.offpeak.start.hour` and `hbase.offpeak.end.hour`.
2068
2069 . If the last major compaction was too long ago and there is more than one StoreFile to be compacted, a major compaction is run, even if it would otherwise have been minor.
2070   By default, the maximum time between major compactions is 7 days, plus or minus a 4.8 hour period, and determined randomly within those parameters.
2071   Prior to HBase 0.96, the major compaction period was 24 hours.
2072   See `hbase.hregion.majorcompaction` in the table below to tune or disable time-based major compactions.
2073
2074 [[compaction.parameters]]
2075 ====== Parameters Used by Compaction Algorithm
2076
2077 This table contains the main configuration parameters for compaction.
2078 This list is not exhaustive.
2079 To tune these parameters from the defaults, edit the _hbase-default.xml_ file.
2080 For a full list of all configuration parameters available, see <<config.files,config.files>>
2081
2082 `hbase.hstore.compaction.min`::
2083   The minimum number of StoreFiles which must be eligible for compaction before compaction can run.
2084   The goal of tuning `hbase.hstore.compaction.min` is to avoid ending up with too many tiny StoreFiles
2085   to compact. Setting this value to 2 would cause a minor compaction each time you have two StoreFiles
2086   in a Store, and this is probably not appropriate. If you set this value too high, all the other
2087   values will need to be adjusted accordingly. For most cases, the default value is appropriate.
2088   In previous versions of HBase, the parameter `hbase.hstore.compaction.min` was called
2089   `hbase.hstore.compactionThreshold`.
2090 +
2091 *Default*: 3
2092
2093 `hbase.hstore.compaction.max`::
2094   The maximum number of StoreFiles which will be selected for a single minor compaction,
2095   regardless of the number of eligible StoreFiles. Effectively, the value of
2096   `hbase.hstore.compaction.max` controls the length of time it takes a single
2097   compaction to complete. Setting it larger means that more StoreFiles are included
2098   in a compaction. For most cases, the default value is appropriate.
2099 +
2100 *Default*: 10
2101
2102 `hbase.hstore.compaction.min.size`::
2103   A StoreFile smaller than this size will always be eligible for minor compaction.
2104   StoreFiles this size or larger are evaluated by `hbase.hstore.compaction.ratio`
2105   to determine if they are eligible. Because this limit represents the "automatic
2106   include" limit for all StoreFiles smaller than this value, this value may need
2107   to be reduced in write-heavy environments where many files in the 1-2 MB range
2108   are being flushed, because every StoreFile will be targeted for compaction and
2109   the resulting StoreFiles may still be under the minimum size and require further
2110   compaction. If this parameter is lowered, the ratio check is triggered more quickly.
2111   This addressed some issues seen in earlier versions of HBase but changing this
2112   parameter is no longer necessary in most situations.
2113 +
2114 *Default*:128 MB
2115
2116 `hbase.hstore.compaction.max.size`::
2117   A StoreFile larger than this size will be excluded from compaction. The effect of
2118   raising `hbase.hstore.compaction.max.size` is fewer, larger StoreFiles that do not
2119   get compacted often. If you feel that compaction is happening too often without
2120   much benefit, you can try raising this value.
2121 +
2122 *Default*: `Long.MAX_VALUE`
2123
2124 `hbase.hstore.compaction.ratio`::
2125   For minor compaction, this ratio is used to determine whether a given StoreFile
2126   which is larger than `hbase.hstore.compaction.min.size` is eligible for compaction.
2127   Its effect is to limit compaction of large StoreFile. The value of
2128   `hbase.hstore.compaction.ratio` is expressed as a floating-point decimal.
2129 +
2130 * A large ratio, such as 10, will produce a single giant StoreFile. Conversely,
2131   a value of .25, will produce behavior similar to the BigTable compaction algorithm,
2132   producing four StoreFiles.
2133 * A moderate value of between 1.0 and 1.4 is recommended. When tuning this value,
2134   you are balancing write costs with read costs. Raising the value (to something like
2135   1.4) will have more write costs, because you will compact larger StoreFiles.
2136   However, during reads, HBase will need to seek through fewer StoreFiles to
2137   accomplish the read. Consider this approach if you cannot take advantage of <<blooms>>.
2138 * Alternatively, you can lower this value to something like 1.0 to reduce the
2139   background cost of writes, and use  to limit the number of StoreFiles touched
2140   during reads. For most cases, the default value is appropriate.
2141 +
2142 *Default*: `1.2F`
2143
2144 `hbase.hstore.compaction.ratio.offpeak`::
2145   The compaction ratio used during off-peak compactions, if off-peak hours are
2146   also configured (see below). Expressed as a floating-point decimal. This allows
2147   for more aggressive (or less aggressive, if you set it lower than
2148   `hbase.hstore.compaction.ratio`) compaction during a set time period. Ignored
2149   if off-peak is disabled (default). This works the same as
2150   `hbase.hstore.compaction.ratio`.
2151 +
2152 *Default*: `5.0F`
2153
2154 `hbase.offpeak.start.hour`::
2155   The start of off-peak hours, expressed as an integer between 0 and 23, inclusive.
2156   Set to -1 to disable off-peak.
2157 +
2158 *Default*: `-1` (disabled)
2159
2160 `hbase.offpeak.end.hour`::
2161   The end of off-peak hours, expressed as an integer between 0 and 23, inclusive.
2162   Set to -1 to disable off-peak.
2163 +
2164 *Default*: `-1` (disabled)
2165
2166 `hbase.regionserver.thread.compaction.throttle`::
2167   There are two different thread pools for compactions, one for large compactions
2168   and the other for small compactions. This helps to keep compaction of lean tables
2169   (such as `hbase:meta`) fast. If a compaction is larger than this threshold,
2170   it goes into the large compaction pool. In most cases, the default value is
2171   appropriate.
2172 +
2173 *Default*: `2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size`
2174 (which defaults to `128`)
2175
2176 `hbase.hregion.majorcompaction`::
2177   Time between major compactions, expressed in milliseconds. Set to 0 to disable
2178   time-based automatic major compactions. User-requested and size-based major
2179   compactions will still run. This value is multiplied by
2180   `hbase.hregion.majorcompaction.jitter` to cause compaction to start at a
2181   somewhat-random time during a given window of time.
2182 +
2183 *Default*: 7 days (`604800000` milliseconds)
2184
2185 `hbase.hregion.majorcompaction.jitter`::
2186   A multiplier applied to hbase.hregion.majorcompaction to cause compaction to
2187   occur a given amount of time either side of `hbase.hregion.majorcompaction`.
2188   The smaller the number, the closer the compactions will happen to the
2189   `hbase.hregion.majorcompaction` interval. Expressed as a floating-point decimal.
2190 +
2191 *Default*: `.50F`
2192
2193 [[compaction.file.selection.old]]
2194 ===== Compaction File Selection
2195
2196 .Legacy Information
2197 [NOTE]
2198 ====
2199 This section has been preserved for historical reasons and refers to the way compaction worked prior to HBase 0.96.x.
2200 You can still use this behavior if you enable <<compaction.ratiobasedcompactionpolicy.algorithm>>. For information on the way that compactions work in HBase 0.96.x and later, see <<compaction>>.
2201 ====
2202
2203 To understand the core algorithm for StoreFile selection, there is some ASCII-art in the Store source code that will serve as useful reference.
2204
2205 It has been copied below:
2206 [source]
2207 ----
2208 /* normal skew:
2209  *
2210  *         older ----> newer
2211  *     _
2212  *    | |   _
2213  *    | |  | |   _
2214  *  --|-|- |-|- |-|---_-------_-------  minCompactSize
2215  *    | |  | |  | |  | |  _  | |
2216  *    | |  | |  | |  | | | | | |
2217  *    | |  | |  | |  | | | | | |
2218  */
2219 ----
2220 .Important knobs:
2221 * `hbase.hstore.compaction.ratio` Ratio used in compaction file selection algorithm (default 1.2f).
2222 * `hbase.hstore.compaction.min` (in HBase v 0.90 this is called `hbase.hstore.compactionThreshold`) (files) Minimum number of StoreFiles per Store to be selected for a compaction to occur (default 2).
2223 * `hbase.hstore.compaction.max` (files) Maximum number of StoreFiles to compact per minor compaction (default 10).
2224 * `hbase.hstore.compaction.min.size` (bytes) Any StoreFile smaller than this setting with automatically be a candidate for compaction.
2225   Defaults to `hbase.hregion.memstore.flush.size` (128 mb).
2226 * `hbase.hstore.compaction.max.size` (.92) (bytes) Any StoreFile larger than this setting with automatically be excluded from compaction (default Long.MAX_VALUE).
2227
2228 The minor compaction StoreFile selection logic is size based, and selects a file for compaction when the `file <= sum(smaller_files) * hbase.hstore.compaction.ratio`.
2229
2230 [[compaction.file.selection.example1]]
2231 ====== Minor Compaction File Selection - Example #1 (Basic Example)
2232
2233 This example mirrors an example from the unit test `TestCompactSelection`.
2234
2235 * `hbase.hstore.compaction.ratio` = 1.0f
2236 * `hbase.hstore.compaction.min` = 3 (files)
2237 * `hbase.hstore.compaction.max` = 5 (files)
2238 * `hbase.hstore.compaction.min.size` = 10 (bytes)
2239 * `hbase.hstore.compaction.max.size` = 1000 (bytes)
2240
2241 The following StoreFiles exist: 100, 50, 23, 12, and 12 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 23, 12, and 12.
2242
2243 Why?
2244
2245 * 100 -> No, because sum(50, 23, 12, 12) * 1.0 = 97.
2246 * 50 -> No, because sum(23, 12, 12) * 1.0 = 47.
2247 * 23 -> Yes, because sum(12, 12) * 1.0 = 24.
2248 * 12 -> Yes, because the previous file has been included, and because this does not exceed the max-file limit of 5
2249 * 12 -> Yes, because the previous file had been included, and because this does not exceed the max-file limit of 5.
2250
2251 [[compaction.file.selection.example2]]
2252 ====== Minor Compaction File Selection - Example #2 (Not Enough Files ToCompact)
2253
2254 This example mirrors an example from the unit test `TestCompactSelection`.
2255
2256 * `hbase.hstore.compaction.ratio` = 1.0f
2257 * `hbase.hstore.compaction.min` = 3 (files)
2258 * `hbase.hstore.compaction.max` = 5 (files)
2259 * `hbase.hstore.compaction.min.size` = 10 (bytes)
2260 * `hbase.hstore.compaction.max.size` = 1000 (bytes)
2261
2262 The following StoreFiles exist: 100, 25, 12, and 12 bytes apiece (oldest to newest). With the above parameters, no compaction will be started.
2263
2264 Why?
2265
2266 * 100 -> No, because sum(25, 12, 12) * 1.0 = 47
2267 * 25 -> No, because sum(12, 12) * 1.0 = 24
2268 * 12 -> No. Candidate because sum(12) * 1.0 = 12, there are only 2 files to compact and that is less than the threshold of 3
2269 * 12 -> No. Candidate because the previous StoreFile was, but there are not enough files to compact
2270
2271 [[compaction.file.selection.example3]]
2272 ====== Minor Compaction File Selection - Example #3 (Limiting Files To Compact)
2273
2274 This example mirrors an example from the unit test `TestCompactSelection`.
2275
2276 * `hbase.hstore.compaction.ratio` = 1.0f
2277 * `hbase.hstore.compaction.min` = 3 (files)
2278 * `hbase.hstore.compaction.max` = 5 (files)
2279 * `hbase.hstore.compaction.min.size` = 10 (bytes)
2280 * `hbase.hstore.compaction.max.size` = 1000 (bytes)
2281
2282 The following StoreFiles exist: 7, 6, 5, 4, 3, 2, and 1 bytes apiece (oldest to newest). With the above parameters, the files that would be selected for minor compaction are 7, 6, 5, 4, 3.
2283
2284 Why?
2285
2286 * 7 -> Yes, because sum(6, 5, 4, 3, 2, 1) * 1.0 = 21.
2287   Also, 7 is less than the min-size
2288 * 6 -> Yes, because sum(5, 4, 3, 2, 1) * 1.0 = 15.
2289   Also, 6 is less than the min-size.
2290 * 5 -> Yes, because sum(4, 3, 2, 1) * 1.0 = 10.
2291   Also, 5 is less than the min-size.
2292 * 4 -> Yes, because sum(3, 2, 1) * 1.0 = 6.
2293   Also, 4 is less than the min-size.
2294 * 3 -> Yes, because sum(2, 1) * 1.0 = 3.
2295   Also, 3 is less than the min-size.
2296 * 2 -> No.
2297   Candidate because previous file was selected and 2 is less than the min-size, but the max-number of files to compact has been reached.
2298 * 1 -> No.
2299   Candidate because previous file was selected and 1 is less than the min-size, but max-number of files to compact has been reached.
2300
2301 [[compaction.config.impact]]
2302 .Impact of Key Configuration Options
2303
2304 NOTE: This information is now included in the configuration parameter table in <<compaction.parameters>>.
2305
2306 [[ops.date.tiered]]
2307 ===== Date Tiered Compaction
2308
2309 Date tiered compaction is a date-aware store file compaction strategy that is beneficial for time-range scans for time-series data.
2310
2311 [[ops.date.tiered.when]]
2312 ===== When To Use Date Tiered Compactions
2313
2314 Consider using Date Tiered Compaction for reads for limited time ranges, especially scans of recent data
2315
2316 Don't use it for
2317
2318 * random gets without a limited time range
2319 * frequent deletes and updates
2320 * Frequent out of order data writes creating long tails, especially writes with future timestamps
2321 * frequent bulk loads with heavily overlapping time ranges
2322
2323 .Performance Improvements
2324 Performance testing has shown that the performance of time-range scans improve greatly for limited time ranges, especially scans of recent data.
2325
2326 [[ops.date.tiered.enable]]
2327 ====== Enabling Date Tiered Compaction
2328
2329 You can enable Date Tiered compaction for a table or a column family, by setting its `hbase.hstore.engine.class` to `org.apache.hadoop.hbase.regionserver.DateTieredStoreEngine`.
2330
2331 You also need to set `hbase.hstore.blockingStoreFiles` to a high number, such as 60, if using all default settings, rather than the default value of 12). Use 1.5~2 x projected file count if changing the parameters, Projected file count = windows per tier x tier count + incoming window min + files older than max age
2332
2333 You also need to set `hbase.hstore.compaction.max` to the same value as `hbase.hstore.blockingStoreFiles` to unblock major compaction.
2334
2335 .Procedure: Enable Date Tiered Compaction
2336 . Run one of following commands in the HBase shell.
2337   Replace the table name `orders_table` with the name of your table.
2338 +
2339 [source,sql]
2340 ----
2341 alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.DateTieredStoreEngine', 'hbase.hstore.blockingStoreFiles' => '60', 'hbase.hstore.compaction.min'=>'2', 'hbase.hstore.compaction.max'=>'60'}
2342 alter 'orders_table', {NAME => 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.DateTieredStoreEngine', 'hbase.hstore.blockingStoreFiles' => '60', 'hbase.hstore.compaction.min'=>'2', 'hbase.hstore.compaction.max'=>'60'}}
2343 create 'orders_table', 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.DateTieredStoreEngine', 'hbase.hstore.blockingStoreFiles' => '60', 'hbase.hstore.compaction.min'=>'2', 'hbase.hstore.compaction.max'=>'60'}
2344 ----
2345
2346 . Configure other options if needed.
2347   See <<ops.date.tiered.config>> for more information.
2348
2349 .Procedure: Disable Date Tiered Compaction
2350 . Set the `hbase.hstore.engine.class` option to either nil or `org.apache.hadoop.hbase.regionserver.DefaultStoreEngine`.
2351   Either option has the same effect.
2352   Make sure you set the other options you changed to the original settings too.
2353 +
2354 [source,sql]
2355 ----
2356 alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.DefaultStoreEngine'， 'hbase.hstore.blockingStoreFiles' => '12', 'hbase.hstore.compaction.min'=>'6', 'hbase.hstore.compaction.max'=>'12'}}
2357 ----
2358
2359 When you change the store engine either way, a major compaction will likely be performed on most regions.
2360 This is not necessary on new tables.
2361
2362 [[ops.date.tiered.config]]
2363 ====== Configuring Date Tiered Compaction
2364
2365 Each of the settings for date tiered compaction should be configured at the table or column family level.
2366 If you use HBase shell, the general command pattern is as follows:
2367
2368 [source,sql]
2369 ----
2370 alter 'orders_table', CONFIGURATION => {'key' => 'value', ..., 'key' => 'value'}}
2371 ----
2372
2373 [[ops.date.tiered.config.parameters]]
2374 .Tier Parameters
2375
2376 You can configure your date tiers by changing the settings for the following parameters:
2377
2378 .Date Tier Parameters
2379 [cols="1,1a", frame="all", options="header"]
2380 |===
2381 | Setting
2382 | Notes
2383
2384 |`hbase.hstore.compaction.date.tiered.max.storefile.age.millis`
2385 |Files with max-timestamp smaller than this will no longer be compacted.Default at Long.MAX_VALUE.
2386
2387 | `hbase.hstore.compaction.date.tiered.base.window.millis`
2388 | Base window size in milliseconds. Default at 6 hours.
2389
2390 | `hbase.hstore.compaction.date.tiered.windows.per.tier`
2391 | Number of windows per tier. Default at 4.
2392
2393 | `hbase.hstore.compaction.date.tiered.incoming.window.min`
2394 | Minimal number of files to compact in the incoming window. Set it to expected number of files in the window to avoid wasteful compaction. Default at 6.
2395
2396 | `hbase.hstore.compaction.date.tiered.window.policy.class`
2397 | The policy to select store files within the same time window. It doesn’t apply to the incoming window. Default at exploring compaction. This is to avoid wasteful compaction.
2398 |===
2399
2400 [[ops.date.tiered.config.compaction.throttler]]
2401 .Compaction Throttler
2402
2403 With tiered compaction all servers in the cluster will promote windows to higher tier at the same time, so using a compaction throttle is recommended:
2404 Set `hbase.regionserver.throughput.controller` to `org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController`.
2405
2406 NOTE: For more information about date tiered compaction, please refer to the design specification at https://docs.google.com/document/d/1_AmlNb2N8Us1xICsTeGDLKIqL6T-oHoRLZ323MG_uy8
2407 [[ops.stripe]]
2408 ===== Experimental: Stripe Compactions
2409
2410 Stripe compactions is an experimental feature added in HBase 0.98 which aims to improve compactions for large regions or non-uniformly distributed row keys.
2411 In order to achieve smaller and/or more granular compactions, the StoreFiles within a region are maintained separately for several row-key sub-ranges, or "stripes", of the region.
2412 The stripes are transparent to the rest of HBase, so other operations on the HFiles or data work without modification.
2413
2414 Stripe compactions change the HFile layout, creating sub-regions within regions.
2415 These sub-regions are easier to compact, and should result in fewer major compactions.
2416 This approach alleviates some of the challenges of larger regions.
2417
2418 Stripe compaction is fully compatible with <<compaction>> and works in conjunction with either the ExploringCompactionPolicy or RatioBasedCompactionPolicy.
2419 It can be enabled for existing tables, and the table will continue to operate normally if it is disabled later.
2420
2421 [[ops.stripe.when]]
2422 ===== When To Use Stripe Compactions
2423
2424 Consider using stripe compaction if you have either of the following:
2425
2426 * Large regions.
2427   You can get the positive effects of smaller regions without additional overhead for MemStore and region management overhead.
2428 * Non-uniform keys, such as time dimension in a key.
2429   Only the stripes receiving the new keys will need to compact.
2430   Old data will not compact as often, if at all
2431
2432 .Performance Improvements
2433 Performance testing has shown that the performance of reads improves somewhat, and variability of performance of reads and writes is greatly reduced.
2434 An overall long-term performance improvement is seen on large non-uniform-row key regions, such as a hash-prefixed timestamp key.
2435 These performance gains are the most dramatic on a table which is already large.
2436 It is possible that the performance improvement might extend to region splits.
2437
2438 [[ops.stripe.enable]]
2439 ====== Enabling Stripe Compaction
2440
2441 You can enable stripe compaction for a table or a column family, by setting its `hbase.hstore.engine.class` to `org.apache.hadoop.hbase.regionserver.StripeStoreEngine`.
2442 You also need to set the `hbase.hstore.blockingStoreFiles` to a high number, such as 100 (rather than the default value of 10).
2443
2444 .Procedure: Enable Stripe Compaction
2445 . Run one of following commands in the HBase shell.
2446   Replace the table name `orders_table` with the name of your table.
2447 +
2448 [source,sql]
2449 ----
2450 alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}
2451 alter 'orders_table', {NAME => 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}}
2452 create 'orders_table', 'blobs_cf', CONFIGURATION => {'hbase.hstore.engine.class' => 'org.apache.hadoop.hbase.regionserver.StripeStoreEngine', 'hbase.hstore.blockingStoreFiles' => '100'}
2453 ----
2454
2455 . Configure other options if needed.
2456   See <<ops.stripe.config>> for more information.
2457 . Enable the table.
2458
2459 .Procedure: Disable Stripe Compaction
2460 . Set the `hbase.hstore.engine.class` option to either nil or `org.apache.hadoop.hbase.regionserver.DefaultStoreEngine`.
2461   Either option has the same effect.
2462 +
2463 [source,sql]
2464 ----
2465 alter 'orders_table', CONFIGURATION => {'hbase.hstore.engine.class' => 'rg.apache.hadoop.hbase.regionserver.DefaultStoreEngine'}
2466 ----
2467
2468 . Enable the table.
2469
2470 When you enable a large table after changing the store engine either way, a major compaction will likely be performed on most regions.
2471 This is not necessary on new tables.
2472
2473 [[ops.stripe.config]]
2474 ====== Configuring Stripe Compaction
2475
2476 Each of the settings for stripe compaction should be configured at the table or column family level.
2477 If you use HBase shell, the general command pattern is as follows:
2478
2479 [source,sql]
2480 ----
2481 alter 'orders_table', CONFIGURATION => {'key' => 'value', ..., 'key' => 'value'}}
2482 ----
2483
2484 [[ops.stripe.config.sizing]]
2485 .Region and stripe sizing
2486
2487 You can configure your stripe sizing based upon your region sizing.
2488 By default, your new regions will start with one stripe.
2489 On the next compaction after the stripe has grown too large (16 x MemStore flushes size), it is split into two stripes.
2490 Stripe splitting continues as the region grows, until the region is large enough to split.
2491
2492 You can improve this pattern for your own data.
2493 A good rule is to aim for a stripe size of at least 1 GB, and about 8-12 stripes for uniform row keys.
2494 For example, if your regions are 30 GB, 12 x 2.5 GB stripes might be a good starting point.
2495
2496 .Stripe Sizing Settings
2497 [cols="1,1a", frame="all", options="header"]
2498 |===
2499 | Setting
2500 | Notes
2501
2502 |`hbase.store.stripe.initialStripeCount`
2503 |The number of stripes to create when stripe compaction is enabled. You can use it as follows:
2504
2505 * For relatively uniform row keys, if you know the approximate
2506   target number of stripes from the above, you can avoid some
2507   splitting overhead by starting with several stripes (2, 5, 10...).
2508   If the early data is not representative of overall row key
2509   distribution, this will not be as efficient.
2510
2511 * For existing tables with a large amount of data, this setting
2512   will effectively pre-split your stripes.
2513
2514 * For keys such as hash-prefixed sequential keys, with more than
2515   one hash prefix per region, pre-splitting may make sense.
2516
2517
2518 | `hbase.store.stripe.sizeToSplit`
2519 | The maximum size a stripe grows before splitting. Use this in
2520 conjunction with `hbase.store.stripe.splitPartCount` to
2521 control the target stripe size (`sizeToSplit = splitPartsCount * target
2522 stripe size`), according to the above sizing considerations.
2523
2524 | `hbase.store.stripe.splitPartCount`
2525 | The number of new stripes to create when splitting a stripe. The default is 2, which is appropriate for most cases. For non-uniform row keys, you can experiment with increasing the number to 3 or 4, to isolate the arriving updates into narrower slice of the region without additional splits being required.
2526 |===
2527
2528 [[ops.stripe.config.memstore]]
2529 .MemStore Size Settings
2530
2531 By default, the flush creates several files from one MemStore, according to existing stripe boundaries and row keys to flush.
2532 This approach minimizes write amplification, but can be undesirable if the MemStore is small and there are many stripes, because the files will be too small.
2533
2534 In this type of situation, you can set `hbase.store.stripe.compaction.flushToL0` to `true`.
2535 This will cause a MemStore flush to create a single file instead.
2536 When at least `hbase.store.stripe.compaction.minFilesL0` such files (by default, 4) accumulate, they will be compacted into striped files.
2537
2538 [[ops.stripe.config.compact]]
2539 .Normal Compaction Configuration and Stripe Compaction
2540
2541 All the settings that apply to normal compactions (see <<compaction.parameters>>) apply to stripe compactions.
2542 The exceptions are the minimum and maximum number of files, which are set to higher values by default because the files in stripes are smaller.
2543 To control these for stripe compactions, use `hbase.store.stripe.compaction.minFiles` and `hbase.store.stripe.compaction.maxFiles`, rather than `hbase.hstore.compaction.min` and `hbase.hstore.compaction.max`.
2544
2545 [[ops.fifo]]
2546 ===== FIFO Compaction
2547
2548 FIFO compaction policy selects only files which have all cells expired. The column family *MUST* have non-default TTL.
2549 Essentially, FIFO compactor only collects expired store files.
2550
2551 Because we don't do any real compaction, we do not use CPU and IO (disk and network) and evict hot data from a block cache.
2552 As a result, both RW throughput and latency can be improved.
2553
2554 [[ops.fifo.when]]
2555 ===== When To Use FIFO Compaction
2556
2557 Consider using FIFO Compaction when your use case is
2558
2559 * Very high volume raw data which has low TTL and which is the source of another data (after additional processing).
2560 * Data which can be kept entirely in a a block cache (RAM/SSD). No need for compaction of a raw data at all.
2561
2562 Do not use FIFO compaction when
2563
2564 * Table/ColumnFamily has MIN_VERSION > 0
2565 * Table/ColumnFamily has TTL = FOREVER (HColumnDescriptor.DEFAULT_TTL)
2566
2567 [[ops.fifo.enable]]
2568 ====== Enabling FIFO Compaction
2569
2570 For Table:
2571
2572 [source,java]
2573 ----
2574 HTableDescriptor desc = new HTableDescriptor(tableName);
2575     desc.setConfiguration(DefaultStoreEngine.DEFAULT_COMPACTION_POLICY_CLASS_KEY,
2576       FIFOCompactionPolicy.class.getName());
2577 ----
2578
2579 For Column Family:
2580
2581 [source,java]
2582 ----
2583 HColumnDescriptor desc = new HColumnDescriptor(family);
2584     desc.setConfiguration(DefaultStoreEngine.DEFAULT_COMPACTION_POLICY_CLASS_KEY,
2585       FIFOCompactionPolicy.class.getName());
2586 ----
2587
2588 From HBase Shell:
2589
2590 [source,bash]
2591 ----
2592 create 'x',{NAME=>'y', TTL=>'30'}, {CONFIGURATION => {'hbase.hstore.defaultengine.compactionpolicy.class' => 'org.apache.hadoop.hbase.regionserver.compactions.FIFOCompactionPolicy', 'hbase.hstore.blockingStoreFiles' => 1000}}
2593 ----
2594
2595 Although region splitting is still supported, for optimal performance it should be disabled, either by setting explicitly `DisabledRegionSplitPolicy` or by setting `ConstantSizeRegionSplitPolicy` and very large max region size.
2596 You will have to increase to a very large number store's blocking file (`hbase.hstore.blockingStoreFiles`) as well.
2597 There is a sanity check on table/column family configuration in case of FIFO compaction and minimum value for number of blocking file is 1000.
2598
2599 [[arch.bulk.load]]
2600 == Bulk Loading
2601
2602 [[arch.bulk.load.overview]]
2603 === Overview
2604
2605 HBase includes several methods of loading data into tables.
2606 The most straightforward method is to either use the `TableOutputFormat` class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.
2607
2608 The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly load the generated StoreFiles into a running cluster.
2609 Using bulk load will use less CPU and network resources than loading via the HBase API.
2610
2611 [[arch.bulk.load.arch]]
2612 === Bulk Load Architecture
2613
2614 The HBase bulk load process consists of two main steps.
2615
2616 [[arch.bulk.load.prep]]
2617 ==== Preparing data via a MapReduce job
2618
2619 The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job using `HFileOutputFormat2`.
2620 This output format writes out data in HBase's internal storage format so that they can be later loaded efficiently into the cluster.
2621
2622 In order to function efficiently, `HFileOutputFormat2` must be configured such that each output HFile fits within a single region.
2623 In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's `TotalOrderPartitioner` class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.
2624
2625 `HFileOutputFormat2` includes a convenience function, `configureIncrementalLoad()`, which automatically sets up a `TotalOrderPartitioner` based on the current region boundaries of a table.
2626
2627 [[arch.bulk.load.complete]]
2628 ==== Completing the data load
2629
2630 After a data import has been prepared, either by using the `importtsv` tool with the "`importtsv.bulk.output`" option or by some other MapReduce job using the `HFileOutputFormat`, the `completebulkload` tool is used to import the data into the running cluster.
2631 This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to.
2632 It then contacts the appropriate RegionServer which adopts the HFile, moving it into its storage directory and making the data available to clients.
2633
2634 If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, the `completebulkload` utility will automatically split the data files into pieces corresponding to the new boundaries.
2635 This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through other means.
2636
2637 [[arch.bulk.load.complete.help]]
2638 [source,bash]
2639 ----
2640 $ hadoop jar hbase-mapreduce-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable
2641 ----
2642
2643 The `-c config-file` option can be used to specify a file containing the appropriate hbase parameters (e.g., hbase-site.xml) if not supplied already on the CLASSPATH (In addition, the CLASSPATH must contain the directory that has the zookeeper configuration file if zookeeper is NOT managed by HBase).
2644
2645 [[arch.bulk.load.also]]
2646 === See Also
2647
2648 For more information about the referenced utilities, see <<importtsv>> and  <<completebulkload>>.
2649
2650 See link:http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/[How-to: Use HBase Bulk Loading, and Why] for an old blog post on loading.
2651
2652 [[arch.bulk.load.adv]]
2653 === Advanced Usage
2654
2655 Although the `importtsv` tool is useful in many cases, advanced users may want to generate data programmatically, or import data from other formats.
2656 To get started doing so, dig into `ImportTsv.java` and check the JavaDoc for HFileOutputFormat.
2657
2658 The import step of the bulk load can also be done programmatically.
2659 See the `LoadIncrementalHFiles` class for more information.
2660
2661 [[arch.bulk.load.complete.strays]]
2662 ==== 'Adopting' Stray Data
2663 Should an HBase cluster lose account of regions or files during an outage or error, you can use
2664 the `completebulkload` tool to add back the dropped data. HBase operator tooling such as
2665 link:https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2[HBCK2] or
2666 the reporting added to the Master's UI under the `HBCK Report` (Since HBase 2.0.6/2.1.6/2.2.1)
2667 can identify such 'orphan' directories.
2668
2669 Before you begin the 'adoption', ensure the `hbase:meta` table is in a healthy state.
2670 Run the `CatalogJanitor` by executing the `catalogjanitor_run` command on the HBase shell.
2671 When finished, check the `HBCK Report` page on the Master UI. Work on fixing any
2672 inconsistencies, holes, or overlaps found before proceeding. The `hbase:meta` table
2673 is the authority on where all data is to be found and must be consistent for
2674 the `completebulkload` tool to work properly.
2675
2676 The `completebulkload` tool takes a directory and a `tablename`.
2677 The directory has subdirectories named for column families of the targeted `tablename`.
2678 In these subdirectories are `hfiles` to load. Given this structure, you can pass
2679 errant region directories (and the table name to which the region directory belongs)
2680 and the tool will bring the data files back into the fold by moving them under the
2681 approprate serving directory. If stray files, then you will need to mock up this
2682 structure before invoking the `completebulkload` tool; you may have to look at the
2683 file content using the <<hfile.tool>> to see what the column family to use is.
2684 When the tool completes its run, you will notice that the
2685 source errant directory has had its storefiles moved/removed. It is now desiccated
2686 since its data has been drained, and the pointed-to directory can be safely
2687 removed. It may still have `.regioninfo` files and other
2688 subdirectories but they are of no relevance now (There may be content still
2689 under the _recovered_edits_ directory; a TODO is tooling to replay the
2690 content of _recovered_edits_ if needed; see
2691 link:https://issues.apache.org/jira/browse/HBASE-22976[Add RecoveredEditsPlayer]).
2692 If you pass `completebulkload` a directory without store files, it will run and
2693 note the directory is storefile-free. Just remove such 'empty' directories.
2694
2695 For example, presuming a directory at the top level in HDFS named
2696 `eb3352fb5c9c9a05feeb2caba101e1cc` has data we need to re-add to the
2697 HBase `TestTable`:
2698
2699 [source,bash]
2700 ----
2701 $ ${HBASE_HOME}/bin/hbase --config ~/hbase-conf completebulkload hdfs://server.example.org:9000/eb3352fb5c9c9a05feeb2caba101e1cc TestTable
2702 ----
2703
2704 After it successfully completes, any files that were in  `eb3352fb5c9c9a05feeb2caba101e1cc` have been moved
2705 under hbase and the  `eb3352fb5c9c9a05feeb2caba101e1cc` directory can be deleted (Check content
2706 before and after by running `ls -r` on the HDFS directory).
2707
2708 [[arch.bulk.load.replication]]
2709 === Bulk Loading Replication
2710 HBASE-13153 adds replication support for bulk loaded HFiles, available since HBase 1.3/2.0. This feature is enabled by setting `hbase.replication.bulkload.enabled` to `true` (default is `false`).
2711 You also need to copy the source cluster configuration files to the destination cluster.
2712
2713 Additional configurations are required too:
2714
2715 . `hbase.replication.source.fs.conf.provider`
2716 +
2717 This defines the class which loads the source cluster file system client configuration in the destination cluster. This should be configured for all the RS in the destination cluster. Default is `org.apache.hadoop.hbase.replication.regionserver.DefaultSourceFSConfigurationProvider`.
2718 +
2719 . `hbase.replication.conf.dir`
2720 +
2721 This represents the base directory where the file system client configurations of the source cluster are copied to the destination cluster. This should be configured for all the RS in the destination cluster. Default is `$HBASE_CONF_DIR`.
2722 +
2723 . `hbase.replication.cluster.id`
2724 +
2725 This configuration is required in the cluster where replication for bulk loaded data is enabled. A source cluster is uniquely identified by the destination cluster using this id. This should be configured for all the RS in the source cluster configuration file for all the RS.
2726 +
2727
2728
2729
2730 For example: If source cluster FS client configurations are copied to the destination cluster under directory `/home/user/dc1/`, then `hbase.replication.cluster.id` should be configured as `dc1` and `hbase.replication.conf.dir` as `/home/user`.
2731
2732 NOTE: `DefaultSourceFSConfigurationProvider` supports only `xml` type files. It loads source cluster FS client configuration only once, so if source cluster FS client configuration files are updated, every peer(s) cluster RS must be restarted to reload the configuration.
2733
2734 [[arch.hdfs]]
2735 == HDFS
2736
2737 As HBase runs on HDFS (and each StoreFile is written as a file on HDFS), it is important to have an understanding of the HDFS Architecture especially in terms of how it stores files, handles failovers, and replicates blocks.
2738
2739 See the Hadoop documentation on link:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html[HDFS Architecture] for more information.
2740
2741 [[arch.hdfs.nn]]
2742 === NameNode
2743
2744 The NameNode is responsible for maintaining the filesystem metadata.
2745 See the above HDFS Architecture link for more information.
2746
2747 [[arch.hdfs.dn]]
2748 === DataNode
2749
2750 The DataNodes are responsible for storing HDFS blocks.
2751 See the above HDFS Architecture link for more information.
2752
2753 [[arch.timelineconsistent.reads]]
2754 == Timeline-consistent High Available Reads
2755
2756 [[casestudies.timelineconsistent.intro]]
2757 === Introduction
2758
2759 HBase, architecturally, always had the strong consistency guarantee from the start.
2760 All reads and writes are routed through a single region server, which guarantees that all writes happen in an order, and all reads are seeing the most recent committed data.
2761
2762
2763 However, because of this single homing of the reads to a single location, if the server becomes unavailable, the regions of the table that were hosted in the region server become unavailable for some time.
2764 There are three phases in the region recovery process - detection, assignment, and recovery.
2765 Of these, the detection is usually the longest and is presently in the order of 20-30 seconds depending on the ZooKeeper session timeout.
2766 During this time and before the recovery is complete, the clients will not be able to read the region data.
2767
2768 However, for some use cases, either the data may be read-only, or doing reads against some stale data is acceptable.
2769 With timeline-consistent high available reads, HBase can be used for these kind of latency-sensitive use cases where the application can expect to have a time bound on the read completion.
2770
2771
2772 For achieving high availability for reads, HBase provides a feature called _region replication_. In this model, for each region of a table, there will be multiple replicas that are opened in different RegionServers.
2773 By default, the region replication is set to 1, so only a single region replica is deployed and there will not be any changes from the original model.
2774 If region replication is set to 2 or more, then the master will assign replicas of the regions of the table.
2775 The Load Balancer ensures that the region replicas are not co-hosted in the same region servers and also in the same rack (if possible).
2776
2777 All of the replicas for a single region will have a unique replica_id, starting from 0.
2778 The region replica having replica_id==0 is called the primary region, and the others _secondary regions_ or secondaries.
2779 Only the primary can accept writes from the client, and the primary will always contain the latest changes.
2780 Since all writes still have to go through the primary region, the writes are not highly-available (meaning they might block for some time if the region becomes unavailable).
2781
2782
2783 === Timeline Consistency
2784
2785 With this feature, HBase introduces a Consistency definition, which can be provided per read operation (get or scan).
2786 [source,java]
2787 ----
2788 public enum Consistency {
2789     STRONG,
2790     TIMELINE
2791 }
2792 ----
2793 `Consistency.STRONG` is the default consistency model provided by HBase.
2794 In case the table has region replication = 1, or in a table with region replicas but the reads are done with this consistency, the read is always performed by the primary regions, so that there will not be any change from the previous behaviour, and the client always observes the latest data.
2795
2796
2797 In case a read is performed with `Consistency.TIMELINE`, then the read RPC will be sent to the primary region server first.
2798 After a short interval (`hbase.client.primaryCallTimeout.get`, 10ms by default), parallel RPC for secondary region replicas will also be sent if the primary does not respond back.
2799 After this, the result is returned from whichever RPC is finished first.
2800 If the response came back from the primary region replica, we can always know that the data is latest.
2801 For this Result.isStale() API has been added to inspect the staleness.
2802 If the result is from a secondary region, then Result.isStale() will be set to true.
2803 The user can then inspect this field to possibly reason about the data.
2804
2805
2806 In terms of semantics, TIMELINE consistency as implemented by HBase differs from pure eventual consistency in these respects:
2807
2808 * Single homed and ordered updates: Region replication or not, on the write side, there is still only 1 defined replica (primary) which can accept writes.
2809   This replica is responsible for ordering the edits and preventing conflicts.
2810   This guarantees that two different writes are not committed at the same time by different replicas and the data diverges.
2811   With this, there is no need to do read-repair or last-timestamp-wins kind of conflict resolution.
2812 * The secondaries also apply the edits in the order that the primary committed them.
2813   This way the secondaries will contain a snapshot of the primaries data at any point in time.
2814   This is similar to RDBMS replications and even HBase's own multi-datacenter replication, however in a single cluster.
2815 * On the read side, the client can detect whether the read is coming from up-to-date data or is stale data.
2816   Also, the client can issue reads with different consistency requirements on a per-operation basis to ensure its own semantic guarantees.
2817 * The client can still observe edits out-of-order, and can go back in time, if it observes reads from one secondary replica first, then another secondary replica.
2818   There is no stickiness to region replicas or a transaction-id based guarantee.
2819   If required, this can be implemented later though.
2820
2821 .Timeline Consistency
2822 image::timeline_consistency.png[Timeline Consistency]
2823
2824 To better understand the TIMELINE semantics, let's look at the above diagram.
2825 Let's say that there are two clients, and the first one writes x=1 at first, then x=2 and x=3 later.
2826 As above, all writes are handled by the primary region replica.
2827 The writes are saved in the write ahead log (WAL), and replicated to the other replicas asynchronously.
2828 In the above diagram, notice that replica_id=1 received 2 updates, and its data shows that x=2, while the replica_id=2 only received a single update, and its data shows that x=1.
2829
2830
2831 If client1 reads with STRONG consistency, it will only talk with the replica_id=0, and thus is guaranteed to observe the latest value of x=3.
2832 In case of a client issuing TIMELINE consistency reads, the RPC will go to all replicas (after primary timeout) and the result from the first response will be returned back.
2833 Thus the client can see either 1, 2 or 3 as the value of x.
2834 Let's say that the primary region has failed and log replication cannot continue for some time.
2835 If the client does multiple reads with TIMELINE consistency, she can observe x=2 first, then x=1, and so on.
2836
2837
2838 === Tradeoffs
2839
2840 Having secondary regions hosted for read availability comes with some tradeoffs which should be carefully evaluated per use case.
2841 Following are advantages and disadvantages.
2842
2843 .Advantages
2844 * High availability for read-only tables
2845 * High availability for stale reads
2846 * Ability to do very low latency reads with very high percentile (99.9%+) latencies for stale reads
2847
2848 .Disadvantages
2849 * Double / Triple MemStore usage (depending on region replication count) for tables with region replication > 1
2850 * Increased block cache usage
2851 * Extra network traffic for log replication
2852 * Extra backup RPCs for replicas
2853
2854 To serve the region data from multiple replicas, HBase opens the regions in secondary mode in the region servers.
2855 The regions opened in secondary mode will share the same data files with the primary region replica, however each secondary region replica will have its own MemStore to keep the unflushed data (only primary region can do flushes). Also to serve reads from secondary regions, the blocks of data files may be also cached in the block caches for the secondary regions.
2856
2857 === Where is the code
2858 This feature is delivered in two phases, Phase 1 and 2. The first phase is done in time for HBase-1.0.0 release. Meaning that using HBase-1.0.x, you can use all the features that are marked for Phase 1. Phase 2 is committed in HBase-1.1.0, meaning all HBase versions after 1.1.0 should contain Phase 2 items.
2859
2860 === Propagating writes to region replicas
2861 As discussed above writes only go to the primary region replica. For propagating the writes from the primary region replica to the secondaries, there are two different mechanisms. For read-only tables, you do not need to use any of the following methods. Disabling and enabling the table should make the data available in all region replicas. For mutable tables, you have to use *only* one of the following mechanisms: storefile refresher, or async wal replication. The latter is recommended.
2862
2863 ==== StoreFile Refresher
2864 The first mechanism is store file refresher which is introduced in HBase-1.0+. Store file refresher is a thread per region server, which runs periodically, and does a refresh operation for the store files of the primary region for the secondary region replicas. If enabled, the refresher will ensure that the secondary region replicas see the new flushed, compacted or bulk loaded files from the primary region in a timely manner. However, this means that only flushed data can be read back from the secondary region replicas, and after the refresher is run, making the secondaries lag behind the primary for an a longer time.
2865
2866 For turning this feature on, you should configure `hbase.regionserver.storefile.refresh.period` to a non-zero value. See Configuration section below.
2867
2868 [[async.wal.replication]]
2869 ==== Async WAL replication
2870 The second mechanism for propagation of writes to secondaries is done via the
2871 “Async WAL Replication” feature. It is only available in HBase-1.1+. This works
2872 similarly to HBase’s multi-datacenter replication, but instead the data from a
2873 region is replicated to the secondary regions. Each secondary replica always
2874 receives and observes the writes in the same order that the primary region
2875 committed them. In some sense, this design can be thought of as “in-cluster
2876 replication”, where instead of replicating to a different datacenter, the data
2877 goes to secondary regions to keep secondary region’s in-memory state up to date.
2878 The data files are shared between the primary region and the other replicas, so
2879 that there is no extra storage overhead. However, the secondary regions will
2880 have recent non-flushed data in their memstores, which increases the memory
2881 overhead. The primary region writes flush, compaction, and bulk load events
2882 to its WAL as well, which are also replicated through wal replication to
2883 secondaries. When they observe the flush/compaction or bulk load event, the
2884 secondary regions replay the event to pick up the new files and drop the old
2885 ones.
2886
2887 Committing writes in the same order as in primary ensures that the secondaries won’t diverge from the primary regions data, but since the log replication is asynchronous, the data might still be stale in secondary regions. Since this feature works as a replication endpoint, the performance and latency characteristics is expected to be similar to inter-cluster replication.
2888
2889 Async WAL Replication is *disabled* by default. You can enable this feature by
2890 setting `hbase.region.replica.replication.enabled` to `true`.  The Async WAL
2891 Replication feature will add a new replication peer named
2892 `region_replica_replication` as a replication peer when you create a table with
2893 region replication > 1 for the first time. Once enabled, if you want to disable
2894 this feature, you need to do two actions in the following order:
2895 * Set configuration property `hbase.region.replica.replication.enabled` to false in `hbase-site.xml` (see Configuration section below)
2896 * Disable the replication peer named `region_replica_replication` in the cluster using hbase shell or `Admin` class:
2897 [source,bourne]
2898 ----
2899         hbase> disable_peer 'region_replica_replication'
2900 ----
2901
2902 Async WAL Replication and the `hbase:meta` table is a little more involved and gets its own section below; see <<async.wal.replication.meta>>
2903
2904 === Store File TTL
2905 In both of the write propagation approaches mentioned above, store files of the primary will be opened in secondaries independent of the primary region. So for files that the primary compacted away, the secondaries might still be referring to these files for reading. Both features are using HFileLinks to refer to files, but there is no protection (yet) for guaranteeing that the file will not be deleted prematurely. Thus, as a guard, you should set the configuration property `hbase.master.hfilecleaner.ttl` to a larger value, such as 1 hour to guarantee that you will not receive IOExceptions for requests going to replicas.
2906
2907 [[async.wal.replication.meta]]
2908 === Region replication for META table’s region
2909 Async WAL Replication does not work for the META table’s WAL.
2910 The meta table’s secondary replicas refresh themselves from the persistent store
2911 files every `hbase.regionserver.meta.storefile.refresh.period`, (a non-zero value).
2912 Note how the META replication period is distinct from the user-space
2913 `hbase.regionserver.storefile.refresh.period` value.
2914
2915 ==== Async WAL Replication for META table as of hbase-2.4.0+ ====
2916 Async WAL replication for META is added as a new feature in 2.4.0. It is still under
2917 active development. Use with caution. Set
2918 `hbase.region.replica.replication.catalog.enabled` to enable async WAL Replication
2919 for META region replicas. It is off by default.
2920
2921 Regarding META replicas count, up to hbase-2.4.0, you would set the special
2922 property 'hbase.meta.replica.count'. Now you can alter the META table as you
2923 would a user-space table (if `hbase.meta.replica.count` is set, it will take
2924 precedent over what is set for replica count in the META table updating META
2925 replica count to match).
2926
2927 ===== Load Balancing META table load =====
2928
2929 hbase-2.4.0 also adds a *new* client-side `LoadBalance` mode. When enabled
2930 client-side, clients will try to read META replicas first before falling back on
2931 the primary. Before this, the replica lookup mode -- now named `HedgedRead` in
2932 hbase-2.4.0 -- had clients read the primary and if no response after a
2933 configurable amount of time had elapsed, it would start up reads against the
2934 replicas.
2935
2936 The new 'LoadBalance' mode helps alleviate hotspotting on the META
2937 table distributing META read load.
2938
2939 To enable the meta replica locator's load balance mode, please set the following
2940 configuration at on the *client-side* (only): set 'hbase.locator.meta.replicas.mode'
2941 to "LoadBalance". Valid options for this configuration are `None`, `HedgedRead`, and
2942 `LoadBalance`. Option parse is case insensitive.  The default mode is `None` (which falls
2943 through to `HedgedRead`, the current default).  Do NOT put this configuration in any
2944 hbase server-side's configuration, Master or RegionServer (Master could make decisions
2945 based off stale state -- to be avoided).
2946
2947 `LoadBalance` also is a new feature. Use with caution.
2948
2949 === Memory accounting
2950 The secondary region replicas refer to the data files of the primary region replica, but they have their own memstores (in HBase-1.1+) and uses block cache as well. However, one distinction is that the secondary region replicas cannot flush the data when there is memory pressure for their memstores. They can only free up memstore memory when the primary region does a flush and this flush is replicated to the secondary. Since in a region server hosting primary replicas for some regions and secondaries for some others, the secondaries might cause extra flushes to the primary regions in the same host. In extreme situations, there can be no memory left for adding new writes coming from the primary via wal replication. For unblocking this situation (and since secondary cannot flush by itself), the secondary is allowed to do a “store file refresh” by doing a file system list operation to pick up new files from primary, and possibly dropping its memstore. This refresh will only be performed if the memstore size of the biggest secondary region replica is at least `hbase.region.replica.storefile.refresh.memstore.multiplier` (default 4) times bigger than the biggest memstore of a primary replica. One caveat is that if this is performed, the secondary can observe partial row updates across column families (since column families are flushed independently). The default should be good to not do this operation frequently. You can set this value to a large number to disable this feature if desired, but be warned that it might cause the replication to block forever.
2951
2952 === Secondary replica failover
2953 When a secondary region replica first comes online, or fails over, it may have served some edits from its memstore. Since the recovery is handled differently for secondary replicas, the secondary has to ensure that it does not go back in time before it starts serving requests after assignment. For doing that, the secondary waits until it observes a full flush cycle (start flush, commit flush) or a “region open event” replicated from the primary. Until this happens, the secondary region replica will reject all read requests by throwing an IOException with message “The region's reads are disabled”. However, the other replicas will probably still be available to read, thus not causing any impact for the rpc with TIMELINE consistency. To facilitate faster recovery, the secondary region will trigger a flush request from the primary when it is opened. The configuration property `hbase.region.replica.wait.for.primary.flush` (enabled by default) can be used to disable this feature if needed.
2954
2955
2956
2957
2958 === Configuration properties
2959
2960 To use highly available reads, you should set the following properties in `hbase-site.xml` file.
2961 There is no specific configuration to enable or disable region replicas.
2962 Instead you can change the number of region replicas per table to increase or decrease at the table creation or with alter table. The following configuration is for using async wal replication and using meta replicas of 3.
2963
2964
2965 ==== Server side properties
2966
2967 [source,xml]
2968 ----
2969 <property>
2970     <name>hbase.regionserver.storefile.refresh.period</name>
2971     <value>0</value>
2972     <description>
2973       The period (in milliseconds) for refreshing the store files for the secondary regions. 0 means this feature is disabled. Secondary regions sees new files (from flushes and compactions) from primary once the secondary region refreshes the list of files in the region (there is no notification mechanism). But too frequent refreshes might cause extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL (hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger value is also recommended with this setting.
2974     </description>
2975 </property>
2976
2977 <property>
2978     <name>hbase.regionserver.meta.storefile.refresh.period</name>
2979     <value>300000</value>
2980     <description>
2981       The period (in milliseconds) for refreshing the store files for the hbase:meta tables secondary regions. 0 means this feature is disabled. Secondary regions sees new files (from flushes and compactions) from primary once the secondary region refreshes the list of files in the region (there is no notification mechanism). But too frequent refreshes might cause extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL (hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger value is also recommended with this setting. This should be a non-zero number if meta replicas are enabled.
2982     </description>
2983 </property>
2984
2985 <property>
2986     <name>hbase.region.replica.replication.enabled</name>
2987     <value>true</value>
2988     <description>
2989       Whether asynchronous WAL replication to the secondary region replicas is enabled or not. If this is enabled, a replication peer named "region_replica_replication" will be created which will tail the logs and replicate the mutations to region replicas for tables that have region replication > 1. If this is enabled once, disabling this replication also requires disabling the replication peer using shell or Admin java class. Replication to secondary region replicas works over standard inter-cluster replication.
2990     </description>
2991 </property>
2992 <property>
2993   <name>hbase.region.replica.replication.memstore.enabled</name>
2994   <value>true</value>
2995   <description>
2996     If you set this to `false`, replicas do not receive memstore updates from
2997     the primary RegionServer. If you set this to `true`, you can still disable
2998     memstore replication on a per-table basis, by setting the table's
2999     `REGION_MEMSTORE_REPLICATION` configuration property to `false`. If
3000     memstore replication is disabled, the secondaries will only receive
3001     updates for events like flushes and bulkloads, and will not have access to
3002     data which the primary has not yet flushed. This preserves the guarantee
3003     of row-level consistency, even when the read requests `Consistency.TIMELINE`.
3004   </description>
3005 </property>
3006
3007 <property>
3008     <name>hbase.master.hfilecleaner.ttl</name>
3009     <value>3600000</value>
3010     <description>
3011       The period (in milliseconds) to keep store files in the archive folder before deleting them from the file system.</description>
3012 </property>
3013
3014 <property>
3015     <name>hbase.region.replica.storefile.refresh.memstore.multiplier</name>
3016     <value>4</value>
3017     <description>
3018       The multiplier for a “store file refresh” operation for the secondary region replica. If a region server has memory pressure, the secondary region will refresh it’s store files if the memstore size of the biggest secondary replica is bigger this many times than the memstore size of the biggest primary replica. Set this to a very big value to disable this feature (not recommended).
3019     </description>
3020 </property>
3021
3022 <property>
3023  <name>hbase.region.replica.wait.for.primary.flush</name>
3024     <value>true</value>
3025     <description>
3026       Whether to wait for observing a full flush cycle from the primary before start serving data in a secondary. Disabling this might cause the secondary region replicas to go back in time for reads between region movements.
3027     </description>
3028 </property>
3029 ----
3030
3031 One thing to keep in mind also is that, region replica placement policy is only enforced by the `StochasticLoadBalancer` which is the default balancer.
3032 If you are using a custom load balancer property in hbase-site.xml (`hbase.master.loadbalancer.class`) replicas of regions might end up being hosted in the same server.
3033
3034 ==== Client side properties
3035
3036 Ensure to set the following for all clients (and servers) that will use region replicas.
3037
3038 [source,xml]
3039 ----
3040 <property>
3041     <name>hbase.ipc.client.specificThreadForWriting</name>
3042     <value>true</value>
3043     <description>
3044       Whether to enable interruption of RPC threads at the client side. This is required for region replicas with fallback RPC’s to secondary regions.
3045     </description>
3046 </property>
3047 <property>
3048   <name>hbase.client.primaryCallTimeout.get</name>
3049   <value>10000</value>
3050   <description>
3051     The timeout (in microseconds), before secondary fallback RPC’s are submitted for get requests with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies.
3052   </description>
3053 </property>
3054 <property>
3055   <name>hbase.client.primaryCallTimeout.multiget</name>
3056   <value>10000</value>
3057   <description>
3058       The timeout (in microseconds), before secondary fallback RPC’s are submitted for multi-get requests (Table.get(List<Get>)) with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 10ms. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies.
3059   </description>
3060 </property>
3061 <property>
3062   <name>hbase.client.replicaCallTimeout.scan</name>
3063   <value>1000000</value>
3064   <description>
3065     The timeout (in microseconds), before secondary fallback RPC’s are submitted for scan requests with Consistency.TIMELINE to the secondary replicas of the regions. Defaults to 1 sec. Setting this lower will increase the number of RPC’s, but will lower the p99 latencies.
3066   </description>
3067 </property>
3068 <property>
3069     <name>hbase.meta.replicas.use</name>
3070     <value>true</value>
3071     <description>
3072       Whether to use meta table replicas or not. Default is false.
3073     </description>
3074 </property>
3075 ----
3076
3077 Note HBase-1.0.x users should use `hbase.ipc.client.allowsInterrupt` rather than `hbase.ipc.client.specificThreadForWriting`.
3078
3079 === User Interface
3080
3081 In the masters user interface, the region replicas of a table are also shown together with the primary regions.
3082 You can notice that the replicas of a region will share the same start and end keys and the same region name prefix.
3083 The only difference would be the appended replica_id (which is encoded as hex), and the region encoded name will be different.
3084 You can also see the replica ids shown explicitly in the UI.
3085
3086 === Creating a table with region replication
3087
3088 Region replication is a per-table property.
3089 All tables have `REGION_REPLICATION = 1` by default, which means that there is only one replica per region.
3090 You can set and change the number of replicas per region of a table by supplying the `REGION_REPLICATION` property in the table descriptor.
3091
3092
3093 ==== Shell
3094
3095 [source]
3096 ----
3097 create 't1', 'f1', {REGION_REPLICATION => 2}
3098
3099 describe 't1'
3100 for i in 1..100
3101 put 't1', "r#{i}", 'f1:c1', i
3102 end
3103 flush 't1'
3104 ----
3105
3106 ==== Java
3107
3108 [source,java]
3109 ----
3110 HTableDescriptor htd = new HTableDescriptor(TableName.valueOf(“test_table”));
3111 htd.setRegionReplication(2);
3112 ...
3113 admin.createTable(htd);
3114 ----
3115
3116 You can also use `setRegionReplication()` and alter table to increase, decrease the region replication for a table.
3117
3118
3119 === Read API and Usage
3120
3121 ==== Shell
3122
3123 You can do reads in shell using a the Consistency.TIMELINE semantics as follows
3124
3125 [source]
3126 ----
3127 hbase(main):001:0> get 't1','r6', {CONSISTENCY => "TIMELINE"}
3128 ----
3129
3130 You can simulate a region server pausing or becoming unavailable and do a read from the secondary replica:
3131
3132 [source,bash]
3133 ----
3134 $ kill -STOP <pid or primary region server>
3135
3136 hbase(main):001:0> get 't1','r6', {CONSISTENCY => "TIMELINE"}
3137 ----
3138
3139 Using scans is also similar
3140
3141 [source]
3142 ----
3143 hbase> scan 't1', {CONSISTENCY => 'TIMELINE'}
3144 ----
3145
3146 ==== Java
3147
3148 You can set the consistency for Gets and Scans and do requests as follows.
3149
3150 [source,java]
3151 ----
3152 Get get = new Get(row);
3153 get.setConsistency(Consistency.TIMELINE);
3154 ...
3155 Result result = table.get(get);
3156 ----
3157
3158 You can also pass multiple gets:
3159
3160 [source,java]
3161 ----
3162 Get get1 = new Get(row);
3163 get1.setConsistency(Consistency.TIMELINE);
3164 ...
3165 ArrayList<Get> gets = new ArrayList<Get>();
3166 gets.add(get1);
3167 ...
3168 Result[] results = table.get(gets);
3169 ----
3170
3171 And Scans:
3172
3173 [source,java]
3174 ----
3175 Scan scan = new Scan();
3176 scan.setConsistency(Consistency.TIMELINE);
3177 ...
3178 ResultScanner scanner = table.getScanner(scan);
3179 ----
3180
3181 You can inspect whether the results are coming from primary region or not by calling the `Result.isStale()` method:
3182
3183 [source,java]
3184 ----
3185 Result result = table.get(get);
3186 if (result.isStale()) {
3187   ...
3188 }
3189 ----
3190
3191 === Resources
3192
3193 . More information about the design and implementation can be found at the jira issue: link:https://issues.apache.org/jira/browse/HBASE-10070[HBASE-10070]
3194 . HBaseCon 2014 talk: link:https://hbase.apache.org/www.hbasecon.com/#2014-PresentationsRecordings[HBase Read High Availability Using Timeline-Consistent Region Replicas] also contains some details and link:http://www.slideshare.net/enissoz/hbase-high-availability-for-reads-with-time[slides].
3195
3196 ifdef::backend-docbook[]
3197 [index]
3198 == Index
3199 // Generated automatically by the DocBook toolchain.
3200 endif::backend-docbook[]