src/main/asciidoc/_chapters/ops_mgt.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 [[ops_mgt]]
  23 = Apache HBase Operational Management
  24 :doctype: book
  25 :numbered:
  26 :toc: left
  27 :icons: font
  28 :experimental:
  29
  30 This chapter will cover operational tools and practices required of a running Apache HBase cluster.
  31 The subject of operations is related to the topics of <<trouble>>, <<performance>>, and <<configuration>> but is a distinct topic in itself.
  32
  33 [[tools]]
  34 == HBase Tools and Utilities
  35
  36 HBase provides several tools for administration, analysis, and debugging of your cluster.
  37 The entry-point to most of these tools is the _bin/hbase_ command, though some tools are available in the _dev-support/_ directory.
  38
  39 To see usage instructions for _bin/hbase_ command, run it with no arguments, or with the `-h` argument.
  40 These are the usage instructions for HBase 0.98.x.
  41 Some commands, such as `version`, `pe`, `ltt`, `clean`, are not available in previous versions.
  42
  43 ----
  44 $ bin/hbase
  45 Usage: hbase [<options>] <command> [<args>]
  46 Options:
  47   --config DIR     Configuration direction to use. Default: ./conf
  48   --hosts HOSTS    Override the list in 'regionservers' file
  49   --auth-as-server Authenticate to ZooKeeper using servers configuration
  50
  51 Commands:
  52 Some commands take arguments. Pass no args or -h for usage.
  53   shell           Run the HBase shell
  54   hbck            Run the HBase 'fsck' tool. Defaults read-only hbck1.
  55                   Pass '-j /path/to/HBCK2.jar' to run hbase-2.x HBCK2.
  56   snapshot        Tool for managing snapshots
  57   wal             Write-ahead-log analyzer
  58   hfile           Store file analyzer
  59   zkcli           Run the ZooKeeper shell
  60   master          Run an HBase HMaster node
  61   regionserver    Run an HBase HRegionServer node
  62   zookeeper       Run a ZooKeeper server
  63   rest            Run an HBase REST server
  64   thrift          Run the HBase Thrift server
  65   thrift2         Run the HBase Thrift2 server
  66   clean           Run the HBase clean up script
  67   jshell          Run a jshell with HBase on the classpath
  68   classpath       Dump hbase CLASSPATH
  69   mapredcp        Dump CLASSPATH entries required by mapreduce
  70   pe              Run PerformanceEvaluation
  71   ltt             Run LoadTestTool
  72   canary          Run the Canary tool
  73   version         Print the version
  74   backup          Backup tables for recovery
  75   restore         Restore tables from existing backup image
  76   regionsplitter  Run RegionSplitter tool
  77   rowcounter      Run RowCounter tool
  78   cellcounter     Run CellCounter tool
  79   CLASSNAME       Run the class named CLASSNAME
  80 ----
  81
  82 Some of the tools and utilities below are Java classes which are passed directly to the _bin/hbase_ command, as referred to in the last line of the usage instructions.
  83 Others, such as `hbase shell` (<<shell>>), `hbase upgrade` (<<upgrading>>), and `hbase thrift` (<<thrift>>), are documented elsewhere in this guide.
  84
  85 === Canary
  86
  87 The Canary tool can help users "canary-test" the HBase cluster status.
  88 The default "region mode" fetches a row from every column-family of every regions.
  89 In "regionserver mode", the Canary tool will fetch a row from a random
  90 region on each of the cluster's RegionServers. In "zookeeper mode", the
  91 Canary will read the root znode on each member of the zookeeper ensemble.
  92
  93 To see usage, pass the `-help` parameter (if you pass no
  94 parameters, the Canary tool starts executing in the default
  95 region "mode" fetching a row from every region in the cluster).
  96
  97 ----
  98 2018-10-16 13:11:27,037 INFO  [main] tool.Canary: Execution thread count=16
  99 Usage: canary [OPTIONS] [<TABLE1> [<TABLE2]...] | [<REGIONSERVER1> [<REGIONSERVER2]..]
 100 Where [OPTIONS] are:
 101  -h,-help        show this help and exit.
 102  -regionserver   set 'regionserver mode'; gets row from random region on server
 103  -allRegions     get from ALL regions when 'regionserver mode', not just random one.
 104  -zookeeper      set 'zookeeper mode'; grab zookeeper.znode.parent on each ensemble member
 105  -daemon         continuous check at defined intervals.
 106  -interval <N>   interval between checks in seconds
 107  -e              consider table/regionserver argument as regular expression
 108  -f <B>          exit on first error; default=true
 109  -failureAsError treat read/write failure as error
 110  -t <N>          timeout for canary-test run; default=600000ms
 111  -writeSniffing  enable write sniffing
 112  -writeTable     the table used for write sniffing; default=hbase:canary
 113  -writeTableTimeout <N>  timeout for writeTable; default=600000ms
 114  -readTableTimeouts <tableName>=<read timeout>,<tableName>=<read timeout>,...
 115                 comma-separated list of table read timeouts (no spaces);
 116                 logs 'ERROR' if takes longer. default=600000ms
 117  -permittedZookeeperFailures <N>  Ignore first N failures attempting to
 118                 connect to individual zookeeper nodes in ensemble
 119
 120  -D<configProperty>=<value> to assign or override configuration params
 121  -Dhbase.canary.read.raw.enabled=<true/false> Set to enable/disable raw scan; default=false
 122
 123 Canary runs in one of three modes: region (default), regionserver, or zookeeper.
 124 To sniff/probe all regions, pass no arguments.
 125 To sniff/probe all regions of a table, pass tablename.
 126 To sniff/probe regionservers, pass -regionserver, etc.
 127 See http://hbase.apache.org/book.html#_canary for Canary documentation.
 128 ----
 129
 130 [NOTE]
 131 The `Sink` class is instantiated using the `hbase.canary.sink.class` configuration property.
 132
 133 This tool will return non zero error codes to user for collaborating with other monitoring tools,
 134 such as Nagios.  The error code definitions are:
 135
 136 [source,java]
 137 ----
 138 private static final int USAGE_EXIT_CODE = 1;
 139 private static final int INIT_ERROR_EXIT_CODE = 2;
 140 private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
 141 private static final int ERROR_EXIT_CODE = 4;
 142 private static final int FAILURE_EXIT_CODE = 5;
 143 ----
 144
 145 Here are some examples based on the following given case: given two Table objects called test-01
 146 and test-02 each with two column family cf1 and cf2 respectively, deployed on 3 RegionServers.
 147 See the following table.
 148
 149 [cols="1,1,1", options="header"]
 150 |===
 151 | RegionServer
 152 | test-01
 153 | test-02
 154 | rs1 | r1 | r2
 155 | rs2 | r2 |
 156 | rs3 | r2 | r1
 157 |===
 158
 159 Following are some example outputs based on the previous given case.
 160
 161 ==== Canary test for every column family (store) of every region of every table
 162
 163 ----
 164 $ ${HBASE_HOME}/bin/hbase canary
 165
 166 3/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf1 in 2ms
 167 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf2 in 2ms
 168 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf1 in 4ms
 169 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf2 in 1ms
 170 ...
 171 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf1 in 5ms
 172 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf2 in 3ms
 173 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf1 in 31ms
 174 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf2 in 8ms
 175 ----
 176
 177 So you can see, table test-01 has two regions and two column families, so the Canary tool in the
 178 default "region mode" will pick 4 small piece of data from 4 (2 region * 2 store) different stores.
 179 This is a default behavior.
 180
 181 ==== Canary test for every column family (store) of every region of a specific table(s)
 182
 183 You can also test one or more specific tables by passing table names.
 184
 185 ----
 186 $ ${HBASE_HOME}/bin/hbase canary test-01 test-02
 187 ----
 188
 189 ==== Canary test with RegionServer granularity
 190
 191 In "regionserver mode", the Canary tool will pick one small piece of data
 192 from each RegionServer (You can also pass one or more RegionServer names as arguments
 193 to the canary-test when in "regionserver mode").
 194
 195 ----
 196 $ ${HBASE_HOME}/bin/hbase canary -regionserver
 197
 198 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs2 in 72ms
 199 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-02 on region server:rs3 in 34ms
 200 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs1 in 56ms
 201 ----
 202
 203 ==== Canary test with regular expression pattern
 204
 205 You can pass regexes for table names when in "region mode" or for servernames when
 206 in "regionserver mode". The below will test both table test-01 and test-02.
 207
 208 ----
 209 $ ${HBASE_HOME}/bin/hbase canary -e test-0[1-2]
 210 ----
 211
 212 ==== Run canary test as a "daemon"
 213
 214 Run repeatedly with an interval defined via the option `-interval` (default value is 60 seconds).
 215 This daemon will stop itself and return non-zero error code if any error occur. To have
 216 the daemon keep running across errors, pass the -f flag with its value set to false
 217 (see usage above).
 218
 219 ----
 220 $ ${HBASE_HOME}/bin/hbase canary -daemon
 221 ----
 222
 223 To run repeatedly with 5 second intervals and not stop on errors, do the following.
 224
 225 ----
 226 $ ${HBASE_HOME}/bin/hbase canary -daemon -interval 5 -f false
 227 ----
 228
 229 ==== Force timeout if canary test stuck
 230
 231 In some cases the request is stuck and no response is sent back to the client. This
 232 can happen with dead RegionServers which the master has not yet noticed.
 233 Because of this we provide a timeout option to kill the canary test and return a
 234 non-zero error code. The below sets the timeout value to 60 seconds (the default value
 235 is 600 seconds).
 236
 237 ----
 238 $ ${HBASE_HOME}/bin/hbase canary -t 60000
 239 ----
 240
 241 ==== Enable write sniffing in canary
 242
 243 By default, the canary tool only checks read operations. To enable the write sniffing,
 244 you can run the canary with the `-writeSniffing` option set.  When write sniffing is
 245 enabled, the canary tool will create an hbase table and make sure the
 246 regions of the table are distributed to all region servers. In each sniffing period,
 247 the canary will try to put data to these regions to check the write availability of
 248 each region server.
 249 ----
 250 $ ${HBASE_HOME}/bin/hbase canary -writeSniffing
 251 ----
 252
 253 The default write table is `hbase:canary` and can be specified with the option `-writeTable`.
 254 ----
 255 $ ${HBASE_HOME}/bin/hbase canary -writeSniffing -writeTable ns:canary
 256 ----
 257
 258 The default value size of each put is 10 bytes. You can set it via the config key:
 259 `hbase.canary.write.value.size`.
 260
 261 ==== Treat read / write failure as error
 262
 263 By default, the canary tool only logs read failures -- due to e.g. RetriesExhaustedException, etc. --
 264 and will return the 'normal' exit code. To treat read/write failure as errors, you can run canary
 265 with the `-treatFailureAsError` option. When enabled, read/write failures will result in an
 266 error exit code.
 267 ----
 268 $ ${HBASE_HOME}/bin/hbase canary -treatFailureAsError
 269 ----
 270
 271 ==== Running Canary in a Kerberos-enabled Cluster
 272
 273 To run the Canary in a Kerberos-enabled cluster, configure the following two properties in _hbase-site.xml_:
 274
 275 * `hbase.client.keytab.file`
 276 * `hbase.client.kerberos.principal`
 277
 278 Kerberos credentials are refreshed every 30 seconds when Canary runs in daemon mode.
 279
 280 To configure the DNS interface for the client, configure the following optional properties in _hbase-site.xml_.
 281
 282 * `hbase.client.dns.interface`
 283 * `hbase.client.dns.nameserver`
 284
 285 .Canary in a Kerberos-Enabled Cluster
 286 ====
 287 This example shows each of the properties with valid values.
 288
 289 [source,xml]
 290 ----
 291 <property>
 292   <name>hbase.client.kerberos.principal</name>
 293   <value>hbase/_HOST@YOUR-REALM.COM</value>
 294 </property>
 295 <property>
 296   <name>hbase.client.keytab.file</name>
 297   <value>/etc/hbase/conf/keytab.krb5</value>
 298 </property>
 299 <!-- optional params -->
 300 <property>
 301   <name>hbase.client.dns.interface</name>
 302   <value>default</value>
 303 </property>
 304 <property>
 305   <name>hbase.client.dns.nameserver</name>
 306   <value>default</value>
 307 </property>
 308 ----
 309 ====
 310
 311 === RegionSplitter
 312
 313 ----
 314 usage: bin/hbase regionsplitter <TABLE> <SPLITALGORITHM>
 315 SPLITALGORITHM is the java class name of a class implementing
 316                       SplitAlgorithm, or one of the special strings
 317                       HexStringSplit or DecimalStringSplit or
 318                       UniformSplit, which are built-in split algorithms.
 319                       HexStringSplit treats keys as hexadecimal ASCII, and
 320                       DecimalStringSplit treats keys as decimal ASCII, and
 321                       UniformSplit treats keys as arbitrary bytes.
 322  -c <region count>        Create a new table with a pre-split number of
 323                           regions
 324  -D <property=value>      Override HBase Configuration Settings
 325  -f <family:family:...>   Column Families to create with new table.
 326                           Required with -c
 327     --firstrow <arg>      First Row in Table for Split Algorithm
 328  -h                       Print this usage help
 329     --lastrow <arg>       Last Row in Table for Split Algorithm
 330  -o <count>               Max outstanding splits that have unfinished
 331                           major compactions
 332  -r                       Perform a rolling split of an existing region
 333     --risky               Skip verification steps to complete
 334                           quickly. STRONGLY DISCOURAGED for production
 335                           systems.
 336 ----
 337
 338 For additional detail, see <<manual_region_splitting_decisions>>.
 339
 340 [[health.check]]
 341 === Health Checker
 342
 343 You can configure HBase to run a script periodically and if it fails N times (configurable), have the server exit.
 344 See _HBASE-7351 Periodic health check script_ for configurations and detail.
 345
 346 === Driver
 347
 348 Several frequently-accessed utilities are provided as `Driver` classes, and executed by the _bin/hbase_ command.
 349 These utilities represent MapReduce jobs which run on your cluster.
 350 They are run in the following way, replacing _UtilityName_ with the utility you want to run.
 351 This command assumes you have set the environment variable `HBASE_HOME` to the directory where HBase is unpacked on your server.
 352
 353 ----
 354
 355 ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.mapreduce.UtilityName
 356 ----
 357
 358 The following utilities are available:
 359
 360 `LoadIncrementalHFiles`::
 361   Complete a bulk data load.
 362
 363 `CopyTable`::
 364   Export a table from the local cluster to a peer cluster.
 365
 366 `Export`::
 367   Write table data to HDFS.
 368
 369 `Import`::
 370   Import data written by a previous `Export` operation.
 371
 372 `ImportTsv`::
 373   Import data in TSV format.
 374
 375 `RowCounter`::
 376   Count rows in an HBase table.
 377
 378 `CellCounter`::
 379   Count cells in an HBase table.
 380
 381 `replication.VerifyReplication`::
 382   Compare the data from tables in two different clusters.
 383   WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed.
 384   Note that this command is in a different package than the others.
 385
 386 Each command except `RowCounter` and `CellCounter` accept a single `--help` argument to print usage instructions.
 387
 388 [[hbck]]
 389 === HBase `hbck`
 390
 391 The `hbck` tool that shipped with hbase-1.x has been made read-only in hbase-2.x. It is not able to repair
 392 hbase-2.x clusters as hbase internals have changed. Nor should its assessments in read-only mode be
 393 trusted as it does not understand hbase-2.x operation.
 394
 395 A new tool, <<HBCK2>>, described in the next section, replaces `hbck`.
 396
 397 [[HBCK2]]
 398 === HBase `HBCK2`
 399
 400 `HBCK2` is the successor to <<hbck>>, the hbase-1.x fix tool (A.K.A `hbck1`). Use it in place of `hbck1`
 401 making repairs against hbase-2.x installs.
 402
 403 `HBCK2` does not ship as part of hbase. It can be found as a subproject of the companion
 404 link:https://github.com/apache/hbase-operator-tools[hbase-operator-tools] repository at
 405 link:https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2[Apache HBase HBCK2 Tool].
 406 `HBCK2` was moved out of hbase so it could evolve at a cadence apart from that of hbase core.
 407
 408 See the [https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2](HBCK2) Home Page
 409 for how `HBCK2` differs from `hbck1`, and for how to build and use it.
 410
 411 Once built, you can run `HBCK2` as follows:
 412
 413 ```
 414 $ hbase hbck -j /path/to/HBCK2.jar
 415 ```
 416
 417 This will generate `HBCK2` usage describing commands and options.
 418
 419 [[hfile_tool2]]
 420 === HFile Tool
 421
 422 See <<hfile_tool>>.
 423
 424 === WAL Tools
 425 For bulk replaying WAL files or _recovered.edits_ files, see
 426 <<walplayer>>. For reading/verifying individual files, read on.
 427
 428 [[hlog_tool.prettyprint]]
 429 ==== WALPrettyPrinter
 430
 431 The `WALPrettyPrinter` is a tool with configurable options to print the contents of a WAL
 432 or a _recovered.edits_ file. You can invoke it via the HBase cli with the 'wal' command.
 433
 434 ----
 435  $ ./bin/hbase wal hdfs://example.org:8020/hbase/WALs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
 436 ----
 437
 438 .WAL Printing in older versions of HBase
 439 [NOTE]
 440 ====
 441 Prior to version 2.0, the `WALPrettyPrinter` was called the `HLogPrettyPrinter`, after an internal name for HBase's write ahead log.
 442 In those versions, you can print the contents of a WAL using the same configuration as above, but with the 'hlog' command.
 443
 444 ----
 445  $ ./bin/hbase hlog hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
 446 ----
 447 ====
 448
 449 [[compression.tool]]
 450 === Compression Tool
 451
 452 See <<compression.test,compression.test>>.
 453
 454 [[copy.table]]
 455 === CopyTable
 456
 457 CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster.
 458 The target table must first exist.
 459 The usage is as follows:
 460
 461 ----
 462
 463 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
 464 /bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
 465 Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
 466
 467 Options:
 468  rs.class     hbase.regionserver.class of the peer cluster,
 469               specify if different from current cluster
 470  rs.impl      hbase.regionserver.impl of the peer cluster,
 471  startrow     the start row
 472  stoprow      the stop row
 473  starttime    beginning of the time range (unixtime in millis)
 474               without endtime means from starttime to forever
 475  endtime      end of the time range.  Ignored if no starttime specified.
 476  versions     number of cell versions to copy
 477  new.name     new table's name
 478  peer.adr     Address of the peer cluster given in the format
 479               hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 480  families     comma-separated list of families to copy
 481               To copy from cf1 to cf2, give sourceCfName:destCfName.
 482               To keep the same name, just give "cfName"
 483  all.cells    also copy delete markers and deleted cells
 484
 485 Args:
 486  tablename    Name of the table to copy
 487
 488 Examples:
 489  To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 490  $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
 491
 492 For performance consider the following general options:
 493   It is recommended that you set the following to >=100. A higher value uses more memory but
 494   decreases the round trip time to the server and may increase performance.
 495     -Dhbase.client.scanner.caching=100
 496   The following should always be set to false, to prevent writing data twice, which may produce
 497   inaccurate results.
 498     -Dmapred.map.tasks.speculative.execution=false
 499 ----
 500
 501 .Scanner Caching
 502 [NOTE]
 503 ====
 504 Caching for the input Scan is configured via `hbase.client.scanner.caching`          in the job configuration.
 505 ====
 506
 507 .Versions
 508 [NOTE]
 509 ====
 510 By default, CopyTable utility only copies the latest version of row cells unless `--versions=n` is explicitly specified in the command.
 511 ====
 512
 513 .Data Load
 514 [NOTE]
 515 ====
 516 CopyTable does not perform a diff, it copies all Cells in between the specified startrow/stoprow starttime/endtime range.
 517 This means that already existing cells with same values will still be copied.
 518 ====
 519
 520 See Jonathan Hsieh's link:https://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online
 521           HBase Backups with CopyTable] blog post for more on `CopyTable`.
 522
 523 [[hashtable.synctable]]
 524 === HashTable/SyncTable
 525
 526 HashTable/SyncTable is a two steps tool for synchronizing table data, where each of the steps are implemented as MapReduce jobs.
 527 Similarly to CopyTable, it can be used for partial or entire table data syncing, under same or remote cluster.
 528 However, it performs the sync in a more efficient way than CopyTable. Instead of copying all cells
 529 in specified row key/time period range, HashTable (the first step) creates hashed indexes for batch of cells on source table and output those as results.
 530 On the next stage, SyncTable scans the source table and now calculates hash indexes for table cells,
 531 compares these hashes with the outputs of HashTable, then it just scans (and compares) cells for diverging hashes, only updating
 532 mismatching cells. This results in less network traffic/data transfers, which can be impacting when syncing large tables on remote clusters.
 533
 534 ==== Step 1, HashTable
 535
 536 First, run HashTable on the source table cluster (this is the table whose state will be copied to its counterpart).
 537
 538 Usage:
 539
 540 ----
 541 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --help
 542 Usage: HashTable [options] <tablename> <outputpath>
 543
 544 Options:
 545  batchsize         the target amount of bytes to hash in each batch
 546                    rows are added to the batch until this size is reached
 547                    (defaults to 8000 bytes)
 548  numhashfiles      the number of hash files to create
 549                    if set to fewer than number of regions then
 550                    the job will create this number of reducers
 551                    (defaults to 1/100 of regions -- at least 1)
 552  startrow          the start row
 553  stoprow           the stop row
 554  starttime         beginning of the time range (unixtime in millis)
 555                    without endtime means from starttime to forever
 556  endtime           end of the time range.  Ignored if no starttime specified.
 557  scanbatch         scanner batch size to support intra row scans
 558  versions          number of cell versions to include
 559  families          comma-separated list of families to include
 560  ignoreTimestamps  if true, ignores cell timestamps
 561
 562 Args:
 563  tablename     Name of the table to hash
 564  outputpath    Filesystem path to put the output data
 565
 566 Examples:
 567  To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
 568  $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTable
 569 ----
 570
 571 The *batchsize* property defines how much cell data for a given region will be hashed together in a single hash value.
 572 Sizing this properly has a direct impact on the sync efficiency, as it may lead to less scans executed by mapper tasks
 573 of SyncTable (the next step in the process). The rule of thumb is that, the smaller the number of cells out of sync
 574 (lower probability of finding a diff), larger batch size values can be determined.
 575
 576 ==== Step 2, SyncTable
 577
 578 Once HashTable has completed on source cluster, SyncTable can be ran on target cluster.
 579 Just like replication and other synchronization jobs, it requires that all RegionServers/DataNodes
 580 on source cluster be accessible by NodeManagers on the target cluster (where SyncTable job tasks will be running).
 581
 582 Usage:
 583
 584 ----
 585 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --help
 586 Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>
 587
 588 Options:
 589  sourcezkcluster  ZK cluster key of the source table
 590                   (defaults to cluster in classpath's config)
 591  targetzkcluster  ZK cluster key of the target table
 592                   (defaults to cluster in classpath's config)
 593  dryrun           if true, output counters but no writes
 594                   (defaults to false)
 595  doDeletes        if false, does not perform deletes
 596                   (defaults to true)
 597  doPuts           if false, does not perform puts
 598                   (defaults to true)
 599  ignoreTimestamps if true, ignores cells timestamps while comparing
 600                   cell values. Any missing cell on target then gets
 601                   added with current time as timestamp
 602                   (defaults to false)
 603
 604 Args:
 605  sourcehashdir    path to HashTable output dir for source table
 606                   (see org.apache.hadoop.hbase.mapreduce.HashTable)
 607  sourcetable      Name of the source table to sync from
 608  targettable      Name of the target table to sync to
 609
 610 Examples:
 611  For a dry run SyncTable of tableA from a remote source cluster
 612  to a local target cluster:
 613  $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableA
 614 ----
 615
 616 Cell comparison takes ROW/FAMILY/QUALIFIER/TIMESTAMP/VALUE into account for equality. When syncing at the target, missing cells will be
 617 added with original timestamp value from source. That may cause unexpected results after SyncTable completes, for example, if missing
 618 cells on target have a delete marker with a timestamp T2 (say, a bulk delete performed by mistake), but source cells timestamps have an
 619 older value T1, then those cells would still be unavailable at target because of the newer delete marker timestamp. Since cell timestamps
 620 might not be relevant to all use cases, _ignoreTimestamps_ option adds the flexibility to avoid using cells timestamp in the comparison.
 621 When using _ignoreTimestamps_ set to true, this option must be specified for both HashTable and SyncTable steps.
 622
 623 The *dryrun* option is useful when a read only, diff report is wanted, as it will produce only COUNTERS indicating the differences, but will not perform
 624 any actual changes. It can be used as an alternative to VerifyReplication tool.
 625
 626 By default, SyncTable will cause target table to become an exact copy of source table (at least, for the specified startrow/stoprow or/and starttime/endtime).
 627
 628 Setting doDeletes to false modifies default behaviour to not delete target cells that are missing on source.
 629 Similarly, setting doPuts to false modifies default behaviour to not add missing cells on target. Setting both doDeletes
 630 and doPuts to false would give same effect as setting dryrun to true.
 631
 632
 633 .Additional info on doDeletes/doPuts
 634 [NOTE]
 635 ====
 636 "doDeletes/doPuts" were only added by
 637 link:https://jira.apache.org/jira/browse/HBASE-20305[HBASE-20305], so these may not be available on
 638 all released versions.
 639 For major 1.x versions, minimum minor release including it is *1.4.10*.
 640 For major 2.x versions, minimum minor release including it is *2.1.5*.
 641 ====
 642
 643 .Additional info on ignoreTimestamps
 644 [NOTE]
 645 ====
 646 "ignoreTimestamps" was only added by
 647 link:https://issues.apache.org/jira/browse/HBASE-24302[HBASE-24302], so it may not be available on
 648 all released versions.
 649 For major 1.x versions, minimum minor release including it is *1.4.14*.
 650 For major 2.x versions, minimum minor release including it is *2.2.5*.
 651 ====
 652
 653 .Set doDeletes to false on Two-Way Replication scenarios
 654 [NOTE]
 655 ====
 656 On Two-Way Replication or other scenarios where both source and target clusters can have data ingested, it's advisable to always set doDeletes option to false,
 657 as any additional cell inserted on SyncTable target cluster and not yet replicated to source would be deleted, and potentially lost permanently.
 658 ====
 659
 660 .Set sourcezkcluster to the actual source cluster ZK quorum
 661 [NOTE]
 662 ====
 663 Although not required, if sourcezkcluster is not set, SyncTable will connect to local HBase cluster for both source and target,
 664 which does not give any meaningful result.
 665 ====
 666
 667 .Remote Clusters on different Kerberos Realms
 668 [NOTE]
 669 ====
 670 Often, remote clusters may be deployed on different Kerberos Realms.
 671 link:https://jira.apache.org/jira/browse/HBASE-20586[HBASE-20586] added SyncTable support for
 672 cross realm authentication, allowing a SyncTable process running on target cluster to connect to
 673 source cluster and read both HashTable output files and the given HBase table when performing the
 674 required comparisons.
 675 ====
 676
 677 [[export]]
 678 === Export
 679
 680 Export is a utility that will dump the contents of table to HDFS in a sequence file.
 681 The Export can be run via a Coprocessor Endpoint or MapReduce. Invoke via:
 682
 683 *mapreduce-based Export*
 684 ----
 685 $ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
 686 ----
 687
 688 *endpoint-based Export*
 689
 690 NOTE: Make sure the Export coprocessor is enabled by adding `org.apache.hadoop.hbase.coprocessor.Export` to `hbase.coprocessor.region.classes`.
 691 ----
 692 $ bin/hbase org.apache.hadoop.hbase.coprocessor.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
 693 ----
 694 The outputdir is a HDFS directory that does not exist prior to the export. When done, the exported files will be owned by the user invoking the export command.
 695
 696 *The Comparison of Endpoint-based Export And Mapreduce-based Export*
 697 |===
 698 ||Endpoint-based Export|Mapreduce-based Export
 699
 700 |HBase version requirement
 701 |2.0+
 702 |0.2.1+
 703
 704 |Maven dependency
 705 |hbase-endpoint
 706 |hbase-mapreduce (2.0+), hbase-server(prior to 2.0)
 707
 708 |Requirement before dump
 709 |mount the endpoint.Export on the target table
 710 |deploy the MapReduce framework
 711
 712 |Read latency
 713 |low, directly read the data from region
 714 |normal, traditional RPC scan
 715
 716 |Read Scalability
 717 |depend on number of regions
 718 |depend on number of mappers (see TableInputFormatBase#getSplits)
 719
 720 |Timeout
 721 |operation timeout. configured by hbase.client.operation.timeout
 722 |scan timeout. configured by hbase.client.scanner.timeout.period
 723
 724 |Permission requirement
 725 |READ, EXECUTE
 726 |READ
 727
 728 |Fault tolerance
 729 |no
 730 |depend on MapReduce
 731 |===
 732
 733
 734 NOTE: To see usage instructions, run the command with no options. Available options include
 735 specifying column families and applying filters during the export.
 736
 737 By default, the `Export` tool only exports the newest version of a given cell, regardless of the number of versions stored. To export more than one version, replace *_<versions>_* with the desired number of versions.
 738
 739 For mapreduce based Export, if you want to export cell tags then set the following config property
 740 `hbase.client.rpc.codec` to `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`
 741
 742 Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
 743
 744 [[import]]
 745 === Import
 746
 747 Import is a utility that will load data that has been exported back into HBase.
 748 Invoke via:
 749
 750 ----
 751 $ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
 752 ----
 753
 754 NOTE: To see usage instructions, run the command with no options.
 755
 756 To import 0.94 exported files in a 0.96 cluster or onwards, you need to set system property "hbase.import.version" when running the import command as below:
 757
 758 ----
 759 $ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
 760 ----
 761
 762 If you want to import cell tags then set the following config property
 763 `hbase.client.rpc.codec` to `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`
 764
 765 [[importtsv]]
 766 === ImportTsv
 767
 768 ImportTsv is a utility that will load data in TSV format into HBase.
 769 It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the `completebulkload`.
 770
 771 To load data via Puts (i.e., non-bulk loading):
 772
 773 ----
 774 $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
 775 ----
 776
 777 To generate StoreFiles for bulk-loading:
 778
 779 [source,bourne]
 780 ----
 781 $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
 782 ----
 783
 784 These generated StoreFiles can be loaded into HBase via <<completebulkload,completebulkload>>.
 785
 786 [[importtsv.options]]
 787 ==== ImportTsv Options
 788
 789 Running `ImportTsv` with no arguments prints brief usage information:
 790
 791 ----
 792
 793 Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
 794
 795 Imports the given input directory of TSV data into the specified table.
 796
 797 The column names of the TSV data must be specified using the -Dimporttsv.columns
 798 option. This option takes the form of comma-separated column names, where each
 799 column name is either a simple column family, or a columnfamily:qualifier. The special
 800 column name HBASE_ROW_KEY is used to designate that this column should be used
 801 as the row key for each imported record. You must specify exactly one column
 802 to be the row key, and you must specify a column name for every column that exists in the
 803 input data.
 804
 805 By default importtsv will load data directly into HBase. To instead generate
 806 HFiles of data to prepare for a bulk data load, pass the option:
 807   -Dimporttsv.bulk.output=/path/for/output
 808   Note: the target table will be created with default column family descriptors if it does not already exist.
 809
 810 Other options that may be specified with -D include:
 811   -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
 812   '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
 813   -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
 814   -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
 815 ----
 816
 817 [[importtsv.example]]
 818 ==== ImportTsv Example
 819
 820 For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
 821
 822 Assume that an input file exists as follows:
 823 ----
 824
 825 row1    c1      c2
 826 row2    c1      c2
 827 row3    c1      c2
 828 row4    c1      c2
 829 row5    c1      c2
 830 row6    c1      c2
 831 row7    c1      c2
 832 row8    c1      c2
 833 row9    c1      c2
 834 row10   c1      c2
 835 ----
 836
 837 For ImportTsv to use this input file, the command line needs to look like this:
 838
 839 ----
 840
 841  HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
 842 ----
 843
 844 \... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used.
 845 The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
 846
 847 [[importtsv.warning]]
 848 ==== ImportTsv Warning
 849
 850 If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
 851
 852 [[importtsv.also]]
 853 ==== See Also
 854
 855 For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>
 856
 857 [[completebulkload]]
 858 === CompleteBulkLoad
 859
 860 The `completebulkload` utility will move generated StoreFiles into an HBase table.
 861 This utility is often used in conjunction with output from <<importtsv,importtsv>>.
 862
 863 There are two ways to invoke this utility, with explicit classname and via the driver:
 864
 865 .Explicit Classname
 866 ----
 867 $ bin/hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
 868 ----
 869
 870 .Driver
 871 ----
 872 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
 873 ----
 874
 875 [[completebulkload.warning]]
 876 ==== CompleteBulkLoad Warning
 877
 878 Data generated via MapReduce is often created with file permissions that are not compatible with the running HBase process.
 879 Assuming you're running HDFS with permissions enabled, those permissions will need to be updated before you run CompleteBulkLoad.
 880
 881 For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>.
 882
 883 [[walplayer]]
 884 === WALPlayer
 885
 886 WALPlayer is a utility to replay WAL files into HBase.
 887
 888 The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables.
 889 The output can optionally be mapped to another set of tables.
 890
 891 WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
 892
 893 Finally, you can use WALPlayer to replay the content of a Regions `recovered.edits` directory (the files under
 894 `recovered.edits` directory have the same format as WAL files).
 895
 896 .WALPrettyPrinter
 897 [NOTE]
 898 ====
 899 To read or verify single WAL files or _recovered.edits_ files, since they share the WAL format,
 900 see <<_wal_tools>>.
 901 ====
 902
 903 Invoke via:
 904
 905 ----
 906 $ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]>
 907 ----
 908
 909 For example:
 910
 911 ----
 912 $ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
 913 ----
 914
 915 WALPlayer, by default, runs as a mapreduce job.
 916 To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags `-Dmapreduce.jobtracker.address=local` on the command line.
 917
 918 [[walplayer.options]]
 919 ==== WALPlayer Options
 920
 921 Running `WALPlayer` with no arguments prints brief usage information:
 922
 923 ----
 924 Usage: WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]
 925  <WAL inputdir>   directory of WALs to replay.
 926  <tables>         comma separated list of tables. If no tables specified,
 927                   all are imported (even hbase:meta if present).
 928  <tableMappings>  WAL entries can be mapped to a new set of tables by passing
 929                   <tableMappings>, a comma separated list of target tables.
 930                   If specified, each table in <tables> must have a mapping.
 931 To generate HFiles to bulk load instead of loading HBase directly, pass:
 932  -Dwal.bulk.output=/path/for/output
 933  Only one table can be specified, and no mapping allowed!
 934 To specify a time range, pass:
 935  -Dwal.start.time=[date|ms]
 936  -Dwal.end.time=[date|ms]
 937  The start and the end date of timerange (inclusive). The dates can be
 938  expressed in milliseconds-since-epoch or yyyy-MM-dd'T'HH:mm:ss.SS format.
 939  E.g. 1234567890120 or 2009-02-13T23:32:30.12
 940 Other options:
 941  -Dmapreduce.job.name=jobName
 942  Use the specified mapreduce job name for the wal player
 943  -Dwal.input.separator=' '
 944  Change WAL filename separator (WAL dir names use default ','.)
 945 For performance also consider the following options:
 946   -Dmapreduce.map.speculative=false
 947   -Dmapreduce.reduce.speculative=false
 948 ----
 949
 950 [[rowcounter]]
 951 === RowCounter
 952
 953 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] is a mapreduce job to count all the rows of a table.
 954 This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency.
 955 It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
 956 It is possible to limit the time range of data to be scanned by using the `--starttime=[starttime]` and `--endtime=[endtime]` flags.
 957 The scanned data can be limited based on keys using the `--range=[startKey],[endKey][;[startKey],[endKey]...]` option.
 958
 959 ----
 960 $ bin/hbase rowcounter [options] <tablename> [--starttime=<start> --endtime=<end>] [--range=[startKey],[endKey][;[startKey],[endKey]...]] [<column1> <column2>...]
 961 ----
 962
 963 RowCounter only counts one version per cell.
 964
 965 For performance consider to use `-Dhbase.client.scanner.caching=100` and `-Dmapreduce.map.speculative=false` options.
 966
 967 [[cellcounter]]
 968 === CellCounter
 969
 970 HBase ships another diagnostic mapreduce job called link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter].
 971 Like RowCounter, it gathers more fine-grained statistics about your table.
 972 The statistics gathered by CellCounter are more fine-grained and include:
 973
 974 * Total number of rows in the table.
 975 * Total number of CFs across all rows.
 976 * Total qualifiers across all rows.
 977 * Total occurrence of each CF.
 978 * Total occurrence of each qualifier.
 979 * Total number of versions of each qualifier.
 980
 981 The program allows you to limit the scope of the run.
 982 Provide a row regex or prefix to limit the rows to analyze.
 983 Specify a time range to scan the table by using the `--starttime=<starttime>` and `--endtime=<endtime>` flags.
 984
 985 Use `hbase.mapreduce.scan.column.family` to specify scanning a single column family.
 986
 987 ----
 988 $ bin/hbase cellcounter <tablename> <outputDir> [reportSeparator] [regex or prefix] [--starttime=<starttime> --endtime=<endtime>]
 989 ----
 990
 991 Note: just like RowCounter, caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
 992
 993 === mlockall
 994
 995 It is possible to optionally pin your servers in physical memory making them less likely to be swapped out in oversubscribed environments by having the servers call link:http://linux.die.net/man/2/mlockall[mlockall] on startup.
 996 See link:https://issues.apache.org/jira/browse/HBASE-4391[HBASE-4391 Add ability to start RS as root and call mlockall] for how to build the optional library and have it run on startup.
 997
 998 [[compaction.tool]]
 999 === Offline Compaction Tool
1000
1001 *CompactionTool* provides a way of running compactions (either minor or major) as an independent
1002 process from the RegionServer. It reuses same internal implementation classes executed by RegionServer
1003 compaction feature. However, since this runs on a complete separate independent java process, it
1004 releases RegionServers from the overhead involved in rewrite a set of hfiles, which can be critical
1005 for latency sensitive use cases.
1006
1007 Usage:
1008 ----
1009 $ ./bin/hbase org.apache.hadoop.hbase.regionserver.CompactionTool
1010
1011 Usage: java org.apache.hadoop.hbase.regionserver.CompactionTool \
1012   [-compactOnce] [-major] [-mapred] [-D<property=value>]* files...
1013
1014 Options:
1015  mapred         Use MapReduce to run compaction.
1016  compactOnce    Execute just one compaction step. (default: while needed)
1017  major          Trigger major compaction.
1018
1019 Note: -D properties will be applied to the conf used.
1020 For example:
1021  To stop delete of compacted file, pass -Dhbase.compactiontool.delete=false
1022  To set tmp dir, pass -Dhbase.tmp.dir=ALTERNATE_DIR
1023
1024 Examples:
1025  To compact the full 'TestTable' using MapReduce:
1026  $ hbase org.apache.hadoop.hbase.regionserver.CompactionTool -mapred hdfs://hbase/data/default/TestTable
1027
1028  To compact column family 'x' of the table 'TestTable' region 'abc':
1029  $ hbase org.apache.hadoop.hbase.regionserver.CompactionTool hdfs://hbase/data/default/TestTable/abc/x
1030 ----
1031
1032 As shown by usage options above, *CompactionTool* can run as a standalone client or a mapreduce job.
1033 When running as mapreduce job, each family dir is handled as an input split, and is processed
1034 by a separate map task.
1035
1036 The *compactionOnce* parameter controls how many compaction cycles will be performed until
1037 *CompactionTool* program decides to finish its work. If omitted, it will assume it should keep
1038 running compactions on each specified family as determined by the given compaction policy
1039 configured. For more info on compaction policy, see <<compaction,compaction>>.
1040
1041 If a major compaction is desired, *major* flag can be specified. If omitted, *CompactionTool* will
1042 assume minor compaction is wanted by default.
1043
1044 It also allows for configuration overrides with `-D` flag. In the usage section above, for example,
1045 `-Dhbase.compactiontool.delete=false` option will instruct compaction engine to not delete original
1046 files from temp folder.
1047
1048 Files targeted for compaction must be specified as parent hdfs dirs. It allows for multiple dirs
1049 definition, as long as each for these dirs are either a *family*, a *region*, or a *table* dir. If a
1050 table or region dir is passed, the program will recursively iterate through related sub-folders,
1051 effectively running compaction for each family found below the table/region level.
1052
1053 Since these dirs are nested under *hbase* hdfs directory tree, *CompactionTool* requires hbase super
1054 user permissions in order to have access to required hfiles.
1055
1056 .Running in MapReduce mode
1057 [NOTE]
1058 ====
1059 MapReduce mode offers the ability to process each family dir in parallel, as a separate map task.
1060 Generally, it would make sense to run in this mode when specifying one or more table dirs as targets
1061 for compactions. The caveat, though, is that if number of families to be compacted become too large,
1062 the related mapreduce job may have indirect impacts on *RegionServers* performance .
1063 Since *NodeManagers* are normally co-located with RegionServers, such large jobs could
1064 compete for IO/Bandwidth resources with the *RegionServers*.
1065 ====
1066
1067 .MajorCompaction completely disabled on RegionServers due performance impacts
1068 [NOTE]
1069 ====
1070 *Major compactions* can be a costly operation (see <<compaction,compaction>>), and can indeed
1071 impact performance on RegionServers, leading operators to completely disable it for critical
1072 low latency application. *CompactionTool* could be used as an alternative in such scenarios,
1073 although, additional custom application logic would need to be implemented, such as deciding
1074 scheduling and selection of tables/regions/families target for a given compaction run.
1075 ====
1076
1077 For additional details about CompactionTool, see also
1078 link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/regionserver/CompactionTool.html[CompactionTool].
1079
1080 === `hbase clean`
1081
1082 The `hbase clean` command cleans HBase data from ZooKeeper, HDFS, or both.
1083 It is appropriate to use for testing.
1084 Run it with no options for usage instructions.
1085 The `hbase clean` command was introduced in HBase 0.98.
1086
1087 ----
1088
1089 $ bin/hbase clean
1090 Usage: hbase clean (--cleanZk|--cleanHdfs|--cleanAll)
1091 Options:
1092         --cleanZk   cleans hbase related data from zookeeper.
1093         --cleanHdfs cleans hbase related data from hdfs.
1094         --cleanAll  cleans hbase related data from both zookeeper and hdfs.
1095 ----
1096
1097 === `hbase pe`
1098
1099 The `hbase pe` command runs the PerformanceEvaluation tool, which is used for testing.
1100
1101 The PerformanceEvaluation tool accepts many different options and commands.
1102 For usage instructions, run the command with no options.
1103
1104 The PerformanceEvaluation tool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLs and visibility labels, multiget support for RPC calls, increased sampling sizes, an option to randomly sleep during testing, and ability to "warm up" the cluster before testing starts.
1105
1106 === `hbase ltt`
1107
1108 The `hbase ltt` command runs the LoadTestTool utility, which is used for testing.
1109
1110 You must specify either `-init_only` or at least one of `-write`, `-update`, or `-read`.
1111 For general usage instructions, pass the `-h` option.
1112
1113 The LoadTestTool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLS and visibility labels, testing security-related features, ability to specify the number of regions per server, tests for multi-get RPC calls, and tests relating to replication.
1114
1115 [[ops.pre-upgrade]]
1116 === Pre-Upgrade validator
1117 Pre-Upgrade validator tool can be used to check the cluster for known incompatibilities before upgrading from HBase 1 to HBase 2.
1118
1119 [source, bash]
1120 ----
1121 $ bin/hbase pre-upgrade command ...
1122 ----
1123
1124 ==== Coprocessor validation
1125
1126 HBase supports co-processors for a long time, but the co-processor API can be changed between major releases. Co-processor validator tries to determine
1127 whether the old co-processors are still compatible with the actual HBase version.
1128
1129 [source, bash]
1130 ----
1131 $ bin/hbase pre-upgrade validate-cp [-jar ...] [-class ... | -table ... | -config]
1132 Options:
1133  -e            Treat warnings as errors.
1134  -jar <arg>    Jar file/directory of the coprocessor.
1135  -table <arg>  Table coprocessor(s) to check.
1136  -class <arg>  Coprocessor class(es) to check.
1137  -config         Scan jar for observers.
1138 ----
1139
1140 The co-processor classes can be explicitly declared by `-class` option, or they can be obtained from HBase configuration by `-config` option.
1141 Table level co-processors can be also checked by `-table` option. The tool searches for co-processors on its classpath, but it can be extended
1142 by the `-jar` option. It is possible to test multiple classes with multiple `-class`, multiple tables with multiple `-table` options as well as
1143 adding multiple jars to the classpath with multiple `-jar` options.
1144
1145 The tool can report errors and warnings. Errors mean that HBase won't be able to load the coprocessor, because it is incompatible with the current version
1146 of HBase. Warnings mean that the co-processors can be loaded, but they won't work as expected. If `-e` option is given, then the tool will also fail
1147 for warnings.
1148
1149 Please note that this tool cannot validate every aspect of jar files, it just does some static checks.
1150
1151 For example:
1152
1153 [source, bash]
1154 ----
1155 $ bin/hbase pre-upgrade validate-cp -jar my-coprocessor.jar -class MyMasterObserver -class MyRegionObserver
1156 ----
1157
1158 It validates `MyMasterObserver` and `MyRegionObserver` classes which are located in `my-coprocessor.jar`.
1159
1160 [source, bash]
1161 ----
1162 $ bin/hbase pre-upgrade validate-cp -table .*
1163 ----
1164
1165 It validates every table level co-processors where the table name matches to `.*` regular expression.
1166
1167 ==== DataBlockEncoding validation
1168 HBase 2.0 removed `PREFIX_TREE` Data Block Encoding from column families. For further information
1169 please check <<upgrade2.0.prefix-tree.removed,_prefix-tree_ encoding removed>>.
1170 To verify that none of the column families are using incompatible Data Block Encodings in the cluster run the following command.
1171
1172 [source, bash]
1173 ----
1174 $ bin/hbase pre-upgrade validate-dbe
1175 ----
1176
1177 This check validates all column families and print out any incompatibilities. For example:
1178
1179 ----
1180 2018-07-13 09:58:32,028 WARN  [main] tool.DataBlockEncodingValidator: Incompatible DataBlockEncoding for table: t, cf: f, encoding: PREFIX_TREE
1181 ----
1182
1183 Which means that Data Block Encoding of table `t`, column family `f` is incompatible. To fix, use `alter` command in HBase shell:
1184
1185 ----
1186 alter 't', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
1187 ----
1188
1189 Please also validate HFiles, which is described in the next section.
1190
1191 ==== HFile Content validation
1192 Even though Data Block Encoding is changed from `PREFIX_TREE` it is still possible to have HFiles that contain data encoded that way.
1193 To verify that HFiles are readable with HBase 2 please use _HFile content validator_.
1194
1195 [source, bash]
1196 ----
1197 $ bin/hbase pre-upgrade validate-hfile
1198 ----
1199
1200 The tool will log the corrupt HFiles and details about the root cause.
1201 If the problem is about PREFIX_TREE encoding it is necessary to change encodings before upgrading to HBase 2.
1202
1203 The following log message shows an example of incorrect HFiles.
1204
1205 ----
1206 2018-06-05 16:20:46,976 WARN  [hfilevalidator-pool1-t3] hbck.HFileCorruptionChecker: Found corrupt HFile hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1207 org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1208     ...
1209 Caused by: java.io.IOException: Invalid data block encoding type in file info: PREFIX_TREE
1210     ...
1211 Caused by: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.PREFIX_TREE
1212     ...
1213 2018-06-05 16:20:47,322 INFO  [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1214 2018-06-05 16:20:47,383 INFO  [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/archive/data/default/t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1
1215 ----
1216
1217 ===== Fixing PREFIX_TREE errors
1218
1219 It's possible to get `PREFIX_TREE` errors after changing Data Block Encoding to a supported one. It can happen
1220 because there are some HFiles which still encoded with `PREFIX_TREE` or there are still some snapshots.
1221
1222 For fixing HFiles, please run a major compaction on the table (it was `default:t` according to the log message):
1223
1224 ----
1225 major_compact 't'
1226 ----
1227
1228 HFiles can be referenced from snapshots, too. It's the case when the HFile is located under `archive/data`.
1229 The first step is to determine which snapshot references that HFile (the name of the file was `29c641ae91c34fc3bee881f45436b6d1`
1230 according to the logs):
1231
1232 [source, bash]
1233 ----
1234 for snapshot in $(hbase snapshotinfo -list-snapshots 2> /dev/null | tail -n -1 | cut -f 1 -d \|);
1235 do
1236   echo "checking snapshot named '${snapshot}'";
1237   hbase snapshotinfo -snapshot "${snapshot}" -files 2> /dev/null | grep 29c641ae91c34fc3bee881f45436b6d1;
1238 done
1239 ----
1240
1241 The output of this shell script is:
1242
1243 ----
1244 checking snapshot named 't_snap'
1245    1.0 K t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1 (archive)
1246 ----
1247
1248 Which means `t_snap` snapshot references the incompatible HFile. If the snapshot is still needed,
1249 then it has to be recreated with HBase shell:
1250
1251 ----
1252 # creating a new namespace for the cleanup process
1253 create_namespace 'pre_upgrade_cleanup'
1254
1255 # creating a new snapshot
1256 clone_snapshot 't_snap', 'pre_upgrade_cleanup:t'
1257 alter 'pre_upgrade_cleanup:t', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
1258 major_compact 'pre_upgrade_cleanup:t'
1259
1260 # removing the invalid snapshot
1261 delete_snapshot 't_snap'
1262
1263 # creating a new snapshot
1264 snapshot 'pre_upgrade_cleanup:t', 't_snap'
1265
1266 # removing temporary table
1267 disable 'pre_upgrade_cleanup:t'
1268 drop 'pre_upgrade_cleanup:t'
1269 drop_namespace 'pre_upgrade_cleanup'
1270 ----
1271
1272 For further information, please refer to
1273 link:https://issues.apache.org/jira/browse/HBASE-20649?focusedCommentId=16535476#comment-16535476[HBASE-20649].
1274
1275 === Data Block Encoding Tool
1276
1277 Tests various compression algorithms with different data block encoder for key compression on an existing HFile.
1278 Useful for testing, debugging and benchmarking.
1279
1280 You must specify `-f` which is the full path of the HFile.
1281
1282 The result shows both the performance (MB/s) of compression/decompression and encoding/decoding, and the data savings on the HFile.
1283
1284 ----
1285
1286 $ bin/hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
1287 Usages: hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
1288 Options:
1289         -f HFile to analyse (REQUIRED)
1290         -n Maximum number of key/value pairs to process in a single benchmark run.
1291         -b Whether to run a benchmark to measure read throughput.
1292         -c If this is specified, no correctness testing will be done.
1293         -a What kind of compression algorithm use for test. Default value: GZ.
1294         -t Number of times to run each benchmark. Default value: 12.
1295         -omit Number of first runs of every benchmark to omit from statistics. Default value: 2.
1296
1297 ----
1298
1299 [[ops.regionmgt]]
1300 == Region Management
1301
1302 [[ops.regionmgt.majorcompact]]
1303 === Major Compaction
1304
1305 Major compactions can be requested via the HBase shell or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact-org.apache.hadoop.hbase.TableName-[Admin.majorCompact].
1306
1307 Note: major compactions do NOT do region merges.
1308 See <<compaction,compaction>> for more information about compactions.
1309
1310 [[ops.regionmgt.merge]]
1311 === Merge
1312
1313 Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).
1314
1315 [source,bourne]
1316 ----
1317 $ bin/hbase org.apache.hadoop.hbase.util.Merge <tablename> <region1> <region2>
1318 ----
1319
1320 If you feel you have too many regions and want to consolidate them, Merge is the utility you need.
1321 Merge must run be done when the cluster is down.
1322 See the link:https://web.archive.org/web/20111231002503/http://ofps.oreilly.com/titles/9781449396107/performance.html[O'Reilly HBase
1323           Book] for an example of usage.
1324
1325 You will need to pass 3 parameters to this application.
1326 The first one is the table name.
1327 The second one is the fully qualified name of the first region to merge, like "table_name,\x0A,1342956111995.7cef47f192318ba7ccc75b1bbf27a82b.". The third one is the fully qualified name for the second region to merge.
1328
1329 Additionally, there is a Ruby script attached to link:https://issues.apache.org/jira/browse/HBASE-1621[HBASE-1621] for region merging.
1330
1331 [[node.management]]
1332 == Node Management
1333
1334 [[decommission]]
1335 === Node Decommission
1336
1337 You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:
1338
1339 ----
1340 $ ./bin/hbase-daemon.sh stop regionserver
1341 ----
1342
1343 The RegionServer will first close all regions and then shut itself down.
1344 On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire.
1345 The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
1346
1347 .Disable the Load Balancer before Decommissioning a node
1348 [NOTE]
1349 ====
1350 If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer.
1351 Avoid any problems by disabling the balancer first.
1352 See <<lb,lb>> below.
1353 ====
1354
1355 .Kill Node Tool
1356 [NOTE]
1357 ====
1358 In hbase-2.0, in the bin directory, we added a script named _considerAsDead.sh_ that can be used to kill a regionserver.
1359 Hardware issues could be detected by specialized monitoring tools before the zookeeper timeout has expired. _considerAsDead.sh_ is a simple function to mark a RegionServer as dead.
1360 It deletes all the znodes of the server, starting the recovery process.
1361 Plug in the script into your monitoring/fault detection tools to initiate faster failover.
1362 Be careful how you use this disruptive tool.
1363 Copy the script if you need to make use of it in a version of hbase previous to hbase-2.0.
1364 ====
1365
1366 A downside to the above stop of a RegionServer is that regions could be offline for a good period of time.
1367 Regions are closed in order.
1368 If many regions on the server, the first region to close may not be back online until all regions close and
1369 after the master notices the RegionServer's znode gone.  A node can be asked to gradually shed its load and
1370 then shutdown itself using the  _graceful_stop.sh_ script. Here is its usage:
1371
1372 ----
1373 $ ./bin/graceful_stop.sh
1374 Usage: graceful_stop.sh [--config <conf-dir>] [-e] [--restart [--reload]] [--thrift] [--rest] [-n |--noack] [--maxthreads <number of threads>] [--movetimeout <timeout in seconds>] [-nob |--nobalancer] [-d |--designatedfile <file path>] [-x |--excludefile <file path>] <hostname>
1375  thrift         If we should stop/start thrift before/after the hbase stop/start
1376  rest           If we should stop/start rest before/after the hbase stop/start
1377  restart        If we should restart after graceful stop
1378  reload         Move offloaded regions back on to the restarted server
1379  n|noack        Enable noAck mode in RegionMover. This is a best effort mode for moving regions
1380  maxthreads xx  Limit the number of threads used by the region mover. Default value is 1.
1381  movetimeout xx Timeout for moving regions. If regions are not moved by the timeout value,exit with error. Default value is INT_MAX.
1382  hostname       Hostname of server we are to stop
1383  e|failfast     Set -e so exit immediately if any command exits with non-zero status
1384  nob|nobalancer Do not manage balancer states. This is only used as optimization in rolling_restart.sh to avoid multiple calls to hbase shell
1385  d|designatedfile xx Designated file with <hostname:port> per line as unload targets
1386  x|excludefile xx Exclude file should have <hostname:port> per line. We do not unload regions to hostnames given in exclude file
1387 ----
1388
1389 To decommission a loaded RegionServer, run the following: +$
1390           ./bin/graceful_stop.sh HOSTNAME+ where `HOSTNAME` is the host carrying the RegionServer you would decommission.
1391
1392 .On `HOSTNAME`
1393 [NOTE]
1394 ====
1395 The `HOSTNAME` passed to _graceful_stop.sh_ must match the hostname that hbase is using to identify RegionServers.
1396 HBase uses fully-qualified domain names usually. Check the list of RegionServers in the master UI for how HBase
1397 is referring to servers. Whatever HBase is using, this is what you should pass the _graceful_stop.sh_ decommission script.
1398 If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks
1399 if server is currently running; the graceful unloading of regions will not run.
1400 ====
1401
1402 The _graceful_stop.sh_ script will move the regions off the decommissioned RegionServer one at a time to minimize region churn.
1403 It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned
1404 server is carrying zero regions. At this point, the _graceful_stop.sh_ tells the RegionServer `stop`.
1405 The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the
1406 RegionServer went down cleanly, there will be no WAL logs to split.
1407
1408 [[lb]]
1409 .Load Balancer
1410 [NOTE]
1411 ====
1412 It is assumed that the Region Load Balancer is disabled while the `graceful_stop` script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:
1413
1414 [source]
1415 ----
1416 hbase(main):001:0> balance_switch false
1417 true
1418 0 row(s) in 0.3590 seconds
1419 ----
1420
1421 This turns the balancer OFF.
1422 To reenable, do:
1423
1424 [source]
1425 ----
1426 hbase(main):001:0> balance_switch true
1427 false
1428 0 row(s) in 0.3590 seconds
1429 ----
1430
1431 The `graceful_stop` will check the balancer and if enabled, will turn it off before it goes to work.
1432 If it exits prematurely because of error, it will not have reset the balancer.
1433 Hence, it is better to manage the balancer apart from `graceful_stop` reenabling it after you are done w/ graceful_stop.
1434 ====
1435
1436 [[draining.servers]]
1437 ==== Decommissioning several Regions Servers concurrently
1438
1439 If you have a large cluster, you may want to decommission more than one machine at a time by gracefully stopping multiple RegionServers concurrently.
1440 To gracefully drain multiple regionservers at the same time, RegionServers can be put into a "draining" state.
1441 This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the _hbase_root/draining_ znode.
1442 This znode has format `name,port,startcode` just like the regionserver entries under _hbase_root/rs_ znode.
1443
1444 Without this facility, decommissioning multiple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining.
1445 Marking RegionServers to be in the draining state prevents this from happening.
1446 See this link:http://inchoate-clatter.blogspot.com/2012/03/hbase-ops-automation.html[blog
1447             post] for more details.
1448
1449 [[bad.disk]]
1450 ==== Bad or Failing Disk
1451
1452 It is good having <<dfs.datanode.failed.volumes.tolerated,dfs.datanode.failed.volumes.tolerated>> set if you have a decent number of disks per machine for the case where a disk plain dies.
1453 But usually disks do the "John Wayne" -- i.e.
1454 take a while to go down spewing errors in _dmesg_ -- or for some reason, run much slower than their companions.
1455 In this case you want to decommission the disk.
1456 You have two options.
1457 You can link:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html[decommission
1458             the datanode] or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its logs as it recalibrates where to get its data from -- it will likely roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
1459
1460 .Short Circuit Reads
1461 [NOTE]
1462 ====
1463 If you are doing short-circuit reads, you will have to move the regions off the regionserver before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot have access, because it already has the files open, it will be able to keep reading the file blocks from the bad disk even though the datanode is down.
1464 Move the regions back after you restart the datanode.
1465 ====
1466
1467 [[rolling]]
1468 === Rolling Restart
1469
1470 Some cluster configuration changes require either the entire cluster, or the RegionServers, to be restarted in order to pick up the changes.
1471 In addition, rolling restarts are supported for upgrading to a minor or maintenance release, and to a major release if at all possible.
1472 See the release notes for release you want to upgrade to, to find out about limitations to the ability to perform a rolling upgrade.
1473
1474 There are multiple ways to restart your cluster nodes, depending on your situation.
1475 These methods are detailed below.
1476
1477 ==== Using the `rolling-restart.sh` Script
1478
1479 HBase ships with a script, _bin/rolling-restart.sh_, that allows you to perform rolling restarts on the entire cluster, the master only, or the RegionServers only.
1480 The script is provided as a template for your own script, and is not explicitly tested.
1481 It requires password-less SSH login to be configured and assumes that you have deployed using a tarball.
1482 The script requires you to set some environment variables before running it.
1483 Examine the script and modify it to suit your needs.
1484
1485 ._rolling-restart.sh_ General Usage
1486 ----
1487 $ ./bin/rolling-restart.sh --help
1488 Usage: rolling-restart.sh [--config <hbase-confdir>] [--rs-only] [--master-only] [--graceful] [--maxthreads xx]
1489 ----
1490
1491 Rolling Restart on RegionServers Only::
1492   To perform a rolling restart on the RegionServers only, use the `--rs-only` option.
1493   This might be necessary if you need to reboot the individual RegionServer or if you make a configuration change that only affects RegionServers and not the other HBase processes.
1494
1495 Rolling Restart on Masters Only::
1496   To perform a rolling restart on the active and backup Masters, use the `--master-only` option.
1497   You might use this if you know that your configuration change only affects the Master and not the RegionServers, or if you need to restart the server where the active Master is running.
1498
1499 Graceful Restart::
1500   If you specify the `--graceful` option, RegionServers are restarted using the _bin/graceful_stop.sh_ script, which moves regions off a RegionServer before restarting it.
1501   This is safer, but can delay the restart.
1502
1503 Limiting the Number of Threads::
1504   To limit the rolling restart to using only a specific number of threads, use the `--maxthreads` option.
1505
1506 [[rolling.restart.manual]]
1507 ==== Manual Rolling Restart
1508
1509 To retain more control over the process, you may wish to manually do a rolling restart across your cluster.
1510 This uses the `graceful-stop.sh` command <<decommission,decommission>>.
1511 In this method, you can restart each RegionServer individually and then move its old regions back into place, retaining locality.
1512 If you also need to restart the Master, you need to do it separately, and restart the Master before restarting the RegionServers using this method.
1513 The following is an example of such a command.
1514 You may need to tailor it to your environment.
1515 This script does a rolling restart of RegionServers only.
1516 It disables the load balancer before moving the regions.
1517
1518 ----
1519
1520 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &;
1521 ----
1522
1523 Monitor the output of the _/tmp/log.txt_ file to follow the progress of the script.
1524
1525 ==== Logic for Crafting Your Own Rolling Restart Script
1526
1527 Use the following guidelines if you want to create your own rolling restart script.
1528
1529 . Extract the new release, verify its configuration, and synchronize it to all nodes of your cluster using `rsync`, `scp`, or another secure synchronization mechanism.
1530
1531 . Restart the master first.
1532   You may need to modify these commands if your new HBase directory is different from the old one, such as for an upgrade.
1533 +
1534 ----
1535
1536 $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
1537 ----
1538
1539 . Gracefully restart each RegionServer, using a script such as the following, from the Master.
1540 +
1541 ----
1542
1543 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
1544 ----
1545 +
1546 If you are running Thrift or REST servers, pass the --thrift or --rest options.
1547 For other available options, run the `bin/graceful-stop.sh --help`              command.
1548 +
1549 It is important to drain HBase regions slowly when restarting multiple RegionServers.
1550 Otherwise, multiple regions go offline simultaneously and must be reassigned to other nodes, which may also go offline soon.
1551 This can negatively affect performance.
1552 You can inject delays into the script above, for instance, by adding a Shell command such as `sleep`.
1553 To wait for 5 minutes between each RegionServer restart, modify the above script to the following:
1554 +
1555 ----
1556
1557 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i & sleep 5m; done &> /tmp/log.txt &
1558 ----
1559
1560 . Restart the Master again, to clear out the dead servers list and re-enable the load balancer.
1561
1562 [[adding.new.node]]
1563 === Adding a New Node
1564
1565 Adding a new regionserver in HBase is essentially free, you simply start it like this: `$ ./bin/hbase-daemon.sh start regionserver` and it will register itself with the master.
1566 Ideally you also started a DataNode on the same machine so that the RS can eventually start to have local files.
1567 If you rely on ssh to start your daemons, don't forget to add the new hostname in _conf/regionservers_ on the master.
1568
1569 At this point the region server isn't serving data because no regions have moved to it yet.
1570 If the balancer is enabled, it will start moving regions to the new RS.
1571 On a small/medium cluster this can have a very adverse effect on latency as a lot of regions will be offline at the same time.
1572 It is thus recommended to disable the balancer the same way it's done when decommissioning a node and move the regions manually (or even better, using a script that moves them one by one).
1573
1574 The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have to use the network to serve requests.
1575 Apart from resulting in higher latency, it may also be able to use all of your network card's capacity.
1576 For practical purposes, consider that a standard 1GigE NIC won't be able to read much more than _100MB/s_.
1577 In this case, or if you are in a OLAP environment and require having locality, then it is recommended to major compact the moved regions.
1578
1579 [[hbase_metrics]]
1580 == HBase Metrics
1581
1582 HBase emits metrics which adhere to the link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html[Hadoop Metrics] API.
1583 Starting with HBase 0.95footnote:[The Metrics system was redone in
1584           HBase 0.96. See Migration
1585             to the New Metrics Hotness – Metrics2 by Elliot Clark for detail], HBase is configured to emit a default set of metrics with a default sampling period of every 10 seconds.
1586 You can use HBase metrics in conjunction with Ganglia.
1587 You can also filter which metrics are emitted and extend the metrics framework to capture custom metrics appropriate for your environment.
1588
1589 === Metric Setup
1590
1591 For HBase 0.95 and newer, HBase ships with a default metrics configuration, or [firstterm]_sink_.
1592 This includes a wide variety of individual metrics, and emits them every 10 seconds by default.
1593 To configure metrics for a given region server, edit the _conf/hadoop-metrics2-hbase.properties_ file.
1594 Restart the region server for the changes to take effect.
1595
1596 To change the sampling rate for the default sink, edit the line beginning with `*.period`.
1597 To filter which metrics are emitted or to extend the metrics framework, see https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html
1598
1599 .HBase Metrics and Ganglia
1600 [NOTE]
1601 ====
1602 By default, HBase emits a large number of metrics per region server.
1603 Ganglia may have difficulty processing all these metrics.
1604 Consider increasing the capacity of the Ganglia server or reducing the number of metrics emitted by HBase.
1605 See link:https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html#filtering[Metrics Filtering].
1606 ====
1607
1608 === Disabling Metrics
1609
1610 To disable metrics for a region server, edit the _conf/hadoop-metrics2-hbase.properties_ file and comment out any uncommented lines.
1611 Restart the region server for the changes to take effect.
1612
1613 [[discovering.available.metrics]]
1614 === Discovering Available Metrics
1615
1616 Rather than listing each metric which HBase emits by default, you can browse through the available metrics, either as a JSON output or via JMX.
1617 Different metrics are exposed for the Master process and each region server process.
1618
1619 .Procedure: Access a JSON Output of Available Metrics
1620 . After starting HBase, access the region server's web UI, at pass:[http://REGIONSERVER_HOSTNAME:60030] by default (or port 16030 in HBase 1.0+).
1621 . Click the [label]#Metrics Dump# link near the top.
1622   The metrics for the region server are presented as a dump of the JMX bean in JSON format.
1623   This will dump out all metrics names and their values.
1624   To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60030/jmx?description=true].
1625   Not all beans and attributes have descriptions.
1626 . To view metrics for the Master, connect to the Master's web UI instead (defaults to pass:[http://localhost:60010] or port 16010 in HBase 1.0+) and click its [label]#Metrics
1627   Dump# link.
1628   To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60010/jmx?description=true].
1629   Not all beans and attributes have descriptions.
1630
1631
1632 You can use many different tools to view JMX content by browsing MBeans.
1633 This procedure uses `jvisualvm`, which is an application usually available in the JDK.
1634
1635 .Procedure: Browse the JMX Output of Available Metrics
1636 . Start HBase, if it is not already running.
1637 . Run the command `jvisualvm` command on a host with a GUI display.
1638   You can launch it from the command line or another method appropriate for your operating system.
1639 . Be sure the [label]#VisualVM-MBeans# plugin is installed. Browse to *Tools -> Plugins*. Click [label]#Installed# and check whether the plugin is listed.
1640   If not, click [label]#Available Plugins#, select it, and click btn:[Install].
1641   When finished, click btn:[Close].
1642 . To view details for a given HBase process, double-click the process in the [label]#Local# sub-tree in the left-hand panel.
1643   A detailed view opens in the right-hand panel.
1644   Click the [label]#MBeans# tab which appears as a tab in the top of the right-hand panel.
1645 . To access the HBase metrics, navigate to the appropriate sub-bean:
1646 .* Master:
1647 .* RegionServer:
1648
1649 . The name of each metric and its current value is displayed in the [label]#Attributes# tab.
1650   For a view which includes more details, including the description of each attribute, click the [label]#Metadata# tab.
1651
1652 === Units of Measure for Metrics
1653
1654 Different metrics are expressed in different units, as appropriate.
1655 Often, the unit of measure is in the name (as in the metric `shippedKBs`). Otherwise, use the following guidelines.
1656 When in doubt, you may need to examine the source for a given metric.
1657
1658 * Metrics that refer to a point in time are usually expressed as a timestamp.
1659 * Metrics that refer to an age (such as `ageOfLastShippedOp`) are usually expressed in milliseconds.
1660 * Metrics that refer to memory sizes are in bytes.
1661 * Sizes of queues (such as `sizeOfLogQueue`) are expressed as the number of items in the queue.
1662   Determine the size by multiplying by the block size (default is 64 MB in HDFS).
1663 * Metrics that refer to things like the number of a given type of operations (such as `logEditsRead`) are expressed as an integer.
1664
1665 [[master_metrics]]
1666 === Most Important Master Metrics
1667
1668 Note: Counts are usually over the last metrics reporting interval.
1669
1670 hbase.master.numRegionServers::
1671   Number of live regionservers
1672
1673 hbase.master.numDeadRegionServers::
1674   Number of dead regionservers
1675
1676 hbase.master.ritCount ::
1677   The number of regions in transition
1678
1679 hbase.master.ritCountOverThreshold::
1680   The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
1681
1682 hbase.master.ritOldestAge::
1683   The age of the longest region in transition, in milliseconds
1684
1685 [[rs_metrics]]
1686 === Most Important RegionServer Metrics
1687
1688 Note: Counts are usually over the last metrics reporting interval.
1689
1690 hbase.regionserver.regionCount::
1691   The number of regions hosted by the regionserver
1692
1693 hbase.regionserver.storeFileCount::
1694   The number of store files on disk currently managed by the regionserver
1695
1696 hbase.regionserver.storeFileSize::
1697   Aggregate size of the store files on disk
1698
1699 hbase.regionserver.hlogFileCount::
1700   The number of write ahead logs not yet archived
1701
1702 hbase.regionserver.totalRequestCount::
1703   The total number of requests received
1704
1705 hbase.regionserver.readRequestCount::
1706   The number of read requests received
1707
1708 hbase.regionserver.writeRequestCount::
1709   The number of write requests received
1710
1711 hbase.regionserver.numOpenConnections::
1712   The number of open connections at the RPC layer
1713
1714 hbase.regionserver.numActiveHandler::
1715   The number of RPC handlers actively servicing requests
1716
1717 hbase.regionserver.numCallsInGeneralQueue::
1718   The number of currently enqueued user requests
1719
1720 hbase.regionserver.numCallsInReplicationQueue::
1721   The number of currently enqueued operations received from replication
1722
1723 hbase.regionserver.numCallsInPriorityQueue::
1724   The number of currently enqueued priority (internal housekeeping) requests
1725
1726 hbase.regionserver.flushQueueLength::
1727   Current depth of the memstore flush queue.
1728   If increasing, we are falling behind with clearing memstores out to HDFS.
1729
1730 hbase.regionserver.updatesBlockedTime::
1731   Number of milliseconds updates have been blocked so the memstore can be flushed
1732
1733 hbase.regionserver.compactionQueueLength::
1734   Current depth of the compaction request queue.
1735   If increasing, we are falling behind with storefile compaction.
1736
1737 hbase.regionserver.blockCacheHitCount::
1738   The number of block cache hits
1739
1740 hbase.regionserver.blockCacheMissCount::
1741   The number of block cache misses
1742
1743 hbase.regionserver.blockCacheExpressHitPercent ::
1744   The percent of the time that requests with the cache turned on hit the cache
1745
1746 hbase.regionserver.percentFilesLocal::
1747   Percent of store file data that can be read from the local DataNode, 0-100
1748
1749 hbase.regionserver.<op>_<measure>::
1750   Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
1751
1752 hbase.regionserver.slow<op>Count ::
1753   The number of operations we thought were slow, where <op> is one of the list above
1754
1755 hbase.regionserver.GcTimeMillis::
1756   Time spent in garbage collection, in milliseconds
1757
1758 hbase.regionserver.GcTimeMillisParNew::
1759   Time spent in garbage collection of the young generation, in milliseconds
1760
1761 hbase.regionserver.GcTimeMillisConcurrentMarkSweep::
1762   Time spent in garbage collection of the old generation, in milliseconds
1763
1764 hbase.regionserver.authenticationSuccesses::
1765   Number of client connections where authentication succeeded
1766
1767 hbase.regionserver.authenticationFailures::
1768   Number of client connection authentication failures
1769
1770 hbase.regionserver.mutationsWithoutWALCount ::
1771   Count of writes submitted with a flag indicating they should bypass the write ahead log
1772
1773 [[rs_meta_metrics]]
1774 === Meta Table Load Metrics
1775
1776 HBase meta table metrics collection feature is available in HBase 1.4+ but it is disabled by default, as it can
1777 affect the performance of the cluster. When it is enabled, it helps to monitor client access patterns by collecting
1778 the following statistics:
1779
1780 * number of get, put and delete operations on the `hbase:meta` table
1781 * number of get, put and delete operations made by the top-N clients
1782 * number of operations related to each table
1783 * number of operations related to the top-N regions
1784
1785
1786 When to use the feature::
1787   This feature can help to identify hot spots in the meta table by showing the regions or tables where the meta info is
1788   modified (e.g. by create, drop, split or move tables) or retrieved most frequently. It can also help to find misbehaving
1789   client applications by showing which clients are using the meta table most heavily, which can for example suggest the
1790   lack of meta table buffering or the lack of re-using open client connections in the client application.
1791
1792 .Possible side-effects of enabling this feature
1793 [WARNING]
1794 ====
1795 Having large number of clients and regions in the cluster can cause the registration and tracking of a large amount of
1796 metrics, which can increase the memory and CPU footprint of the HBase region server handling the `hbase:meta` table.
1797 It can also cause the significant increase of the JMX dump size, which can affect the monitoring or log aggregation
1798 system you use beside HBase. It is recommended to turn on this feature only during debugging.
1799 ====
1800
1801 Where to find the metrics in JMX::
1802   Each metric attribute name will start with the ‘MetaTable_’ prefix. For all the metrics you will see five different
1803   JMX attributes: count, mean rate, 1 minute rate, 5 minute rate and 15 minute rate. You will find these metrics in JMX
1804   under the following MBean:
1805   `Hadoop -> HBase -> RegionServer -> Coprocessor.Region.CP_org.apache.hadoop.hbase.coprocessor.MetaTableMetrics`.
1806
1807 .Examples: some Meta Table metrics you can see in your JMX dump
1808 [source,json]
1809 ----
1810 {
1811   "MetaTable_get_request_count": 77309,
1812   "MetaTable_put_request_mean_rate": 0.06339092997186495,
1813   "MetaTable_table_MyTestTable_request_15min_rate": 1.1020599841623246,
1814   "MetaTable_client_/172.30.65.42_lossy_request_count": 1786
1815   "MetaTable_client_/172.30.65.45_put_request_5min_rate": 0.6189810954855728,
1816   "MetaTable_region_1561131112259.c66e4308d492936179352c80432ccfe0._lossy_request_count": 38342,
1817   "MetaTable_region_1561131043640.5bdffe4b9e7e334172065c853cf0caa6._lossy_request_1min_rate": 0.04925099917433935,
1818 }
1819 ----
1820
1821 Configuration::
1822   To turn on this feature, you have to enable a custom coprocessor by adding the following section to hbase-site.xml.
1823   This coprocessor will run on all the HBase RegionServers, but will be active (i.e. consume memory / CPU) only on
1824   the server, where the `hbase:meta` table is located. It will produce JMX metrics which can be downloaded from the
1825   web UI of the given RegionServer or by a simple REST call. These metrics will not be present in the JMX dump of the
1826   other RegionServers.
1827
1828 .Enabling the Meta Table Metrics feature
1829 [source,xml]
1830 ----
1831 <property>
1832     <name>hbase.coprocessor.region.classes</name>
1833     <value>org.apache.hadoop.hbase.coprocessor.MetaTableMetrics</value>
1834 </property>
1835 ----
1836
1837 .How the top-N metrics are calculated?
1838 [NOTE]
1839 ====
1840 The 'top-N' type of metrics will be counted using the Lossy Counting Algorithm (as defined in
1841 link:http://www.vldb.org/conf/2002/S10P03.pdf[Motwani, R; Manku, G.S (2002). "Approximate frequency counts over data streams"]),
1842 which is designed to identify elements in a data stream whose frequency count exceed a user-given threshold.
1843 The frequency computed by this algorithm is not always accurate but has an error threshold that can be specified by the
1844 user as a configuration parameter. The run time space required by the algorithm is inversely proportional to the
1845 specified error threshold, hence larger the error parameter, the smaller the footprint and the less accurate are the
1846 metrics.
1847
1848 You can specify the error rate of the algorithm as a floating-point value between 0 and 1 (exclusive), it's default
1849 value is 0.02. Having the error rate set to `E` and having `N` as the total number of meta table operations, then
1850 (assuming the uniform distribution of the activity of low frequency elements) at most `7 / E` meters will be kept and
1851 each kept element will have a frequency higher than `E * N`.
1852
1853 An example: Let’s assume we are interested in the HBase clients that are most active in accessing the meta table.
1854 When there was 1,000,000 operations on the meta table so far and the error rate parameter is set to 0.02, then we can
1855 assume that only at most 350 client IP address related counters will be present in JMX and each of these clients
1856 accessed the meta table at least 20,000 times.
1857
1858 [source,xml]
1859 ----
1860 <property>
1861     <name>hbase.util.default.lossycounting.errorrate</name>
1862     <value>0.02</value>
1863 </property>
1864 ----
1865 ====
1866
1867 [[ops.monitoring]]
1868 == HBase Monitoring
1869
1870 [[ops.monitoring.overview]]
1871 === Overview
1872
1873 The following metrics are arguably the most important to monitor for each RegionServer for "macro monitoring", preferably with a system like link:http://opentsdb.net/[OpenTSDB].
1874 If your cluster is having performance issues it's likely that you'll see something unusual with this group.
1875
1876 HBase::
1877   * See <<rs_metrics,rs metrics>>
1878
1879 OS::
1880   * IO Wait
1881   * User CPU
1882
1883 Java::
1884   * GC
1885
1886 For more information on HBase metrics, see <<hbase_metrics,hbase metrics>>.
1887
1888 [[ops.slow.query]]
1889 === Slow Query Log
1890
1891 The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output.
1892 The thresholds for "too long to run" and "too much output" are configurable, as described below.
1893 The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events.
1894 It is also prepended with identifying tags `(responseTooSlow)`, `(responseTooLarge)`, `(operationTooSlow)`, and `(operationTooLarge)` in order to enable easy filtering with grep, in case the user desires to see only slow queries.
1895
1896 ==== Configuration
1897
1898 There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
1899
1900 * `hbase.ipc.warn.response.time` Maximum number of milliseconds that a query can be run without being logged.
1901   Defaults to 10000, or 10 seconds.
1902   Can be set to -1 to disable logging by time.
1903 * `hbase.ipc.warn.response.size` Maximum byte size of response that a query can return without being logged.
1904   Defaults to 100 megabytes.
1905   Can be set to -1 to disable logging by size.
1906
1907 ==== Metrics
1908
1909 The slow query log exposes to metrics to JMX.
1910
1911 * `hadoop.regionserver_rpc_slowResponse` a global metric reflecting the durations of all responses that triggered logging.
1912 * `hadoop.regionserver_rpc_methodName.aboveOneSec` A metric reflecting the durations of all responses that lasted for more than one second.
1913
1914 ==== Output
1915
1916 The output is tagged with operation e.g. `(operationTooSlow)` if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for.
1917 If not, it is tagged `(responseTooSlow)`          and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. `TooLarge` is substituted for `TooSlow` if the response size triggered the logging, with `TooLarge` appearing even in the case that both size and duration triggered logging.
1918
1919 ==== Example
1920
1921
1922 [source]
1923 ----
1924 2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}
1925 ----
1926
1927 Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port.
1928 Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations.
1929 In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.
1930
1931 This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.
1932
1933 [[slow_log_responses]]
1934 ==== Get Slow Response Log from shell
1935 When an individual RPC exceeds a configurable time bound we log a complaint
1936 by way of the logging subsystem
1937
1938 e.g.
1939
1940 ----
1941 2019-10-02 10:10:22,195 WARN [,queue=15,port=60020] ipc.RpcServer - (responseTooSlow):
1942 {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
1943 "starttimems":1567203007549,
1944 "responsesize":6819737,
1945 "method":"Scan",
1946 "param":"region { type: REGION_NAME value: \"t1,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
1947 "processingtimems":28646,
1948 "client":"10.253.196.215:41116",
1949 "queuetimems":22453,
1950 "class":"HRegionServer"}
1951 ----
1952
1953 Unfortunately often the request parameters are truncated as per above Example.
1954 The truncation is unfortunate because it eliminates much of the utility of
1955 the warnings. For example, the region name, the start and end keys, and the
1956 filter hierarchy are all important clues for debugging performance problems
1957 caused by moderate to low selectivity queries or queries made at a high rate.
1958
1959 HBASE-22978 introduces maintaining an in-memory ring buffer of requests that were judged to
1960 be too slow in addition to the responseTooSlow logging. The in-memory representation can be
1961 complete. There is some chance a high rate of requests will cause information on other
1962 interesting requests to be overwritten before it can be read. This is an acceptable trade off.
1963
1964 In order to enable the in-memory ring buffer at RegionServers, we need to enable
1965 config:
1966 ----
1967 hbase.regionserver.slowlog.buffer.enabled
1968 ----
1969
1970 One more config determines the size of the ring buffer:
1971 ----
1972 hbase.regionserver.slowlog.ringbuffer.size
1973 ----
1974
1975 Check the config section for the detailed description.
1976
1977 This config would be disabled by default. Turn it on and these shell commands
1978 would provide expected results from the ring-buffers.
1979
1980
1981 shell commands to retrieve slowlog responses from RegionServers:
1982
1983 ----
1984 Retrieve latest SlowLog Responses maintained by each or specific RegionServers.
1985 Specify '*' to include all RS otherwise array of server names for specific
1986 RS. A server name is the host, port plus startcode of a RegionServer.
1987 e.g.: host187.example.com,60020,1289493121758 (find servername in
1988 master ui or when you do detailed status in shell)
1989
1990 Provide optional filter parameters as Hash.
1991 Default Limit of each server for providing no of slow log records is 10. User can specify
1992 more limit by 'LIMIT' param in case more than 10 records should be retrieved.
1993
1994 Examples:
1995
1996   hbase> get_slowlog_responses '*'                                 => get slowlog responses from all RS
1997   hbase> get_slowlog_responses '*', {'LIMIT' => 50}                => get slowlog responses from all RS
1998                                                                       with 50 records limit (default limit: 10)
1999   hbase> get_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2']    => get slowlog responses from SERVER_NAME1,
2000                                                                       SERVER_NAME2
2001   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
2002                                                                    => get slowlog responses only related to meta
2003                                                                       region
2004   hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1'}         => get slowlog responses only related to t1 table
2005   hbase> get_slowlog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
2006                                                                    => get slowlog responses with given client
2007                                                                       IP address and get 100 records limit
2008                                                                       (default limit: 10)
2009   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
2010                                                                    => get slowlog responses with given region name
2011                                                                       or table name
2012   hbase> get_slowlog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
2013                                                                    => get slowlog responses that match either
2014                                                                       provided client IP address or user name
2015
2016
2017 ----
2018
2019 All of above queries with filters have default OR operation applied i.e. all
2020 records with any of the provided filters applied will be returned. However,
2021 we can also apply AND operator i.e. all records that match all (not any) of
2022 the provided filters should be returned.
2023
2024 ----
2025   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
2026                                                                    => get slowlog responses with given region name
2027                                                                       and table name, both should match
2028
2029   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
2030                                                                    => get slowlog responses with given region name
2031                                                                       or table name, any one can match
2032
2033   hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
2034                                                                    => get slowlog responses with given region name
2035                                                                       and client IP address, both should match
2036
2037 ----
2038 Since OR is the default filter operator, without providing 'FILTER_BY_OP', query will have
2039 same result as providing 'FILTER_BY_OP' => 'OR'.
2040
2041
2042 Sometimes output can be long pretty printed json for user to scroll in
2043 a single screen and hence user might prefer
2044 redirecting output of get_slowlog_responses to a file.
2045
2046 Example:
2047 ----
2048 echo "get_slowlog_responses '*'" | hbase shell > xyz.out 2>&1
2049 ----
2050
2051 Similar to slow RPC logs, client can also retrieve large RPC logs.
2052 Sometimes, slow logs important to debug perf issues turn out to be
2053 larger in size.
2054
2055 ----
2056
2057   hbase> get_largelog_responses '*'                                 => get largelog responses from all RS
2058   hbase> get_largelog_responses '*', {'LIMIT' => 50}                => get largelog responses from all RS
2059                                                                        with 50 records limit (default limit: 10)
2060   hbase> get_largelog_responses ['SERVER_NAME1', 'SERVER_NAME2']    => get largelog responses from SERVER_NAME1,
2061                                                                        SERVER_NAME2
2062   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
2063                                                                     => get largelog responses only related to meta
2064                                                                        region
2065   hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1'}         => get largelog responses only related to t1 table
2066   hbase> get_largelog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
2067                                                                     => get largelog responses with given client
2068                                                                        IP address and get 100 records limit
2069                                                                        (default limit: 10)
2070   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
2071                                                                     => get largelog responses with given region name
2072                                                                        or table name
2073   hbase> get_largelog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
2074                                                                     => get largelog responses that match either
2075                                                                        provided client IP address or user name
2076
2077   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
2078                                                                    => get largelog responses with given region name
2079                                                                       and table name, both should match
2080
2081   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
2082                                                                    => get largelog responses with given region name
2083                                                                       or table name, any one can match
2084
2085   hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
2086                                                                    => get largelog responses with given region name
2087                                                                       and client IP address, both should match
2088
2089 ----
2090
2091
2092 shell command to clear slow/largelog responses from RegionServer:
2093
2094 ----
2095 Clears SlowLog Responses maintained by each or specific RegionServers.
2096 Specify array of server names for specific RS. A server name is
2097 the host, port plus startcode of a RegionServer.
2098 e.g.: host187.example.com,60020,1289493121758 (find servername in
2099 master ui or when you do detailed status in shell)
2100
2101 Examples:
2102
2103   hbase> clear_slowlog_responses                                     => clears slowlog responses from all RS
2104   hbase> clear_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2']    => clears slowlog responses from SERVER_NAME1,
2105                                                                         SERVER_NAME2
2106
2107
2108 ----
2109
2110 include::slow_log_responses_from_systable.adoc[]
2111
2112
2113 === Block Cache Monitoring
2114
2115 Starting with HBase 0.98, the HBase Web UI includes the ability to monitor and report on the performance of the block cache.
2116 To view the block cache reports, see the Block Cache section of the region server UI.
2117 Following are a few examples of the reporting capabilities.
2118
2119 .Basic Info shows the cache implementation.
2120 image::bc_basic.png[]
2121
2122 .Config shows all cache configuration options.
2123 image::bc_config.png[]
2124
2125 .Stats shows statistics about the performance of the cache.
2126 image::bc_stats.png[]
2127
2128 .L1 and L2 show information about the L1 and L2 caches.
2129 image::bc_l1.png[]
2130
2131 This is not an exhaustive list of all the screens and reports available.
2132 Have a look in the Web UI.
2133
2134 === Snapshot Space Usage Monitoring
2135
2136 Starting with HBase 0.95, Snapshot usage information on individual snapshots was shown in the HBase Master Web UI. This was further enhanced starting with HBase 1.3 to show the total Storefile size of the Snapshot Set. The following metrics are shown in the Master Web UI with HBase 1.3 and later.
2137
2138 * Shared Storefile Size is the Storefile size shared between snapshots and active tables.
2139 * Mob Storefile Size is the Mob Storefile size shared between snapshots and active tables.
2140 * Archived Storefile Size is the Storefile size in Archive.
2141
2142 The format of Archived Storefile Size is NNN(MMM). NNN is the total Storefile size in Archive, MMM is the total Storefile size in Archive that is specific to the snapshot (not shared with other snapshots and tables).
2143
2144 .Master Snapshot Overview
2145 image::master-snapshot.png[]
2146
2147 .Snapshot Storefile Stats Example 1
2148 image::1-snapshot.png[]
2149
2150 .Snapshot Storefile Stats Example 2
2151 image::2-snapshots.png[]
2152
2153 .Empty Snapshot Storfile Stats Example
2154 image::empty-snapshots.png[]
2155
2156 == Cluster Replication
2157
2158 NOTE: This information was previously available at
2159 link:https://hbase.apache.org/0.94/replication.html[Cluster Replication].
2160
2161 HBase provides a cluster replication mechanism which allows you to keep one cluster's state synchronized with that of another cluster, using the write-ahead log (WAL) of the source cluster to propagate the changes.
2162 Some use cases for cluster replication include:
2163
2164 * Backup and disaster recovery
2165 * Data aggregation
2166 * Geographic data distribution
2167 * Online data ingestion combined with offline data analytics
2168
2169 NOTE: Replication is enabled at the granularity of the column family.
2170 Before enabling replication for a column family, create the table and all column families to be replicated, on the destination cluster.
2171
2172 NOTE: Replication is asynchronous as we send WAL to another cluster in background, which means that when you want to do recovery through replication, you could loss some data. To address this problem, we have introduced a new feature called synchronous replication. As the mechanism is a bit different so we use a separated section to describe it. Please see
2173 <<Synchronous Replication,Synchronous Replication>>.
2174
2175 NOTE: At present, there is compatibility problem if Replication and WAL Compression are used together. If you need to use Replication, it is recommended to set the `hbase.regionserver.wal.enablecompression` property to `false`. See (https://issues.apache.org/jira/browse/HBASE-26849[HBASE-26849]) for details.
2176
2177 === Replication Overview
2178
2179 Cluster replication uses a source-push methodology.
2180 An HBase cluster can be a source (also called master or active, meaning that it is the originator of new data), a destination (also called slave or passive, meaning that it receives data via replication), or can fulfill both roles at once.
2181 Replication is asynchronous, and the goal of replication is eventual consistency.
2182 When the source receives an edit to a column family with replication enabled, that edit is propagated to all destination clusters using the WAL for that for that column family on the RegionServer managing the relevant region.
2183
2184 When data is replicated from one cluster to another, the original source of the data is tracked via a cluster ID which is part of the metadata.
2185 In HBase 0.96 and newer (link:https://issues.apache.org/jira/browse/HBASE-7709[HBASE-7709]), all clusters which have already consumed the data are also tracked.
2186 This prevents replication loops.
2187
2188 The WALs for each region server must be kept in HDFS as long as they are needed to replicate data to any slave cluster.
2189 Each region server reads from the oldest log it needs to replicate and keeps track of its progress processing WALs inside ZooKeeper to simplify failure recovery.
2190 The position marker which indicates a slave cluster's progress, as well as the queue of WALs to process, may be different for every slave cluster.
2191
2192 The clusters participating in replication can be of different sizes.
2193 The master cluster relies on randomization to attempt to balance the stream of replication on the slave clusters.
2194 It is expected that the slave cluster has storage capacity to hold the replicated data, as well as any data it is responsible for ingesting.
2195 If a slave cluster does run out of room, or is inaccessible for other reasons, it throws an error and the master retains the WAL and retries the replication at intervals.
2196
2197 .Consistency Across Replicated Clusters
2198 [WARNING]
2199 ====
2200 How your application builds on top of the HBase API matters when replication is in play. HBase's replication system provides at-least-once delivery of client edits for an enabled column family to each configured destination cluster. In the event of failure to reach a given destination, the replication system will retry sending edits in a way that might repeat a given message. HBase provides two ways of replication, one is the original replication and the other is serial replication. In the previous way of replication, there is not a guaranteed order of delivery for client edits. In the event of a RegionServer failing, recovery of the replication queue happens independent of recovery of the individual regions that server was previously handling. This means that it is possible for the not-yet-replicated edits to be serviced by a RegionServer that is currently slower to replicate than the one that handles edits from after the failure.
2201
2202 The combination of these two properties (at-least-once delivery and the lack of message ordering) means that some destination clusters may end up in a different state if your application makes use of operations that are not idempotent, e.g. Increments.
2203
2204 To solve the problem, HBase now supports serial replication, which sends edits to destination cluster as the order of requests from client. See <<Serial Replication,Serial Replication>>.
2205
2206 ====
2207
2208 .Terminology Changes
2209 [NOTE]
2210 ====
2211 Previously, terms such as [firstterm]_master-master_, [firstterm]_master-slave_, and [firstterm]_cyclical_ were used to describe replication relationships in HBase.
2212 These terms added confusion, and have been abandoned in favor of discussions about cluster topologies appropriate for different scenarios.
2213 ====
2214
2215 .Cluster Topologies
2216 * A central source cluster might propagate changes out to multiple destination clusters, for failover or due to geographic distribution.
2217 * A source cluster might push changes to a destination cluster, which might also push its own changes back to the original cluster.
2218 * Many different low-latency clusters might push changes to one centralized cluster for backup or resource-intensive data analytics jobs.
2219   The processed data might then be replicated back to the low-latency clusters.
2220
2221 Multiple levels of replication may be chained together to suit your organization's needs.
2222 The following diagram shows a hypothetical scenario.
2223 Use the arrows to follow the data paths.
2224
2225 .Example of a Complex Cluster Replication Configuration
2226 image::hbase_replication_diagram.jpg[]
2227
2228 HBase replication borrows many concepts from the [firstterm]_statement-based replication_ design used by MySQL.
2229 Instead of SQL statements, entire WALEdits (consisting of multiple cell inserts coming from Put and Delete operations on the clients) are replicated in order to maintain atomicity.
2230
2231 [[hbase.replication.management]]
2232 === Managing and Configuring Cluster Replication
2233 .Cluster Configuration Overview
2234
2235 . Configure and start the source and destination clusters.
2236   Create tables with the same names and column families on both the source and destination clusters, so that the destination cluster knows where to store data it will receive.
2237 . All hosts in the source and destination clusters should be reachable to each other.
2238 . If both clusters use the same ZooKeeper cluster, you must use a different `zookeeper.znode.parent`, because they cannot write in the same folder.
2239 . On the source cluster, in HBase Shell, add the destination cluster as a peer, using the `add_peer` command.
2240 . On the source cluster, in HBase Shell, enable the table replication, using the `enable_table_replication` command.
2241 . Check the logs to see if replication is taking place. If so, you will see messages like the following, coming from the ReplicationSource.
2242 ----
2243 LOG.info("Replicating "+clusterId + " -> " + peerClusterId);
2244 ----
2245
2246 .Serial Replication Configuration
2247 See <<Serial Replication,Serial Replication>>
2248
2249 .Cluster Management Commands
2250 add_peer <ID> <CLUSTER_KEY>::
2251   Adds a replication relationship between two clusters. +
2252   * ID -- a unique string, which must not contain a hyphen.
2253   * CLUSTER_KEY: composed using the following template, with appropriate place-holders: `hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent`. This value can be found on the Master UI info page.
2254   * STATE(optional): ENABLED or DISABLED, default value is ENABLED
2255 list_peers:: list all replication relationships known by this cluster
2256 enable_peer <ID>::
2257   Enable a previously-disabled replication relationship
2258 disable_peer <ID>::
2259   Disable a replication relationship. HBase will no longer send edits to that
2260   peer cluster, but it still keeps track of all the new WALs that it will need
2261   to replicate if and when it is re-enabled. WALs are retained when enabling or disabling
2262   replication as long as peers exist.
2263 remove_peer <ID>::
2264   Disable and remove a replication relationship. HBase will no longer send edits to that peer cluster or keep track of WALs.
2265 enable_table_replication <TABLE_NAME>::
2266   Enable the table replication switch for all its column families. If the table is not found in the destination cluster then it will create one with the same name and column families.
2267 disable_table_replication <TABLE_NAME>::
2268   Disable the table replication switch for all its column families.
2269
2270 === Serial Replication
2271
2272 Note: this feature is introduced in HBase 2.1
2273
2274 .Function of serial replication
2275
2276 Serial replication supports to push logs to the destination cluster in the same order as logs reach to the source cluster.
2277
2278 .Why need serial replication?
2279 In replication of HBase, we push mutations to destination cluster by reading WAL in each region server. We have a queue for WAL files so we can read them in order of creation time. However, when region-move or RS failure occurs in source cluster, the hlog entries that are not pushed before region-move or RS-failure will be pushed by original RS(for region move) or another RS which takes over the remained hlog of dead RS(for RS failure), and the new entries for the same region(s) will be pushed by the RS which now serves the region(s), but they push the hlog entries of a same region concurrently without coordination.
2280
2281 This treatment can possibly lead to data inconsistency between source and destination clusters:
2282
2283 1. there are put and then delete written to source cluster.
2284
2285 2. due to region-move / RS-failure, they are pushed by different replication-source threads to peer cluster.
2286
2287 3. if delete is pushed to peer cluster before put, and flush and major-compact occurs in peer cluster before put is pushed to peer cluster, the delete is collected and the put remains in peer cluster, but in source cluster the put is masked by the delete, hence data inconsistency between source and destination clusters.
2288
2289
2290 .Serial replication configuration
2291
2292 Set the serial flag to true for a repliation peer. And the default serial flag is false.
2293
2294 * Add a new replication peer which serial flag is true
2295
2296 [source,ruby]
2297 ----
2298 hbase> add_peer '1', CLUSTER_KEY => "server1.cie.com:2181:/hbase", SERIAL => true
2299 ----
2300
2301 * Set a replication peer's serial flag to false
2302
2303 [source,ruby]
2304 ----
2305 hbase> set_peer_serial '1', false
2306 ----
2307
2308 * Set a replication peer's serial flag to true
2309
2310 [source,ruby]
2311 ----
2312 hbase> set_peer_serial '1', true
2313 ----
2314
2315 The serial replication feature had been done firstly in link:https://issues.apache.org/jira/browse/HBASE-9465[HBASE-9465] and then reverted and redone in link:https://issues.apache.org/jira/browse/HBASE-20046[HBASE-20046]. You can find more details in these issues.
2316
2317 === Verifying Replicated Data
2318
2319 The `VerifyReplication` MapReduce job, which is included in HBase, performs a systematic comparison of replicated data between two different clusters. Run the VerifyReplication job on the master cluster, supplying it with the peer ID and table name to use for validation. You can limit the verification further by specifying a time range or specific families. The job's short name is `verifyrep`. To run the job, use a command like the following:
2320 +
2321 [source,bash]
2322 ----
2323 $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` "${HADOOP_HOME}/bin/hadoop" jar "${HBASE_HOME}/hbase-mapreduce-VERSION.jar" verifyrep --starttime=<timestamp> --endtime=<timestamp> --families=<myFam> <ID> <tableName>
2324 ----
2325 +
2326 The `VerifyReplication` command prints out `GOODROWS` and `BADROWS` counters to indicate rows that did and did not replicate correctly.
2327
2328 === Detailed Information About Cluster Replication
2329
2330 .Replication Architecture Overview
2331 image::replication_overview.png[]
2332
2333 ==== Life of a WAL Edit
2334
2335 A single WAL edit goes through several steps in order to be replicated to a slave cluster.
2336
2337 . An HBase client uses a Put or Delete operation to manipulate data in HBase.
2338 . The region server writes the request to the WAL in a way allows it to be replayed if it is not written successfully.
2339 . If the changed cell corresponds to a column family that is scoped for replication, the edit is added to the queue for replication.
2340 . In a separate thread, the edit is read from the log, as part of a batch process.
2341   Only the KeyValues that are eligible for replication are kept.
2342   Replicable KeyValues are part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as `hbase:meta`, did not originate from the target slave cluster, and have not already been consumed by the target slave cluster.
2343 . The edit is tagged with the master's UUID and added to a buffer.
2344   When the buffer is filled, or the reader reaches the end of the file, the buffer is sent to a random region server on the slave cluster.
2345 . The region server reads the edits sequentially and separates them into buffers, one buffer per table.
2346   After all edits are read, each buffer is flushed using link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table], HBase's normal client.
2347   The master's UUID and the UUIDs of slaves which have already consumed the data are preserved in the edits they are applied, in order to prevent replication loops.
2348 . In the master, the offset for the WAL that is currently being replicated is registered in ZooKeeper.
2349
2350 . The first three steps, where the edit is inserted, are identical.
2351 . Again in a separate thread, the region server reads, filters, and edits the log edits in the same way as above.
2352   The slave region server does not answer the RPC call.
2353 . The master sleeps and tries again a configurable number of times.
2354 . If the slave region server is still not available, the master selects a new subset of region server to replicate to, and tries again to send the buffer of edits.
2355 . Meanwhile, the WALs are rolled and stored in a queue in ZooKeeper.
2356   Logs that are [firstterm]_archived_ by their region server, by moving them from the region server's log directory to a central log directory, will update their paths in the in-memory queue of the replicating thread.
2357 . When the slave cluster is finally available, the buffer is applied in the same way as during normal processing.
2358   The master region server will then replicate the backlog of logs that accumulated during the outage.
2359
2360 .Spreading Queue Failover Load
2361 When replication is active, a subset of region servers in the source cluster is responsible for shipping edits to the sink.
2362 This responsibility must be failed over like all other region server functions should a process or node crash.
2363 The following configuration settings are recommended for maintaining an even distribution of replication activity over the remaining live servers in the source cluster:
2364
2365 * Set `replication.source.maxretriesmultiplier` to `300`.
2366 * Set `replication.source.sleepforretries` to `1` (1 second). This value, combined with the value of `replication.source.maxretriesmultiplier`, causes the retry cycle to last about 5 minutes.
2367 * Set `replication.sleep.before.failover` to `30000` (30 seconds) in the source cluster site configuration.
2368
2369 [[cluster.replication.preserving.tags]]
2370 .Preserving Tags During Replication
2371 By default, the codec used for replication between clusters strips tags, such as cell-level ACLs, from cells.
2372 To prevent the tags from being stripped, you can use a different codec which does not strip them.
2373 Configure `hbase.replication.rpc.codec` to use `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`, on both the source and sink RegionServers involved in the replication.
2374 This option was introduced in link:https://issues.apache.org/jira/browse/HBASE-10322[HBASE-10322].
2375
2376 ==== Replication Internals
2377
2378 Replication State in ZooKeeper::
2379   HBase replication maintains its state in ZooKeeper.
2380   By default, the state is contained in the base node _/hbase/replication_.
2381   This node contains two child nodes, the `Peers` znode and the `RS`                znode.
2382
2383 The `Peers` Znode::
2384   The `peers` znode is stored in _/hbase/replication/peers_ by default.
2385   It consists of a list of all peer replication clusters, along with the status of each of them.
2386   The value of each peer is its cluster key, which is provided in the HBase Shell.
2387   The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.
2388
2389 The `RS` Znode::
2390   The `rs` znode contains a list of WAL logs which need to be replicated.
2391   This list is divided into a set of queues organized by region server and the peer cluster the region server is shipping the logs to.
2392   The rs znode has one child znode for each region server in the cluster.
2393   The child znode name is the region server's hostname, client port, and start code.
2394   This list includes both live and dead region servers.
2395
2396 ==== Choosing Region Servers to Replicate To
2397
2398 When a master cluster region server initiates a replication source to a slave cluster, it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It then scans the _rs/_ directory to discover all the available sinks (region servers that are accepting incoming streams of edits to replicate) and randomly chooses a subset of them using a configured ratio which has a default value of 10%. For example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for edits that this master cluster region server sends.
2399 Because this selection is performed by each master region server, the probability that all slave region servers are used is very high, and this method works for clusters of any size.
2400 For example, a master cluster of 10 machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the master cluster region servers to choose one machine each at random.
2401
2402 A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the slave cluster by each of the master cluster's region servers.
2403 This watch is used to monitor changes in the composition of the slave cluster.
2404 When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
2405
2406 ==== Keeping Track of Logs
2407
2408 Each master cluster region server has its own znode in the replication znodes hierarchy.
2409 It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
2410 Each of these queues will track the WALs created by that region server, but they can differ in size.
2411 For example, if one slave cluster becomes unavailable for some time, the WALs should not be deleted, so they need to stay in the queue while the others are processed.
2412 See <<rs.failover.details,rs.failover.details>> for an example.
2413
2414 When a source is instantiated, it contains the current WAL that the region server is writing to.
2415 During log rolling, the new file is added to the queue of each slave cluster's znode just before it is made available.
2416 This ensures that all the sources are aware that a new log exists before the region server is able to append edits into it, but this operations is now more expensive.
2417 The queue items are discarded when the replication thread cannot read more entries from a file (because it reached the end of the last block) and there are other files in the queue.
2418 This means that if a source is up to date and replicates from the log that the region server writes to, reading up to the "end" of the current file will not delete the item in the queue.
2419
2420 A log can be archived if it is no longer used or if the number of logs exceeds `hbase.regionserver.maxlogs` because the insertion rate is faster than regions are flushed.
2421 When a log is archived, the source threads are notified that the path for that log changed.
2422 If a particular source has already finished with an archived log, it will just ignore the message.
2423 If the log is in the queue, the path will be updated in memory.
2424 If the log is currently being replicated, the change will be done atomically so that the reader doesn't attempt to open the file when has already been moved.
2425 Because moving a file is a NameNode operation , if the reader is currently reading the log, it won't generate any exception.
2426
2427 ==== Reading, Filtering and Sending Edits
2428
2429 By default, a source attempts to read from a WAL and ship log entries to a sink as quickly as possible.
2430 Speed is limited by the filtering of log entries Only KeyValues that are scoped GLOBAL and that do not belong to catalog tables will be retained.
2431 Speed is also limited by total size of the list of edits to replicate per slave, which is limited to 64 MB by default.
2432 With this configuration, a master cluster region server with three slaves would use at most 192 MB to store data to replicate.
2433 This does not account for the data which was filtered but not garbage collected.
2434
2435 Once the maximum size of edits has been buffered or the reader reaches the end of the WAL, the source thread stops reading and chooses at random a sink to replicate to (from the list that was generated by keeping only a subset of slave region servers). It directly issues a RPC to the chosen region server and waits for the method to return.
2436 If the RPC was successful, the source determines whether the current file has been emptied or it contains more data which needs to be read.
2437 If the file has been emptied, the source deletes the znode in the queue.
2438 Otherwise, it registers the new offset in the log's znode.
2439 If the RPC threw an exception, the source will retry 10 times before trying to find a different sink.
2440
2441 ==== Cleaning Logs
2442
2443 If replication is not enabled, the master's log-cleaning thread deletes old logs using a configured TTL.
2444 This TTL-based method does not work well with replication, because archived logs which have exceeded their TTL may still be in a queue.
2445 The default behavior is augmented so that if a log is past its TTL, the cleaning thread looks up every queue until it finds the log, while caching queues it has found.
2446 If the log is not found in any queues, the log will be deleted.
2447 The next time the cleaning process needs to look for a log, it starts by using its cached list.
2448
2449 NOTE: WALs are saved when replication is enabled or disabled as long as peers exist.
2450
2451 [[rs.failover.details]]
2452 ==== Region Server Failover
2453
2454 When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
2455 Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
2456
2457 Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
2458 The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
2459 After queues are all transferred, they are deleted from the old location.
2460 The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
2461
2462 Next, the master cluster region server creates one new source thread per copied queue, and each of the source threads follows the read/filter/ship pattern.
2463 The main difference is that those queues will never receive new data, since they do not belong to their new region server.
2464 When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
2465
2466 Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
2467 The region servers' znodes all contain a `peers`          znode which contains a single queue.
2468 The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
2469
2470 ----
2471
2472 /hbase/replication/rs/
2473   1.1.1.1,60020,123456780/
2474     2/
2475       1.1.1.1,60020.1234  (Contains a position)
2476       1.1.1.1,60020.1265
2477   1.1.1.2,60020,123456790/
2478     2/
2479       1.1.1.2,60020.1214  (Contains a position)
2480       1.1.1.2,60020.1248
2481       1.1.1.2,60020.1312
2482   1.1.1.3,60020,    123456630/
2483     2/
2484       1.1.1.3,60020.1280  (Contains a position)
2485 ----
2486
2487 Assume that 1.1.1.2 loses its ZooKeeper session.
2488 The survivors will race to create a lock, and, arbitrarily, 1.1.1.3 wins.
2489 It will then start transferring all the queues to its local peers znode by appending the name of the dead server.
2490 Right before 1.1.1.3 is able to clean up the old znodes, the layout will look like the following:
2491
2492 ----
2493
2494 /hbase/replication/rs/
2495   1.1.1.1,60020,123456780/
2496     2/
2497       1.1.1.1,60020.1234  (Contains a position)
2498       1.1.1.1,60020.1265
2499   1.1.1.2,60020,123456790/
2500     lock
2501     2/
2502       1.1.1.2,60020.1214  (Contains a position)
2503       1.1.1.2,60020.1248
2504       1.1.1.2,60020.1312
2505   1.1.1.3,60020,123456630/
2506     2/
2507       1.1.1.3,60020.1280  (Contains a position)
2508
2509     2-1.1.1.2,60020,123456790/
2510       1.1.1.2,60020.1214  (Contains a position)
2511       1.1.1.2,60020.1248
2512       1.1.1.2,60020.1312
2513 ----
2514
2515 Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from 1.1.1.2, it dies too.
2516 Some new logs were also created in the normal queues.
2517 The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues.
2518 The new layout will be:
2519
2520 ----
2521
2522 /hbase/replication/rs/
2523   1.1.1.1,60020,123456780/
2524     2/
2525       1.1.1.1,60020.1378  (Contains a position)
2526
2527     2-1.1.1.3,60020,123456630/
2528       1.1.1.3,60020.1325  (Contains a position)
2529       1.1.1.3,60020.1401
2530
2531     2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
2532       1.1.1.2,60020.1312  (Contains a position)
2533   1.1.1.3,60020,123456630/
2534     lock
2535     2/
2536       1.1.1.3,60020.1325  (Contains a position)
2537       1.1.1.3,60020.1401
2538
2539     2-1.1.1.2,60020,123456790/
2540       1.1.1.2,60020.1312  (Contains a position)
2541 ----
2542
2543 === Replication Metrics
2544
2545 The following metrics are exposed at the global region server level and at the peer level:
2546
2547 `source.sizeOfLogQueue`::
2548   number of WALs to process (excludes the one which is being processed) at the Replication source
2549
2550 `source.shippedOps`::
2551   number of mutations shipped
2552
2553 `source.logEditsRead`::
2554   number of mutations read from WALs at the replication source
2555
2556 `source.ageOfLastShippedOp`::
2557   age of last batch that was shipped by the replication source
2558
2559 `source.completedLogs`::
2560   The number of write-ahead-log files that have completed their acknowledged sending to the peer associated with this source. Increments to this metric are a part of normal operation of HBase replication.
2561
2562 `source.completedRecoverQueues`::
2563   The number of recovery queues this source has completed sending to the associated peer. Increments to this metric are a part of normal recovery of HBase replication in the face of failed Region Servers.
2564
2565 `source.uncleanlyClosedLogs`::
2566   The number of write-ahead-log files the replication system considered completed after reaching the end of readable entries in the face of an uncleanly closed file.
2567
2568 `source.ignoredUncleanlyClosedLogContentsInBytes`::
2569   When a write-ahead-log file is not closed cleanly, there will likely be some entry that has been partially serialized. This metric contains the number of bytes of such entries the HBase replication system believes were remaining at the end of files skipped in the face of an uncleanly closed file. Those bytes should either be in different file or represent a client write that was not acknowledged.
2570
2571 `source.restartedLogReading`::
2572   The number of times the HBase replication system detected that it failed to correctly parse a cleanly closed write-ahead-log file. In this circumstance, the system replays the entire log from the beginning, ensuring that no edits fail to be acknowledged by the associated peer. Increments to this metric indicate that the HBase replication system is having difficulty correctly handling failures in the underlying distributed storage system. No dataloss should occur, but you should check Region Server log files for details of the failures.
2573
2574 `source.repeatedLogFileBytes`::
2575   When the HBase replication system determines that it needs to replay a given write-ahead-log file, this metric is incremented by the number of bytes the replication system believes had already been acknowledged by the associated peer prior to starting over.
2576
2577 `source.closedLogsWithUnknownFileLength`::
2578   Incremented when the HBase replication system believes it is at the end of a write-ahead-log file but it can not determine the length of that file in the underlying distributed storage system. Could indicate dataloss since the replication system is unable to determine if the end of readable entries lines up with the expected end of the file. You should check Region Server log files for details of the failures.
2579
2580
2581 === Replication Configuration Options
2582
2583 [cols="1,1,1", options="header"]
2584 |===
2585 | Option
2586 | Description
2587 | Default
2588
2589 | zookeeper.znode.parent
2590 | The name of the base ZooKeeper znode used for HBase
2591 | /hbase
2592
2593 | zookeeper.znode.replication
2594 | The name of the base znode used for replication
2595 | replication
2596
2597 | zookeeper.znode.replication.peers
2598 | The name of the peer znode
2599 | peers
2600
2601 | zookeeper.znode.replication.peers.state
2602 | The name of peer-state znode
2603 | peer-state
2604
2605 | zookeeper.znode.replication.rs
2606 | The name of the rs znode
2607 | rs
2608
2609 | replication.sleep.before.failover
2610 | How many milliseconds a worker should sleep before attempting to replicate
2611                 a dead region server's WAL queues.
2612 |
2613
2614 | replication.executor.workers
2615 | The number of region servers a given region server should attempt to
2616                   failover simultaneously.
2617 | 1
2618 |===
2619
2620 === Monitoring Replication Status
2621
2622 You can use the HBase Shell command `status 'replication'` to monitor the replication status on your cluster. The  command has three variations:
2623 * `status 'replication'` -- prints the status of each source and its sinks, sorted by hostname.
2624 * `status 'replication', 'source'` -- prints the status for each replication source, sorted by hostname.
2625 * `status 'replication', 'sink'` -- prints the status for each replication sink, sorted by hostname.
2626
2627 ==== Understanding the output
2628
2629 The command output will vary according to the state of replication. For example right after a restart
2630 and if destination peer is not reachable, no replication source threads would be running,
2631 so no metrics would get displayed:
2632
2633 ----
2634 hbase01.home:
2635 SOURCE: PeerID=1
2636 Normal Queue: 1
2637 No Reader/Shipper threads runnning yet.
2638 SINK: TimeStampStarted=1591985197350, Waiting for OPs...
2639 ----
2640
2641 Under normal circumstances, a healthy, active-active replication deployment would
2642 show the following:
2643
2644 ----
2645     hbase01.home:
2646       SOURCE: PeerID=1
2647          Normal Queue: 1
2648            AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
2649       SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020
2650 ----
2651
2652 The definition for each of these metrics is detailed below:
2653
2654 [cols="1,1,1", options="header"]
2655 |===
2656 | Type
2657 | Metric Name
2658 | Description
2659
2660 | Source
2661 | AgeOfLastShippedOp
2662 | How long last successfully shipped edit took to effectively get replicated on target.
2663
2664 | Source
2665 | TimeStampOfLastShippedOp
2666 | The actual date of last successful edit shipment.
2667
2668 | Source
2669 | SizeOfLogQueue
2670 | Number of wal files on this given queue.
2671
2672 | Source
2673 | EditsReadFromLogQueue
2674 | How many edits have been read from this given queue since this source thread started.
2675
2676 | Source
2677 | OpsShippedToTarget
2678 | How many edits have been shipped to target since this source thread started.
2679
2680 | Source
2681 | TimeStampOfNextToReplicate
2682 | Date of the current edit been attempted to replicate.
2683
2684 | Source
2685 | Replication Lag
2686 | The elapsed time (in millis), since the last edit to replicate was read by this source
2687 thread and effectively replicated to target
2688
2689 | Sink
2690 | TimeStampStarted
2691 | Date (in millis) of when this Sink thread started.
2692
2693 | Sink
2694 | AgeOfLastAppliedOp
2695 | How long it took to apply the last successful shipped edit.
2696
2697 | Sink
2698 | TimeStampsOfLastAppliedOp
2699 | Date of last successful applied edit.
2700
2701 |===
2702
2703 Growing values for `Source.TimeStampsOfLastAppliedOp` and/or
2704 `Source.Replication Lag` would indicate replication delays. If those numbers keep going
2705 up, while `Source.TimeStampOfLastShippedOp`, `Source.EditsReadFromLogQueue`,
2706 `Source.OpsShippedToTarget` or `Source.TimeStampOfNextToReplicate` do not change at all,
2707  then replication flow is failing to progress, and there might be problems within
2708 clusters communication. This could also happen if replication is manually paused
2709 (via hbase shell `disable_peer` command, for example), but data keeps getting ingested
2710 in the source cluster tables.
2711
2712 == Running Multiple Workloads On a Single Cluster
2713
2714 HBase provides the following mechanisms for managing the performance of a cluster
2715 handling multiple workloads:
2716 . <<quota>>
2717 . <<request_queues>>
2718 . <<multiple-typed-queues>>
2719
2720 [[quota]]
2721 === Quotas
2722 HBASE-11598 introduces RPC quotas, which allow you to throttle requests based on
2723 the following limits:
2724
2725 . <<request-quotas,The number or size of requests(read, write, or read+write) in a given timeframe>>
2726 . <<namespace_quotas,The number of tables allowed in a namespace>>
2727
2728 These limits can be enforced for a specified user, table, or namespace.
2729
2730 .Enabling Quotas
2731
2732 Quotas are disabled by default. To enable the feature, set the `hbase.quota.enabled`
2733 property to `true` in _hbase-site.xml_ file for all cluster nodes.
2734
2735 .General Quota Syntax
2736 . THROTTLE_TYPE can be expressed as READ, WRITE, or the default type(read + write).
2737 . Timeframes  can be expressed in the following units: `sec`, `min`, `hour`, `day`
2738 . Request sizes can be expressed in the following units: `B` (bytes), `K` (kilobytes),
2739 `M` (megabytes), `G` (gigabytes), `T` (terabytes), `P` (petabytes)
2740 . Numbers of requests are expressed as an integer followed by the string `req`
2741 . Limits relating to time are expressed as req/time or size/time. For instance `10req/day`
2742 or `100P/hour`.
2743 . Numbers of tables or regions are expressed as integers.
2744
2745 [[request-quotas]]
2746 .Setting Request Quotas
2747 You can set quota rules ahead of time, or you can change the throttle at runtime. The change
2748 will propagate after the quota refresh period has expired. This expiration period
2749 defaults to 5 minutes. To change it, modify the `hbase.quota.refresh.period` property
2750 in `hbase-site.xml`. This property is expressed in milliseconds and defaults to `300000`.
2751
2752 ----
2753 # Limit user u1 to 10 requests per second
2754 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10req/sec'
2755
2756 # Limit user u1 to 10 read requests per second
2757 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', LIMIT => '10req/sec'
2758
2759 # Limit user u1 to 10 M per day everywhere
2760 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10M/day'
2761
2762 # Limit user u1 to 10 M write size per sec
2763 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => WRITE, USER => 'u1', LIMIT => '10M/sec'
2764
2765 # Limit user u1 to 5k per minute on table t2
2766 hbase> set_quota TYPE => THROTTLE, USER => 'u1', TABLE => 't2', LIMIT => '5K/min'
2767
2768 # Limit user u1 to 10 read requests per sec on table t2
2769 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', TABLE => 't2', LIMIT => '10req/sec'
2770
2771 # Remove an existing limit from user u1 on namespace ns2
2772 hbase> set_quota TYPE => THROTTLE, USER => 'u1', NAMESPACE => 'ns2', LIMIT => NONE
2773
2774 # Limit all users to 10 requests per hour on namespace ns1
2775 hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns1', LIMIT => '10req/hour'
2776
2777 # Limit all users to 10 T per hour on table t1
2778 hbase> set_quota TYPE => THROTTLE, TABLE => 't1', LIMIT => '10T/hour'
2779
2780 # Remove all existing limits from user u1
2781 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => NONE
2782
2783 # List all quotas for user u1 in namespace ns2
2784 hbase> list_quotas USER => 'u1, NAMESPACE => 'ns2'
2785
2786 # List all quotas for namespace ns2
2787 hbase> list_quotas NAMESPACE => 'ns2'
2788
2789 # List all quotas for table t1
2790 hbase> list_quotas TABLE => 't1'
2791
2792 # list all quotas
2793 hbase> list_quotas
2794 ----
2795
2796 You can also place a global limit and exclude a user or a table from the limit by applying the
2797 `GLOBAL_BYPASS` property.
2798 ----
2799 hbase> set_quota NAMESPACE => 'ns1', LIMIT => '100req/min'               # a per-namespace request limit
2800 hbase> set_quota USER => 'u1', GLOBAL_BYPASS => true                     # user u1 is not affected by the limit
2801 ----
2802
2803 [[namespace_quotas]]
2804 .Setting Namespace Quotas
2805
2806 You can specify the maximum number of tables or regions allowed in a given namespace, either
2807 when you create the namespace or by altering an existing namespace, by setting the
2808 `hbase.namespace.quota.maxtables property`  on the namespace.
2809
2810 .Limiting Tables Per Namespace
2811 ----
2812 # Create a namespace with a max of 5 tables
2813 hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxtables'=>'5'}
2814
2815 # Alter an existing namespace to have a max of 8 tables
2816 hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxtables'=>'8'}
2817
2818 # Show quota information for a namespace
2819 hbase> describe_namespace 'ns2'
2820
2821 # Alter an existing namespace to remove a quota
2822 hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=>'hbase.namespace.quota.maxtables'}
2823 ----
2824
2825 .Limiting Regions Per Namespace
2826 ----
2827 # Create a namespace with a max of 10 regions
2828 hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxregions'=>'10'
2829
2830 # Show quota information for a namespace
2831 hbase> describe_namespace 'ns1'
2832
2833 # Alter an existing namespace to have a max of 20 tables
2834 hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxregions'=>'20'}
2835
2836 # Alter an existing namespace to remove a quota
2837 hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=> 'hbase.namespace.quota.maxregions'}
2838 ----
2839
2840 [[request_queues]]
2841 === Request Queues
2842 If no throttling policy is configured, when the RegionServer receives multiple requests,
2843 they are now placed into a queue waiting for a free execution slot (HBASE-6721).
2844 The simplest queue is a FIFO queue, where each request waits for all previous requests in the queue
2845 to finish before running. Fast or interactive queries can get stuck behind large requests.
2846
2847 If you are able to guess how long a request will take, you can reorder requests by
2848 pushing the long requests to the end of the queue and allowing short requests to preempt
2849 them. Eventually, you must still execute the large requests and prioritize the new
2850 requests behind them. The short requests will be newer, so the result is not terrible,
2851 but still suboptimal compared to a mechanism which allows large requests to be split
2852 into multiple smaller ones.
2853
2854 HBASE-10993 introduces such a system for deprioritizing long-running scanners. There
2855 are two types of queues, `fifo` and `deadline`. To configure the type of queue used,
2856 configure the `hbase.ipc.server.callqueue.type` property in `hbase-site.xml`. There
2857 is no way to estimate how long each request may take, so de-prioritization only affects
2858 scans, and is based on the number of “next” calls a scan request has made. An assumption
2859 is made that when you are doing a full table scan, your job is not likely to be interactive,
2860 so if there are concurrent requests, you can delay long-running scans up to a limit tunable by
2861 setting the `hbase.ipc.server.queue.max.call.delay` property. The slope of the delay is calculated
2862 by a simple square root of `(numNextCall * weight)` where the weight is
2863 configurable by setting the `hbase.ipc.server.scan.vtime.weight` property.
2864
2865 [[multiple-typed-queues]]
2866 === Multiple-Typed Queues
2867
2868 You can also prioritize or deprioritize different kinds of requests by configuring
2869 a specified number of dedicated handlers and queues. You can segregate the scan requests
2870 in a single queue with a single handler, and all the other available queues can service
2871 short `Get` requests.
2872
2873 You can adjust the IPC queues and handlers based on the type of workload, using static
2874 tuning options. This approach is an interim first step that will eventually allow
2875 you to change the settings at runtime, and to dynamically adjust values based on the load.
2876
2877 .Multiple Queues
2878
2879 To avoid contention and separate different kinds of requests, configure the
2880 `hbase.ipc.server.callqueue.handler.factor` property, which allows you to increase the number of
2881 queues and control how many handlers can share the same queue., allows admins to increase the number
2882 of queues and decide how many handlers share the same queue.
2883
2884 Using more queues reduces contention when adding a task to a queue or selecting it
2885 from a queue. You can even configure one queue per handler. The trade-off is that
2886 if some queues contain long-running tasks, a handler may need to wait to execute from that queue
2887 rather than stealing from another queue which has waiting tasks.
2888
2889 .Read and Write Queues
2890 With multiple queues, you can now divide read and write requests, giving more priority
2891 (more queues) to one or the other type. Use the `hbase.ipc.server.callqueue.read.ratio`
2892 property to choose to serve more reads or more writes.
2893
2894 .Get and Scan Queues
2895 Similar to the read/write split, you can split gets and scans by tuning the `hbase.ipc.server.callqueue.scan.ratio`
2896 property to give more priority to gets or to scans. A scan ratio of `0.1` will give
2897 more queue/handlers to the incoming gets, which means that more gets can be processed
2898 at the same time and that fewer scans can be executed at the same time. A value of
2899 `0.9` will give more queue/handlers to scans, so the number of scans executed will
2900 increase and the number of gets will decrease.
2901
2902 [[space-quotas]]
2903 === Space Quotas
2904
2905 link:https://issues.apache.org/jira/browse/HBASE-16961[HBASE-16961] introduces a new type of
2906 quotas for HBase to leverage: filesystem quotas. These "space" quotas limit the amount of space
2907 on the filesystem that HBase namespaces and tables can consume. If a user, malicious or ignorant,
2908 has the ability to write data into HBase, with enough time, that user can effectively crash HBase
2909 (or worse HDFS) by consuming all available space. When there is no filesystem space available,
2910 HBase crashes because it can no longer create/sync data to the write-ahead log.
2911
2912 This feature allows a for a limit to be set on the size of a table or namespace. When a space quota is set
2913 on a namespace, the quota's limit applies to the sum of usage of all tables in that namespace.
2914 When a table with a quota exists in a namespace with a quota, the table quota takes priority
2915 over the namespace quota. This allows for a scenario where a large limit can be placed on
2916 a collection of tables, but a single table in that collection can have a fine-grained limit set.
2917
2918 The existing `set_quota` and `list_quota` HBase shell commands can be used to interact with
2919 space quotas. Space quotas are quotas with a `TYPE` of `SPACE` and have `LIMIT` and `POLICY`
2920 attributes. The `LIMIT` is a string that refers to the amount of space on the filesystem
2921 that the quota subject (e.g. the table or namespace) may consume. For example, valid values
2922 of `LIMIT` are `'10G'`, `'2T'`, or `'256M'`. The `POLICY` refers to the action that HBase will
2923 take when the quota subject's usage exceeds the `LIMIT`. The following are valid `POLICY` values.
2924
2925 * `NO_INSERTS` - No new data may be written (e.g. `Put`, `Increment`, `Append`).
2926 * `NO_WRITES` - Same as `NO_INSERTS` but `Deletes` are also disallowed.
2927 * `NO_WRITES_COMPACTIONS` - Same as `NO_WRITES` but compactions are also disallowed.
2928 ** This policy only prevents user-submitted compactions. System can still run compactions.
2929 * `DISABLE` - The table(s) are disabled, preventing all read/write access.
2930
2931 .Setting simple space quotas
2932 ----
2933 # Sets a quota on the table 't1' with a limit of 1GB, disallowing Puts/Increments/Appends when the table exceeds 1GB
2934 hbase> set_quota TYPE => SPACE, TABLE => 't1', LIMIT => '1G', POLICY => NO_INSERTS
2935
2936 # Sets a quota on the namespace 'ns1' with a limit of 50TB, disallowing Puts/Increments/Appends/Deletes
2937 hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '50T', POLICY => NO_WRITES
2938
2939 # Sets a quota on the table 't3' with a limit of 2TB, disallowing any writes and compactions when the table exceeds 2TB.
2940 hbase> set_quota TYPE => SPACE, TABLE => 't3', LIMIT => '2T', POLICY => NO_WRITES_COMPACTIONS
2941
2942 # Sets a quota on the table 't2' with a limit of 50GB, disabling the table when it exceeds 50GB
2943 hbase> set_quota TYPE => SPACE, TABLE => 't2', LIMIT => '50G', POLICY => DISABLE
2944 ----
2945
2946 Consider the following scenario to set up quotas on a namespace, overriding the quota on tables in that namespace
2947
2948 .Table and Namespace space quotas
2949 ----
2950 hbase> create_namespace 'ns1'
2951 hbase> create 'ns1:t1'
2952 hbase> create 'ns1:t2'
2953 hbase> create 'ns1:t3'
2954 hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '100T', POLICY => NO_INSERTS
2955 hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t2', LIMIT => '200G', POLICY => NO_WRITES
2956 hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t3', LIMIT => '20T', POLICY => NO_WRITES
2957 ----
2958
2959 In the above scenario, the tables in the namespace `ns1` will not be allowed to consume more than
2960 100TB of space on the filesystem among each other. The table 'ns1:t2' is only allowed to be 200GB in size, and will
2961 disallow all writes when the usage exceeds this limit. The table 'ns1:t3' is allowed to grow to 20TB in size
2962 and also will disallow all writes then the usage exceeds this limit. Because there is no table quota
2963 on 'ns1:t1', this table can grow up to 100TB, but only if 'ns1:t2' and 'ns1:t3' have a usage of zero bytes.
2964 Practically, it's limit is 100TB less the current usage of 'ns1:t2' and 'ns1:t3'.
2965
2966 [[ops.space.quota.deletion]]
2967 === Disabling Automatic Space Quota Deletion
2968
2969 By default, if a table or namespace is deleted that has a space quota, the quota itself is
2970 also deleted. In some cases, it may be desirable for the space quota to not be automatically deleted.
2971 In these cases, the user may configure the system to not delete any space quota automatically via hbase-site.xml.
2972
2973 [source,java]
2974 ----
2975
2976   <property>
2977     <name>hbase.quota.remove.on.table.delete</name>
2978     <value>false</value>
2979   </property>
2980 ----
2981
2982 The value is set to `true` by default.
2983
2984 === HBase Snapshots with Space Quotas
2985
2986 One common area of unintended-filesystem-use with HBase is via HBase snapshots. Because snapshots
2987 exist outside of the management of HBase tables, it is not uncommon for administrators to suddenly
2988 realize that hundreds of gigabytes or terabytes of space is being used by HBase snapshots which were
2989 forgotten and never removed.
2990
2991 link:https://issues.apache.org/jira/browse/HBASE-17748[HBASE-17748] is the umbrella JIRA issue which
2992 expands on the original space quota functionality to also include HBase snapshots. While this is a confusing
2993 subject, the implementation attempts to present this support in as reasonable and simple of a manner as
2994 possible for administrators. This feature does not make any changes to administrator interaction with
2995 space quotas, only in the internal computation of table/namespace usage. Table and namespace usage will
2996 automatically incorporate the size taken by a snapshot per the rules defined below.
2997
2998 As a review, let's cover a snapshot's lifecycle: a snapshot is metadata which points to
2999 a list of HFiles on the filesystem. This is why creating a snapshot is a very cheap operation; no HBase
3000 table data is actually copied to perform a snapshot. Cloning a snapshot into a new table or restoring
3001 a table is a cheap operation for the same reason; the new table references the files which already exist
3002 on the filesystem without a copy. To include snapshots in space quotas, we need to define which table
3003 "owns" a file when a snapshot references the file ("owns" refers to encompassing the filesystem usage
3004 of that file).
3005
3006 Consider a snapshot which was made against a table. When the snapshot refers to a file and the table no
3007 longer refers to that file, the "originating" table "owns" that file. When multiple snapshots refer to
3008 the same file and no table refers to that file, the snapshot with the lowest-sorting name (lexicographically)
3009 is chosen and the table which that snapshot was created from "owns" that file. HFiles are not "double-counted"
3010  hen a table and one or more snapshots refer to that HFile.
3011
3012 When a table is "rematerialized" (via `clone_snapshot` or `restore_snapshot`), a similar problem of file
3013 ownership arises. In this case, while the rematerialized table references a file which a snapshot also
3014 references, the table does not "own" the file. The table from which the snapshot was created still "owns"
3015 that file. When the rematerialized table is compacted or the snapshot is deleted, the rematerialized table
3016 will uniquely refer to a new file and "own" the usage of that file. Similarly, when a table is duplicated via a snapshot
3017 and `restore_snapshot`, the new table will not consume any quota size until the original table stops referring
3018 to the files, either due to a compaction on the original table, a compaction on the new table, or the
3019 original table being deleted.
3020
3021 One new HBase shell command was added to inspect the computed sizes of each snapshot in an HBase instance.
3022
3023 ----
3024 hbase> list_snapshot_sizes
3025 SNAPSHOT                                      SIZE
3026  t1.s1                                        1159108
3027 ----
3028
3029 [[ops.backup]]
3030 == HBase Backup
3031
3032 There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster.
3033 Each approach has pros and cons.
3034
3035 For additional information, see link:http://blog.sematext.com/2011/03/11/hbase-backup-options/[HBase Backup
3036         Options] over on the Sematext Blog.
3037
3038 [[ops.backup.fullshutdown]]
3039 === Full Shutdown Backup
3040
3041 Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages.
3042 The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata.
3043 The obvious con is that the cluster is down.
3044 The steps include:
3045
3046 [[ops.backup.fullshutdown.stop]]
3047 ==== Stop HBase
3048
3049
3050
3051 [[ops.backup.fullshutdown.distcp]]
3052 ==== Distcp
3053
3054 Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
3055
3056 Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files.
3057 Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.
3058
3059 [[ops.backup.fullshutdown.restore]]
3060 ==== Restore (if needed)
3061
3062 The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp.
3063 The act of copying these files creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
3064
3065 [[ops.backup.live.replication]]
3066 === Live Cluster Backup - Replication
3067
3068 This approach assumes that there is a second cluster.
3069 See the HBase page on link:https://hbase.apache.org/book.html#_cluster_replication[replication] for more information.
3070
3071 [[ops.backup.live.copytable]]
3072 === Live Cluster Backup - CopyTable
3073
3074 The <<copy.table,copytable>> utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
3075
3076 Since the cluster is up, there is a risk that edits could be missed in the copy process.
3077
3078 [[ops.backup.live.export]]
3079 === Live Cluster Backup - Export
3080
3081 The <<export,export>> approach dumps the content of a table to HDFS on the same cluster.
3082 To restore the data, the <<import,import>> utility would be used.
3083
3084 Since the cluster is up, there is a risk that edits could be missed in the export process. If you want to know more about HBase back-up and restore see the page on link:http://hbase.apache.org/book.html#backuprestore[Backup and Restore].
3085
3086 [[ops.snapshots]]
3087 == HBase Snapshots
3088
3089 HBase Snapshots allow you to take a copy of a table (both contents and metadata)with a very small performance impact. A Snapshot is an immutable
3090 collection of table metadata and a list of HFiles that comprised the table at the time the Snapshot was taken. A "clone"
3091 of a snapshot creates a new table from that snapshot, and a "restore" of a snapshot returns the contents of a table to
3092 what it was when the snapshot was created. The "clone" and "restore" operations do not require any data to be copied,
3093 as the underlying HFiles (the files which contain the data for an HBase table) are not modified with either action.
3094 Simiarly, exporting a snapshot to another cluster has little impact on RegionServers of the local cluster.
3095
3096 Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table.
3097 The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
3098
3099 [[ops.snapshots.configuration]]
3100 === Configuration
3101
3102 To turn on the snapshot support just set the `hbase.snapshot.enabled`        property to true.
3103 (Snapshots are enabled by default in 0.95+ and off by default in 0.94.6+)
3104
3105 [source,java]
3106 ----
3107
3108   <property>
3109     <name>hbase.snapshot.enabled</name>
3110     <value>true</value>
3111   </property>
3112 ----
3113
3114 [[ops.snapshots.takeasnapshot]]
3115 === Take a Snapshot
3116
3117 You can take a snapshot of a table regardless of whether it is enabled or disabled.
3118 The snapshot operation doesn't involve any data copying.
3119
3120 ----
3121
3122 $ ./bin/hbase shell
3123 hbase> snapshot 'myTable', 'myTableSnapshot-122112'
3124 ----
3125
3126 .Take a Snapshot Without Flushing
3127 The default behavior is to perform a flush of data in memory before the snapshot is taken.
3128 This means that data in memory is included in the snapshot.
3129 In most cases, this is the desired behavior.
3130 However, if your set-up can tolerate data in memory being excluded from the snapshot, you can use the `SKIP_FLUSH` option of the `snapshot` command to disable and flushing while taking the snapshot.
3131
3132 ----
3133 hbase> snapshot 'mytable', 'snapshot123', {SKIP_FLUSH => true}
3134 ----
3135
3136 WARNING: There is no way to determine or predict whether a very concurrent insert or update will be included in a given snapshot, whether flushing is enabled or disabled.
3137 A snapshot is only a representation of a table during a window of time.
3138 The amount of time the snapshot operation will take to reach each Region Server may vary from a few seconds to a minute, depending on the resource load and speed of the hardware or network, among other factors.
3139 There is also no way to know whether a given insert or update is in memory or has been flushed.
3140
3141
3142 .Take a Snapshot With TTL
3143 Snapshots have a lifecycle that is independent from the table from which they are created.
3144 Although data in a table may be stored with TTL the data files containing them become
3145 frozen by the snapshot. Space consumed by expired cells will not be reclaimed by normal
3146 table housekeeping like compaction. While this is expected it can be inconvenient at scale.
3147 When many snapshots are under management and the data in various tables is expired by
3148 TTL some notion of optional TTL (and optional default TTL) for snapshots could be useful.
3149
3150
3151 ----
3152 hbase> snapshot 'mytable', 'snapshot1234', {TTL => 86400}
3153 ----
3154
3155 The above command creates snapshot `snapshot1234` with TTL of 86400 sec (24 hours)
3156 and hence, the snapshot is supposed to be cleaned up after 24 hours
3157
3158
3159
3160 .Default Snapshot TTL:
3161 - User specified default TTL with config `hbase.master.snapshot.ttl`
3162 - FOREVER if `hbase.master.snapshot.ttl` is not set
3163
3164 While creating a snapshot, if TTL in seconds is not explicitly specified, the above logic will be
3165 followed to determine the TTL. If no configs are changed, the default behavior is that all snapshots
3166 will be retained forever (until manual deletion). If a different default TTL behavior is desired,
3167 `hbase.master.snapshot.ttl` can be set to a default TTL in seconds. Any snapshot created without
3168 an explicit TTL will take this new value.
3169
3170 NOTE: If `hbase.master.snapshot.ttl` is set, a snapshot with an explicit {TTL => 0} or
3171 {TTL => -1} will also take this value. In this case, a TTL < -1 (such as {TTL => -2} should be used
3172 to indicate FOREVER.
3173
3174 To summarize concisely,
3175
3176 1. Snapshot with TTL value < -1 will stay forever regardless of any server side config changes (until deleted manually by user).
3177 2. Snapshot with TTL value > 0 will be deleted automatically soon after TTL expires.
3178 3. Snapshot created without specifying TTL will always have TTL value represented by config `hbase.master.snapshot.ttl`. Default value of this config is 0, which represents: keep the snapshot forever (until deleted manually by user).
3179 4. From client side, TTL value 0 or -1 should never be explicitly provided because they will be treated same as snapshot without TTL (same as above point 3) and hence will use TTL as per value represented by config `hbase.master.snapshot.ttl`.
3180
3181 .Take a snapshot with custom MAX_FILESIZE
3182
3183 Optionally, snapshots can be created with a custom max file size configuration that will be
3184 used by cloned tables, instead of the global `hbase.hregion.max.filesize` configuration property.
3185 This is mostly useful when exporting snapshots between different clusters. If the HBase cluster where
3186 the snapshot is originally taken has a much larger value set for `hbase.hregion.max.filesize` than
3187 one or more clusters where the snapshot is being exported to, a storm of region splits may occur when
3188 restoring the snapshot on destination clusters. Specifying `MAX_FILESIZE` on properties passed to
3189 `snapshot` command will save informed value into the table's `MAX_FILESIZE`
3190 decriptor at snapshot creation time. If the table already defines `MAX_FILESIZE` descriptor,
3191 this property would be ignored and have no effect.
3192
3193 ----
3194 snapshot 'table01', 'snap01', {MAX_FILESIZE => 21474836480}
3195 ----
3196
3197 .Enable/Disable Snapshot Auto Cleanup on running cluster:
3198
3199 By default, snapshot auto cleanup based on TTL would be enabled
3200 for any new cluster.
3201 At any point in time, if snapshot cleanup is supposed to be stopped due to
3202 some snapshot restore activity or any other reason, it is advisable
3203 to disable it using shell command:
3204
3205 ----
3206 hbase> snapshot_cleanup_switch false
3207 ----
3208
3209 We can re-enable it using:
3210
3211 ----
3212 hbase> snapshot_cleanup_switch true
3213 ----
3214
3215 The shell command with switch false would disable snapshot auto
3216 cleanup activity based on TTL and return the previous state of
3217 the activity(true: running already, false: disabled already)
3218
3219 A sample output for above commands:
3220 ----
3221 Previous snapshot cleanup state : true
3222 Took 0.0069 seconds
3223 => "true"
3224 ----
3225
3226 We can query whether snapshot auto cleanup is enabled for
3227 cluster using:
3228
3229 ----
3230 hbase> snapshot_cleanup_enabled
3231 ----
3232
3233 The command would return output in true/false.
3234
3235 [[ops.snapshots.list]]
3236 === Listing Snapshots
3237
3238 List all snapshots taken (by printing the names and relative information).
3239
3240 ----
3241
3242 $ ./bin/hbase shell
3243 hbase> list_snapshots
3244 ----
3245
3246 [[ops.snapshots.delete]]
3247 === Deleting Snapshots
3248
3249 You can remove a snapshot, and the files retained for that snapshot will be removed if no longer needed.
3250
3251 ----
3252
3253 $ ./bin/hbase shell
3254 hbase> delete_snapshot 'myTableSnapshot-122112'
3255 ----
3256
3257 [[ops.snapshots.clone]]
3258 === Clone a table from snapshot
3259
3260 From a snapshot you can create a new table (clone operation) with the same data that you had when the snapshot was taken.
3261 The clone operation, doesn't involve data copies, and a change to the cloned table doesn't impact the snapshot or the original table.
3262
3263 ----
3264
3265 $ ./bin/hbase shell
3266 hbase> clone_snapshot 'myTableSnapshot-122112', 'myNewTestTable'
3267 ----
3268
3269 [[ops.snapshots.restore]]
3270 === Restore a snapshot
3271
3272 The restore operation requires the table to be disabled, and the table will be restored to the state at the time when the snapshot was taken, changing both data and schema if required.
3273
3274 ----
3275
3276 $ ./bin/hbase shell
3277 hbase> disable 'myTable'
3278 hbase> restore_snapshot 'myTableSnapshot-122112'
3279 ----
3280
3281 NOTE: Since Replication works at log level and snapshots at file-system level, after a restore, the replicas will be in a different state from the master.
3282 If you want to use restore, you need to stop replication and redo the bootstrap.
3283
3284 In case of partial data-loss due to misbehaving client, instead of a full restore that requires the table to be disabled, you can clone the table from the snapshot and use a Map-Reduce job to copy the data that you need, from the clone to the main one.
3285
3286 [[ops.snapshots.acls]]
3287 === Snapshots operations and ACLs
3288
3289 If you are using security with the AccessController Coprocessor (See <<hbase.accesscontrol.configuration,hbase.accesscontrol.configuration>>), only a global administrator can take, clone, or restore a snapshot, and these actions do not capture the ACL rights.
3290 This means that restoring a table preserves the ACL rights of the existing table, while cloning a table creates a new table that has no ACL rights until the administrator adds them.
3291
3292 [[ops.snapshots.export]]
3293 === Export to another cluster
3294
3295 The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster.
3296 The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
3297
3298 To copy a snapshot called MySnapshot to an HBase cluster srv2 (hdfs:///srv2:8082/hbase) using 16 mappers:
3299
3300 [source,bourne]
3301 ----
3302 $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16
3303 ----
3304
3305 .Limiting Bandwidth Consumption
3306 You can limit the bandwidth consumption when exporting a snapshot, by specifying the `-bandwidth` parameter, which expects an integer representing megabytes per second.
3307 The following example limits the above example to 200 MB/sec.
3308
3309 [source,bourne]
3310 ----
3311 $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200
3312 ----
3313
3314 [[snapshots_s3]]
3315 === Storing Snapshots in an Amazon S3 Bucket
3316
3317 You can store and retrieve snapshots from Amazon S3, using the following procedure.
3318
3319 NOTE: You can also store snapshots in Microsoft Azure Blob Storage. See <<snapshots_azure>>.
3320
3321 .Prerequisites
3322 - You must be using HBase 1.0 or higher and Hadoop 2.6.1 or higher, which is the first
3323 configuration that uses the Amazon AWS SDK.
3324 - You must use the `s3a://` protocol to connect to Amazon S3. The older `s3n://`
3325 and `s3://` protocols have various limitations and do not use the Amazon AWS SDK.
3326 - The `s3a://` URI must be configured and available on the server where you run
3327 the commands to export and restore the snapshot.
3328
3329 After you have fulfilled the prerequisites, take the snapshot like you normally would.
3330 Afterward, you can export it using the `org.apache.hadoop.hbase.snapshot.ExportSnapshot`
3331 command like the one below, substituting your own `s3a://` path in the `copy-from`
3332 or `copy-to` directive and substituting or modifying other options as required:
3333
3334 ----
3335 $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
3336     -snapshot MySnapshot \
3337     -copy-from hdfs://srv2:8082/hbase \
3338     -copy-to s3a://<bucket>/<namespace>/hbase \
3339     -chuser MyUser \
3340     -chgroup MyGroup \
3341     -chmod 700 \
3342     -mappers 16
3343 ----
3344
3345 ----
3346 $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
3347     -snapshot MySnapshot
3348     -copy-from s3a://<bucket>/<namespace>/hbase \
3349     -copy-to hdfs://srv2:8082/hbase \
3350     -chuser MyUser \
3351     -chgroup MyGroup \
3352     -chmod 700 \
3353     -mappers 16
3354 ----
3355
3356 You can also use the `org.apache.hadoop.hbase.snapshot.SnapshotInfo` utility with the `s3a://` path by including the
3357 `-remote-dir` option.
3358
3359 ----
3360 $ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo \
3361     -remote-dir s3a://<bucket>/<namespace>/hbase \
3362     -list-snapshots
3363 ----
3364
3365 [[snapshots_azure]]
3366 == Storing Snapshots in Microsoft Azure Blob Storage
3367
3368 You can store snapshots in Microsoft Azure Blog Storage using the same techniques
3369 as in <<snapshots_s3>>.
3370
3371 .Prerequisites
3372 - You must be using HBase 1.2 or higher with Hadoop 2.7.1 or
3373   higher. No version of HBase supports Hadoop 2.7.0.
3374 - Your hosts must be configured to be aware of the Azure blob storage filesystem.
3375   See https://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html.
3376
3377 After you meet the prerequisites, follow the instructions
3378 in <<snapshots_s3>>, replacingthe protocol specifier with `wasb://` or `wasbs://`.
3379
3380 [[ops.capacity]]
3381 == Capacity Planning and Region Sizing
3382
3383 There are several considerations when planning the capacity for an HBase cluster and performing the initial configuration.
3384 Start with a solid understanding of how HBase handles data internally.
3385
3386 [[ops.capacity.nodes]]
3387 === Node count and hardware/VM configuration
3388
3389 [[ops.capacity.nodes.datasize]]
3390 ==== Physical data size
3391
3392 Physical data size on disk is distinct from logical size of your data and is affected by the following:
3393
3394 * Increased by HBase overhead
3395 +
3396 * See <<keyvalue,keyvalue>> and <<keysize,keysize>>.
3397   At least 24 bytes per key-value (cell), can be more.
3398   Small keys/values means more relative overhead.
3399 * KeyValue instances are aggregated into blocks, which are indexed.
3400   Indexes also have to be stored.
3401   Blocksize is configurable on a per-ColumnFamily basis.
3402   See <<regions.arch,regions.arch>>.
3403
3404 * Decreased by <<compression,compression>> and data block encoding, depending on data.
3405   You might want to test what compression and encoding (if any) make sense for your data.
3406 * Increased by size of region server <<wal,wal>> (usually fixed and negligible - less than half of RS memory size, per RS).
3407 * Increased by HDFS replication - usually x3.
3408
3409 Aside from the disk space necessary to store the data, one RS may not be able to serve arbitrarily large amounts of data due to some practical limits on region count and size (see <<ops.capacity.regions,ops.capacity.regions>>).
3410
3411 [[ops.capacity.nodes.throughput]]
3412 ==== Read/Write throughput
3413
3414 Number of nodes can also be driven by required throughput for reads and/or writes.
3415 The throughput one can get per node depends a lot on data (esp.
3416 key/value sizes) and request patterns, as well as node and system configuration.
3417 Planning should be done for peak load if it is likely that the load would be the main driver of the increase of the node count.
3418 PerformanceEvaluation and <<ycsb,ycsb>> tools can be used to test single node or a test cluster.
3419
3420 For write, usually 5-15Mb/s per RS can be expected, since every region server has only one active WAL.
3421 There's no good estimate for reads, as it depends vastly on data, requests, and cache hit rate. <<perf.casestudy,perf.casestudy>> might be helpful.
3422
3423 [[ops.capacity.nodes.gc]]
3424 ==== JVM GC limitations
3425
3426 RS cannot currently utilize very large heap due to cost of GC.
3427 There's also no good way of running multiple RS-es per server (other than running several VMs per machine). Thus, ~20-24Gb or less memory dedicated to one RS is recommended.
3428 GC tuning is required for large heap sizes.
3429 See <<gcpause,gcpause>>, <<trouble.log.gc,trouble.log.gc>> and elsewhere (TODO: where?)
3430
3431 [[ops.capacity.regions]]
3432 === Determining region count and size
3433
3434 Generally less regions makes for a smoother running cluster (you can always manually split the big regions later (if necessary) to spread the data, or request load, over the cluster); 20-200 regions per RS is a reasonable range.
3435 The number of regions cannot be configured directly (unless you go for fully <<disable.splitting,disable.splitting>>); adjust the region size to achieve the target region size given table size.
3436
3437 When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor], as well as shell commands.
3438 These settings will override the ones in `hbase-site.xml`.
3439 That is useful if your tables have different workloads/use cases.
3440
3441 Also note that in the discussion of region sizes here, _HDFS replication factor is not (and should not be) taken into account, whereas
3442           other factors <<ops.capacity.nodes.datasize,ops.capacity.nodes.datasize>> should be._ So, if your data is compressed and replicated 3 ways by HDFS, "9 Gb region" means 9 Gb of compressed data.
3443 HDFS replication factor only affects your disk usage and is invisible to most HBase code.
3444
3445 ==== Viewing the Current Number of Regions
3446
3447 You can view the current number of regions for a given table using the HMaster UI.
3448 In the [label]#Tables# section, the number of online regions for each table is listed in the [label]#Online Regions# column.
3449 This total only includes the in-memory state and does not include disabled or offline regions.
3450
3451 [[ops.capacity.regions.count]]
3452 ==== Number of regions per RS - upper bound
3453
3454 In production scenarios, where you have a lot of data, you are normally concerned with the maximum number of regions you can have per server. <<too_many_regions,too many regions>>          has technical discussion on the subject.
3455 Basically, the maximum number of regions is mostly determined by memstore memory usage.
3456 Each region has its own memstores; these grow up to a configurable size; usually in 128-256 MB range, see <<hbase.hregion.memstore.flush.size,hbase.hregion.memstore.flush.size>>.
3457 One memstore exists per column family (so there's only one per region if there's one CF in the table). The RS dedicates some fraction of total memory to its memstores (see <<hbase.regionserver.global.memstore.size,hbase.regionserver.global.memstore.size>>). If this memory is exceeded (too much memstore usage), it can cause undesirable consequences such as unresponsive server or compaction storms.
3458 A good starting point for the number of regions per RS (assuming one table) is:
3459
3460 [source]
3461 ----
3462 ((RS memory) * (total memstore fraction)) / ((memstore size)*(# column families))
3463 ----
3464
3465 This formula is pseudo-code.
3466 Here are two formulas using the actual tunable parameters, first for HBase 0.98+ and second for HBase 0.94.x.
3467
3468 HBase 0.98.x::
3469 ----
3470 ((RS Xmx) * hbase.regionserver.global.memstore.size) / (hbase.hregion.memstore.flush.size * (# column families))
3471 ----
3472 HBase 0.94.x::
3473 ----
3474 ((RS Xmx) * hbase.regionserver.global.memstore.upperLimit) / (hbase.hregion.memstore.flush.size * (# column families))+
3475 ----
3476
3477 If a given RegionServer has 16 GB of RAM, with default settings, the formula works out to 16384*0.4/128 ~ 51 regions per RS is a starting point.
3478 The formula can be extended to multiple tables; if they all have the same configuration, just use the total number of families.
3479
3480 This number can be adjusted; the formula above assumes all your regions are filled at approximately the same rate.
3481 If only a fraction of your regions are going to be actively written to, you can divide the result by that fraction to get a larger region count.
3482 Then, even if all regions are written to, all region memstores are not filled evenly, and eventually jitter appears even if they are (due to limited number of concurrent flushes). Thus, one can have as many as 2-3 times more regions than the starting point; however, increased numbers carry increased risk.
3483
3484 For write-heavy workload, memstore fraction can be increased in configuration at the expense of block cache; this will also allow one to have more regions.
3485
3486 [[ops.capacity.regions.mincount]]
3487 ==== Number of regions per RS - lower bound
3488
3489 HBase scales by having regions across many servers.
3490 Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle.
3491 This really can't be stressed enough, since a common problem is loading 200MB data into HBase and then wondering why your awesome 10 node cluster isn't doing anything.
3492
3493 On the other hand, if you have a very large amount of data, you may also want to go for a larger number of regions to avoid having regions that are too large.
3494
3495 [[ops.capacity.regions.size]]
3496 ==== Maximum region size
3497
3498 For large tables in production scenarios, maximum region size is mostly limited by compactions - very large compactions, esp.
3499 major, can degrade cluster performance.
3500 Currently, the recommended maximum region size is 10-20Gb, and 5-10Gb is optimal.
3501 For older 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb.
3502
3503 The size at which the region is split into two is generally configured via <<hbase.hregion.max.filesize,hbase.hregion.max.filesize>>; for details, see <<arch.region.splits,arch.region.splits>>.
3504
3505 If you cannot estimate the size of your tables well, when starting off, it's probably best to stick to the default region size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
3506
3507 In HBase 0.98, experimental stripe compactions feature was added that would allow for larger regions, especially for log data.
3508 See <<ops.stripe,ops.stripe>>.
3509
3510 [[ops.capacity.regions.total]]
3511 ==== Total data size per region server
3512
3513 According to above numbers for region size and number of regions per region server, in an optimistic estimate 10 GB x 100 regions per RS will give up to 1TB served per region server, which is in line with some of the reported multi-PB use cases.
3514 However, it is important to think about the data vs cache size ratio at the RS level.
3515 With 1TB of data per server and 10 GB block cache, only 1% of the data will be cached, which may barely cover all block indices.
3516
3517 [[ops.capacity.config]]
3518 === Initial configuration and tuning
3519
3520 First, see <<important_configurations,important configurations>>.
3521 Note that some configurations, more than others, depend on specific scenarios.
3522 Pay special attention to:
3523
3524 * <<hbase.regionserver.handler.count,hbase.regionserver.handler.count>> - request handler thread count, vital for high-throughput workloads.
3525 * <<config.wals,config.wals>> - the blocking number of WAL files depends on your memstore configuration and should be set accordingly to prevent potential blocking when doing high volume of writes.
3526
3527 Then, there are some considerations when setting up your cluster and tables.
3528
3529 [[ops.capacity.config.compactions]]
3530 ==== Compactions
3531
3532 Depending on read/write volume and latency requirements, optimal compaction settings may be different.
3533 See <<compaction,compaction>> for some details.
3534
3535 When provisioning for large data sizes, however, it's good to keep in mind that compactions can affect write throughput.
3536 Thus, for write-intensive workloads, you may opt for less frequent compactions and more store files per regions.
3537 Minimum number of files for compactions (`hbase.hstore.compaction.min`) can be set to higher value; <<hbase.hstore.blockingStoreFiles,hbase.hstore.blockingStoreFiles>> should also be increased, as more files might accumulate in such case.
3538 You may also consider manually managing compactions: <<managed.compactions,managed.compactions>>
3539
3540 [[ops.capacity.config.presplit]]
3541 ==== Pre-splitting the table
3542
3543 Based on the target number of the regions per RS (see <<ops.capacity.regions.count,ops.capacity.regions.count>>) and number of RSes, one can pre-split the table at creation time.
3544 This would both avoid some costly splitting as the table starts to fill up, and ensure that the table starts out already distributed across many servers.
3545
3546 If the table is expected to grow large enough to justify that, at least one region per RS should be created.
3547 It is not recommended to split immediately into the full target number of regions (e.g.
3548 50 * number of RSes), but a low intermediate value can be chosen.
3549 For multiple tables, it is recommended to be conservative with presplitting (e.g.
3550 pre-split 1 region per RS at most), especially if you don't know how much each table will grow.
3551 If you split too much, you may end up with too many regions, with some tables having too many small regions.
3552
3553 For pre-splitting howto, see <<manual_region_splitting_decisions,manual region splitting decisions>> and <<precreate.regions,precreate.regions>>.
3554
3555 [[table.rename]]
3556 == Table Rename
3557
3558 In versions 0.90.x of hbase and earlier, we had a simple script that would rename the hdfs table directory and then do an edit of the hbase:meta table replacing all mentions of the old table name with the new.
3559 The script was called `./bin/rename_table.rb`.
3560 The script was deprecated and removed mostly because it was unmaintained and the operation performed by the script was brutal.
3561
3562 As of hbase 0.94.x, you can use the snapshot facility renaming a table.
3563 Here is how you would do it using the hbase shell:
3564
3565 ----
3566 hbase shell> disable 'tableName'
3567 hbase shell> snapshot 'tableName', 'tableSnapshot'
3568 hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
3569 hbase shell> delete_snapshot 'tableSnapshot'
3570 hbase shell> drop 'tableName'
3571 ----
3572
3573 or in code it would be as follows:
3574
3575 [source,java]
3576 ----
3577 void rename(Admin admin, String oldTableName, TableName newTableName) {
3578   String snapshotName = randomName();
3579   admin.disableTable(oldTableName);
3580   admin.snapshot(snapshotName, oldTableName);
3581   admin.cloneSnapshot(snapshotName, newTableName);
3582   admin.deleteSnapshot(snapshotName);
3583   admin.deleteTable(oldTableName);
3584 }
3585 ----
3586
3587 [[rsgroup]]
3588 == RegionServer Grouping
3589 RegionServer Grouping (A.K.A `rsgroup`) is an advanced feature for
3590 partitioning regionservers into distinctive groups for strict isolation. It
3591 should only be used by users who are sophisticated enough to understand the
3592 full implications and have a sufficient background in managing HBase clusters.
3593 It was developed by Yahoo! and they run it at scale on their large grid cluster.
3594 See link:http://www.slideshare.net/HBaseCon/keynote-apache-hbase-at-yahoo-scale[HBase at Yahoo! Scale].
3595
3596 RSGroups can be defined and managed with both admin methods and shell commands.
3597 A server can be added to a group with hostname and port pair and tables
3598 can be moved to this group so that only regionservers in the same rsgroup can
3599 host the regions of the table. The group for a table is stored in its
3600 TableDescriptor, the property name is `hbase.rsgroup.name`. You can also set
3601 this property on a namespace, so it will cause all the tables under this
3602 namespace to be placed into this group. RegionServers and tables can only
3603 belong to one rsgroup at a time. By default, all tables and regionservers
3604 belong to the `default` rsgroup. System tables can also be put into a
3605 rsgroup using the regular APIs. A custom balancer implementation tracks
3606 assignments per rsgroup and makes sure to move regions to the relevant
3607 regionservers in that rsgroup. The rsgroup information is stored in a regular
3608 HBase table, and a zookeeper-based read-only cache is used at cluster bootstrap
3609 time.
3610
3611 To enable, add the following to your hbase-site.xml and restart your Master:
3612
3613 [source,xml]
3614 ----
3615  <property>
3616    <name>hbase.balancer.rsgroup.enabled</name>
3617    <value>true</value>
3618  </property>
3619 ----
3620
3621 Then use the admin/shell _rsgroup_ methods/commands to create and manipulate
3622 RegionServer groups: e.g. to add a rsgroup and then add a server to it.
3623 To see the list of rsgroup commands available in the hbase shell type:
3624
3625 [source, bash]
3626 ----
3627  hbase(main):008:0> help 'rsgroup'
3628  Took 0.5610 seconds
3629 ----
3630
3631 High level, you create a rsgroup that is other than the `default` group using
3632 _add_rsgroup_ command. You then add servers and tables to this group with the
3633 _move_servers_rsgroup_ and _move_tables_rsgroup_ commands. If necessary, run
3634 a balance for the group if tables are slow to migrate to the groups dedicated
3635 server with the _balance_rsgroup_ command (Usually this is not needed). To
3636 monitor effect of the commands, see the `Tables` tab toward the end of the
3637 Master UI home page. If you click on a table, you can see what servers it is
3638 deployed across. You should see here a reflection of the grouping done with
3639 your shell commands. View the master log if issues.
3640
3641 Here is example using a few of the rsgroup commands. To add a group, do as
3642 follows:
3643
3644 [source, bash]
3645 ----
3646  hbase(main):008:0> add_rsgroup 'my_group'
3647  Took 0.5610 seconds
3648 ----
3649
3650
3651 .RegionServer Groups must be Enabled
3652 [NOTE]
3653 ====
3654 If you have not enabled the rsgroup feature and you call any of the rsgroup
3655 admin methods or shell commands the call will fail with a
3656 `DoNotRetryIOException` with a detail message that says the rsgroup feature
3657 is disabled.
3658 ====
3659
3660 Add a server (specified by hostname + port) to the just-made group using the
3661 _move_servers_rsgroup_ command as follows:
3662
3663 [source, bash]
3664 ----
3665  hbase(main):010:0> move_servers_rsgroup 'my_group',['k.att.net:51129']
3666 ----
3667
3668 .Hostname and Port vs ServerName
3669 [NOTE]
3670 ====
3671 The rsgroup feature refers to servers in a cluster with hostname and port only.
3672 It does not make use of the HBase ServerName type identifying RegionServers;
3673 i.e. hostname + port + starttime to distinguish RegionServer instances. The
3674 rsgroup feature keeps working across RegionServer restarts so the starttime of
3675 ServerName -- and hence the ServerName type -- is not appropriate.
3676 Administration
3677 ====
3678
3679 Servers come and go over the lifetime of a Cluster. Currently, you must
3680 manually align the servers referenced in rsgroups with the actual state of
3681 nodes in the running cluster. What we mean by this is that if you decommission
3682 a server, then you must update rsgroups as part of your server decommission
3683 process removing references. Notice that, by calling `clearDeadServers`
3684 manually will also remove the dead servers from any rsgroups, but the problem
3685 is that we will lost track of the dead servers after master restarts, which
3686 means you still need to update the rsgroup by your own.
3687
3688 Please use `Admin.removeServersFromRSGroup` or shell command
3689 _remove_servers_rsgroup_ to remove decommission servers from rsgroup.
3690
3691 The `default` group is not like other rsgroups in that it is dynamic. Its server
3692 list mirrors the current state of the cluster; i.e. if you shutdown a server that
3693 was part of the `default` rsgroup, and then do a _get_rsgroup_ `default` to list
3694 its content in the shell, the server will no longer be listed. For non-default
3695 groups, though a mode may be offline, it will persist in the non-default group’s
3696 list of servers. But if you move the offline server from the non-default rsgroup
3697 to default, it will not show in the `default` list. It will just be dropped.
3698
3699 === Best Practice
3700 The authors of the rsgroup feature, the Yahoo! HBase Engineering team, have been
3701 running it on their grid for a good while now and have come up with a few best
3702 practices informed by their experience.
3703
3704 ==== Isolate System Tables
3705 Either have a system rsgroup where all the system tables are or just leave the
3706 system tables in `default` rsgroup and have all user-space tables are in
3707 non-default rsgroups.
3708
3709 ==== Dead Nodes
3710 Yahoo! Have found it useful at their scale to keep a special rsgroup of dead or
3711 questionable nodes; this is one means of keeping them out of the running until repair.
3712
3713 Be careful replacing dead nodes in an rsgroup. Ensure there are enough live nodes
3714 before you start moving out the dead. Move in good live nodes first if you have to.
3715
3716 === Troubleshooting
3717 Viewing the Master log will give you insight on rsgroup operation.
3718
3719 If it appears stuck, restart the Master process.
3720
3721 === Remove RegionServer Grouping
3722 Simply disable RegionServer Grouping feature is easy, just remove the
3723 'hbase.balancer.rsgroup.enabled' from hbase-site.xml or explicitly set it to
3724 false in hbase-site.xml.
3725
3726 [source,xml]
3727 ----
3728  <property>
3729    <name>hbase.balancer.rsgroup.enabled</name>
3730    <value>false</value>
3731  </property>
3732 ----
3733
3734 But if you change the 'hbase.balancer.rsgroup.enabled' to true, the old rsgroup
3735 configs will take effect again. So if you want to completely remove the
3736 RegionServer Grouping feature from a cluster, so that if the feature is
3737 re-enabled in the future, the old meta data will not affect the functioning of
3738 the cluster, there are more steps to do.
3739
3740 - Move all tables in non-default rsgroups to `default` regionserver group
3741 [source,bash]
3742 ----
3743 #Reassigning table t1 from non default group - hbase shell
3744 hbase(main):005:0> move_tables_rsgroup 'default',['t1']
3745 ----
3746 - Move all regionservers in non-default rsgroups to `default` regionserver group
3747 [source, bash]
3748 ----
3749 #Reassigning all the servers in the non-default rsgroup to default - hbase shell
3750 hbase(main):008:0> move_servers_rsgroup 'default',['rs1.xxx.com:16206','rs2.xxx.com:16202','rs3.xxx.com:16204']
3751 ----
3752 - Remove all non-default rsgroups. `default` rsgroup created implicitly doesn't have to be removed
3753 [source,bash]
3754 ----
3755 #removing non default rsgroup - hbase shell
3756 hbase(main):009:0> remove_rsgroup 'group2'
3757 ----
3758 - Remove the changes made in `hbase-site.xml` and restart the cluster
3759 - Drop the table `hbase:rsgroup` from `hbase`
3760 [source, bash]
3761 ----
3762 #Through hbase shell drop table hbase:rsgroup
3763 hbase(main):001:0> disable 'hbase:rsgroup'
3764 0 row(s) in 2.6270 seconds
3765
3766 hbase(main):002:0> drop 'hbase:rsgroup'
3767 0 row(s) in 1.2730 seconds
3768 ----
3769 - Remove znode `rsgroup` from the cluster ZooKeeper using zkCli.sh
3770 [source, bash]
3771 ----
3772 #From ZK remove the node /hbase/rsgroup through zkCli.sh
3773 rmr /hbase/rsgroup
3774 ----
3775
3776 === ACL
3777 To enable ACL, add the following to your hbase-site.xml and restart your Master:
3778
3779 [source,xml]
3780 ----
3781 <property>
3782   <name>hbase.security.authorization</name>
3783   <value>true</value>
3784 <property>
3785 ----
3786 [[migrating.rsgroup]]
3787 === Migrating From Old Implementation
3788 The coprocessor `org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is
3789 deprected, but for compatible, if you want the pre 3.0.0 hbase client/shell
3790 to communicate with the new hbase cluster, you still need to add this
3791 coprocessor to master.
3792
3793 The `hbase.rsgroup.grouploadbalancer.class` config has been deprecated, as now
3794 the top level load balancer will always be `RSGroupBasedLoadBalaner`, and the
3795 `hbase.master.loadbalancer.class` config is for configuring the balancer within
3796 a group. This also means you should not set `hbase.master.loadbalancer.class`
3797 to `RSGroupBasedLoadBalaner` any more even if rsgroup feature is enabled.
3798
3799 And we have done some special changes for compatibility. First, if coprocessor
3800 `org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is specified, the
3801 `hbase.balancer.rsgroup.enabled` flag will be set to true automatically to
3802 enable rs group feature. Second, we will load
3803 `hbase.rsgroup.grouploadbalancer.class` prior to
3804 `hbase.master.loadbalancer.class`. And last, if you do not set
3805 `hbase.rsgroup.grouploadbalancer.class` but only set
3806 `hbase.master.loadbalancer.class` to `RSGroupBasedLoadBalancer`, we will load
3807 the default load balancer to avoid infinite nesting. This means you do not need
3808 to change anything when upgrading if you have already enabled rs group feature.
3809
3810 The main difference comparing to the old implementation is that, now the
3811 rsgroup for a table is stored in `TableDescriptor`, instead of in
3812 `RSGroupInfo`, so the `getTables` method of `RSGroupInfo` has been deprecated.
3813 And if you use the `Admin` methods to get the `RSGroupInfo`, its `getTables`
3814 method will always return empty. This is because that in the old
3815 implementation, this method is a bit broken as you can set rsgroup on namespace
3816 and make all the tables under this namespace into this group but you can not
3817 get these tables through `RSGroupInfo.getTables`. Now you should use the two
3818 new methods `listTablesInRSGroup` and
3819 `getConfiguredNamespacesAndTablesInRSGroup` in `Admin` to get tables and
3820 namespaces in a rsgroup.
3821
3822 Of course the behavior for the old RSGroupAdminEndpoint is not changed,
3823 we will fill the tables field of the RSGroupInfo before returning, to make it
3824 compatible with old hbase client/shell.
3825
3826 When upgrading, the migration between the RSGroupInfo and TableDescriptor will
3827 be done automatically. It will take sometime, but it is fine to restart master
3828 in the middle, the migration will continue after restart. And during the
3829 migration, the rs group feature will still work and in most cases the region
3830 will not be misplaced(since this is only a one time job and will not last too
3831 long so we have not test it very seriously to make sure the region will not be
3832 misplaced always, so we use the word 'in most cases'). The implementation is a
3833 bit tricky, you can see the code in `RSGroupInfoManagerImpl.migrate` if
3834 interested.
3835
3836
3837
3838
3839 [[normalizer]]
3840 == Region Normalizer
3841
3842 The Region Normalizer tries to make Regions all in a table about the same in
3843 size. It does this by first calculating total table size and average size per
3844 region. It splits any region that is larger than twice this size. Any region
3845 that is much smaller is merged into an adjacent region. The Normalizer runs on
3846 a regular schedule, which is configurable. It can also be disabled entirely via
3847 a runtime "switch". It can be run manually via the shell or Admin API call.
3848 Even if normally disabled, it is good to run manually after the cluster has
3849 been running a while or say after a burst of activity such as a large delete.
3850
3851 The Normalizer works well for bringing a table's region boundaries into
3852 alignment with the reality of data distribution after an initial effort at
3853 pre-splitting a table. It is also a nice compliment to the data TTL feature
3854 when the schema includes timestamp in the rowkey, as it will automatically
3855 merge away regions whose contents have expired.
3856
3857 (The bulk of the below detail was copied wholesale from the blog by Romil Choksi at
3858 link:https://community.hortonworks.com/articles/54987/hbase-region-normalizer.html[HBase Region Normalizer]).
3859
3860 The Region Normalizer is feature available since HBase-1.2. It runs a set of
3861 pre-calculated merge/split actions to resize regions that are either too
3862 large or too small compared to the average region size for a given table. Region
3863 Normalizer when invoked computes a normalization 'plan' for all of the tables in
3864 HBase. System tables (such as hbase:meta, hbase:namespace, Phoenix system tables
3865 etc) and user tables with normalization disabled are ignored while computing the
3866 plan. For normalization enabled tables, normalization plan is carried out in
3867 parallel across multiple tables.
3868
3869 Normalizer can be enabled or disabled globally for the entire cluster using the
3870 ‘normalizer_switch’ command in the HBase shell. Normalization can also be
3871 controlled on a per table basis, which is disabled by default when a table is
3872 created. Normalization for a table can be enabled or disabled by setting the
3873 NORMALIZATION_ENABLED table attribute to true or false.
3874
3875 To check normalizer status and enable/disable normalizer
3876
3877 [source,bash]
3878 ----
3879 hbase(main):001:0> normalizer_enabled
3880 true
3881 0 row(s) in 0.4870 seconds
3882
3883 hbase(main):002:0> normalizer_switch false
3884 true
3885 0 row(s) in 0.0640 seconds
3886
3887 hbase(main):003:0> normalizer_enabled
3888 false
3889 0 row(s) in 0.0120 seconds
3890
3891 hbase(main):004:0> normalizer_switch true
3892 false
3893 0 row(s) in 0.0200 seconds
3894
3895 hbase(main):005:0> normalizer_enabled
3896 true
3897 0 row(s) in 0.0090 seconds
3898 ----
3899
3900 When enabled, Normalizer is invoked in the background every 5 mins (by default),
3901 which can be configured using `hbase.normalization.period` in `hbase-site.xml`.
3902 Normalizer can also be invoked manually/programmatically at will using HBase shell’s
3903 `normalize` command. HBase by default uses `SimpleRegionNormalizer`, but users can
3904 design their own normalizer as long as they implement the RegionNormalizer Interface.
3905 Details about the logic used by `SimpleRegionNormalizer` to compute its normalization
3906 plan can be found link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/normalizer/SimpleRegionNormalizer.html[here].
3907
3908 The below example shows a normalization plan being computed for an user table, and
3909 merge action being taken as a result of the normalization plan computed by SimpleRegionNormalizer.
3910
3911 Consider an user table with some pre-split regions having 3 equally large regions
3912 (about 100K rows) and 1 relatively small region (about 25K rows). Following is the
3913 snippet from an hbase meta table scan showing each of the pre-split regions for
3914 the user table.
3915
3916 ----
3917 table_p8ddpd6q5z,,1469494305548.68b9892220865cb6048 column=info:regioninfo, timestamp=1469494306375, value={ENCODED => 68b9892220865cb604809c950d1adf48, NAME => 'table_p8ddpd6q5z,,1469494305548.68b989222 09c950d1adf48.   0865cb604809c950d1adf48.', STARTKEY => '', ENDKEY => '1'}
3918 ....
3919 table_p8ddpd6q5z,1,1469494317178.867b77333bdc75a028 column=info:regioninfo, timestamp=1469494317848, value={ENCODED => 867b77333bdc75a028bb4c5e4b235f48, NAME => 'table_p8ddpd6q5z,1,1469494317178.867b7733 bb4c5e4b235f48.  3bdc75a028bb4c5e4b235f48.', STARTKEY => '1', ENDKEY => '3'}
3920 ....
3921 table_p8ddpd6q5z,3,1469494328323.98f019a753425e7977 column=info:regioninfo, timestamp=1469494328486, value={ENCODED => 98f019a753425e7977ab8636e32deeeb, NAME => 'table_p8ddpd6q5z,3,1469494328323.98f019a7 ab8636e32deeeb.  53425e7977ab8636e32deeeb.', STARTKEY => '3', ENDKEY => '7'}
3922 ....
3923 table_p8ddpd6q5z,7,1469494339662.94c64e748979ecbb16 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 94c64e748979ecbb166f6cc6550e25c6, NAME => 'table_p8ddpd6q5z,7,1469494339662.94c64e74 6f6cc6550e25c6.   8979ecbb166f6cc6550e25c6.', STARTKEY => '7', ENDKEY => '8'}
3924 ....
3925 table_p8ddpd6q5z,8,1469494339662.6d2b3f5fd1595ab8e7 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 6d2b3f5fd1595ab8e7c031876057b1ee, NAME => 'table_p8ddpd6q5z,8,1469494339662.6d2b3f5f c031876057b1ee.   d1595ab8e7c031876057b1ee.', STARTKEY => '8', ENDKEY => ''}
3926 ----
3927 Invoking the normalizer using ‘normalize’ int the HBase shell, the below log snippet
3928 from HMaster log shows the normalization plan computed as per the logic defined for
3929 SimpleRegionNormalizer. Since the total region size (in MB) for the adjacent smallest
3930 regions in the table is less than the average region size, the normalizer computes a
3931 plan to merge these two regions.
3932
3933 ----
3934 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto
3935 normalization turned on
3936 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
3937 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't have auto normalization turned on
3938 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: table_h2osxu3wat, as it's either system table or doesn't have autonormalization turned on
3939 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 5
3940 2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
3941 2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 2.4
3942 2016-07-26 07:08:26,929 INFO  [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, small region size: 0 plus its neighbor size: 0, less thanthe avg size 2.4, merging them
3943 2016-07-26 07:08:26,971 INFO  [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.MergeNormalizationPlan: Executing merging normalization plan: MergeNormalizationPlan{firstRegion={ENCODED=> d51df2c58e9b525206b1325fd925a971, NAME => 'table_p8ddpd6q5z,,1469514755237.d51df2c58e9b525206b1325fd925a971.', STARTKEY => '', ENDKEY => '1'}, secondRegion={ENCODED => e69c6b25c7b9562d078d9ad3994f5330, NAME => 'table_p8ddpd6q5z,1,1469514767669.e69c6b25c7b9562d078d9ad3994f5330.',
3944 STARTKEY => '1', ENDKEY => '3'}}
3945 ----
3946 Region normalizer as per it’s computed plan, merged the region with start key as ‘’
3947 and end key as ‘1’, with another region having start key as ‘1’ and end key as ‘3’.
3948 Now, that these regions have been merged we see a single new region with start key
3949 as ‘’ and end key as ‘3’
3950 ----
3951 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeA, timestamp=1469516907431,
3952 value=PBUF\x08\xA5\xD9\x9E\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x00"\x011(\x000\x00 ea74d246741ba.   8\x00
3953 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeB, timestamp=1469516907431,
3954 value=PBUF\x08\xB5\xBA\x9F\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x011"\x013(\x000\x0 ea74d246741ba.   08\x00
3955 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:regioninfo, timestamp=1469516907431, value={ENCODED => e06c9b83c4a252b130eea74d246741ba, NAME => 'table_p8ddpd6q5z,,1469516907210.e06c9b83c ea74d246741ba.   4a252b130eea74d246741ba.', STARTKEY => '', ENDKEY => '3'}
3956 ....
3957 table_p8ddpd6q5z,3,1469514778736.bf024670a847c0adff column=info:regioninfo, timestamp=1469514779417, value={ENCODED => bf024670a847c0adffb74b2e13408b32, NAME => 'table_p8ddpd6q5z,3,1469514778736.bf024670 b74b2e13408b32.  a847c0adffb74b2e13408b32.' STARTKEY => '3', ENDKEY => '7'}
3958 ....
3959 table_p8ddpd6q5z,7,1469514790152.7c5a67bc755e649db2 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 7c5a67bc755e649db22f49af6270f1e1, NAME => 'table_p8ddpd6q5z,7,1469514790152.7c5a67bc 2f49af6270f1e1.  755e649db22f49af6270f1e1.', STARTKEY => '7', ENDKEY => '8'}
3960 ....
3961 table_p8ddpd6q5z,8,1469514790152.58e7503cda69f98f47 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 58e7503cda69f98f4755178e74288c3a, NAME => 'table_p8ddpd6q5z,8,1469514790152.58e7503c 55178e74288c3a.  da69f98f4755178e74288c3a.', STARTKEY => '8', ENDKEY => ''}
3962 ----
3963
3964 A similar example can be seen for an user table with 3 smaller regions and 1
3965 relatively large region. For this example, we have an user table with 1 large region containing 100K rows, and 3 relatively smaller regions with about 33K rows each. As seen from the normalization plan, since the larger region is more than twice the average region size it ends being split into two regions – one with start key as ‘1’ and end key as ‘154717’ and the other region with start key as '154717' and end key as ‘3’
3966 ----
3967 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
3968 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 4
3969 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
3970 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 3.0
3971 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: No normalization needed, regions look good for table: table_p8ddpd6q5z
3972 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_h2osxu3wat, number of regions: 5
3973 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, total aggregated regions size: 7
3974 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, average region size: 1.4
3975 2016-07-26 07:39:45,636 INFO  [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, large region table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db. has size 4, more than twice avg size, splitting
3976 2016-07-26 07:39:45,640 INFO [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SplitNormalizationPlan: Executing splitting normalization plan: SplitNormalizationPlan{regionInfo={ENCODED => 27f2fdbb2b6612ea163eb6b40753c3db, NAME => 'table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db.', STARTKEY => '1', ENDKEY => '3'}, splitPoint=null}
3977 2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto normalization turned on
3978 2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't
3979 have auto normalization turned on …..…..….
3980 2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined 54de97dae764b864504704c1c8d3674a on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => 54de97dae764b864504704c1c8d3674a, NAME => 'table_h2osxu3wat,1,1469518785661.54de97dae764b864504704c1c8d3674a.', STARTKEY => '1', ENDKEY => '154717'}
3981 2016-07-26 07:39:46,246 INFO  [AM.ZK.Worker-pool2-t278] master.RegionStates: Transition {d6b5625df331cfec84dce4f1122c567f state=SPLITTING_NEW, ts=1469518786246, server=hbase-test-rc-5.openstacklocal,16020,1469419333913} to {d6b5625df331cfec84dce4f1122c567f state=OPEN, ts=1469518786246,
3982 server=hbase-test-rc-5.openstacklocal,16020,1469419333913}
3983 2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined d6b5625df331cfec84dce4f1122c567f on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => d6b5625df331cfec84dce4f1122c567f, NAME => 'table_h2osxu3wat,154717,1469518785661.d6b5625df331cfec84dce4f1122c567f.', STARTKEY => '154717', ENDKEY => '3'}
3984 ----
3985
3986
3987
3988 [[auto_reopen_regions]]
3989 == Auto Region Reopen
3990
3991 We can leak store reader references if a coprocessor or core function somehow
3992 opens a scanner, or wraps one, and then does not take care to call close on the
3993 scanner or the wrapped instance. Leaked store files can not be removed even
3994 after it is invalidated via compaction.
3995 A reasonable mitigation for a reader reference
3996 leak would be a fast reopen of the region on the same server.
3997 This will release all resources, like the refcount, leases, etc.
3998 The clients should gracefully ride over this like any other region in
3999 transition.
4000 By default this auto reopen of region feature would be disabled.
4001 To enabled it, please provide high ref count value for config
4002 `hbase.regions.recovery.store.file.ref.count`.
4003
4004 Please refer to config descriptions for
4005 `hbase.master.regions.recovery.check.interval` and
4006 `hbase.regions.recovery.store.file.ref.count`.
4007