src/main/asciidoc/_chapters/ops_mgt.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 [[ops_mgt]]
  23 = Apache HBase Operational Management
  24 :doctype: book
  25 :numbered:
  26 :toc: left
  27 :icons: font
  28 :experimental:
  29
  30 This chapter will cover operational tools and practices required of a running Apache HBase cluster.
  31 The subject of operations is related to the topics of <<trouble>>, <<performance>>, and <<configuration>> but is a distinct topic in itself.
  32
  33 [[tools]]
  34 == HBase Tools and Utilities
  35
  36 HBase provides several tools for administration, analysis, and debugging of your cluster.
  37 The entry-point to most of these tools is the _bin/hbase_ command, though some tools are available in the _dev-support/_ directory.
  38
  39 To see usage instructions for _bin/hbase_ command, run it with no arguments, or with the `-h` argument.
  40 These are the usage instructions for HBase 0.98.x.
  41 Some commands, such as `version`, `pe`, `ltt`, `clean`, are not available in previous versions.
  42
  43 ----
  44 $ bin/hbase
  45 Usage: hbase [<options>] <command> [<args>]
  46 Options:
  47   --config DIR     Configuration direction to use. Default: ./conf
  48   --hosts HOSTS    Override the list in 'regionservers' file
  49   --auth-as-server Authenticate to ZooKeeper using servers configuration
  50
  51 Commands:
  52 Some commands take arguments. Pass no args or -h for usage.
  53   shell           Run the HBase shell
  54   hbck            Run the HBase 'fsck' tool. Defaults read-only hbck1.
  55                   Pass '-j /path/to/HBCK2.jar' to run hbase-2.x HBCK2.
  56   snapshot        Tool for managing snapshots
  57   wal             Write-ahead-log analyzer
  58   hfile           Store file analyzer
  59   zkcli           Run the ZooKeeper shell
  60   master          Run an HBase HMaster node
  61   regionserver    Run an HBase HRegionServer node
  62   zookeeper       Run a ZooKeeper server
  63   rest            Run an HBase REST server
  64   thrift          Run the HBase Thrift server
  65   thrift2         Run the HBase Thrift2 server
  66   clean           Run the HBase clean up script
  67   jshell          Run a jshell with HBase on the classpath
  68   classpath       Dump hbase CLASSPATH
  69   mapredcp        Dump CLASSPATH entries required by mapreduce
  70   pe              Run PerformanceEvaluation
  71   ltt             Run LoadTestTool
  72   canary          Run the Canary tool
  73   version         Print the version
  74   backup          Backup tables for recovery
  75   restore         Restore tables from existing backup image
  76   regionsplitter  Run RegionSplitter tool
  77   rowcounter      Run RowCounter tool
  78   cellcounter     Run CellCounter tool
  79   CLASSNAME       Run the class named CLASSNAME
  80 ----
  81
  82 Some of the tools and utilities below are Java classes which are passed directly to the _bin/hbase_ command, as referred to in the last line of the usage instructions.
  83 Others, such as `hbase shell` (<<shell>>), `hbase upgrade` (<<upgrading>>), and `hbase thrift` (<<thrift>>), are documented elsewhere in this guide.
  84
  85 === Canary
  86
  87 The Canary tool can help users "canary-test" the HBase cluster status.
  88 The default "region mode" fetches a row from every column-family of every regions.
  89 In "regionserver mode", the Canary tool will fetch a row from a random
  90 region on each of the cluster's RegionServers. In "zookeeper mode", the
  91 Canary will read the root znode on each member of the zookeeper ensemble.
  92
  93 To see usage, pass the `-help` parameter (if you pass no
  94 parameters, the Canary tool starts executing in the default
  95 region "mode" fetching a row from every region in the cluster).
  96
  97 ----
  98 2018-10-16 13:11:27,037 INFO  [main] tool.Canary: Execution thread count=16
  99 Usage: canary [OPTIONS] [<TABLE1> [<TABLE2]...] | [<REGIONSERVER1> [<REGIONSERVER2]..]
 100 Where [OPTIONS] are:
 101  -h,-help        show this help and exit.
 102  -regionserver   set 'regionserver mode'; gets row from random region on server
 103  -allRegions     get from ALL regions when 'regionserver mode', not just random one.
 104  -zookeeper      set 'zookeeper mode'; grab zookeeper.znode.parent on each ensemble member
 105  -daemon         continuous check at defined intervals.
 106  -interval <N>   interval between checks in seconds
 107  -e              consider table/regionserver argument as regular expression
 108  -f <B>          exit on first error; default=true
 109  -failureAsError treat read/write failure as error
 110  -t <N>          timeout for canary-test run; default=600000ms
 111  -writeSniffing  enable write sniffing
 112  -writeTable     the table used for write sniffing; default=hbase:canary
 113  -writeTableTimeout <N>  timeout for writeTable; default=600000ms
 114  -readTableTimeouts <tableName>=<read timeout>,<tableName>=<read timeout>,...
 115                 comma-separated list of table read timeouts (no spaces);
 116                 logs 'ERROR' if takes longer. default=600000ms
 117  -permittedZookeeperFailures <N>  Ignore first N failures attempting to
 118                 connect to individual zookeeper nodes in ensemble
 119
 120  -D<configProperty>=<value> to assign or override configuration params
 121  -Dhbase.canary.read.raw.enabled=<true/false> Set to enable/disable raw scan; default=false
 122
 123 Canary runs in one of three modes: region (default), regionserver, or zookeeper.
 124 To sniff/probe all regions, pass no arguments.
 125 To sniff/probe all regions of a table, pass tablename.
 126 To sniff/probe regionservers, pass -regionserver, etc.
 127 See http://hbase.apache.org/book.html#_canary for Canary documentation.
 128 ----
 129
 130 [NOTE]
 131 The `Sink` class is instantiated using the `hbase.canary.sink.class` configuration property.
 132
 133 This tool will return non zero error codes to user for collaborating with other monitoring tools,
 134 such as Nagios.  The error code definitions are:
 135
 136 [source,java]
 137 ----
 138 private static final int USAGE_EXIT_CODE = 1;
 139 private static final int INIT_ERROR_EXIT_CODE = 2;
 140 private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
 141 private static final int ERROR_EXIT_CODE = 4;
 142 private static final int FAILURE_EXIT_CODE = 5;
 143 ----
 144
 145 Here are some examples based on the following given case: given two Table objects called test-01
 146 and test-02 each with two column family cf1 and cf2 respectively, deployed on 3 RegionServers.
 147 See the following table.
 148
 149 [cols="1,1,1", options="header"]
 150 |===
 151 | RegionServer
 152 | test-01
 153 | test-02
 154 | rs1 | r1 | r2
 155 | rs2 | r2 |
 156 | rs3 | r2 | r1
 157 |===
 158
 159 Following are some example outputs based on the previous given case.
 160
 161 ==== Canary test for every column family (store) of every region of every table
 162
 163 ----
 164 $ ${HBASE_HOME}/bin/hbase canary
 165
 166 3/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf1 in 2ms
 167 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf2 in 2ms
 168 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf1 in 4ms
 169 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf2 in 1ms
 170 ...
 171 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf1 in 5ms
 172 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf2 in 3ms
 173 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf1 in 31ms
 174 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf2 in 8ms
 175 ----
 176
 177 So you can see, table test-01 has two regions and two column families, so the Canary tool in the
 178 default "region mode" will pick 4 small piece of data from 4 (2 region * 2 store) different stores.
 179 This is a default behavior.
 180
 181 ==== Canary test for every column family (store) of every region of a specific table(s)
 182
 183 You can also test one or more specific tables by passing table names.
 184
 185 ----
 186 $ ${HBASE_HOME}/bin/hbase canary test-01 test-02
 187 ----
 188
 189 ==== Canary test with RegionServer granularity
 190
 191 In "regionserver mode", the Canary tool will pick one small piece of data
 192 from each RegionServer (You can also pass one or more RegionServer names as arguments
 193 to the canary-test when in "regionserver mode").
 194
 195 ----
 196 $ ${HBASE_HOME}/bin/hbase canary -regionserver
 197
 198 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs2 in 72ms
 199 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-02 on region server:rs3 in 34ms
 200 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs1 in 56ms
 201 ----
 202
 203 ==== Canary test with regular expression pattern
 204
 205 You can pass regexes for table names when in "region mode" or for servernames when
 206 in "regionserver mode". The below will test both table test-01 and test-02.
 207
 208 ----
 209 $ ${HBASE_HOME}/bin/hbase canary -e test-0[1-2]
 210 ----
 211
 212 ==== Run canary test as a "daemon"
 213
 214 Run repeatedly with an interval defined via the option `-interval` (default value is 60 seconds).
 215 This daemon will stop itself and return non-zero error code if any error occur. To have
 216 the daemon keep running across errors, pass the -f flag with its value set to false
 217 (see usage above).
 218
 219 ----
 220 $ ${HBASE_HOME}/bin/hbase canary -daemon
 221 ----
 222
 223 To run repeatedly with 5 second intervals and not stop on errors, do the following.
 224
 225 ----
 226 $ ${HBASE_HOME}/bin/hbase canary -daemon -interval 5 -f false
 227 ----
 228
 229 ==== Force timeout if canary test stuck
 230
 231 In some cases the request is stuck and no response is sent back to the client. This
 232 can happen with dead RegionServers which the master has not yet noticed.
 233 Because of this we provide a timeout option to kill the canary test and return a
 234 non-zero error code. The below sets the timeout value to 60 seconds (the default value
 235 is 600 seconds).
 236
 237 ----
 238 $ ${HBASE_HOME}/bin/hbase canary -t 60000
 239 ----
 240
 241 ==== Enable write sniffing in canary
 242
 243 By default, the canary tool only checks read operations. To enable the write sniffing,
 244 you can run the canary with the `-writeSniffing` option set.  When write sniffing is
 245 enabled, the canary tool will create an hbase table and make sure the
 246 regions of the table are distributed to all region servers. In each sniffing period,
 247 the canary will try to put data to these regions to check the write availability of
 248 each region server.
 249 ----
 250 $ ${HBASE_HOME}/bin/hbase canary -writeSniffing
 251 ----
 252
 253 The default write table is `hbase:canary` and can be specified with the option `-writeTable`.
 254 ----
 255 $ ${HBASE_HOME}/bin/hbase canary -writeSniffing -writeTable ns:canary
 256 ----
 257
 258 The default value size of each put is 10 bytes. You can set it via the config key:
 259 `hbase.canary.write.value.size`.
 260
 261 ==== Treat read / write failure as error
 262
 263 By default, the canary tool only logs read failures -- due to e.g. RetriesExhaustedException, etc. --
 264 and will return the 'normal' exit code. To treat read/write failure as errors, you can run canary
 265 with the `-treatFailureAsError` option. When enabled, read/write failures will result in an
 266 error exit code.
 267 ----
 268 $ ${HBASE_HOME}/bin/hbase canary -treatFailureAsError
 269 ----
 270
 271 ==== Running Canary in a Kerberos-enabled Cluster
 272
 273 To run the Canary in a Kerberos-enabled cluster, configure the following two properties in _hbase-site.xml_:
 274
 275 * `hbase.client.keytab.file`
 276 * `hbase.client.kerberos.principal`
 277
 278 Kerberos credentials are refreshed every 30 seconds when Canary runs in daemon mode.
 279
 280 To configure the DNS interface for the client, configure the following optional properties in _hbase-site.xml_.
 281
 282 * `hbase.client.dns.interface`
 283 * `hbase.client.dns.nameserver`
 284
 285 .Canary in a Kerberos-Enabled Cluster
 286 ====
 287 This example shows each of the properties with valid values.
 288
 289 [source,xml]
 290 ----
 291 <property>
 292   <name>hbase.client.kerberos.principal</name>
 293   <value>hbase/_HOST@YOUR-REALM.COM</value>
 294 </property>
 295 <property>
 296   <name>hbase.client.keytab.file</name>
 297   <value>/etc/hbase/conf/keytab.krb5</value>
 298 </property>
 299 <!-- optional params -->
 300 <property>
 301   <name>hbase.client.dns.interface</name>
 302   <value>default</value>
 303 </property>
 304 <property>
 305   <name>hbase.client.dns.nameserver</name>
 306   <value>default</value>
 307 </property>
 308 ----
 309 ====
 310
 311 === RegionSplitter
 312
 313 ----
 314 usage: bin/hbase regionsplitter <TABLE> <SPLITALGORITHM>
 315 SPLITALGORITHM is the java class name of a class implementing
 316                       SplitAlgorithm, or one of the special strings
 317                       HexStringSplit or DecimalStringSplit or
 318                       UniformSplit, which are built-in split algorithms.
 319                       HexStringSplit treats keys as hexadecimal ASCII, and
 320                       DecimalStringSplit treats keys as decimal ASCII, and
 321                       UniformSplit treats keys as arbitrary bytes.
 322  -c <region count>        Create a new table with a pre-split number of
 323                           regions
 324  -D <property=value>      Override HBase Configuration Settings
 325  -f <family:family:...>   Column Families to create with new table.
 326                           Required with -c
 327     --firstrow <arg>      First Row in Table for Split Algorithm
 328  -h                       Print this usage help
 329     --lastrow <arg>       Last Row in Table for Split Algorithm
 330  -o <count>               Max outstanding splits that have unfinished
 331                           major compactions
 332  -r                       Perform a rolling split of an existing region
 333     --risky               Skip verification steps to complete
 334                           quickly. STRONGLY DISCOURAGED for production
 335                           systems.
 336 ----
 337
 338 For additional detail, see <<manual_region_splitting_decisions>>.
 339
 340 [[health.check]]
 341 === Health Checker
 342
 343 You can configure HBase to run a script periodically and if it fails N times (configurable), have the server exit.
 344 See _HBASE-7351 Periodic health check script_ for configurations and detail.
 345
 346 === Driver
 347
 348 Several frequently-accessed utilities are provided as `Driver` classes, and executed by the _bin/hbase_ command.
 349 These utilities represent MapReduce jobs which run on your cluster.
 350 They are run in the following way, replacing _UtilityName_ with the utility you want to run.
 351 This command assumes you have set the environment variable `HBASE_HOME` to the directory where HBase is unpacked on your server.
 352
 353 ----
 354
 355 ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.mapreduce.UtilityName
 356 ----
 357
 358 The following utilities are available:
 359
 360 `LoadIncrementalHFiles`::
 361   Complete a bulk data load.
 362
 363 `CopyTable`::
 364   Export a table from the local cluster to a peer cluster.
 365
 366 `Export`::
 367   Write table data to HDFS.
 368
 369 `Import`::
 370   Import data written by a previous `Export` operation.
 371
 372 `ImportTsv`::
 373   Import data in TSV format.
 374
 375 `RowCounter`::
 376   Count rows in an HBase table.
 377
 378 `CellCounter`::
 379   Count cells in an HBase table.
 380
 381 `replication.VerifyReplication`::
 382   Compare the data from tables in two different clusters.
 383   WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed.
 384   Note that this command is in a different package than the others.
 385
 386 Each command except `RowCounter` and `CellCounter` accept a single `--help` argument to print usage instructions.
 387
 388 [[hbck]]
 389 === HBase `hbck`
 390
 391 The `hbck` tool that shipped with hbase-1.x has been made read-only in hbase-2.x. It is not able to repair
 392 hbase-2.x clusters as hbase internals have changed. Nor should its assessments in read-only mode be
 393 trusted as it does not understand hbase-2.x operation.
 394
 395 A new tool, <<HBCK2>>, described in the next section, replaces `hbck`.
 396
 397 [[HBCK2]]
 398 === HBase `HBCK2`
 399
 400 `HBCK2` is the successor to <<hbck>>, the hbase-1.x fix tool (A.K.A `hbck1`). Use it in place of `hbck1`
 401 making repairs against hbase-2.x installs.
 402
 403 `HBCK2` does not ship as part of hbase. It can be found as a subproject of the companion
 404 link:https://github.com/apache/hbase-operator-tools[hbase-operator-tools] repository at
 405 link:https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2[Apache HBase HBCK2 Tool].
 406 `HBCK2` was moved out of hbase so it could evolve at a cadence apart from that of hbase core.
 407
 408 See the [https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2](HBCK2) Home Page
 409 for how `HBCK2` differs from `hbck1`, and for how to build and use it.
 410
 411 Once built, you can run `HBCK2` as follows:
 412
 413 ```
 414 $ hbase hbck -j /path/to/HBCK2.jar
 415 ```
 416
 417 This will generate `HBCK2` usage describing commands and options.
 418
 419 [[hfile_tool2]]
 420 === HFile Tool
 421
 422 See <<hfile_tool>>.
 423
 424 === WAL Tools
 425 For bulk replaying WAL files or _recovered.edits_ files, see
 426 <<walplayer>>. For reading/verifying individual files, read on.
 427
 428 [[hlog_tool.prettyprint]]
 429 ==== WALPrettyPrinter
 430
 431 The `WALPrettyPrinter` is a tool with configurable options to print the contents of a WAL
 432 or a _recovered.edits_ file. You can invoke it via the HBase cli with the 'wal' command.
 433
 434 ----
 435  $ ./bin/hbase wal hdfs://example.org:8020/hbase/WALs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
 436 ----
 437
 438 .WAL Printing in older versions of HBase
 439 [NOTE]
 440 ====
 441 Prior to version 2.0, the `WALPrettyPrinter` was called the `HLogPrettyPrinter`, after an internal name for HBase's write ahead log.
 442 In those versions, you can print the contents of a WAL using the same configuration as above, but with the 'hlog' command.
 443
 444 ----
 445  $ ./bin/hbase hlog hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
 446 ----
 447 ====
 448
 449 [[compression.tool]]
 450 === Compression Tool
 451
 452 See <<compression.test,compression.test>>.
 453
 454 [[copy.table]]
 455 === CopyTable
 456
 457 CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster.
 458 The target table must first exist.
 459 The usage is as follows:
 460
 461 ----
 462
 463 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
 464 /bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
 465 Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
 466
 467 Options:
 468  rs.class     hbase.regionserver.class of the peer cluster,
 469               specify if different from current cluster
 470  rs.impl      hbase.regionserver.impl of the peer cluster,
 471  startrow     the start row
 472  stoprow      the stop row
 473  starttime    beginning of the time range (unixtime in millis)
 474               without endtime means from starttime to forever
 475  endtime      end of the time range.  Ignored if no starttime specified.
 476  versions     number of cell versions to copy
 477  new.name     new table's name
 478  peer.adr     Address of the peer cluster given in the format
 479               hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
 480  families     comma-separated list of families to copy
 481               To copy from cf1 to cf2, give sourceCfName:destCfName.
 482               To keep the same name, just give "cfName"
 483  all.cells    also copy delete markers and deleted cells
 484
 485 Args:
 486  tablename    Name of the table to copy
 487
 488 Examples:
 489  To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
 490  $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
 491
 492 For performance consider the following general options:
 493   It is recommended that you set the following to >=100. A higher value uses more memory but
 494   decreases the round trip time to the server and may increase performance.
 495     -Dhbase.client.scanner.caching=100
 496   The following should always be set to false, to prevent writing data twice, which may produce
 497   inaccurate results.
 498     -Dmapred.map.tasks.speculative.execution=false
 499 ----
 500
 501 .Scanner Caching
 502 [NOTE]
 503 ====
 504 Caching for the input Scan is configured via `hbase.client.scanner.caching`          in the job configuration.
 505 ====
 506
 507 .Versions
 508 [NOTE]
 509 ====
 510 By default, CopyTable utility only copies the latest version of row cells unless `--versions=n` is explicitly specified in the command.
 511 ====
 512
 513 .Data Load
 514 [NOTE]
 515 ====
 516 CopyTable does not perform a diff, it copies all Cells in between the specified startrow/stoprow starttime/endtime range.
 517 This means that already existing cells with same values will still be copied.
 518 ====
 519
 520 See Jonathan Hsieh's link:https://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online
 521           HBase Backups with CopyTable] blog post for more on `CopyTable`.
 522
 523 [[hashtable.synctable]]
 524 === HashTable/SyncTable
 525
 526 HashTable/SyncTable is a two steps tool for synchronizing table data, where each of the steps are implemented as MapReduce jobs.
 527 Similarly to CopyTable, it can be used for partial or entire table data syncing, under same or remote cluster.
 528 However, it performs the sync in a more efficient way than CopyTable. Instead of copying all cells
 529 in specified row key/time period range, HashTable (the first step) creates hashed indexes for batch of cells on source table and output those as results.
 530 On the next stage, SyncTable scans the source table and now calculates hash indexes for table cells,
 531 compares these hashes with the outputs of HashTable, then it just scans (and compares) cells for diverging hashes, only updating
 532 mismatching cells. This results in less network traffic/data transfers, which can be impacting when syncing large tables on remote clusters.
 533
 534 ==== Step 1, HashTable
 535
 536 First, run HashTable on the source table cluster (this is the table whose state will be copied to its counterpart).
 537
 538 Usage:
 539
 540 ----
 541 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --help
 542 Usage: HashTable [options] <tablename> <outputpath>
 543
 544 Options:
 545  batchsize         the target amount of bytes to hash in each batch
 546                    rows are added to the batch until this size is reached
 547                    (defaults to 8000 bytes)
 548  numhashfiles      the number of hash files to create
 549                    if set to fewer than number of regions then
 550                    the job will create this number of reducers
 551                    (defaults to 1/100 of regions -- at least 1)
 552  startrow          the start row
 553  stoprow           the stop row
 554  starttime         beginning of the time range (unixtime in millis)
 555                    without endtime means from starttime to forever
 556  endtime           end of the time range.  Ignored if no starttime specified.
 557  scanbatch         scanner batch size to support intra row scans
 558  versions          number of cell versions to include
 559  families          comma-separated list of families to include
 560  ignoreTimestamps  if true, ignores cell timestamps
 561
 562 Args:
 563  tablename     Name of the table to hash
 564  outputpath    Filesystem path to put the output data
 565
 566 Examples:
 567  To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
 568  $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTable
 569 ----
 570
 571 The *batchsize* property defines how much cell data for a given region will be hashed together in a single hash value.
 572 Sizing this properly has a direct impact on the sync efficiency, as it may lead to less scans executed by mapper tasks
 573 of SyncTable (the next step in the process). The rule of thumb is that, the smaller the number of cells out of sync
 574 (lower probability of finding a diff), larger batch size values can be determined.
 575
 576 ==== Step 2, SyncTable
 577
 578 Once HashTable has completed on source cluster, SyncTable can be ran on target cluster.
 579 Just like replication and other synchronization jobs, it requires that all RegionServers/DataNodes
 580 on source cluster be accessible by NodeManagers on the target cluster (where SyncTable job tasks will be running).
 581
 582 Usage:
 583
 584 ----
 585 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --help
 586 Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>
 587
 588 Options:
 589  sourcezkcluster  ZK cluster key of the source table
 590                   (defaults to cluster in classpath's config)
 591  targetzkcluster  ZK cluster key of the target table
 592                   (defaults to cluster in classpath's config)
 593  dryrun           if true, output counters but no writes
 594                   (defaults to false)
 595  doDeletes        if false, does not perform deletes
 596                   (defaults to true)
 597  doPuts           if false, does not perform puts
 598                   (defaults to true)
 599  ignoreTimestamps if true, ignores cells timestamps while comparing
 600                   cell values. Any missing cell on target then gets
 601                   added with current time as timestamp
 602                   (defaults to false)
 603
 604 Args:
 605  sourcehashdir    path to HashTable output dir for source table
 606                   (see org.apache.hadoop.hbase.mapreduce.HashTable)
 607  sourcetable      Name of the source table to sync from
 608  targettable      Name of the target table to sync to
 609
 610 Examples:
 611  For a dry run SyncTable of tableA from a remote source cluster
 612  to a local target cluster:
 613  $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableA
 614 ----
 615
 616 Cell comparison takes ROW/FAMILY/QUALIFIER/TIMESTAMP/VALUE into account for equality. When syncing at the target, missing cells will be
 617 added with original timestamp value from source. That may cause unexpected results after SyncTable completes, for example, if missing
 618 cells on target have a delete marker with a timestamp T2 (say, a bulk delete performed by mistake), but source cells timestamps have an
 619 older value T1, then those cells would still be unavailable at target because of the newer delete marker timestamp. Since cell timestamps
 620 might not be relevant to all use cases, _ignoreTimestamps_ option adds the flexibility to avoid using cells timestamp in the comparison.
 621 When using _ignoreTimestamps_ set to true, this option must be specified for both HashTable and SyncTable steps.
 622
 623 The *dryrun* option is useful when a read only, diff report is wanted, as it will produce only COUNTERS indicating the differences, but will not perform
 624 any actual changes. It can be used as an alternative to VerifyReplication tool.
 625
 626 By default, SyncTable will cause target table to become an exact copy of source table (at least, for the specified startrow/stoprow or/and starttime/endtime).
 627
 628 Setting doDeletes to false modifies default behaviour to not delete target cells that are missing on source.
 629 Similarly, setting doPuts to false modifies default behaviour to not add missing cells on target. Setting both doDeletes
 630 and doPuts to false would give same effect as setting dryrun to true.
 631
 632
 633 .Additional info on doDeletes/doPuts
 634 [NOTE]
 635 ====
 636 "doDeletes/doPuts" were only added by
 637 link:https://jira.apache.org/jira/browse/HBASE-20305[HBASE-20305], so these may not be available on
 638 all released versions.
 639 For major 1.x versions, minimum minor release including it is *1.4.10*.
 640 For major 2.x versions, minimum minor release including it is *2.1.5*.
 641 ====
 642
 643 .Additional info on ignoreTimestamps
 644 [NOTE]
 645 ====
 646 "ignoreTimestamps" was only added by
 647 link:https://issues.apache.org/jira/browse/HBASE-24302[HBASE-24302], so it may not be available on
 648 all released versions.
 649 For major 1.x versions, minimum minor release including it is *1.4.14*.
 650 For major 2.x versions, minimum minor release including it is *2.2.5*.
 651 ====
 652
 653 .Set doDeletes to false on Two-Way Replication scenarios
 654 [NOTE]
 655 ====
 656 On Two-Way Replication or other scenarios where both source and target clusters can have data ingested, it's advisable to always set doDeletes option to false,
 657 as any additional cell inserted on SyncTable target cluster and not yet replicated to source would be deleted, and potentially lost permanently.
 658 ====
 659
 660 .Set sourcezkcluster to the actual source cluster ZK quorum
 661 [NOTE]
 662 ====
 663 Although not required, if sourcezkcluster is not set, SyncTable will connect to local HBase cluster for both source and target,
 664 which does not give any meaningful result.
 665 ====
 666
 667 .Remote Clusters on different Kerberos Realms
 668 [NOTE]
 669 ====
 670 Often, remote clusters may be deployed on different Kerberos Realms.
 671 link:https://jira.apache.org/jira/browse/HBASE-20586[HBASE-20586] added SyncTable support for
 672 cross realm authentication, allowing a SyncTable process running on target cluster to connect to
 673 source cluster and read both HashTable output files and the given HBase table when performing the
 674 required comparisons.
 675 ====
 676
 677 [[export]]
 678 === Export
 679
 680 Export is a utility that will dump the contents of table to HDFS in a sequence file.
 681 The Export can be run via a Coprocessor Endpoint or MapReduce. Invoke via:
 682
 683 *mapreduce-based Export*
 684 ----
 685 $ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
 686 ----
 687
 688 *endpoint-based Export*
 689
 690 NOTE: Make sure the Export coprocessor is enabled by adding `org.apache.hadoop.hbase.coprocessor.Export` to `hbase.coprocessor.region.classes`.
 691 ----
 692 $ bin/hbase org.apache.hadoop.hbase.coprocessor.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
 693 ----
 694 The outputdir is a HDFS directory that does not exist prior to the export. When done, the exported files will be owned by the user invoking the export command.
 695
 696 *The Comparison of Endpoint-based Export And Mapreduce-based Export*
 697 |===
 698 ||Endpoint-based Export|Mapreduce-based Export
 699
 700 |HBase version requirement
 701 |2.0+
 702 |0.2.1+
 703
 704 |Maven dependency
 705 |hbase-endpoint
 706 |hbase-mapreduce (2.0+), hbase-server(prior to 2.0)
 707
 708 |Requirement before dump
 709 |mount the endpoint.Export on the target table
 710 |deploy the MapReduce framework
 711
 712 |Read latency
 713 |low, directly read the data from region
 714 |normal, traditional RPC scan
 715
 716 |Read Scalability
 717 |depend on number of regions
 718 |depend on number of mappers (see TableInputFormatBase#getSplits)
 719
 720 |Timeout
 721 |operation timeout. configured by hbase.client.operation.timeout
 722 |scan timeout. configured by hbase.client.scanner.timeout.period
 723
 724 |Permission requirement
 725 |READ, EXECUTE
 726 |READ
 727
 728 |Fault tolerance
 729 |no
 730 |depend on MapReduce
 731 |===
 732
 733
 734 NOTE: To see usage instructions, run the command with no options. Available options include
 735 specifying column families and applying filters during the export.
 736
 737 By default, the `Export` tool only exports the newest version of a given cell, regardless of the number of versions stored. To export more than one version, replace *_<versions>_* with the desired number of versions.
 738
 739 For mapreduce based Export, if you want to export cell tags then set the following config property
 740 `hbase.client.rpc.codec` to `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`
 741
 742 Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
 743
 744 [[import]]
 745 === Import
 746
 747 Import is a utility that will load data that has been exported back into HBase.
 748 Invoke via:
 749
 750 ----
 751 $ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
 752 ----
 753
 754 NOTE: To see usage instructions, run the command with no options.
 755
 756 To import 0.94 exported files in a 0.96 cluster or onwards, you need to set system property "hbase.import.version" when running the import command as below:
 757
 758 ----
 759 $ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
 760 ----
 761
 762 If you want to import cell tags then set the following config property
 763 `hbase.client.rpc.codec` to `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`
 764
 765 [[importtsv]]
 766 === ImportTsv
 767
 768 ImportTsv is a utility that will load data in TSV format into HBase.
 769 It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the `completebulkload`.
 770
 771 To load data via Puts (i.e., non-bulk loading):
 772
 773 ----
 774 $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
 775 ----
 776
 777 To generate StoreFiles for bulk-loading:
 778
 779 [source,bourne]
 780 ----
 781 $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
 782 ----
 783
 784 These generated StoreFiles can be loaded into HBase via <<completebulkload,completebulkload>>.
 785
 786 [[importtsv.options]]
 787 ==== ImportTsv Options
 788
 789 Running `ImportTsv` with no arguments prints brief usage information:
 790
 791 ----
 792
 793 Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
 794
 795 Imports the given input directory of TSV data into the specified table.
 796
 797 The column names of the TSV data must be specified using the -Dimporttsv.columns
 798 option. This option takes the form of comma-separated column names, where each
 799 column name is either a simple column family, or a columnfamily:qualifier. The special
 800 column name HBASE_ROW_KEY is used to designate that this column should be used
 801 as the row key for each imported record. You must specify exactly one column
 802 to be the row key, and you must specify a column name for every column that exists in the
 803 input data.
 804
 805 By default importtsv will load data directly into HBase. To instead generate
 806 HFiles of data to prepare for a bulk data load, pass the option:
 807   -Dimporttsv.bulk.output=/path/for/output
 808   Note: the target table will be created with default column family descriptors if it does not already exist.
 809
 810 Other options that may be specified with -D include:
 811   -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
 812   '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
 813   -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
 814   -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
 815 ----
 816
 817 [[importtsv.example]]
 818 ==== ImportTsv Example
 819
 820 For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
 821
 822 Assume that an input file exists as follows:
 823 ----
 824
 825 row1    c1      c2
 826 row2    c1      c2
 827 row3    c1      c2
 828 row4    c1      c2
 829 row5    c1      c2
 830 row6    c1      c2
 831 row7    c1      c2
 832 row8    c1      c2
 833 row9    c1      c2
 834 row10   c1      c2
 835 ----
 836
 837 For ImportTsv to use this input file, the command line needs to look like this:
 838
 839 ----
 840
 841  HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
 842 ----
 843
 844 \... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used.
 845 The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
 846
 847 [[importtsv.warning]]
 848 ==== ImportTsv Warning
 849
 850 If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
 851
 852 [[importtsv.also]]
 853 ==== See Also
 854
 855 For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>
 856
 857 [[completebulkload]]
 858 === CompleteBulkLoad
 859
 860 The `completebulkload` utility will move generated StoreFiles into an HBase table.
 861 This utility is often used in conjunction with output from <<importtsv,importtsv>>.
 862
 863 There are two ways to invoke this utility, with explicit classname and via the driver:
 864
 865 .Explicit Classname
 866 ----
 867 $ bin/hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
 868 ----
 869
 870 .Driver
 871 ----
 872 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
 873 ----
 874
 875 [[completebulkload.warning]]
 876 ==== CompleteBulkLoad Warning
 877
 878 Data generated via MapReduce is often created with file permissions that are not compatible with the running HBase process.
 879 Assuming you're running HDFS with permissions enabled, those permissions will need to be updated before you run CompleteBulkLoad.
 880
 881 For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>.
 882
 883 [[walplayer]]
 884 === WALPlayer
 885
 886 WALPlayer is a utility to replay WAL files into HBase.
 887
 888 The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables.
 889 The output can optionally be mapped to another set of tables.
 890
 891 WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
 892
 893 Finally, you can use WALPlayer to replay the content of a Regions `recovered.edits` directory (the files under
 894 `recovered.edits` directory have the same format as WAL files).
 895
 896 .WALPrettyPrinter
 897 [NOTE]
 898 ====
 899 To read or verify single WAL files or _recovered.edits_ files, since they share the WAL format,
 900 see <<_wal_tools>>.
 901 ====
 902
 903 Invoke via:
 904
 905 ----
 906 $ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]>
 907 ----
 908
 909 For example:
 910
 911 ----
 912 $ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
 913 ----
 914
 915 WALPlayer, by default, runs as a mapreduce job.
 916 To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags `-Dmapreduce.jobtracker.address=local` on the command line.
 917
 918 [[walplayer.options]]
 919 ==== WALPlayer Options
 920
 921 Running `WALPlayer` with no arguments prints brief usage information:
 922
 923 ----
 924 Usage: WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]
 925  <WAL inputdir>   directory of WALs to replay.
 926  <tables>         comma separated list of tables. If no tables specified,
 927                   all are imported (even hbase:meta if present).
 928  <tableMappings>  WAL entries can be mapped to a new set of tables by passing
 929                   <tableMappings>, a comma separated list of target tables.
 930                   If specified, each table in <tables> must have a mapping.
 931 To generate HFiles to bulk load instead of loading HBase directly, pass:
 932  -Dwal.bulk.output=/path/for/output
 933  Only one table can be specified, and no mapping allowed!
 934 To specify a time range, pass:
 935  -Dwal.start.time=[date|ms]
 936  -Dwal.end.time=[date|ms]
 937  The start and the end date of timerange (inclusive). The dates can be
 938  expressed in milliseconds-since-epoch or yyyy-MM-dd'T'HH:mm:ss.SS format.
 939  E.g. 1234567890120 or 2009-02-13T23:32:30.12
 940 Other options:
 941  -Dmapreduce.job.name=jobName
 942  Use the specified mapreduce job name for the wal player
 943  -Dwal.input.separator=' '
 944  Change WAL filename separator (WAL dir names use default ','.)
 945 For performance also consider the following options:
 946   -Dmapreduce.map.speculative=false
 947   -Dmapreduce.reduce.speculative=false
 948 ----
 949
 950 [[rowcounter]]
 951 === RowCounter
 952
 953 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] is a mapreduce job to count all the rows of a table.
 954 This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency.
 955 It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
 956 It is possible to limit the time range of data to be scanned by using the `--starttime=[starttime]` and `--endtime=[endtime]` flags.
 957 The scanned data can be limited based on keys using the `--range=[startKey],[endKey][;[startKey],[endKey]...]` option.
 958
 959 ----
 960 $ bin/hbase rowcounter [options] <tablename> [--starttime=<start> --endtime=<end>] [--range=[startKey],[endKey][;[startKey],[endKey]...]] [<column1> <column2>...]
 961 ----
 962
 963 RowCounter only counts one version per cell.
 964
 965 For performance consider to use `-Dhbase.client.scanner.caching=100` and `-Dmapreduce.map.speculative=false` options.
 966
 967 [[cellcounter]]
 968 === CellCounter
 969
 970 HBase ships another diagnostic mapreduce job called link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter].
 971 Like RowCounter, it gathers more fine-grained statistics about your table.
 972 The statistics gathered by CellCounter are more fine-grained and include:
 973
 974 * Total number of rows in the table.
 975 * Total number of CFs across all rows.
 976 * Total qualifiers across all rows.
 977 * Total occurrence of each CF.
 978 * Total occurrence of each qualifier.
 979 * Total number of versions of each qualifier.
 980
 981 The program allows you to limit the scope of the run.
 982 Provide a row regex or prefix to limit the rows to analyze.
 983 Specify a time range to scan the table by using the `--starttime=<starttime>` and `--endtime=<endtime>` flags.
 984
 985 Use `hbase.mapreduce.scan.column.family` to specify scanning a single column family.
 986
 987 ----
 988 $ bin/hbase cellcounter <tablename> <outputDir> [reportSeparator] [regex or prefix] [--starttime=<starttime> --endtime=<endtime>]
 989 ----
 990
 991 Note: just like RowCounter, caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
 992
 993 === mlockall
 994
 995 It is possible to optionally pin your servers in physical memory making them less likely to be swapped out in oversubscribed environments by having the servers call link:http://linux.die.net/man/2/mlockall[mlockall] on startup.
 996 See link:https://issues.apache.org/jira/browse/HBASE-4391[HBASE-4391 Add ability to start RS as root and call mlockall] for how to build the optional library and have it run on startup.
 997
 998 [[compaction.tool]]
 999 === Offline Compaction Tool
1000
1001 *CompactionTool* provides a way of running compactions (either minor or major) as an independent
1002 process from the RegionServer. It reuses same internal implementation classes executed by RegionServer
1003 compaction feature. However, since this runs on a complete separate independent java process, it
1004 releases RegionServers from the overhead involved in rewrite a set of hfiles, which can be critical
1005 for latency sensitive use cases.
1006
1007 Usage:
1008 ----
1009 $ ./bin/hbase org.apache.hadoop.hbase.regionserver.CompactionTool
1010
1011 Usage: java org.apache.hadoop.hbase.regionserver.CompactionTool \
1012   [-compactOnce] [-major] [-mapred] [-D<property=value>]* files...
1013
1014 Options:
1015  mapred         Use MapReduce to run compaction.
1016  compactOnce    Execute just one compaction step. (default: while needed)
1017  major          Trigger major compaction.
1018
1019 Note: -D properties will be applied to the conf used.
1020 For example:
1021  To stop delete of compacted file, pass -Dhbase.compactiontool.delete=false
1022  To set tmp dir, pass -Dhbase.tmp.dir=ALTERNATE_DIR
1023
1024 Examples:
1025  To compact the full 'TestTable' using MapReduce:
1026  $ hbase org.apache.hadoop.hbase.regionserver.CompactionTool -mapred hdfs://hbase/data/default/TestTable
1027
1028  To compact column family 'x' of the table 'TestTable' region 'abc':
1029  $ hbase org.apache.hadoop.hbase.regionserver.CompactionTool hdfs://hbase/data/default/TestTable/abc/x
1030 ----
1031
1032 As shown by usage options above, *CompactionTool* can run as a standalone client or a mapreduce job.
1033 When running as mapreduce job, each family dir is handled as an input split, and is processed
1034 by a separate map task.
1035
1036 The *compactionOnce* parameter controls how many compaction cycles will be performed until
1037 *CompactionTool* program decides to finish its work. If omitted, it will assume it should keep
1038 running compactions on each specified family as determined by the given compaction policy
1039 configured. For more info on compaction policy, see <<compaction,compaction>>.
1040
1041 If a major compaction is desired, *major* flag can be specified. If omitted, *CompactionTool* will
1042 assume minor compaction is wanted by default.
1043
1044 It also allows for configuration overrides with `-D` flag. In the usage section above, for example,
1045 `-Dhbase.compactiontool.delete=false` option will instruct compaction engine to not delete original
1046 files from temp folder.
1047
1048 Files targeted for compaction must be specified as parent hdfs dirs. It allows for multiple dirs
1049 definition, as long as each for these dirs are either a *family*, a *region*, or a *table* dir. If a
1050 table or region dir is passed, the program will recursively iterate through related sub-folders,
1051 effectively running compaction for each family found below the table/region level.
1052
1053 Since these dirs are nested under *hbase* hdfs directory tree, *CompactionTool* requires hbase super
1054 user permissions in order to have access to required hfiles.
1055
1056 .Running in MapReduce mode
1057 [NOTE]
1058 ====
1059 MapReduce mode offers the ability to process each family dir in parallel, as a separate map task.
1060 Generally, it would make sense to run in this mode when specifying one or more table dirs as targets
1061 for compactions. The caveat, though, is that if number of families to be compacted become too large,
1062 the related mapreduce job may have indirect impacts on *RegionServers* performance .
1063 Since *NodeManagers* are normally co-located with RegionServers, such large jobs could
1064 compete for IO/Bandwidth resources with the *RegionServers*.
1065 ====
1066
1067 .MajorCompaction completely disabled on RegionServers due performance impacts
1068 [NOTE]
1069 ====
1070 *Major compactions* can be a costly operation (see <<compaction,compaction>>), and can indeed
1071 impact performance on RegionServers, leading operators to completely disable it for critical
1072 low latency application. *CompactionTool* could be used as an alternative in such scenarios,
1073 although, additional custom application logic would need to be implemented, such as deciding
1074 scheduling and selection of tables/regions/families target for a given compaction run.
1075 ====
1076
1077 For additional details about CompactionTool, see also
1078 link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/regionserver/CompactionTool.html[CompactionTool].
1079
1080 === `hbase clean`
1081
1082 The `hbase clean` command cleans HBase data from ZooKeeper, HDFS, or both.
1083 It is appropriate to use for testing.
1084 Run it with no options for usage instructions.
1085 The `hbase clean` command was introduced in HBase 0.98.
1086
1087 ----
1088
1089 $ bin/hbase clean
1090 Usage: hbase clean (--cleanZk|--cleanHdfs|--cleanAll)
1091 Options:
1092         --cleanZk   cleans hbase related data from zookeeper.
1093         --cleanHdfs cleans hbase related data from hdfs.
1094         --cleanAll  cleans hbase related data from both zookeeper and hdfs.
1095 ----
1096
1097 === `hbase pe`
1098
1099 The `hbase pe` command runs the PerformanceEvaluation tool, which is used for testing.
1100
1101 The PerformanceEvaluation tool accepts many different options and commands.
1102 For usage instructions, run the command with no options.
1103
1104 The PerformanceEvaluation tool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLs and visibility labels, multiget support for RPC calls, increased sampling sizes, an option to randomly sleep during testing, and ability to "warm up" the cluster before testing starts.
1105
1106 === `hbase ltt`
1107
1108 The `hbase ltt` command runs the LoadTestTool utility, which is used for testing.
1109
1110 You must specify either `-init_only` or at least one of `-write`, `-update`, or `-read`.
1111 For general usage instructions, pass the `-h` option.
1112
1113 The LoadTestTool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLS and visibility labels, testing security-related features, ability to specify the number of regions per server, tests for multi-get RPC calls, and tests relating to replication.
1114
1115 [[ops.pre-upgrade]]
1116 === Pre-Upgrade validator
1117 Pre-Upgrade validator tool can be used to check the cluster for known incompatibilities before upgrading from HBase 1 to HBase 2.
1118
1119 [source, bash]
1120 ----
1121 $ bin/hbase pre-upgrade command ...
1122 ----
1123
1124 ==== Coprocessor validation
1125
1126 HBase supports co-processors for a long time, but the co-processor API can be changed between major releases. Co-processor validator tries to determine
1127 whether the old co-processors are still compatible with the actual HBase version.
1128
1129 [source, bash]
1130 ----
1131 $ bin/hbase pre-upgrade validate-cp [-jar ...] [-class ... | -table ... | -config]
1132 Options:
1133  -e            Treat warnings as errors.
1134  -jar <arg>    Jar file/directory of the coprocessor.
1135  -table <arg>  Table coprocessor(s) to check.
1136  -class <arg>  Coprocessor class(es) to check.
1137  -config         Scan jar for observers.
1138 ----
1139
1140 The co-processor classes can be explicitly declared by `-class` option, or they can be obtained from HBase configuration by `-config` option.
1141 Table level co-processors can be also checked by `-table` option. The tool searches for co-processors on its classpath, but it can be extended
1142 by the `-jar` option. It is possible to test multiple classes with multiple `-class`, multiple tables with multiple `-table` options as well as
1143 adding multiple jars to the classpath with multiple `-jar` options.
1144
1145 The tool can report errors and warnings. Errors mean that HBase won't be able to load the coprocessor, because it is incompatible with the current version
1146 of HBase. Warnings mean that the co-processors can be loaded, but they won't work as expected. If `-e` option is given, then the tool will also fail
1147 for warnings.
1148
1149 Please note that this tool cannot validate every aspect of jar files, it just does some static checks.
1150
1151 For example:
1152
1153 [source, bash]
1154 ----
1155 $ bin/hbase pre-upgrade validate-cp -jar my-coprocessor.jar -class MyMasterObserver -class MyRegionObserver
1156 ----
1157
1158 It validates `MyMasterObserver` and `MyRegionObserver` classes which are located in `my-coprocessor.jar`.
1159
1160 [source, bash]
1161 ----
1162 $ bin/hbase pre-upgrade validate-cp -table .*
1163 ----
1164
1165 It validates every table level co-processors where the table name matches to `.*` regular expression.
1166
1167 ==== DataBlockEncoding validation
1168 HBase 2.0 removed `PREFIX_TREE` Data Block Encoding from column families. For further information
1169 please check <<upgrade2.0.prefix-tree.removed,_prefix-tree_ encoding removed>>.
1170 To verify that none of the column families are using incompatible Data Block Encodings in the cluster run the following command.
1171
1172 [source, bash]
1173 ----
1174 $ bin/hbase pre-upgrade validate-dbe
1175 ----
1176
1177 This check validates all column families and print out any incompatibilities. For example:
1178
1179 ----
1180 2018-07-13 09:58:32,028 WARN  [main] tool.DataBlockEncodingValidator: Incompatible DataBlockEncoding for table: t, cf: f, encoding: PREFIX_TREE
1181 ----
1182
1183 Which means that Data Block Encoding of table `t`, column family `f` is incompatible. To fix, use `alter` command in HBase shell:
1184
1185 ----
1186 alter 't', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
1187 ----
1188
1189 Please also validate HFiles, which is described in the next section.
1190
1191 ==== HFile Content validation
1192 Even though Data Block Encoding is changed from `PREFIX_TREE` it is still possible to have HFiles that contain data encoded that way.
1193 To verify that HFiles are readable with HBase 2 please use _HFile content validator_.
1194
1195 [source, bash]
1196 ----
1197 $ bin/hbase pre-upgrade validate-hfile
1198 ----
1199
1200 The tool will log the corrupt HFiles and details about the root cause.
1201 If the problem is about PREFIX_TREE encoding it is necessary to change encodings before upgrading to HBase 2.
1202
1203 The following log message shows an example of incorrect HFiles.
1204
1205 ----
1206 2018-06-05 16:20:46,976 WARN  [hfilevalidator-pool1-t3] hbck.HFileCorruptionChecker: Found corrupt HFile hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1207 org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1208     ...
1209 Caused by: java.io.IOException: Invalid data block encoding type in file info: PREFIX_TREE
1210     ...
1211 Caused by: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.PREFIX_TREE
1212     ...
1213 2018-06-05 16:20:47,322 INFO  [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1214 2018-06-05 16:20:47,383 INFO  [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/archive/data/default/t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1
1215 ----
1216
1217 ===== Fixing PREFIX_TREE errors
1218
1219 It's possible to get `PREFIX_TREE` errors after changing Data Block Encoding to a supported one. It can happen
1220 because there are some HFiles which still encoded with `PREFIX_TREE` or there are still some snapshots.
1221
1222 For fixing HFiles, please run a major compaction on the table (it was `default:t` according to the log message):
1223
1224 ----
1225 major_compact 't'
1226 ----
1227
1228 HFiles can be referenced from snapshots, too. It's the case when the HFile is located under `archive/data`.
1229 The first step is to determine which snapshot references that HFile (the name of the file was `29c641ae91c34fc3bee881f45436b6d1`
1230 according to the logs):
1231
1232 [source, bash]
1233 ----
1234 for snapshot in $(hbase snapshotinfo -list-snapshots 2> /dev/null | tail -n -1 | cut -f 1 -d \|);
1235 do
1236   echo "checking snapshot named '${snapshot}'";
1237   hbase snapshotinfo -snapshot "${snapshot}" -files 2> /dev/null | grep 29c641ae91c34fc3bee881f45436b6d1;
1238 done
1239 ----
1240
1241 The output of this shell script is:
1242
1243 ----
1244 checking snapshot named 't_snap'
1245    1.0 K t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1 (archive)
1246 ----
1247
1248 Which means `t_snap` snapshot references the incompatible HFile. If the snapshot is still needed,
1249 then it has to be recreated with HBase shell:
1250
1251 ----
1252 # creating a new namespace for the cleanup process
1253 create_namespace 'pre_upgrade_cleanup'
1254
1255 # creating a new snapshot
1256 clone_snapshot 't_snap', 'pre_upgrade_cleanup:t'
1257 alter 'pre_upgrade_cleanup:t', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
1258 major_compact 'pre_upgrade_cleanup:t'
1259
1260 # removing the invalid snapshot
1261 delete_snapshot 't_snap'
1262
1263 # creating a new snapshot
1264 snapshot 'pre_upgrade_cleanup:t', 't_snap'
1265
1266 # removing temporary table
1267 disable 'pre_upgrade_cleanup:t'
1268 drop 'pre_upgrade_cleanup:t'
1269 drop_namespace 'pre_upgrade_cleanup'
1270 ----
1271
1272 For further information, please refer to
1273 link:https://issues.apache.org/jira/browse/HBASE-20649?focusedCommentId=16535476#comment-16535476[HBASE-20649].
1274
1275 === Data Block Encoding Tool
1276
1277 Tests various compression algorithms with different data block encoder for key compression on an existing HFile.
1278 Useful for testing, debugging and benchmarking.
1279
1280 You must specify `-f` which is the full path of the HFile.
1281
1282 The result shows both the performance (MB/s) of compression/decompression and encoding/decoding, and the data savings on the HFile.
1283
1284 ----
1285
1286 $ bin/hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
1287 Usages: hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
1288 Options:
1289         -f HFile to analyse (REQUIRED)
1290         -n Maximum number of key/value pairs to process in a single benchmark run.
1291         -b Whether to run a benchmark to measure read throughput.
1292         -c If this is specified, no correctness testing will be done.
1293         -a What kind of compression algorithm use for test. Default value: GZ.
1294         -t Number of times to run each benchmark. Default value: 12.
1295         -omit Number of first runs of every benchmark to omit from statistics. Default value: 2.
1296
1297 ----
1298
1299 [[ops.regionmgt]]
1300 == Region Management
1301
1302 [[ops.regionmgt.majorcompact]]
1303 === Major Compaction
1304
1305 Major compactions can be requested via the HBase shell or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact-org.apache.hadoop.hbase.TableName-[Admin.majorCompact].
1306
1307 Note: major compactions do NOT do region merges.
1308 See <<compaction,compaction>> for more information about compactions.
1309
1310 [[ops.regionmgt.merge]]
1311 === Merge
1312
1313 Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).
1314
1315 [source,bourne]
1316 ----
1317 $ bin/hbase org.apache.hadoop.hbase.util.Merge <tablename> <region1> <region2>
1318 ----
1319
1320 If you feel you have too many regions and want to consolidate them, Merge is the utility you need.
1321 Merge must run be done when the cluster is down.
1322 See the link:https://web.archive.org/web/20111231002503/http://ofps.oreilly.com/titles/9781449396107/performance.html[O'Reilly HBase
1323           Book] for an example of usage.
1324
1325 You will need to pass 3 parameters to this application.
1326 The first one is the table name.
1327 The second one is the fully qualified name of the first region to merge, like "table_name,\x0A,1342956111995.7cef47f192318ba7ccc75b1bbf27a82b.". The third one is the fully qualified name for the second region to merge.
1328
1329 Additionally, there is a Ruby script attached to link:https://issues.apache.org/jira/browse/HBASE-1621[HBASE-1621] for region merging.
1330
1331 [[node.management]]
1332 == Node Management
1333
1334 [[decommission]]
1335 === Node Decommission
1336
1337 You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:
1338
1339 ----
1340 $ ./bin/hbase-daemon.sh stop regionserver
1341 ----
1342
1343 The RegionServer will first close all regions and then shut itself down.
1344 On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire.
1345 The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
1346
1347 .Disable the Load Balancer before Decommissioning a node
1348 [NOTE]
1349 ====
1350 If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer.
1351 Avoid any problems by disabling the balancer first.
1352 See <<lb,lb>> below.
1353 ====
1354
1355 .Kill Node Tool
1356 [NOTE]
1357 ====
1358 In hbase-2.0, in the bin directory, we added a script named _considerAsDead.sh_ that can be used to kill a regionserver.
1359 Hardware issues could be detected by specialized monitoring tools before the zookeeper timeout has expired. _considerAsDead.sh_ is a simple function to mark a RegionServer as dead.
1360 It deletes all the znodes of the server, starting the recovery process.
1361 Plug in the script into your monitoring/fault detection tools to initiate faster failover.
1362 Be careful how you use this disruptive tool.
1363 Copy the script if you need to make use of it in a version of hbase previous to hbase-2.0.
1364 ====
1365
1366 A downside to the above stop of a RegionServer is that regions could be offline for a good period of time.
1367 Regions are closed in order.
1368 If many regions on the server, the first region to close may not be back online until all regions close and
1369 after the master notices the RegionServer's znode gone.  A node can be asked to gradually shed its load and
1370 then shutdown itself using the  _graceful_stop.sh_ script. Here is its usage:
1371
1372 ----
1373 $ ./bin/graceful_stop.sh
1374 Usage: graceful_stop.sh [--config <conf-dir>] [-e] [--restart [--reload]] [--thrift] [--rest] [-n |--noack] [--maxthreads <number of threads>] [--movetimeout <timeout in seconds>] [-nob |--nobalancer] [-d |--designatedfile <file path>] [-x |--excludefile <file path>] <hostname>
1375  thrift         If we should stop/start thrift before/after the hbase stop/start
1376  rest           If we should stop/start rest before/after the hbase stop/start
1377  restart        If we should restart after graceful stop
1378  reload         Move offloaded regions back on to the restarted server
1379  n|noack        Enable noAck mode in RegionMover. This is a best effort mode for moving regions
1380  maxthreads xx  Limit the number of threads used by the region mover. Default value is 1.
1381  movetimeout xx Timeout for moving regions. If regions are not moved by the timeout value,exit with error. Default value is INT_MAX.
1382  hostname       Hostname of server we are to stop
1383  e|failfast     Set -e so exit immediately if any command exits with non-zero status
1384  nob|nobalancer Do not manage balancer states. This is only used as optimization in rolling_restart.sh to avoid multiple calls to hbase shell
1385  d|designatedfile xx Designated file with <hostname:port> per line as unload targets
1386  x|excludefile xx Exclude file should have <hostname:port> per line. We do not unload regions to hostnames given in exclude file
1387 ----
1388
1389 To decommission a loaded RegionServer, run the following: +$
1390           ./bin/graceful_stop.sh HOSTNAME+ where `HOSTNAME` is the host carrying the RegionServer you would decommission.
1391
1392 .On `HOSTNAME`
1393 [NOTE]
1394 ====
1395 The `HOSTNAME` passed to _graceful_stop.sh_ must match the hostname that hbase is using to identify RegionServers.
1396 HBase uses fully-qualified domain names usually. Check the list of RegionServers in the master UI for how HBase
1397 is referring to servers. Whatever HBase is using, this is what you should pass the _graceful_stop.sh_ decommission script.
1398 If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks
1399 if server is currently running; the graceful unloading of regions will not run.
1400 ====
1401
1402 The _graceful_stop.sh_ script will move the regions off the decommissioned RegionServer one at a time to minimize region churn.
1403 It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned
1404 server is carrying zero regions. At this point, the _graceful_stop.sh_ tells the RegionServer `stop`.
1405 The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the
1406 RegionServer went down cleanly, there will be no WAL logs to split.
1407
1408 [[lb]]
1409 .Load Balancer
1410 [NOTE]
1411 ====
1412 It is assumed that the Region Load Balancer is disabled while the `graceful_stop` script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:
1413
1414 [source]
1415 ----
1416 hbase(main):001:0> balance_switch false
1417 true
1418 0 row(s) in 0.3590 seconds
1419 ----
1420
1421 This turns the balancer OFF.
1422 To reenable, do:
1423
1424 [source]
1425 ----
1426 hbase(main):001:0> balance_switch true
1427 false
1428 0 row(s) in 0.3590 seconds
1429 ----
1430
1431 The `graceful_stop` will check the balancer and if enabled, will turn it off before it goes to work.
1432 If it exits prematurely because of error, it will not have reset the balancer.
1433 Hence, it is better to manage the balancer apart from `graceful_stop` reenabling it after you are done w/ graceful_stop.
1434 ====
1435
1436 [[draining.servers]]
1437 ==== Decommissioning several Regions Servers concurrently
1438
1439 If you have a large cluster, you may want to decommission more than one machine at a time by gracefully stopping multiple RegionServers concurrently.
1440 To gracefully drain multiple regionservers at the same time, RegionServers can be put into a "draining" state.
1441 This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the _hbase_root/draining_ znode.
1442 This znode has format `name,port,startcode` just like the regionserver entries under _hbase_root/rs_ znode.
1443
1444 Without this facility, decommissioning multiple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining.
1445 Marking RegionServers to be in the draining state prevents this from happening.
1446 See this link:http://inchoate-clatter.blogspot.com/2012/03/hbase-ops-automation.html[blog
1447             post] for more details.
1448
1449 [[bad.disk]]
1450 ==== Bad or Failing Disk
1451
1452 It is good having <<dfs.datanode.failed.volumes.tolerated,dfs.datanode.failed.volumes.tolerated>> set if you have a decent number of disks per machine for the case where a disk plain dies.
1453 But usually disks do the "John Wayne" -- i.e.
1454 take a while to go down spewing errors in _dmesg_ -- or for some reason, run much slower than their companions.
1455 In this case you want to decommission the disk.
1456 You have two options.
1457 You can link:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html[decommission
1458             the datanode] or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its logs as it recalibrates where to get its data from -- it will likely roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
1459
1460 .Short Circuit Reads
1461 [NOTE]
1462 ====
1463 If you are doing short-circuit reads, you will have to move the regions off the regionserver before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot have access, because it already has the files open, it will be able to keep reading the file blocks from the bad disk even though the datanode is down.
1464 Move the regions back after you restart the datanode.
1465 ====
1466
1467 [[rolling]]
1468 === Rolling Restart
1469
1470 Some cluster configuration changes require either the entire cluster, or the RegionServers, to be restarted in order to pick up the changes.
1471 In addition, rolling restarts are supported for upgrading to a minor or maintenance release, and to a major release if at all possible.
1472 See the release notes for release you want to upgrade to, to find out about limitations to the ability to perform a rolling upgrade.
1473
1474 There are multiple ways to restart your cluster nodes, depending on your situation.
1475 These methods are detailed below.
1476
1477 ==== Using the `rolling-restart.sh` Script
1478
1479 HBase ships with a script, _bin/rolling-restart.sh_, that allows you to perform rolling restarts on the entire cluster, the master only, or the RegionServers only.
1480 The script is provided as a template for your own script, and is not explicitly tested.
1481 It requires password-less SSH login to be configured and assumes that you have deployed using a tarball.
1482 The script requires you to set some environment variables before running it.
1483 Examine the script and modify it to suit your needs.
1484
1485 ._rolling-restart.sh_ General Usage
1486 ----
1487 $ ./bin/rolling-restart.sh --help
1488 Usage: rolling-restart.sh [--config <hbase-confdir>] [--rs-only] [--master-only] [--graceful] [--maxthreads xx]
1489 ----
1490
1491 Rolling Restart on RegionServers Only::
1492   To perform a rolling restart on the RegionServers only, use the `--rs-only` option.
1493   This might be necessary if you need to reboot the individual RegionServer or if you make a configuration change that only affects RegionServers and not the other HBase processes.
1494
1495 Rolling Restart on Masters Only::
1496   To perform a rolling restart on the active and backup Masters, use the `--master-only` option.
1497   You might use this if you know that your configuration change only affects the Master and not the RegionServers, or if you need to restart the server where the active Master is running.
1498
1499 Graceful Restart::
1500   If you specify the `--graceful` option, RegionServers are restarted using the _bin/graceful_stop.sh_ script, which moves regions off a RegionServer before restarting it.
1501   This is safer, but can delay the restart.
1502
1503 Limiting the Number of Threads::
1504   To limit the rolling restart to using only a specific number of threads, use the `--maxthreads` option.
1505
1506 [[rolling.restart.manual]]
1507 ==== Manual Rolling Restart
1508
1509 To retain more control over the process, you may wish to manually do a rolling restart across your cluster.
1510 This uses the `graceful-stop.sh` command <<decommission,decommission>>.
1511 In this method, you can restart each RegionServer individually and then move its old regions back into place, retaining locality.
1512 If you also need to restart the Master, you need to do it separately, and restart the Master before restarting the RegionServers using this method.
1513 The following is an example of such a command.
1514 You may need to tailor it to your environment.
1515 This script does a rolling restart of RegionServers only.
1516 It disables the load balancer before moving the regions.
1517
1518 ----
1519
1520 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &;
1521 ----
1522
1523 Monitor the output of the _/tmp/log.txt_ file to follow the progress of the script.
1524
1525 ==== Logic for Crafting Your Own Rolling Restart Script
1526
1527 Use the following guidelines if you want to create your own rolling restart script.
1528
1529 . Extract the new release, verify its configuration, and synchronize it to all nodes of your cluster using `rsync`, `scp`, or another secure synchronization mechanism.
1530
1531 . Restart the master first.
1532   You may need to modify these commands if your new HBase directory is different from the old one, such as for an upgrade.
1533 +
1534 ----
1535
1536 $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
1537 ----
1538
1539 . Gracefully restart each RegionServer, using a script such as the following, from the Master.
1540 +
1541 ----
1542
1543 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
1544 ----
1545 +
1546 If you are running Thrift or REST servers, pass the --thrift or --rest options.
1547 For other available options, run the `bin/graceful-stop.sh --help`              command.
1548 +
1549 It is important to drain HBase regions slowly when restarting multiple RegionServers.
1550 Otherwise, multiple regions go offline simultaneously and must be reassigned to other nodes, which may also go offline soon.
1551 This can negatively affect performance.
1552 You can inject delays into the script above, for instance, by adding a Shell command such as `sleep`.
1553 To wait for 5 minutes between each RegionServer restart, modify the above script to the following:
1554 +
1555 ----
1556
1557 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i & sleep 5m; done &> /tmp/log.txt &
1558 ----
1559
1560 . Restart the Master again, to clear out the dead servers list and re-enable the load balancer.
1561
1562 [[adding.new.node]]
1563 === Adding a New Node
1564
1565 Adding a new regionserver in HBase is essentially free, you simply start it like this: `$ ./bin/hbase-daemon.sh start regionserver` and it will register itself with the master.
1566 Ideally you also started a DataNode on the same machine so that the RS can eventually start to have local files.
1567 If you rely on ssh to start your daemons, don't forget to add the new hostname in _conf/regionservers_ on the master.
1568
1569 At this point the region server isn't serving data because no regions have moved to it yet.
1570 If the balancer is enabled, it will start moving regions to the new RS.
1571 On a small/medium cluster this can have a very adverse effect on latency as a lot of regions will be offline at the same time.
1572 It is thus recommended to disable the balancer the same way it's done when decommissioning a node and move the regions manually (or even better, using a script that moves them one by one).
1573
1574 The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have to use the network to serve requests.
1575 Apart from resulting in higher latency, it may also be able to use all of your network card's capacity.
1576 For practical purposes, consider that a standard 1GigE NIC won't be able to read much more than _100MB/s_.
1577 In this case, or if you are in a OLAP environment and require having locality, then it is recommended to major compact the moved regions.
1578
1579 [[hbase_metrics]]
1580 == HBase Metrics
1581
1582 HBase emits metrics which adhere to the link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html[Hadoop Metrics] API.
1583 Starting with HBase 0.95footnote:[The Metrics system was redone in
1584           HBase 0.96. See Migration
1585             to the New Metrics Hotness – Metrics2 by Elliot Clark for detail], HBase is configured to emit a default set of metrics with a default sampling period of every 10 seconds.
1586 You can use HBase metrics in conjunction with Ganglia.
1587 You can also filter which metrics are emitted and extend the metrics framework to capture custom metrics appropriate for your environment.
1588
1589 === Metric Setup
1590
1591 For HBase 0.95 and newer, HBase ships with a default metrics configuration, or [firstterm]_sink_.
1592 This includes a wide variety of individual metrics, and emits them every 10 seconds by default.
1593 To configure metrics for a given region server, edit the _conf/hadoop-metrics2-hbase.properties_ file.
1594 Restart the region server for the changes to take effect.
1595
1596 To change the sampling rate for the default sink, edit the line beginning with `*.period`.
1597 To filter which metrics are emitted or to extend the metrics framework, see https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html
1598
1599 .HBase Metrics and Ganglia
1600 [NOTE]
1601 ====
1602 By default, HBase emits a large number of metrics per region server.
1603 Ganglia may have difficulty processing all these metrics.
1604 Consider increasing the capacity of the Ganglia server or reducing the number of metrics emitted by HBase.
1605 See link:https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html#filtering[Metrics Filtering].
1606 ====
1607
1608 === Disabling Metrics
1609
1610 To disable metrics for a region server, edit the _conf/hadoop-metrics2-hbase.properties_ file and comment out any uncommented lines.
1611 Restart the region server for the changes to take effect.
1612
1613 [[discovering.available.metrics]]
1614 === Discovering Available Metrics
1615
1616 Rather than listing each metric which HBase emits by default, you can browse through the available metrics, either as a JSON output or via JMX.
1617 Different metrics are exposed for the Master process and each region server process.
1618
1619 .Procedure: Access a JSON Output of Available Metrics
1620 . After starting HBase, access the region server's web UI, at pass:[http://REGIONSERVER_HOSTNAME:60030] by default (or port 16030 in HBase 1.0+).
1621 . Click the [label]#Metrics Dump# link near the top.
1622   The metrics for the region server are presented as a dump of the JMX bean in JSON format.
1623   This will dump out all metrics names and their values.
1624   To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60030/jmx?description=true].
1625   Not all beans and attributes have descriptions.
1626 . To view metrics for the Master, connect to the Master's web UI instead (defaults to pass:[http://localhost:60010] or port 16010 in HBase 1.0+) and click its [label]#Metrics
1627   Dump# link.
1628   To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60010/jmx?description=true].
1629   Not all beans and attributes have descriptions.
1630
1631
1632 You can use many different tools to view JMX content by browsing MBeans.
1633 This procedure uses `jvisualvm`, which is an application usually available in the JDK.
1634
1635 .Procedure: Browse the JMX Output of Available Metrics
1636 . Start HBase, if it is not already running.
1637 . Run the command `jvisualvm` command on a host with a GUI display.
1638   You can launch it from the command line or another method appropriate for your operating system.
1639 . Be sure the [label]#VisualVM-MBeans# plugin is installed. Browse to *Tools -> Plugins*. Click [label]#Installed# and check whether the plugin is listed.
1640   If not, click [label]#Available Plugins#, select it, and click btn:[Install].
1641   When finished, click btn:[Close].
1642 . To view details for a given HBase process, double-click the process in the [label]#Local# sub-tree in the left-hand panel.
1643   A detailed view opens in the right-hand panel.
1644   Click the [label]#MBeans# tab which appears as a tab in the top of the right-hand panel.
1645 . To access the HBase metrics, navigate to the appropriate sub-bean:
1646 .* Master:
1647 .* RegionServer:
1648
1649 . The name of each metric and its current value is displayed in the [label]#Attributes# tab.
1650   For a view which includes more details, including the description of each attribute, click the [label]#Metadata# tab.
1651
1652 === Units of Measure for Metrics
1653
1654 Different metrics are expressed in different units, as appropriate.
1655 Often, the unit of measure is in the name (as in the metric `shippedKBs`). Otherwise, use the following guidelines.
1656 When in doubt, you may need to examine the source for a given metric.
1657
1658 * Metrics that refer to a point in time are usually expressed as a timestamp.
1659 * Metrics that refer to an age (such as `ageOfLastShippedOp`) are usually expressed in milliseconds.
1660 * Metrics that refer to memory sizes are in bytes.
1661 * Sizes of queues (such as `sizeOfLogQueue`) are expressed as the number of items in the queue.
1662   Determine the size by multiplying by the block size (default is 64 MB in HDFS).
1663 * Metrics that refer to things like the number of a given type of operations (such as `logEditsRead`) are expressed as an integer.
1664
1665 [[master_metrics]]
1666 === Most Important Master Metrics
1667
1668 Note: Counts are usually over the last metrics reporting interval.
1669
1670 hbase.master.numRegionServers::
1671   Number of live regionservers
1672
1673 hbase.master.numDeadRegionServers::
1674   Number of dead regionservers
1675
1676 hbase.master.ritCount ::
1677   The number of regions in transition
1678
1679 hbase.master.ritCountOverThreshold::
1680   The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
1681
1682 hbase.master.ritOldestAge::
1683   The age of the longest region in transition, in milliseconds
1684
1685 [[rs_metrics]]
1686 === Most Important RegionServer Metrics
1687
1688 Note: Counts are usually over the last metrics reporting interval.
1689
1690 hbase.regionserver.regionCount::
1691   The number of regions hosted by the regionserver
1692
1693 hbase.regionserver.storeFileCount::
1694   The number of store files on disk currently managed by the regionserver
1695
1696 hbase.regionserver.storeFileSize::
1697   Aggregate size of the store files on disk
1698
1699 hbase.regionserver.hlogFileCount::
1700   The number of write ahead logs not yet archived
1701
1702 hbase.regionserver.totalRequestCount::
1703   The total number of requests received
1704
1705 hbase.regionserver.readRequestCount::
1706   The number of read requests received
1707
1708 hbase.regionserver.writeRequestCount::
1709   The number of write requests received
1710
1711 hbase.regionserver.numOpenConnections::
1712   The number of open connections at the RPC layer
1713
1714 hbase.regionserver.numActiveHandler::
1715   The number of RPC handlers actively servicing requests
1716
1717 hbase.regionserver.numCallsInGeneralQueue::
1718   The number of currently enqueued user requests
1719
1720 hbase.regionserver.numCallsInReplicationQueue::
1721   The number of currently enqueued operations received from replication
1722
1723 hbase.regionserver.numCallsInPriorityQueue::
1724   The number of currently enqueued priority (internal housekeeping) requests
1725
1726 hbase.regionserver.flushQueueLength::
1727   Current depth of the memstore flush queue.
1728   If increasing, we are falling behind with clearing memstores out to HDFS.
1729
1730 hbase.regionserver.updatesBlockedTime::
1731   Number of milliseconds updates have been blocked so the memstore can be flushed
1732
1733 hbase.regionserver.compactionQueueLength::
1734   Current depth of the compaction request queue.
1735   If increasing, we are falling behind with storefile compaction.
1736
1737 hbase.regionserver.blockCacheHitCount::
1738   The number of block cache hits
1739
1740 hbase.regionserver.blockCacheMissCount::
1741   The number of block cache misses
1742
1743 hbase.regionserver.blockCacheExpressHitPercent ::
1744   The percent of the time that requests with the cache turned on hit the cache
1745
1746 hbase.regionserver.percentFilesLocal::
1747   Percent of store file data that can be read from the local DataNode, 0-100
1748
1749 hbase.regionserver.<op>_<measure>::
1750   Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
1751
1752 hbase.regionserver.slow<op>Count ::
1753   The number of operations we thought were slow, where <op> is one of the list above
1754
1755 hbase.regionserver.GcTimeMillis::
1756   Time spent in garbage collection, in milliseconds
1757
1758 hbase.regionserver.GcTimeMillisParNew::
1759   Time spent in garbage collection of the young generation, in milliseconds
1760
1761 hbase.regionserver.GcTimeMillisConcurrentMarkSweep::
1762   Time spent in garbage collection of the old generation, in milliseconds
1763
1764 hbase.regionserver.authenticationSuccesses::
1765   Number of client connections where authentication succeeded
1766
1767 hbase.regionserver.authenticationFailures::
1768   Number of client connection authentication failures
1769
1770 hbase.regionserver.mutationsWithoutWALCount ::
1771   Count of writes submitted with a flag indicating they should bypass the write ahead log
1772
1773 [[rs_meta_metrics]]
1774 === Meta Table Load Metrics
1775
1776 HBase meta table metrics collection feature is available in HBase 1.4+ but it is disabled by default, as it can
1777 affect the performance of the cluster. When it is enabled, it helps to monitor client access patterns by collecting
1778 the following statistics:
1779
1780 * number of get, put and delete operations on the `hbase:meta` table
1781 * number of get, put and delete operations made by the top-N clients
1782 * number of operations related to each table
1783 * number of operations related to the top-N regions
1784
1785
1786 When to use the feature::
1787   This feature can help to identify hot spots in the meta table by showing the regions or tables where the meta info is
1788   modified (e.g. by create, drop, split or move tables) or retrieved most frequently. It can also help to find misbehaving
1789   client applications by showing which clients are using the meta table most heavily, which can for example suggest the
1790   lack of meta table buffering or the lack of re-using open client connections in the client application.
1791
1792 .Possible side-effects of enabling this feature
1793 [WARNING]
1794 ====
1795 Having large number of clients and regions in the cluster can cause the registration and tracking of a large amount of
1796 metrics, which can increase the memory and CPU footprint of the HBase region server handling the `hbase:meta` table.
1797 It can also cause the significant increase of the JMX dump size, which can affect the monitoring or log aggregation
1798 system you use beside HBase. It is recommended to turn on this feature only during debugging.
1799 ====
1800
1801 Where to find the metrics in JMX::
1802   Each metric attribute name will start with the ‘MetaTable_’ prefix. For all the metrics you will see five different
1803   JMX attributes: count, mean rate, 1 minute rate, 5 minute rate and 15 minute rate. You will find these metrics in JMX
1804   under the following MBean:
1805   `Hadoop -> HBase -> RegionServer -> Coprocessor.Region.CP_org.apache.hadoop.hbase.coprocessor.MetaTableMetrics`.
1806
1807 .Examples: some Meta Table metrics you can see in your JMX dump
1808 [source,json]
1809 ----
1810 {
1811   "MetaTable_get_request_count": 77309,
1812   "MetaTable_put_request_mean_rate": 0.06339092997186495,
1813   "MetaTable_table_MyTestTable_request_15min_rate": 1.1020599841623246,
1814   "MetaTable_client_/172.30.65.42_lossy_request_count": 1786
1815   "MetaTable_client_/172.30.65.45_put_request_5min_rate": 0.6189810954855728,
1816   "MetaTable_region_1561131112259.c66e4308d492936179352c80432ccfe0._lossy_request_count": 38342,
1817   "MetaTable_region_1561131043640.5bdffe4b9e7e334172065c853cf0caa6._lossy_request_1min_rate": 0.04925099917433935,
1818 }
1819 ----
1820
1821 Configuration::
1822   To turn on this feature, you have to enable a custom coprocessor by adding the following section to hbase-site.xml.
1823   This coprocessor will run on all the HBase RegionServers, but will be active (i.e. consume memory / CPU) only on
1824   the server, where the `hbase:meta` table is located. It will produce JMX metrics which can be downloaded from the
1825   web UI of the given RegionServer or by a simple REST call. These metrics will not be present in the JMX dump of the
1826   other RegionServers.
1827
1828 .Enabling the Meta Table Metrics feature
1829 [source,xml]
1830 ----
1831 <property>
1832     <name>hbase.coprocessor.region.classes</name>
1833     <value>org.apache.hadoop.hbase.coprocessor.MetaTableMetrics</value>
1834 </property>
1835 ----
1836
1837 .How the top-N metrics are calculated?
1838 [NOTE]
1839 ====
1840 The 'top-N' type of metrics will be counted using the Lossy Counting Algorithm (as defined in
1841 link:http://www.vldb.org/conf/2002/S10P03.pdf[Motwani, R; Manku, G.S (2002). "Approximate frequency counts over data streams"]),
1842 which is designed to identify elements in a data stream whose frequency count exceed a user-given threshold.
1843 The frequency computed by this algorithm is not always accurate but has an error threshold that can be specified by the
1844 user as a configuration parameter. The run time space required by the algorithm is inversely proportional to the
1845 specified error threshold, hence larger the error parameter, the smaller the footprint and the less accurate are the
1846 metrics.
1847
1848 You can specify the error rate of the algorithm as a floating-point value between 0 and 1 (exclusive), it's default
1849 value is 0.02. Having the error rate set to `E` and having `N` as the total number of meta table operations, then
1850 (assuming the uniform distribution of the activity of low frequency elements) at most `7 / E` meters will be kept and
1851 each kept element will have a frequency higher than `E * N`.
1852
1853 An example: Let’s assume we are interested in the HBase clients that are most active in accessing the meta table.
1854 When there was 1,000,000 operations on the meta table so far and the error rate parameter is set to 0.02, then we can
1855 assume that only at most 350 client IP address related counters will be present in JMX and each of these clients
1856 accessed the meta table at least 20,000 times.
1857
1858 [source,xml]
1859 ----
1860 <property>
1861     <name>hbase.util.default.lossycounting.errorrate</name>
1862     <value>0.02</value>
1863 </property>
1864 ----
1865 ====
1866
1867 [[ops.monitoring]]
1868 == HBase Monitoring
1869
1870 [[ops.monitoring.overview]]
1871 === Overview
1872
1873 The following metrics are arguably the most important to monitor for each RegionServer for "macro monitoring", preferably with a system like link:http://opentsdb.net/[OpenTSDB].
1874 If your cluster is having performance issues it's likely that you'll see something unusual with this group.
1875
1876 HBase::
1877   * See <<rs_metrics,rs metrics>>
1878
1879 OS::
1880   * IO Wait
1881   * User CPU
1882
1883 Java::
1884   * GC
1885
1886 For more information on HBase metrics, see <<hbase_metrics,hbase metrics>>.
1887
1888 [[ops.slow.query]]
1889 === Slow Query Log
1890
1891 The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output.
1892 The thresholds for "too long to run" and "too much output" are configurable, as described below.
1893 The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events.
1894 It is also prepended with identifying tags `(responseTooSlow)`, `(responseTooLarge)`, `(operationTooSlow)`, and `(operationTooLarge)` in order to enable easy filtering with grep, in case the user desires to see only slow queries.
1895
1896 ==== Configuration
1897
1898 There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
1899
1900 * `hbase.ipc.warn.response.time` Maximum number of milliseconds that a query can be run without being logged.
1901   Defaults to 10000, or 10 seconds.
1902   Can be set to -1 to disable logging by time.
1903 * `hbase.ipc.warn.response.size` Maximum byte size of response that a query can return without being logged.
1904   Defaults to 100 megabytes.
1905   Can be set to -1 to disable logging by size.
1906
1907 ==== Metrics
1908
1909 The slow query log exposes to metrics to JMX.
1910
1911 * `hadoop.regionserver_rpc_slowResponse` a global metric reflecting the durations of all responses that triggered logging.
1912 * `hadoop.regionserver_rpc_methodName.aboveOneSec` A metric reflecting the durations of all responses that lasted for more than one second.
1913
1914 ==== Output
1915
1916 The output is tagged with operation e.g. `(operationTooSlow)` if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for.
1917 If not, it is tagged `(responseTooSlow)`          and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. `TooLarge` is substituted for `TooSlow` if the response size triggered the logging, with `TooLarge` appearing even in the case that both size and duration triggered logging.
1918
1919 ==== Example
1920
1921
1922 [source]
1923 ----
1924 2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}
1925 ----
1926
1927 Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port.
1928 Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations.
1929 In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.
1930
1931 This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.
1932
1933 [[slow_log_responses]]
1934 ==== Get Slow Response Log from shell
1935 When an individual RPC exceeds a configurable time bound we log a complaint
1936 by way of the logging subsystem
1937
1938 e.g.
1939
1940 ----
1941 2019-10-02 10:10:22,195 WARN [,queue=15,port=60020] ipc.RpcServer - (responseTooSlow):
1942 {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
1943 "starttimems":1567203007549,
1944 "responsesize":6819737,
1945 "method":"Scan",
1946 "param":"region { type: REGION_NAME value: \"t1,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
1947 "processingtimems":28646,
1948 "client":"10.253.196.215:41116",
1949 "queuetimems":22453,
1950 "class":"HRegionServer"}
1951 ----
1952
1953 Unfortunately often the request parameters are truncated as per above Example.
1954 The truncation is unfortunate because it eliminates much of the utility of
1955 the warnings. For example, the region name, the start and end keys, and the
1956 filter hierarchy are all important clues for debugging performance problems
1957 caused by moderate to low selectivity queries or queries made at a high rate.
1958
1959 HBASE-22978 introduces maintaining an in-memory ring buffer of requests that were judged to
1960 be too slow in addition to the responseTooSlow logging. The in-memory representation can be
1961 complete. There is some chance a high rate of requests will cause information on other
1962 interesting requests to be overwritten before it can be read. This is an acceptable trade off.
1963
1964 In order to enable the in-memory ring buffer at RegionServers, we need to enable
1965 config:
1966 ----
1967 hbase.regionserver.slowlog.buffer.enabled
1968 ----
1969
1970 One more config determines the size of the ring buffer:
1971 ----
1972 hbase.regionserver.slowlog.ringbuffer.size
1973 ----
1974
1975 Check the config section for the detailed description.
1976
1977 This config would be disabled by default. Turn it on and these shell commands
1978 would provide expected results from the ring-buffers.
1979
1980
1981 shell commands to retrieve slowlog responses from RegionServers:
1982
1983 ----
1984 Retrieve latest SlowLog Responses maintained by each or specific RegionServers.
1985 Specify '*' to include all RS otherwise array of server names for specific
1986 RS. A server name is the host, port plus startcode of a RegionServer.
1987 e.g.: host187.example.com,60020,1289493121758 (find servername in
1988 master ui or when you do detailed status in shell)
1989
1990 Provide optional filter parameters as Hash.
1991 Default Limit of each server for providing no of slow log records is 10. User can specify
1992 more limit by 'LIMIT' param in case more than 10 records should be retrieved.
1993
1994 Examples:
1995
1996   hbase> get_slowlog_responses '*'                                 => get slowlog responses from all RS
1997   hbase> get_slowlog_responses '*', {'LIMIT' => 50}                => get slowlog responses from all RS
1998                                                                       with 50 records limit (default limit: 10)
1999   hbase> get_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2']    => get slowlog responses from SERVER_NAME1,
2000                                                                       SERVER_NAME2
2001   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
2002                                                                    => get slowlog responses only related to meta
2003                                                                       region
2004   hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1'}         => get slowlog responses only related to t1 table
2005   hbase> get_slowlog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
2006                                                                    => get slowlog responses with given client
2007                                                                       IP address and get 100 records limit
2008                                                                       (default limit: 10)
2009   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
2010                                                                    => get slowlog responses with given region name
2011                                                                       or table name
2012   hbase> get_slowlog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
2013                                                                    => get slowlog responses that match either
2014                                                                       provided client IP address or user name
2015
2016
2017 ----
2018
2019 All of above queries with filters have default OR operation applied i.e. all
2020 records with any of the provided filters applied will be returned. However,
2021 we can also apply AND operator i.e. all records that match all (not any) of
2022 the provided filters should be returned.
2023
2024 ----
2025   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
2026                                                                    => get slowlog responses with given region name
2027                                                                       and table name, both should match
2028
2029   hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
2030                                                                    => get slowlog responses with given region name
2031                                                                       or table name, any one can match
2032
2033   hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
2034                                                                    => get slowlog responses with given region name
2035                                                                       and client IP address, both should match
2036
2037 ----
2038 Since OR is the default filter operator, without providing 'FILTER_BY_OP', query will have
2039 same result as providing 'FILTER_BY_OP' => 'OR'.
2040
2041
2042 Sometimes output can be long pretty printed json for user to scroll in
2043 a single screen and hence user might prefer
2044 redirecting output of get_slowlog_responses to a file.
2045
2046 Example:
2047 ----
2048 echo "get_slowlog_responses '*'" | hbase shell > xyz.out 2>&1
2049 ----
2050
2051 Similar to slow RPC logs, client can also retrieve large RPC logs.
2052 Sometimes, slow logs important to debug perf issues turn out to be
2053 larger in size.
2054
2055 ----
2056
2057   hbase> get_largelog_responses '*'                                 => get largelog responses from all RS
2058   hbase> get_largelog_responses '*', {'LIMIT' => 50}                => get largelog responses from all RS
2059                                                                        with 50 records limit (default limit: 10)
2060   hbase> get_largelog_responses ['SERVER_NAME1', 'SERVER_NAME2']    => get largelog responses from SERVER_NAME1,
2061                                                                        SERVER_NAME2
2062   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
2063                                                                     => get largelog responses only related to meta
2064                                                                        region
2065   hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1'}         => get largelog responses only related to t1 table
2066   hbase> get_largelog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
2067                                                                     => get largelog responses with given client
2068                                                                        IP address and get 100 records limit
2069                                                                        (default limit: 10)
2070   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
2071                                                                     => get largelog responses with given region name
2072                                                                        or table name
2073   hbase> get_largelog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
2074                                                                     => get largelog responses that match either
2075                                                                        provided client IP address or user name
2076
2077   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
2078                                                                    => get largelog responses with given region name
2079                                                                       and table name, both should match
2080
2081   hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
2082                                                                    => get largelog responses with given region name
2083                                                                       or table name, any one can match
2084
2085   hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
2086                                                                    => get largelog responses with given region name
2087                                                                       and client IP address, both should match
2088
2089 ----
2090
2091
2092 shell command to clear slow/largelog responses from RegionServer:
2093
2094 ----
2095 Clears SlowLog Responses maintained by each or specific RegionServers.
2096 Specify array of server names for specific RS. A server name is
2097 the host, port plus startcode of a RegionServer.
2098 e.g.: host187.example.com,60020,1289493121758 (find servername in
2099 master ui or when you do detailed status in shell)
2100
2101 Examples:
2102
2103   hbase> clear_slowlog_responses                                     => clears slowlog responses from all RS
2104   hbase> clear_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2']    => clears slowlog responses from SERVER_NAME1,
2105                                                                         SERVER_NAME2
2106
2107
2108 ----
2109
2110 include::slow_log_responses_from_systable.adoc[]
2111
2112
2113 === Block Cache Monitoring
2114
2115 Starting with HBase 0.98, the HBase Web UI includes the ability to monitor and report on the performance of the block cache.
2116 To view the block cache reports, see the Block Cache section of the region server UI.
2117 Following are a few examples of the reporting capabilities.
2118
2119 .Basic Info shows the cache implementation.
2120 image::bc_basic.png[]
2121
2122 .Config shows all cache configuration options.
2123 image::bc_config.png[]
2124
2125 .Stats shows statistics about the performance of the cache.
2126 image::bc_stats.png[]
2127
2128 .L1 and L2 show information about the L1 and L2 caches.
2129 image::bc_l1.png[]
2130
2131 This is not an exhaustive list of all the screens and reports available.
2132 Have a look in the Web UI.
2133
2134 === Snapshot Space Usage Monitoring
2135
2136 Starting with HBase 0.95, Snapshot usage information on individual snapshots was shown in the HBase Master Web UI. This was further enhanced starting with HBase 1.3 to show the total Storefile size of the Snapshot Set. The following metrics are shown in the Master Web UI with HBase 1.3 and later.
2137
2138 * Shared Storefile Size is the Storefile size shared between snapshots and active tables.
2139 * Mob Storefile Size is the Mob Storefile size shared between snapshots and active tables.
2140 * Archived Storefile Size is the Storefile size in Archive.
2141
2142 The format of Archived Storefile Size is NNN(MMM). NNN is the total Storefile size in Archive, MMM is the total Storefile size in Archive that is specific to the snapshot (not shared with other snapshots and tables).
2143
2144 .Master Snapshot Overview
2145 image::master-snapshot.png[]
2146
2147 .Snapshot Storefile Stats Example 1
2148 image::1-snapshot.png[]
2149
2150 .Snapshot Storefile Stats Example 2
2151 image::2-snapshots.png[]
2152
2153 .Empty Snapshot Storfile Stats Example
2154 image::empty-snapshots.png[]
2155
2156 == Cluster Replication
2157
2158 NOTE: This information was previously available at
2159 link:https://hbase.apache.org/0.94/replication.html[Cluster Replication].
2160
2161 HBase provides a cluster replication mechanism which allows you to keep one cluster's state synchronized with that of another cluster, using the write-ahead log (WAL) of the source cluster to propagate the changes.
2162 Some use cases for cluster replication include:
2163
2164 * Backup and disaster recovery
2165 * Data aggregation
2166 * Geographic data distribution
2167 * Online data ingestion combined with offline data analytics
2168
2169 NOTE: Replication is enabled at the granularity of the column family.
2170 Before enabling replication for a column family, create the table and all column families to be replicated, on the destination cluster.
2171
2172 NOTE: Replication is asynchronous as we send WAL to another cluster in background, which means that when you want to do recovery through replication, you could loss some data. To address this problem, we have introduced a new feature called synchronous replication. As the mechanism is a bit different so we use a separated section to describe it. Please see
2173 <<Synchronous Replication,Synchronous Replication>>.
2174
2175 === Replication Overview
2176
2177 Cluster replication uses a source-push methodology.
2178 An HBase cluster can be a source (also called master or active, meaning that it is the originator of new data), a destination (also called slave or passive, meaning that it receives data via replication), or can fulfill both roles at once.
2179 Replication is asynchronous, and the goal of replication is eventual consistency.
2180 When the source receives an edit to a column family with replication enabled, that edit is propagated to all destination clusters using the WAL for that for that column family on the RegionServer managing the relevant region.
2181
2182 When data is replicated from one cluster to another, the original source of the data is tracked via a cluster ID which is part of the metadata.
2183 In HBase 0.96 and newer (link:https://issues.apache.org/jira/browse/HBASE-7709[HBASE-7709]), all clusters which have already consumed the data are also tracked.
2184 This prevents replication loops.
2185
2186 The WALs for each region server must be kept in HDFS as long as they are needed to replicate data to any slave cluster.
2187 Each region server reads from the oldest log it needs to replicate and keeps track of its progress processing WALs inside ZooKeeper to simplify failure recovery.
2188 The position marker which indicates a slave cluster's progress, as well as the queue of WALs to process, may be different for every slave cluster.
2189
2190 The clusters participating in replication can be of different sizes.
2191 The master cluster relies on randomization to attempt to balance the stream of replication on the slave clusters.
2192 It is expected that the slave cluster has storage capacity to hold the replicated data, as well as any data it is responsible for ingesting.
2193 If a slave cluster does run out of room, or is inaccessible for other reasons, it throws an error and the master retains the WAL and retries the replication at intervals.
2194
2195 .Consistency Across Replicated Clusters
2196 [WARNING]
2197 ====
2198 How your application builds on top of the HBase API matters when replication is in play. HBase's replication system provides at-least-once delivery of client edits for an enabled column family to each configured destination cluster. In the event of failure to reach a given destination, the replication system will retry sending edits in a way that might repeat a given message. HBase provides two ways of replication, one is the original replication and the other is serial replication. In the previous way of replication, there is not a guaranteed order of delivery for client edits. In the event of a RegionServer failing, recovery of the replication queue happens independent of recovery of the individual regions that server was previously handling. This means that it is possible for the not-yet-replicated edits to be serviced by a RegionServer that is currently slower to replicate than the one that handles edits from after the failure.
2199
2200 The combination of these two properties (at-least-once delivery and the lack of message ordering) means that some destination clusters may end up in a different state if your application makes use of operations that are not idempotent, e.g. Increments.
2201
2202 To solve the problem, HBase now supports serial replication, which sends edits to destination cluster as the order of requests from client. See <<Serial Replication,Serial Replication>>.
2203
2204 ====
2205
2206 .Terminology Changes
2207 [NOTE]
2208 ====
2209 Previously, terms such as [firstterm]_master-master_, [firstterm]_master-slave_, and [firstterm]_cyclical_ were used to describe replication relationships in HBase.
2210 These terms added confusion, and have been abandoned in favor of discussions about cluster topologies appropriate for different scenarios.
2211 ====
2212
2213 .Cluster Topologies
2214 * A central source cluster might propagate changes out to multiple destination clusters, for failover or due to geographic distribution.
2215 * A source cluster might push changes to a destination cluster, which might also push its own changes back to the original cluster.
2216 * Many different low-latency clusters might push changes to one centralized cluster for backup or resource-intensive data analytics jobs.
2217   The processed data might then be replicated back to the low-latency clusters.
2218
2219 Multiple levels of replication may be chained together to suit your organization's needs.
2220 The following diagram shows a hypothetical scenario.
2221 Use the arrows to follow the data paths.
2222
2223 .Example of a Complex Cluster Replication Configuration
2224 image::hbase_replication_diagram.jpg[]
2225
2226 HBase replication borrows many concepts from the [firstterm]_statement-based replication_ design used by MySQL.
2227 Instead of SQL statements, entire WALEdits (consisting of multiple cell inserts coming from Put and Delete operations on the clients) are replicated in order to maintain atomicity.
2228
2229 [[hbase.replication.management]]
2230 === Managing and Configuring Cluster Replication
2231 .Cluster Configuration Overview
2232
2233 . Configure and start the source and destination clusters.
2234   Create tables with the same names and column families on both the source and destination clusters, so that the destination cluster knows where to store data it will receive.
2235 . All hosts in the source and destination clusters should be reachable to each other.
2236 . If both clusters use the same ZooKeeper cluster, you must use a different `zookeeper.znode.parent`, because they cannot write in the same folder.
2237 . On the source cluster, in HBase Shell, add the destination cluster as a peer, using the `add_peer` command.
2238 . On the source cluster, in HBase Shell, enable the table replication, using the `enable_table_replication` command.
2239 . Check the logs to see if replication is taking place. If so, you will see messages like the following, coming from the ReplicationSource.
2240 ----
2241 LOG.info("Replicating "+clusterId + " -> " + peerClusterId);
2242 ----
2243
2244 .Serial Replication Configuration
2245 See <<Serial Replication,Serial Replication>>
2246
2247 .Cluster Management Commands
2248 add_peer <ID> <CLUSTER_KEY>::
2249   Adds a replication relationship between two clusters. +
2250   * ID -- a unique string, which must not contain a hyphen.
2251   * CLUSTER_KEY: composed using the following template, with appropriate place-holders: `hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent`. This value can be found on the Master UI info page.
2252   * STATE(optional): ENABLED or DISABLED, default value is ENABLED
2253 list_peers:: list all replication relationships known by this cluster
2254 enable_peer <ID>::
2255   Enable a previously-disabled replication relationship
2256 disable_peer <ID>::
2257   Disable a replication relationship. HBase will no longer send edits to that
2258   peer cluster, but it still keeps track of all the new WALs that it will need
2259   to replicate if and when it is re-enabled. WALs are retained when enabling or disabling
2260   replication as long as peers exist.
2261 remove_peer <ID>::
2262   Disable and remove a replication relationship. HBase will no longer send edits to that peer cluster or keep track of WALs.
2263 enable_table_replication <TABLE_NAME>::
2264   Enable the table replication switch for all its column families. If the table is not found in the destination cluster then it will create one with the same name and column families.
2265 disable_table_replication <TABLE_NAME>::
2266   Disable the table replication switch for all its column families.
2267
2268 === Serial Replication
2269
2270 Note: this feature is introduced in HBase 2.1
2271
2272 .Function of serial replication
2273
2274 Serial replication supports to push logs to the destination cluster in the same order as logs reach to the source cluster.
2275
2276 .Why need serial replication?
2277 In replication of HBase, we push mutations to destination cluster by reading WAL in each region server. We have a queue for WAL files so we can read them in order of creation time. However, when region-move or RS failure occurs in source cluster, the hlog entries that are not pushed before region-move or RS-failure will be pushed by original RS(for region move) or another RS which takes over the remained hlog of dead RS(for RS failure), and the new entries for the same region(s) will be pushed by the RS which now serves the region(s), but they push the hlog entries of a same region concurrently without coordination.
2278
2279 This treatment can possibly lead to data inconsistency between source and destination clusters:
2280
2281 1. there are put and then delete written to source cluster.
2282
2283 2. due to region-move / RS-failure, they are pushed by different replication-source threads to peer cluster.
2284
2285 3. if delete is pushed to peer cluster before put, and flush and major-compact occurs in peer cluster before put is pushed to peer cluster, the delete is collected and the put remains in peer cluster, but in source cluster the put is masked by the delete, hence data inconsistency between source and destination clusters.
2286
2287
2288 .Serial replication configuration
2289
2290 Set the serial flag to true for a repliation peer. And the default serial flag is false.
2291
2292 * Add a new replication peer which serial flag is true
2293
2294 [source,ruby]
2295 ----
2296 hbase> add_peer '1', CLUSTER_KEY => "server1.cie.com:2181:/hbase", SERIAL => true
2297 ----
2298
2299 * Set a replication peer's serial flag to false
2300
2301 [source,ruby]
2302 ----
2303 hbase> set_peer_serial '1', false
2304 ----
2305
2306 * Set a replication peer's serial flag to true
2307
2308 [source,ruby]
2309 ----
2310 hbase> set_peer_serial '1', true
2311 ----
2312
2313 The serial replication feature had been done firstly in link:https://issues.apache.org/jira/browse/HBASE-9465[HBASE-9465] and then reverted and redone in link:https://issues.apache.org/jira/browse/HBASE-20046[HBASE-20046]. You can find more details in these issues.
2314
2315 === Verifying Replicated Data
2316
2317 The `VerifyReplication` MapReduce job, which is included in HBase, performs a systematic comparison of replicated data between two different clusters. Run the VerifyReplication job on the master cluster, supplying it with the peer ID and table name to use for validation. You can limit the verification further by specifying a time range or specific families. The job's short name is `verifyrep`. To run the job, use a command like the following:
2318 +
2319 [source,bash]
2320 ----
2321 $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` "${HADOOP_HOME}/bin/hadoop" jar "${HBASE_HOME}/hbase-mapreduce-VERSION.jar" verifyrep --starttime=<timestamp> --endtime=<timestamp> --families=<myFam> <ID> <tableName>
2322 ----
2323 +
2324 The `VerifyReplication` command prints out `GOODROWS` and `BADROWS` counters to indicate rows that did and did not replicate correctly.
2325
2326 === Detailed Information About Cluster Replication
2327
2328 .Replication Architecture Overview
2329 image::replication_overview.png[]
2330
2331 ==== Life of a WAL Edit
2332
2333 A single WAL edit goes through several steps in order to be replicated to a slave cluster.
2334
2335 . An HBase client uses a Put or Delete operation to manipulate data in HBase.
2336 . The region server writes the request to the WAL in a way allows it to be replayed if it is not written successfully.
2337 . If the changed cell corresponds to a column family that is scoped for replication, the edit is added to the queue for replication.
2338 . In a separate thread, the edit is read from the log, as part of a batch process.
2339   Only the KeyValues that are eligible for replication are kept.
2340   Replicable KeyValues are part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as `hbase:meta`, did not originate from the target slave cluster, and have not already been consumed by the target slave cluster.
2341 . The edit is tagged with the master's UUID and added to a buffer.
2342   When the buffer is filled, or the reader reaches the end of the file, the buffer is sent to a random region server on the slave cluster.
2343 . The region server reads the edits sequentially and separates them into buffers, one buffer per table.
2344   After all edits are read, each buffer is flushed using link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table], HBase's normal client.
2345   The master's UUID and the UUIDs of slaves which have already consumed the data are preserved in the edits they are applied, in order to prevent replication loops.
2346 . In the master, the offset for the WAL that is currently being replicated is registered in ZooKeeper.
2347
2348 . The first three steps, where the edit is inserted, are identical.
2349 . Again in a separate thread, the region server reads, filters, and edits the log edits in the same way as above.
2350   The slave region server does not answer the RPC call.
2351 . The master sleeps and tries again a configurable number of times.
2352 . If the slave region server is still not available, the master selects a new subset of region server to replicate to, and tries again to send the buffer of edits.
2353 . Meanwhile, the WALs are rolled and stored in a queue in ZooKeeper.
2354   Logs that are [firstterm]_archived_ by their region server, by moving them from the region server's log directory to a central log directory, will update their paths in the in-memory queue of the replicating thread.
2355 . When the slave cluster is finally available, the buffer is applied in the same way as during normal processing.
2356   The master region server will then replicate the backlog of logs that accumulated during the outage.
2357
2358 .Spreading Queue Failover Load
2359 When replication is active, a subset of region servers in the source cluster is responsible for shipping edits to the sink.
2360 This responsibility must be failed over like all other region server functions should a process or node crash.
2361 The following configuration settings are recommended for maintaining an even distribution of replication activity over the remaining live servers in the source cluster:
2362
2363 * Set `replication.source.maxretriesmultiplier` to `300`.
2364 * Set `replication.source.sleepforretries` to `1` (1 second). This value, combined with the value of `replication.source.maxretriesmultiplier`, causes the retry cycle to last about 5 minutes.
2365 * Set `replication.sleep.before.failover` to `30000` (30 seconds) in the source cluster site configuration.
2366
2367 [[cluster.replication.preserving.tags]]
2368 .Preserving Tags During Replication
2369 By default, the codec used for replication between clusters strips tags, such as cell-level ACLs, from cells.
2370 To prevent the tags from being stripped, you can use a different codec which does not strip them.
2371 Configure `hbase.replication.rpc.codec` to use `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`, on both the source and sink RegionServers involved in the replication.
2372 This option was introduced in link:https://issues.apache.org/jira/browse/HBASE-10322[HBASE-10322].
2373
2374 ==== Replication Internals
2375
2376 Replication State in ZooKeeper::
2377   HBase replication maintains its state in ZooKeeper.
2378   By default, the state is contained in the base node _/hbase/replication_.
2379   This node contains two child nodes, the `Peers` znode and the `RS`                znode.
2380
2381 The `Peers` Znode::
2382   The `peers` znode is stored in _/hbase/replication/peers_ by default.
2383   It consists of a list of all peer replication clusters, along with the status of each of them.
2384   The value of each peer is its cluster key, which is provided in the HBase Shell.
2385   The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.
2386
2387 The `RS` Znode::
2388   The `rs` znode contains a list of WAL logs which need to be replicated.
2389   This list is divided into a set of queues organized by region server and the peer cluster the region server is shipping the logs to.
2390   The rs znode has one child znode for each region server in the cluster.
2391   The child znode name is the region server's hostname, client port, and start code.
2392   This list includes both live and dead region servers.
2393
2394 ==== Choosing Region Servers to Replicate To
2395
2396 When a master cluster region server initiates a replication source to a slave cluster, it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It then scans the _rs/_ directory to discover all the available sinks (region servers that are accepting incoming streams of edits to replicate) and randomly chooses a subset of them using a configured ratio which has a default value of 10%. For example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for edits that this master cluster region server sends.
2397 Because this selection is performed by each master region server, the probability that all slave region servers are used is very high, and this method works for clusters of any size.
2398 For example, a master cluster of 10 machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the master cluster region servers to choose one machine each at random.
2399
2400 A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the slave cluster by each of the master cluster's region servers.
2401 This watch is used to monitor changes in the composition of the slave cluster.
2402 When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
2403
2404 ==== Keeping Track of Logs
2405
2406 Each master cluster region server has its own znode in the replication znodes hierarchy.
2407 It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
2408 Each of these queues will track the WALs created by that region server, but they can differ in size.
2409 For example, if one slave cluster becomes unavailable for some time, the WALs should not be deleted, so they need to stay in the queue while the others are processed.
2410 See <<rs.failover.details,rs.failover.details>> for an example.
2411
2412 When a source is instantiated, it contains the current WAL that the region server is writing to.
2413 During log rolling, the new file is added to the queue of each slave cluster's znode just before it is made available.
2414 This ensures that all the sources are aware that a new log exists before the region server is able to append edits into it, but this operations is now more expensive.
2415 The queue items are discarded when the replication thread cannot read more entries from a file (because it reached the end of the last block) and there are other files in the queue.
2416 This means that if a source is up to date and replicates from the log that the region server writes to, reading up to the "end" of the current file will not delete the item in the queue.
2417
2418 A log can be archived if it is no longer used or if the number of logs exceeds `hbase.regionserver.maxlogs` because the insertion rate is faster than regions are flushed.
2419 When a log is archived, the source threads are notified that the path for that log changed.
2420 If a particular source has already finished with an archived log, it will just ignore the message.
2421 If the log is in the queue, the path will be updated in memory.
2422 If the log is currently being replicated, the change will be done atomically so that the reader doesn't attempt to open the file when has already been moved.
2423 Because moving a file is a NameNode operation , if the reader is currently reading the log, it won't generate any exception.
2424
2425 ==== Reading, Filtering and Sending Edits
2426
2427 By default, a source attempts to read from a WAL and ship log entries to a sink as quickly as possible.
2428 Speed is limited by the filtering of log entries Only KeyValues that are scoped GLOBAL and that do not belong to catalog tables will be retained.
2429 Speed is also limited by total size of the list of edits to replicate per slave, which is limited to 64 MB by default.
2430 With this configuration, a master cluster region server with three slaves would use at most 192 MB to store data to replicate.
2431 This does not account for the data which was filtered but not garbage collected.
2432
2433 Once the maximum size of edits has been buffered or the reader reaches the end of the WAL, the source thread stops reading and chooses at random a sink to replicate to (from the list that was generated by keeping only a subset of slave region servers). It directly issues a RPC to the chosen region server and waits for the method to return.
2434 If the RPC was successful, the source determines whether the current file has been emptied or it contains more data which needs to be read.
2435 If the file has been emptied, the source deletes the znode in the queue.
2436 Otherwise, it registers the new offset in the log's znode.
2437 If the RPC threw an exception, the source will retry 10 times before trying to find a different sink.
2438
2439 ==== Cleaning Logs
2440
2441 If replication is not enabled, the master's log-cleaning thread deletes old logs using a configured TTL.
2442 This TTL-based method does not work well with replication, because archived logs which have exceeded their TTL may still be in a queue.
2443 The default behavior is augmented so that if a log is past its TTL, the cleaning thread looks up every queue until it finds the log, while caching queues it has found.
2444 If the log is not found in any queues, the log will be deleted.
2445 The next time the cleaning process needs to look for a log, it starts by using its cached list.
2446
2447 NOTE: WALs are saved when replication is enabled or disabled as long as peers exist.
2448
2449 [[rs.failover.details]]
2450 ==== Region Server Failover
2451
2452 When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
2453 Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
2454
2455 Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
2456 The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
2457 After queues are all transferred, they are deleted from the old location.
2458 The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
2459
2460 Next, the master cluster region server creates one new source thread per copied queue, and each of the source threads follows the read/filter/ship pattern.
2461 The main difference is that those queues will never receive new data, since they do not belong to their new region server.
2462 When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
2463
2464 Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
2465 The region servers' znodes all contain a `peers`          znode which contains a single queue.
2466 The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
2467
2468 ----
2469
2470 /hbase/replication/rs/
2471   1.1.1.1,60020,123456780/
2472     2/
2473       1.1.1.1,60020.1234  (Contains a position)
2474       1.1.1.1,60020.1265
2475   1.1.1.2,60020,123456790/
2476     2/
2477       1.1.1.2,60020.1214  (Contains a position)
2478       1.1.1.2,60020.1248
2479       1.1.1.2,60020.1312
2480   1.1.1.3,60020,    123456630/
2481     2/
2482       1.1.1.3,60020.1280  (Contains a position)
2483 ----
2484
2485 Assume that 1.1.1.2 loses its ZooKeeper session.
2486 The survivors will race to create a lock, and, arbitrarily, 1.1.1.3 wins.
2487 It will then start transferring all the queues to its local peers znode by appending the name of the dead server.
2488 Right before 1.1.1.3 is able to clean up the old znodes, the layout will look like the following:
2489
2490 ----
2491
2492 /hbase/replication/rs/
2493   1.1.1.1,60020,123456780/
2494     2/
2495       1.1.1.1,60020.1234  (Contains a position)
2496       1.1.1.1,60020.1265
2497   1.1.1.2,60020,123456790/
2498     lock
2499     2/
2500       1.1.1.2,60020.1214  (Contains a position)
2501       1.1.1.2,60020.1248
2502       1.1.1.2,60020.1312
2503   1.1.1.3,60020,123456630/
2504     2/
2505       1.1.1.3,60020.1280  (Contains a position)
2506
2507     2-1.1.1.2,60020,123456790/
2508       1.1.1.2,60020.1214  (Contains a position)
2509       1.1.1.2,60020.1248
2510       1.1.1.2,60020.1312
2511 ----
2512
2513 Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from 1.1.1.2, it dies too.
2514 Some new logs were also created in the normal queues.
2515 The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues.
2516 The new layout will be:
2517
2518 ----
2519
2520 /hbase/replication/rs/
2521   1.1.1.1,60020,123456780/
2522     2/
2523       1.1.1.1,60020.1378  (Contains a position)
2524
2525     2-1.1.1.3,60020,123456630/
2526       1.1.1.3,60020.1325  (Contains a position)
2527       1.1.1.3,60020.1401
2528
2529     2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
2530       1.1.1.2,60020.1312  (Contains a position)
2531   1.1.1.3,60020,123456630/
2532     lock
2533     2/
2534       1.1.1.3,60020.1325  (Contains a position)
2535       1.1.1.3,60020.1401
2536
2537     2-1.1.1.2,60020,123456790/
2538       1.1.1.2,60020.1312  (Contains a position)
2539 ----
2540
2541 === Replication Metrics
2542
2543 The following metrics are exposed at the global region server level and at the peer level:
2544
2545 `source.sizeOfLogQueue`::
2546   number of WALs to process (excludes the one which is being processed) at the Replication source
2547
2548 `source.shippedOps`::
2549   number of mutations shipped
2550
2551 `source.logEditsRead`::
2552   number of mutations read from WALs at the replication source
2553
2554 `source.ageOfLastShippedOp`::
2555   age of last batch that was shipped by the replication source
2556
2557 `source.completedLogs`::
2558   The number of write-ahead-log files that have completed their acknowledged sending to the peer associated with this source. Increments to this metric are a part of normal operation of HBase replication.
2559
2560 `source.completedRecoverQueues`::
2561   The number of recovery queues this source has completed sending to the associated peer. Increments to this metric are a part of normal recovery of HBase replication in the face of failed Region Servers.
2562
2563 `source.uncleanlyClosedLogs`::
2564   The number of write-ahead-log files the replication system considered completed after reaching the end of readable entries in the face of an uncleanly closed file.
2565
2566 `source.ignoredUncleanlyClosedLogContentsInBytes`::
2567   When a write-ahead-log file is not closed cleanly, there will likely be some entry that has been partially serialized. This metric contains the number of bytes of such entries the HBase replication system believes were remaining at the end of files skipped in the face of an uncleanly closed file. Those bytes should either be in different file or represent a client write that was not acknowledged.
2568
2569 `source.restartedLogReading`::
2570   The number of times the HBase replication system detected that it failed to correctly parse a cleanly closed write-ahead-log file. In this circumstance, the system replays the entire log from the beginning, ensuring that no edits fail to be acknowledged by the associated peer. Increments to this metric indicate that the HBase replication system is having difficulty correctly handling failures in the underlying distributed storage system. No dataloss should occur, but you should check Region Server log files for details of the failures.
2571
2572 `source.repeatedLogFileBytes`::
2573   When the HBase replication system determines that it needs to replay a given write-ahead-log file, this metric is incremented by the number of bytes the replication system believes had already been acknowledged by the associated peer prior to starting over.
2574
2575 `source.closedLogsWithUnknownFileLength`::
2576   Incremented when the HBase replication system believes it is at the end of a write-ahead-log file but it can not determine the length of that file in the underlying distributed storage system. Could indicate dataloss since the replication system is unable to determine if the end of readable entries lines up with the expected end of the file. You should check Region Server log files for details of the failures.
2577
2578
2579 === Replication Configuration Options
2580
2581 [cols="1,1,1", options="header"]
2582 |===
2583 | Option
2584 | Description
2585 | Default
2586
2587 | zookeeper.znode.parent
2588 | The name of the base ZooKeeper znode used for HBase
2589 | /hbase
2590
2591 | zookeeper.znode.replication
2592 | The name of the base znode used for replication
2593 | replication
2594
2595 | zookeeper.znode.replication.peers
2596 | The name of the peer znode
2597 | peers
2598
2599 | zookeeper.znode.replication.peers.state
2600 | The name of peer-state znode
2601 | peer-state
2602
2603 | zookeeper.znode.replication.rs
2604 | The name of the rs znode
2605 | rs
2606
2607 | replication.sleep.before.failover
2608 | How many milliseconds a worker should sleep before attempting to replicate
2609                 a dead region server's WAL queues.
2610 |
2611
2612 | replication.executor.workers
2613 | The number of region servers a given region server should attempt to
2614                   failover simultaneously.
2615 | 1
2616 |===
2617
2618 === Monitoring Replication Status
2619
2620 You can use the HBase Shell command `status 'replication'` to monitor the replication status on your cluster. The  command has three variations:
2621 * `status 'replication'` -- prints the status of each source and its sinks, sorted by hostname.
2622 * `status 'replication', 'source'` -- prints the status for each replication source, sorted by hostname.
2623 * `status 'replication', 'sink'` -- prints the status for each replication sink, sorted by hostname.
2624
2625 ==== Understanding the output
2626
2627 The command output will vary according to the state of replication. For example right after a restart
2628 and if destination peer is not reachable, no replication source threads would be running,
2629 so no metrics would get displayed:
2630
2631 ----
2632 hbase01.home:
2633 SOURCE: PeerID=1
2634 Normal Queue: 1
2635 No Reader/Shipper threads runnning yet.
2636 SINK: TimeStampStarted=1591985197350, Waiting for OPs...
2637 ----
2638
2639 Under normal circumstances, a healthy, active-active replication deployment would
2640 show the following:
2641
2642 ----
2643     hbase01.home:
2644       SOURCE: PeerID=1
2645          Normal Queue: 1
2646            AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
2647       SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020
2648 ----
2649
2650 The definition for each of these metrics is detailed below:
2651
2652 [cols="1,1,1", options="header"]
2653 |===
2654 | Type
2655 | Metric Name
2656 | Description
2657
2658 | Source
2659 | AgeOfLastShippedOp
2660 | How long last successfully shipped edit took to effectively get replicated on target.
2661
2662 | Source
2663 | TimeStampOfLastShippedOp
2664 | The actual date of last successful edit shipment.
2665
2666 | Source
2667 | SizeOfLogQueue
2668 | Number of wal files on this given queue.
2669
2670 | Source
2671 | EditsReadFromLogQueue
2672 | How many edits have been read from this given queue since this source thread started.
2673
2674 | Source
2675 | OpsShippedToTarget
2676 | How many edits have been shipped to target since this source thread started.
2677
2678 | Source
2679 | TimeStampOfNextToReplicate
2680 | Date of the current edit been attempted to replicate.
2681
2682 | Source
2683 | Replication Lag
2684 | The elapsed time (in millis), since the last edit to replicate was read by this source
2685 thread and effectively replicated to target
2686
2687 | Sink
2688 | TimeStampStarted
2689 | Date (in millis) of when this Sink thread started.
2690
2691 | Sink
2692 | AgeOfLastAppliedOp
2693 | How long it took to apply the last successful shipped edit.
2694
2695 | Sink
2696 | TimeStampsOfLastAppliedOp
2697 | Date of last successful applied edit.
2698
2699 |===
2700
2701 Growing values for `Source.TimeStampsOfLastAppliedOp` and/or
2702 `Source.Replication Lag` would indicate replication delays. If those numbers keep going
2703 up, while `Source.TimeStampOfLastShippedOp`, `Source.EditsReadFromLogQueue`,
2704 `Source.OpsShippedToTarget` or `Source.TimeStampOfNextToReplicate` do not change at all,
2705  then replication flow is failing to progress, and there might be problems within
2706 clusters communication. This could also happen if replication is manually paused
2707 (via hbase shell `disable_peer` command, for example), but data keeps getting ingested
2708 in the source cluster tables.
2709
2710 == Running Multiple Workloads On a Single Cluster
2711
2712 HBase provides the following mechanisms for managing the performance of a cluster
2713 handling multiple workloads:
2714 . <<quota>>
2715 . <<request_queues>>
2716 . <<multiple-typed-queues>>
2717
2718 [[quota]]
2719 === Quotas
2720 HBASE-11598 introduces RPC quotas, which allow you to throttle requests based on
2721 the following limits:
2722
2723 . <<request-quotas,The number or size of requests(read, write, or read+write) in a given timeframe>>
2724 . <<namespace_quotas,The number of tables allowed in a namespace>>
2725
2726 These limits can be enforced for a specified user, table, or namespace.
2727
2728 .Enabling Quotas
2729
2730 Quotas are disabled by default. To enable the feature, set the `hbase.quota.enabled`
2731 property to `true` in _hbase-site.xml_ file for all cluster nodes.
2732
2733 .General Quota Syntax
2734 . THROTTLE_TYPE can be expressed as READ, WRITE, or the default type(read + write).
2735 . Timeframes  can be expressed in the following units: `sec`, `min`, `hour`, `day`
2736 . Request sizes can be expressed in the following units: `B` (bytes), `K` (kilobytes),
2737 `M` (megabytes), `G` (gigabytes), `T` (terabytes), `P` (petabytes)
2738 . Numbers of requests are expressed as an integer followed by the string `req`
2739 . Limits relating to time are expressed as req/time or size/time. For instance `10req/day`
2740 or `100P/hour`.
2741 . Numbers of tables or regions are expressed as integers.
2742
2743 [[request-quotas]]
2744 .Setting Request Quotas
2745 You can set quota rules ahead of time, or you can change the throttle at runtime. The change
2746 will propagate after the quota refresh period has expired. This expiration period
2747 defaults to 5 minutes. To change it, modify the `hbase.quota.refresh.period` property
2748 in `hbase-site.xml`. This property is expressed in milliseconds and defaults to `300000`.
2749
2750 ----
2751 # Limit user u1 to 10 requests per second
2752 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10req/sec'
2753
2754 # Limit user u1 to 10 read requests per second
2755 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', LIMIT => '10req/sec'
2756
2757 # Limit user u1 to 10 M per day everywhere
2758 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10M/day'
2759
2760 # Limit user u1 to 10 M write size per sec
2761 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => WRITE, USER => 'u1', LIMIT => '10M/sec'
2762
2763 # Limit user u1 to 5k per minute on table t2
2764 hbase> set_quota TYPE => THROTTLE, USER => 'u1', TABLE => 't2', LIMIT => '5K/min'
2765
2766 # Limit user u1 to 10 read requests per sec on table t2
2767 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', TABLE => 't2', LIMIT => '10req/sec'
2768
2769 # Remove an existing limit from user u1 on namespace ns2
2770 hbase> set_quota TYPE => THROTTLE, USER => 'u1', NAMESPACE => 'ns2', LIMIT => NONE
2771
2772 # Limit all users to 10 requests per hour on namespace ns1
2773 hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns1', LIMIT => '10req/hour'
2774
2775 # Limit all users to 10 T per hour on table t1
2776 hbase> set_quota TYPE => THROTTLE, TABLE => 't1', LIMIT => '10T/hour'
2777
2778 # Remove all existing limits from user u1
2779 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => NONE
2780
2781 # List all quotas for user u1 in namespace ns2
2782 hbase> list_quotas USER => 'u1, NAMESPACE => 'ns2'
2783
2784 # List all quotas for namespace ns2
2785 hbase> list_quotas NAMESPACE => 'ns2'
2786
2787 # List all quotas for table t1
2788 hbase> list_quotas TABLE => 't1'
2789
2790 # list all quotas
2791 hbase> list_quotas
2792 ----
2793
2794 You can also place a global limit and exclude a user or a table from the limit by applying the
2795 `GLOBAL_BYPASS` property.
2796 ----
2797 hbase> set_quota NAMESPACE => 'ns1', LIMIT => '100req/min'               # a per-namespace request limit
2798 hbase> set_quota USER => 'u1', GLOBAL_BYPASS => true                     # user u1 is not affected by the limit
2799 ----
2800
2801 [[namespace_quotas]]
2802 .Setting Namespace Quotas
2803
2804 You can specify the maximum number of tables or regions allowed in a given namespace, either
2805 when you create the namespace or by altering an existing namespace, by setting the
2806 `hbase.namespace.quota.maxtables property`  on the namespace.
2807
2808 .Limiting Tables Per Namespace
2809 ----
2810 # Create a namespace with a max of 5 tables
2811 hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxtables'=>'5'}
2812
2813 # Alter an existing namespace to have a max of 8 tables
2814 hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxtables'=>'8'}
2815
2816 # Show quota information for a namespace
2817 hbase> describe_namespace 'ns2'
2818
2819 # Alter an existing namespace to remove a quota
2820 hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=>'hbase.namespace.quota.maxtables'}
2821 ----
2822
2823 .Limiting Regions Per Namespace
2824 ----
2825 # Create a namespace with a max of 10 regions
2826 hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxregions'=>'10'
2827
2828 # Show quota information for a namespace
2829 hbase> describe_namespace 'ns1'
2830
2831 # Alter an existing namespace to have a max of 20 tables
2832 hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxregions'=>'20'}
2833
2834 # Alter an existing namespace to remove a quota
2835 hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=> 'hbase.namespace.quota.maxregions'}
2836 ----
2837
2838 [[request_queues]]
2839 === Request Queues
2840 If no throttling policy is configured, when the RegionServer receives multiple requests,
2841 they are now placed into a queue waiting for a free execution slot (HBASE-6721).
2842 The simplest queue is a FIFO queue, where each request waits for all previous requests in the queue
2843 to finish before running. Fast or interactive queries can get stuck behind large requests.
2844
2845 If you are able to guess how long a request will take, you can reorder requests by
2846 pushing the long requests to the end of the queue and allowing short requests to preempt
2847 them. Eventually, you must still execute the large requests and prioritize the new
2848 requests behind them. The short requests will be newer, so the result is not terrible,
2849 but still suboptimal compared to a mechanism which allows large requests to be split
2850 into multiple smaller ones.
2851
2852 HBASE-10993 introduces such a system for deprioritizing long-running scanners. There
2853 are two types of queues, `fifo` and `deadline`. To configure the type of queue used,
2854 configure the `hbase.ipc.server.callqueue.type` property in `hbase-site.xml`. There
2855 is no way to estimate how long each request may take, so de-prioritization only affects
2856 scans, and is based on the number of “next” calls a scan request has made. An assumption
2857 is made that when you are doing a full table scan, your job is not likely to be interactive,
2858 so if there are concurrent requests, you can delay long-running scans up to a limit tunable by
2859 setting the `hbase.ipc.server.queue.max.call.delay` property. The slope of the delay is calculated
2860 by a simple square root of `(numNextCall * weight)` where the weight is
2861 configurable by setting the `hbase.ipc.server.scan.vtime.weight` property.
2862
2863 [[multiple-typed-queues]]
2864 === Multiple-Typed Queues
2865
2866 You can also prioritize or deprioritize different kinds of requests by configuring
2867 a specified number of dedicated handlers and queues. You can segregate the scan requests
2868 in a single queue with a single handler, and all the other available queues can service
2869 short `Get` requests.
2870
2871 You can adjust the IPC queues and handlers based on the type of workload, using static
2872 tuning options. This approach is an interim first step that will eventually allow
2873 you to change the settings at runtime, and to dynamically adjust values based on the load.
2874
2875 .Multiple Queues
2876
2877 To avoid contention and separate different kinds of requests, configure the
2878 `hbase.ipc.server.callqueue.handler.factor` property, which allows you to increase the number of
2879 queues and control how many handlers can share the same queue., allows admins to increase the number
2880 of queues and decide how many handlers share the same queue.
2881
2882 Using more queues reduces contention when adding a task to a queue or selecting it
2883 from a queue. You can even configure one queue per handler. The trade-off is that
2884 if some queues contain long-running tasks, a handler may need to wait to execute from that queue
2885 rather than stealing from another queue which has waiting tasks.
2886
2887 .Read and Write Queues
2888 With multiple queues, you can now divide read and write requests, giving more priority
2889 (more queues) to one or the other type. Use the `hbase.ipc.server.callqueue.read.ratio`
2890 property to choose to serve more reads or more writes.
2891
2892 .Get and Scan Queues
2893 Similar to the read/write split, you can split gets and scans by tuning the `hbase.ipc.server.callqueue.scan.ratio`
2894 property to give more priority to gets or to scans. A scan ratio of `0.1` will give
2895 more queue/handlers to the incoming gets, which means that more gets can be processed
2896 at the same time and that fewer scans can be executed at the same time. A value of
2897 `0.9` will give more queue/handlers to scans, so the number of scans executed will
2898 increase and the number of gets will decrease.
2899
2900 [[space-quotas]]
2901 === Space Quotas
2902
2903 link:https://issues.apache.org/jira/browse/HBASE-16961[HBASE-16961] introduces a new type of
2904 quotas for HBase to leverage: filesystem quotas. These "space" quotas limit the amount of space
2905 on the filesystem that HBase namespaces and tables can consume. If a user, malicious or ignorant,
2906 has the ability to write data into HBase, with enough time, that user can effectively crash HBase
2907 (or worse HDFS) by consuming all available space. When there is no filesystem space available,
2908 HBase crashes because it can no longer create/sync data to the write-ahead log.
2909
2910 This feature allows a for a limit to be set on the size of a table or namespace. When a space quota is set
2911 on a namespace, the quota's limit applies to the sum of usage of all tables in that namespace.
2912 When a table with a quota exists in a namespace with a quota, the table quota takes priority
2913 over the namespace quota. This allows for a scenario where a large limit can be placed on
2914 a collection of tables, but a single table in that collection can have a fine-grained limit set.
2915
2916 The existing `set_quota` and `list_quota` HBase shell commands can be used to interact with
2917 space quotas. Space quotas are quotas with a `TYPE` of `SPACE` and have `LIMIT` and `POLICY`
2918 attributes. The `LIMIT` is a string that refers to the amount of space on the filesystem
2919 that the quota subject (e.g. the table or namespace) may consume. For example, valid values
2920 of `LIMIT` are `'10G'`, `'2T'`, or `'256M'`. The `POLICY` refers to the action that HBase will
2921 take when the quota subject's usage exceeds the `LIMIT`. The following are valid `POLICY` values.
2922
2923 * `NO_INSERTS` - No new data may be written (e.g. `Put`, `Increment`, `Append`).
2924 * `NO_WRITES` - Same as `NO_INSERTS` but `Deletes` are also disallowed.
2925 * `NO_WRITES_COMPACTIONS` - Same as `NO_WRITES` but compactions are also disallowed.
2926 ** This policy only prevents user-submitted compactions. System can still run compactions.
2927 * `DISABLE` - The table(s) are disabled, preventing all read/write access.
2928
2929 .Setting simple space quotas
2930 ----
2931 # Sets a quota on the table 't1' with a limit of 1GB, disallowing Puts/Increments/Appends when the table exceeds 1GB
2932 hbase> set_quota TYPE => SPACE, TABLE => 't1', LIMIT => '1G', POLICY => NO_INSERTS
2933
2934 # Sets a quota on the namespace 'ns1' with a limit of 50TB, disallowing Puts/Increments/Appends/Deletes
2935 hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '50T', POLICY => NO_WRITES
2936
2937 # Sets a quota on the table 't3' with a limit of 2TB, disallowing any writes and compactions when the table exceeds 2TB.
2938 hbase> set_quota TYPE => SPACE, TABLE => 't3', LIMIT => '2T', POLICY => NO_WRITES_COMPACTIONS
2939
2940 # Sets a quota on the table 't2' with a limit of 50GB, disabling the table when it exceeds 50GB
2941 hbase> set_quota TYPE => SPACE, TABLE => 't2', LIMIT => '50G', POLICY => DISABLE
2942 ----
2943
2944 Consider the following scenario to set up quotas on a namespace, overriding the quota on tables in that namespace
2945
2946 .Table and Namespace space quotas
2947 ----
2948 hbase> create_namespace 'ns1'
2949 hbase> create 'ns1:t1'
2950 hbase> create 'ns1:t2'
2951 hbase> create 'ns1:t3'
2952 hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '100T', POLICY => NO_INSERTS
2953 hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t2', LIMIT => '200G', POLICY => NO_WRITES
2954 hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t3', LIMIT => '20T', POLICY => NO_WRITES
2955 ----
2956
2957 In the above scenario, the tables in the namespace `ns1` will not be allowed to consume more than
2958 100TB of space on the filesystem among each other. The table 'ns1:t2' is only allowed to be 200GB in size, and will
2959 disallow all writes when the usage exceeds this limit. The table 'ns1:t3' is allowed to grow to 20TB in size
2960 and also will disallow all writes then the usage exceeds this limit. Because there is no table quota
2961 on 'ns1:t1', this table can grow up to 100TB, but only if 'ns1:t2' and 'ns1:t3' have a usage of zero bytes.
2962 Practically, it's limit is 100TB less the current usage of 'ns1:t2' and 'ns1:t3'.
2963
2964 [[ops.space.quota.deletion]]
2965 === Disabling Automatic Space Quota Deletion
2966
2967 By default, if a table or namespace is deleted that has a space quota, the quota itself is
2968 also deleted. In some cases, it may be desirable for the space quota to not be automatically deleted.
2969 In these cases, the user may configure the system to not delete any space quota automatically via hbase-site.xml.
2970
2971 [source,java]
2972 ----
2973
2974   <property>
2975     <name>hbase.quota.remove.on.table.delete</name>
2976     <value>false</value>
2977   </property>
2978 ----
2979
2980 The value is set to `true` by default.
2981
2982 === HBase Snapshots with Space Quotas
2983
2984 One common area of unintended-filesystem-use with HBase is via HBase snapshots. Because snapshots
2985 exist outside of the management of HBase tables, it is not uncommon for administrators to suddenly
2986 realize that hundreds of gigabytes or terabytes of space is being used by HBase snapshots which were
2987 forgotten and never removed.
2988
2989 link:https://issues.apache.org/jira/browse/HBASE-17748[HBASE-17748] is the umbrella JIRA issue which
2990 expands on the original space quota functionality to also include HBase snapshots. While this is a confusing
2991 subject, the implementation attempts to present this support in as reasonable and simple of a manner as
2992 possible for administrators. This feature does not make any changes to administrator interaction with
2993 space quotas, only in the internal computation of table/namespace usage. Table and namespace usage will
2994 automatically incorporate the size taken by a snapshot per the rules defined below.
2995
2996 As a review, let's cover a snapshot's lifecycle: a snapshot is metadata which points to
2997 a list of HFiles on the filesystem. This is why creating a snapshot is a very cheap operation; no HBase
2998 table data is actually copied to perform a snapshot. Cloning a snapshot into a new table or restoring
2999 a table is a cheap operation for the same reason; the new table references the files which already exist
3000 on the filesystem without a copy. To include snapshots in space quotas, we need to define which table
3001 "owns" a file when a snapshot references the file ("owns" refers to encompassing the filesystem usage
3002 of that file).
3003
3004 Consider a snapshot which was made against a table. When the snapshot refers to a file and the table no
3005 longer refers to that file, the "originating" table "owns" that file. When multiple snapshots refer to
3006 the same file and no table refers to that file, the snapshot with the lowest-sorting name (lexicographically)
3007 is chosen and the table which that snapshot was created from "owns" that file. HFiles are not "double-counted"
3008  hen a table and one or more snapshots refer to that HFile.
3009
3010 When a table is "rematerialized" (via `clone_snapshot` or `restore_snapshot`), a similar problem of file
3011 ownership arises. In this case, while the rematerialized table references a file which a snapshot also
3012 references, the table does not "own" the file. The table from which the snapshot was created still "owns"
3013 that file. When the rematerialized table is compacted or the snapshot is deleted, the rematerialized table
3014 will uniquely refer to a new file and "own" the usage of that file. Similarly, when a table is duplicated via a snapshot
3015 and `restore_snapshot`, the new table will not consume any quota size until the original table stops referring
3016 to the files, either due to a compaction on the original table, a compaction on the new table, or the
3017 original table being deleted.
3018
3019 One new HBase shell command was added to inspect the computed sizes of each snapshot in an HBase instance.
3020
3021 ----
3022 hbase> list_snapshot_sizes
3023 SNAPSHOT                                      SIZE
3024  t1.s1                                        1159108
3025 ----
3026
3027 [[ops.backup]]
3028 == HBase Backup
3029
3030 There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster.
3031 Each approach has pros and cons.
3032
3033 For additional information, see link:http://blog.sematext.com/2011/03/11/hbase-backup-options/[HBase Backup
3034         Options] over on the Sematext Blog.
3035
3036 [[ops.backup.fullshutdown]]
3037 === Full Shutdown Backup
3038
3039 Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages.
3040 The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata.
3041 The obvious con is that the cluster is down.
3042 The steps include:
3043
3044 [[ops.backup.fullshutdown.stop]]
3045 ==== Stop HBase
3046
3047
3048
3049 [[ops.backup.fullshutdown.distcp]]
3050 ==== Distcp
3051
3052 Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
3053
3054 Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files.
3055 Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.
3056
3057 [[ops.backup.fullshutdown.restore]]
3058 ==== Restore (if needed)
3059
3060 The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp.
3061 The act of copying these files creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
3062
3063 [[ops.backup.live.replication]]
3064 === Live Cluster Backup - Replication
3065
3066 This approach assumes that there is a second cluster.
3067 See the HBase page on link:https://hbase.apache.org/book.html#_cluster_replication[replication] for more information.
3068
3069 [[ops.backup.live.copytable]]
3070 === Live Cluster Backup - CopyTable
3071
3072 The <<copy.table,copytable>> utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
3073
3074 Since the cluster is up, there is a risk that edits could be missed in the copy process.
3075
3076 [[ops.backup.live.export]]
3077 === Live Cluster Backup - Export
3078
3079 The <<export,export>> approach dumps the content of a table to HDFS on the same cluster.
3080 To restore the data, the <<import,import>> utility would be used.
3081
3082 Since the cluster is up, there is a risk that edits could be missed in the export process. If you want to know more about HBase back-up and restore see the page on link:http://hbase.apache.org/book.html#backuprestore[Backup and Restore].
3083
3084 [[ops.snapshots]]
3085 == HBase Snapshots
3086
3087 HBase Snapshots allow you to take a copy of a table (both contents and metadata)with a very small performance impact. A Snapshot is an immutable
3088 collection of table metadata and a list of HFiles that comprised the table at the time the Snapshot was taken. A "clone"
3089 of a snapshot creates a new table from that snapshot, and a "restore" of a snapshot returns the contents of a table to
3090 what it was when the snapshot was created. The "clone" and "restore" operations do not require any data to be copied,
3091 as the underlying HFiles (the files which contain the data for an HBase table) are not modified with either action.
3092 Simiarly, exporting a snapshot to another cluster has little impact on RegionServers of the local cluster.
3093
3094 Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table.
3095 The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
3096
3097 [[ops.snapshots.configuration]]
3098 === Configuration
3099
3100 To turn on the snapshot support just set the `hbase.snapshot.enabled`        property to true.
3101 (Snapshots are enabled by default in 0.95+ and off by default in 0.94.6+)
3102
3103 [source,java]
3104 ----
3105
3106   <property>
3107     <name>hbase.snapshot.enabled</name>
3108     <value>true</value>
3109   </property>
3110 ----
3111
3112 [[ops.snapshots.takeasnapshot]]
3113 === Take a Snapshot
3114
3115 You can take a snapshot of a table regardless of whether it is enabled or disabled.
3116 The snapshot operation doesn't involve any data copying.
3117
3118 ----
3119
3120 $ ./bin/hbase shell
3121 hbase> snapshot 'myTable', 'myTableSnapshot-122112'
3122 ----
3123
3124 .Take a Snapshot Without Flushing
3125 The default behavior is to perform a flush of data in memory before the snapshot is taken.
3126 This means that data in memory is included in the snapshot.
3127 In most cases, this is the desired behavior.
3128 However, if your set-up can tolerate data in memory being excluded from the snapshot, you can use the `SKIP_FLUSH` option of the `snapshot` command to disable and flushing while taking the snapshot.
3129
3130 ----
3131 hbase> snapshot 'mytable', 'snapshot123', {SKIP_FLUSH => true}
3132 ----
3133
3134 WARNING: There is no way to determine or predict whether a very concurrent insert or update will be included in a given snapshot, whether flushing is enabled or disabled.
3135 A snapshot is only a representation of a table during a window of time.
3136 The amount of time the snapshot operation will take to reach each Region Server may vary from a few seconds to a minute, depending on the resource load and speed of the hardware or network, among other factors.
3137 There is also no way to know whether a given insert or update is in memory or has been flushed.
3138
3139
3140 .Take a Snapshot With TTL
3141 Snapshots have a lifecycle that is independent from the table from which they are created.
3142 Although data in a table may be stored with TTL the data files containing them become
3143 frozen by the snapshot. Space consumed by expired cells will not be reclaimed by normal
3144 table housekeeping like compaction. While this is expected it can be inconvenient at scale.
3145 When many snapshots are under management and the data in various tables is expired by
3146 TTL some notion of optional TTL (and optional default TTL) for snapshots could be useful.
3147
3148
3149 ----
3150 hbase> snapshot 'mytable', 'snapshot1234', {TTL => 86400}
3151 ----
3152
3153 The above command creates snapshot `snapshot1234` with TTL of 86400 sec (24 hours)
3154 and hence, the snapshot is supposed to be cleaned up after 24 hours
3155
3156
3157
3158 .Default Snapshot TTL:
3159 - User specified default TTL with config `hbase.master.snapshot.ttl`
3160 - FOREVER if `hbase.master.snapshot.ttl` is not set
3161
3162 While creating a snapshot, if TTL in seconds is not explicitly specified, the above logic will be
3163 followed to determine the TTL. If no configs are changed, the default behavior is that all snapshots
3164 will be retained forever (until manual deletion). If a different default TTL behavior is desired,
3165 `hbase.master.snapshot.ttl` can be set to a default TTL in seconds. Any snapshot created without
3166 an explicit TTL will take this new value.
3167
3168 NOTE: If `hbase.master.snapshot.ttl` is set, a snapshot with an explicit {TTL => 0} or
3169 {TTL => -1} will also take this value. In this case, a TTL < -1 (such as {TTL => -2} should be used
3170 to indicate FOREVER.
3171
3172 To summarize concisely,
3173
3174 1. Snapshot with TTL value < -1 will stay forever regardless of any server side config changes (until deleted manually by user).
3175 2. Snapshot with TTL value > 0 will be deleted automatically soon after TTL expires.
3176 3. Snapshot created without specifying TTL will always have TTL value represented by config `hbase.master.snapshot.ttl`. Default value of this config is 0, which represents: keep the snapshot forever (until deleted manually by user).
3177 4. From client side, TTL value 0 or -1 should never be explicitly provided because they will be treated same as snapshot without TTL (same as above point 3) and hence will use TTL as per value represented by config `hbase.master.snapshot.ttl`.
3178
3179 .Take a snapshot with custom MAX_FILESIZE
3180
3181 Optionally, snapshots can be created with a custom max file size configuration that will be
3182 used by cloned tables, instead of the global `hbase.hregion.max.filesize` configuration property.
3183 This is mostly useful when exporting snapshots between different clusters. If the HBase cluster where
3184 the snapshot is originally taken has a much larger value set for `hbase.hregion.max.filesize` than
3185 one or more clusters where the snapshot is being exported to, a storm of region splits may occur when
3186 restoring the snapshot on destination clusters. Specifying `MAX_FILESIZE` on properties passed to
3187 `snapshot` command will save informed value into the table's `MAX_FILESIZE`
3188 decriptor at snapshot creation time. If the table already defines `MAX_FILESIZE` descriptor,
3189 this property would be ignored and have no effect.
3190
3191 ----
3192 snapshot 'table01', 'snap01', {MAX_FILESIZE => 21474836480}
3193 ----
3194
3195 .Enable/Disable Snapshot Auto Cleanup on running cluster:
3196
3197 By default, snapshot auto cleanup based on TTL would be enabled
3198 for any new cluster.
3199 At any point in time, if snapshot cleanup is supposed to be stopped due to
3200 some snapshot restore activity or any other reason, it is advisable
3201 to disable it using shell command:
3202
3203 ----
3204 hbase> snapshot_cleanup_switch false
3205 ----
3206
3207 We can re-enable it using:
3208
3209 ----
3210 hbase> snapshot_cleanup_switch true
3211 ----
3212
3213 The shell command with switch false would disable snapshot auto
3214 cleanup activity based on TTL and return the previous state of
3215 the activity(true: running already, false: disabled already)
3216
3217 A sample output for above commands:
3218 ----
3219 Previous snapshot cleanup state : true
3220 Took 0.0069 seconds
3221 => "true"
3222 ----
3223
3224 We can query whether snapshot auto cleanup is enabled for
3225 cluster using:
3226
3227 ----
3228 hbase> snapshot_cleanup_enabled
3229 ----
3230
3231 The command would return output in true/false.
3232
3233 [[ops.snapshots.list]]
3234 === Listing Snapshots
3235
3236 List all snapshots taken (by printing the names and relative information).
3237
3238 ----
3239
3240 $ ./bin/hbase shell
3241 hbase> list_snapshots
3242 ----
3243
3244 [[ops.snapshots.delete]]
3245 === Deleting Snapshots
3246
3247 You can remove a snapshot, and the files retained for that snapshot will be removed if no longer needed.
3248
3249 ----
3250
3251 $ ./bin/hbase shell
3252 hbase> delete_snapshot 'myTableSnapshot-122112'
3253 ----
3254
3255 [[ops.snapshots.clone]]
3256 === Clone a table from snapshot
3257
3258 From a snapshot you can create a new table (clone operation) with the same data that you had when the snapshot was taken.
3259 The clone operation, doesn't involve data copies, and a change to the cloned table doesn't impact the snapshot or the original table.
3260
3261 ----
3262
3263 $ ./bin/hbase shell
3264 hbase> clone_snapshot 'myTableSnapshot-122112', 'myNewTestTable'
3265 ----
3266
3267 [[ops.snapshots.restore]]
3268 === Restore a snapshot
3269
3270 The restore operation requires the table to be disabled, and the table will be restored to the state at the time when the snapshot was taken, changing both data and schema if required.
3271
3272 ----
3273
3274 $ ./bin/hbase shell
3275 hbase> disable 'myTable'
3276 hbase> restore_snapshot 'myTableSnapshot-122112'
3277 ----
3278
3279 NOTE: Since Replication works at log level and snapshots at file-system level, after a restore, the replicas will be in a different state from the master.
3280 If you want to use restore, you need to stop replication and redo the bootstrap.
3281
3282 In case of partial data-loss due to misbehaving client, instead of a full restore that requires the table to be disabled, you can clone the table from the snapshot and use a Map-Reduce job to copy the data that you need, from the clone to the main one.
3283
3284 [[ops.snapshots.acls]]
3285 === Snapshots operations and ACLs
3286
3287 If you are using security with the AccessController Coprocessor (See <<hbase.accesscontrol.configuration,hbase.accesscontrol.configuration>>), only a global administrator can take, clone, or restore a snapshot, and these actions do not capture the ACL rights.
3288 This means that restoring a table preserves the ACL rights of the existing table, while cloning a table creates a new table that has no ACL rights until the administrator adds them.
3289
3290 [[ops.snapshots.export]]
3291 === Export to another cluster
3292
3293 The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster.
3294 The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
3295
3296 To copy a snapshot called MySnapshot to an HBase cluster srv2 (hdfs:///srv2:8082/hbase) using 16 mappers:
3297
3298 [source,bourne]
3299 ----
3300 $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16
3301 ----
3302
3303 .Limiting Bandwidth Consumption
3304 You can limit the bandwidth consumption when exporting a snapshot, by specifying the `-bandwidth` parameter, which expects an integer representing megabytes per second.
3305 The following example limits the above example to 200 MB/sec.
3306
3307 [source,bourne]
3308 ----
3309 $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200
3310 ----
3311
3312 [[snapshots_s3]]
3313 === Storing Snapshots in an Amazon S3 Bucket
3314
3315 You can store and retrieve snapshots from Amazon S3, using the following procedure.
3316
3317 NOTE: You can also store snapshots in Microsoft Azure Blob Storage. See <<snapshots_azure>>.
3318
3319 .Prerequisites
3320 - You must be using HBase 1.0 or higher and Hadoop 2.6.1 or higher, which is the first
3321 configuration that uses the Amazon AWS SDK.
3322 - You must use the `s3a://` protocol to connect to Amazon S3. The older `s3n://`
3323 and `s3://` protocols have various limitations and do not use the Amazon AWS SDK.
3324 - The `s3a://` URI must be configured and available on the server where you run
3325 the commands to export and restore the snapshot.
3326
3327 After you have fulfilled the prerequisites, take the snapshot like you normally would.
3328 Afterward, you can export it using the `org.apache.hadoop.hbase.snapshot.ExportSnapshot`
3329 command like the one below, substituting your own `s3a://` path in the `copy-from`
3330 or `copy-to` directive and substituting or modifying other options as required:
3331
3332 ----
3333 $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
3334     -snapshot MySnapshot \
3335     -copy-from hdfs://srv2:8082/hbase \
3336     -copy-to s3a://<bucket>/<namespace>/hbase \
3337     -chuser MyUser \
3338     -chgroup MyGroup \
3339     -chmod 700 \
3340     -mappers 16
3341 ----
3342
3343 ----
3344 $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
3345     -snapshot MySnapshot
3346     -copy-from s3a://<bucket>/<namespace>/hbase \
3347     -copy-to hdfs://srv2:8082/hbase \
3348     -chuser MyUser \
3349     -chgroup MyGroup \
3350     -chmod 700 \
3351     -mappers 16
3352 ----
3353
3354 You can also use the `org.apache.hadoop.hbase.snapshot.SnapshotInfo` utility with the `s3a://` path by including the
3355 `-remote-dir` option.
3356
3357 ----
3358 $ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo \
3359     -remote-dir s3a://<bucket>/<namespace>/hbase \
3360     -list-snapshots
3361 ----
3362
3363 [[snapshots_azure]]
3364 == Storing Snapshots in Microsoft Azure Blob Storage
3365
3366 You can store snapshots in Microsoft Azure Blog Storage using the same techniques
3367 as in <<snapshots_s3>>.
3368
3369 .Prerequisites
3370 - You must be using HBase 1.2 or higher with Hadoop 2.7.1 or
3371   higher. No version of HBase supports Hadoop 2.7.0.
3372 - Your hosts must be configured to be aware of the Azure blob storage filesystem.
3373   See https://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html.
3374
3375 After you meet the prerequisites, follow the instructions
3376 in <<snapshots_s3>>, replacingthe protocol specifier with `wasb://` or `wasbs://`.
3377
3378 [[ops.capacity]]
3379 == Capacity Planning and Region Sizing
3380
3381 There are several considerations when planning the capacity for an HBase cluster and performing the initial configuration.
3382 Start with a solid understanding of how HBase handles data internally.
3383
3384 [[ops.capacity.nodes]]
3385 === Node count and hardware/VM configuration
3386
3387 [[ops.capacity.nodes.datasize]]
3388 ==== Physical data size
3389
3390 Physical data size on disk is distinct from logical size of your data and is affected by the following:
3391
3392 * Increased by HBase overhead
3393 +
3394 * See <<keyvalue,keyvalue>> and <<keysize,keysize>>.
3395   At least 24 bytes per key-value (cell), can be more.
3396   Small keys/values means more relative overhead.
3397 * KeyValue instances are aggregated into blocks, which are indexed.
3398   Indexes also have to be stored.
3399   Blocksize is configurable on a per-ColumnFamily basis.
3400   See <<regions.arch,regions.arch>>.
3401
3402 * Decreased by <<compression,compression>> and data block encoding, depending on data.
3403   You might want to test what compression and encoding (if any) make sense for your data.
3404 * Increased by size of region server <<wal,wal>> (usually fixed and negligible - less than half of RS memory size, per RS).
3405 * Increased by HDFS replication - usually x3.
3406
3407 Aside from the disk space necessary to store the data, one RS may not be able to serve arbitrarily large amounts of data due to some practical limits on region count and size (see <<ops.capacity.regions,ops.capacity.regions>>).
3408
3409 [[ops.capacity.nodes.throughput]]
3410 ==== Read/Write throughput
3411
3412 Number of nodes can also be driven by required throughput for reads and/or writes.
3413 The throughput one can get per node depends a lot on data (esp.
3414 key/value sizes) and request patterns, as well as node and system configuration.
3415 Planning should be done for peak load if it is likely that the load would be the main driver of the increase of the node count.
3416 PerformanceEvaluation and <<ycsb,ycsb>> tools can be used to test single node or a test cluster.
3417
3418 For write, usually 5-15Mb/s per RS can be expected, since every region server has only one active WAL.
3419 There's no good estimate for reads, as it depends vastly on data, requests, and cache hit rate. <<perf.casestudy,perf.casestudy>> might be helpful.
3420
3421 [[ops.capacity.nodes.gc]]
3422 ==== JVM GC limitations
3423
3424 RS cannot currently utilize very large heap due to cost of GC.
3425 There's also no good way of running multiple RS-es per server (other than running several VMs per machine). Thus, ~20-24Gb or less memory dedicated to one RS is recommended.
3426 GC tuning is required for large heap sizes.
3427 See <<gcpause,gcpause>>, <<trouble.log.gc,trouble.log.gc>> and elsewhere (TODO: where?)
3428
3429 [[ops.capacity.regions]]
3430 === Determining region count and size
3431
3432 Generally less regions makes for a smoother running cluster (you can always manually split the big regions later (if necessary) to spread the data, or request load, over the cluster); 20-200 regions per RS is a reasonable range.
3433 The number of regions cannot be configured directly (unless you go for fully <<disable.splitting,disable.splitting>>); adjust the region size to achieve the target region size given table size.
3434
3435 When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor], as well as shell commands.
3436 These settings will override the ones in `hbase-site.xml`.
3437 That is useful if your tables have different workloads/use cases.
3438
3439 Also note that in the discussion of region sizes here, _HDFS replication factor is not (and should not be) taken into account, whereas
3440           other factors <<ops.capacity.nodes.datasize,ops.capacity.nodes.datasize>> should be._ So, if your data is compressed and replicated 3 ways by HDFS, "9 Gb region" means 9 Gb of compressed data.
3441 HDFS replication factor only affects your disk usage and is invisible to most HBase code.
3442
3443 ==== Viewing the Current Number of Regions
3444
3445 You can view the current number of regions for a given table using the HMaster UI.
3446 In the [label]#Tables# section, the number of online regions for each table is listed in the [label]#Online Regions# column.
3447 This total only includes the in-memory state and does not include disabled or offline regions.
3448
3449 [[ops.capacity.regions.count]]
3450 ==== Number of regions per RS - upper bound
3451
3452 In production scenarios, where you have a lot of data, you are normally concerned with the maximum number of regions you can have per server. <<too_many_regions,too many regions>>          has technical discussion on the subject.
3453 Basically, the maximum number of regions is mostly determined by memstore memory usage.
3454 Each region has its own memstores; these grow up to a configurable size; usually in 128-256 MB range, see <<hbase.hregion.memstore.flush.size,hbase.hregion.memstore.flush.size>>.
3455 One memstore exists per column family (so there's only one per region if there's one CF in the table). The RS dedicates some fraction of total memory to its memstores (see <<hbase.regionserver.global.memstore.size,hbase.regionserver.global.memstore.size>>). If this memory is exceeded (too much memstore usage), it can cause undesirable consequences such as unresponsive server or compaction storms.
3456 A good starting point for the number of regions per RS (assuming one table) is:
3457
3458 [source]
3459 ----
3460 ((RS memory) * (total memstore fraction)) / ((memstore size)*(# column families))
3461 ----
3462
3463 This formula is pseudo-code.
3464 Here are two formulas using the actual tunable parameters, first for HBase 0.98+ and second for HBase 0.94.x.
3465
3466 HBase 0.98.x::
3467 ----
3468 ((RS Xmx) * hbase.regionserver.global.memstore.size) / (hbase.hregion.memstore.flush.size * (# column families))
3469 ----
3470 HBase 0.94.x::
3471 ----
3472 ((RS Xmx) * hbase.regionserver.global.memstore.upperLimit) / (hbase.hregion.memstore.flush.size * (# column families))+
3473 ----
3474
3475 If a given RegionServer has 16 GB of RAM, with default settings, the formula works out to 16384*0.4/128 ~ 51 regions per RS is a starting point.
3476 The formula can be extended to multiple tables; if they all have the same configuration, just use the total number of families.
3477
3478 This number can be adjusted; the formula above assumes all your regions are filled at approximately the same rate.
3479 If only a fraction of your regions are going to be actively written to, you can divide the result by that fraction to get a larger region count.
3480 Then, even if all regions are written to, all region memstores are not filled evenly, and eventually jitter appears even if they are (due to limited number of concurrent flushes). Thus, one can have as many as 2-3 times more regions than the starting point; however, increased numbers carry increased risk.
3481
3482 For write-heavy workload, memstore fraction can be increased in configuration at the expense of block cache; this will also allow one to have more regions.
3483
3484 [[ops.capacity.regions.mincount]]
3485 ==== Number of regions per RS - lower bound
3486
3487 HBase scales by having regions across many servers.
3488 Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle.
3489 This really can't be stressed enough, since a common problem is loading 200MB data into HBase and then wondering why your awesome 10 node cluster isn't doing anything.
3490
3491 On the other hand, if you have a very large amount of data, you may also want to go for a larger number of regions to avoid having regions that are too large.
3492
3493 [[ops.capacity.regions.size]]
3494 ==== Maximum region size
3495
3496 For large tables in production scenarios, maximum region size is mostly limited by compactions - very large compactions, esp.
3497 major, can degrade cluster performance.
3498 Currently, the recommended maximum region size is 10-20Gb, and 5-10Gb is optimal.
3499 For older 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb.
3500
3501 The size at which the region is split into two is generally configured via <<hbase.hregion.max.filesize,hbase.hregion.max.filesize>>; for details, see <<arch.region.splits,arch.region.splits>>.
3502
3503 If you cannot estimate the size of your tables well, when starting off, it's probably best to stick to the default region size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
3504
3505 In HBase 0.98, experimental stripe compactions feature was added that would allow for larger regions, especially for log data.
3506 See <<ops.stripe,ops.stripe>>.
3507
3508 [[ops.capacity.regions.total]]
3509 ==== Total data size per region server
3510
3511 According to above numbers for region size and number of regions per region server, in an optimistic estimate 10 GB x 100 regions per RS will give up to 1TB served per region server, which is in line with some of the reported multi-PB use cases.
3512 However, it is important to think about the data vs cache size ratio at the RS level.
3513 With 1TB of data per server and 10 GB block cache, only 1% of the data will be cached, which may barely cover all block indices.
3514
3515 [[ops.capacity.config]]
3516 === Initial configuration and tuning
3517
3518 First, see <<important_configurations,important configurations>>.
3519 Note that some configurations, more than others, depend on specific scenarios.
3520 Pay special attention to:
3521
3522 * <<hbase.regionserver.handler.count,hbase.regionserver.handler.count>> - request handler thread count, vital for high-throughput workloads.
3523 * <<config.wals,config.wals>> - the blocking number of WAL files depends on your memstore configuration and should be set accordingly to prevent potential blocking when doing high volume of writes.
3524
3525 Then, there are some considerations when setting up your cluster and tables.
3526
3527 [[ops.capacity.config.compactions]]
3528 ==== Compactions
3529
3530 Depending on read/write volume and latency requirements, optimal compaction settings may be different.
3531 See <<compaction,compaction>> for some details.
3532
3533 When provisioning for large data sizes, however, it's good to keep in mind that compactions can affect write throughput.
3534 Thus, for write-intensive workloads, you may opt for less frequent compactions and more store files per regions.
3535 Minimum number of files for compactions (`hbase.hstore.compaction.min`) can be set to higher value; <<hbase.hstore.blockingStoreFiles,hbase.hstore.blockingStoreFiles>> should also be increased, as more files might accumulate in such case.
3536 You may also consider manually managing compactions: <<managed.compactions,managed.compactions>>
3537
3538 [[ops.capacity.config.presplit]]
3539 ==== Pre-splitting the table
3540
3541 Based on the target number of the regions per RS (see <<ops.capacity.regions.count,ops.capacity.regions.count>>) and number of RSes, one can pre-split the table at creation time.
3542 This would both avoid some costly splitting as the table starts to fill up, and ensure that the table starts out already distributed across many servers.
3543
3544 If the table is expected to grow large enough to justify that, at least one region per RS should be created.
3545 It is not recommended to split immediately into the full target number of regions (e.g.
3546 50 * number of RSes), but a low intermediate value can be chosen.
3547 For multiple tables, it is recommended to be conservative with presplitting (e.g.
3548 pre-split 1 region per RS at most), especially if you don't know how much each table will grow.
3549 If you split too much, you may end up with too many regions, with some tables having too many small regions.
3550
3551 For pre-splitting howto, see <<manual_region_splitting_decisions,manual region splitting decisions>> and <<precreate.regions,precreate.regions>>.
3552
3553 [[table.rename]]
3554 == Table Rename
3555
3556 In versions 0.90.x of hbase and earlier, we had a simple script that would rename the hdfs table directory and then do an edit of the hbase:meta table replacing all mentions of the old table name with the new.
3557 The script was called `./bin/rename_table.rb`.
3558 The script was deprecated and removed mostly because it was unmaintained and the operation performed by the script was brutal.
3559
3560 As of hbase 0.94.x, you can use the snapshot facility renaming a table.
3561 Here is how you would do it using the hbase shell:
3562
3563 ----
3564 hbase shell> disable 'tableName'
3565 hbase shell> snapshot 'tableName', 'tableSnapshot'
3566 hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
3567 hbase shell> delete_snapshot 'tableSnapshot'
3568 hbase shell> drop 'tableName'
3569 ----
3570
3571 or in code it would be as follows:
3572
3573 [source,java]
3574 ----
3575 void rename(Admin admin, String oldTableName, TableName newTableName) {
3576   String snapshotName = randomName();
3577   admin.disableTable(oldTableName);
3578   admin.snapshot(snapshotName, oldTableName);
3579   admin.cloneSnapshot(snapshotName, newTableName);
3580   admin.deleteSnapshot(snapshotName);
3581   admin.deleteTable(oldTableName);
3582 }
3583 ----
3584
3585 [[rsgroup]]
3586 == RegionServer Grouping
3587 RegionServer Grouping (A.K.A `rsgroup`) is an advanced feature for
3588 partitioning regionservers into distinctive groups for strict isolation. It
3589 should only be used by users who are sophisticated enough to understand the
3590 full implications and have a sufficient background in managing HBase clusters.
3591 It was developed by Yahoo! and they run it at scale on their large grid cluster.
3592 See link:http://www.slideshare.net/HBaseCon/keynote-apache-hbase-at-yahoo-scale[HBase at Yahoo! Scale].
3593
3594 RSGroups can be defined and managed with both admin methods and shell commands.
3595 A server can be added to a group with hostname and port pair and tables
3596 can be moved to this group so that only regionservers in the same rsgroup can
3597 host the regions of the table. The group for a table is stored in its
3598 TableDescriptor, the property name is `hbase.rsgroup.name`. You can also set
3599 this property on a namespace, so it will cause all the tables under this
3600 namespace to be placed into this group. RegionServers and tables can only
3601 belong to one rsgroup at a time. By default, all tables and regionservers
3602 belong to the `default` rsgroup. System tables can also be put into a
3603 rsgroup using the regular APIs. A custom balancer implementation tracks
3604 assignments per rsgroup and makes sure to move regions to the relevant
3605 regionservers in that rsgroup. The rsgroup information is stored in a regular
3606 HBase table, and a zookeeper-based read-only cache is used at cluster bootstrap
3607 time.
3608
3609 To enable, add the following to your hbase-site.xml and restart your Master:
3610
3611 [source,xml]
3612 ----
3613  <property>
3614    <name>hbase.balancer.rsgroup.enabled</name>
3615    <value>true</value>
3616  </property>
3617 ----
3618
3619 Then use the admin/shell _rsgroup_ methods/commands to create and manipulate
3620 RegionServer groups: e.g. to add a rsgroup and then add a server to it.
3621 To see the list of rsgroup commands available in the hbase shell type:
3622
3623 [source, bash]
3624 ----
3625  hbase(main):008:0> help 'rsgroup'
3626  Took 0.5610 seconds
3627 ----
3628
3629 High level, you create a rsgroup that is other than the `default` group using
3630 _add_rsgroup_ command. You then add servers and tables to this group with the
3631 _move_servers_rsgroup_ and _move_tables_rsgroup_ commands. If necessary, run
3632 a balance for the group if tables are slow to migrate to the groups dedicated
3633 server with the _balance_rsgroup_ command (Usually this is not needed). To
3634 monitor effect of the commands, see the `Tables` tab toward the end of the
3635 Master UI home page. If you click on a table, you can see what servers it is
3636 deployed across. You should see here a reflection of the grouping done with
3637 your shell commands. View the master log if issues.
3638
3639 Here is example using a few of the rsgroup commands. To add a group, do as
3640 follows:
3641
3642 [source, bash]
3643 ----
3644  hbase(main):008:0> add_rsgroup 'my_group'
3645  Took 0.5610 seconds
3646 ----
3647
3648
3649 .RegionServer Groups must be Enabled
3650 [NOTE]
3651 ====
3652 If you have not enabled the rsgroup feature and you call any of the rsgroup
3653 admin methods or shell commands the call will fail with a
3654 `DoNotRetryIOException` with a detail message that says the rsgroup feature
3655 is disabled.
3656 ====
3657
3658 Add a server (specified by hostname + port) to the just-made group using the
3659 _move_servers_rsgroup_ command as follows:
3660
3661 [source, bash]
3662 ----
3663  hbase(main):010:0> move_servers_rsgroup 'my_group',['k.att.net:51129']
3664 ----
3665
3666 .Hostname and Port vs ServerName
3667 [NOTE]
3668 ====
3669 The rsgroup feature refers to servers in a cluster with hostname and port only.
3670 It does not make use of the HBase ServerName type identifying RegionServers;
3671 i.e. hostname + port + starttime to distinguish RegionServer instances. The
3672 rsgroup feature keeps working across RegionServer restarts so the starttime of
3673 ServerName -- and hence the ServerName type -- is not appropriate.
3674 Administration
3675 ====
3676
3677 Servers come and go over the lifetime of a Cluster. Currently, you must
3678 manually align the servers referenced in rsgroups with the actual state of
3679 nodes in the running cluster. What we mean by this is that if you decommission
3680 a server, then you must update rsgroups as part of your server decommission
3681 process removing references. Notice that, by calling `clearDeadServers`
3682 manually will also remove the dead servers from any rsgroups, but the problem
3683 is that we will lost track of the dead servers after master restarts, which
3684 means you still need to update the rsgroup by your own.
3685
3686 Please use `Admin.removeServersFromRSGroup` or shell command
3687 _remove_servers_rsgroup_ to remove decommission servers from rsgroup.
3688
3689 The `default` group is not like other rsgroups in that it is dynamic. Its server
3690 list mirrors the current state of the cluster; i.e. if you shutdown a server that
3691 was part of the `default` rsgroup, and then do a _get_rsgroup_ `default` to list
3692 its content in the shell, the server will no longer be listed. For non-default
3693 groups, though a mode may be offline, it will persist in the non-default group’s
3694 list of servers. But if you move the offline server from the non-default rsgroup
3695 to default, it will not show in the `default` list. It will just be dropped.
3696
3697 === Best Practice
3698 The authors of the rsgroup feature, the Yahoo! HBase Engineering team, have been
3699 running it on their grid for a good while now and have come up with a few best
3700 practices informed by their experience.
3701
3702 ==== Isolate System Tables
3703 Either have a system rsgroup where all the system tables are or just leave the
3704 system tables in `default` rsgroup and have all user-space tables are in
3705 non-default rsgroups.
3706
3707 ==== Dead Nodes
3708 Yahoo! Have found it useful at their scale to keep a special rsgroup of dead or
3709 questionable nodes; this is one means of keeping them out of the running until repair.
3710
3711 Be careful replacing dead nodes in an rsgroup. Ensure there are enough live nodes
3712 before you start moving out the dead. Move in good live nodes first if you have to.
3713
3714 === Troubleshooting
3715 Viewing the Master log will give you insight on rsgroup operation.
3716
3717 If it appears stuck, restart the Master process.
3718
3719 === Remove RegionServer Grouping
3720 Simply disable RegionServer Grouping feature is easy, just remove the
3721 'hbase.balancer.rsgroup.enabled' from hbase-site.xml or explicitly set it to
3722 false in hbase-site.xml.
3723
3724 [source,xml]
3725 ----
3726  <property>
3727    <name>hbase.balancer.rsgroup.enabled</name>
3728    <value>false</value>
3729  </property>
3730 ----
3731
3732 But if you change the 'hbase.balancer.rsgroup.enabled' to true, the old rsgroup
3733 configs will take effect again. So if you want to completely remove the
3734 RegionServer Grouping feature from a cluster, so that if the feature is
3735 re-enabled in the future, the old meta data will not affect the functioning of
3736 the cluster, there are more steps to do.
3737
3738 - Move all tables in non-default rsgroups to `default` regionserver group
3739 [source,bash]
3740 ----
3741 #Reassigning table t1 from non default group - hbase shell
3742 hbase(main):005:0> move_tables_rsgroup 'default',['t1']
3743 ----
3744 - Move all regionservers in non-default rsgroups to `default` regionserver group
3745 [source, bash]
3746 ----
3747 #Reassigning all the servers in the non-default rsgroup to default - hbase shell
3748 hbase(main):008:0> move_servers_rsgroup 'default',['rs1.xxx.com:16206','rs2.xxx.com:16202','rs3.xxx.com:16204']
3749 ----
3750 - Remove all non-default rsgroups. `default` rsgroup created implicitly doesn't have to be removed
3751 [source,bash]
3752 ----
3753 #removing non default rsgroup - hbase shell
3754 hbase(main):009:0> remove_rsgroup 'group2'
3755 ----
3756 - Remove the changes made in `hbase-site.xml` and restart the cluster
3757 - Drop the table `hbase:rsgroup` from `hbase`
3758 [source, bash]
3759 ----
3760 #Through hbase shell drop table hbase:rsgroup
3761 hbase(main):001:0> disable 'hbase:rsgroup'
3762 0 row(s) in 2.6270 seconds
3763
3764 hbase(main):002:0> drop 'hbase:rsgroup'
3765 0 row(s) in 1.2730 seconds
3766 ----
3767 - Remove znode `rsgroup` from the cluster ZooKeeper using zkCli.sh
3768 [source, bash]
3769 ----
3770 #From ZK remove the node /hbase/rsgroup through zkCli.sh
3771 rmr /hbase/rsgroup
3772 ----
3773
3774 === ACL
3775 To enable ACL, add the following to your hbase-site.xml and restart your Master:
3776
3777 [source,xml]
3778 ----
3779 <property>
3780   <name>hbase.security.authorization</name>
3781   <value>true</value>
3782 <property>
3783 ----
3784 [[migrating.rsgroup]]
3785 === Migrating From Old Implementation
3786 The coprocessor `org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is
3787 deprected, but for compatible, if you want the pre 3.0.0 hbase client/shell
3788 to communicate with the new hbase cluster, you still need to add this
3789 coprocessor to master.
3790
3791 The `hbase.rsgroup.grouploadbalancer.class` config has been deprecated, as now
3792 the top level load balancer will always be `RSGroupBasedLoadBalaner`, and the
3793 `hbase.master.loadbalancer.class` config is for configuring the balancer within
3794 a group. This also means you should not set `hbase.master.loadbalancer.class`
3795 to `RSGroupBasedLoadBalaner` any more even if rsgroup feature is enabled.
3796
3797 And we have done some special changes for compatibility. First, if coprocessor
3798 `org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is specified, the
3799 `hbase.balancer.rsgroup.enabled` flag will be set to true automatically to
3800 enable rs group feature. Second, we will load
3801 `hbase.rsgroup.grouploadbalancer.class` prior to
3802 `hbase.master.loadbalancer.class`. And last, if you do not set
3803 `hbase.rsgroup.grouploadbalancer.class` but only set
3804 `hbase.master.loadbalancer.class` to `RSGroupBasedLoadBalancer`, we will load
3805 the default load balancer to avoid infinite nesting. This means you do not need
3806 to change anything when upgrading if you have already enabled rs group feature.
3807
3808 The main difference comparing to the old implementation is that, now the
3809 rsgroup for a table is stored in `TableDescriptor`, instead of in
3810 `RSGroupInfo`, so the `getTables` method of `RSGroupInfo` has been deprecated.
3811 And if you use the `Admin` methods to get the `RSGroupInfo`, its `getTables`
3812 method will always return empty. This is because that in the old
3813 implementation, this method is a bit broken as you can set rsgroup on namespace
3814 and make all the tables under this namespace into this group but you can not
3815 get these tables through `RSGroupInfo.getTables`. Now you should use the two
3816 new methods `listTablesInRSGroup` and
3817 `getConfiguredNamespacesAndTablesInRSGroup` in `Admin` to get tables and
3818 namespaces in a rsgroup.
3819
3820 Of course the behavior for the old RSGroupAdminEndpoint is not changed,
3821 we will fill the tables field of the RSGroupInfo before returning, to make it
3822 compatible with old hbase client/shell.
3823
3824 When upgrading, the migration between the RSGroupInfo and TableDescriptor will
3825 be done automatically. It will take sometime, but it is fine to restart master
3826 in the middle, the migration will continue after restart. And during the
3827 migration, the rs group feature will still work and in most cases the region
3828 will not be misplaced(since this is only a one time job and will not last too
3829 long so we have not test it very seriously to make sure the region will not be
3830 misplaced always, so we use the word 'in most cases'). The implementation is a
3831 bit tricky, you can see the code in `RSGroupInfoManagerImpl.migrate` if
3832 interested.
3833
3834
3835
3836
3837 [[normalizer]]
3838 == Region Normalizer
3839
3840 The Region Normalizer tries to make Regions all in a table about the same in
3841 size. It does this by first calculating total table size and average size per
3842 region. It splits any region that is larger than twice this size. Any region
3843 that is much smaller is merged into an adjacent region. The Normalizer runs on
3844 a regular schedule, which is configurable. It can also be disabled entirely via
3845 a runtime "switch". It can be run manually via the shell or Admin API call.
3846 Even if normally disabled, it is good to run manually after the cluster has
3847 been running a while or say after a burst of activity such as a large delete.
3848
3849 The Normalizer works well for bringing a table's region boundaries into
3850 alignment with the reality of data distribution after an initial effort at
3851 pre-splitting a table. It is also a nice compliment to the data TTL feature
3852 when the schema includes timestamp in the rowkey, as it will automatically
3853 merge away regions whose contents have expired.
3854
3855 (The bulk of the below detail was copied wholesale from the blog by Romil Choksi at
3856 link:https://community.hortonworks.com/articles/54987/hbase-region-normalizer.html[HBase Region Normalizer]).
3857
3858 The Region Normalizer is feature available since HBase-1.2. It runs a set of
3859 pre-calculated merge/split actions to resize regions that are either too
3860 large or too small compared to the average region size for a given table. Region
3861 Normalizer when invoked computes a normalization 'plan' for all of the tables in
3862 HBase. System tables (such as hbase:meta, hbase:namespace, Phoenix system tables
3863 etc) and user tables with normalization disabled are ignored while computing the
3864 plan. For normalization enabled tables, normalization plan is carried out in
3865 parallel across multiple tables.
3866
3867 Normalizer can be enabled or disabled globally for the entire cluster using the
3868 ‘normalizer_switch’ command in the HBase shell. Normalization can also be
3869 controlled on a per table basis, which is disabled by default when a table is
3870 created. Normalization for a table can be enabled or disabled by setting the
3871 NORMALIZATION_ENABLED table attribute to true or false.
3872
3873 To check normalizer status and enable/disable normalizer
3874
3875 [source,bash]
3876 ----
3877 hbase(main):001:0> normalizer_enabled
3878 true
3879 0 row(s) in 0.4870 seconds
3880
3881 hbase(main):002:0> normalizer_switch false
3882 true
3883 0 row(s) in 0.0640 seconds
3884
3885 hbase(main):003:0> normalizer_enabled
3886 false
3887 0 row(s) in 0.0120 seconds
3888
3889 hbase(main):004:0> normalizer_switch true
3890 false
3891 0 row(s) in 0.0200 seconds
3892
3893 hbase(main):005:0> normalizer_enabled
3894 true
3895 0 row(s) in 0.0090 seconds
3896 ----
3897
3898 When enabled, Normalizer is invoked in the background every 5 mins (by default),
3899 which can be configured using `hbase.normalization.period` in `hbase-site.xml`.
3900 Normalizer can also be invoked manually/programmatically at will using HBase shell’s
3901 `normalize` command. HBase by default uses `SimpleRegionNormalizer`, but users can
3902 design their own normalizer as long as they implement the RegionNormalizer Interface.
3903 Details about the logic used by `SimpleRegionNormalizer` to compute its normalization
3904 plan can be found link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/normalizer/SimpleRegionNormalizer.html[here].
3905
3906 The below example shows a normalization plan being computed for an user table, and
3907 merge action being taken as a result of the normalization plan computed by SimpleRegionNormalizer.
3908
3909 Consider an user table with some pre-split regions having 3 equally large regions
3910 (about 100K rows) and 1 relatively small region (about 25K rows). Following is the
3911 snippet from an hbase meta table scan showing each of the pre-split regions for
3912 the user table.
3913
3914 ----
3915 table_p8ddpd6q5z,,1469494305548.68b9892220865cb6048 column=info:regioninfo, timestamp=1469494306375, value={ENCODED => 68b9892220865cb604809c950d1adf48, NAME => 'table_p8ddpd6q5z,,1469494305548.68b989222 09c950d1adf48.   0865cb604809c950d1adf48.', STARTKEY => '', ENDKEY => '1'}
3916 ....
3917 table_p8ddpd6q5z,1,1469494317178.867b77333bdc75a028 column=info:regioninfo, timestamp=1469494317848, value={ENCODED => 867b77333bdc75a028bb4c5e4b235f48, NAME => 'table_p8ddpd6q5z,1,1469494317178.867b7733 bb4c5e4b235f48.  3bdc75a028bb4c5e4b235f48.', STARTKEY => '1', ENDKEY => '3'}
3918 ....
3919 table_p8ddpd6q5z,3,1469494328323.98f019a753425e7977 column=info:regioninfo, timestamp=1469494328486, value={ENCODED => 98f019a753425e7977ab8636e32deeeb, NAME => 'table_p8ddpd6q5z,3,1469494328323.98f019a7 ab8636e32deeeb.  53425e7977ab8636e32deeeb.', STARTKEY => '3', ENDKEY => '7'}
3920 ....
3921 table_p8ddpd6q5z,7,1469494339662.94c64e748979ecbb16 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 94c64e748979ecbb166f6cc6550e25c6, NAME => 'table_p8ddpd6q5z,7,1469494339662.94c64e74 6f6cc6550e25c6.   8979ecbb166f6cc6550e25c6.', STARTKEY => '7', ENDKEY => '8'}
3922 ....
3923 table_p8ddpd6q5z,8,1469494339662.6d2b3f5fd1595ab8e7 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 6d2b3f5fd1595ab8e7c031876057b1ee, NAME => 'table_p8ddpd6q5z,8,1469494339662.6d2b3f5f c031876057b1ee.   d1595ab8e7c031876057b1ee.', STARTKEY => '8', ENDKEY => ''}
3924 ----
3925 Invoking the normalizer using ‘normalize’ int the HBase shell, the below log snippet
3926 from HMaster log shows the normalization plan computed as per the logic defined for
3927 SimpleRegionNormalizer. Since the total region size (in MB) for the adjacent smallest
3928 regions in the table is less than the average region size, the normalizer computes a
3929 plan to merge these two regions.
3930
3931 ----
3932 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto
3933 normalization turned on
3934 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
3935 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't have auto normalization turned on
3936 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: table_h2osxu3wat, as it's either system table or doesn't have autonormalization turned on
3937 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 5
3938 2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
3939 2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 2.4
3940 2016-07-26 07:08:26,929 INFO  [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, small region size: 0 plus its neighbor size: 0, less thanthe avg size 2.4, merging them
3941 2016-07-26 07:08:26,971 INFO  [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.MergeNormalizationPlan: Executing merging normalization plan: MergeNormalizationPlan{firstRegion={ENCODED=> d51df2c58e9b525206b1325fd925a971, NAME => 'table_p8ddpd6q5z,,1469514755237.d51df2c58e9b525206b1325fd925a971.', STARTKEY => '', ENDKEY => '1'}, secondRegion={ENCODED => e69c6b25c7b9562d078d9ad3994f5330, NAME => 'table_p8ddpd6q5z,1,1469514767669.e69c6b25c7b9562d078d9ad3994f5330.',
3942 STARTKEY => '1', ENDKEY => '3'}}
3943 ----
3944 Region normalizer as per it’s computed plan, merged the region with start key as ‘’
3945 and end key as ‘1’, with another region having start key as ‘1’ and end key as ‘3’.
3946 Now, that these regions have been merged we see a single new region with start key
3947 as ‘’ and end key as ‘3’
3948 ----
3949 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeA, timestamp=1469516907431,
3950 value=PBUF\x08\xA5\xD9\x9E\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x00"\x011(\x000\x00 ea74d246741ba.   8\x00
3951 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeB, timestamp=1469516907431,
3952 value=PBUF\x08\xB5\xBA\x9F\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x011"\x013(\x000\x0 ea74d246741ba.   08\x00
3953 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:regioninfo, timestamp=1469516907431, value={ENCODED => e06c9b83c4a252b130eea74d246741ba, NAME => 'table_p8ddpd6q5z,,1469516907210.e06c9b83c ea74d246741ba.   4a252b130eea74d246741ba.', STARTKEY => '', ENDKEY => '3'}
3954 ....
3955 table_p8ddpd6q5z,3,1469514778736.bf024670a847c0adff column=info:regioninfo, timestamp=1469514779417, value={ENCODED => bf024670a847c0adffb74b2e13408b32, NAME => 'table_p8ddpd6q5z,3,1469514778736.bf024670 b74b2e13408b32.  a847c0adffb74b2e13408b32.' STARTKEY => '3', ENDKEY => '7'}
3956 ....
3957 table_p8ddpd6q5z,7,1469514790152.7c5a67bc755e649db2 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 7c5a67bc755e649db22f49af6270f1e1, NAME => 'table_p8ddpd6q5z,7,1469514790152.7c5a67bc 2f49af6270f1e1.  755e649db22f49af6270f1e1.', STARTKEY => '7', ENDKEY => '8'}
3958 ....
3959 table_p8ddpd6q5z,8,1469514790152.58e7503cda69f98f47 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 58e7503cda69f98f4755178e74288c3a, NAME => 'table_p8ddpd6q5z,8,1469514790152.58e7503c 55178e74288c3a.  da69f98f4755178e74288c3a.', STARTKEY => '8', ENDKEY => ''}
3960 ----
3961
3962 A similar example can be seen for an user table with 3 smaller regions and 1
3963 relatively large region. For this example, we have an user table with 1 large region containing 100K rows, and 3 relatively smaller regions with about 33K rows each. As seen from the normalization plan, since the larger region is more than twice the average region size it ends being split into two regions – one with start key as ‘1’ and end key as ‘154717’ and the other region with start key as '154717' and end key as ‘3’
3964 ----
3965 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
3966 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 4
3967 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
3968 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 3.0
3969 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: No normalization needed, regions look good for table: table_p8ddpd6q5z
3970 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_h2osxu3wat, number of regions: 5
3971 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, total aggregated regions size: 7
3972 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, average region size: 1.4
3973 2016-07-26 07:39:45,636 INFO  [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, large region table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db. has size 4, more than twice avg size, splitting
3974 2016-07-26 07:39:45,640 INFO [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SplitNormalizationPlan: Executing splitting normalization plan: SplitNormalizationPlan{regionInfo={ENCODED => 27f2fdbb2b6612ea163eb6b40753c3db, NAME => 'table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db.', STARTKEY => '1', ENDKEY => '3'}, splitPoint=null}
3975 2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto normalization turned on
3976 2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't
3977 have auto normalization turned on …..…..….
3978 2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined 54de97dae764b864504704c1c8d3674a on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => 54de97dae764b864504704c1c8d3674a, NAME => 'table_h2osxu3wat,1,1469518785661.54de97dae764b864504704c1c8d3674a.', STARTKEY => '1', ENDKEY => '154717'}
3979 2016-07-26 07:39:46,246 INFO  [AM.ZK.Worker-pool2-t278] master.RegionStates: Transition {d6b5625df331cfec84dce4f1122c567f state=SPLITTING_NEW, ts=1469518786246, server=hbase-test-rc-5.openstacklocal,16020,1469419333913} to {d6b5625df331cfec84dce4f1122c567f state=OPEN, ts=1469518786246,
3980 server=hbase-test-rc-5.openstacklocal,16020,1469419333913}
3981 2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined d6b5625df331cfec84dce4f1122c567f on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => d6b5625df331cfec84dce4f1122c567f, NAME => 'table_h2osxu3wat,154717,1469518785661.d6b5625df331cfec84dce4f1122c567f.', STARTKEY => '154717', ENDKEY => '3'}
3982 ----
3983
3984
3985
3986 [[auto_reopen_regions]]
3987 == Auto Region Reopen
3988
3989 We can leak store reader references if a coprocessor or core function somehow
3990 opens a scanner, or wraps one, and then does not take care to call close on the
3991 scanner or the wrapped instance. Leaked store files can not be removed even
3992 after it is invalidated via compaction.
3993 A reasonable mitigation for a reader reference
3994 leak would be a fast reopen of the region on the same server.
3995 This will release all resources, like the refcount, leases, etc.
3996 The clients should gracefully ride over this like any other region in
3997 transition.
3998 By default this auto reopen of region feature would be disabled.
3999 To enabled it, please provide high ref count value for config
4000 `hbase.regions.recovery.store.file.ref.count`.
4001
4002 Please refer to config descriptions for
4003 `hbase.master.regions.recovery.check.interval` and
4004 `hbase.regions.recovery.store.file.ref.count`.
4005