4 * Licensed to the Apache Software Foundation (ASF) under one
5 * or more contributor license agreements. See the NOTICE file
6 * distributed with this work for additional information
7 * regarding copyright ownership. The ASF licenses this file
8 * to you under the Apache License, Version 2.0 (the
9 * "License"); you may not use this file except in compliance
10 * with the License. You may obtain a copy of the License at
12 * http://www.apache.org/licenses/LICENSE-2.0
14 * Unless required by applicable law or agreed to in writing, software
15 * distributed under the License is distributed on an "AS IS" BASIS,
16 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17 * See the License for the specific language governing permissions and
18 * limitations under the License.
23 = Apache HBase Operational Management
30 This chapter will cover operational tools and practices required of a running Apache HBase cluster.
31 The subject of operations is related to the topics of <<trouble>>, <<performance>>, and <<configuration>> but is a distinct topic in itself.
34 == HBase Tools and Utilities
36 HBase provides several tools for administration, analysis, and debugging of your cluster.
37 The entry-point to most of these tools is the _bin/hbase_ command, though some tools are available in the _dev-support/_ directory.
39 To see usage instructions for _bin/hbase_ command, run it with no arguments, or with the `-h` argument.
40 These are the usage instructions for HBase 0.98.x.
41 Some commands, such as `version`, `pe`, `ltt`, `clean`, are not available in previous versions.
45 Usage: hbase [<options>] <command> [<args>]
47 --config DIR Configuration direction to use. Default: ./conf
48 --hosts HOSTS Override the list in 'regionservers' file
49 --auth-as-server Authenticate to ZooKeeper using servers configuration
52 Some commands take arguments. Pass no args or -h for usage.
53 shell Run the HBase shell
54 hbck Run the HBase 'fsck' tool. Defaults read-only hbck1.
55 Pass '-j /path/to/HBCK2.jar' to run hbase-2.x HBCK2.
56 snapshot Tool for managing snapshots
57 wal Write-ahead-log analyzer
58 hfile Store file analyzer
59 zkcli Run the ZooKeeper shell
60 master Run an HBase HMaster node
61 regionserver Run an HBase HRegionServer node
62 zookeeper Run a ZooKeeper server
63 rest Run an HBase REST server
64 thrift Run the HBase Thrift server
65 thrift2 Run the HBase Thrift2 server
66 clean Run the HBase clean up script
67 jshell Run a jshell with HBase on the classpath
68 classpath Dump hbase CLASSPATH
69 mapredcp Dump CLASSPATH entries required by mapreduce
70 pe Run PerformanceEvaluation
72 canary Run the Canary tool
73 version Print the version
74 backup Backup tables for recovery
75 restore Restore tables from existing backup image
76 regionsplitter Run RegionSplitter tool
77 rowcounter Run RowCounter tool
78 cellcounter Run CellCounter tool
79 CLASSNAME Run the class named CLASSNAME
82 Some of the tools and utilities below are Java classes which are passed directly to the _bin/hbase_ command, as referred to in the last line of the usage instructions.
83 Others, such as `hbase shell` (<<shell>>), `hbase upgrade` (<<upgrading>>), and `hbase thrift` (<<thrift>>), are documented elsewhere in this guide.
87 The Canary tool can help users "canary-test" the HBase cluster status.
88 The default "region mode" fetches a row from every column-family of every regions.
89 In "regionserver mode", the Canary tool will fetch a row from a random
90 region on each of the cluster's RegionServers. In "zookeeper mode", the
91 Canary will read the root znode on each member of the zookeeper ensemble.
93 To see usage, pass the `-help` parameter (if you pass no
94 parameters, the Canary tool starts executing in the default
95 region "mode" fetching a row from every region in the cluster).
98 2018-10-16 13:11:27,037 INFO [main] tool.Canary: Execution thread count=16
99 Usage: canary [OPTIONS] [<TABLE1> [<TABLE2]...] | [<REGIONSERVER1> [<REGIONSERVER2]..]
101 -h,-help show this help and exit.
102 -regionserver set 'regionserver mode'; gets row from random region on server
103 -allRegions get from ALL regions when 'regionserver mode', not just random one.
104 -zookeeper set 'zookeeper mode'; grab zookeeper.znode.parent on each ensemble member
105 -daemon continuous check at defined intervals.
106 -interval <N> interval between checks in seconds
107 -e consider table/regionserver argument as regular expression
108 -f <B> exit on first error; default=true
109 -failureAsError treat read/write failure as error
110 -t <N> timeout for canary-test run; default=600000ms
111 -writeSniffing enable write sniffing
112 -writeTable the table used for write sniffing; default=hbase:canary
113 -writeTableTimeout <N> timeout for writeTable; default=600000ms
114 -readTableTimeouts <tableName>=<read timeout>,<tableName>=<read timeout>,...
115 comma-separated list of table read timeouts (no spaces);
116 logs 'ERROR' if takes longer. default=600000ms
117 -permittedZookeeperFailures <N> Ignore first N failures attempting to
118 connect to individual zookeeper nodes in ensemble
120 -D<configProperty>=<value> to assign or override configuration params
121 -Dhbase.canary.read.raw.enabled=<true/false> Set to enable/disable raw scan; default=false
123 Canary runs in one of three modes: region (default), regionserver, or zookeeper.
124 To sniff/probe all regions, pass no arguments.
125 To sniff/probe all regions of a table, pass tablename.
126 To sniff/probe regionservers, pass -regionserver, etc.
127 See http://hbase.apache.org/book.html#_canary for Canary documentation.
131 The `Sink` class is instantiated using the `hbase.canary.sink.class` configuration property.
133 This tool will return non zero error codes to user for collaborating with other monitoring tools,
134 such as Nagios. The error code definitions are:
138 private static final int USAGE_EXIT_CODE = 1;
139 private static final int INIT_ERROR_EXIT_CODE = 2;
140 private static final int TIMEOUT_ERROR_EXIT_CODE = 3;
141 private static final int ERROR_EXIT_CODE = 4;
142 private static final int FAILURE_EXIT_CODE = 5;
145 Here are some examples based on the following given case: given two Table objects called test-01
146 and test-02 each with two column family cf1 and cf2 respectively, deployed on 3 RegionServers.
147 See the following table.
149 [cols="1,1,1", options="header"]
159 Following are some example outputs based on the previous given case.
161 ==== Canary test for every column family (store) of every region of every table
164 $ ${HBASE_HOME}/bin/hbase canary
166 3/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf1 in 2ms
167 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,,1386230156732.0e3c7d77ffb6361ea1b996ac1042ca9a. column family cf2 in 2ms
168 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf1 in 4ms
169 13/12/09 03:26:32 INFO tool.Canary: read from region test-01,0004883,1386230156732.87b55e03dfeade00f441125159f8ca87. column family cf2 in 1ms
171 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf1 in 5ms
172 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,,1386559511167.aa2951a86289281beee480f107bb36ee. column family cf2 in 3ms
173 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf1 in 31ms
174 13/12/09 03:26:32 INFO tool.Canary: read from region test-02,0004883,1386559511167.cbda32d5e2e276520712d84eaaa29d84. column family cf2 in 8ms
177 So you can see, table test-01 has two regions and two column families, so the Canary tool in the
178 default "region mode" will pick 4 small piece of data from 4 (2 region * 2 store) different stores.
179 This is a default behavior.
181 ==== Canary test for every column family (store) of every region of a specific table(s)
183 You can also test one or more specific tables by passing table names.
186 $ ${HBASE_HOME}/bin/hbase canary test-01 test-02
189 ==== Canary test with RegionServer granularity
191 In "regionserver mode", the Canary tool will pick one small piece of data
192 from each RegionServer (You can also pass one or more RegionServer names as arguments
193 to the canary-test when in "regionserver mode").
196 $ ${HBASE_HOME}/bin/hbase canary -regionserver
198 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs2 in 72ms
199 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-02 on region server:rs3 in 34ms
200 13/12/09 06:05:17 INFO tool.Canary: Read from table:test-01 on region server:rs1 in 56ms
203 ==== Canary test with regular expression pattern
205 You can pass regexes for table names when in "region mode" or for servernames when
206 in "regionserver mode". The below will test both table test-01 and test-02.
209 $ ${HBASE_HOME}/bin/hbase canary -e test-0[1-2]
212 ==== Run canary test as a "daemon"
214 Run repeatedly with an interval defined via the option `-interval` (default value is 60 seconds).
215 This daemon will stop itself and return non-zero error code if any error occur. To have
216 the daemon keep running across errors, pass the -f flag with its value set to false
220 $ ${HBASE_HOME}/bin/hbase canary -daemon
223 To run repeatedly with 5 second intervals and not stop on errors, do the following.
226 $ ${HBASE_HOME}/bin/hbase canary -daemon -interval 5 -f false
229 ==== Force timeout if canary test stuck
231 In some cases the request is stuck and no response is sent back to the client. This
232 can happen with dead RegionServers which the master has not yet noticed.
233 Because of this we provide a timeout option to kill the canary test and return a
234 non-zero error code. The below sets the timeout value to 60 seconds (the default value
238 $ ${HBASE_HOME}/bin/hbase canary -t 60000
241 ==== Enable write sniffing in canary
243 By default, the canary tool only checks read operations. To enable the write sniffing,
244 you can run the canary with the `-writeSniffing` option set. When write sniffing is
245 enabled, the canary tool will create an hbase table and make sure the
246 regions of the table are distributed to all region servers. In each sniffing period,
247 the canary will try to put data to these regions to check the write availability of
250 $ ${HBASE_HOME}/bin/hbase canary -writeSniffing
253 The default write table is `hbase:canary` and can be specified with the option `-writeTable`.
255 $ ${HBASE_HOME}/bin/hbase canary -writeSniffing -writeTable ns:canary
258 The default value size of each put is 10 bytes. You can set it via the config key:
259 `hbase.canary.write.value.size`.
261 ==== Treat read / write failure as error
263 By default, the canary tool only logs read failures -- due to e.g. RetriesExhaustedException, etc. --
264 and will return the 'normal' exit code. To treat read/write failure as errors, you can run canary
265 with the `-treatFailureAsError` option. When enabled, read/write failures will result in an
268 $ ${HBASE_HOME}/bin/hbase canary -treatFailureAsError
271 ==== Running Canary in a Kerberos-enabled Cluster
273 To run the Canary in a Kerberos-enabled cluster, configure the following two properties in _hbase-site.xml_:
275 * `hbase.client.keytab.file`
276 * `hbase.client.kerberos.principal`
278 Kerberos credentials are refreshed every 30 seconds when Canary runs in daemon mode.
280 To configure the DNS interface for the client, configure the following optional properties in _hbase-site.xml_.
282 * `hbase.client.dns.interface`
283 * `hbase.client.dns.nameserver`
285 .Canary in a Kerberos-Enabled Cluster
287 This example shows each of the properties with valid values.
292 <name>hbase.client.kerberos.principal</name>
293 <value>hbase/_HOST@YOUR-REALM.COM</value>
296 <name>hbase.client.keytab.file</name>
297 <value>/etc/hbase/conf/keytab.krb5</value>
299 <!-- optional params -->
301 <name>hbase.client.dns.interface</name>
302 <value>default</value>
305 <name>hbase.client.dns.nameserver</name>
306 <value>default</value>
314 usage: bin/hbase regionsplitter <TABLE> <SPLITALGORITHM>
315 SPLITALGORITHM is the java class name of a class implementing
316 SplitAlgorithm, or one of the special strings
317 HexStringSplit or DecimalStringSplit or
318 UniformSplit, which are built-in split algorithms.
319 HexStringSplit treats keys as hexadecimal ASCII, and
320 DecimalStringSplit treats keys as decimal ASCII, and
321 UniformSplit treats keys as arbitrary bytes.
322 -c <region count> Create a new table with a pre-split number of
324 -D <property=value> Override HBase Configuration Settings
325 -f <family:family:...> Column Families to create with new table.
327 --firstrow <arg> First Row in Table for Split Algorithm
328 -h Print this usage help
329 --lastrow <arg> Last Row in Table for Split Algorithm
330 -o <count> Max outstanding splits that have unfinished
332 -r Perform a rolling split of an existing region
333 --risky Skip verification steps to complete
334 quickly. STRONGLY DISCOURAGED for production
338 For additional detail, see <<manual_region_splitting_decisions>>.
343 You can configure HBase to run a script periodically and if it fails N times (configurable), have the server exit.
344 See _HBASE-7351 Periodic health check script_ for configurations and detail.
348 Several frequently-accessed utilities are provided as `Driver` classes, and executed by the _bin/hbase_ command.
349 These utilities represent MapReduce jobs which run on your cluster.
350 They are run in the following way, replacing _UtilityName_ with the utility you want to run.
351 This command assumes you have set the environment variable `HBASE_HOME` to the directory where HBase is unpacked on your server.
355 ${HBASE_HOME}/bin/hbase org.apache.hadoop.hbase.mapreduce.UtilityName
358 The following utilities are available:
360 `LoadIncrementalHFiles`::
361 Complete a bulk data load.
364 Export a table from the local cluster to a peer cluster.
367 Write table data to HDFS.
370 Import data written by a previous `Export` operation.
373 Import data in TSV format.
376 Count rows in an HBase table.
379 Count cells in an HBase table.
381 `replication.VerifyReplication`::
382 Compare the data from tables in two different clusters.
383 WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is changed.
384 Note that this command is in a different package than the others.
386 Each command except `RowCounter` and `CellCounter` accept a single `--help` argument to print usage instructions.
391 The `hbck` tool that shipped with hbase-1.x has been made read-only in hbase-2.x. It is not able to repair
392 hbase-2.x clusters as hbase internals have changed. Nor should its assessments in read-only mode be
393 trusted as it does not understand hbase-2.x operation.
395 A new tool, <<HBCK2>>, described in the next section, replaces `hbck`.
400 `HBCK2` is the successor to <<hbck>>, the hbase-1.x fix tool (A.K.A `hbck1`). Use it in place of `hbck1`
401 making repairs against hbase-2.x installs.
403 `HBCK2` does not ship as part of hbase. It can be found as a subproject of the companion
404 link:https://github.com/apache/hbase-operator-tools[hbase-operator-tools] repository at
405 link:https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2[Apache HBase HBCK2 Tool].
406 `HBCK2` was moved out of hbase so it could evolve at a cadence apart from that of hbase core.
408 See the [https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2](HBCK2) Home Page
409 for how `HBCK2` differs from `hbck1`, and for how to build and use it.
411 Once built, you can run `HBCK2` as follows:
414 $ hbase hbck -j /path/to/HBCK2.jar
417 This will generate `HBCK2` usage describing commands and options.
425 For bulk replaying WAL files or _recovered.edits_ files, see
426 <<walplayer>>. For reading/verifying individual files, read on.
428 [[hlog_tool.prettyprint]]
429 ==== WALPrettyPrinter
431 The `WALPrettyPrinter` is a tool with configurable options to print the contents of a WAL
432 or a _recovered.edits_ file. You can invoke it via the HBase cli with the 'wal' command.
435 $ ./bin/hbase wal hdfs://example.org:8020/hbase/WALs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
438 .WAL Printing in older versions of HBase
441 Prior to version 2.0, the `WALPrettyPrinter` was called the `HLogPrettyPrinter`, after an internal name for HBase's write ahead log.
442 In those versions, you can print the contents of a WAL using the same configuration as above, but with the 'hlog' command.
445 $ ./bin/hbase hlog hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012
452 See <<compression.test,compression.test>>.
457 CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster.
458 The target table must first exist.
459 The usage is as follows:
463 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
464 /bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --help
465 Usage: CopyTable [general options] [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] <tablename>
468 rs.class hbase.regionserver.class of the peer cluster,
469 specify if different from current cluster
470 rs.impl hbase.regionserver.impl of the peer cluster,
471 startrow the start row
473 starttime beginning of the time range (unixtime in millis)
474 without endtime means from starttime to forever
475 endtime end of the time range. Ignored if no starttime specified.
476 versions number of cell versions to copy
477 new.name new table's name
478 peer.adr Address of the peer cluster given in the format
479 hbase.zookeeer.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent
480 families comma-separated list of families to copy
481 To copy from cf1 to cf2, give sourceCfName:destCfName.
482 To keep the same name, just give "cfName"
483 all.cells also copy delete markers and deleted cells
486 tablename Name of the table to copy
489 To copy 'TestTable' to a cluster that uses replication for a 1 hour window:
490 $ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable --starttime=1265875194289 --endtime=1265878794289 --peer.adr=server1,server2,server3:2181:/hbase --families=myOldCf:myNewCf,cf2,cf3 TestTable
492 For performance consider the following general options:
493 It is recommended that you set the following to >=100. A higher value uses more memory but
494 decreases the round trip time to the server and may increase performance.
495 -Dhbase.client.scanner.caching=100
496 The following should always be set to false, to prevent writing data twice, which may produce
498 -Dmapred.map.tasks.speculative.execution=false
504 Caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
510 By default, CopyTable utility only copies the latest version of row cells unless `--versions=n` is explicitly specified in the command.
516 CopyTable does not perform a diff, it copies all Cells in between the specified startrow/stoprow starttime/endtime range.
517 This means that already existing cells with same values will still be copied.
520 See Jonathan Hsieh's link:https://blog.cloudera.com/blog/2012/06/online-hbase-backups-with-copytable-2/[Online
521 HBase Backups with CopyTable] blog post for more on `CopyTable`.
523 [[hashtable.synctable]]
524 === HashTable/SyncTable
526 HashTable/SyncTable is a two steps tool for synchronizing table data, where each of the steps are implemented as MapReduce jobs.
527 Similarly to CopyTable, it can be used for partial or entire table data syncing, under same or remote cluster.
528 However, it performs the sync in a more efficient way than CopyTable. Instead of copying all cells
529 in specified row key/time period range, HashTable (the first step) creates hashed indexes for batch of cells on source table and output those as results.
530 On the next stage, SyncTable scans the source table and now calculates hash indexes for table cells,
531 compares these hashes with the outputs of HashTable, then it just scans (and compares) cells for diverging hashes, only updating
532 mismatching cells. This results in less network traffic/data transfers, which can be impacting when syncing large tables on remote clusters.
534 ==== Step 1, HashTable
536 First, run HashTable on the source table cluster (this is the table whose state will be copied to its counterpart).
541 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --help
542 Usage: HashTable [options] <tablename> <outputpath>
545 batchsize the target amount of bytes to hash in each batch
546 rows are added to the batch until this size is reached
547 (defaults to 8000 bytes)
548 numhashfiles the number of hash files to create
549 if set to fewer than number of regions then
550 the job will create this number of reducers
551 (defaults to 1/100 of regions -- at least 1)
552 startrow the start row
554 starttime beginning of the time range (unixtime in millis)
555 without endtime means from starttime to forever
556 endtime end of the time range. Ignored if no starttime specified.
557 scanbatch scanner batch size to support intra row scans
558 versions number of cell versions to include
559 families comma-separated list of families to include
560 ignoreTimestamps if true, ignores cell timestamps
563 tablename Name of the table to hash
564 outputpath Filesystem path to put the output data
567 To hash 'TestTable' in 32kB batches for a 1 hour window into 50 files:
568 $ bin/hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=32000 --numhashfiles=50 --starttime=1265875194289 --endtime=1265878794289 --families=cf2,cf3 TestTable /hashes/testTable
571 The *batchsize* property defines how much cell data for a given region will be hashed together in a single hash value.
572 Sizing this properly has a direct impact on the sync efficiency, as it may lead to less scans executed by mapper tasks
573 of SyncTable (the next step in the process). The rule of thumb is that, the smaller the number of cells out of sync
574 (lower probability of finding a diff), larger batch size values can be determined.
576 ==== Step 2, SyncTable
578 Once HashTable has completed on source cluster, SyncTable can be ran on target cluster.
579 Just like replication and other synchronization jobs, it requires that all RegionServers/DataNodes
580 on source cluster be accessible by NodeManagers on the target cluster (where SyncTable job tasks will be running).
585 $ ./bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --help
586 Usage: SyncTable [options] <sourcehashdir> <sourcetable> <targettable>
589 sourcezkcluster ZK cluster key of the source table
590 (defaults to cluster in classpath's config)
591 targetzkcluster ZK cluster key of the target table
592 (defaults to cluster in classpath's config)
593 dryrun if true, output counters but no writes
595 doDeletes if false, does not perform deletes
597 doPuts if false, does not perform puts
599 ignoreTimestamps if true, ignores cells timestamps while comparing
600 cell values. Any missing cell on target then gets
601 added with current time as timestamp
605 sourcehashdir path to HashTable output dir for source table
606 (see org.apache.hadoop.hbase.mapreduce.HashTable)
607 sourcetable Name of the source table to sync from
608 targettable Name of the target table to sync to
611 For a dry run SyncTable of tableA from a remote source cluster
612 to a local target cluster:
613 $ bin/hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:9000/hashes/tableA tableA tableA
616 Cell comparison takes ROW/FAMILY/QUALIFIER/TIMESTAMP/VALUE into account for equality. When syncing at the target, missing cells will be
617 added with original timestamp value from source. That may cause unexpected results after SyncTable completes, for example, if missing
618 cells on target have a delete marker with a timestamp T2 (say, a bulk delete performed by mistake), but source cells timestamps have an
619 older value T1, then those cells would still be unavailable at target because of the newer delete marker timestamp. Since cell timestamps
620 might not be relevant to all use cases, _ignoreTimestamps_ option adds the flexibility to avoid using cells timestamp in the comparison.
621 When using _ignoreTimestamps_ set to true, this option must be specified for both HashTable and SyncTable steps.
623 The *dryrun* option is useful when a read only, diff report is wanted, as it will produce only COUNTERS indicating the differences, but will not perform
624 any actual changes. It can be used as an alternative to VerifyReplication tool.
626 By default, SyncTable will cause target table to become an exact copy of source table (at least, for the specified startrow/stoprow or/and starttime/endtime).
628 Setting doDeletes to false modifies default behaviour to not delete target cells that are missing on source.
629 Similarly, setting doPuts to false modifies default behaviour to not add missing cells on target. Setting both doDeletes
630 and doPuts to false would give same effect as setting dryrun to true.
633 .Additional info on doDeletes/doPuts
636 "doDeletes/doPuts" were only added by
637 link:https://jira.apache.org/jira/browse/HBASE-20305[HBASE-20305], so these may not be available on
638 all released versions.
639 For major 1.x versions, minimum minor release including it is *1.4.10*.
640 For major 2.x versions, minimum minor release including it is *2.1.5*.
643 .Additional info on ignoreTimestamps
646 "ignoreTimestamps" was only added by
647 link:https://issues.apache.org/jira/browse/HBASE-24302[HBASE-24302], so it may not be available on
648 all released versions.
649 For major 1.x versions, minimum minor release including it is *1.4.14*.
650 For major 2.x versions, minimum minor release including it is *2.2.5*.
653 .Set doDeletes to false on Two-Way Replication scenarios
656 On Two-Way Replication or other scenarios where both source and target clusters can have data ingested, it's advisable to always set doDeletes option to false,
657 as any additional cell inserted on SyncTable target cluster and not yet replicated to source would be deleted, and potentially lost permanently.
660 .Set sourcezkcluster to the actual source cluster ZK quorum
663 Although not required, if sourcezkcluster is not set, SyncTable will connect to local HBase cluster for both source and target,
664 which does not give any meaningful result.
667 .Remote Clusters on different Kerberos Realms
670 Often, remote clusters may be deployed on different Kerberos Realms.
671 link:https://jira.apache.org/jira/browse/HBASE-20586[HBASE-20586] added SyncTable support for
672 cross realm authentication, allowing a SyncTable process running on target cluster to connect to
673 source cluster and read both HashTable output files and the given HBase table when performing the
674 required comparisons.
680 Export is a utility that will dump the contents of table to HDFS in a sequence file.
681 The Export can be run via a Coprocessor Endpoint or MapReduce. Invoke via:
683 *mapreduce-based Export*
685 $ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
688 *endpoint-based Export*
690 NOTE: Make sure the Export coprocessor is enabled by adding `org.apache.hadoop.hbase.coprocessor.Export` to `hbase.coprocessor.region.classes`.
692 $ bin/hbase org.apache.hadoop.hbase.coprocessor.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
694 The outputdir is a HDFS directory that does not exist prior to the export. When done, the exported files will be owned by the user invoking the export command.
696 *The Comparison of Endpoint-based Export And Mapreduce-based Export*
698 ||Endpoint-based Export|Mapreduce-based Export
700 |HBase version requirement
706 |hbase-mapreduce (2.0+), hbase-server(prior to 2.0)
708 |Requirement before dump
709 |mount the endpoint.Export on the target table
710 |deploy the MapReduce framework
713 |low, directly read the data from region
714 |normal, traditional RPC scan
717 |depend on number of regions
718 |depend on number of mappers (see TableInputFormatBase#getSplits)
721 |operation timeout. configured by hbase.client.operation.timeout
722 |scan timeout. configured by hbase.client.scanner.timeout.period
724 |Permission requirement
734 NOTE: To see usage instructions, run the command with no options. Available options include
735 specifying column families and applying filters during the export.
737 By default, the `Export` tool only exports the newest version of a given cell, regardless of the number of versions stored. To export more than one version, replace *_<versions>_* with the desired number of versions.
739 For mapreduce based Export, if you want to export cell tags then set the following config property
740 `hbase.client.rpc.codec` to `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`
742 Note: caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
747 Import is a utility that will load data that has been exported back into HBase.
751 $ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
754 NOTE: To see usage instructions, run the command with no options.
756 To import 0.94 exported files in a 0.96 cluster or onwards, you need to set system property "hbase.import.version" when running the import command as below:
759 $ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
762 If you want to import cell tags then set the following config property
763 `hbase.client.rpc.codec` to `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`
768 ImportTsv is a utility that will load data in TSV format into HBase.
769 It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the `completebulkload`.
771 To load data via Puts (i.e., non-bulk loading):
774 $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
777 To generate StoreFiles for bulk-loading:
781 $ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
784 These generated StoreFiles can be loaded into HBase via <<completebulkload,completebulkload>>.
786 [[importtsv.options]]
787 ==== ImportTsv Options
789 Running `ImportTsv` with no arguments prints brief usage information:
793 Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
795 Imports the given input directory of TSV data into the specified table.
797 The column names of the TSV data must be specified using the -Dimporttsv.columns
798 option. This option takes the form of comma-separated column names, where each
799 column name is either a simple column family, or a columnfamily:qualifier. The special
800 column name HBASE_ROW_KEY is used to designate that this column should be used
801 as the row key for each imported record. You must specify exactly one column
802 to be the row key, and you must specify a column name for every column that exists in the
805 By default importtsv will load data directly into HBase. To instead generate
806 HFiles of data to prepare for a bulk data load, pass the option:
807 -Dimporttsv.bulk.output=/path/for/output
808 Note: the target table will be created with default column family descriptors if it does not already exist.
810 Other options that may be specified with -D include:
811 -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
812 '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
813 -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
814 -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
817 [[importtsv.example]]
818 ==== ImportTsv Example
820 For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
822 Assume that an input file exists as follows:
837 For ImportTsv to use this input file, the command line needs to look like this:
841 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput datatsv hdfs://inputfile
844 \... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used.
845 The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
847 [[importtsv.warning]]
848 ==== ImportTsv Warning
850 If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
855 For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>
860 The `completebulkload` utility will move generated StoreFiles into an HBase table.
861 This utility is often used in conjunction with output from <<importtsv,importtsv>>.
863 There are two ways to invoke this utility, with explicit classname and via the driver:
867 $ bin/hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
872 HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
875 [[completebulkload.warning]]
876 ==== CompleteBulkLoad Warning
878 Data generated via MapReduce is often created with file permissions that are not compatible with the running HBase process.
879 Assuming you're running HDFS with permissions enabled, those permissions will need to be updated before you run CompleteBulkLoad.
881 For more information about bulk-loading HFiles into HBase, see <<arch.bulk.load,arch.bulk.load>>.
886 WALPlayer is a utility to replay WAL files into HBase.
888 The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables.
889 The output can optionally be mapped to another set of tables.
891 WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
893 Finally, you can use WALPlayer to replay the content of a Regions `recovered.edits` directory (the files under
894 `recovered.edits` directory have the same format as WAL files).
899 To read or verify single WAL files or _recovered.edits_ files, since they share the WAL format,
906 $ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]>
912 $ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
915 WALPlayer, by default, runs as a mapreduce job.
916 To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags `-Dmapreduce.jobtracker.address=local` on the command line.
918 [[walplayer.options]]
919 ==== WALPlayer Options
921 Running `WALPlayer` with no arguments prints brief usage information:
924 Usage: WALPlayer [options] <WAL inputdir> [<tables> <tableMappings>]
925 <WAL inputdir> directory of WALs to replay.
926 <tables> comma separated list of tables. If no tables specified,
927 all are imported (even hbase:meta if present).
928 <tableMappings> WAL entries can be mapped to a new set of tables by passing
929 <tableMappings>, a comma separated list of target tables.
930 If specified, each table in <tables> must have a mapping.
931 To generate HFiles to bulk load instead of loading HBase directly, pass:
932 -Dwal.bulk.output=/path/for/output
933 Only one table can be specified, and no mapping allowed!
934 To specify a time range, pass:
935 -Dwal.start.time=[date|ms]
936 -Dwal.end.time=[date|ms]
937 The start and the end date of timerange (inclusive). The dates can be
938 expressed in milliseconds-since-epoch or yyyy-MM-dd'T'HH:mm:ss.SS format.
939 E.g. 1234567890120 or 2009-02-13T23:32:30.12
941 -Dmapreduce.job.name=jobName
942 Use the specified mapreduce job name for the wal player
943 -Dwal.input.separator=' '
944 Change WAL filename separator (WAL dir names use default ','.)
945 For performance also consider the following options:
946 -Dmapreduce.map.speculative=false
947 -Dmapreduce.reduce.speculative=false
953 link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter] is a mapreduce job to count all the rows of a table.
954 This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency.
955 It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
956 It is possible to limit the time range of data to be scanned by using the `--starttime=[starttime]` and `--endtime=[endtime]` flags.
957 The scanned data can be limited based on keys using the `--range=[startKey],[endKey][;[startKey],[endKey]...]` option.
960 $ bin/hbase rowcounter [options] <tablename> [--starttime=<start> --endtime=<end>] [--range=[startKey],[endKey][;[startKey],[endKey]...]] [<column1> <column2>...]
963 RowCounter only counts one version per cell.
965 For performance consider to use `-Dhbase.client.scanner.caching=100` and `-Dmapreduce.map.speculative=false` options.
970 HBase ships another diagnostic mapreduce job called link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/CellCounter.html[CellCounter].
971 Like RowCounter, it gathers more fine-grained statistics about your table.
972 The statistics gathered by CellCounter are more fine-grained and include:
974 * Total number of rows in the table.
975 * Total number of CFs across all rows.
976 * Total qualifiers across all rows.
977 * Total occurrence of each CF.
978 * Total occurrence of each qualifier.
979 * Total number of versions of each qualifier.
981 The program allows you to limit the scope of the run.
982 Provide a row regex or prefix to limit the rows to analyze.
983 Specify a time range to scan the table by using the `--starttime=<starttime>` and `--endtime=<endtime>` flags.
985 Use `hbase.mapreduce.scan.column.family` to specify scanning a single column family.
988 $ bin/hbase cellcounter <tablename> <outputDir> [reportSeparator] [regex or prefix] [--starttime=<starttime> --endtime=<endtime>]
991 Note: just like RowCounter, caching for the input Scan is configured via `hbase.client.scanner.caching` in the job configuration.
995 It is possible to optionally pin your servers in physical memory making them less likely to be swapped out in oversubscribed environments by having the servers call link:http://linux.die.net/man/2/mlockall[mlockall] on startup.
996 See link:https://issues.apache.org/jira/browse/HBASE-4391[HBASE-4391 Add ability to start RS as root and call mlockall] for how to build the optional library and have it run on startup.
999 === Offline Compaction Tool
1001 *CompactionTool* provides a way of running compactions (either minor or major) as an independent
1002 process from the RegionServer. It reuses same internal implementation classes executed by RegionServer
1003 compaction feature. However, since this runs on a complete separate independent java process, it
1004 releases RegionServers from the overhead involved in rewrite a set of hfiles, which can be critical
1005 for latency sensitive use cases.
1009 $ ./bin/hbase org.apache.hadoop.hbase.regionserver.CompactionTool
1011 Usage: java org.apache.hadoop.hbase.regionserver.CompactionTool \
1012 [-compactOnce] [-major] [-mapred] [-D<property=value>]* files...
1015 mapred Use MapReduce to run compaction.
1016 compactOnce Execute just one compaction step. (default: while needed)
1017 major Trigger major compaction.
1019 Note: -D properties will be applied to the conf used.
1021 To stop delete of compacted file, pass -Dhbase.compactiontool.delete=false
1022 To set tmp dir, pass -Dhbase.tmp.dir=ALTERNATE_DIR
1025 To compact the full 'TestTable' using MapReduce:
1026 $ hbase org.apache.hadoop.hbase.regionserver.CompactionTool -mapred hdfs://hbase/data/default/TestTable
1028 To compact column family 'x' of the table 'TestTable' region 'abc':
1029 $ hbase org.apache.hadoop.hbase.regionserver.CompactionTool hdfs://hbase/data/default/TestTable/abc/x
1032 As shown by usage options above, *CompactionTool* can run as a standalone client or a mapreduce job.
1033 When running as mapreduce job, each family dir is handled as an input split, and is processed
1034 by a separate map task.
1036 The *compactionOnce* parameter controls how many compaction cycles will be performed until
1037 *CompactionTool* program decides to finish its work. If omitted, it will assume it should keep
1038 running compactions on each specified family as determined by the given compaction policy
1039 configured. For more info on compaction policy, see <<compaction,compaction>>.
1041 If a major compaction is desired, *major* flag can be specified. If omitted, *CompactionTool* will
1042 assume minor compaction is wanted by default.
1044 It also allows for configuration overrides with `-D` flag. In the usage section above, for example,
1045 `-Dhbase.compactiontool.delete=false` option will instruct compaction engine to not delete original
1046 files from temp folder.
1048 Files targeted for compaction must be specified as parent hdfs dirs. It allows for multiple dirs
1049 definition, as long as each for these dirs are either a *family*, a *region*, or a *table* dir. If a
1050 table or region dir is passed, the program will recursively iterate through related sub-folders,
1051 effectively running compaction for each family found below the table/region level.
1053 Since these dirs are nested under *hbase* hdfs directory tree, *CompactionTool* requires hbase super
1054 user permissions in order to have access to required hfiles.
1056 .Running in MapReduce mode
1059 MapReduce mode offers the ability to process each family dir in parallel, as a separate map task.
1060 Generally, it would make sense to run in this mode when specifying one or more table dirs as targets
1061 for compactions. The caveat, though, is that if number of families to be compacted become too large,
1062 the related mapreduce job may have indirect impacts on *RegionServers* performance .
1063 Since *NodeManagers* are normally co-located with RegionServers, such large jobs could
1064 compete for IO/Bandwidth resources with the *RegionServers*.
1067 .MajorCompaction completely disabled on RegionServers due performance impacts
1070 *Major compactions* can be a costly operation (see <<compaction,compaction>>), and can indeed
1071 impact performance on RegionServers, leading operators to completely disable it for critical
1072 low latency application. *CompactionTool* could be used as an alternative in such scenarios,
1073 although, additional custom application logic would need to be implemented, such as deciding
1074 scheduling and selection of tables/regions/families target for a given compaction run.
1077 For additional details about CompactionTool, see also
1078 link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/regionserver/CompactionTool.html[CompactionTool].
1082 The `hbase clean` command cleans HBase data from ZooKeeper, HDFS, or both.
1083 It is appropriate to use for testing.
1084 Run it with no options for usage instructions.
1085 The `hbase clean` command was introduced in HBase 0.98.
1090 Usage: hbase clean (--cleanZk|--cleanHdfs|--cleanAll)
1092 --cleanZk cleans hbase related data from zookeeper.
1093 --cleanHdfs cleans hbase related data from hdfs.
1094 --cleanAll cleans hbase related data from both zookeeper and hdfs.
1099 The `hbase pe` command runs the PerformanceEvaluation tool, which is used for testing.
1101 The PerformanceEvaluation tool accepts many different options and commands.
1102 For usage instructions, run the command with no options.
1104 The PerformanceEvaluation tool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLs and visibility labels, multiget support for RPC calls, increased sampling sizes, an option to randomly sleep during testing, and ability to "warm up" the cluster before testing starts.
1108 The `hbase ltt` command runs the LoadTestTool utility, which is used for testing.
1110 You must specify either `-init_only` or at least one of `-write`, `-update`, or `-read`.
1111 For general usage instructions, pass the `-h` option.
1113 The LoadTestTool has received many updates in recent HBase releases, including support for namespaces, support for tags, cell-level ACLS and visibility labels, testing security-related features, ability to specify the number of regions per server, tests for multi-get RPC calls, and tests relating to replication.
1116 === Pre-Upgrade validator
1117 Pre-Upgrade validator tool can be used to check the cluster for known incompatibilities before upgrading from HBase 1 to HBase 2.
1121 $ bin/hbase pre-upgrade command ...
1124 ==== Coprocessor validation
1126 HBase supports co-processors for a long time, but the co-processor API can be changed between major releases. Co-processor validator tries to determine
1127 whether the old co-processors are still compatible with the actual HBase version.
1131 $ bin/hbase pre-upgrade validate-cp [-jar ...] [-class ... | -table ... | -config]
1133 -e Treat warnings as errors.
1134 -jar <arg> Jar file/directory of the coprocessor.
1135 -table <arg> Table coprocessor(s) to check.
1136 -class <arg> Coprocessor class(es) to check.
1137 -config Scan jar for observers.
1140 The co-processor classes can be explicitly declared by `-class` option, or they can be obtained from HBase configuration by `-config` option.
1141 Table level co-processors can be also checked by `-table` option. The tool searches for co-processors on its classpath, but it can be extended
1142 by the `-jar` option. It is possible to test multiple classes with multiple `-class`, multiple tables with multiple `-table` options as well as
1143 adding multiple jars to the classpath with multiple `-jar` options.
1145 The tool can report errors and warnings. Errors mean that HBase won't be able to load the coprocessor, because it is incompatible with the current version
1146 of HBase. Warnings mean that the co-processors can be loaded, but they won't work as expected. If `-e` option is given, then the tool will also fail
1149 Please note that this tool cannot validate every aspect of jar files, it just does some static checks.
1155 $ bin/hbase pre-upgrade validate-cp -jar my-coprocessor.jar -class MyMasterObserver -class MyRegionObserver
1158 It validates `MyMasterObserver` and `MyRegionObserver` classes which are located in `my-coprocessor.jar`.
1162 $ bin/hbase pre-upgrade validate-cp -table .*
1165 It validates every table level co-processors where the table name matches to `.*` regular expression.
1167 ==== DataBlockEncoding validation
1168 HBase 2.0 removed `PREFIX_TREE` Data Block Encoding from column families. For further information
1169 please check <<upgrade2.0.prefix-tree.removed,_prefix-tree_ encoding removed>>.
1170 To verify that none of the column families are using incompatible Data Block Encodings in the cluster run the following command.
1174 $ bin/hbase pre-upgrade validate-dbe
1177 This check validates all column families and print out any incompatibilities. For example:
1180 2018-07-13 09:58:32,028 WARN [main] tool.DataBlockEncodingValidator: Incompatible DataBlockEncoding for table: t, cf: f, encoding: PREFIX_TREE
1183 Which means that Data Block Encoding of table `t`, column family `f` is incompatible. To fix, use `alter` command in HBase shell:
1186 alter 't', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
1189 Please also validate HFiles, which is described in the next section.
1191 ==== HFile Content validation
1192 Even though Data Block Encoding is changed from `PREFIX_TREE` it is still possible to have HFiles that contain data encoded that way.
1193 To verify that HFiles are readable with HBase 2 please use _HFile content validator_.
1197 $ bin/hbase pre-upgrade validate-hfile
1200 The tool will log the corrupt HFiles and details about the root cause.
1201 If the problem is about PREFIX_TREE encoding it is necessary to change encodings before upgrading to HBase 2.
1203 The following log message shows an example of incorrect HFiles.
1206 2018-06-05 16:20:46,976 WARN [hfilevalidator-pool1-t3] hbck.HFileCorruptionChecker: Found corrupt HFile hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1207 org.apache.hadoop.hbase.io.hfile.CorruptHFileException: Problem reading HFile Trailer from file hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1209 Caused by: java.io.IOException: Invalid data block encoding type in file info: PREFIX_TREE
1211 Caused by: java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.hbase.io.encoding.DataBlockEncoding.PREFIX_TREE
1213 2018-06-05 16:20:47,322 INFO [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/data/default/t/72ea7f7d625ee30f959897d1a3e2c350/prefix/7e6b3d73263c4851bf2b8590a9b3791e
1214 2018-06-05 16:20:47,383 INFO [main] tool.HFileContentValidator: Corrupted file: hdfs://example.com:8020/hbase/archive/data/default/t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1
1217 ===== Fixing PREFIX_TREE errors
1219 It's possible to get `PREFIX_TREE` errors after changing Data Block Encoding to a supported one. It can happen
1220 because there are some HFiles which still encoded with `PREFIX_TREE` or there are still some snapshots.
1222 For fixing HFiles, please run a major compaction on the table (it was `default:t` according to the log message):
1228 HFiles can be referenced from snapshots, too. It's the case when the HFile is located under `archive/data`.
1229 The first step is to determine which snapshot references that HFile (the name of the file was `29c641ae91c34fc3bee881f45436b6d1`
1230 according to the logs):
1234 for snapshot in $(hbase snapshotinfo -list-snapshots 2> /dev/null | tail -n -1 | cut -f 1 -d \|);
1236 echo "checking snapshot named '${snapshot}'";
1237 hbase snapshotinfo -snapshot "${snapshot}" -files 2> /dev/null | grep 29c641ae91c34fc3bee881f45436b6d1;
1241 The output of this shell script is:
1244 checking snapshot named 't_snap'
1245 1.0 K t/56be41796340b757eb7fff1eb5e2a905/f/29c641ae91c34fc3bee881f45436b6d1 (archive)
1248 Which means `t_snap` snapshot references the incompatible HFile. If the snapshot is still needed,
1249 then it has to be recreated with HBase shell:
1252 # creating a new namespace for the cleanup process
1253 create_namespace 'pre_upgrade_cleanup'
1255 # creating a new snapshot
1256 clone_snapshot 't_snap', 'pre_upgrade_cleanup:t'
1257 alter 'pre_upgrade_cleanup:t', { NAME => 'f', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
1258 major_compact 'pre_upgrade_cleanup:t'
1260 # removing the invalid snapshot
1261 delete_snapshot 't_snap'
1263 # creating a new snapshot
1264 snapshot 'pre_upgrade_cleanup:t', 't_snap'
1266 # removing temporary table
1267 disable 'pre_upgrade_cleanup:t'
1268 drop 'pre_upgrade_cleanup:t'
1269 drop_namespace 'pre_upgrade_cleanup'
1272 For further information, please refer to
1273 link:https://issues.apache.org/jira/browse/HBASE-20649?focusedCommentId=16535476#comment-16535476[HBASE-20649].
1275 === Data Block Encoding Tool
1277 Tests various compression algorithms with different data block encoder for key compression on an existing HFile.
1278 Useful for testing, debugging and benchmarking.
1280 You must specify `-f` which is the full path of the HFile.
1282 The result shows both the performance (MB/s) of compression/decompression and encoding/decoding, and the data savings on the HFile.
1286 $ bin/hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
1287 Usages: hbase org.apache.hadoop.hbase.regionserver.DataBlockEncodingTool
1289 -f HFile to analyse (REQUIRED)
1290 -n Maximum number of key/value pairs to process in a single benchmark run.
1291 -b Whether to run a benchmark to measure read throughput.
1292 -c If this is specified, no correctness testing will be done.
1293 -a What kind of compression algorithm use for test. Default value: GZ.
1294 -t Number of times to run each benchmark. Default value: 12.
1295 -omit Number of first runs of every benchmark to omit from statistics. Default value: 2.
1300 == Region Management
1302 [[ops.regionmgt.majorcompact]]
1303 === Major Compaction
1305 Major compactions can be requested via the HBase shell or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html#majorCompact-org.apache.hadoop.hbase.TableName-[Admin.majorCompact].
1307 Note: major compactions do NOT do region merges.
1308 See <<compaction,compaction>> for more information about compactions.
1310 [[ops.regionmgt.merge]]
1313 Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).
1317 $ bin/hbase org.apache.hadoop.hbase.util.Merge <tablename> <region1> <region2>
1320 If you feel you have too many regions and want to consolidate them, Merge is the utility you need.
1321 Merge must run be done when the cluster is down.
1322 See the link:https://web.archive.org/web/20111231002503/http://ofps.oreilly.com/titles/9781449396107/performance.html[O'Reilly HBase
1323 Book] for an example of usage.
1325 You will need to pass 3 parameters to this application.
1326 The first one is the table name.
1327 The second one is the fully qualified name of the first region to merge, like "table_name,\x0A,1342956111995.7cef47f192318ba7ccc75b1bbf27a82b.". The third one is the fully qualified name for the second region to merge.
1329 Additionally, there is a Ruby script attached to link:https://issues.apache.org/jira/browse/HBASE-1621[HBASE-1621] for region merging.
1335 === Node Decommission
1337 You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:
1340 $ ./bin/hbase-daemon.sh stop regionserver
1343 The RegionServer will first close all regions and then shut itself down.
1344 On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire.
1345 The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
1347 .Disable the Load Balancer before Decommissioning a node
1350 If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer.
1351 Avoid any problems by disabling the balancer first.
1352 See <<lb,lb>> below.
1358 In hbase-2.0, in the bin directory, we added a script named _considerAsDead.sh_ that can be used to kill a regionserver.
1359 Hardware issues could be detected by specialized monitoring tools before the zookeeper timeout has expired. _considerAsDead.sh_ is a simple function to mark a RegionServer as dead.
1360 It deletes all the znodes of the server, starting the recovery process.
1361 Plug in the script into your monitoring/fault detection tools to initiate faster failover.
1362 Be careful how you use this disruptive tool.
1363 Copy the script if you need to make use of it in a version of hbase previous to hbase-2.0.
1366 A downside to the above stop of a RegionServer is that regions could be offline for a good period of time.
1367 Regions are closed in order.
1368 If many regions on the server, the first region to close may not be back online until all regions close and
1369 after the master notices the RegionServer's znode gone. A node can be asked to gradually shed its load and
1370 then shutdown itself using the _graceful_stop.sh_ script. Here is its usage:
1373 $ ./bin/graceful_stop.sh
1374 Usage: graceful_stop.sh [--config <conf-dir>] [-e] [--restart [--reload]] [--thrift] [--rest] [-n |--noack] [--maxthreads <number of threads>] [--movetimeout <timeout in seconds>] [-nob |--nobalancer] [-d |--designatedfile <file path>] [-x |--excludefile <file path>] <hostname>
1375 thrift If we should stop/start thrift before/after the hbase stop/start
1376 rest If we should stop/start rest before/after the hbase stop/start
1377 restart If we should restart after graceful stop
1378 reload Move offloaded regions back on to the restarted server
1379 n|noack Enable noAck mode in RegionMover. This is a best effort mode for moving regions
1380 maxthreads xx Limit the number of threads used by the region mover. Default value is 1.
1381 movetimeout xx Timeout for moving regions. If regions are not moved by the timeout value,exit with error. Default value is INT_MAX.
1382 hostname Hostname of server we are to stop
1383 e|failfast Set -e so exit immediately if any command exits with non-zero status
1384 nob|nobalancer Do not manage balancer states. This is only used as optimization in rolling_restart.sh to avoid multiple calls to hbase shell
1385 d|designatedfile xx Designated file with <hostname:port> per line as unload targets
1386 x|excludefile xx Exclude file should have <hostname:port> per line. We do not unload regions to hostnames given in exclude file
1389 To decommission a loaded RegionServer, run the following: +$
1390 ./bin/graceful_stop.sh HOSTNAME+ where `HOSTNAME` is the host carrying the RegionServer you would decommission.
1395 The `HOSTNAME` passed to _graceful_stop.sh_ must match the hostname that hbase is using to identify RegionServers.
1396 HBase uses fully-qualified domain names usually. Check the list of RegionServers in the master UI for how HBase
1397 is referring to servers. Whatever HBase is using, this is what you should pass the _graceful_stop.sh_ decommission script.
1398 If you pass IPs, the script is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it checks
1399 if server is currently running; the graceful unloading of regions will not run.
1402 The _graceful_stop.sh_ script will move the regions off the decommissioned RegionServer one at a time to minimize region churn.
1403 It will verify the region deployed in the new location before it will moves the next region and so on until the decommissioned
1404 server is carrying zero regions. At this point, the _graceful_stop.sh_ tells the RegionServer `stop`.
1405 The master will at this point notice the RegionServer gone but all regions will have already been redeployed and because the
1406 RegionServer went down cleanly, there will be no WAL logs to split.
1412 It is assumed that the Region Load Balancer is disabled while the `graceful_stop` script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:
1416 hbase(main):001:0> balance_switch false
1418 0 row(s) in 0.3590 seconds
1421 This turns the balancer OFF.
1426 hbase(main):001:0> balance_switch true
1428 0 row(s) in 0.3590 seconds
1431 The `graceful_stop` will check the balancer and if enabled, will turn it off before it goes to work.
1432 If it exits prematurely because of error, it will not have reset the balancer.
1433 Hence, it is better to manage the balancer apart from `graceful_stop` reenabling it after you are done w/ graceful_stop.
1436 [[draining.servers]]
1437 ==== Decommissioning several Regions Servers concurrently
1439 If you have a large cluster, you may want to decommission more than one machine at a time by gracefully stopping multiple RegionServers concurrently.
1440 To gracefully drain multiple regionservers at the same time, RegionServers can be put into a "draining" state.
1441 This is done by marking a RegionServer as a draining node by creating an entry in ZooKeeper under the _hbase_root/draining_ znode.
1442 This znode has format `name,port,startcode` just like the regionserver entries under _hbase_root/rs_ znode.
1444 Without this facility, decommissioning multiple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining.
1445 Marking RegionServers to be in the draining state prevents this from happening.
1446 See this link:http://inchoate-clatter.blogspot.com/2012/03/hbase-ops-automation.html[blog
1447 post] for more details.
1450 ==== Bad or Failing Disk
1452 It is good having <<dfs.datanode.failed.volumes.tolerated,dfs.datanode.failed.volumes.tolerated>> set if you have a decent number of disks per machine for the case where a disk plain dies.
1453 But usually disks do the "John Wayne" -- i.e.
1454 take a while to go down spewing errors in _dmesg_ -- or for some reason, run much slower than their companions.
1455 In this case you want to decommission the disk.
1456 You have two options.
1457 You can link:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html[decommission
1458 the datanode] or, less disruptive in that only the bad disks data will be rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume while the datanode is using it), and then restart the datanode (presuming you have set dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its logs as it recalibrates where to get its data from -- it will likely roll its WAL log too -- but in general but for some latency spikes, it should keep on chugging.
1460 .Short Circuit Reads
1463 If you are doing short-circuit reads, you will have to move the regions off the regionserver before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot have access, because it already has the files open, it will be able to keep reading the file blocks from the bad disk even though the datanode is down.
1464 Move the regions back after you restart the datanode.
1470 Some cluster configuration changes require either the entire cluster, or the RegionServers, to be restarted in order to pick up the changes.
1471 In addition, rolling restarts are supported for upgrading to a minor or maintenance release, and to a major release if at all possible.
1472 See the release notes for release you want to upgrade to, to find out about limitations to the ability to perform a rolling upgrade.
1474 There are multiple ways to restart your cluster nodes, depending on your situation.
1475 These methods are detailed below.
1477 ==== Using the `rolling-restart.sh` Script
1479 HBase ships with a script, _bin/rolling-restart.sh_, that allows you to perform rolling restarts on the entire cluster, the master only, or the RegionServers only.
1480 The script is provided as a template for your own script, and is not explicitly tested.
1481 It requires password-less SSH login to be configured and assumes that you have deployed using a tarball.
1482 The script requires you to set some environment variables before running it.
1483 Examine the script and modify it to suit your needs.
1485 ._rolling-restart.sh_ General Usage
1487 $ ./bin/rolling-restart.sh --help
1488 Usage: rolling-restart.sh [--config <hbase-confdir>] [--rs-only] [--master-only] [--graceful] [--maxthreads xx]
1491 Rolling Restart on RegionServers Only::
1492 To perform a rolling restart on the RegionServers only, use the `--rs-only` option.
1493 This might be necessary if you need to reboot the individual RegionServer or if you make a configuration change that only affects RegionServers and not the other HBase processes.
1495 Rolling Restart on Masters Only::
1496 To perform a rolling restart on the active and backup Masters, use the `--master-only` option.
1497 You might use this if you know that your configuration change only affects the Master and not the RegionServers, or if you need to restart the server where the active Master is running.
1500 If you specify the `--graceful` option, RegionServers are restarted using the _bin/graceful_stop.sh_ script, which moves regions off a RegionServer before restarting it.
1501 This is safer, but can delay the restart.
1503 Limiting the Number of Threads::
1504 To limit the rolling restart to using only a specific number of threads, use the `--maxthreads` option.
1506 [[rolling.restart.manual]]
1507 ==== Manual Rolling Restart
1509 To retain more control over the process, you may wish to manually do a rolling restart across your cluster.
1510 This uses the `graceful-stop.sh` command <<decommission,decommission>>.
1511 In this method, you can restart each RegionServer individually and then move its old regions back into place, retaining locality.
1512 If you also need to restart the Master, you need to do it separately, and restart the Master before restarting the RegionServers using this method.
1513 The following is an example of such a command.
1514 You may need to tailor it to your environment.
1515 This script does a rolling restart of RegionServers only.
1516 It disables the load balancer before moving the regions.
1520 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &;
1523 Monitor the output of the _/tmp/log.txt_ file to follow the progress of the script.
1525 ==== Logic for Crafting Your Own Rolling Restart Script
1527 Use the following guidelines if you want to create your own rolling restart script.
1529 . Extract the new release, verify its configuration, and synchronize it to all nodes of your cluster using `rsync`, `scp`, or another secure synchronization mechanism.
1531 . Restart the master first.
1532 You may need to modify these commands if your new HBase directory is different from the old one, such as for an upgrade.
1536 $ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
1539 . Gracefully restart each RegionServer, using a script such as the following, from the Master.
1543 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
1546 If you are running Thrift or REST servers, pass the --thrift or --rest options.
1547 For other available options, run the `bin/graceful-stop.sh --help` command.
1549 It is important to drain HBase regions slowly when restarting multiple RegionServers.
1550 Otherwise, multiple regions go offline simultaneously and must be reassigned to other nodes, which may also go offline soon.
1551 This can negatively affect performance.
1552 You can inject delays into the script above, for instance, by adding a Shell command such as `sleep`.
1553 To wait for 5 minutes between each RegionServer restart, modify the above script to the following:
1557 $ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i & sleep 5m; done &> /tmp/log.txt &
1560 . Restart the Master again, to clear out the dead servers list and re-enable the load balancer.
1563 === Adding a New Node
1565 Adding a new regionserver in HBase is essentially free, you simply start it like this: `$ ./bin/hbase-daemon.sh start regionserver` and it will register itself with the master.
1566 Ideally you also started a DataNode on the same machine so that the RS can eventually start to have local files.
1567 If you rely on ssh to start your daemons, don't forget to add the new hostname in _conf/regionservers_ on the master.
1569 At this point the region server isn't serving data because no regions have moved to it yet.
1570 If the balancer is enabled, it will start moving regions to the new RS.
1571 On a small/medium cluster this can have a very adverse effect on latency as a lot of regions will be offline at the same time.
1572 It is thus recommended to disable the balancer the same way it's done when decommissioning a node and move the regions manually (or even better, using a script that moves them one by one).
1574 The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have to use the network to serve requests.
1575 Apart from resulting in higher latency, it may also be able to use all of your network card's capacity.
1576 For practical purposes, consider that a standard 1GigE NIC won't be able to read much more than _100MB/s_.
1577 In this case, or if you are in a OLAP environment and require having locality, then it is recommended to major compact the moved regions.
1582 HBase emits metrics which adhere to the link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/Metrics.html[Hadoop Metrics] API.
1583 Starting with HBase 0.95footnote:[The Metrics system was redone in
1584 HBase 0.96. See Migration
1585 to the New Metrics Hotness – Metrics2 by Elliot Clark for detail], HBase is configured to emit a default set of metrics with a default sampling period of every 10 seconds.
1586 You can use HBase metrics in conjunction with Ganglia.
1587 You can also filter which metrics are emitted and extend the metrics framework to capture custom metrics appropriate for your environment.
1591 For HBase 0.95 and newer, HBase ships with a default metrics configuration, or [firstterm]_sink_.
1592 This includes a wide variety of individual metrics, and emits them every 10 seconds by default.
1593 To configure metrics for a given region server, edit the _conf/hadoop-metrics2-hbase.properties_ file.
1594 Restart the region server for the changes to take effect.
1596 To change the sampling rate for the default sink, edit the line beginning with `*.period`.
1597 To filter which metrics are emitted or to extend the metrics framework, see https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html
1599 .HBase Metrics and Ganglia
1602 By default, HBase emits a large number of metrics per region server.
1603 Ganglia may have difficulty processing all these metrics.
1604 Consider increasing the capacity of the Ganglia server or reducing the number of metrics emitted by HBase.
1605 See link:https://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html#filtering[Metrics Filtering].
1608 === Disabling Metrics
1610 To disable metrics for a region server, edit the _conf/hadoop-metrics2-hbase.properties_ file and comment out any uncommented lines.
1611 Restart the region server for the changes to take effect.
1613 [[discovering.available.metrics]]
1614 === Discovering Available Metrics
1616 Rather than listing each metric which HBase emits by default, you can browse through the available metrics, either as a JSON output or via JMX.
1617 Different metrics are exposed for the Master process and each region server process.
1619 .Procedure: Access a JSON Output of Available Metrics
1620 . After starting HBase, access the region server's web UI, at pass:[http://REGIONSERVER_HOSTNAME:60030] by default (or port 16030 in HBase 1.0+).
1621 . Click the [label]#Metrics Dump# link near the top.
1622 The metrics for the region server are presented as a dump of the JMX bean in JSON format.
1623 This will dump out all metrics names and their values.
1624 To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60030/jmx?description=true].
1625 Not all beans and attributes have descriptions.
1626 . To view metrics for the Master, connect to the Master's web UI instead (defaults to pass:[http://localhost:60010] or port 16010 in HBase 1.0+) and click its [label]#Metrics
1628 To include metrics descriptions in the listing -- this can be useful when you are exploring what is available -- add a query string of `?description=true` so your URL becomes pass:[http://REGIONSERVER_HOSTNAME:60010/jmx?description=true].
1629 Not all beans and attributes have descriptions.
1632 You can use many different tools to view JMX content by browsing MBeans.
1633 This procedure uses `jvisualvm`, which is an application usually available in the JDK.
1635 .Procedure: Browse the JMX Output of Available Metrics
1636 . Start HBase, if it is not already running.
1637 . Run the command `jvisualvm` command on a host with a GUI display.
1638 You can launch it from the command line or another method appropriate for your operating system.
1639 . Be sure the [label]#VisualVM-MBeans# plugin is installed. Browse to *Tools -> Plugins*. Click [label]#Installed# and check whether the plugin is listed.
1640 If not, click [label]#Available Plugins#, select it, and click btn:[Install].
1641 When finished, click btn:[Close].
1642 . To view details for a given HBase process, double-click the process in the [label]#Local# sub-tree in the left-hand panel.
1643 A detailed view opens in the right-hand panel.
1644 Click the [label]#MBeans# tab which appears as a tab in the top of the right-hand panel.
1645 . To access the HBase metrics, navigate to the appropriate sub-bean:
1649 . The name of each metric and its current value is displayed in the [label]#Attributes# tab.
1650 For a view which includes more details, including the description of each attribute, click the [label]#Metadata# tab.
1652 === Units of Measure for Metrics
1654 Different metrics are expressed in different units, as appropriate.
1655 Often, the unit of measure is in the name (as in the metric `shippedKBs`). Otherwise, use the following guidelines.
1656 When in doubt, you may need to examine the source for a given metric.
1658 * Metrics that refer to a point in time are usually expressed as a timestamp.
1659 * Metrics that refer to an age (such as `ageOfLastShippedOp`) are usually expressed in milliseconds.
1660 * Metrics that refer to memory sizes are in bytes.
1661 * Sizes of queues (such as `sizeOfLogQueue`) are expressed as the number of items in the queue.
1662 Determine the size by multiplying by the block size (default is 64 MB in HDFS).
1663 * Metrics that refer to things like the number of a given type of operations (such as `logEditsRead`) are expressed as an integer.
1666 === Most Important Master Metrics
1668 Note: Counts are usually over the last metrics reporting interval.
1670 hbase.master.numRegionServers::
1671 Number of live regionservers
1673 hbase.master.numDeadRegionServers::
1674 Number of dead regionservers
1676 hbase.master.ritCount ::
1677 The number of regions in transition
1679 hbase.master.ritCountOverThreshold::
1680 The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
1682 hbase.master.ritOldestAge::
1683 The age of the longest region in transition, in milliseconds
1686 === Most Important RegionServer Metrics
1688 Note: Counts are usually over the last metrics reporting interval.
1690 hbase.regionserver.regionCount::
1691 The number of regions hosted by the regionserver
1693 hbase.regionserver.storeFileCount::
1694 The number of store files on disk currently managed by the regionserver
1696 hbase.regionserver.storeFileSize::
1697 Aggregate size of the store files on disk
1699 hbase.regionserver.hlogFileCount::
1700 The number of write ahead logs not yet archived
1702 hbase.regionserver.totalRequestCount::
1703 The total number of requests received
1705 hbase.regionserver.readRequestCount::
1706 The number of read requests received
1708 hbase.regionserver.writeRequestCount::
1709 The number of write requests received
1711 hbase.regionserver.numOpenConnections::
1712 The number of open connections at the RPC layer
1714 hbase.regionserver.numActiveHandler::
1715 The number of RPC handlers actively servicing requests
1717 hbase.regionserver.numCallsInGeneralQueue::
1718 The number of currently enqueued user requests
1720 hbase.regionserver.numCallsInReplicationQueue::
1721 The number of currently enqueued operations received from replication
1723 hbase.regionserver.numCallsInPriorityQueue::
1724 The number of currently enqueued priority (internal housekeeping) requests
1726 hbase.regionserver.flushQueueLength::
1727 Current depth of the memstore flush queue.
1728 If increasing, we are falling behind with clearing memstores out to HDFS.
1730 hbase.regionserver.updatesBlockedTime::
1731 Number of milliseconds updates have been blocked so the memstore can be flushed
1733 hbase.regionserver.compactionQueueLength::
1734 Current depth of the compaction request queue.
1735 If increasing, we are falling behind with storefile compaction.
1737 hbase.regionserver.blockCacheHitCount::
1738 The number of block cache hits
1740 hbase.regionserver.blockCacheMissCount::
1741 The number of block cache misses
1743 hbase.regionserver.blockCacheExpressHitPercent ::
1744 The percent of the time that requests with the cache turned on hit the cache
1746 hbase.regionserver.percentFilesLocal::
1747 Percent of store file data that can be read from the local DataNode, 0-100
1749 hbase.regionserver.<op>_<measure>::
1750 Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
1752 hbase.regionserver.slow<op>Count ::
1753 The number of operations we thought were slow, where <op> is one of the list above
1755 hbase.regionserver.GcTimeMillis::
1756 Time spent in garbage collection, in milliseconds
1758 hbase.regionserver.GcTimeMillisParNew::
1759 Time spent in garbage collection of the young generation, in milliseconds
1761 hbase.regionserver.GcTimeMillisConcurrentMarkSweep::
1762 Time spent in garbage collection of the old generation, in milliseconds
1764 hbase.regionserver.authenticationSuccesses::
1765 Number of client connections where authentication succeeded
1767 hbase.regionserver.authenticationFailures::
1768 Number of client connection authentication failures
1770 hbase.regionserver.mutationsWithoutWALCount ::
1771 Count of writes submitted with a flag indicating they should bypass the write ahead log
1774 === Meta Table Load Metrics
1776 HBase meta table metrics collection feature is available in HBase 1.4+ but it is disabled by default, as it can
1777 affect the performance of the cluster. When it is enabled, it helps to monitor client access patterns by collecting
1778 the following statistics:
1780 * number of get, put and delete operations on the `hbase:meta` table
1781 * number of get, put and delete operations made by the top-N clients
1782 * number of operations related to each table
1783 * number of operations related to the top-N regions
1786 When to use the feature::
1787 This feature can help to identify hot spots in the meta table by showing the regions or tables where the meta info is
1788 modified (e.g. by create, drop, split or move tables) or retrieved most frequently. It can also help to find misbehaving
1789 client applications by showing which clients are using the meta table most heavily, which can for example suggest the
1790 lack of meta table buffering or the lack of re-using open client connections in the client application.
1792 .Possible side-effects of enabling this feature
1795 Having large number of clients and regions in the cluster can cause the registration and tracking of a large amount of
1796 metrics, which can increase the memory and CPU footprint of the HBase region server handling the `hbase:meta` table.
1797 It can also cause the significant increase of the JMX dump size, which can affect the monitoring or log aggregation
1798 system you use beside HBase. It is recommended to turn on this feature only during debugging.
1801 Where to find the metrics in JMX::
1802 Each metric attribute name will start with the ‘MetaTable_’ prefix. For all the metrics you will see five different
1803 JMX attributes: count, mean rate, 1 minute rate, 5 minute rate and 15 minute rate. You will find these metrics in JMX
1804 under the following MBean:
1805 `Hadoop -> HBase -> RegionServer -> Coprocessor.Region.CP_org.apache.hadoop.hbase.coprocessor.MetaTableMetrics`.
1807 .Examples: some Meta Table metrics you can see in your JMX dump
1811 "MetaTable_get_request_count": 77309,
1812 "MetaTable_put_request_mean_rate": 0.06339092997186495,
1813 "MetaTable_table_MyTestTable_request_15min_rate": 1.1020599841623246,
1814 "MetaTable_client_/172.30.65.42_lossy_request_count": 1786
1815 "MetaTable_client_/172.30.65.45_put_request_5min_rate": 0.6189810954855728,
1816 "MetaTable_region_1561131112259.c66e4308d492936179352c80432ccfe0._lossy_request_count": 38342,
1817 "MetaTable_region_1561131043640.5bdffe4b9e7e334172065c853cf0caa6._lossy_request_1min_rate": 0.04925099917433935,
1822 To turn on this feature, you have to enable a custom coprocessor by adding the following section to hbase-site.xml.
1823 This coprocessor will run on all the HBase RegionServers, but will be active (i.e. consume memory / CPU) only on
1824 the server, where the `hbase:meta` table is located. It will produce JMX metrics which can be downloaded from the
1825 web UI of the given RegionServer or by a simple REST call. These metrics will not be present in the JMX dump of the
1826 other RegionServers.
1828 .Enabling the Meta Table Metrics feature
1832 <name>hbase.coprocessor.region.classes</name>
1833 <value>org.apache.hadoop.hbase.coprocessor.MetaTableMetrics</value>
1837 .How the top-N metrics are calculated?
1840 The 'top-N' type of metrics will be counted using the Lossy Counting Algorithm (as defined in
1841 link:http://www.vldb.org/conf/2002/S10P03.pdf[Motwani, R; Manku, G.S (2002). "Approximate frequency counts over data streams"]),
1842 which is designed to identify elements in a data stream whose frequency count exceed a user-given threshold.
1843 The frequency computed by this algorithm is not always accurate but has an error threshold that can be specified by the
1844 user as a configuration parameter. The run time space required by the algorithm is inversely proportional to the
1845 specified error threshold, hence larger the error parameter, the smaller the footprint and the less accurate are the
1848 You can specify the error rate of the algorithm as a floating-point value between 0 and 1 (exclusive), it's default
1849 value is 0.02. Having the error rate set to `E` and having `N` as the total number of meta table operations, then
1850 (assuming the uniform distribution of the activity of low frequency elements) at most `7 / E` meters will be kept and
1851 each kept element will have a frequency higher than `E * N`.
1853 An example: Let’s assume we are interested in the HBase clients that are most active in accessing the meta table.
1854 When there was 1,000,000 operations on the meta table so far and the error rate parameter is set to 0.02, then we can
1855 assume that only at most 350 client IP address related counters will be present in JMX and each of these clients
1856 accessed the meta table at least 20,000 times.
1861 <name>hbase.util.default.lossycounting.errorrate</name>
1870 [[ops.monitoring.overview]]
1873 The following metrics are arguably the most important to monitor for each RegionServer for "macro monitoring", preferably with a system like link:http://opentsdb.net/[OpenTSDB].
1874 If your cluster is having performance issues it's likely that you'll see something unusual with this group.
1877 * See <<rs_metrics,rs metrics>>
1886 For more information on HBase metrics, see <<hbase_metrics,hbase metrics>>.
1891 The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output.
1892 The thresholds for "too long to run" and "too much output" are configurable, as described below.
1893 The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events.
1894 It is also prepended with identifying tags `(responseTooSlow)`, `(responseTooLarge)`, `(operationTooSlow)`, and `(operationTooLarge)` in order to enable easy filtering with grep, in case the user desires to see only slow queries.
1898 There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
1900 * `hbase.ipc.warn.response.time` Maximum number of milliseconds that a query can be run without being logged.
1901 Defaults to 10000, or 10 seconds.
1902 Can be set to -1 to disable logging by time.
1903 * `hbase.ipc.warn.response.size` Maximum byte size of response that a query can return without being logged.
1904 Defaults to 100 megabytes.
1905 Can be set to -1 to disable logging by size.
1909 The slow query log exposes to metrics to JMX.
1911 * `hadoop.regionserver_rpc_slowResponse` a global metric reflecting the durations of all responses that triggered logging.
1912 * `hadoop.regionserver_rpc_methodName.aboveOneSec` A metric reflecting the durations of all responses that lasted for more than one second.
1916 The output is tagged with operation e.g. `(operationTooSlow)` if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for.
1917 If not, it is tagged `(responseTooSlow)` and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. `TooLarge` is substituted for `TooSlow` if the response size triggered the logging, with `TooLarge` appearing even in the case that both size and duration triggered logging.
1924 2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}
1927 Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port.
1928 Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations.
1929 In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.
1931 This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.
1933 [[slow_log_responses]]
1934 ==== Get Slow Response Log from shell
1935 When an individual RPC exceeds a configurable time bound we log a complaint
1936 by way of the logging subsystem
1941 2019-10-02 10:10:22,195 WARN [,queue=15,port=60020] ipc.RpcServer - (responseTooSlow):
1942 {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
1943 "starttimems":1567203007549,
1944 "responsesize":6819737,
1946 "param":"region { type: REGION_NAME value: \"t1,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
1947 "processingtimems":28646,
1948 "client":"10.253.196.215:41116",
1949 "queuetimems":22453,
1950 "class":"HRegionServer"}
1953 Unfortunately often the request parameters are truncated as per above Example.
1954 The truncation is unfortunate because it eliminates much of the utility of
1955 the warnings. For example, the region name, the start and end keys, and the
1956 filter hierarchy are all important clues for debugging performance problems
1957 caused by moderate to low selectivity queries or queries made at a high rate.
1959 HBASE-22978 introduces maintaining an in-memory ring buffer of requests that were judged to
1960 be too slow in addition to the responseTooSlow logging. The in-memory representation can be
1961 complete. There is some chance a high rate of requests will cause information on other
1962 interesting requests to be overwritten before it can be read. This is an acceptable trade off.
1964 In order to enable the in-memory ring buffer at RegionServers, we need to enable
1967 hbase.regionserver.slowlog.buffer.enabled
1970 One more config determines the size of the ring buffer:
1972 hbase.regionserver.slowlog.ringbuffer.size
1975 Check the config section for the detailed description.
1977 This config would be disabled by default. Turn it on and these shell commands
1978 would provide expected results from the ring-buffers.
1981 shell commands to retrieve slowlog responses from RegionServers:
1984 Retrieve latest SlowLog Responses maintained by each or specific RegionServers.
1985 Specify '*' to include all RS otherwise array of server names for specific
1986 RS. A server name is the host, port plus startcode of a RegionServer.
1987 e.g.: host187.example.com,60020,1289493121758 (find servername in
1988 master ui or when you do detailed status in shell)
1990 Provide optional filter parameters as Hash.
1991 Default Limit of each server for providing no of slow log records is 10. User can specify
1992 more limit by 'LIMIT' param in case more than 10 records should be retrieved.
1996 hbase> get_slowlog_responses '*' => get slowlog responses from all RS
1997 hbase> get_slowlog_responses '*', {'LIMIT' => 50} => get slowlog responses from all RS
1998 with 50 records limit (default limit: 10)
1999 hbase> get_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2'] => get slowlog responses from SERVER_NAME1,
2001 hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
2002 => get slowlog responses only related to meta
2004 hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1'} => get slowlog responses only related to t1 table
2005 hbase> get_slowlog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
2006 => get slowlog responses with given client
2007 IP address and get 100 records limit
2009 hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
2010 => get slowlog responses with given region name
2012 hbase> get_slowlog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
2013 => get slowlog responses that match either
2014 provided client IP address or user name
2019 All of above queries with filters have default OR operation applied i.e. all
2020 records with any of the provided filters applied will be returned. However,
2021 we can also apply AND operator i.e. all records that match all (not any) of
2022 the provided filters should be returned.
2025 hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
2026 => get slowlog responses with given region name
2027 and table name, both should match
2029 hbase> get_slowlog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
2030 => get slowlog responses with given region name
2031 or table name, any one can match
2033 hbase> get_slowlog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
2034 => get slowlog responses with given region name
2035 and client IP address, both should match
2038 Since OR is the default filter operator, without providing 'FILTER_BY_OP', query will have
2039 same result as providing 'FILTER_BY_OP' => 'OR'.
2042 Sometimes output can be long pretty printed json for user to scroll in
2043 a single screen and hence user might prefer
2044 redirecting output of get_slowlog_responses to a file.
2048 echo "get_slowlog_responses '*'" | hbase shell > xyz.out 2>&1
2051 Similar to slow RPC logs, client can also retrieve large RPC logs.
2052 Sometimes, slow logs important to debug perf issues turn out to be
2057 hbase> get_largelog_responses '*' => get largelog responses from all RS
2058 hbase> get_largelog_responses '*', {'LIMIT' => 50} => get largelog responses from all RS
2059 with 50 records limit (default limit: 10)
2060 hbase> get_largelog_responses ['SERVER_NAME1', 'SERVER_NAME2'] => get largelog responses from SERVER_NAME1,
2062 hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1'}
2063 => get largelog responses only related to meta
2065 hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1'} => get largelog responses only related to t1 table
2066 hbase> get_largelog_responses '*', {'CLIENT_IP' => '192.162.1.40:60225', 'LIMIT' => 100}
2067 => get largelog responses with given client
2068 IP address and get 100 records limit
2070 hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1'}
2071 => get largelog responses with given region name
2073 hbase> get_largelog_responses '*', {'USER' => 'user_name', 'CLIENT_IP' => '192.162.1.40:60225'}
2074 => get largelog responses that match either
2075 provided client IP address or user name
2077 hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'AND'}
2078 => get largelog responses with given region name
2079 and table name, both should match
2081 hbase> get_largelog_responses '*', {'REGION_NAME' => 'hbase:meta,,1', 'TABLE_NAME' => 't1', 'FILTER_BY_OP' => 'OR'}
2082 => get largelog responses with given region name
2083 or table name, any one can match
2085 hbase> get_largelog_responses '*', {'TABLE_NAME' => 't1', 'CLIENT_IP' => '192.163.41.53:52781', 'FILTER_BY_OP' => 'AND'}
2086 => get largelog responses with given region name
2087 and client IP address, both should match
2092 shell command to clear slow/largelog responses from RegionServer:
2095 Clears SlowLog Responses maintained by each or specific RegionServers.
2096 Specify array of server names for specific RS. A server name is
2097 the host, port plus startcode of a RegionServer.
2098 e.g.: host187.example.com,60020,1289493121758 (find servername in
2099 master ui or when you do detailed status in shell)
2103 hbase> clear_slowlog_responses => clears slowlog responses from all RS
2104 hbase> clear_slowlog_responses ['SERVER_NAME1', 'SERVER_NAME2'] => clears slowlog responses from SERVER_NAME1,
2110 include::slow_log_responses_from_systable.adoc[]
2113 === Block Cache Monitoring
2115 Starting with HBase 0.98, the HBase Web UI includes the ability to monitor and report on the performance of the block cache.
2116 To view the block cache reports, see the Block Cache section of the region server UI.
2117 Following are a few examples of the reporting capabilities.
2119 .Basic Info shows the cache implementation.
2120 image::bc_basic.png[]
2122 .Config shows all cache configuration options.
2123 image::bc_config.png[]
2125 .Stats shows statistics about the performance of the cache.
2126 image::bc_stats.png[]
2128 .L1 and L2 show information about the L1 and L2 caches.
2131 This is not an exhaustive list of all the screens and reports available.
2132 Have a look in the Web UI.
2134 === Snapshot Space Usage Monitoring
2136 Starting with HBase 0.95, Snapshot usage information on individual snapshots was shown in the HBase Master Web UI. This was further enhanced starting with HBase 1.3 to show the total Storefile size of the Snapshot Set. The following metrics are shown in the Master Web UI with HBase 1.3 and later.
2138 * Shared Storefile Size is the Storefile size shared between snapshots and active tables.
2139 * Mob Storefile Size is the Mob Storefile size shared between snapshots and active tables.
2140 * Archived Storefile Size is the Storefile size in Archive.
2142 The format of Archived Storefile Size is NNN(MMM). NNN is the total Storefile size in Archive, MMM is the total Storefile size in Archive that is specific to the snapshot (not shared with other snapshots and tables).
2144 .Master Snapshot Overview
2145 image::master-snapshot.png[]
2147 .Snapshot Storefile Stats Example 1
2148 image::1-snapshot.png[]
2150 .Snapshot Storefile Stats Example 2
2151 image::2-snapshots.png[]
2153 .Empty Snapshot Storfile Stats Example
2154 image::empty-snapshots.png[]
2156 == Cluster Replication
2158 NOTE: This information was previously available at
2159 link:https://hbase.apache.org/0.94/replication.html[Cluster Replication].
2161 HBase provides a cluster replication mechanism which allows you to keep one cluster's state synchronized with that of another cluster, using the write-ahead log (WAL) of the source cluster to propagate the changes.
2162 Some use cases for cluster replication include:
2164 * Backup and disaster recovery
2166 * Geographic data distribution
2167 * Online data ingestion combined with offline data analytics
2169 NOTE: Replication is enabled at the granularity of the column family.
2170 Before enabling replication for a column family, create the table and all column families to be replicated, on the destination cluster.
2172 NOTE: Replication is asynchronous as we send WAL to another cluster in background, which means that when you want to do recovery through replication, you could loss some data. To address this problem, we have introduced a new feature called synchronous replication. As the mechanism is a bit different so we use a separated section to describe it. Please see
2173 <<Synchronous Replication,Synchronous Replication>>.
2175 NOTE: At present, there is compatibility problem if Replication and WAL Compression are used together. If you need to use Replication, it is recommended to set the `hbase.regionserver.wal.enablecompression` property to `false`. See (https://issues.apache.org/jira/browse/HBASE-26849[HBASE-26849]) for details.
2177 === Replication Overview
2179 Cluster replication uses a source-push methodology.
2180 An HBase cluster can be a source (also called master or active, meaning that it is the originator of new data), a destination (also called slave or passive, meaning that it receives data via replication), or can fulfill both roles at once.
2181 Replication is asynchronous, and the goal of replication is eventual consistency.
2182 When the source receives an edit to a column family with replication enabled, that edit is propagated to all destination clusters using the WAL for that for that column family on the RegionServer managing the relevant region.
2184 When data is replicated from one cluster to another, the original source of the data is tracked via a cluster ID which is part of the metadata.
2185 In HBase 0.96 and newer (link:https://issues.apache.org/jira/browse/HBASE-7709[HBASE-7709]), all clusters which have already consumed the data are also tracked.
2186 This prevents replication loops.
2188 The WALs for each region server must be kept in HDFS as long as they are needed to replicate data to any slave cluster.
2189 Each region server reads from the oldest log it needs to replicate and keeps track of its progress processing WALs inside ZooKeeper to simplify failure recovery.
2190 The position marker which indicates a slave cluster's progress, as well as the queue of WALs to process, may be different for every slave cluster.
2192 The clusters participating in replication can be of different sizes.
2193 The master cluster relies on randomization to attempt to balance the stream of replication on the slave clusters.
2194 It is expected that the slave cluster has storage capacity to hold the replicated data, as well as any data it is responsible for ingesting.
2195 If a slave cluster does run out of room, or is inaccessible for other reasons, it throws an error and the master retains the WAL and retries the replication at intervals.
2197 .Consistency Across Replicated Clusters
2200 How your application builds on top of the HBase API matters when replication is in play. HBase's replication system provides at-least-once delivery of client edits for an enabled column family to each configured destination cluster. In the event of failure to reach a given destination, the replication system will retry sending edits in a way that might repeat a given message. HBase provides two ways of replication, one is the original replication and the other is serial replication. In the previous way of replication, there is not a guaranteed order of delivery for client edits. In the event of a RegionServer failing, recovery of the replication queue happens independent of recovery of the individual regions that server was previously handling. This means that it is possible for the not-yet-replicated edits to be serviced by a RegionServer that is currently slower to replicate than the one that handles edits from after the failure.
2202 The combination of these two properties (at-least-once delivery and the lack of message ordering) means that some destination clusters may end up in a different state if your application makes use of operations that are not idempotent, e.g. Increments.
2204 To solve the problem, HBase now supports serial replication, which sends edits to destination cluster as the order of requests from client. See <<Serial Replication,Serial Replication>>.
2208 .Terminology Changes
2211 Previously, terms such as [firstterm]_master-master_, [firstterm]_master-slave_, and [firstterm]_cyclical_ were used to describe replication relationships in HBase.
2212 These terms added confusion, and have been abandoned in favor of discussions about cluster topologies appropriate for different scenarios.
2216 * A central source cluster might propagate changes out to multiple destination clusters, for failover or due to geographic distribution.
2217 * A source cluster might push changes to a destination cluster, which might also push its own changes back to the original cluster.
2218 * Many different low-latency clusters might push changes to one centralized cluster for backup or resource-intensive data analytics jobs.
2219 The processed data might then be replicated back to the low-latency clusters.
2221 Multiple levels of replication may be chained together to suit your organization's needs.
2222 The following diagram shows a hypothetical scenario.
2223 Use the arrows to follow the data paths.
2225 .Example of a Complex Cluster Replication Configuration
2226 image::hbase_replication_diagram.jpg[]
2228 HBase replication borrows many concepts from the [firstterm]_statement-based replication_ design used by MySQL.
2229 Instead of SQL statements, entire WALEdits (consisting of multiple cell inserts coming from Put and Delete operations on the clients) are replicated in order to maintain atomicity.
2231 [[hbase.replication.management]]
2232 === Managing and Configuring Cluster Replication
2233 .Cluster Configuration Overview
2235 . Configure and start the source and destination clusters.
2236 Create tables with the same names and column families on both the source and destination clusters, so that the destination cluster knows where to store data it will receive.
2237 . All hosts in the source and destination clusters should be reachable to each other.
2238 . If both clusters use the same ZooKeeper cluster, you must use a different `zookeeper.znode.parent`, because they cannot write in the same folder.
2239 . On the source cluster, in HBase Shell, add the destination cluster as a peer, using the `add_peer` command.
2240 . On the source cluster, in HBase Shell, enable the table replication, using the `enable_table_replication` command.
2241 . Check the logs to see if replication is taking place. If so, you will see messages like the following, coming from the ReplicationSource.
2243 LOG.info("Replicating "+clusterId + " -> " + peerClusterId);
2246 .Serial Replication Configuration
2247 See <<Serial Replication,Serial Replication>>
2249 .Cluster Management Commands
2250 add_peer <ID> <CLUSTER_KEY>::
2251 Adds a replication relationship between two clusters. +
2252 * ID -- a unique string, which must not contain a hyphen.
2253 * CLUSTER_KEY: composed using the following template, with appropriate place-holders: `hbase.zookeeper.quorum:hbase.zookeeper.property.clientPort:zookeeper.znode.parent`. This value can be found on the Master UI info page.
2254 * STATE(optional): ENABLED or DISABLED, default value is ENABLED
2255 list_peers:: list all replication relationships known by this cluster
2257 Enable a previously-disabled replication relationship
2259 Disable a replication relationship. HBase will no longer send edits to that
2260 peer cluster, but it still keeps track of all the new WALs that it will need
2261 to replicate if and when it is re-enabled. WALs are retained when enabling or disabling
2262 replication as long as peers exist.
2264 Disable and remove a replication relationship. HBase will no longer send edits to that peer cluster or keep track of WALs.
2265 enable_table_replication <TABLE_NAME>::
2266 Enable the table replication switch for all its column families. If the table is not found in the destination cluster then it will create one with the same name and column families.
2267 disable_table_replication <TABLE_NAME>::
2268 Disable the table replication switch for all its column families.
2270 === Serial Replication
2272 Note: this feature is introduced in HBase 2.1
2274 .Function of serial replication
2276 Serial replication supports to push logs to the destination cluster in the same order as logs reach to the source cluster.
2278 .Why need serial replication?
2279 In replication of HBase, we push mutations to destination cluster by reading WAL in each region server. We have a queue for WAL files so we can read them in order of creation time. However, when region-move or RS failure occurs in source cluster, the hlog entries that are not pushed before region-move or RS-failure will be pushed by original RS(for region move) or another RS which takes over the remained hlog of dead RS(for RS failure), and the new entries for the same region(s) will be pushed by the RS which now serves the region(s), but they push the hlog entries of a same region concurrently without coordination.
2281 This treatment can possibly lead to data inconsistency between source and destination clusters:
2283 1. there are put and then delete written to source cluster.
2285 2. due to region-move / RS-failure, they are pushed by different replication-source threads to peer cluster.
2287 3. if delete is pushed to peer cluster before put, and flush and major-compact occurs in peer cluster before put is pushed to peer cluster, the delete is collected and the put remains in peer cluster, but in source cluster the put is masked by the delete, hence data inconsistency between source and destination clusters.
2290 .Serial replication configuration
2292 Set the serial flag to true for a repliation peer. And the default serial flag is false.
2294 * Add a new replication peer which serial flag is true
2298 hbase> add_peer '1', CLUSTER_KEY => "server1.cie.com:2181:/hbase", SERIAL => true
2301 * Set a replication peer's serial flag to false
2305 hbase> set_peer_serial '1', false
2308 * Set a replication peer's serial flag to true
2312 hbase> set_peer_serial '1', true
2315 The serial replication feature had been done firstly in link:https://issues.apache.org/jira/browse/HBASE-9465[HBASE-9465] and then reverted and redone in link:https://issues.apache.org/jira/browse/HBASE-20046[HBASE-20046]. You can find more details in these issues.
2317 === Verifying Replicated Data
2319 The `VerifyReplication` MapReduce job, which is included in HBase, performs a systematic comparison of replicated data between two different clusters. Run the VerifyReplication job on the master cluster, supplying it with the peer ID and table name to use for validation. You can limit the verification further by specifying a time range or specific families. The job's short name is `verifyrep`. To run the job, use a command like the following:
2323 $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` "${HADOOP_HOME}/bin/hadoop" jar "${HBASE_HOME}/hbase-mapreduce-VERSION.jar" verifyrep --starttime=<timestamp> --endtime=<timestamp> --families=<myFam> <ID> <tableName>
2326 The `VerifyReplication` command prints out `GOODROWS` and `BADROWS` counters to indicate rows that did and did not replicate correctly.
2328 === Detailed Information About Cluster Replication
2330 .Replication Architecture Overview
2331 image::replication_overview.png[]
2333 ==== Life of a WAL Edit
2335 A single WAL edit goes through several steps in order to be replicated to a slave cluster.
2337 . An HBase client uses a Put or Delete operation to manipulate data in HBase.
2338 . The region server writes the request to the WAL in a way allows it to be replayed if it is not written successfully.
2339 . If the changed cell corresponds to a column family that is scoped for replication, the edit is added to the queue for replication.
2340 . In a separate thread, the edit is read from the log, as part of a batch process.
2341 Only the KeyValues that are eligible for replication are kept.
2342 Replicable KeyValues are part of a column family whose schema is scoped GLOBAL, are not part of a catalog such as `hbase:meta`, did not originate from the target slave cluster, and have not already been consumed by the target slave cluster.
2343 . The edit is tagged with the master's UUID and added to a buffer.
2344 When the buffer is filled, or the reader reaches the end of the file, the buffer is sent to a random region server on the slave cluster.
2345 . The region server reads the edits sequentially and separates them into buffers, one buffer per table.
2346 After all edits are read, each buffer is flushed using link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html[Table], HBase's normal client.
2347 The master's UUID and the UUIDs of slaves which have already consumed the data are preserved in the edits they are applied, in order to prevent replication loops.
2348 . In the master, the offset for the WAL that is currently being replicated is registered in ZooKeeper.
2350 . The first three steps, where the edit is inserted, are identical.
2351 . Again in a separate thread, the region server reads, filters, and edits the log edits in the same way as above.
2352 The slave region server does not answer the RPC call.
2353 . The master sleeps and tries again a configurable number of times.
2354 . If the slave region server is still not available, the master selects a new subset of region server to replicate to, and tries again to send the buffer of edits.
2355 . Meanwhile, the WALs are rolled and stored in a queue in ZooKeeper.
2356 Logs that are [firstterm]_archived_ by their region server, by moving them from the region server's log directory to a central log directory, will update their paths in the in-memory queue of the replicating thread.
2357 . When the slave cluster is finally available, the buffer is applied in the same way as during normal processing.
2358 The master region server will then replicate the backlog of logs that accumulated during the outage.
2360 .Spreading Queue Failover Load
2361 When replication is active, a subset of region servers in the source cluster is responsible for shipping edits to the sink.
2362 This responsibility must be failed over like all other region server functions should a process or node crash.
2363 The following configuration settings are recommended for maintaining an even distribution of replication activity over the remaining live servers in the source cluster:
2365 * Set `replication.source.maxretriesmultiplier` to `300`.
2366 * Set `replication.source.sleepforretries` to `1` (1 second). This value, combined with the value of `replication.source.maxretriesmultiplier`, causes the retry cycle to last about 5 minutes.
2367 * Set `replication.sleep.before.failover` to `30000` (30 seconds) in the source cluster site configuration.
2369 [[cluster.replication.preserving.tags]]
2370 .Preserving Tags During Replication
2371 By default, the codec used for replication between clusters strips tags, such as cell-level ACLs, from cells.
2372 To prevent the tags from being stripped, you can use a different codec which does not strip them.
2373 Configure `hbase.replication.rpc.codec` to use `org.apache.hadoop.hbase.codec.KeyValueCodecWithTags`, on both the source and sink RegionServers involved in the replication.
2374 This option was introduced in link:https://issues.apache.org/jira/browse/HBASE-10322[HBASE-10322].
2376 ==== Replication Internals
2378 Replication State in ZooKeeper::
2379 HBase replication maintains its state in ZooKeeper.
2380 By default, the state is contained in the base node _/hbase/replication_.
2381 This node contains two child nodes, the `Peers` znode and the `RS` znode.
2384 The `peers` znode is stored in _/hbase/replication/peers_ by default.
2385 It consists of a list of all peer replication clusters, along with the status of each of them.
2386 The value of each peer is its cluster key, which is provided in the HBase Shell.
2387 The cluster key contains a list of ZooKeeper nodes in the cluster's quorum, the client port for the ZooKeeper quorum, and the base znode for HBase in HDFS on that cluster.
2390 The `rs` znode contains a list of WAL logs which need to be replicated.
2391 This list is divided into a set of queues organized by region server and the peer cluster the region server is shipping the logs to.
2392 The rs znode has one child znode for each region server in the cluster.
2393 The child znode name is the region server's hostname, client port, and start code.
2394 This list includes both live and dead region servers.
2396 ==== Choosing Region Servers to Replicate To
2398 When a master cluster region server initiates a replication source to a slave cluster, it first connects to the slave's ZooKeeper ensemble using the provided cluster key . It then scans the _rs/_ directory to discover all the available sinks (region servers that are accepting incoming streams of edits to replicate) and randomly chooses a subset of them using a configured ratio which has a default value of 10%. For example, if a slave cluster has 150 machines, 15 will be chosen as potential recipient for edits that this master cluster region server sends.
2399 Because this selection is performed by each master region server, the probability that all slave region servers are used is very high, and this method works for clusters of any size.
2400 For example, a master cluster of 10 machines replicating to a slave cluster of 5 machines with a ratio of 10% causes the master cluster region servers to choose one machine each at random.
2402 A ZooKeeper watcher is placed on the _${zookeeper.znode.parent}/rs_ node of the slave cluster by each of the master cluster's region servers.
2403 This watch is used to monitor changes in the composition of the slave cluster.
2404 When nodes are removed from the slave cluster, or if nodes go down or come back up, the master cluster's region servers will respond by selecting a new pool of slave region servers to replicate to.
2406 ==== Keeping Track of Logs
2408 Each master cluster region server has its own znode in the replication znodes hierarchy.
2409 It contains one znode per peer cluster (if 5 slave clusters, 5 znodes are created), and each of these contain a queue of WALs to process.
2410 Each of these queues will track the WALs created by that region server, but they can differ in size.
2411 For example, if one slave cluster becomes unavailable for some time, the WALs should not be deleted, so they need to stay in the queue while the others are processed.
2412 See <<rs.failover.details,rs.failover.details>> for an example.
2414 When a source is instantiated, it contains the current WAL that the region server is writing to.
2415 During log rolling, the new file is added to the queue of each slave cluster's znode just before it is made available.
2416 This ensures that all the sources are aware that a new log exists before the region server is able to append edits into it, but this operations is now more expensive.
2417 The queue items are discarded when the replication thread cannot read more entries from a file (because it reached the end of the last block) and there are other files in the queue.
2418 This means that if a source is up to date and replicates from the log that the region server writes to, reading up to the "end" of the current file will not delete the item in the queue.
2420 A log can be archived if it is no longer used or if the number of logs exceeds `hbase.regionserver.maxlogs` because the insertion rate is faster than regions are flushed.
2421 When a log is archived, the source threads are notified that the path for that log changed.
2422 If a particular source has already finished with an archived log, it will just ignore the message.
2423 If the log is in the queue, the path will be updated in memory.
2424 If the log is currently being replicated, the change will be done atomically so that the reader doesn't attempt to open the file when has already been moved.
2425 Because moving a file is a NameNode operation , if the reader is currently reading the log, it won't generate any exception.
2427 ==== Reading, Filtering and Sending Edits
2429 By default, a source attempts to read from a WAL and ship log entries to a sink as quickly as possible.
2430 Speed is limited by the filtering of log entries Only KeyValues that are scoped GLOBAL and that do not belong to catalog tables will be retained.
2431 Speed is also limited by total size of the list of edits to replicate per slave, which is limited to 64 MB by default.
2432 With this configuration, a master cluster region server with three slaves would use at most 192 MB to store data to replicate.
2433 This does not account for the data which was filtered but not garbage collected.
2435 Once the maximum size of edits has been buffered or the reader reaches the end of the WAL, the source thread stops reading and chooses at random a sink to replicate to (from the list that was generated by keeping only a subset of slave region servers). It directly issues a RPC to the chosen region server and waits for the method to return.
2436 If the RPC was successful, the source determines whether the current file has been emptied or it contains more data which needs to be read.
2437 If the file has been emptied, the source deletes the znode in the queue.
2438 Otherwise, it registers the new offset in the log's znode.
2439 If the RPC threw an exception, the source will retry 10 times before trying to find a different sink.
2443 If replication is not enabled, the master's log-cleaning thread deletes old logs using a configured TTL.
2444 This TTL-based method does not work well with replication, because archived logs which have exceeded their TTL may still be in a queue.
2445 The default behavior is augmented so that if a log is past its TTL, the cleaning thread looks up every queue until it finds the log, while caching queues it has found.
2446 If the log is not found in any queues, the log will be deleted.
2447 The next time the cleaning process needs to look for a log, it starts by using its cached list.
2449 NOTE: WALs are saved when replication is enabled or disabled as long as peers exist.
2451 [[rs.failover.details]]
2452 ==== Region Server Failover
2454 When no region servers are failing, keeping track of the logs in ZooKeeper adds no value.
2455 Unfortunately, region servers do fail, and since ZooKeeper is highly available, it is useful for managing the transfer of the queues in the event of a failure.
2457 Each of the master cluster region servers keeps a watcher on every other region server, in order to be notified when one dies (just as the master does). When a failure happens, they all race to create a znode called `lock` inside the dead region server's znode that contains its queues.
2458 The region server that creates it successfully then transfers all the queues to its own znode, one at a time since ZooKeeper does not support renaming queues.
2459 After queues are all transferred, they are deleted from the old location.
2460 The znodes that were recovered are renamed with the ID of the slave cluster appended with the name of the dead server.
2462 Next, the master cluster region server creates one new source thread per copied queue, and each of the source threads follows the read/filter/ship pattern.
2463 The main difference is that those queues will never receive new data, since they do not belong to their new region server.
2464 When the reader hits the end of the last log, the queue's znode is deleted and the master cluster region server closes that replication source.
2466 Given a master cluster with 3 region servers replicating to a single slave with id `2`, the following hierarchy represents what the znodes layout could be at some point in time.
2467 The region servers' znodes all contain a `peers` znode which contains a single queue.
2468 The znode names in the queues represent the actual file names on HDFS in the form `address,port.timestamp`.
2472 /hbase/replication/rs/
2473 1.1.1.1,60020,123456780/
2475 1.1.1.1,60020.1234 (Contains a position)
2477 1.1.1.2,60020,123456790/
2479 1.1.1.2,60020.1214 (Contains a position)
2482 1.1.1.3,60020, 123456630/
2484 1.1.1.3,60020.1280 (Contains a position)
2487 Assume that 1.1.1.2 loses its ZooKeeper session.
2488 The survivors will race to create a lock, and, arbitrarily, 1.1.1.3 wins.
2489 It will then start transferring all the queues to its local peers znode by appending the name of the dead server.
2490 Right before 1.1.1.3 is able to clean up the old znodes, the layout will look like the following:
2494 /hbase/replication/rs/
2495 1.1.1.1,60020,123456780/
2497 1.1.1.1,60020.1234 (Contains a position)
2499 1.1.1.2,60020,123456790/
2502 1.1.1.2,60020.1214 (Contains a position)
2505 1.1.1.3,60020,123456630/
2507 1.1.1.3,60020.1280 (Contains a position)
2509 2-1.1.1.2,60020,123456790/
2510 1.1.1.2,60020.1214 (Contains a position)
2515 Some time later, but before 1.1.1.3 is able to finish replicating the last WAL from 1.1.1.2, it dies too.
2516 Some new logs were also created in the normal queues.
2517 The last region server will then try to lock 1.1.1.3's znode and will begin transferring all the queues.
2518 The new layout will be:
2522 /hbase/replication/rs/
2523 1.1.1.1,60020,123456780/
2525 1.1.1.1,60020.1378 (Contains a position)
2527 2-1.1.1.3,60020,123456630/
2528 1.1.1.3,60020.1325 (Contains a position)
2531 2-1.1.1.2,60020,123456790-1.1.1.3,60020,123456630/
2532 1.1.1.2,60020.1312 (Contains a position)
2533 1.1.1.3,60020,123456630/
2536 1.1.1.3,60020.1325 (Contains a position)
2539 2-1.1.1.2,60020,123456790/
2540 1.1.1.2,60020.1312 (Contains a position)
2543 === Replication Metrics
2545 The following metrics are exposed at the global region server level and at the peer level:
2547 `source.sizeOfLogQueue`::
2548 number of WALs to process (excludes the one which is being processed) at the Replication source
2550 `source.shippedOps`::
2551 number of mutations shipped
2553 `source.logEditsRead`::
2554 number of mutations read from WALs at the replication source
2556 `source.ageOfLastShippedOp`::
2557 age of last batch that was shipped by the replication source
2559 `source.completedLogs`::
2560 The number of write-ahead-log files that have completed their acknowledged sending to the peer associated with this source. Increments to this metric are a part of normal operation of HBase replication.
2562 `source.completedRecoverQueues`::
2563 The number of recovery queues this source has completed sending to the associated peer. Increments to this metric are a part of normal recovery of HBase replication in the face of failed Region Servers.
2565 `source.uncleanlyClosedLogs`::
2566 The number of write-ahead-log files the replication system considered completed after reaching the end of readable entries in the face of an uncleanly closed file.
2568 `source.ignoredUncleanlyClosedLogContentsInBytes`::
2569 When a write-ahead-log file is not closed cleanly, there will likely be some entry that has been partially serialized. This metric contains the number of bytes of such entries the HBase replication system believes were remaining at the end of files skipped in the face of an uncleanly closed file. Those bytes should either be in different file or represent a client write that was not acknowledged.
2571 `source.restartedLogReading`::
2572 The number of times the HBase replication system detected that it failed to correctly parse a cleanly closed write-ahead-log file. In this circumstance, the system replays the entire log from the beginning, ensuring that no edits fail to be acknowledged by the associated peer. Increments to this metric indicate that the HBase replication system is having difficulty correctly handling failures in the underlying distributed storage system. No dataloss should occur, but you should check Region Server log files for details of the failures.
2574 `source.repeatedLogFileBytes`::
2575 When the HBase replication system determines that it needs to replay a given write-ahead-log file, this metric is incremented by the number of bytes the replication system believes had already been acknowledged by the associated peer prior to starting over.
2577 `source.closedLogsWithUnknownFileLength`::
2578 Incremented when the HBase replication system believes it is at the end of a write-ahead-log file but it can not determine the length of that file in the underlying distributed storage system. Could indicate dataloss since the replication system is unable to determine if the end of readable entries lines up with the expected end of the file. You should check Region Server log files for details of the failures.
2581 === Replication Configuration Options
2583 [cols="1,1,1", options="header"]
2589 | zookeeper.znode.parent
2590 | The name of the base ZooKeeper znode used for HBase
2593 | zookeeper.znode.replication
2594 | The name of the base znode used for replication
2597 | zookeeper.znode.replication.peers
2598 | The name of the peer znode
2601 | zookeeper.znode.replication.peers.state
2602 | The name of peer-state znode
2605 | zookeeper.znode.replication.rs
2606 | The name of the rs znode
2609 | replication.sleep.before.failover
2610 | How many milliseconds a worker should sleep before attempting to replicate
2611 a dead region server's WAL queues.
2614 | replication.executor.workers
2615 | The number of region servers a given region server should attempt to
2616 failover simultaneously.
2620 === Monitoring Replication Status
2622 You can use the HBase Shell command `status 'replication'` to monitor the replication status on your cluster. The command has three variations:
2623 * `status 'replication'` -- prints the status of each source and its sinks, sorted by hostname.
2624 * `status 'replication', 'source'` -- prints the status for each replication source, sorted by hostname.
2625 * `status 'replication', 'sink'` -- prints the status for each replication sink, sorted by hostname.
2627 ==== Understanding the output
2629 The command output will vary according to the state of replication. For example right after a restart
2630 and if destination peer is not reachable, no replication source threads would be running,
2631 so no metrics would get displayed:
2637 No Reader/Shipper threads runnning yet.
2638 SINK: TimeStampStarted=1591985197350, Waiting for OPs...
2641 Under normal circumstances, a healthy, active-active replication deployment would
2648 AgeOfLastShippedOp=0, TimeStampOfLastShippedOp=Fri Jun 12 18:49:23 BST 2020, SizeOfLogQueue=1, EditsReadFromLogQueue=1, OpsShippedToTarget=1, TimeStampOfNextToReplicate=Fri Jun 12 18:49:23 BST 2020, Replication Lag=0
2649 SINK: TimeStampStarted=1591983663458, AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Jun 12 18:57:18 BST 2020
2652 The definition for each of these metrics is detailed below:
2654 [cols="1,1,1", options="header"]
2661 | AgeOfLastShippedOp
2662 | How long last successfully shipped edit took to effectively get replicated on target.
2665 | TimeStampOfLastShippedOp
2666 | The actual date of last successful edit shipment.
2670 | Number of wal files on this given queue.
2673 | EditsReadFromLogQueue
2674 | How many edits have been read from this given queue since this source thread started.
2677 | OpsShippedToTarget
2678 | How many edits have been shipped to target since this source thread started.
2681 | TimeStampOfNextToReplicate
2682 | Date of the current edit been attempted to replicate.
2686 | The elapsed time (in millis), since the last edit to replicate was read by this source
2687 thread and effectively replicated to target
2691 | Date (in millis) of when this Sink thread started.
2694 | AgeOfLastAppliedOp
2695 | How long it took to apply the last successful shipped edit.
2698 | TimeStampsOfLastAppliedOp
2699 | Date of last successful applied edit.
2703 Growing values for `Source.TimeStampsOfLastAppliedOp` and/or
2704 `Source.Replication Lag` would indicate replication delays. If those numbers keep going
2705 up, while `Source.TimeStampOfLastShippedOp`, `Source.EditsReadFromLogQueue`,
2706 `Source.OpsShippedToTarget` or `Source.TimeStampOfNextToReplicate` do not change at all,
2707 then replication flow is failing to progress, and there might be problems within
2708 clusters communication. This could also happen if replication is manually paused
2709 (via hbase shell `disable_peer` command, for example), but data keeps getting ingested
2710 in the source cluster tables.
2712 == Running Multiple Workloads On a Single Cluster
2714 HBase provides the following mechanisms for managing the performance of a cluster
2715 handling multiple workloads:
2717 . <<request_queues>>
2718 . <<multiple-typed-queues>>
2722 HBASE-11598 introduces RPC quotas, which allow you to throttle requests based on
2723 the following limits:
2725 . <<request-quotas,The number or size of requests(read, write, or read+write) in a given timeframe>>
2726 . <<namespace_quotas,The number of tables allowed in a namespace>>
2728 These limits can be enforced for a specified user, table, or namespace.
2732 Quotas are disabled by default. To enable the feature, set the `hbase.quota.enabled`
2733 property to `true` in _hbase-site.xml_ file for all cluster nodes.
2735 .General Quota Syntax
2736 . THROTTLE_TYPE can be expressed as READ, WRITE, or the default type(read + write).
2737 . Timeframes can be expressed in the following units: `sec`, `min`, `hour`, `day`
2738 . Request sizes can be expressed in the following units: `B` (bytes), `K` (kilobytes),
2739 `M` (megabytes), `G` (gigabytes), `T` (terabytes), `P` (petabytes)
2740 . Numbers of requests are expressed as an integer followed by the string `req`
2741 . Limits relating to time are expressed as req/time or size/time. For instance `10req/day`
2743 . Numbers of tables or regions are expressed as integers.
2746 .Setting Request Quotas
2747 You can set quota rules ahead of time, or you can change the throttle at runtime. The change
2748 will propagate after the quota refresh period has expired. This expiration period
2749 defaults to 5 minutes. To change it, modify the `hbase.quota.refresh.period` property
2750 in `hbase-site.xml`. This property is expressed in milliseconds and defaults to `300000`.
2753 # Limit user u1 to 10 requests per second
2754 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10req/sec'
2756 # Limit user u1 to 10 read requests per second
2757 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', LIMIT => '10req/sec'
2759 # Limit user u1 to 10 M per day everywhere
2760 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => '10M/day'
2762 # Limit user u1 to 10 M write size per sec
2763 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => WRITE, USER => 'u1', LIMIT => '10M/sec'
2765 # Limit user u1 to 5k per minute on table t2
2766 hbase> set_quota TYPE => THROTTLE, USER => 'u1', TABLE => 't2', LIMIT => '5K/min'
2768 # Limit user u1 to 10 read requests per sec on table t2
2769 hbase> set_quota TYPE => THROTTLE, THROTTLE_TYPE => READ, USER => 'u1', TABLE => 't2', LIMIT => '10req/sec'
2771 # Remove an existing limit from user u1 on namespace ns2
2772 hbase> set_quota TYPE => THROTTLE, USER => 'u1', NAMESPACE => 'ns2', LIMIT => NONE
2774 # Limit all users to 10 requests per hour on namespace ns1
2775 hbase> set_quota TYPE => THROTTLE, NAMESPACE => 'ns1', LIMIT => '10req/hour'
2777 # Limit all users to 10 T per hour on table t1
2778 hbase> set_quota TYPE => THROTTLE, TABLE => 't1', LIMIT => '10T/hour'
2780 # Remove all existing limits from user u1
2781 hbase> set_quota TYPE => THROTTLE, USER => 'u1', LIMIT => NONE
2783 # List all quotas for user u1 in namespace ns2
2784 hbase> list_quotas USER => 'u1, NAMESPACE => 'ns2'
2786 # List all quotas for namespace ns2
2787 hbase> list_quotas NAMESPACE => 'ns2'
2789 # List all quotas for table t1
2790 hbase> list_quotas TABLE => 't1'
2796 You can also place a global limit and exclude a user or a table from the limit by applying the
2797 `GLOBAL_BYPASS` property.
2799 hbase> set_quota NAMESPACE => 'ns1', LIMIT => '100req/min' # a per-namespace request limit
2800 hbase> set_quota USER => 'u1', GLOBAL_BYPASS => true # user u1 is not affected by the limit
2803 [[namespace_quotas]]
2804 .Setting Namespace Quotas
2806 You can specify the maximum number of tables or regions allowed in a given namespace, either
2807 when you create the namespace or by altering an existing namespace, by setting the
2808 `hbase.namespace.quota.maxtables property` on the namespace.
2810 .Limiting Tables Per Namespace
2812 # Create a namespace with a max of 5 tables
2813 hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxtables'=>'5'}
2815 # Alter an existing namespace to have a max of 8 tables
2816 hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxtables'=>'8'}
2818 # Show quota information for a namespace
2819 hbase> describe_namespace 'ns2'
2821 # Alter an existing namespace to remove a quota
2822 hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=>'hbase.namespace.quota.maxtables'}
2825 .Limiting Regions Per Namespace
2827 # Create a namespace with a max of 10 regions
2828 hbase> create_namespace 'ns1', {'hbase.namespace.quota.maxregions'=>'10'
2830 # Show quota information for a namespace
2831 hbase> describe_namespace 'ns1'
2833 # Alter an existing namespace to have a max of 20 tables
2834 hbase> alter_namespace 'ns2', {METHOD => 'set', 'hbase.namespace.quota.maxregions'=>'20'}
2836 # Alter an existing namespace to remove a quota
2837 hbase> alter_namespace 'ns2', {METHOD => 'unset', NAME=> 'hbase.namespace.quota.maxregions'}
2842 If no throttling policy is configured, when the RegionServer receives multiple requests,
2843 they are now placed into a queue waiting for a free execution slot (HBASE-6721).
2844 The simplest queue is a FIFO queue, where each request waits for all previous requests in the queue
2845 to finish before running. Fast or interactive queries can get stuck behind large requests.
2847 If you are able to guess how long a request will take, you can reorder requests by
2848 pushing the long requests to the end of the queue and allowing short requests to preempt
2849 them. Eventually, you must still execute the large requests and prioritize the new
2850 requests behind them. The short requests will be newer, so the result is not terrible,
2851 but still suboptimal compared to a mechanism which allows large requests to be split
2852 into multiple smaller ones.
2854 HBASE-10993 introduces such a system for deprioritizing long-running scanners. There
2855 are two types of queues, `fifo` and `deadline`. To configure the type of queue used,
2856 configure the `hbase.ipc.server.callqueue.type` property in `hbase-site.xml`. There
2857 is no way to estimate how long each request may take, so de-prioritization only affects
2858 scans, and is based on the number of “next” calls a scan request has made. An assumption
2859 is made that when you are doing a full table scan, your job is not likely to be interactive,
2860 so if there are concurrent requests, you can delay long-running scans up to a limit tunable by
2861 setting the `hbase.ipc.server.queue.max.call.delay` property. The slope of the delay is calculated
2862 by a simple square root of `(numNextCall * weight)` where the weight is
2863 configurable by setting the `hbase.ipc.server.scan.vtime.weight` property.
2865 [[multiple-typed-queues]]
2866 === Multiple-Typed Queues
2868 You can also prioritize or deprioritize different kinds of requests by configuring
2869 a specified number of dedicated handlers and queues. You can segregate the scan requests
2870 in a single queue with a single handler, and all the other available queues can service
2871 short `Get` requests.
2873 You can adjust the IPC queues and handlers based on the type of workload, using static
2874 tuning options. This approach is an interim first step that will eventually allow
2875 you to change the settings at runtime, and to dynamically adjust values based on the load.
2879 To avoid contention and separate different kinds of requests, configure the
2880 `hbase.ipc.server.callqueue.handler.factor` property, which allows you to increase the number of
2881 queues and control how many handlers can share the same queue., allows admins to increase the number
2882 of queues and decide how many handlers share the same queue.
2884 Using more queues reduces contention when adding a task to a queue or selecting it
2885 from a queue. You can even configure one queue per handler. The trade-off is that
2886 if some queues contain long-running tasks, a handler may need to wait to execute from that queue
2887 rather than stealing from another queue which has waiting tasks.
2889 .Read and Write Queues
2890 With multiple queues, you can now divide read and write requests, giving more priority
2891 (more queues) to one or the other type. Use the `hbase.ipc.server.callqueue.read.ratio`
2892 property to choose to serve more reads or more writes.
2894 .Get and Scan Queues
2895 Similar to the read/write split, you can split gets and scans by tuning the `hbase.ipc.server.callqueue.scan.ratio`
2896 property to give more priority to gets or to scans. A scan ratio of `0.1` will give
2897 more queue/handlers to the incoming gets, which means that more gets can be processed
2898 at the same time and that fewer scans can be executed at the same time. A value of
2899 `0.9` will give more queue/handlers to scans, so the number of scans executed will
2900 increase and the number of gets will decrease.
2905 link:https://issues.apache.org/jira/browse/HBASE-16961[HBASE-16961] introduces a new type of
2906 quotas for HBase to leverage: filesystem quotas. These "space" quotas limit the amount of space
2907 on the filesystem that HBase namespaces and tables can consume. If a user, malicious or ignorant,
2908 has the ability to write data into HBase, with enough time, that user can effectively crash HBase
2909 (or worse HDFS) by consuming all available space. When there is no filesystem space available,
2910 HBase crashes because it can no longer create/sync data to the write-ahead log.
2912 This feature allows a for a limit to be set on the size of a table or namespace. When a space quota is set
2913 on a namespace, the quota's limit applies to the sum of usage of all tables in that namespace.
2914 When a table with a quota exists in a namespace with a quota, the table quota takes priority
2915 over the namespace quota. This allows for a scenario where a large limit can be placed on
2916 a collection of tables, but a single table in that collection can have a fine-grained limit set.
2918 The existing `set_quota` and `list_quota` HBase shell commands can be used to interact with
2919 space quotas. Space quotas are quotas with a `TYPE` of `SPACE` and have `LIMIT` and `POLICY`
2920 attributes. The `LIMIT` is a string that refers to the amount of space on the filesystem
2921 that the quota subject (e.g. the table or namespace) may consume. For example, valid values
2922 of `LIMIT` are `'10G'`, `'2T'`, or `'256M'`. The `POLICY` refers to the action that HBase will
2923 take when the quota subject's usage exceeds the `LIMIT`. The following are valid `POLICY` values.
2925 * `NO_INSERTS` - No new data may be written (e.g. `Put`, `Increment`, `Append`).
2926 * `NO_WRITES` - Same as `NO_INSERTS` but `Deletes` are also disallowed.
2927 * `NO_WRITES_COMPACTIONS` - Same as `NO_WRITES` but compactions are also disallowed.
2928 ** This policy only prevents user-submitted compactions. System can still run compactions.
2929 * `DISABLE` - The table(s) are disabled, preventing all read/write access.
2931 .Setting simple space quotas
2933 # Sets a quota on the table 't1' with a limit of 1GB, disallowing Puts/Increments/Appends when the table exceeds 1GB
2934 hbase> set_quota TYPE => SPACE, TABLE => 't1', LIMIT => '1G', POLICY => NO_INSERTS
2936 # Sets a quota on the namespace 'ns1' with a limit of 50TB, disallowing Puts/Increments/Appends/Deletes
2937 hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '50T', POLICY => NO_WRITES
2939 # Sets a quota on the table 't3' with a limit of 2TB, disallowing any writes and compactions when the table exceeds 2TB.
2940 hbase> set_quota TYPE => SPACE, TABLE => 't3', LIMIT => '2T', POLICY => NO_WRITES_COMPACTIONS
2942 # Sets a quota on the table 't2' with a limit of 50GB, disabling the table when it exceeds 50GB
2943 hbase> set_quota TYPE => SPACE, TABLE => 't2', LIMIT => '50G', POLICY => DISABLE
2946 Consider the following scenario to set up quotas on a namespace, overriding the quota on tables in that namespace
2948 .Table and Namespace space quotas
2950 hbase> create_namespace 'ns1'
2951 hbase> create 'ns1:t1'
2952 hbase> create 'ns1:t2'
2953 hbase> create 'ns1:t3'
2954 hbase> set_quota TYPE => SPACE, NAMESPACE => 'ns1', LIMIT => '100T', POLICY => NO_INSERTS
2955 hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t2', LIMIT => '200G', POLICY => NO_WRITES
2956 hbase> set_quota TYPE => SPACE, TABLE => 'ns1:t3', LIMIT => '20T', POLICY => NO_WRITES
2959 In the above scenario, the tables in the namespace `ns1` will not be allowed to consume more than
2960 100TB of space on the filesystem among each other. The table 'ns1:t2' is only allowed to be 200GB in size, and will
2961 disallow all writes when the usage exceeds this limit. The table 'ns1:t3' is allowed to grow to 20TB in size
2962 and also will disallow all writes then the usage exceeds this limit. Because there is no table quota
2963 on 'ns1:t1', this table can grow up to 100TB, but only if 'ns1:t2' and 'ns1:t3' have a usage of zero bytes.
2964 Practically, it's limit is 100TB less the current usage of 'ns1:t2' and 'ns1:t3'.
2966 [[ops.space.quota.deletion]]
2967 === Disabling Automatic Space Quota Deletion
2969 By default, if a table or namespace is deleted that has a space quota, the quota itself is
2970 also deleted. In some cases, it may be desirable for the space quota to not be automatically deleted.
2971 In these cases, the user may configure the system to not delete any space quota automatically via hbase-site.xml.
2977 <name>hbase.quota.remove.on.table.delete</name>
2978 <value>false</value>
2982 The value is set to `true` by default.
2984 === HBase Snapshots with Space Quotas
2986 One common area of unintended-filesystem-use with HBase is via HBase snapshots. Because snapshots
2987 exist outside of the management of HBase tables, it is not uncommon for administrators to suddenly
2988 realize that hundreds of gigabytes or terabytes of space is being used by HBase snapshots which were
2989 forgotten and never removed.
2991 link:https://issues.apache.org/jira/browse/HBASE-17748[HBASE-17748] is the umbrella JIRA issue which
2992 expands on the original space quota functionality to also include HBase snapshots. While this is a confusing
2993 subject, the implementation attempts to present this support in as reasonable and simple of a manner as
2994 possible for administrators. This feature does not make any changes to administrator interaction with
2995 space quotas, only in the internal computation of table/namespace usage. Table and namespace usage will
2996 automatically incorporate the size taken by a snapshot per the rules defined below.
2998 As a review, let's cover a snapshot's lifecycle: a snapshot is metadata which points to
2999 a list of HFiles on the filesystem. This is why creating a snapshot is a very cheap operation; no HBase
3000 table data is actually copied to perform a snapshot. Cloning a snapshot into a new table or restoring
3001 a table is a cheap operation for the same reason; the new table references the files which already exist
3002 on the filesystem without a copy. To include snapshots in space quotas, we need to define which table
3003 "owns" a file when a snapshot references the file ("owns" refers to encompassing the filesystem usage
3006 Consider a snapshot which was made against a table. When the snapshot refers to a file and the table no
3007 longer refers to that file, the "originating" table "owns" that file. When multiple snapshots refer to
3008 the same file and no table refers to that file, the snapshot with the lowest-sorting name (lexicographically)
3009 is chosen and the table which that snapshot was created from "owns" that file. HFiles are not "double-counted"
3010 hen a table and one or more snapshots refer to that HFile.
3012 When a table is "rematerialized" (via `clone_snapshot` or `restore_snapshot`), a similar problem of file
3013 ownership arises. In this case, while the rematerialized table references a file which a snapshot also
3014 references, the table does not "own" the file. The table from which the snapshot was created still "owns"
3015 that file. When the rematerialized table is compacted or the snapshot is deleted, the rematerialized table
3016 will uniquely refer to a new file and "own" the usage of that file. Similarly, when a table is duplicated via a snapshot
3017 and `restore_snapshot`, the new table will not consume any quota size until the original table stops referring
3018 to the files, either due to a compaction on the original table, a compaction on the new table, or the
3019 original table being deleted.
3021 One new HBase shell command was added to inspect the computed sizes of each snapshot in an HBase instance.
3024 hbase> list_snapshot_sizes
3032 There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster.
3033 Each approach has pros and cons.
3035 For additional information, see link:http://blog.sematext.com/2011/03/11/hbase-backup-options/[HBase Backup
3036 Options] over on the Sematext Blog.
3038 [[ops.backup.fullshutdown]]
3039 === Full Shutdown Backup
3041 Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages.
3042 The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata.
3043 The obvious con is that the cluster is down.
3046 [[ops.backup.fullshutdown.stop]]
3051 [[ops.backup.fullshutdown.distcp]]
3054 Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
3056 Note: Distcp works in this situation because the cluster is down and there are no in-flight edits to files.
3057 Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.
3059 [[ops.backup.fullshutdown.restore]]
3060 ==== Restore (if needed)
3062 The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp.
3063 The act of copying these files creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
3065 [[ops.backup.live.replication]]
3066 === Live Cluster Backup - Replication
3068 This approach assumes that there is a second cluster.
3069 See the HBase page on link:https://hbase.apache.org/book.html#_cluster_replication[replication] for more information.
3071 [[ops.backup.live.copytable]]
3072 === Live Cluster Backup - CopyTable
3074 The <<copy.table,copytable>> utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
3076 Since the cluster is up, there is a risk that edits could be missed in the copy process.
3078 [[ops.backup.live.export]]
3079 === Live Cluster Backup - Export
3081 The <<export,export>> approach dumps the content of a table to HDFS on the same cluster.
3082 To restore the data, the <<import,import>> utility would be used.
3084 Since the cluster is up, there is a risk that edits could be missed in the export process. If you want to know more about HBase back-up and restore see the page on link:http://hbase.apache.org/book.html#backuprestore[Backup and Restore].
3089 HBase Snapshots allow you to take a copy of a table (both contents and metadata)with a very small performance impact. A Snapshot is an immutable
3090 collection of table metadata and a list of HFiles that comprised the table at the time the Snapshot was taken. A "clone"
3091 of a snapshot creates a new table from that snapshot, and a "restore" of a snapshot returns the contents of a table to
3092 what it was when the snapshot was created. The "clone" and "restore" operations do not require any data to be copied,
3093 as the underlying HFiles (the files which contain the data for an HBase table) are not modified with either action.
3094 Simiarly, exporting a snapshot to another cluster has little impact on RegionServers of the local cluster.
3096 Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table.
3097 The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
3099 [[ops.snapshots.configuration]]
3102 To turn on the snapshot support just set the `hbase.snapshot.enabled` property to true.
3103 (Snapshots are enabled by default in 0.95+ and off by default in 0.94.6+)
3109 <name>hbase.snapshot.enabled</name>
3114 [[ops.snapshots.takeasnapshot]]
3117 You can take a snapshot of a table regardless of whether it is enabled or disabled.
3118 The snapshot operation doesn't involve any data copying.
3123 hbase> snapshot 'myTable', 'myTableSnapshot-122112'
3126 .Take a Snapshot Without Flushing
3127 The default behavior is to perform a flush of data in memory before the snapshot is taken.
3128 This means that data in memory is included in the snapshot.
3129 In most cases, this is the desired behavior.
3130 However, if your set-up can tolerate data in memory being excluded from the snapshot, you can use the `SKIP_FLUSH` option of the `snapshot` command to disable and flushing while taking the snapshot.
3133 hbase> snapshot 'mytable', 'snapshot123', {SKIP_FLUSH => true}
3136 WARNING: There is no way to determine or predict whether a very concurrent insert or update will be included in a given snapshot, whether flushing is enabled or disabled.
3137 A snapshot is only a representation of a table during a window of time.
3138 The amount of time the snapshot operation will take to reach each Region Server may vary from a few seconds to a minute, depending on the resource load and speed of the hardware or network, among other factors.
3139 There is also no way to know whether a given insert or update is in memory or has been flushed.
3142 .Take a Snapshot With TTL
3143 Snapshots have a lifecycle that is independent from the table from which they are created.
3144 Although data in a table may be stored with TTL the data files containing them become
3145 frozen by the snapshot. Space consumed by expired cells will not be reclaimed by normal
3146 table housekeeping like compaction. While this is expected it can be inconvenient at scale.
3147 When many snapshots are under management and the data in various tables is expired by
3148 TTL some notion of optional TTL (and optional default TTL) for snapshots could be useful.
3152 hbase> snapshot 'mytable', 'snapshot1234', {TTL => 86400}
3155 The above command creates snapshot `snapshot1234` with TTL of 86400 sec (24 hours)
3156 and hence, the snapshot is supposed to be cleaned up after 24 hours
3160 .Default Snapshot TTL:
3161 - User specified default TTL with config `hbase.master.snapshot.ttl`
3162 - FOREVER if `hbase.master.snapshot.ttl` is not set
3164 While creating a snapshot, if TTL in seconds is not explicitly specified, the above logic will be
3165 followed to determine the TTL. If no configs are changed, the default behavior is that all snapshots
3166 will be retained forever (until manual deletion). If a different default TTL behavior is desired,
3167 `hbase.master.snapshot.ttl` can be set to a default TTL in seconds. Any snapshot created without
3168 an explicit TTL will take this new value.
3170 NOTE: If `hbase.master.snapshot.ttl` is set, a snapshot with an explicit {TTL => 0} or
3171 {TTL => -1} will also take this value. In this case, a TTL < -1 (such as {TTL => -2} should be used
3172 to indicate FOREVER.
3174 To summarize concisely,
3176 1. Snapshot with TTL value < -1 will stay forever regardless of any server side config changes (until deleted manually by user).
3177 2. Snapshot with TTL value > 0 will be deleted automatically soon after TTL expires.
3178 3. Snapshot created without specifying TTL will always have TTL value represented by config `hbase.master.snapshot.ttl`. Default value of this config is 0, which represents: keep the snapshot forever (until deleted manually by user).
3179 4. From client side, TTL value 0 or -1 should never be explicitly provided because they will be treated same as snapshot without TTL (same as above point 3) and hence will use TTL as per value represented by config `hbase.master.snapshot.ttl`.
3181 .Take a snapshot with custom MAX_FILESIZE
3183 Optionally, snapshots can be created with a custom max file size configuration that will be
3184 used by cloned tables, instead of the global `hbase.hregion.max.filesize` configuration property.
3185 This is mostly useful when exporting snapshots between different clusters. If the HBase cluster where
3186 the snapshot is originally taken has a much larger value set for `hbase.hregion.max.filesize` than
3187 one or more clusters where the snapshot is being exported to, a storm of region splits may occur when
3188 restoring the snapshot on destination clusters. Specifying `MAX_FILESIZE` on properties passed to
3189 `snapshot` command will save informed value into the table's `MAX_FILESIZE`
3190 decriptor at snapshot creation time. If the table already defines `MAX_FILESIZE` descriptor,
3191 this property would be ignored and have no effect.
3194 snapshot 'table01', 'snap01', {MAX_FILESIZE => 21474836480}
3197 .Enable/Disable Snapshot Auto Cleanup on running cluster:
3199 By default, snapshot auto cleanup based on TTL would be enabled
3200 for any new cluster.
3201 At any point in time, if snapshot cleanup is supposed to be stopped due to
3202 some snapshot restore activity or any other reason, it is advisable
3203 to disable it using shell command:
3206 hbase> snapshot_cleanup_switch false
3209 We can re-enable it using:
3212 hbase> snapshot_cleanup_switch true
3215 The shell command with switch false would disable snapshot auto
3216 cleanup activity based on TTL and return the previous state of
3217 the activity(true: running already, false: disabled already)
3219 A sample output for above commands:
3221 Previous snapshot cleanup state : true
3226 We can query whether snapshot auto cleanup is enabled for
3230 hbase> snapshot_cleanup_enabled
3233 The command would return output in true/false.
3235 [[ops.snapshots.list]]
3236 === Listing Snapshots
3238 List all snapshots taken (by printing the names and relative information).
3243 hbase> list_snapshots
3246 [[ops.snapshots.delete]]
3247 === Deleting Snapshots
3249 You can remove a snapshot, and the files retained for that snapshot will be removed if no longer needed.
3254 hbase> delete_snapshot 'myTableSnapshot-122112'
3257 [[ops.snapshots.clone]]
3258 === Clone a table from snapshot
3260 From a snapshot you can create a new table (clone operation) with the same data that you had when the snapshot was taken.
3261 The clone operation, doesn't involve data copies, and a change to the cloned table doesn't impact the snapshot or the original table.
3266 hbase> clone_snapshot 'myTableSnapshot-122112', 'myNewTestTable'
3269 [[ops.snapshots.restore]]
3270 === Restore a snapshot
3272 The restore operation requires the table to be disabled, and the table will be restored to the state at the time when the snapshot was taken, changing both data and schema if required.
3277 hbase> disable 'myTable'
3278 hbase> restore_snapshot 'myTableSnapshot-122112'
3281 NOTE: Since Replication works at log level and snapshots at file-system level, after a restore, the replicas will be in a different state from the master.
3282 If you want to use restore, you need to stop replication and redo the bootstrap.
3284 In case of partial data-loss due to misbehaving client, instead of a full restore that requires the table to be disabled, you can clone the table from the snapshot and use a Map-Reduce job to copy the data that you need, from the clone to the main one.
3286 [[ops.snapshots.acls]]
3287 === Snapshots operations and ACLs
3289 If you are using security with the AccessController Coprocessor (See <<hbase.accesscontrol.configuration,hbase.accesscontrol.configuration>>), only a global administrator can take, clone, or restore a snapshot, and these actions do not capture the ACL rights.
3290 This means that restoring a table preserves the ACL rights of the existing table, while cloning a table creates a new table that has no ACL rights until the administrator adds them.
3292 [[ops.snapshots.export]]
3293 === Export to another cluster
3295 The ExportSnapshot tool copies all the data related to a snapshot (hfiles, logs, snapshot metadata) to another cluster.
3296 The tool executes a Map-Reduce job, similar to distcp, to copy files between the two clusters, and since it works at file-system level the hbase cluster does not have to be online.
3298 To copy a snapshot called MySnapshot to an HBase cluster srv2 (hdfs:///srv2:8082/hbase) using 16 mappers:
3302 $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16
3305 .Limiting Bandwidth Consumption
3306 You can limit the bandwidth consumption when exporting a snapshot, by specifying the `-bandwidth` parameter, which expects an integer representing megabytes per second.
3307 The following example limits the above example to 200 MB/sec.
3311 $ bin/hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot MySnapshot -copy-to hdfs://srv2:8082/hbase -mappers 16 -bandwidth 200
3315 === Storing Snapshots in an Amazon S3 Bucket
3317 You can store and retrieve snapshots from Amazon S3, using the following procedure.
3319 NOTE: You can also store snapshots in Microsoft Azure Blob Storage. See <<snapshots_azure>>.
3322 - You must be using HBase 1.0 or higher and Hadoop 2.6.1 or higher, which is the first
3323 configuration that uses the Amazon AWS SDK.
3324 - You must use the `s3a://` protocol to connect to Amazon S3. The older `s3n://`
3325 and `s3://` protocols have various limitations and do not use the Amazon AWS SDK.
3326 - The `s3a://` URI must be configured and available on the server where you run
3327 the commands to export and restore the snapshot.
3329 After you have fulfilled the prerequisites, take the snapshot like you normally would.
3330 Afterward, you can export it using the `org.apache.hadoop.hbase.snapshot.ExportSnapshot`
3331 command like the one below, substituting your own `s3a://` path in the `copy-from`
3332 or `copy-to` directive and substituting or modifying other options as required:
3335 $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
3336 -snapshot MySnapshot \
3337 -copy-from hdfs://srv2:8082/hbase \
3338 -copy-to s3a://<bucket>/<namespace>/hbase \
3346 $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot \
3347 -snapshot MySnapshot
3348 -copy-from s3a://<bucket>/<namespace>/hbase \
3349 -copy-to hdfs://srv2:8082/hbase \
3356 You can also use the `org.apache.hadoop.hbase.snapshot.SnapshotInfo` utility with the `s3a://` path by including the
3357 `-remote-dir` option.
3360 $ hbase org.apache.hadoop.hbase.snapshot.SnapshotInfo \
3361 -remote-dir s3a://<bucket>/<namespace>/hbase \
3366 == Storing Snapshots in Microsoft Azure Blob Storage
3368 You can store snapshots in Microsoft Azure Blog Storage using the same techniques
3369 as in <<snapshots_s3>>.
3372 - You must be using HBase 1.2 or higher with Hadoop 2.7.1 or
3373 higher. No version of HBase supports Hadoop 2.7.0.
3374 - Your hosts must be configured to be aware of the Azure blob storage filesystem.
3375 See https://hadoop.apache.org/docs/r2.7.1/hadoop-azure/index.html.
3377 After you meet the prerequisites, follow the instructions
3378 in <<snapshots_s3>>, replacingthe protocol specifier with `wasb://` or `wasbs://`.
3381 == Capacity Planning and Region Sizing
3383 There are several considerations when planning the capacity for an HBase cluster and performing the initial configuration.
3384 Start with a solid understanding of how HBase handles data internally.
3386 [[ops.capacity.nodes]]
3387 === Node count and hardware/VM configuration
3389 [[ops.capacity.nodes.datasize]]
3390 ==== Physical data size
3392 Physical data size on disk is distinct from logical size of your data and is affected by the following:
3394 * Increased by HBase overhead
3396 * See <<keyvalue,keyvalue>> and <<keysize,keysize>>.
3397 At least 24 bytes per key-value (cell), can be more.
3398 Small keys/values means more relative overhead.
3399 * KeyValue instances are aggregated into blocks, which are indexed.
3400 Indexes also have to be stored.
3401 Blocksize is configurable on a per-ColumnFamily basis.
3402 See <<regions.arch,regions.arch>>.
3404 * Decreased by <<compression,compression>> and data block encoding, depending on data.
3405 You might want to test what compression and encoding (if any) make sense for your data.
3406 * Increased by size of region server <<wal,wal>> (usually fixed and negligible - less than half of RS memory size, per RS).
3407 * Increased by HDFS replication - usually x3.
3409 Aside from the disk space necessary to store the data, one RS may not be able to serve arbitrarily large amounts of data due to some practical limits on region count and size (see <<ops.capacity.regions,ops.capacity.regions>>).
3411 [[ops.capacity.nodes.throughput]]
3412 ==== Read/Write throughput
3414 Number of nodes can also be driven by required throughput for reads and/or writes.
3415 The throughput one can get per node depends a lot on data (esp.
3416 key/value sizes) and request patterns, as well as node and system configuration.
3417 Planning should be done for peak load if it is likely that the load would be the main driver of the increase of the node count.
3418 PerformanceEvaluation and <<ycsb,ycsb>> tools can be used to test single node or a test cluster.
3420 For write, usually 5-15Mb/s per RS can be expected, since every region server has only one active WAL.
3421 There's no good estimate for reads, as it depends vastly on data, requests, and cache hit rate. <<perf.casestudy,perf.casestudy>> might be helpful.
3423 [[ops.capacity.nodes.gc]]
3424 ==== JVM GC limitations
3426 RS cannot currently utilize very large heap due to cost of GC.
3427 There's also no good way of running multiple RS-es per server (other than running several VMs per machine). Thus, ~20-24Gb or less memory dedicated to one RS is recommended.
3428 GC tuning is required for large heap sizes.
3429 See <<gcpause,gcpause>>, <<trouble.log.gc,trouble.log.gc>> and elsewhere (TODO: where?)
3431 [[ops.capacity.regions]]
3432 === Determining region count and size
3434 Generally less regions makes for a smoother running cluster (you can always manually split the big regions later (if necessary) to spread the data, or request load, over the cluster); 20-200 regions per RS is a reasonable range.
3435 The number of regions cannot be configured directly (unless you go for fully <<disable.splitting,disable.splitting>>); adjust the region size to achieve the target region size given table size.
3437 When configuring regions for multiple tables, note that most region settings can be set on a per-table basis via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html[HTableDescriptor], as well as shell commands.
3438 These settings will override the ones in `hbase-site.xml`.
3439 That is useful if your tables have different workloads/use cases.
3441 Also note that in the discussion of region sizes here, _HDFS replication factor is not (and should not be) taken into account, whereas
3442 other factors <<ops.capacity.nodes.datasize,ops.capacity.nodes.datasize>> should be._ So, if your data is compressed and replicated 3 ways by HDFS, "9 Gb region" means 9 Gb of compressed data.
3443 HDFS replication factor only affects your disk usage and is invisible to most HBase code.
3445 ==== Viewing the Current Number of Regions
3447 You can view the current number of regions for a given table using the HMaster UI.
3448 In the [label]#Tables# section, the number of online regions for each table is listed in the [label]#Online Regions# column.
3449 This total only includes the in-memory state and does not include disabled or offline regions.
3451 [[ops.capacity.regions.count]]
3452 ==== Number of regions per RS - upper bound
3454 In production scenarios, where you have a lot of data, you are normally concerned with the maximum number of regions you can have per server. <<too_many_regions,too many regions>> has technical discussion on the subject.
3455 Basically, the maximum number of regions is mostly determined by memstore memory usage.
3456 Each region has its own memstores; these grow up to a configurable size; usually in 128-256 MB range, see <<hbase.hregion.memstore.flush.size,hbase.hregion.memstore.flush.size>>.
3457 One memstore exists per column family (so there's only one per region if there's one CF in the table). The RS dedicates some fraction of total memory to its memstores (see <<hbase.regionserver.global.memstore.size,hbase.regionserver.global.memstore.size>>). If this memory is exceeded (too much memstore usage), it can cause undesirable consequences such as unresponsive server or compaction storms.
3458 A good starting point for the number of regions per RS (assuming one table) is:
3462 ((RS memory) * (total memstore fraction)) / ((memstore size)*(# column families))
3465 This formula is pseudo-code.
3466 Here are two formulas using the actual tunable parameters, first for HBase 0.98+ and second for HBase 0.94.x.
3470 ((RS Xmx) * hbase.regionserver.global.memstore.size) / (hbase.hregion.memstore.flush.size * (# column families))
3474 ((RS Xmx) * hbase.regionserver.global.memstore.upperLimit) / (hbase.hregion.memstore.flush.size * (# column families))+
3477 If a given RegionServer has 16 GB of RAM, with default settings, the formula works out to 16384*0.4/128 ~ 51 regions per RS is a starting point.
3478 The formula can be extended to multiple tables; if they all have the same configuration, just use the total number of families.
3480 This number can be adjusted; the formula above assumes all your regions are filled at approximately the same rate.
3481 If only a fraction of your regions are going to be actively written to, you can divide the result by that fraction to get a larger region count.
3482 Then, even if all regions are written to, all region memstores are not filled evenly, and eventually jitter appears even if they are (due to limited number of concurrent flushes). Thus, one can have as many as 2-3 times more regions than the starting point; however, increased numbers carry increased risk.
3484 For write-heavy workload, memstore fraction can be increased in configuration at the expense of block cache; this will also allow one to have more regions.
3486 [[ops.capacity.regions.mincount]]
3487 ==== Number of regions per RS - lower bound
3489 HBase scales by having regions across many servers.
3490 Thus if you have 2 regions for 16GB data, on a 20 node machine your data will be concentrated on just a few machines - nearly the entire cluster will be idle.
3491 This really can't be stressed enough, since a common problem is loading 200MB data into HBase and then wondering why your awesome 10 node cluster isn't doing anything.
3493 On the other hand, if you have a very large amount of data, you may also want to go for a larger number of regions to avoid having regions that are too large.
3495 [[ops.capacity.regions.size]]
3496 ==== Maximum region size
3498 For large tables in production scenarios, maximum region size is mostly limited by compactions - very large compactions, esp.
3499 major, can degrade cluster performance.
3500 Currently, the recommended maximum region size is 10-20Gb, and 5-10Gb is optimal.
3501 For older 0.90.x codebase, the upper-bound of regionsize is about 4Gb, with a default of 256Mb.
3503 The size at which the region is split into two is generally configured via <<hbase.hregion.max.filesize,hbase.hregion.max.filesize>>; for details, see <<arch.region.splits,arch.region.splits>>.
3505 If you cannot estimate the size of your tables well, when starting off, it's probably best to stick to the default region size, perhaps going smaller for hot tables (or manually split hot regions to spread the load over the cluster), or go with larger region sizes if your cell sizes tend to be largish (100k and up).
3507 In HBase 0.98, experimental stripe compactions feature was added that would allow for larger regions, especially for log data.
3508 See <<ops.stripe,ops.stripe>>.
3510 [[ops.capacity.regions.total]]
3511 ==== Total data size per region server
3513 According to above numbers for region size and number of regions per region server, in an optimistic estimate 10 GB x 100 regions per RS will give up to 1TB served per region server, which is in line with some of the reported multi-PB use cases.
3514 However, it is important to think about the data vs cache size ratio at the RS level.
3515 With 1TB of data per server and 10 GB block cache, only 1% of the data will be cached, which may barely cover all block indices.
3517 [[ops.capacity.config]]
3518 === Initial configuration and tuning
3520 First, see <<important_configurations,important configurations>>.
3521 Note that some configurations, more than others, depend on specific scenarios.
3522 Pay special attention to:
3524 * <<hbase.regionserver.handler.count,hbase.regionserver.handler.count>> - request handler thread count, vital for high-throughput workloads.
3525 * <<config.wals,config.wals>> - the blocking number of WAL files depends on your memstore configuration and should be set accordingly to prevent potential blocking when doing high volume of writes.
3527 Then, there are some considerations when setting up your cluster and tables.
3529 [[ops.capacity.config.compactions]]
3532 Depending on read/write volume and latency requirements, optimal compaction settings may be different.
3533 See <<compaction,compaction>> for some details.
3535 When provisioning for large data sizes, however, it's good to keep in mind that compactions can affect write throughput.
3536 Thus, for write-intensive workloads, you may opt for less frequent compactions and more store files per regions.
3537 Minimum number of files for compactions (`hbase.hstore.compaction.min`) can be set to higher value; <<hbase.hstore.blockingStoreFiles,hbase.hstore.blockingStoreFiles>> should also be increased, as more files might accumulate in such case.
3538 You may also consider manually managing compactions: <<managed.compactions,managed.compactions>>
3540 [[ops.capacity.config.presplit]]
3541 ==== Pre-splitting the table
3543 Based on the target number of the regions per RS (see <<ops.capacity.regions.count,ops.capacity.regions.count>>) and number of RSes, one can pre-split the table at creation time.
3544 This would both avoid some costly splitting as the table starts to fill up, and ensure that the table starts out already distributed across many servers.
3546 If the table is expected to grow large enough to justify that, at least one region per RS should be created.
3547 It is not recommended to split immediately into the full target number of regions (e.g.
3548 50 * number of RSes), but a low intermediate value can be chosen.
3549 For multiple tables, it is recommended to be conservative with presplitting (e.g.
3550 pre-split 1 region per RS at most), especially if you don't know how much each table will grow.
3551 If you split too much, you may end up with too many regions, with some tables having too many small regions.
3553 For pre-splitting howto, see <<manual_region_splitting_decisions,manual region splitting decisions>> and <<precreate.regions,precreate.regions>>.
3558 In versions 0.90.x of hbase and earlier, we had a simple script that would rename the hdfs table directory and then do an edit of the hbase:meta table replacing all mentions of the old table name with the new.
3559 The script was called `./bin/rename_table.rb`.
3560 The script was deprecated and removed mostly because it was unmaintained and the operation performed by the script was brutal.
3562 As of hbase 0.94.x, you can use the snapshot facility renaming a table.
3563 Here is how you would do it using the hbase shell:
3566 hbase shell> disable 'tableName'
3567 hbase shell> snapshot 'tableName', 'tableSnapshot'
3568 hbase shell> clone_snapshot 'tableSnapshot', 'newTableName'
3569 hbase shell> delete_snapshot 'tableSnapshot'
3570 hbase shell> drop 'tableName'
3573 or in code it would be as follows:
3577 void rename(Admin admin, String oldTableName, TableName newTableName) {
3578 String snapshotName = randomName();
3579 admin.disableTable(oldTableName);
3580 admin.snapshot(snapshotName, oldTableName);
3581 admin.cloneSnapshot(snapshotName, newTableName);
3582 admin.deleteSnapshot(snapshotName);
3583 admin.deleteTable(oldTableName);
3588 == RegionServer Grouping
3589 RegionServer Grouping (A.K.A `rsgroup`) is an advanced feature for
3590 partitioning regionservers into distinctive groups for strict isolation. It
3591 should only be used by users who are sophisticated enough to understand the
3592 full implications and have a sufficient background in managing HBase clusters.
3593 It was developed by Yahoo! and they run it at scale on their large grid cluster.
3594 See link:http://www.slideshare.net/HBaseCon/keynote-apache-hbase-at-yahoo-scale[HBase at Yahoo! Scale].
3596 RSGroups can be defined and managed with both admin methods and shell commands.
3597 A server can be added to a group with hostname and port pair and tables
3598 can be moved to this group so that only regionservers in the same rsgroup can
3599 host the regions of the table. The group for a table is stored in its
3600 TableDescriptor, the property name is `hbase.rsgroup.name`. You can also set
3601 this property on a namespace, so it will cause all the tables under this
3602 namespace to be placed into this group. RegionServers and tables can only
3603 belong to one rsgroup at a time. By default, all tables and regionservers
3604 belong to the `default` rsgroup. System tables can also be put into a
3605 rsgroup using the regular APIs. A custom balancer implementation tracks
3606 assignments per rsgroup and makes sure to move regions to the relevant
3607 regionservers in that rsgroup. The rsgroup information is stored in a regular
3608 HBase table, and a zookeeper-based read-only cache is used at cluster bootstrap
3611 To enable, add the following to your hbase-site.xml and restart your Master:
3616 <name>hbase.balancer.rsgroup.enabled</name>
3621 Then use the admin/shell _rsgroup_ methods/commands to create and manipulate
3622 RegionServer groups: e.g. to add a rsgroup and then add a server to it.
3623 To see the list of rsgroup commands available in the hbase shell type:
3627 hbase(main):008:0> help 'rsgroup'
3631 High level, you create a rsgroup that is other than the `default` group using
3632 _add_rsgroup_ command. You then add servers and tables to this group with the
3633 _move_servers_rsgroup_ and _move_tables_rsgroup_ commands. If necessary, run
3634 a balance for the group if tables are slow to migrate to the groups dedicated
3635 server with the _balance_rsgroup_ command (Usually this is not needed). To
3636 monitor effect of the commands, see the `Tables` tab toward the end of the
3637 Master UI home page. If you click on a table, you can see what servers it is
3638 deployed across. You should see here a reflection of the grouping done with
3639 your shell commands. View the master log if issues.
3641 Here is example using a few of the rsgroup commands. To add a group, do as
3646 hbase(main):008:0> add_rsgroup 'my_group'
3651 .RegionServer Groups must be Enabled
3654 If you have not enabled the rsgroup feature and you call any of the rsgroup
3655 admin methods or shell commands the call will fail with a
3656 `DoNotRetryIOException` with a detail message that says the rsgroup feature
3660 Add a server (specified by hostname + port) to the just-made group using the
3661 _move_servers_rsgroup_ command as follows:
3665 hbase(main):010:0> move_servers_rsgroup 'my_group',['k.att.net:51129']
3668 .Hostname and Port vs ServerName
3671 The rsgroup feature refers to servers in a cluster with hostname and port only.
3672 It does not make use of the HBase ServerName type identifying RegionServers;
3673 i.e. hostname + port + starttime to distinguish RegionServer instances. The
3674 rsgroup feature keeps working across RegionServer restarts so the starttime of
3675 ServerName -- and hence the ServerName type -- is not appropriate.
3679 Servers come and go over the lifetime of a Cluster. Currently, you must
3680 manually align the servers referenced in rsgroups with the actual state of
3681 nodes in the running cluster. What we mean by this is that if you decommission
3682 a server, then you must update rsgroups as part of your server decommission
3683 process removing references. Notice that, by calling `clearDeadServers`
3684 manually will also remove the dead servers from any rsgroups, but the problem
3685 is that we will lost track of the dead servers after master restarts, which
3686 means you still need to update the rsgroup by your own.
3688 Please use `Admin.removeServersFromRSGroup` or shell command
3689 _remove_servers_rsgroup_ to remove decommission servers from rsgroup.
3691 The `default` group is not like other rsgroups in that it is dynamic. Its server
3692 list mirrors the current state of the cluster; i.e. if you shutdown a server that
3693 was part of the `default` rsgroup, and then do a _get_rsgroup_ `default` to list
3694 its content in the shell, the server will no longer be listed. For non-default
3695 groups, though a mode may be offline, it will persist in the non-default group’s
3696 list of servers. But if you move the offline server from the non-default rsgroup
3697 to default, it will not show in the `default` list. It will just be dropped.
3700 The authors of the rsgroup feature, the Yahoo! HBase Engineering team, have been
3701 running it on their grid for a good while now and have come up with a few best
3702 practices informed by their experience.
3704 ==== Isolate System Tables
3705 Either have a system rsgroup where all the system tables are or just leave the
3706 system tables in `default` rsgroup and have all user-space tables are in
3707 non-default rsgroups.
3710 Yahoo! Have found it useful at their scale to keep a special rsgroup of dead or
3711 questionable nodes; this is one means of keeping them out of the running until repair.
3713 Be careful replacing dead nodes in an rsgroup. Ensure there are enough live nodes
3714 before you start moving out the dead. Move in good live nodes first if you have to.
3717 Viewing the Master log will give you insight on rsgroup operation.
3719 If it appears stuck, restart the Master process.
3721 === Remove RegionServer Grouping
3722 Simply disable RegionServer Grouping feature is easy, just remove the
3723 'hbase.balancer.rsgroup.enabled' from hbase-site.xml or explicitly set it to
3724 false in hbase-site.xml.
3729 <name>hbase.balancer.rsgroup.enabled</name>
3730 <value>false</value>
3734 But if you change the 'hbase.balancer.rsgroup.enabled' to true, the old rsgroup
3735 configs will take effect again. So if you want to completely remove the
3736 RegionServer Grouping feature from a cluster, so that if the feature is
3737 re-enabled in the future, the old meta data will not affect the functioning of
3738 the cluster, there are more steps to do.
3740 - Move all tables in non-default rsgroups to `default` regionserver group
3743 #Reassigning table t1 from non default group - hbase shell
3744 hbase(main):005:0> move_tables_rsgroup 'default',['t1']
3746 - Move all regionservers in non-default rsgroups to `default` regionserver group
3749 #Reassigning all the servers in the non-default rsgroup to default - hbase shell
3750 hbase(main):008:0> move_servers_rsgroup 'default',['rs1.xxx.com:16206','rs2.xxx.com:16202','rs3.xxx.com:16204']
3752 - Remove all non-default rsgroups. `default` rsgroup created implicitly doesn't have to be removed
3755 #removing non default rsgroup - hbase shell
3756 hbase(main):009:0> remove_rsgroup 'group2'
3758 - Remove the changes made in `hbase-site.xml` and restart the cluster
3759 - Drop the table `hbase:rsgroup` from `hbase`
3762 #Through hbase shell drop table hbase:rsgroup
3763 hbase(main):001:0> disable 'hbase:rsgroup'
3764 0 row(s) in 2.6270 seconds
3766 hbase(main):002:0> drop 'hbase:rsgroup'
3767 0 row(s) in 1.2730 seconds
3769 - Remove znode `rsgroup` from the cluster ZooKeeper using zkCli.sh
3772 #From ZK remove the node /hbase/rsgroup through zkCli.sh
3777 To enable ACL, add the following to your hbase-site.xml and restart your Master:
3782 <name>hbase.security.authorization</name>
3786 [[migrating.rsgroup]]
3787 === Migrating From Old Implementation
3788 The coprocessor `org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is
3789 deprected, but for compatible, if you want the pre 3.0.0 hbase client/shell
3790 to communicate with the new hbase cluster, you still need to add this
3791 coprocessor to master.
3793 The `hbase.rsgroup.grouploadbalancer.class` config has been deprecated, as now
3794 the top level load balancer will always be `RSGroupBasedLoadBalaner`, and the
3795 `hbase.master.loadbalancer.class` config is for configuring the balancer within
3796 a group. This also means you should not set `hbase.master.loadbalancer.class`
3797 to `RSGroupBasedLoadBalaner` any more even if rsgroup feature is enabled.
3799 And we have done some special changes for compatibility. First, if coprocessor
3800 `org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint` is specified, the
3801 `hbase.balancer.rsgroup.enabled` flag will be set to true automatically to
3802 enable rs group feature. Second, we will load
3803 `hbase.rsgroup.grouploadbalancer.class` prior to
3804 `hbase.master.loadbalancer.class`. And last, if you do not set
3805 `hbase.rsgroup.grouploadbalancer.class` but only set
3806 `hbase.master.loadbalancer.class` to `RSGroupBasedLoadBalancer`, we will load
3807 the default load balancer to avoid infinite nesting. This means you do not need
3808 to change anything when upgrading if you have already enabled rs group feature.
3810 The main difference comparing to the old implementation is that, now the
3811 rsgroup for a table is stored in `TableDescriptor`, instead of in
3812 `RSGroupInfo`, so the `getTables` method of `RSGroupInfo` has been deprecated.
3813 And if you use the `Admin` methods to get the `RSGroupInfo`, its `getTables`
3814 method will always return empty. This is because that in the old
3815 implementation, this method is a bit broken as you can set rsgroup on namespace
3816 and make all the tables under this namespace into this group but you can not
3817 get these tables through `RSGroupInfo.getTables`. Now you should use the two
3818 new methods `listTablesInRSGroup` and
3819 `getConfiguredNamespacesAndTablesInRSGroup` in `Admin` to get tables and
3820 namespaces in a rsgroup.
3822 Of course the behavior for the old RSGroupAdminEndpoint is not changed,
3823 we will fill the tables field of the RSGroupInfo before returning, to make it
3824 compatible with old hbase client/shell.
3826 When upgrading, the migration between the RSGroupInfo and TableDescriptor will
3827 be done automatically. It will take sometime, but it is fine to restart master
3828 in the middle, the migration will continue after restart. And during the
3829 migration, the rs group feature will still work and in most cases the region
3830 will not be misplaced(since this is only a one time job and will not last too
3831 long so we have not test it very seriously to make sure the region will not be
3832 misplaced always, so we use the word 'in most cases'). The implementation is a
3833 bit tricky, you can see the code in `RSGroupInfoManagerImpl.migrate` if
3840 == Region Normalizer
3842 The Region Normalizer tries to make Regions all in a table about the same in
3843 size. It does this by first calculating total table size and average size per
3844 region. It splits any region that is larger than twice this size. Any region
3845 that is much smaller is merged into an adjacent region. The Normalizer runs on
3846 a regular schedule, which is configurable. It can also be disabled entirely via
3847 a runtime "switch". It can be run manually via the shell or Admin API call.
3848 Even if normally disabled, it is good to run manually after the cluster has
3849 been running a while or say after a burst of activity such as a large delete.
3851 The Normalizer works well for bringing a table's region boundaries into
3852 alignment with the reality of data distribution after an initial effort at
3853 pre-splitting a table. It is also a nice compliment to the data TTL feature
3854 when the schema includes timestamp in the rowkey, as it will automatically
3855 merge away regions whose contents have expired.
3857 (The bulk of the below detail was copied wholesale from the blog by Romil Choksi at
3858 link:https://community.hortonworks.com/articles/54987/hbase-region-normalizer.html[HBase Region Normalizer]).
3860 The Region Normalizer is feature available since HBase-1.2. It runs a set of
3861 pre-calculated merge/split actions to resize regions that are either too
3862 large or too small compared to the average region size for a given table. Region
3863 Normalizer when invoked computes a normalization 'plan' for all of the tables in
3864 HBase. System tables (such as hbase:meta, hbase:namespace, Phoenix system tables
3865 etc) and user tables with normalization disabled are ignored while computing the
3866 plan. For normalization enabled tables, normalization plan is carried out in
3867 parallel across multiple tables.
3869 Normalizer can be enabled or disabled globally for the entire cluster using the
3870 ‘normalizer_switch’ command in the HBase shell. Normalization can also be
3871 controlled on a per table basis, which is disabled by default when a table is
3872 created. Normalization for a table can be enabled or disabled by setting the
3873 NORMALIZATION_ENABLED table attribute to true or false.
3875 To check normalizer status and enable/disable normalizer
3879 hbase(main):001:0> normalizer_enabled
3881 0 row(s) in 0.4870 seconds
3883 hbase(main):002:0> normalizer_switch false
3885 0 row(s) in 0.0640 seconds
3887 hbase(main):003:0> normalizer_enabled
3889 0 row(s) in 0.0120 seconds
3891 hbase(main):004:0> normalizer_switch true
3893 0 row(s) in 0.0200 seconds
3895 hbase(main):005:0> normalizer_enabled
3897 0 row(s) in 0.0090 seconds
3900 When enabled, Normalizer is invoked in the background every 5 mins (by default),
3901 which can be configured using `hbase.normalization.period` in `hbase-site.xml`.
3902 Normalizer can also be invoked manually/programmatically at will using HBase shell’s
3903 `normalize` command. HBase by default uses `SimpleRegionNormalizer`, but users can
3904 design their own normalizer as long as they implement the RegionNormalizer Interface.
3905 Details about the logic used by `SimpleRegionNormalizer` to compute its normalization
3906 plan can be found link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/normalizer/SimpleRegionNormalizer.html[here].
3908 The below example shows a normalization plan being computed for an user table, and
3909 merge action being taken as a result of the normalization plan computed by SimpleRegionNormalizer.
3911 Consider an user table with some pre-split regions having 3 equally large regions
3912 (about 100K rows) and 1 relatively small region (about 25K rows). Following is the
3913 snippet from an hbase meta table scan showing each of the pre-split regions for
3917 table_p8ddpd6q5z,,1469494305548.68b9892220865cb6048 column=info:regioninfo, timestamp=1469494306375, value={ENCODED => 68b9892220865cb604809c950d1adf48, NAME => 'table_p8ddpd6q5z,,1469494305548.68b989222 09c950d1adf48. 0865cb604809c950d1adf48.', STARTKEY => '', ENDKEY => '1'}
3919 table_p8ddpd6q5z,1,1469494317178.867b77333bdc75a028 column=info:regioninfo, timestamp=1469494317848, value={ENCODED => 867b77333bdc75a028bb4c5e4b235f48, NAME => 'table_p8ddpd6q5z,1,1469494317178.867b7733 bb4c5e4b235f48. 3bdc75a028bb4c5e4b235f48.', STARTKEY => '1', ENDKEY => '3'}
3921 table_p8ddpd6q5z,3,1469494328323.98f019a753425e7977 column=info:regioninfo, timestamp=1469494328486, value={ENCODED => 98f019a753425e7977ab8636e32deeeb, NAME => 'table_p8ddpd6q5z,3,1469494328323.98f019a7 ab8636e32deeeb. 53425e7977ab8636e32deeeb.', STARTKEY => '3', ENDKEY => '7'}
3923 table_p8ddpd6q5z,7,1469494339662.94c64e748979ecbb16 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 94c64e748979ecbb166f6cc6550e25c6, NAME => 'table_p8ddpd6q5z,7,1469494339662.94c64e74 6f6cc6550e25c6. 8979ecbb166f6cc6550e25c6.', STARTKEY => '7', ENDKEY => '8'}
3925 table_p8ddpd6q5z,8,1469494339662.6d2b3f5fd1595ab8e7 column=info:regioninfo, timestamp=1469494339859, value={ENCODED => 6d2b3f5fd1595ab8e7c031876057b1ee, NAME => 'table_p8ddpd6q5z,8,1469494339662.6d2b3f5f c031876057b1ee. d1595ab8e7c031876057b1ee.', STARTKEY => '8', ENDKEY => ''}
3927 Invoking the normalizer using ‘normalize’ int the HBase shell, the below log snippet
3928 from HMaster log shows the normalization plan computed as per the logic defined for
3929 SimpleRegionNormalizer. Since the total region size (in MB) for the adjacent smallest
3930 regions in the table is less than the average region size, the normalizer computes a
3931 plan to merge these two regions.
3934 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto
3935 normalization turned on
3936 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
3937 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't have auto normalization turned on
3938 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] master.HMaster: Skipping normalization for table: table_h2osxu3wat, as it's either system table or doesn't have autonormalization turned on
3939 2016-07-26 07:08:26,928 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 5
3940 2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
3941 2016-07-26 07:08:26,929 DEBUG [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 2.4
3942 2016-07-26 07:08:26,929 INFO [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, small region size: 0 plus its neighbor size: 0, less thanthe avg size 2.4, merging them
3943 2016-07-26 07:08:26,971 INFO [B.fifo.QRpcServer.handler=20,queue=2,port=20000] normalizer.MergeNormalizationPlan: Executing merging normalization plan: MergeNormalizationPlan{firstRegion={ENCODED=> d51df2c58e9b525206b1325fd925a971, NAME => 'table_p8ddpd6q5z,,1469514755237.d51df2c58e9b525206b1325fd925a971.', STARTKEY => '', ENDKEY => '1'}, secondRegion={ENCODED => e69c6b25c7b9562d078d9ad3994f5330, NAME => 'table_p8ddpd6q5z,1,1469514767669.e69c6b25c7b9562d078d9ad3994f5330.',
3944 STARTKEY => '1', ENDKEY => '3'}}
3946 Region normalizer as per it’s computed plan, merged the region with start key as ‘’
3947 and end key as ‘1’, with another region having start key as ‘1’ and end key as ‘3’.
3948 Now, that these regions have been merged we see a single new region with start key
3949 as ‘’ and end key as ‘3’
3951 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeA, timestamp=1469516907431,
3952 value=PBUF\x08\xA5\xD9\x9E\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x00"\x011(\x000\x00 ea74d246741ba. 8\x00
3953 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:mergeB, timestamp=1469516907431,
3954 value=PBUF\x08\xB5\xBA\x9F\xAF\xE2*\x12\x1B\x0A\x07default\x12\x10table_p8ddpd6q5z\x1A\x011"\x013(\x000\x0 ea74d246741ba. 08\x00
3955 table_p8ddpd6q5z,,1469516907210.e06c9b83c4a252b130e column=info:regioninfo, timestamp=1469516907431, value={ENCODED => e06c9b83c4a252b130eea74d246741ba, NAME => 'table_p8ddpd6q5z,,1469516907210.e06c9b83c ea74d246741ba. 4a252b130eea74d246741ba.', STARTKEY => '', ENDKEY => '3'}
3957 table_p8ddpd6q5z,3,1469514778736.bf024670a847c0adff column=info:regioninfo, timestamp=1469514779417, value={ENCODED => bf024670a847c0adffb74b2e13408b32, NAME => 'table_p8ddpd6q5z,3,1469514778736.bf024670 b74b2e13408b32. a847c0adffb74b2e13408b32.' STARTKEY => '3', ENDKEY => '7'}
3959 table_p8ddpd6q5z,7,1469514790152.7c5a67bc755e649db2 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 7c5a67bc755e649db22f49af6270f1e1, NAME => 'table_p8ddpd6q5z,7,1469514790152.7c5a67bc 2f49af6270f1e1. 755e649db22f49af6270f1e1.', STARTKEY => '7', ENDKEY => '8'}
3961 table_p8ddpd6q5z,8,1469514790152.58e7503cda69f98f47 column=info:regioninfo, timestamp=1469514790312, value={ENCODED => 58e7503cda69f98f4755178e74288c3a, NAME => 'table_p8ddpd6q5z,8,1469514790152.58e7503c 55178e74288c3a. da69f98f4755178e74288c3a.', STARTKEY => '8', ENDKEY => ''}
3964 A similar example can be seen for an user table with 3 smaller regions and 1
3965 relatively large region. For this example, we have an user table with 1 large region containing 100K rows, and 3 relatively smaller regions with about 33K rows each. As seen from the normalization plan, since the larger region is more than twice the average region size it ends being split into two regions – one with start key as ‘1’ and end key as ‘154717’ and the other region with start key as '154717' and end key as ‘3’
3967 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:backup, as it's either system table or doesn't have auto normalization turned on
3968 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_p8ddpd6q5z, number of regions: 4
3969 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, total aggregated regions size: 12
3970 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_p8ddpd6q5z, average region size: 3.0
3971 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: No normalization needed, regions look good for table: table_p8ddpd6q5z
3972 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Computing normalization plan for table: table_h2osxu3wat, number of regions: 5
3973 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, total aggregated regions size: 7
3974 2016-07-26 07:39:45,636 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, average region size: 1.4
3975 2016-07-26 07:39:45,636 INFO [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SimpleRegionNormalizer: Table table_h2osxu3wat, large region table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db. has size 4, more than twice avg size, splitting
3976 2016-07-26 07:39:45,640 INFO [B.fifo.QRpcServer.handler=7,queue=1,port=20000] normalizer.SplitNormalizationPlan: Executing splitting normalization plan: SplitNormalizationPlan{regionInfo={ENCODED => 27f2fdbb2b6612ea163eb6b40753c3db, NAME => 'table_h2osxu3wat,1,1469515926544.27f2fdbb2b6612ea163eb6b40753c3db.', STARTKEY => '1', ENDKEY => '3'}, splitPoint=null}
3977 2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:namespace, as it's either system table or doesn't have auto normalization turned on
3978 2016-07-26 07:39:45,656 DEBUG [B.fifo.QRpcServer.handler=7,queue=1,port=20000] master.HMaster: Skipping normalization for table: hbase:meta, as it's either system table or doesn't
3979 have auto normalization turned on …..…..….
3980 2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined 54de97dae764b864504704c1c8d3674a on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => 54de97dae764b864504704c1c8d3674a, NAME => 'table_h2osxu3wat,1,1469518785661.54de97dae764b864504704c1c8d3674a.', STARTKEY => '1', ENDKEY => '154717'}
3981 2016-07-26 07:39:46,246 INFO [AM.ZK.Worker-pool2-t278] master.RegionStates: Transition {d6b5625df331cfec84dce4f1122c567f state=SPLITTING_NEW, ts=1469518786246, server=hbase-test-rc-5.openstacklocal,16020,1469419333913} to {d6b5625df331cfec84dce4f1122c567f state=OPEN, ts=1469518786246,
3982 server=hbase-test-rc-5.openstacklocal,16020,1469419333913}
3983 2016-07-26 07:39:46,246 DEBUG [AM.ZK.Worker-pool2-t278] master.RegionStates: Onlined d6b5625df331cfec84dce4f1122c567f on hbase-test-rc-5.openstacklocal,16020,1469419333913 {ENCODED => d6b5625df331cfec84dce4f1122c567f, NAME => 'table_h2osxu3wat,154717,1469518785661.d6b5625df331cfec84dce4f1122c567f.', STARTKEY => '154717', ENDKEY => '3'}
3988 [[auto_reopen_regions]]
3989 == Auto Region Reopen
3991 We can leak store reader references if a coprocessor or core function somehow
3992 opens a scanner, or wraps one, and then does not take care to call close on the
3993 scanner or the wrapped instance. Leaked store files can not be removed even
3994 after it is invalidated via compaction.
3995 A reasonable mitigation for a reader reference
3996 leak would be a fast reopen of the region on the same server.
3997 This will release all resources, like the refcount, leases, etc.
3998 The clients should gracefully ride over this like any other region in
4000 By default this auto reopen of region feature would be disabled.
4001 To enabled it, please provide high ref count value for config
4002 `hbase.regions.recovery.store.file.ref.count`.
4004 Please refer to config descriptions for
4005 `hbase.master.regions.recovery.check.interval` and
4006 `hbase.regions.recovery.store.file.ref.count`.