src/main/asciidoc/_chapters/troubleshooting.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 [[trouble]]
  23 = Troubleshooting and Debugging Apache HBase
  24 :doctype: book
  25 :numbered:
  26 :toc: left
  27 :icons: font
  28 :experimental:
  29
  30 [[trouble.general]]
  31 == General Guidelines
  32
  33 Always start with the master log (TODO: Which lines?). Normally it's just printing the same lines over and over again.
  34 If not, then there's an issue.
  35 Google or link:http://search-hadoop.com[search-hadoop.com] should return some hits for those exceptions you're seeing.
  36
  37 An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place.
  38 The best way to approach this type of problem is to walk the log up to where it all began, for example one trick with RegionServers is that they will print some metrics when aborting so grepping for _Dump_ should get you around the start of the problem.
  39
  40 RegionServer suicides are 'normal', as this is what they do when something goes wrong.
  41 For example, if ulimit and max transfer threads (the two most important initial settings, see <<ulimit>> and <<dfs.datanode.max.transfer.threads>>) aren't changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone.
  42 Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it's the same with HBase and HDFS.
  43 Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session timeout.
  44 For more information on GC pauses, see the link:https://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/[3 part blog post] by Todd Lipcon and <<gcpause>> above.
  45
  46 [[trouble.log]]
  47 == Logs
  48
  49 The key process logs are as follows... (replace <user> with the user that started the service, and <hostname> for the machine name)
  50
  51 NameNode: _$HADOOP_HOME/logs/hadoop-<user>-namenode-<hostname>.log_
  52
  53 DataNode: _$HADOOP_HOME/logs/hadoop-<user>-datanode-<hostname>.log_
  54
  55 JobTracker: _$HADOOP_HOME/logs/hadoop-<user>-jobtracker-<hostname>.log_
  56
  57 TaskTracker: _$HADOOP_HOME/logs/hadoop-<user>-tasktracker-<hostname>.log_
  58
  59 HMaster: _$HBASE_HOME/logs/hbase-<user>-master-<hostname>.log_
  60
  61 RegionServer: _$HBASE_HOME/logs/hbase-<user>-regionserver-<hostname>.log_
  62
  63 ZooKeeper: _TODO_
  64
  65 [[trouble.log.locations]]
  66 === Log Locations
  67
  68 For stand-alone deployments the logs are obviously going to be on a single machine, however this is a development configuration only.
  69 Production deployments need to run on a cluster.
  70
  71 [[trouble.log.locations.namenode]]
  72 ==== NameNode
  73
  74 The NameNode log is on the NameNode server.
  75 The HBase Master is typically run on the NameNode server, and well as ZooKeeper.
  76
  77 For smaller clusters the JobTracker/ResourceManager is typically run on the NameNode server as well.
  78
  79 [[trouble.log.locations.datanode]]
  80 ==== DataNode
  81
  82 Each DataNode server will have a DataNode log for HDFS, as well as a RegionServer log for HBase.
  83
  84 Additionally, each DataNode server will also have a TaskTracker/NodeManager log for MapReduce task execution.
  85
  86 [[trouble.log.levels]]
  87 === Log Levels
  88
  89 [[rpc.logging]]
  90 ==== Enabling RPC-level logging
  91
  92 Enabling the RPC-level logging on a RegionServer can often give insight on timings at the server.
  93 Once enabled, the amount of log spewed is voluminous.
  94 It is not recommended that you leave this logging on for more than short bursts of time.
  95 To enable RPC-level logging, browse to the RegionServer UI and click on _Log Level_.
  96 Set the log level to `DEBUG` for the package `org.apache.hadoop.ipc` (That's right, for `hadoop.ipc`, NOT, `hbase.ipc`). Then tail the RegionServers log.
  97 Analyze.
  98
  99 To disable, set the logging level back to `INFO` level.
 100
 101 [[trouble.log.gc]]
 102 === JVM Garbage Collection Logs
 103
 104 HBase is memory intensive, and using the default GC you can see long pauses in all threads including the _Juliet Pause_ aka "GC of Death". To help debug this or confirm this is happening GC logging can be turned on in the Java virtual machine.
 105
 106 To enable, in _hbase-env.sh_, uncomment one of the below lines :
 107
 108 [source,bourne]
 109 ----
 110
 111 # This enables basic gc logging to the .out file.
 112 # export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
 113
 114 # This enables basic gc logging to its own file.
 115 # export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
 116
 117 # This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
 118 # export SERVER_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
 119
 120 # If <FILE-PATH> is not replaced, the log file(.gc) would be generated in the HBASE_LOG_DIR.
 121 ----
 122
 123 At this point you should see logs like so:
 124
 125 [source]
 126 ----
 127
 128 64898.952: [GC [1 CMS-initial-mark: 2811538K(3055704K)] 2812179K(3061272K), 0.0007360 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
 129 64898.953: [CMS-concurrent-mark-start]
 130 64898.971: [GC 64898.971: [ParNew: 5567K->576K(5568K), 0.0101110 secs] 2817105K->2812715K(3061272K), 0.0102200 secs] [Times: user=0.07 sys=0.00, real=0.01 secs]
 131 ----
 132
 133 In this section, the first line indicates a 0.0007360 second pause for the CMS to initially mark.
 134 This pauses the entire VM, all threads for that period of time.
 135
 136 The third line indicates a "minor GC", which pauses the VM for 0.0101110 seconds - aka 10 milliseconds.
 137 It has reduced the "ParNew" from about 5.5m to 576k.
 138 Later on in this cycle we see:
 139
 140 [source]
 141 ----
 142
 143 64901.445: [CMS-concurrent-mark: 1.542/2.492 secs] [Times: user=10.49 sys=0.33, real=2.49 secs]
 144 64901.445: [CMS-concurrent-preclean-start]
 145 64901.453: [GC 64901.453: [ParNew: 5505K->573K(5568K), 0.0062440 secs] 2868746K->2864292K(3061272K), 0.0063360 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
 146 64901.476: [GC 64901.476: [ParNew: 5563K->575K(5568K), 0.0072510 secs] 2869283K->2864837K(3061272K), 0.0073320 secs] [Times: user=0.05 sys=0.01, real=0.01 secs]
 147 64901.500: [GC 64901.500: [ParNew: 5517K->573K(5568K), 0.0120390 secs] 2869780K->2865267K(3061272K), 0.0121150 secs] [Times: user=0.09 sys=0.00, real=0.01 secs]
 148 64901.529: [GC 64901.529: [ParNew: 5507K->569K(5568K), 0.0086240 secs] 2870200K->2865742K(3061272K), 0.0087180 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
 149 64901.554: [GC 64901.555: [ParNew: 5516K->575K(5568K), 0.0107130 secs] 2870689K->2866291K(3061272K), 0.0107820 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
 150 64901.578: [CMS-concurrent-preclean: 0.070/0.133 secs] [Times: user=0.48 sys=0.01, real=0.14 secs]
 151 64901.578: [CMS-concurrent-abortable-preclean-start]
 152 64901.584: [GC 64901.584: [ParNew: 5504K->571K(5568K), 0.0087270 secs] 2871220K->2866830K(3061272K), 0.0088220 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
 153 64901.609: [GC 64901.609: [ParNew: 5512K->569K(5568K), 0.0063370 secs] 2871771K->2867322K(3061272K), 0.0064230 secs] [Times: user=0.06 sys=0.00, real=0.01 secs]
 154 64901.615: [CMS-concurrent-abortable-preclean: 0.007/0.037 secs] [Times: user=0.13 sys=0.00, real=0.03 secs]
 155 64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
 156 64901.621: [CMS-concurrent-sweep-start]
 157 ----
 158
 159 The first line indicates that the CMS concurrent mark (finding garbage) has taken 2.4 seconds.
 160 But this is a _concurrent_ 2.4 seconds, Java has not been paused at any point in time.
 161
 162 There are a few more minor GCs, then there is a pause at the 2nd last line:
 163 [source]
 164 ----
 165
 166 64901.616: [GC[YG occupancy: 645 K (5568 K)]64901.616: [Rescan (parallel) , 0.0020210 secs]64901.618: [weak refs processing, 0.0027950 secs] [1 CMS-remark: 2866753K(3055704K)] 2867399K(3061272K), 0.0049380 secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
 167 ----
 168
 169 The pause here is 0.0049380 seconds (aka 4.9 milliseconds) to 'remark' the heap.
 170
 171 At this point the sweep starts, and you can watch the heap size go down:
 172
 173 [source]
 174 ----
 175
 176 64901.637: [GC 64901.637: [ParNew: 5501K->569K(5568K), 0.0097350 secs] 2871958K->2867441K(3061272K), 0.0098370 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
 177 ...  lines removed ...
 178 64904.936: [GC 64904.936: [ParNew: 5532K->568K(5568K), 0.0070720 secs] 1365024K->1360689K(3061272K), 0.0071930 secs] [Times: user=0.05 sys=0.00, real=0.01 secs]
 179 64904.953: [CMS-concurrent-sweep: 2.030/3.332 secs] [Times: user=9.57 sys=0.26, real=3.33 secs]
 180 ----
 181
 182 At this point, the CMS sweep took 3.332 seconds, and heap went from about ~ 2.8 GB to 1.3 GB (approximate).
 183
 184 The key points here is to keep all these pauses low.
 185 CMS pauses are always low, but if your ParNew starts growing, you can see minor GC pauses approach 100ms, exceed 100ms and hit as high at 400ms.
 186
 187 This can be due to the size of the ParNew, which should be relatively small.
 188 If your ParNew is very large after running HBase for a while, in one example a ParNew was about 150MB, then you might have to constrain the size of ParNew (The larger it is, the longer the collections take but if it's too small, objects are promoted to old gen too quickly). In the below we constrain new gen size to 64m.
 189
 190 Add the below line in _hbase-env.sh_:
 191 [source,bourne]
 192 ----
 193
 194 export SERVER_GC_OPTS="$SERVER_GC_OPTS -XX:NewSize=64m -XX:MaxNewSize=64m"
 195 ----
 196
 197 Similarly, to enable GC logging for client processes, uncomment one of the below lines in _hbase-env.sh_:
 198
 199 [source,bourne]
 200 ----
 201
 202 # This enables basic gc logging to the .out file.
 203 # export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps"
 204
 205 # This enables basic gc logging to its own file.
 206 # export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH>"
 207
 208 # This enables basic GC logging to its own file with automatic log rolling. Only applies to jdk 1.6.0_34+ and 1.7.0_2+.
 209 # export CLIENT_GC_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<FILE-PATH> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=512M"
 210
 211 # If <FILE-PATH> is not replaced, the log file(.gc) would be generated in the HBASE_LOG_DIR .
 212 ----
 213
 214 For more information on GC pauses, see the link:https://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/[3 part blog post] by Todd Lipcon and <<gcpause>> above.
 215
 216 [[trouble.resources]]
 217 == Resources
 218
 219 [[trouble.resources.searchhadoop]]
 220 === search-hadoop.com
 221
 222 link:http://search-hadoop.com[search-hadoop.com] indexes all the mailing lists and is great for historical searches.
 223 Search here first when you have an issue as its more than likely someone has already had your problem.
 224
 225 [[trouble.resources.lists]]
 226 === Mailing Lists
 227
 228 Ask a question on the link:https://hbase.apache.org/mail-lists.html[Apache HBase mailing lists].
 229 The 'dev' mailing list is aimed at the community of developers actually building Apache HBase and for features currently under development, and 'user' is generally used for questions on released versions of Apache HBase.
 230 Before going to the mailing list, make sure your question has not already been answered by searching the mailing list archives first.
 231 Use <<trouble.resources.searchhadoop>>.
 232 Take some time crafting your question.
 233 See link:http://www.mikeash.com/getting_answers.html[Getting Answers] for ideas on crafting good questions.
 234 A quality question that includes all context and exhibits evidence the author has tried to find answers in the manual and out on lists is more likely to get a prompt response.
 235
 236 [[trouble.resources.slack]]
 237 === Slack
 238 See  http://apache-hbase.slack.com Channel on Slack
 239
 240 [[trouble.resources.irc]]
 241 === IRC
 242 (You will probably get a more prompt response on the Slack channel)
 243
 244 #hbase on irc.freenode.net
 245
 246 [[trouble.resources.jira]]
 247 === JIRA
 248
 249 link:https://issues.apache.org/jira/browse/HBASE[JIRA] is also really helpful when looking for Hadoop/HBase-specific issues.
 250
 251 [[trouble.tools]]
 252 == Tools
 253
 254 [[trouble.tools.builtin]]
 255 === Builtin Tools
 256
 257 [[trouble.tools.builtin.webmaster]]
 258 ==== Master Web Interface
 259
 260 The Master starts a web-interface on port 16010 by default.
 261 (Up to and including 0.98 this was port 60010)
 262
 263 The Master web UI lists created tables and their definition (e.g., ColumnFamilies, blocksize, etc.). Additionally, the available RegionServers in the cluster are listed along with selected high-level metrics (requests, number of regions, usedHeap, maxHeap). The Master web UI allows navigation to each RegionServer's web UI.
 264
 265 [[trouble.tools.builtin.webregion]]
 266 ==== RegionServer Web Interface
 267
 268 RegionServers starts a web-interface on port 16030 by default.
 269 (Up to an including 0.98 this was port 60030)
 270
 271 The RegionServer web UI lists online regions and their start/end keys, as well as point-in-time RegionServer metrics (requests, regions, storeFileIndexSize, compactionQueueSize, etc.).
 272
 273 See <<hbase_metrics>> for more information in metric definitions.
 274
 275 [[trouble.tools.builtin.zkcli]]
 276 ==== zkcli
 277
 278 `zkcli` is a very useful tool for investigating ZooKeeper-related issues.
 279 To invoke:
 280 [source,bourne]
 281 ----
 282 ./hbase zkcli -server host:port <cmd> <args>
 283 ----
 284
 285 The commands (and arguments) are:
 286
 287 [source]
 288 ----
 289   connect host:port
 290   get path [watch]
 291   ls path [watch]
 292   set path data [version]
 293   delquota [-n|-b] path
 294   quit
 295   printwatches on|off
 296   create [-s] [-e] path data acl
 297   stat path [watch]
 298   close
 299   ls2 path [watch]
 300   history
 301   listquota path
 302   setAcl path acl
 303   getAcl path
 304   sync path
 305   redo cmdno
 306   addauth scheme auth
 307   delete path [version]
 308   setquota -n|-b val path
 309 ----
 310
 311 [[trouble.tools.external]]
 312 === External Tools
 313
 314 [[trouble.tools.tail]]
 315 ==== tail
 316
 317 `tail` is the command line tool that lets you look at the end of a file.
 318 Add the `-f` option and it will refresh when new data is available.
 319 It's useful when you are wondering what's happening, for example, when a cluster is taking a long time to shutdown or startup as you can just fire a new terminal and tail the master log (and maybe a few RegionServers).
 320
 321 [[trouble.tools.top]]
 322 ==== top
 323
 324 `top` is probably one of the most important tools when first trying to see what's running on a machine and how the resources are consumed.
 325 Here's an example from production system:
 326
 327 [source]
 328 ----
 329 top - 14:46:59 up 39 days, 11:55,  1 user,  load average: 3.75, 3.57, 3.84
 330 Tasks: 309 total,   1 running, 308 sleeping,   0 stopped,   0 zombie
 331 Cpu(s):  4.5%us,  1.6%sy,  0.0%ni, 91.7%id,  1.4%wa,  0.1%hi,  0.6%si,  0.0%st
 332 Mem:  24414432k total, 24296956k used,   117476k free,     7196k buffers
 333 Swap: 16008732k total,  14348k used, 15994384k free, 11106908k cached
 334
 335   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM      TIME+  COMMAND
 336 15558 hadoop    18  -2 3292m 2.4g 3556 S   79 10.4   6523:52 java
 337 13268 hadoop    18  -2 8967m 8.2g 4104 S   21 35.1   5170:30 java
 338  8895 hadoop    18  -2 1581m 497m 3420 S   11  2.1   4002:32 java
 339 …
 340 ----
 341
 342 Here we can see that the system load average during the last five minutes is 3.75, which very roughly means that on average 3.75 threads were waiting for CPU time during these 5 minutes.
 343 In general, the _perfect_ utilization equals to the number of cores, under that number the machine is under utilized and over that the machine is over utilized.
 344 This is an important concept, see this article to understand it more: http://www.linuxjournal.com/article/9001.
 345
 346 Apart from load, we can see that the system is using almost all its available RAM but most of it is used for the OS cache (which is good). The swap only has a few KBs in it and this is wanted, high numbers would indicate swapping activity which is the nemesis of performance of Java systems.
 347 Another way to detect swapping is when the load average goes through the roof (although this could also be caused by things like a dying disk, among others).
 348
 349 The list of processes isn't super useful by default, all we know is that 3 java processes are using about 111% of the CPUs.
 350 To know which is which, simply type `c` and each line will be expanded.
 351 Typing `1` will give you the detail of how each CPU is used instead of the average for all of them like shown here.
 352
 353 [[trouble.tools.jps]]
 354 ==== jps
 355
 356 `jps` is shipped with every JDK and gives the java process ids for the current user (if root, then it gives the ids for all users). Example:
 357
 358 [source,bourne]
 359 ----
 360 hadoop@sv4borg12:~$ jps
 361 1322 TaskTracker
 362 17789 HRegionServer
 363 27862 Child
 364 1158 DataNode
 365 25115 HQuorumPeer
 366 2950 Jps
 367 19750 ThriftServer
 368 18776 jmx
 369 ----
 370
 371 In order, we see a:
 372
 373 * Hadoop TaskTracker, manages the local Childs
 374 * HBase RegionServer, serves regions
 375 * Child, its MapReduce task, cannot tell which type exactly
 376 * Hadoop TaskTracker, manages the local Childs
 377 * Hadoop DataNode, serves blocks
 378 * HQuorumPeer, a ZooKeeper ensemble member
 379 * Jps, well... it's the current process
 380 * ThriftServer, it's a special one will be running only if thrift was started
 381 * jmx, this is a local process that's part of our monitoring platform ( poorly named maybe). You probably don't have that.
 382
 383 You can then do stuff like checking out the full command line that started the process:
 384
 385 [source,bourne]
 386 ----
 387 hadoop@sv4borg12:~$ ps aux | grep HRegionServer
 388 hadoop   17789  155 35.2 9067824 8604364 ?     S&lt;l  Mar04 9855:48 /usr/java/jdk1.6.0_14/bin/java -Xmx8000m -XX:+DoEscapeAnalysis -XX:+AggressiveOpts -XX:+UseConcMarkSweepGC -XX:NewSize=64m -XX:MaxNewSize=64m -XX:CMSInitiatingOccupancyFraction=88 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/export1/hadoop/logs/gc-hbase.log -Dcom.sun.management.jmxremote.port=10102 -Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=/home/hadoop/hbase/conf/jmxremote.password -Dcom.sun.management.jmxremote -Dhbase.log.dir=/export1/hadoop/logs -Dhbase.log.file=hbase-hadoop-regionserver-sv4borg12.log -Dhbase.home.dir=/home/hadoop/hbase -Dhbase.id.str=hadoop -Dhbase.root.logger=INFO,DRFA -Djava.library.path=/home/hadoop/hbase/lib/native/Linux-amd64-64 -classpath /home/hadoop/hbase/bin/../conf:[many jars]:/home/hadoop/hadoop/conf org.apache.hadoop.hbase.regionserver.HRegionServer start
 389 ----
 390
 391 [[trouble.tools.jstack]]
 392 ==== jstack
 393
 394 `jstack` is one of the most important tools when trying to figure out what a java process is doing apart from looking at the logs.
 395 It has to be used in conjunction with jps in order to give it a process id.
 396 It shows a list of threads, each one has a name, and they appear in the order that they were created (so the top ones are the most recent threads). Here are a few example:
 397
 398 The main thread of a RegionServer waiting for something to do from the master:
 399
 400 [source]
 401 ----
 402 "regionserver60020" prio=10 tid=0x0000000040ab4000 nid=0x45cf waiting on condition [0x00007f16b6a96000..0x00007f16b6a96a70]
 403 java.lang.Thread.State: TIMED_WAITING (parking)
 404     at sun.misc.Unsafe.park(Native Method)
 405     - parking to wait for  <0x00007f16cd5c2f30> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 406     at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
 407     at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
 408     at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:395)
 409     at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:647)
 410     at java.lang.Thread.run(Thread.java:619)
 411 ----
 412
 413 The MemStore flusher thread that is currently flushing to a file:
 414
 415 [source]
 416 ----
 417 "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000040f4e000 nid=0x45eb in Object.wait() [0x00007f16b5b86000..0x00007f16b5b87af0]
 418 java.lang.Thread.State: WAITING (on object monitor)
 419     at java.lang.Object.wait(Native Method)
 420     at java.lang.Object.wait(Object.java:485)
 421     at org.apache.hadoop.ipc.Client.call(Client.java:803)
 422     - locked <0x00007f16cb14b3a8> (a org.apache.hadoop.ipc.Client$Call)
 423     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
 424     at $Proxy1.complete(Unknown Source)
 425     at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
 426     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 427     at java.lang.reflect.Method.invoke(Method.java:597)
 428     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
 429     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
 430     at $Proxy1.complete(Unknown Source)
 431     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3390)
 432     - locked <0x00007f16cb14b470> (a org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
 433     at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3304)
 434     at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
 435     at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
 436     at org.apache.hadoop.hbase.io.hfile.HFile$Writer.close(HFile.java:650)
 437     at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:853)
 438     at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:467)
 439     - locked <0x00007f16d00e6f08> (a java.lang.Object)
 440     at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:427)
 441     at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:80)
 442     at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1359)
 443     at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:907)
 444     at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:834)
 445     at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:786)
 446     at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:250)
 447     at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:224)
 448     at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:146)
 449 ----
 450
 451 A handler thread that's waiting for stuff to do (like put, delete, scan, etc.):
 452
 453 [source]
 454 ----
 455 "IPC Server handler 16 on 60020" daemon prio=10 tid=0x00007f16b011d800 nid=0x4a5e waiting on condition [0x00007f16afefd000..0x00007f16afefd9f0]
 456    java.lang.Thread.State: WAITING (parking)
 457           at sun.misc.Unsafe.park(Native Method)
 458           - parking to wait for  <0x00007f16cd3f8dd8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 459           at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
 460           at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
 461           at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
 462           at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1013)
 463 ----
 464
 465 And one that's busy doing an increment of a counter (it's in the phase where it's trying to create a scanner in order to read the last value):
 466
 467 [source]
 468 ----
 469 "IPC Server handler 66 on 60020" daemon prio=10 tid=0x00007f16b006e800 nid=0x4a90 runnable [0x00007f16acb77000..0x00007f16acb77cf0]
 470    java.lang.Thread.State: RUNNABLE
 471           at org.apache.hadoop.hbase.regionserver.KeyValueHeap.<init>(KeyValueHeap.java:56)
 472           at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:79)
 473           at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1202)
 474           at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2209)
 475           at org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1063)
 476           at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1055)
 477           at org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1039)
 478           at org.apache.hadoop.hbase.regionserver.HRegion.getLastIncrement(HRegion.java:2875)
 479           at org.apache.hadoop.hbase.regionserver.HRegion.incrementColumnValue(HRegion.java:2978)
 480           at org.apache.hadoop.hbase.regionserver.HRegionServer.incrementColumnValue(HRegionServer.java:2433)
 481           at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
 482           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 483           at java.lang.reflect.Method.invoke(Method.java:597)
 484           at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:560)
 485           at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1027)
 486 ----
 487
 488 A thread that receives data from HDFS:
 489
 490 [source]
 491 ----
 492 "IPC Client (47) connection to sv4borg9/10.4.24.40:9000 from hadoop" daemon prio=10 tid=0x00007f16a02d0000 nid=0x4fa3 runnable [0x00007f16b517d000..0x00007f16b517dbf0]
 493    java.lang.Thread.State: RUNNABLE
 494           at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 495           at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
 496           at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
 497           at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
 498           - locked <0x00007f17d5b68c00> (a sun.nio.ch.Util$1)
 499           - locked <0x00007f17d5b68be8> (a java.util.Collections$UnmodifiableSet)
 500           - locked <0x00007f1877959b50> (a sun.nio.ch.EPollSelectorImpl)
 501           at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
 502           at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:332)
 503           at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 504           at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 505           at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
 506           at java.io.FilterInputStream.read(FilterInputStream.java:116)
 507           at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:304)
 508           at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 509           at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 510           - locked <0x00007f1808539178> (a java.io.BufferedInputStream)
 511           at java.io.DataInputStream.readInt(DataInputStream.java:370)
 512           at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:569)
 513           at org.apache.hadoop.ipc.Client$Connection.run(Client.java:477)
 514 ----
 515
 516 And here is a master trying to recover a lease after a RegionServer died:
 517
 518 [source]
 519 ----
 520 "LeaseChecker" daemon prio=10 tid=0x00000000407ef800 nid=0x76cd waiting on condition [0x00007f6d0eae2000..0x00007f6d0eae2a70]
 521 --
 522    java.lang.Thread.State: WAITING (on object monitor)
 523           at java.lang.Object.wait(Native Method)
 524           at java.lang.Object.wait(Object.java:485)
 525           at org.apache.hadoop.ipc.Client.call(Client.java:726)
 526           - locked <0x00007f6d1cd28f80> (a org.apache.hadoop.ipc.Client$Call)
 527           at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 528           at $Proxy1.recoverBlock(Unknown Source)
 529           at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2636)
 530           at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2832)
 531           at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:529)
 532           at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
 533           at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
 534           at org.apache.hadoop.hbase.util.FSUtils.recoverFileLease(FSUtils.java:619)
 535           at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1322)
 536           at org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1210)
 537           at org.apache.hadoop.hbase.master.HMaster.splitLogAfterStartup(HMaster.java:648)
 538           at org.apache.hadoop.hbase.master.HMaster.joinCluster(HMaster.java:572)
 539           at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:503)
 540 ----
 541
 542 [[trouble.tools.opentsdb]]
 543 ==== OpenTSDB
 544
 545 link:http://opentsdb.net[OpenTSDB] is an excellent alternative to Ganglia as it uses Apache HBase to store all the time series and doesn't have to downsample.
 546 Monitoring your own HBase cluster that hosts OpenTSDB is a good exercise.
 547
 548 Here's an example of a cluster that's suffering from hundreds of compactions launched almost all around the same time, which severely affects the IO performance: (TODO: insert graph plotting compactionQueueSize)
 549
 550 It's a good practice to build dashboards with all the important graphs per machine and per cluster so that debugging issues can be done with a single quick look.
 551 For example, at StumbleUpon there's one dashboard per cluster with the most important metrics from both the OS and Apache HBase.
 552 You can then go down at the machine level and get even more detailed metrics.
 553
 554 [[trouble.tools.clustersshtop]]
 555 ==== clusterssh+top
 556
 557 clusterssh+top, it's like a poor man's monitoring system and it can be quite useful when you have only a few machines as it's very easy to setup.
 558 Starting clusterssh will give you one terminal per machine and another terminal in which whatever you type will be retyped in every window.
 559 This means that you can type `top` once and it will start it for all of your machines at the same time giving you full view of the current state of your cluster.
 560 You can also tail all the logs at the same time, edit files, etc.
 561
 562 [[trouble.client]]
 563 == Client
 564
 565 For more information on the HBase client, see <<architecture.client,client>>.
 566
 567 === Missed Scan Results Due To Mismatch Of `hbase.client.scanner.max.result.size` Between Client and Server
 568 If either the client or server version is lower than 0.98.11/1.0.0 and the server
 569 has a smaller value for `hbase.client.scanner.max.result.size` than the client, scan
 570 requests that reach the server's `hbase.client.scanner.max.result.size` are likely
 571 to miss data. In particular, 0.98.11 defaults `hbase.client.scanner.max.result.size`
 572 to 2 MB but other versions default to larger values. For this reason, be very careful
 573 using 0.98.11 servers with any other client version.
 574
 575 [[trouble.client.scantimeout]]
 576 === ScannerTimeoutException or UnknownScannerException
 577
 578 This is thrown if the time between RPC calls from the client to RegionServer exceeds the scan timeout.
 579 For example, if `Scan.setCaching` is set to 500, then there will be an RPC call to fetch the next batch of rows every 500 `.next()` calls on the ResultScanner because data is being transferred in blocks of 500 rows to the client.
 580 Reducing the setCaching value may be an option, but setting this value too low makes for inefficient processing on numbers of rows.
 581
 582 See <<perf.hbase.client.caching>>.
 583
 584 === Performance Differences in Thrift and Java APIs
 585
 586 Poor performance, or even `ScannerTimeoutExceptions`, can occur if `Scan.setCaching` is too high, as discussed in <<trouble.client.scantimeout>>.
 587 If the Thrift client uses the wrong caching settings for a given workload, performance can suffer compared to the Java API.
 588 To set caching for a given scan in the Thrift client, use the `scannerGetList(scannerId, numRows)` method, where `numRows` is an integer representing the number of rows to cache.
 589 In one case, it was found that reducing the cache for Thrift scans from 1000 to 100 increased performance to near parity with the Java API given the same queries.
 590
 591 See also Jesse Andersen's link:http://blog.cloudera.com/blog/2014/04/how-to-use-the-hbase-thrift-interface-part-3-using-scans/[blog post] about using Scans with Thrift.
 592
 593 [[trouble.client.lease.exception]]
 594 === `LeaseException` when calling `Scanner.next`
 595
 596 In some situations clients that fetch data from a RegionServer get a LeaseException instead of the usual <<trouble.client.scantimeout>>.
 597 Usually the source of the exception is `org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:230)` (line number may vary). It tends to happen in the context of a slow/freezing `RegionServer#next` call.
 598 It can be prevented by having `hbase.rpc.timeout` > `hbase.regionserver.lease.period`.
 599 Harsh J investigated the issue as part of the mailing list thread link:https://mail-archives.apache.org/mod_mbox/hbase-user/201209.mbox/%3CCAOcnVr3R-LqtKhFsk8Bhrm-YW2i9O6J6Fhjz2h7q6_sxvwd2yw%40mail.gmail.com%3E[HBase, mail # user - Lease does not exist exceptions]
 600
 601 [[trouble.client.scarylogs]]
 602 === Shell or client application throws lots of scary exceptions during normal operation
 603
 604 Since 0.20.0 the default log level for `org.apache.hadoop.hbase.*`is DEBUG.
 605
 606 On your clients, edit _$HBASE_HOME/conf/log4j.properties_ and change this: `log4j.logger.org.apache.hadoop.hbase=DEBUG` to this: `log4j.logger.org.apache.hadoop.hbase=INFO`, or even `log4j.logger.org.apache.hadoop.hbase=WARN`.
 607
 608 [[trouble.client.longpauseswithcompression]]
 609 === Long Client Pauses With Compression
 610
 611 This is a fairly frequent question on the Apache HBase dist-list.
 612 The scenario is that a client is typically inserting a lot of data into a relatively un-optimized HBase cluster.
 613 Compression can exacerbate the pauses, although it is not the source of the problem.
 614
 615 See <<precreate.regions>> on the pattern for pre-creating regions and confirm that the table isn't starting with a single region.
 616
 617 See <<perf.configurations>> for cluster configuration, particularly `hbase.hstore.blockingStoreFiles`, `hbase.hregion.memstore.block.multiplier`, `MAX_FILESIZE` (region size), and `MEMSTORE_FLUSHSIZE.`
 618
 619 A slightly longer explanation of why pauses can happen is as follows: Puts are sometimes blocked on the MemStores which are blocked by the flusher thread which is blocked because there are too many files to compact because the compactor is given too many small files to compact and has to compact the same data repeatedly.
 620 This situation can occur even with minor compactions.
 621 Compounding this situation, Apache HBase doesn't compress data in memory.
 622 Thus, the 64MB that lives in the MemStore could become a 6MB file after compression - which results in a smaller StoreFile.
 623 The upside is that more data is packed into the same region, but performance is achieved by being able to write larger files - which is why HBase waits until the flushsize before writing a new StoreFile.
 624 And smaller StoreFiles become targets for compaction.
 625 Without compression the files are much bigger and don't need as much compaction, however this is at the expense of I/O.
 626
 627 For additional information, see this thread on link:http://search-hadoop.com/m/WUnLM6ojHm1/Long+client+pauses+with+compression&subj=Long+client+pauses+with+compression[Long client pauses with compression].
 628
 629 [[trouble.client.security.rpc.krb]]
 630 === Secure Client Connect ([Caused by GSSException: No valid credentials provided...])
 631
 632 You may encounter the following error:
 633
 634 ----
 635 Secure Client Connect ([Caused by GSSException: No valid credentials provided
 636         (Mechanism level: Request is a replay (34) V PROCESS_TGS)])
 637 ----
 638
 639 This issue is caused by bugs in the MIT Kerberos replay_cache component, link:http://krbdev.mit.edu/rt/Ticket/Display.html?id=1201[#1201] and link:http://krbdev.mit.edu/rt/Ticket/Display.html?id=5924[#5924].
 640 These bugs caused the old version of krb5-server to erroneously block subsequent requests sent from a Principal.
 641 This caused krb5-server to block the connections sent from one Client (one HTable instance with multi-threading connection instances for each RegionServer); Messages, such as `Request is a replay (34)`, are logged in the client log You can ignore the messages, because HTable will retry 5 * 10 (50) times for each failed connection by default.
 642 HTable will throw IOException if any connection to the RegionServer fails after the retries, so that the user client code for HTable instance can handle it further.
 643 NOTE: `HTable` is deprecated in HBase 1.0, in favor of `Table`.
 644
 645 Alternatively, update krb5-server to a version which solves these issues, such as krb5-server-1.10.3.
 646 See JIRA link:https://issues.apache.org/jira/browse/HBASE-10379[HBASE-10379] for more details.
 647
 648 [[trouble.client.zookeeper]]
 649 === ZooKeeper Client Connection Errors
 650
 651 Errors like this...
 652
 653 [source]
 654 ----
 655
 656 11/07/05 11:26:41 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
 657  unexpected error, closing socket connection and attempting reconnect
 658  java.net.ConnectException: Connection refused: no further information
 659         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 660         at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
 661         at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
 662  11/07/05 11:26:43 INFO zookeeper.ClientCnxn: Opening socket connection to
 663  server localhost/127.0.0.1:2181
 664  11/07/05 11:26:44 WARN zookeeper.ClientCnxn: Session 0x0 for server null,
 665  unexpected error, closing socket connection and attempting reconnect
 666  java.net.ConnectException: Connection refused: no further information
 667         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 668         at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
 669         at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1078)
 670  11/07/05 11:26:45 INFO zookeeper.ClientCnxn: Opening socket connection to
 671  server localhost/127.0.0.1:2181
 672 ----
 673
 674 ...are either due to ZooKeeper being down, or unreachable due to network issues.
 675
 676 The utility <<trouble.tools.builtin.zkcli>> may help investigate ZooKeeper issues.
 677
 678 [[trouble.client.oome.directmemory.leak]]
 679 === Client running out of memory though heap size seems to be stable (but the off-heap/direct heap keeps growing)
 680
 681 You are likely running into the issue that is described and worked through in the mail thread link:http://search-hadoop.com/m/ubhrX8KvcH/Suspected+memory+leak&subj=Re+Suspected+memory+leak[HBase, mail # user - Suspected memory leak] and continued over in link:http://search-hadoop.com/m/p2Agc1Zy7Va/MaxDirectMemorySize+Was%253A+Suspected+memory+leak&subj=Re+FeedbackRe+Suspected+memory+leak[HBase, mail # dev - FeedbackRe: Suspected memory leak].
 682 A workaround is passing your client-side JVM a reasonable value for `-XX:MaxDirectMemorySize`.
 683 By default, the `MaxDirectMemorySize` is equal to your `-Xmx` max heapsize setting (if `-Xmx` is set). Try setting it to something smaller (for example, one user had success setting it to `1g` when they had a client-side heap of `12g`). If you set it too small, it will bring on `FullGCs` so keep it a bit hefty.
 684 You want to make this setting client-side only especially if you are running the new experimental server-side off-heap cache since this feature depends on being able to use big direct buffers (You may have to keep separate client-side and server-side config dirs).
 685
 686 [[trouble.client.slowdown.admin]]
 687 === Client Slowdown When Calling Admin Methods (flush, compact, etc.)
 688
 689 This is a client issue fixed by link:https://issues.apache.org/jira/browse/HBASE-5073[HBASE-5073] in 0.90.6.
 690 There was a ZooKeeper leak in the client and the client was getting pummeled by ZooKeeper events with each additional invocation of the admin API.
 691
 692 [[trouble.client.security.rpc]]
 693 === Secure Client Cannot Connect ([Caused by GSSException: No valid credentials provided(Mechanism level: Failed to find any Kerberos tgt)])
 694
 695 There can be several causes that produce this symptom.
 696
 697 First, check that you have a valid Kerberos ticket.
 698 One is required in order to set up communication with a secure Apache HBase cluster.
 699 Examine the ticket currently in the credential cache, if any, by running the `klist` command line utility.
 700 If no ticket is listed, you must obtain a ticket by running the `kinit` command with either a keytab specified, or by interactively entering a password for the desired principal.
 701
 702 Then, consult the link:http://docs.oracle.com/javase/1.5.0/docs/guide/security/jgss/tutorials/Troubleshooting.html[Java Security Guide troubleshooting section].
 703 The most common problem addressed there is resolved by setting `javax.security.auth.useSubjectCredsOnly` system property value to `false`.
 704
 705 Because of a change in the format in which MIT Kerberos writes its credentials cache, there is a bug in the Oracle JDK 6 Update 26 and earlier that causes Java to be unable to read the Kerberos credentials cache created by versions of MIT Kerberos 1.8.1 or higher.
 706 If you have this problematic combination of components in your environment, to work around this problem, first log in with `kinit` and then immediately refresh the credential cache with `kinit -R`.
 707 The refresh will rewrite the credential cache without the problematic formatting.
 708
 709 Prior to JDK 1.4, the JCE was an unbundled product, and as such, the JCA and JCE were regularly referred to as separate, distinct components.
 710 As JCE is now bundled in the JDK 7.0, the distinction is becoming less apparent. Since the JCE uses the same architecture as the JCA, the JCE should be more properly thought of as a part of the JCA.
 711
 712 You may need to install the link:https://docs.oracle.com/javase/1.5.0/docs/guide/security/jce/JCERefGuide.html[Java Cryptography Extension], or JCE because of JDK 1.5 or earlier version.
 713 Insure the JCE jars are on the classpath on both server and client systems.
 714
 715 You may also need to download the link:http://www.oracle.com/technetwork/java/javase/downloads/jce-6-download-429243.html[unlimited strength JCE policy files].
 716 Uncompress and extract the downloaded file, and install the policy jars into _<java-home>/lib/security_.
 717
 718 [[trouble.mapreduce]]
 719 == MapReduce
 720
 721 [[trouble.mapreduce.local]]
 722 === You Think You're On The Cluster, But You're Actually Local
 723
 724 This following stacktrace happened using `ImportTsv`, but things like this can happen on any job with a mis-configuration.
 725
 726 [source,text]
 727 ----
 728     WARN mapred.LocalJobRunner: job_local_0001
 729 java.lang.IllegalArgumentException: Can't read partitions file
 730        at org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.setConf(TotalOrderPartitioner.java:111)
 731        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
 732        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
 733        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560)
 734        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639)
 735        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 736        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
 737 Caused by: java.io.FileNotFoundException: File _partition.lst does not exist.
 738        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
 739        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 740        at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:776)
 741        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1424)
 742        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1419)
 743        at org.apache.hadoop.hbase.mapreduce.hadoopbackport.TotalOrderPartitioner.readPartitions(TotalOrderPartitioner.java:296)
 744 ----
 745
 746 ...see the critical portion of the stack? It's...
 747
 748 [source]
 749 ----
 750 at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
 751 ----
 752
 753 LocalJobRunner means the job is running locally, not on the cluster.
 754
 755 To solve this problem, you should run your MR job with your `HADOOP_CLASSPATH` set to include the HBase dependencies.
 756 The "hbase classpath" utility can be used to do this easily.
 757 For example (substitute VERSION with your HBase version):
 758
 759 [source,bourne]
 760 ----
 761 HADOOP_CLASSPATH=`hbase classpath` hadoop jar $HBASE_HOME/hbase-mapreduce-VERSION.jar rowcounter usertable
 762 ----
 763
 764 See <<hbase.mapreduce.classpath,HBase, MapReduce, and the CLASSPATH>> for more information on HBase MapReduce jobs and classpaths.
 765
 766 [[trouble.hbasezerocopybytestring]]
 767 === Launching a job, you get java.lang.IllegalAccessError: com/google/protobuf/HBaseZeroCopyByteString or class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString
 768
 769 See link:https://issues.apache.org/jira/browse/HBASE-10304[HBASE-10304 Running an hbase job jar: IllegalAccessError: class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString] and link:https://issues.apache.org/jira/browse/HBASE-11118[HBASE-11118 non environment variable solution for "IllegalAccessError: class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString"].
 770 The issue can also show up when trying to run spark jobs.
 771 See link:https://issues.apache.org/jira/browse/HBASE-10877[HBASE-10877 HBase non-retriable exception list should be expanded].
 772
 773 [[trouble.namenode]]
 774 == NameNode
 775
 776 For more information on the NameNode, see <<arch.hdfs>>.
 777
 778 [[trouble.namenode.disk]]
 779 === HDFS Utilization of Tables and Regions
 780
 781 To determine how much space HBase is using on HDFS use the `hadoop` shell commands from the NameNode.
 782 For example...
 783
 784
 785 [source,bourne]
 786 ----
 787 hadoop fs -dus /hbase/
 788 ----
 789 ...returns the summarized disk utilization for all HBase objects.
 790
 791
 792 [source,bourne]
 793 ----
 794 hadoop fs -dus /hbase/myTable
 795 ----
 796 ...returns the summarized disk utilization for the HBase table 'myTable'.
 797
 798
 799 [source,bourne]
 800 ----
 801 hadoop fs -du /hbase/myTable
 802 ----
 803 ...returns a list of the regions under the HBase table 'myTable' and their disk utilization.
 804
 805 For more information on HDFS shell commands, see the link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html[HDFS FileSystem Shell documentation].
 806
 807 [[trouble.namenode.hbase.objects]]
 808 === Browsing HDFS for HBase Objects
 809
 810 Sometimes it will be necessary to explore the HBase objects that exist on HDFS.
 811 These objects could include the WALs (Write Ahead Logs), tables, regions, StoreFiles, etc.
 812 The easiest way to do this is with the NameNode web application that runs on port 50070.
 813 The NameNode web application will provide links to the all the DataNodes in the cluster so that they can be browsed seamlessly.
 814
 815 The HDFS directory structure of HBase tables in the cluster is...
 816 [source]
 817 ----
 818
 819 /hbase
 820     /<Table>                    (Tables in the cluster)
 821         /<Region>               (Regions for the table)
 822             /<ColumnFamily>     (ColumnFamilies for the Region for the table)
 823                 /<StoreFile>    (StoreFiles for the ColumnFamily for the Regions for the table)
 824 ----
 825
 826 The HDFS directory structure of HBase WAL is..
 827 [source]
 828 ----
 829
 830 /hbase
 831     /.logs
 832         /<RegionServer>    (RegionServers)
 833             /<WAL>         (WAL files for the RegionServer)
 834 ----
 835
 836 See the link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html[HDFS User Guide] for other non-shell diagnostic utilities like `fsck`.
 837
 838 [[trouble.namenode.0size.hlogs]]
 839 ==== Zero size WALs with data in them
 840
 841 Problem: when getting a listing of all the files in a RegionServer's _.logs_ directory, one file has a size of 0 but it contains data.
 842
 843 Answer: It's an HDFS quirk.
 844 A file that's currently being written to will appear to have a size of 0 but once it's closed it will show its true size
 845
 846 [[trouble.namenode.uncompaction]]
 847 ==== Use Cases
 848
 849 Two common use-cases for querying HDFS for HBase objects is research the degree of uncompaction of a table.
 850 If there are a large number of StoreFiles for each ColumnFamily it could indicate the need for a major compaction.
 851 Additionally, after a major compaction if the resulting StoreFile is "small" it could indicate the need for a reduction of ColumnFamilies for the table.
 852
 853 === Unexpected Filesystem Growth
 854
 855 If you see an unexpected spike in filesystem usage by HBase, two possible culprits
 856 are snapshots and WALs.
 857
 858 Snapshots::
 859   When you create a snapshot, HBase retains everything it needs to recreate the table's
 860   state at that time of the snapshot. This includes deleted cells or expired versions.
 861   For this reason, your snapshot usage pattern should be well-planned, and you should
 862   prune snapshots that you no longer need. Snapshots are stored in `/hbase/.snapshots`,
 863   and archives needed to restore snapshots are stored in
 864   `/hbase/.archive/<_tablename_>/<_region_>/<_column_family_>/`.
 865
 866   *Do not* manage snapshots or archives manually via HDFS. HBase provides APIs and
 867   HBase Shell commands for managing them. For more information, see <<ops.snapshots>>.
 868
 869 WAL::
 870   Write-ahead logs (WALs) are stored in subdirectories of the HBase root directory,
 871   typically `/hbase/`, depending on their status. Already-processed WALs are stored
 872   in `/hbase/oldWALs/` and corrupt WALs are stored in `/hbase/.corrupt/` for examination.
 873   If the size of one of these subdirectories is growing, examine the HBase
 874   server logs to find the root cause for why WALs are not being processed correctly.
 875 +
 876 If you use replication and `/hbase/oldWALs/` is using more space than you expect,
 877 remember that WALs are saved when replication is disabled, as long as there are peers.
 878
 879 *Do not* manage WALs manually via HDFS.
 880
 881 [[trouble.network]]
 882 == Network
 883
 884 [[trouble.network.spikes]]
 885 === Network Spikes
 886
 887 If you are seeing periodic network spikes you might want to check the `compactionQueues` to see if major compactions are happening.
 888
 889 See <<managed.compactions>> for more information on managing compactions.
 890
 891 [[trouble.network.loopback]]
 892 === Loopback IP
 893
 894 HBase expects the loopback IP Address to be 127.0.0.1.
 895 See the Getting Started section on <<loopback.ip>>.
 896
 897 [[trouble.network.ints]]
 898 === Network Interfaces
 899
 900 Are all the network interfaces functioning correctly? Are you sure? See the Troubleshooting Case Study in <<trouble.casestudy>>.
 901
 902 [[trouble.rs]]
 903 == RegionServer
 904
 905 For more information on the RegionServers, see <<regionserver.arch>>.
 906
 907 [[trouble.rs.startup]]
 908 === Startup Errors
 909
 910 [[trouble.rs.startup.master_no_region]]
 911 ==== Master Starts, But RegionServers Do Not
 912
 913 The Master believes the RegionServers have the IP of 127.0.0.1 - which is localhost and resolves to the master's own localhost.
 914
 915 The RegionServers are erroneously informing the Master that their IP addresses are 127.0.0.1.
 916
 917 Modify _/etc/hosts_ on the region servers, from...
 918
 919 [source]
 920 ----
 921 # Do not remove the following line, or various programs
 922 # that require network functionality will fail.
 923 127.0.0.1               fully.qualified.regionservername regionservername  localhost.localdomain localhost
 924 ::1             localhost6.localdomain6 localhost6
 925 ----
 926
 927 \... to (removing the master node's name from localhost)...
 928
 929 [source]
 930 ----
 931 # Do not remove the following line, or various programs
 932 # that require network functionality will fail.
 933 127.0.0.1               localhost.localdomain localhost
 934 ::1             localhost6.localdomain6 localhost6
 935 ----
 936
 937 [[trouble.rs.startup.compression]]
 938 ==== Compression Link Errors
 939
 940 Since compression algorithms such as LZO need to be installed and configured on each cluster this is a frequent source of startup error.
 941 If you see messages like this...
 942
 943 [source]
 944 ----
 945
 946 11/02/20 01:32:15 ERROR lzo.GPLNativeCodeLoader: Could not load native gpl library
 947 java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
 948         at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1734)
 949         at java.lang.Runtime.loadLibrary0(Runtime.java:823)
 950         at java.lang.System.loadLibrary(System.java:1028)
 951 ----
 952
 953 \... then there is a path issue with the compression libraries.
 954 See the Configuration section on link:[LZO compression configuration].
 955
 956 [[trouble.rs.runtime]]
 957 === Runtime Errors
 958
 959 [[trouble.rs.runtime.hang]]
 960 ==== RegionServer Hanging
 961
 962 Are you running an old JVM (< 1.6.0_u21?)? When you look at a thread dump, does it look like threads are BLOCKED but no one holds the lock all are blocked on? See link:https://issues.apache.org/jira/browse/HBASE-3622[HBASE 3622 Deadlock in
 963             HBaseServer (JVM bug?)].
 964 Adding `-XX:+UseMembar` to the HBase `HBASE_OPTS` in _conf/hbase-env.sh_ may fix it.
 965
 966 [[trouble.rs.runtime.filehandles]]
 967 ==== java.io.IOException...(Too many open files)
 968
 969 If you see log messages like this...
 970
 971 [source]
 972 ----
 973
 974 2010-09-13 01:24:17,336 WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
 975 Disk-related IOException in BlockReceiver constructor. Cause is java.io.IOException: Too many open files
 976         at java.io.UnixFileSystem.createFileExclusively(Native Method)
 977         at java.io.File.createNewFile(File.java:883)
 978 ----
 979
 980 \... see the Getting Started section on link:[ulimit and nproc configuration].
 981
 982 [[trouble.rs.runtime.xceivers]]
 983 ==== xceiverCount 258 exceeds the limit of concurrent xcievers 256
 984
 985 This typically shows up in the DataNode logs.
 986
 987 See the Getting Started section on link:[xceivers configuration].
 988
 989 [[trouble.rs.runtime.oom_nt]]
 990 ==== System instability, and the presence of "java.lang.OutOfMemoryError: unable to createnew native thread in exceptions" HDFS DataNode logs or that of any system daemon
 991
 992 See the Getting Started section on ulimit and nproc configuration.
 993 The default on recent Linux distributions is 1024 - which is far too low for HBase.
 994
 995 [[trouble.rs.runtime.gc]]
 996 ==== DFS instability and/or RegionServer lease timeouts
 997
 998 If you see warning messages like this...
 999
1000 [source]
1001 ----
1002
1003 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 10000
1004 2009-02-24 10:01:33,516 WARN org.apache.hadoop.hbase.util.Sleeper: We slept xxx ms, ten times longer than scheduled: 15000
1005 2009-02-24 10:01:36,472 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master for xxx milliseconds - retrying
1006 ----
1007
1008 \... or see full GC compactions then you may be experiencing full GC's.
1009
1010 [[trouble.rs.runtime.nolivenodes]]
1011 ==== "No live nodes contain current block" and/or YouAreDeadException
1012
1013 These errors can happen either when running out of OS file handles or in periods of severe network problems where the nodes are unreachable.
1014
1015 See the Getting Started section on ulimit and nproc configuration and check your network.
1016
1017 [[trouble.rs.runtime.zkexpired]]
1018 ==== ZooKeeper SessionExpired events
1019
1020 Master or RegionServers shutting down with messages like those in the logs:
1021
1022 [source]
1023 ----
1024
1025 WARN org.apache.zookeeper.ClientCnxn: Exception
1026 closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec
1027 java.io.IOException: TIMED OUT
1028        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
1029 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000
1030 INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT
1031 INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT]
1032 INFO org.apache.zookeeper.ClientCnxn: Server connection successful
1033 WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e
1034 java.io.IOException: Session Expired
1035        at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
1036        at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
1037        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
1038 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
1039 ----
1040
1041 The JVM is doing a long running garbage collecting which is pausing every threads (aka "stop the world"). Since the RegionServer's local ZooKeeper client cannot send heartbeats, the session times out.
1042 By design, we shut down any node that isn't able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.
1043
1044 * Make sure you give plenty of RAM (in _hbase-env.sh_), the default of 1GB won't be able to sustain long running imports.
1045 * Make sure you don't swap, the JVM never behaves well under swapping.
1046 * Make sure you are not CPU starving the RegionServer thread.
1047   For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.
1048 * Increase the ZooKeeper session timeout
1049
1050 If you wish to increase the session timeout, add the following to your _hbase-site.xml_ to increase the timeout from the default of 60 seconds to 120 seconds.
1051
1052 [source,xml]
1053 ----
1054 <property>
1055   <name>zookeeper.session.timeout</name>
1056   <value>120000</value>
1057 </property>
1058 <property>
1059   <name>hbase.zookeeper.property.tickTime</name>
1060   <value>6000</value>
1061 </property>
1062 ----
1063
1064 Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least that amount of time to be transferred to another RegionServer.
1065 For a production system serving live requests, we would instead recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine).
1066
1067 If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider bulk loading.
1068
1069 See <<trouble.zookeeper.general>> for other general information about ZooKeeper troubleshooting.
1070
1071 [[trouble.rs.runtime.notservingregion]]
1072 ==== NotServingRegionException
1073
1074 This exception is "normal" when found in the RegionServer logs at DEBUG level.
1075 This exception is returned back to the client and then the client goes back to `hbase:meta` to find the new location of the moved region.
1076
1077 However, if the NotServingRegionException is logged ERROR, then the client ran out of retries and something probably wrong.
1078
1079 [[trouble.rs.runtime.double_listed_regions]]
1080 ==== Regions listed by domain name, then IP
1081
1082 Fix your DNS.
1083 In versions of Apache HBase before 0.92.x, reverse DNS needs to give same answer as forward lookup.
1084 See link:https://issues.apache.org/jira/browse/HBASE-3431[HBASE 3431 RegionServer is not using the name given it by the master; double entry in master listing of servers] for gory details.
1085
1086 [[brand.new.compressor]]
1087 ==== Logs flooded with '2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool: Gotbrand-new compressor' messages
1088
1089 We are not using the native versions of compression libraries.
1090 See link:https://issues.apache.org/jira/browse/HBASE-1900[HBASE-1900 Put back native support when hadoop 0.21 is released].
1091 Copy the native libs from hadoop under HBase lib dir or symlink them into place and the message should go away.
1092
1093 [[trouble.rs.runtime.client_went_away]]
1094 ==== Server handler X on 60020 caught: java.nio.channels.ClosedChannelException
1095
1096 If you see this type of message it means that the region server was trying to read/send data from/to a client but it already went away.
1097 Typical causes for this are if the client was killed (you see a storm of messages like this when a MapReduce job is killed or fails) or if the client receives a SocketTimeoutException.
1098 It's harmless, but you should consider digging in a bit more if you aren't doing something to trigger them.
1099
1100 === Snapshot Errors Due to Reverse DNS
1101
1102 Several operations within HBase, including snapshots, rely on properly configured reverse DNS.
1103 Some environments, such as Amazon EC2, have trouble with reverse DNS.
1104 If you see errors like the following on your RegionServers, check your reverse DNS configuration:
1105
1106 ----
1107
1108 2013-05-01 00:04:56,356 DEBUG org.apache.hadoop.hbase.procedure.Subprocedure: Subprocedure 'backup1'
1109 coordinator notified of 'acquire', waiting on 'reached' or 'abort' from coordinator.
1110 ----
1111
1112 In general, the hostname reported by the RegionServer needs to be the same as the hostname the Master is trying to reach.
1113 You can see a hostname mismatch by looking for the following type of message in the RegionServer's logs at start-up.
1114
1115 ----
1116
1117 2013-05-01 00:03:00,614 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed us hostname
1118 to use. Was=myhost-1234, Now=ip-10-55-88-99.ec2.internal
1119 ----
1120
1121 [[trouble.rs.shutdown]]
1122 === Shutdown Errors
1123
1124
1125
1126 [[trouble.master]]
1127 == Master
1128
1129 For more information on the Master, see <<architecture.master,master>>.
1130
1131 [[trouble.master.startup]]
1132 === Startup Errors
1133
1134 [[trouble.master.startup.migration]]
1135 ==== Master says that you need to run the HBase migrations script
1136
1137 Upon running that, the HBase migrations script says no files in root directory.
1138
1139 HBase expects the root directory to either not exist, or to have already been initialized by HBase running a previous time.
1140 If you create a new directory for HBase using Hadoop DFS, this error will occur.
1141 Make sure the HBase root directory does not currently exist or has been initialized by a previous run of HBase.
1142 Sure fire solution is to just use Hadoop dfs to delete the HBase root and let HBase create and initialize the directory itself.
1143
1144 [[trouble.master.startup.zk.buffer]]
1145 ==== Packet len6080218 is out of range!
1146
1147 If you have many regions on your cluster and you see an error like that reported above in this sections title in your logs, see link:https://issues.apache.org/jira/browse/HBASE-4246[HBASE-4246 Cluster with too many regions cannot withstand some master failover scenarios].
1148
1149 [[trouble.master.shutdown]]
1150 === Shutdown Errors
1151
1152
1153
1154 [[trouble.zookeeper]]
1155 == ZooKeeper
1156
1157 [[trouble.zookeeper.startup]]
1158 === Startup Errors
1159
1160 [[trouble.zookeeper.startup.address]]
1161 ==== Could not find my address: xyz in list of ZooKeeper quorum servers
1162
1163 A ZooKeeper server wasn't able to start, throws that error.
1164 xyz is the name of your server.
1165
1166 This is a name lookup problem.
1167 HBase tries to start a ZooKeeper server on some machine but that machine isn't able to find itself in the `hbase.zookeeper.quorum` configuration.
1168
1169 Use the hostname presented in the error message instead of the value you used.
1170 If you have a DNS server, you can set `hbase.zookeeper.dns.interface` and `hbase.zookeeper.dns.nameserver` in _hbase-site.xml_ to make sure it resolves to the correct FQDN.
1171
1172 [[trouble.zookeeper.general]]
1173 === ZooKeeper, The Cluster Canary
1174
1175 ZooKeeper is the cluster's "canary in the mineshaft". It'll be the first to notice issues if any so making sure its happy is the short-cut to a humming cluster.
1176
1177 See the link:https://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting[ZooKeeper Operating Environment Troubleshooting] page.
1178 It has suggestions and tools for checking disk and networking performance; i.e.
1179 the operating environment your ZooKeeper and HBase are running in.
1180
1181 Additionally, the utility <<trouble.tools.builtin.zkcli>> may help investigate ZooKeeper issues.
1182
1183 [[trouble.ec2]]
1184 == Amazon EC2
1185
1186 [[trouble.ec2.zookeeper]]
1187 === ZooKeeper does not seem to work on Amazon EC2
1188
1189 HBase does not start when deployed as Amazon EC2 instances.
1190 Exceptions like the below appear in the Master and/or RegionServer logs:
1191
1192 [source]
1193 ----
1194
1195   2009-10-19 11:52:27,030 INFO org.apache.zookeeper.ClientCnxn: Attempting
1196   connection to server ec2-174-129-15-236.compute-1.amazonaws.com/10.244.9.171:2181
1197   2009-10-19 11:52:27,032 WARN org.apache.zookeeper.ClientCnxn: Exception
1198   closing session 0x0 to sun.nio.ch.SelectionKeyImpl@656dc861
1199   java.net.ConnectException: Connection refused
1200 ----
1201
1202 Security group policy is blocking the ZooKeeper port on a public address.
1203 Use the internal EC2 host names when configuring the ZooKeeper quorum peer list.
1204
1205 [[trouble.ec2.instability]]
1206 === Instability on Amazon EC2
1207
1208 Questions on HBase and Amazon EC2 come up frequently on the HBase dist-list.
1209 Search for old threads using link:http://search-hadoop.com/[Search Hadoop]
1210
1211 [[trouble.ec2.connection]]
1212 === Remote Java Connection into EC2 Cluster Not Working
1213
1214 See Andrew's answer here, up on the user list: link:http://search-hadoop.com/m/sPdqNFAwyg2[Remote Java client connection into EC2 instance].
1215
1216 [[trouble.versions]]
1217 == HBase and Hadoop version issues
1218
1219 [[trouble.versions.205]]
1220 === `NoClassDefFoundError` when trying to run 0.90.x on hadoop-0.20.205.x (or hadoop-1.0.x)
1221
1222 Apache HBase 0.90.x does not ship with hadoop-0.20.205.x, etc.
1223 To make it run, you need to replace the hadoop jars that Apache HBase shipped with in its _lib_ directory with those of the Hadoop you want to run HBase on.
1224 If even after replacing Hadoop jars you get the below exception:
1225
1226 [source]
1227 ----
1228
1229 sv4r6s38: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
1230 sv4r6s38:       at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:37)
1231 sv4r6s38:       at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:34)
1232 sv4r6s38:       at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51)
1233 sv4r6s38:       at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:209)
1234 sv4r6s38:       at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177)
1235 sv4r6s38:       at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:229)
1236 sv4r6s38:       at org.apache.hadoop.security.KerberosName.<clinit>(KerberosName.java:83)
1237 sv4r6s38:       at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:202)
1238 sv4r6s38:       at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:177)
1239 ----
1240
1241 you need to copy under _hbase/lib_, the _commons-configuration-X.jar_ you find in your Hadoop's _lib_ directory.
1242 That should fix the above complaint.
1243
1244 [[trouble.wrong.version]]
1245 === ...cannot communicate with client version...
1246
1247 If you see something like the following in your logs [computeroutput]+... 2012-09-24
1248           10:20:52,168 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting
1249           shutdown. org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate
1250           with client version 4 ...+ ...are you trying to talk to an Hadoop 2.0.x from an HBase that has an Hadoop 1.0.x client? Use the HBase built against Hadoop 2.0 or rebuild your HBase passing the +-Dhadoop.profile=2.0+ attribute to Maven (See <<maven.build.hadoop>> for more).
1251
1252 == IPC Configuration Conflicts with Hadoop
1253
1254 If the Hadoop configuration is loaded after the HBase configuration, and you have configured custom IPC settings in both HBase and Hadoop, the Hadoop values may overwrite the HBase values.
1255 There is normally no need to change these settings for HBase, so this problem is an edge case.
1256 However, link:https://issues.apache.org/jira/browse/HBASE-11492[HBASE-11492] renames these settings for HBase to remove the chance of a conflict.
1257 Each of the setting names have been prefixed with `hbase.`, as shown in the following table.
1258 No action is required related to these changes unless you are already experiencing a conflict.
1259
1260 These changes were backported to HBase 0.98.x and apply to all newer versions.
1261
1262 [cols="1,1", options="header"]
1263 |===
1264 | Pre-0.98.x
1265 | 0.98-x And Newer
1266
1267 | ipc.server.listen.queue.size
1268 | hbase.ipc.server.listen.queue.size
1269
1270 | ipc.server.max.callqueue.size
1271 | hbase.ipc.server.max.callqueue.size
1272
1273 | ipc.server.callqueue.handler.factor
1274 | hbase.ipc.server.callqueue.handler.factor
1275
1276 | ipc.server.callqueue.read.share
1277 | hbase.ipc.server.callqueue.read.share
1278
1279 | ipc.server.callqueue.type
1280 | hbase.ipc.server.callqueue.type
1281
1282 | ipc.server.queue.max.call.delay
1283 | hbase.ipc.server.queue.max.call.delay
1284
1285 | ipc.server.max.callqueue.length
1286 | hbase.ipc.server.max.callqueue.length
1287
1288 | ipc.server.read.threadpool.size
1289 | hbase.ipc.server.read.threadpool.size
1290
1291 | ipc.server.tcpkeepalive
1292 | hbase.ipc.server.tcpkeepalive
1293
1294 | ipc.server.tcpnodelay
1295 | hbase.ipc.server.tcpnodelay
1296
1297 | ipc.client.call.purge.timeout
1298 | hbase.ipc.client.call.purge.timeout
1299
1300 | ipc.client.connection.maxidletime
1301 | hbase.ipc.client.connection.maxidletime
1302
1303 | ipc.client.idlethreshold
1304 | hbase.ipc.client.idlethreshold
1305
1306 | ipc.client.kill.max
1307 | hbase.ipc.client.kill.max
1308
1309 | ipc.server.scan.vtime.weight
1310 | hbase.ipc.server.scan.vtime.weight
1311 |===
1312
1313 == HBase and HDFS
1314
1315 General configuration guidance for Apache HDFS is out of the scope of this guide.
1316 Refer to the documentation available at https://hadoop.apache.org/ for extensive information about configuring HDFS.
1317 This section deals with HDFS in terms of HBase.
1318
1319 In most cases, HBase stores its data in Apache HDFS.
1320 This includes the HFiles containing the data, as well as the write-ahead logs (WALs) which store data before it is written to the HFiles and protect against RegionServer crashes.
1321 HDFS provides reliability and protection to data in HBase because it is distributed.
1322 To operate with the most efficiency, HBase needs data to be available locally.
1323 Therefore, it is a good practice to run an HDFS DataNode on each RegionServer.
1324
1325 .Important Information and Guidelines for HBase and HDFS
1326
1327 HBase is a client of HDFS.::
1328   HBase is an HDFS client, using the HDFS `DFSClient` class, and references to this class appear in HBase logs with other HDFS client log messages.
1329
1330 Configuration is necessary in multiple places.::
1331   Some HDFS configurations relating to HBase need to be done at the HDFS (server) side.
1332   Others must be done within HBase (at the client side). Other settings need to be set at both the server and client side.
1333
1334 Write errors which affect HBase may be logged in the HDFS logs rather than HBase logs.::
1335   When writing, HDFS pipelines communications from one DataNode to another.
1336   HBase communicates to both the HDFS NameNode and DataNode, using the HDFS client classes.
1337   Communication problems between DataNodes are logged in the HDFS logs, not the HBase logs.
1338
1339 HBase communicates with HDFS using two different ports.::
1340   HBase communicates with DataNodes using the `ipc.Client` interface and the `DataNode` class.
1341   References to these will appear in HBase logs.
1342   Each of these communication channels use a different port (50010 and 50020 by default). The ports are configured in the HDFS configuration, via the `dfs.datanode.address` and `dfs.datanode.ipc.address`            parameters.
1343
1344 Errors may be logged in HBase, HDFS, or both.::
1345   When troubleshooting HDFS issues in HBase, check logs in both places for errors.
1346
1347 HDFS takes a while to mark a node as dead. You can configure HDFS to avoid using stale DataNodes.::
1348   By default, HDFS does not mark a node as dead until it is unreachable for 630 seconds.
1349   In Hadoop 1.1 and Hadoop 2.x, this can be alleviated by enabling checks for stale DataNodes, though this check is disabled by default.
1350   You can enable the check for reads and writes separately, via `dfs.namenode.avoid.read.stale.datanode` and `dfs.namenode.avoid.write.stale.datanode settings`.
1351   A stale DataNode is one that has not been reachable for `dfs.namenode.stale.datanode.interval` (default is 30 seconds). Stale datanodes are avoided, and marked as the last possible target for a read or write operation.
1352   For configuration details, see the HDFS documentation.
1353
1354 Settings for HDFS retries and timeouts are important to HBase.::
1355   You can configure settings for various retries and timeouts.
1356   Always refer to the HDFS documentation for current recommendations and defaults.
1357   Some of the settings important to HBase are listed here.
1358   Defaults are current as of Hadoop 2.3.
1359   Check the Hadoop documentation for the most current values and recommendations.
1360
1361 The HBase Balancer and HDFS Balancer are incompatible::
1362   The HDFS balancer attempts to spread HDFS blocks evenly among DataNodes. HBase relies
1363   on compactions to restore locality after a region split or failure. These two types
1364   of balancing do not work well together.
1365 +
1366 In the past, the generally accepted advice was to turn off the HDFS load balancer and rely
1367 on the HBase balancer, since the HDFS balancer would degrade locality. This advice
1368 is still valid if your HDFS version is lower than 2.7.1.
1369 +
1370 link:https://issues.apache.org/jira/browse/HDFS-6133[HDFS-6133] provides the ability
1371 to exclude favored-nodes (pinned) blocks from the HDFS load balancer, by setting the
1372 `dfs.datanode.block-pinning.enabled` property to `true` in the HDFS service
1373 configuration.
1374 +
1375 HBase can be enabled to use the HDFS favored-nodes feature by switching the HBase balancer
1376 class (conf: `hbase.master.loadbalancer.class`) to `org.apache.hadoop.hbase.favored.FavoredNodeLoadBalancer`
1377 which is documented link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/favored/FavoredNodeLoadBalancer.html[here].
1378 +
1379 NOTE: HDFS-6133 is available in HDFS 2.7.0 and higher, but HBase does not support
1380 running on HDFS 2.7.0, so you must be using HDFS 2.7.1 or higher to use this feature
1381 with HBase.
1382
1383 .Connection Timeouts
1384 Connection timeouts occur between the client (HBASE) and the HDFS DataNode.
1385 They may occur when establishing a connection, attempting to read, or attempting to write.
1386 The two settings below are used in combination, and affect connections between the DFSClient and the DataNode, the ipc.cClient and the DataNode, and communication between two DataNodes.
1387
1388 `dfs.client.socket-timeout` (default: 60000)::
1389   The amount of time before a client connection times out when establishing a connection or reading.
1390   The value is expressed in milliseconds, so the default is 60 seconds.
1391
1392 `dfs.datanode.socket.write.timeout` (default: 480000)::
1393   The amount of time before a write operation times out.
1394   The default is 8 minutes, expressed as milliseconds.
1395
1396 .Typical Error Logs
1397 The following types of errors are often seen in the logs.
1398
1399 `INFO HDFS.DFSClient: Failed to connect to /xxx50010, add to deadNodes and
1400             continue java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel
1401             to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
1402             remote=/region-server-1:50010]`::
1403   All DataNodes for a block are dead, and recovery is not possible.
1404   Here is the sequence of events that leads to this error:
1405
1406 `INFO org.apache.hadoop.HDFS.DFSClient: Exception in createBlockOutputStream
1407             java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be
1408             ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/
1409             xxx:50010]`::
1410   This type of error indicates a write issue.
1411   In this case, the master wants to split the log.
1412   It does not have a local DataNodes so it tries to connect to a remote DataNode, but the DataNode is dead.
1413
1414 [[trouble.tests]]
1415 == Running unit or integration tests
1416
1417 [[trouble.hdfs_2556]]
1418 === Runtime exceptions from MiniDFSCluster when running tests
1419
1420 If you see something like the following
1421
1422 [source]
1423 ----
1424 ...
1425 java.lang.NullPointerException: null
1426 at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes
1427 at org.apache.hadoop.hdfs.MiniDFSCluster.<init>
1428 at org.apache.hadoop.hbase.MiniHBaseCluster.<init>
1429 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniDFSCluster
1430 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster
1431 ...
1432 ----
1433
1434 or
1435
1436 [source]
1437 ----
1438 ...
1439 java.io.IOException: Shutting down
1440 at org.apache.hadoop.hbase.MiniHBaseCluster.init
1441 at org.apache.hadoop.hbase.MiniHBaseCluster.<init>
1442 at org.apache.hadoop.hbase.MiniHBaseCluster.<init>
1443 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster
1444 at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster
1445 ...
1446 ----
1447
1448 \... then try issuing the command +umask 022+ before launching tests.
1449 This is a workaround for link:https://issues.apache.org/jira/browse/HDFS-2556[HDFS-2556]
1450
1451 [[trouble.casestudy]]
1452 == Case Studies
1453
1454 For Performance and Troubleshooting Case Studies, see <<casestudies>>.
1455
1456 [[trouble.crypto]]
1457 == Cryptographic Features
1458
1459 [[trouble.crypto.hbase_10132]]
1460 === sun.security.pkcs11.wrapper.PKCS11Exception: CKR_ARGUMENTS_BAD
1461
1462 This problem manifests as exceptions ultimately caused by:
1463
1464 [source]
1465 ----
1466 Caused by: sun.security.pkcs11.wrapper.PKCS11Exception: CKR_ARGUMENTS_BAD
1467   at sun.security.pkcs11.wrapper.PKCS11.C_DecryptUpdate(Native Method)
1468   at sun.security.pkcs11.P11Cipher.implDoFinal(P11Cipher.java:795)
1469 ----
1470
1471 This problem appears to affect some versions of OpenJDK 7 shipped by some Linux vendors.
1472 NSS is configured as the default provider.
1473 If the host has an x86_64 architecture, depending on if the vendor packages contain the defect, the NSS provider will not function correctly.
1474
1475 To work around this problem, find the JRE home directory and edit the file _lib/security/java.security_.
1476 Edit the file to comment out the line:
1477
1478 [source]
1479 ----
1480 security.provider.1=sun.security.pkcs11.SunPKCS11 ${java.home}/lib/security/nss.cfg
1481 ----
1482
1483 Then renumber the remaining providers accordingly.
1484
1485 == Operating System Specific Issues
1486
1487 === Page Allocation Failure
1488
1489 NOTE: This issue is known to affect CentOS 6.2 and possibly CentOS 6.5.
1490 It may also affect some versions of Red Hat Enterprise Linux, according to https://bugzilla.redhat.com/show_bug.cgi?id=770545.
1491
1492 Some users have reported seeing the following error:
1493
1494 ----
1495 kernel: java: page allocation failure. order:4, mode:0x20
1496 ----
1497
1498 Raising the value of `min_free_kbytes` was reported to fix this problem.
1499 This parameter is set to a percentage of the amount of RAM on your system, and is described in more detail at http://www.centos.org/docs/5/html/5.1/Deployment_Guide/s3-proc-sys-vm.html.
1500
1501 To find the current value on your system, run the following command:
1502
1503 ----
1504 [user@host]# cat /proc/sys/vm/min_free_kbytes
1505 ----
1506
1507 Next, raise the value.
1508 Try doubling, then quadrupling the value.
1509 Note that setting the value too low or too high could have detrimental effects on your system.
1510 Consult your operating system vendor for specific recommendations.
1511
1512 Use the following command to modify the value of `min_free_kbytes`, substituting _<value>_ with your intended value:
1513
1514 ----
1515 [user@host]# echo <value> > /proc/sys/vm/min_free_kbytes
1516 ----
1517
1518 == JDK Issues
1519
1520 === NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet
1521
1522 If you see this in your logs:
1523 [source]
1524 ----
1525 Caused by: java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
1526   at org.apache.hadoop.hbase.master.ServerManager.findServerWithSameHostnamePortWithLock(ServerManager.java:393)
1527   at org.apache.hadoop.hbase.master.ServerManager.checkAndRecordNewServer(ServerManager.java:307)
1528   at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:244)
1529   at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerStartup(MasterRpcServices.java:304)
1530   at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:7910)
1531   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2020)
1532   ... 4 more
1533 ----
1534 then check if you compiled with jdk8 and tried to run it on jdk7.
1535 If so, this won't work.
1536 Run on jdk8 or recompile with jdk7.
1537 See link:https://issues.apache.org/jira/browse/HBASE-10607[HBASE-10607 JDK8 NoSuchMethodError involving ConcurrentHashMap.keySet if running on JRE 7].