src/main/asciidoc/_chapters/case_studies.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 [[casestudies]]
  23 = Apache HBase Case Studies
  24 :doctype: book
  25 :numbered:
  26 :toc: left
  27 :icons: font
  28 :experimental:
  29
  30 [[casestudies.overview]]
  31 == Overview
  32
  33 This chapter will describe a variety of performance and troubleshooting case studies that can provide a useful blueprint on diagnosing Apache HBase cluster issues.
  34
  35 For more information on Performance and Troubleshooting, see <<performance>> and <<trouble>>.
  36
  37 [[casestudies.schema]]
  38 == Schema Design
  39
  40 See the schema design case studies here: <<schema.casestudies>>
  41
  42 [[casestudies.perftroub]]
  43 == Performance/Troubleshooting
  44
  45 [[casestudies.slownode]]
  46 === Case Study #1 (Performance Issue On A Single Node)
  47
  48 ==== Scenario
  49
  50 Following a scheduled reboot, one data node began exhibiting unusual behavior.
  51 Routine MapReduce jobs run against HBase tables which regularly completed in five or six minutes began taking 30 or 40 minutes to finish.
  52 These jobs were consistently found to be waiting on map and reduce tasks assigned to the troubled data node (e.g., the slow map tasks all had the same Input Split). The situation came to a head during a distributed copy, when the copy was severely prolonged by the lagging node.
  53
  54 ==== Hardware
  55
  56 .Datanodes:
  57 * Two 12-core processors
  58 * Six Enterprise SATA disks
  59 * 24GB of RAM
  60 * Two bonded gigabit NICs
  61
  62 .Network:
  63 * 10 Gigabit top-of-rack switches
  64 * 20 Gigabit bonded interconnects between racks.
  65
  66 ==== Hypotheses
  67
  68 ===== HBase "Hot Spot" Region
  69
  70 We hypothesized that we were experiencing a familiar point of pain: a "hot spot" region in an HBase table, where uneven key-space distribution can funnel a huge number of requests to a single HBase region, bombarding the RegionServer process and cause slow response time.
  71 Examination of the HBase Master status page showed that the number of HBase requests to the troubled node was almost zero.
  72 Further, examination of the HBase logs showed that there were no region splits, compactions, or other region transitions in progress.
  73 This effectively ruled out a "hot spot" as the root cause of the observed slowness.
  74
  75 ===== HBase Region With Non-Local Data
  76
  77 Our next hypothesis was that one of the MapReduce tasks was requesting data from HBase that was not local to the DataNode, thus forcing HDFS to request data blocks from other servers over the network.
  78 Examination of the DataNode logs showed that there were very few blocks being requested over the network, indicating that the HBase region was correctly assigned, and that the majority of the necessary data was located on the node.
  79 This ruled out the possibility of non-local data causing a slowdown.
  80
  81 ===== Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk
  82
  83 After concluding that the Hadoop and HBase were not likely to be the culprits, we moved on to troubleshooting the DataNode's hardware.
  84 Java, by design, will periodically scan its entire memory space to do garbage collection.
  85 If system memory is heavily overcommitted, the Linux kernel may enter a vicious cycle, using up all of its resources swapping Java heap back and forth from disk to RAM as Java tries to run garbage collection.
  86 Further, a failing hard disk will often retry reads and/or writes many times before giving up and returning an error.
  87 This can manifest as high iowait, as running processes wait for reads and writes to complete.
  88 Finally, a disk nearing the upper edge of its performance envelope will begin to cause iowait as it informs the kernel that it cannot accept any more data, and the kernel queues incoming data into the dirty write pool in memory.
  89 However, using `vmstat(1)` and `free(1)`, we could see that no swap was being used, and the amount of disk IO was only a few kilobytes per second.
  90
  91 ===== Slowness Due To High Processor Usage
  92
  93 Next, we checked to see whether the system was performing slowly simply due to very high computational load. `top(1)` showed that the system load was higher than normal, but `vmstat(1)` and `mpstat(1)` showed that the amount of processor being used for actual computation was low.
  94
  95 ===== Network Saturation (The Winner)
  96
  97 Since neither the disks nor the processors were being utilized heavily, we moved on to the performance of the network interfaces.
  98 The DataNode had two gigabit ethernet adapters, bonded to form an active-standby interface. `ifconfig(8)` showed some unusual anomalies, namely interface errors, overruns, framing errors.
  99 While not unheard of, these kinds of errors are exceedingly rare on modern hardware which is operating as it should:
 100
 101 ----
 102
 103 $ /sbin/ifconfig bond0
 104 bond0  Link encap:Ethernet  HWaddr 00:00:00:00:00:00
 105 inet addr:10.x.x.x  Bcast:10.x.x.255  Mask:255.255.255.0
 106 UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
 107 RX packets:2990700159 errors:12 dropped:0 overruns:1 frame:6          <--- Look Here! Errors!
 108 TX packets:3443518196 errors:0 dropped:0 overruns:0 carrier:0
 109 collisions:0 txqueuelen:0
 110 RX bytes:2416328868676 (2.4 TB)  TX bytes:3464991094001 (3.4 TB)
 111 ----
 112
 113 These errors immediately lead us to suspect that one or more of the ethernet interfaces might have negotiated the wrong line speed.
 114 This was confirmed both by running an ICMP ping from an external host and observing round-trip-time in excess of 700ms, and by running `ethtool(8)` on the members of the bond interface and discovering that the active interface was operating at 100Mbs/, full duplex.
 115
 116 ----
 117
 118 $ sudo ethtool eth0
 119 Settings for eth0:
 120 Supported ports: [ TP ]
 121 Supported link modes:   10baseT/Half 10baseT/Full
 122                        100baseT/Half 100baseT/Full
 123                        1000baseT/Full
 124 Supports auto-negotiation: Yes
 125 Advertised link modes:  10baseT/Half 10baseT/Full
 126                        100baseT/Half 100baseT/Full
 127                        1000baseT/Full
 128 Advertised pause frame use: No
 129 Advertised auto-negotiation: Yes
 130 Link partner advertised link modes:  Not reported
 131 Link partner advertised pause frame use: No
 132 Link partner advertised auto-negotiation: No
 133 Speed: 100Mb/s                                     <--- Look Here!  Should say 1000Mb/s!
 134 Duplex: Full
 135 Port: Twisted Pair
 136 PHYAD: 1
 137 Transceiver: internal
 138 Auto-negotiation: on
 139 MDI-X: Unknown
 140 Supports Wake-on: umbg
 141 Wake-on: g
 142 Current message level: 0x00000003 (3)
 143 Link detected: yes
 144 ----
 145
 146 In normal operation, the ICMP ping round trip time should be around 20ms, and the interface speed and duplex should read, "1000MB/s", and, "Full", respectively.
 147
 148 ==== Resolution
 149
 150 After determining that the active ethernet adapter was at the incorrect speed, we used the `ifenslave(8)` command to make the standby interface the active interface, which yielded an immediate improvement in MapReduce performance, and a 10 times improvement in network throughput:
 151
 152 On the next trip to the datacenter, we determined that the line speed issue was ultimately caused by a bad network cable, which was replaced.
 153
 154 [[casestudies.perf.1]]
 155 === Case Study #2 (Performance Research 2012)
 156
 157 Investigation results of a self-described "we're not sure what's wrong, but it seems slow" problem. http://gbif.blogspot.com/2012/03/hbase-performance-evaluation-continued.html
 158
 159 [[casestudies.perf.2]]
 160 === Case Study #3 (Performance Research 2010))
 161
 162 Investigation results of general cluster performance from 2010.
 163 Although this research is on an older version of the codebase, this writeup is still very useful in terms of approach. http://hstack.org/hbase-performance-testing/
 164
 165 [[casestudies.max.transfer.threads]]
 166 === Case Study #4 (max.transfer.threads Config)
 167
 168 Case study of configuring `max.transfer.threads` (previously known as `xcievers`) and diagnosing errors from misconfigurations. http://www.larsgeorge.com/2012/03/hadoop-hbase-and-xceivers.html
 169
 170 See also <<dfs.datanode.max.transfer.threads>>.