src/main/asciidoc/_chapters/schema_design.adoc

   1 ////
   2 /**
   3  *
   4  * Licensed to the Apache Software Foundation (ASF) under one
   5  * or more contributor license agreements.  See the NOTICE file
   6  * distributed with this work for additional information
   7  * regarding copyright ownership.  The ASF licenses this file
   8  * to you under the Apache License, Version 2.0 (the
   9  * "License"); you may not use this file except in compliance
  10  * with the License.  You may obtain a copy of the License at
  11  *
  12  *     http://www.apache.org/licenses/LICENSE-2.0
  13  *
  14  * Unless required by applicable law or agreed to in writing, software
  15  * distributed under the License is distributed on an "AS IS" BASIS,
  16  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  17  * See the License for the specific language governing permissions and
  18  * limitations under the License.
  19  */
  20 ////
  21
  22 [[schema]]
  23 = HBase and Schema Design
  24 :doctype: book
  25 :numbered:
  26 :toc: left
  27 :icons: font
  28 :experimental:
  29
  30 A good introduction on the strength and weaknesses modelling on the various non-rdbms datastores is
  31 to be found in Ian Varley's Master thesis,
  32 link:http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf[No Relation: The Mixed Blessings of Non-Relational Databases].
  33 It is a little dated now but a good background read if you have a moment on how HBase schema modeling
  34 differs from how it is done in an RDBMS. Also,
  35 read <<keyvalue,keyvalue>> for how HBase stores data internally, and the section on <<schema.casestudies,schema.casestudies>>.
  36
  37 The documentation on the Cloud Bigtable website, link:https://cloud.google.com/bigtable/docs/schema-design[Designing Your Schema],
  38 is pertinent and nicely done and lessons learned there equally apply here in HBase land; just divide
  39 any quoted values by ~10 to get what works for HBase: e.g. where it says individual values can be ~10MBs in size, HBase can do similar -- perhaps best
  40 to go smaller if you can -- and where it says a maximum of 100 column families in Cloud Bigtable, think ~10 when
  41 modeling on HBase.
  42
  43 See also Robert Yokota's link:https://blogs.apache.org/hbase/entry/hbase-application-archetypes-redux[HBase Application Archetypes]
  44 (an update on work done by other HBasers), for a helpful categorization of use cases that do well on top of the HBase model.
  45
  46
  47 [[schema.creation]]
  48 ==  Schema Creation
  49
  50 HBase schemas can be created or updated using the <<shell>> or by using link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html[Admin] in the Java API.
  51
  52 Tables must be disabled when making ColumnFamily modifications, for example:
  53
  54 [source,java]
  55 ----
  56
  57 Configuration config = HBaseConfiguration.create();
  58 Admin admin = new Admin(conf);
  59 TableName table = TableName.valueOf("myTable");
  60
  61 admin.disableTable(table);
  62
  63 HColumnDescriptor cf1 = ...;
  64 admin.addColumn(table, cf1);      // adding new ColumnFamily
  65 HColumnDescriptor cf2 = ...;
  66 admin.modifyColumn(table, cf2);    // modifying existing ColumnFamily
  67
  68 admin.enableTable(table);
  69 ----
  70
  71 See <<client_dependencies,client dependencies>> for more information about configuring client connections.
  72
  73 NOTE: online schema changes are supported in the 0.92.x codebase, but the 0.90.x codebase requires the table to be disabled.
  74
  75 [[schema.updates]]
  76 === Schema Updates
  77
  78 When changes are made to either Tables or ColumnFamilies (e.g. region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written.
  79
  80 See <<store,store>> for more information on StoreFiles.
  81
  82 [[table_schema_rules_of_thumb]]
  83 == Table Schema Rules Of Thumb
  84
  85 There are many different data sets, with different access patterns and service-level
  86 expectations. Therefore, these rules of thumb are only an overview. Read the rest
  87 of this chapter to get more details after you have gone through this list.
  88
  89 * Aim to have regions sized between 10 and 50 GB.
  90 * Aim to have cells no larger than 10 MB, or 50 MB if you use <<hbase_mob,mob>>. Otherwise,
  91 consider storing your cell data in HDFS and store a pointer to the data in HBase.
  92 * A typical schema has between 1 and 3 column families per table. HBase tables should
  93 not be designed to mimic RDBMS tables.
  94 * Around 50-100 regions is a good number for a table with 1 or 2 column families.
  95 Remember that a region is a contiguous segment of a column family.
  96 * Keep your column family names as short as possible. The column family names are
  97 stored for every value (ignoring prefix encoding). They should not be self-documenting
  98 and descriptive like in a typical RDBMS.
  99 * If you are storing time-based machine data or logging information, and the row key
 100 is based on device ID or service ID plus time, you can end up with a pattern where
 101 older data regions never have additional writes beyond a certain age. In this type
 102 of situation, you end up with a small number of active regions and a large number
 103 of older regions which have no new writes. For these situations, you can tolerate
 104 a larger number of regions because your resource consumption is driven by the active
 105 regions only.
 106 * If only one column family is busy with writes, only that column family accomulates
 107 memory. Be aware of write patterns when allocating resources.
 108
 109 [[regionserver_sizing_rules_of_thumb]]
 110 = RegionServer Sizing Rules of Thumb
 111
 112 Lars Hofhansl wrote a great
 113 link:http://hadoop-hbase.blogspot.com/2013/01/hbase-region-server-memory-sizing.html[blog post]
 114 about RegionServer memory sizing. The upshot is that you probably need more memory
 115 than you think you need. He goes into the impact of region size, memstore size, HDFS
 116 replication factor, and other things to check.
 117
 118 [quote, Lars Hofhansl, http://hadoop-hbase.blogspot.com/2013/01/hbase-region-server-memory-sizing.html]
 119 ____
 120 Personally I would place the maximum disk space per machine that can be served
 121 exclusively with HBase around 6T, unless you have a very read-heavy workload.
 122 In that case the Java heap should be 32GB (20G regions, 128M memstores, the rest
 123 defaults).
 124 ____
 125
 126 [[number.of.cfs]]
 127 ==  On the number of column families
 128
 129 HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low.
 130 Currently, flushing is done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small.
 131 When many column families exist the flushing interaction can make for a bunch of needless i/o (To be addressed by changing flushing to work on a per column family basis).
 132 In addition, compactions triggered at table/region level will happen per store too.
 133
 134 Try to make do with one column family if you can in your schemas.
 135 Only introduce a second and third column family in the case where data access is usually column scoped; i.e.
 136 you query one column family or the other but usually not both at the one time.
 137
 138 [[number.of.cfs.card]]
 139 === Cardinality of ColumnFamilies
 140
 141 Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA's data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.
 142
 143 [[rowkey.design]]
 144 == Rowkey Design
 145
 146 === Hotspotting
 147
 148 Rows in HBase are sorted lexicographically by row key.
 149 This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other.
 150 However, poorly designed row keys are a common source of [firstterm]_hotspotting_.
 151 Hotspotting occurs when a large amount of client traffic is directed at one node, or only a few nodes, of a cluster.
 152 This traffic may represent reads, writes, or other operations.
 153 The traffic overwhelms the single machine responsible for hosting that region, causing performance degradation and potentially leading to region unavailability.
 154 This can also have adverse effects on other regions hosted by the same region server as that host is unable to service the requested load.
 155 It is important to design data access patterns such that the cluster is fully and evenly utilized.
 156
 157 To prevent hotspotting on writes, design your row keys such that rows that truly do need to be in the same region are, but in the bigger picture, data is being written to multiple regions across the cluster, rather than one at a time.
 158 Some common techniques for avoiding hotspotting are described below, along with some of their advantages and drawbacks.
 159
 160 .Salting
 161 Salting in this sense has nothing to do with cryptography, but refers to adding random data to the start of a row key.
 162 In this case, salting refers to adding a randomly-assigned prefix to the row key to cause it to sort differently than it otherwise would.
 163 The number of possible prefixes correspond to the number of regions you want to spread the data across.
 164 Salting can be helpful if you have a few "hot" row key patterns which come up over and over amongst other more evenly-distributed rows.
 165 Consider the following example, which shows that salting can spread write load across multiple RegionServers, and illustrates some of the negative implications for reads.
 166
 167 .Salting Example
 168 ====
 169 Suppose you have the following list of row keys, and your table is split such that there is one region for each letter of the alphabet.
 170 Prefix 'a' is one region, prefix 'b' is another.
 171 In this table, all rows starting with 'f' are in the same region.
 172 This example focuses on rows with keys like the following:
 173
 174 ----
 175
 176 foo0001
 177 foo0002
 178 foo0003
 179 foo0004
 180 ----
 181
 182 Now, imagine that you would like to spread these across four different regions.
 183 You decide to use four different salts: `a`, `b`, `c`, and `d`.
 184 In this scenario, each of these letter prefixes will be on a different region.
 185 After applying the salts, you have the following rowkeys instead.
 186 Since you can now write to four separate regions, you theoretically have four times the throughput when writing that you would have if all the writes were going to the same region.
 187
 188 ----
 189
 190 a-foo0003
 191 b-foo0001
 192 c-foo0004
 193 d-foo0002
 194 ----
 195
 196 Then, if you add another row, it will randomly be assigned one of the four possible salt values and end up near one of the existing rows.
 197
 198 ----
 199
 200 a-foo0003
 201 b-foo0001
 202 c-foo0003
 203 c-foo0004
 204 d-foo0002
 205 ----
 206
 207 Since this assignment will be random, you will need to do more work if you want to retrieve the rows in lexicographic order.
 208 In this way, salting attempts to increase throughput on writes, but has a cost during reads.
 209 ====
 210
 211
 212
 213 .Hashing
 214 Instead of a random assignment, you could use a one-way [firstterm]_hash_ that would cause a given row to always be "salted" with the same prefix, in a way that would spread the load across the RegionServers, but allow for predictability during reads.
 215 Using a deterministic hash allows the client to reconstruct the complete rowkey and use a Get operation to retrieve that row as normal.
 216
 217 .Hashing Example
 218 [example]
 219 Given the same situation in the salting example above, you could instead apply a one-way hash that would cause the row with key `foo0003` to always, and predictably, receive the `a` prefix.
 220 Then, to retrieve that row, you would already know the key.
 221 You could also optimize things so that certain pairs of keys were always in the same region, for instance.
 222
 223 .Reversing the Key
 224 A third common trick for preventing hotspotting is to reverse a fixed-width or numeric row key so that the part that changes the most often (the least significant digit) is first.
 225 This effectively randomizes row keys, but sacrifices row ordering properties.
 226
 227 See https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables, and link:https://phoenix.apache.org/salted.html[article on Salted Tables] from the Phoenix project, and the discussion in the comments of link:https://issues.apache.org/jira/browse/HBASE-11682[HBASE-11682] for more information about avoiding hotspotting.
 228
 229 [[timeseries]]
 230 ===  Monotonically Increasing Row Keys/Timeseries Data
 231
 232 In the HBase chapter of Tom White's book link:http://oreilly.com/catalog/9780596521981[Hadoop: The Definitive Guide] (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc.
 233 With monotonically increasing row-keys (i.e., using a timestamp), this will happen.
 234 See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores: link:http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/[monotonically increasing values are bad].
 235 The pile-up on a single region brought on by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key.
 236
 237 If you do need to upload time series data into HBase, you should study link:http://opentsdb.net/[OpenTSDB] as a successful example.
 238 It has a page describing the link:http://opentsdb.net/schema.html[schema] it uses in HBase.
 239 The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key.
 240 However, the difference is that the timestamp is not in the _lead_ position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types.
 241 Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
 242
 243 See <<schema.casestudies,schema.casestudies>> for some rowkey design examples.
 244
 245 [[keysize]]
 246 === Try to minimize row and column sizes
 247
 248 In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it'll be accompanied by its row, column name, and timestamp - always.
 249 If your rows and column names are large, especially compared to the size of the cell value, then you may run up against some interesting scenarios.
 250 One such is the case described by Marc Limotte at the tail of link:https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272[HBASE-3551] (recommended!). Therein, the indices that are kept on HBase storefiles (<<hfile>>) to facilitate random access may end up occupying large chunks of the HBase allotted RAM because the cell value coordinates are large.
 251 Mark in the above cited comment suggests upping the block size so entries in the store file index happen at a larger interval or modify the table schema so it makes for smaller rows and column names.
 252 Compression will also make for larger indices.
 253 See the thread link:https://lists.apache.org/thread.html/b158eae5d8888d3530be378298bca90c17f80982fdcdfa01d0844c3d%401306240189%40%3Cuser.hbase.apache.org%3E[a question storefileIndexSize] up on the user mailing list.
 254
 255 Most of the time small inefficiencies don't matter all that much. Unfortunately, this is a case where they do.
 256 Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated several billion times in your data.
 257
 258 See <<keyvalue,keyvalue>> for more information on HBase stores data internally to see why this is important.
 259
 260 [[keysize.cf]]
 261 ==== Column Families
 262
 263 Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
 264
 265 See <<keyvalue>> for more information on HBase stores data internally to see why this is important.
 266
 267 [[keysize.attributes]]
 268 ==== Attributes
 269
 270 Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase.
 271
 272 See <<keyvalue,keyvalue>> for more information on HBase stores data internally to see why this is important.
 273
 274 [[keysize.row]]
 275 ==== Rowkey Length
 276
 277 Keep them as short as is reasonable such that they can still be useful for required data access (e.g. Get vs.
 278 Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties.
 279 Expect tradeoffs when designing rowkeys.
 280
 281 [[keysize.patterns]]
 282 ==== Byte Patterns
 283
 284 A long is 8 bytes.
 285 You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes.
 286 If you stored this number as a String -- presuming a byte per character -- you need nearly 3x the bytes.
 287
 288 Not convinced? Below is some sample code that you can run on your own.
 289
 290 [source,java]
 291 ----
 292
 293 // long
 294 //
 295 long l = 1234567890L;
 296 byte[] lb = Bytes.toBytes(l);
 297 System.out.println("long bytes length: " + lb.length);   // returns 8
 298
 299 String s = String.valueOf(l);
 300 byte[] sb = Bytes.toBytes(s);
 301 System.out.println("long as string length: " + sb.length);    // returns 10
 302
 303 // hash
 304 //
 305 MessageDigest md = MessageDigest.getInstance("MD5");
 306 byte[] digest = md.digest(Bytes.toBytes(s));
 307 System.out.println("md5 digest bytes length: " + digest.length);    // returns 16
 308
 309 String sDigest = new String(digest);
 310 byte[] sbDigest = Bytes.toBytes(sDigest);
 311 System.out.println("md5 digest as string length: " + sbDigest.length);    // returns 26
 312 ----
 313
 314 Unfortunately, using a binary representation of a type will make your data harder to read outside of your code.
 315 For example, this is what you will see in the shell when you increment a value:
 316
 317 [source]
 318 ----
 319
 320 hbase(main):001:0> incr 't', 'r', 'f:q', 1
 321 COUNTER VALUE = 1
 322
 323 hbase(main):002:0> get 't', 'r'
 324 COLUMN                                        CELL
 325  f:q                                          timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
 326 1 row(s) in 0.0310 seconds
 327 ----
 328
 329 The shell makes a best effort to print a string, and it this case it decided to just print the hex.
 330 The same will happen to your row keys inside the region names.
 331 It can be okay if you know what's being stored, but it might also be unreadable if arbitrary data can be put in the same cells.
 332 This is the main trade-off.
 333
 334 [[reverse.timestamp]]
 335 === Reverse Timestamps
 336
 337 .Reverse Scan API
 338 [NOTE]
 339 ====
 340 link:https://issues.apache.org/jira/browse/HBASE-4811[HBASE-4811] implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning.
 341 This feature is available in HBase 0.98 and later.
 342 See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed-boolean-[Scan.setReversed()] for more information.
 343 ====
 344
 345 A common problem in database processing is quickly finding the most recent version of a value.
 346 A technique using reverse timestamps as a part of the key can help greatly with a special case of this problem.
 347 Also found in the HBase chapter of Tom White's book Hadoop: The Definitive Guide (O'Reilly), the technique involves appending (`Long.MAX_VALUE - timestamp`) to the end of any key, e.g. [key][reverse_timestamp].
 348
 349 The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record.
 350 Since HBase keys are in sorted order, this key sorts before any older row-keys for [key] and thus is first.
 351
 352 This technique would be used instead of using <<schema.versions>> where the intent is to hold onto all versions "forever" (or a very long time) and at the same time quickly obtain access to any other version by using the same Scan technique.
 353
 354 [[rowkey.scope]]
 355 === Rowkeys and ColumnFamilies
 356
 357 Rowkeys are scoped to ColumnFamilies.
 358 Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.
 359
 360 [[changing.rowkeys]]
 361 === Immutability of Rowkeys
 362
 363 Rowkeys cannot be changed.
 364 The only way they can be "changed" in a table is if the row is deleted and then re-inserted.
 365 This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you've inserted a lot of data).
 366
 367 [[rowkey.regionsplits]]
 368 === Relationship Between RowKeys and Region Splits
 369
 370 If you pre-split your table, it is _critical_ to understand how your rowkey will be distributed across the region boundaries.
 371 As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through `Bytes.split` (which is the split strategy used when creating regions in `Admin.createTable(byte[] startKey, byte[] endKey, numRegions)` for 10 regions will generate the following splits...
 372
 373 ----
 374
 375 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48                                // 0
 376 54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10                 // 6
 377 61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68                 // =
 378 68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126  // D
 379 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72                                // K
 380 82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14                                // R
 381 88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44                 // X
 382 95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102                // _
 383 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102                // f
 384 ----
 385
 386 (note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', everything is great, right? Not so fast.
 387
 388 The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and possibly "hot") region problem.
 389 To understand why, refer to an link:http://www.asciitable.com[ASCII Table].
 390 '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will _never appear in this keyspace_ because the only values are [0-9] and [a-f]. Thus, the middle regions will never be used.
 391 To make pre-splitting work with this example keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) is required.
 392
 393 Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the regions are accessible in the keyspace.
 394 While this example demonstrated the problem with a hex-key keyspace, the same problem can happen with _any_ keyspace.
 395 Know your data.
 396
 397 Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split tables as long as all the created regions are accessible in the keyspace.
 398
 399 To conclude this example, the following is an example of how appropriate splits can be pre-created for hex-keys:.
 400
 401 [source,java]
 402 ----
 403 public static boolean createTable(Admin admin, HTableDescriptor table, byte[][] splits)
 404 throws IOException {
 405   try {
 406     admin.createTable( table, splits );
 407     return true;
 408   } catch (TableExistsException e) {
 409     logger.info("table " + table.getNameAsString() + " already exists");
 410     // the table already exists...
 411     return false;
 412   }
 413 }
 414
 415 public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
 416   byte[][] splits = new byte[numRegions-1][];
 417   BigInteger lowestKey = new BigInteger(startKey, 16);
 418   BigInteger highestKey = new BigInteger(endKey, 16);
 419   BigInteger range = highestKey.subtract(lowestKey);
 420   BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
 421   lowestKey = lowestKey.add(regionIncrement);
 422   for(int i=0; i < numRegions-1;i++) {
 423     BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
 424     byte[] b = String.format("%016x", key).getBytes();
 425     splits[i] = b;
 426   }
 427   return splits;
 428 }
 429 ----
 430
 431 [[schema.versions]]
 432 ==  Number of Versions
 433
 434 [[schema.versions.max]]
 435 === Maximum Number of Versions
 436
 437 The maximum number of row versions to store is configured per column family via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor].
 438 The default for max versions is 1.
 439 This is an important parameter because as described in <<datamodel>> section HBase does _not_ overwrite row values, but rather stores different values per row by time (and qualifier). Excess versions are removed during major compactions.
 440 The number of max versions may need to be increased or decreased depending on application needs.
 441
 442 It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are very dear to you because this will greatly increase StoreFile size.
 443
 444 [[schema.minversions]]
 445 ===  Minimum Number of Versions
 446
 447 Like maximum number of row versions, the minimum number of row versions to keep is configured per column family via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor].
 448 The default for min versions is 0, which means the feature is disabled.
 449 The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the number of row versions parameter to allow configurations such as "keep the last T minutes worth of data, at most N versions, _but keep at least M versions around_" (where M is the value for minimum number of row versions, M<N). This parameter should only be set when time-to-live is enabled for a column family and must be less than the number of row versions.
 450
 451 [[supported.datatypes]]
 452 ==  Supported Datatypes
 453
 454 HBase supports a "bytes-in/bytes-out" interface via link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html[Put] and link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html[Result], so anything that can be converted to an array of bytes can be stored as a value.
 455 Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.
 456
 457 There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask); search the mailing list for conversations on this topic.
 458 All rows in HBase conform to the <<datamodel>>, and that includes versioning.
 459 Take that into consideration when making your design, as well as block size for the ColumnFamily.
 460
 461 === Counters
 462
 463 One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers). See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#increment%28org.apache.hadoop.hbase.client.Increment%29[Increment] in `Table`.
 464
 465 Synchronization on counters are done on the RegionServer, not in the client.
 466
 467 [[schema.joins]]
 468 == Joins
 469
 470 If you have multiple tables, don't forget to factor in the potential for <<joins>> into the schema design.
 471
 472 [[ttl]]
 473 == Time To Live (TTL)
 474
 475 ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached.
 476 This applies to _all_ versions of a row - even the current one.
 477 The TTL time encoded in the HBase for the row is specified in UTC.
 478
 479 Store files which contains only expired rows are deleted on minor compaction.
 480 Setting `hbase.store.delete.expired.storefile` to `false` disables this feature.
 481 Setting minimum number of versions to other than 0 also disables this.
 482
 483 See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html[HColumnDescriptor] for more information.
 484
 485 Recent versions of HBase also support setting time to live on a per cell basis.
 486 See link:https://issues.apache.org/jira/browse/HBASE-10560[HBASE-10560] for more information.
 487 Cell TTLs are submitted as an attribute on mutation requests (Appends, Increments, Puts, etc.) using Mutation#setTTL.
 488 If the TTL attribute is set, it will be applied to all cells updated on the server by the operation.
 489 There are two notable differences between cell TTL handling and ColumnFamily TTLs:
 490
 491 * Cell TTLs are expressed in units of milliseconds instead of seconds.
 492 * A cell TTLs cannot extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting.
 493
 494 [[cf.keep.deleted]]
 495 ==  Keeping Deleted Cells
 496
 497 By default, delete markers extend back to the beginning of time.
 498 Therefore, link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html[Get] or link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html[Scan] operations will not see a deleted cell (row or column), even when the Get or Scan operation indicates a time range before the delete marker was placed.
 499
 500 ColumnFamilies can optionally keep deleted cells.
 501 In this case, deleted cells can still be retrieved, as long as these operations specify a time range that ends before the timestamp of any delete that would affect the cells.
 502 This allows for point-in-time queries even in the presence of deletes.
 503
 504 Deleted cells are still subject to TTL and there will never be more than "maximum number of versions" deleted cells.
 505 A new "raw" scan options returns all deleted rows and the delete markers.
 506
 507 .Change the Value of `KEEP_DELETED_CELLS` Using HBase Shell
 508 ----
 509 hbase> hbase> alter ‘t1′, NAME => ‘f1′, KEEP_DELETED_CELLS => true
 510 ----
 511
 512 .Change the Value of `KEEP_DELETED_CELLS` Using the API
 513 ====
 514 [source,java]
 515 ----
 516 ...
 517 HColumnDescriptor.setKeepDeletedCells(true);
 518 ...
 519 ----
 520 ====
 521
 522 Let us illustrate the basic effect of setting the `KEEP_DELETED_CELLS` attribute on a table.
 523
 524 First, without:
 525 [source]
 526 ----
 527 create 'test', {NAME=>'e', VERSIONS=>2147483647}
 528 put 'test', 'r1', 'e:c1', 'value', 10
 529 put 'test', 'r1', 'e:c1', 'value', 12
 530 put 'test', 'r1', 'e:c1', 'value', 14
 531 delete 'test', 'r1', 'e:c1',  11
 532
 533 hbase(main):017:0> scan 'test', {RAW=>true, VERSIONS=>1000}
 534 ROW                                              COLUMN+CELL
 535  r1                                              column=e:c1, timestamp=14, value=value
 536  r1                                              column=e:c1, timestamp=12, value=value
 537  r1                                              column=e:c1, timestamp=11, type=DeleteColumn
 538  r1                                              column=e:c1, timestamp=10, value=value
 539 1 row(s) in 0.0120 seconds
 540
 541 hbase(main):018:0> flush 'test'
 542 0 row(s) in 0.0350 seconds
 543
 544 hbase(main):019:0> scan 'test', {RAW=>true, VERSIONS=>1000}
 545 ROW                                              COLUMN+CELL
 546  r1                                              column=e:c1, timestamp=14, value=value
 547  r1                                              column=e:c1, timestamp=12, value=value
 548  r1                                              column=e:c1, timestamp=11, type=DeleteColumn
 549 1 row(s) in 0.0120 seconds
 550
 551 hbase(main):020:0> major_compact 'test'
 552 0 row(s) in 0.0260 seconds
 553
 554 hbase(main):021:0> scan 'test', {RAW=>true, VERSIONS=>1000}
 555 ROW                                              COLUMN+CELL
 556  r1                                              column=e:c1, timestamp=14, value=value
 557  r1                                              column=e:c1, timestamp=12, value=value
 558 1 row(s) in 0.0120 seconds
 559 ----
 560
 561 Notice how delete cells are let go.
 562
 563 Now let's run the same test only with `KEEP_DELETED_CELLS` set on the table (you can do table or per-column-family):
 564
 565 [source]
 566 ----
 567 hbase(main):005:0> create 'test', {NAME=>'e', VERSIONS=>2147483647, KEEP_DELETED_CELLS => true}
 568 0 row(s) in 0.2160 seconds
 569
 570 => Hbase::Table - test
 571 hbase(main):006:0> put 'test', 'r1', 'e:c1', 'value', 10
 572 0 row(s) in 0.1070 seconds
 573
 574 hbase(main):007:0> put 'test', 'r1', 'e:c1', 'value', 12
 575 0 row(s) in 0.0140 seconds
 576
 577 hbase(main):008:0> put 'test', 'r1', 'e:c1', 'value', 14
 578 0 row(s) in 0.0160 seconds
 579
 580 hbase(main):009:0> delete 'test', 'r1', 'e:c1',  11
 581 0 row(s) in 0.0290 seconds
 582
 583 hbase(main):010:0> scan 'test', {RAW=>true, VERSIONS=>1000}
 584 ROW                                                                                          COLUMN+CELL
 585  r1                                                                                          column=e:c1, timestamp=14, value=value
 586  r1                                                                                          column=e:c1, timestamp=12, value=value
 587  r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 588  r1                                                                                          column=e:c1, timestamp=10, value=value
 589 1 row(s) in 0.0550 seconds
 590
 591 hbase(main):011:0> flush 'test'
 592 0 row(s) in 0.2780 seconds
 593
 594 hbase(main):012:0> scan 'test', {RAW=>true, VERSIONS=>1000}
 595 ROW                                                                                          COLUMN+CELL
 596  r1                                                                                          column=e:c1, timestamp=14, value=value
 597  r1                                                                                          column=e:c1, timestamp=12, value=value
 598  r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 599  r1                                                                                          column=e:c1, timestamp=10, value=value
 600 1 row(s) in 0.0620 seconds
 601
 602 hbase(main):013:0> major_compact 'test'
 603 0 row(s) in 0.0530 seconds
 604
 605 hbase(main):014:0> scan 'test', {RAW=>true, VERSIONS=>1000}
 606 ROW                                                                                          COLUMN+CELL
 607  r1                                                                                          column=e:c1, timestamp=14, value=value
 608  r1                                                                                          column=e:c1, timestamp=12, value=value
 609  r1                                                                                          column=e:c1, timestamp=11, type=DeleteColumn
 610  r1                                                                                          column=e:c1, timestamp=10, value=value
 611 1 row(s) in 0.0650 seconds
 612 ----
 613
 614 KEEP_DELETED_CELLS is to avoid removing Cells from HBase when the _only_ reason to remove them is the delete marker.
 615 So with KEEP_DELETED_CELLS enabled deleted cells would get removed if either you write more versions than the configured max, or you have a TTL and Cells are in excess of the configured timeout, etc.
 616
 617
 618 [[secondary.indexes]]
 619 ==  Secondary Indexes and Alternate Query Paths
 620
 621 This section could also be titled "what if my table rowkey looks like _this_ but I also want to query my table like _that_." A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are reporting requirements on activity across users for certain time ranges.
 622 Thus, selecting by user is easy because it is in the lead position of the key, but time is not.
 623
 624 There is no single answer on the best way to handle this because it depends on...
 625
 626 * Number of users
 627 * Data size and data arrival rate
 628 * Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges)
 629 * Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others)
 630
 631 and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution.
 632 Common techniques are in sub-sections below.
 633 This is a comprehensive, but not exhaustive, list of approaches.
 634
 635 It should not be a surprise that secondary indexes require additional cluster space and processing.
 636 This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update.
 637 RDBMS products are more advanced in this regard to handle alternative index management out of the box.
 638 However, HBase scales better at larger data volumes, so this is a feature trade-off.
 639
 640 Pay attention to <<performance>> when implementing any of these approaches.
 641
 642 Additionally, see the David Butler response in this dist-list thread link:https://lists.apache.org/thread.html/b0ca33407f010d5b1be67a20d1708e8d8bb1e147770f2cb7182a2e37%401300972712%40%3Cuser.hbase.apache.org%3E[HBase, mail # user - Stargate+hbase]
 643
 644 [[secondary.indexes.filter]]
 645 ===  Filter Query
 646
 647 Depending on the case, it may be appropriate to use <<client.filter>>.
 648 In this case, no secondary index is created.
 649 However, don't try a full-scan on a large table like this from an application (i.e., single-threaded client).
 650
 651 [[secondary.indexes.periodic]]
 652 ===  Periodic-Update Secondary Index
 653
 654 A secondary index could be created in another table which is periodically updated via a MapReduce job.
 655 The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.
 656
 657 See <<mapreduce.example.readwrite,mapreduce.example.readwrite>> for more information.
 658
 659 [[secondary.indexes.dualwrite]]
 660 ===  Dual-Write Secondary Index
 661
 662 Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see <<secondary.indexes.periodic,secondary.indexes.periodic>>).
 663
 664 [[secondary.indexes.summary]]
 665 ===  Summary Tables
 666
 667 Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach.
 668 These would be generated with MapReduce jobs into another table.
 669
 670 See <<mapreduce.example.summary,mapreduce.example.summary>> for more information.
 671
 672 [[secondary.indexes.coproc]]
 673 ===  Coprocessor Secondary Index
 674
 675 Coprocessors act like RDBMS triggers. These were added in 0.92.
 676 For more information, see <<cp,coprocessors>>
 677
 678 == Constraints
 679
 680 HBase currently supports 'constraints' in traditional (SQL) database parlance.
 681 The advised usage for Constraints is in enforcing business rules for attributes
 682 in the table (e.g. make sure values are in the range 1-10). Constraints could
 683 also be used to enforce referential integrity, but this is strongly discouraged
 684 as it will dramatically decrease the write throughput of the tables where integrity
 685 checking is enabled. Extensive documentation on using Constraints can be found at
 686 link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/constraint/Constraint.html[Constraint]
 687 since version 0.94.
 688
 689 [[schema.casestudies]]
 690 == Schema Design Case Studies
 691
 692 The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached.
 693 Note: this is just an illustration of potential approaches, not an exhaustive list.
 694 Know your data, and know your processing requirements.
 695
 696 It is highly recommended that you read the rest of the <<schema>> first, before reading these case studies.
 697
 698 The following case studies are described:
 699
 700 * Log Data / Timeseries Data
 701 * Log Data / Timeseries on Steroids
 702 * Customer/Order
 703 * Tall/Wide/Middle Schema Design
 704 * List Data
 705
 706 [[schema.casestudies.log_timeseries]]
 707 === Case Study - Log Data and Timeseries Data
 708
 709 Assume that the following data elements are being collected.
 710
 711 * Hostname
 712 * Timestamp
 713 * Log event
 714 * Value/message
 715
 716 We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these attributes the rowkey will be some combination of hostname, timestamp, and log-event - but what specifically?
 717
 718 [[schema.casestudies.log_timeseries.tslead]]
 719 ==== Timestamp In The Rowkey Lead Position
 720
 721 The rowkey `[timestamp][hostname][log-event]` suffers from the monotonically increasing rowkey problem described in <<timeseries>>.
 722
 723 There is another pattern frequently mentioned in the dist-lists about "bucketing" timestamps, by performing a mod operation on the timestamp.
 724 If time-oriented scans are important, this could be a useful approach.
 725 Attention must be paid to the number of buckets, because this will require the same number of scans to return results.
 726
 727 [source,java]
 728 ----
 729
 730 long bucket = timestamp % numBuckets;
 731 ----
 732
 733 to construct:
 734
 735 [source]
 736 ----
 737
 738 [bucket][timestamp][hostname][log-event]
 739 ----
 740
 741 As stated above, to select data for a particular timerange, a Scan will need to be performed for each bucket.
 742 100 buckets, for example, will provide a wide distribution in the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there are trade-offs.
 743
 744 [[schema.casestudies.log_timeseries.hostlead]]
 745 ==== Host In The Rowkey Lead Position
 746
 747 The rowkey `[hostname][log-event][timestamp]` is a candidate if there is a large-ish number of hosts to spread the writes and reads across the keyspace.
 748 This approach would be useful if scanning by hostname was a priority.
 749
 750 [[schema.casestudies.log_timeseries.revts]]
 751 ==== Timestamp, or Reverse Timestamp?
 752
 753 If the most important access path is to pull most recent events, then storing the timestamps as reverse-timestamps (e.g., `timestamp = Long.MAX_VALUE – timestamp`) will create the property of being able to do a Scan on `[hostname][log-event]` to obtain the most recently captured events.
 754
 755 Neither approach is wrong, it just depends on what is most appropriate for the situation.
 756
 757 .Reverse Scan API
 758 [NOTE]
 759 ====
 760 link:https://issues.apache.org/jira/browse/HBASE-4811[HBASE-4811] implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning.
 761 This feature is available in HBase 0.98 and later.
 762 See link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed-boolean-[Scan.setReversed()] for more information.
 763 ====
 764
 765 [[schema.casestudies.log_timeseries.varkeys]]
 766 ==== Variable Length or Fixed Length Rowkeys?
 767
 768 It is critical to remember that rowkeys are stamped on every column in HBase.
 769 If the hostname is `a` and the event type is `e1` then the resulting rowkey would be quite small.
 770 However, what if the ingested hostname is `myserver1.mycompany.com` and the event type is `com.package1.subpackage2.subsubpackage3.ImportantService`?
 771
 772 It might make sense to use some substitution in the rowkey.
 773 There are at least two approaches: hashed and numeric.
 774 In the Hostname In The Rowkey Lead Position example, it might look like this:
 775
 776 Composite Rowkey With Hashes:
 777
 778 * [MD5 hash of hostname] = 16 bytes
 779 * [MD5 hash of event-type] = 16 bytes
 780 * [timestamp] = 8 bytes
 781
 782 Composite Rowkey With Numeric Substitution:
 783
 784 For this approach another lookup table would be needed in addition to LOG_DATA, called LOG_TYPES.
 785 The rowkey of LOG_TYPES would be:
 786
 787 * `[type]` (e.g., byte indicating hostname vs. event-type)
 788 * `[bytes]` variable length bytes for raw hostname or event-type.
 789
 790 A column for this rowkey could be a long with an assigned number, which could be obtained
 791 by using an link:https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#incrementColumnValue-byte:A-byte:A-byte:A-long-[HBase counter]
 792
 793 So the resulting composite rowkey would be:
 794
 795 * [substituted long for hostname] = 8 bytes
 796 * [substituted long for event type] = 8 bytes
 797 * [timestamp] = 8 bytes
 798
 799 In either the Hash or Numeric substitution approach, the raw values for hostname and event-type can be stored as columns.
 800
 801 [[schema.casestudies.log_steroids]]
 802 === Case Study - Log Data and Timeseries Data on Steroids
 803
 804 This effectively is the OpenTSDB approach.
 805 What OpenTSDB does is re-write data and pack rows into columns for certain time-periods.
 806 For a detailed explanation, see: http://opentsdb.net/schema.html, and
 807 link:https://www.slideshare.net/cloudera/4-opentsdb-hbasecon[Lessons Learned from OpenTSDB]
 808 from HBaseCon2012.
 809
 810 But this is how the general concept works: data is ingested, for example, in this manner...
 811
 812 ----
 813
 814 [hostname][log-event][timestamp1]
 815 [hostname][log-event][timestamp2]
 816 [hostname][log-event][timestamp3]
 817 ----
 818
 819 with separate rowkeys for each detailed event, but is re-written like this...
 820
 821 ----
 822 [hostname][log-event][timerange]
 823 ----
 824
 825 and each of the above events are converted into columns stored with a time-offset relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very advanced processing technique, but HBase makes this possible.
 826
 827 [[schema.casestudies.custorder]]
 828 === Case Study - Customer/Order
 829
 830 Assume that HBase is used to store customer and order information.
 831 There are two core record-types being ingested: a Customer record type, and Order record type.
 832
 833 The Customer record type would include all the things that you'd typically expect:
 834
 835 * Customer number
 836 * Customer name
 837 * Address (e.g., city, state, zip)
 838 * Phone numbers, etc.
 839
 840 The Order record type would include things like:
 841
 842 * Customer number
 843 * Order number
 844 * Sales date
 845 * A series of nested objects for shipping locations and line-items (see <<schema.casestudies.custorder.obj>> for details)
 846
 847 Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as:
 848
 849 ----
 850 [customer number][order number]
 851 ----
 852
 853 for an ORDER table.
 854 However, there are more design decisions to make: are the _raw_ values the best choices for rowkeys?
 855
 856 The same design questions in the Log Data use-case confront us here.
 857 What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a reasonable spread in the keyspace, similar options appear:
 858
 859 Composite Rowkey With Hashes:
 860
 861 * [MD5 of customer number] = 16 bytes
 862 * [MD5 of order number] = 16 bytes
 863
 864 Composite Numeric/Hash Combo Rowkey:
 865
 866 * [substituted long for customer number] = 8 bytes
 867 * [MD5 of order number] = 16 bytes
 868
 869 [[schema.casestudies.custorder.tables]]
 870 ==== Single Table? Multiple Tables?
 871
 872 A traditional design approach would have separate tables for CUSTOMER and SALES.
 873 Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).
 874
 875 Customer Record Type Rowkey:
 876
 877 * [customer-id]
 878 * [type] = type indicating `1' for customer record type
 879
 880 Order Record Type Rowkey:
 881
 882 * [customer-id]
 883 * [type] = type indicating `2' for order record type
 884 * [order]
 885
 886 The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id (e.g., a single scan could get you everything about that customer). The disadvantage is that it's not as easy to scan for a particular record-type.
 887
 888 [[schema.casestudies.custorder.obj]]
 889 ==== Order Object Design
 890
 891 Now we need to address how to model the Order object.
 892 Assume that the class structure is as follows:
 893
 894 Order::
 895   (an Order can have multiple ShippingLocations
 896
 897 LineItem::
 898   (a ShippingLocation can have multiple LineItems
 899
 900 there are multiple options on storing this data.
 901
 902 [[schema.casestudies.custorder.obj.norm]]
 903 ===== Completely Normalized
 904
 905 With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.
 906
 907 The ORDER table's rowkey was described above: <<schema.casestudies.custorder,schema.casestudies.custorder>>
 908
 909 The SHIPPING_LOCATION's composite rowkey would be something like this:
 910
 911 * `[order-rowkey]`
 912 * `[shipping location number]` (e.g., 1st location, 2nd, etc.)
 913
 914 The LINE_ITEM table's composite rowkey would be something like this:
 915
 916 * `[order-rowkey]`
 917 * `[shipping location number]` (e.g., 1st location, 2nd, etc.)
 918 * `[line item number]` (e.g., 1st lineitem, 2nd, etc.)
 919
 920 Such a normalized model is likely to be the approach with an RDBMS, but that's not your only option with HBase.
 921 The cons of such an approach is that to retrieve information about any Order, you will need:
 922
 923 * Get on the ORDER table for the Order
 924 * Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances
 925 * Scan on the LINE_ITEM for each ShippingLocation
 926
 927 granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase you're just more aware of this fact.
 928
 929 [[schema.casestudies.custorder.obj.rectype]]
 930 ===== Single Table With Record Types
 931
 932 With this approach, there would exist a single table ORDER that would contain
 933
 934 The Order rowkey was described above: <<schema.casestudies.custorder,schema.casestudies.custorder>>
 935
 936 * `[order-rowkey]`
 937 * `[ORDER record type]`
 938
 939 The ShippingLocation composite rowkey would be something like this:
 940
 941 * `[order-rowkey]`
 942 * `[SHIPPING record type]`
 943 * `[shipping location number]` (e.g., 1st location, 2nd, etc.)
 944
 945 The LineItem composite rowkey would be something like this:
 946
 947 * `[order-rowkey]`
 948 * `[LINE record type]`
 949 * `[shipping location number]` (e.g., 1st location, 2nd, etc.)
 950 * `[line item number]` (e.g., 1st lineitem, 2nd, etc.)
 951
 952 [[schema.casestudies.custorder.obj.denorm]]
 953 ===== Denormalized
 954
 955 A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.
 956
 957 The LineItem composite rowkey would be something like this:
 958
 959 * `[order-rowkey]`
 960 * `[LINE record type]`
 961 * `[line item number]` (e.g., 1st lineitem, 2nd, etc., care must be taken that there are unique across the entire order)
 962
 963 and the LineItem columns would be something like this:
 964
 965 * itemNumber
 966 * quantity
 967 * price
 968 * shipToLine1 (denormalized from ShippingLocation)
 969 * shipToLine2 (denormalized from ShippingLocation)
 970 * shipToCity (denormalized from ShippingLocation)
 971 * shipToState (denormalized from ShippingLocation)
 972 * shipToZip (denormalized from ShippingLocation)
 973
 974 The pros of this approach include a less complex object hierarchy, but one of the cons is that updating gets more complicated in case any of this information changes.
 975
 976 [[schema.casestudies.custorder.obj.singleobj]]
 977 ===== Object BLOB
 978
 979 With this approach, the entire Order object graph is treated, in one way or another, as a BLOB.
 980 For example, the ORDER table's rowkey was described above: <<schema.casestudies.custorder,schema.casestudies.custorder>>, and a single column called "order" would contain an object that could be deserialized that contained a container Order, ShippingLocations, and LineItems.
 981
 982 There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc.
 983 All of them are variants of the same approach: encode the object graph to a byte-array.
 984 Care should be taken with this approach to ensure backward compatibility in case the object model changes such that older persisted structures can still be read back out of HBase.
 985
 986 Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatibility of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.
 987
 988 [[schema.smackdown]]
 989 === Case Study - "Tall/Wide/Middle" Schema Design Smackdown
 990
 991 This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables.
 992 These are general guidelines and not laws - each application must consider its own needs.
 993
 994 [[schema.smackdown.rowsversions]]
 995 ==== Rows vs. Versions
 996
 997 A common question is whether one should prefer rows or HBase's built-in-versioning.
 998 The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The rows-approach would require storing a timestamp in some portion of the rowkey so that they would not overwrite with each successive update.
 999
1000 Preference: Rows (generally speaking).
1001
1002 [[schema.smackdown.rowscols]]
1003 ==== Rows vs. Columns
1004
1005 Another common question is whether one should prefer rows or columns.
1006 The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.
1007
1008 Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns.
1009 But there is also a middle path between these two options, and that is "Rows as Columns."
1010
1011 [[schema.smackdown.rowsascols]]
1012 ==== Rows as Columns
1013
1014 The middle path between Rows vs.
1015 Columns is packing data that would be a separate row into columns, for certain rows.
1016 OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns.
1017 This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient.
1018 For an overview of this approach, see <<schema.casestudies.log_steroids,schema.casestudies.log-steroids>>.
1019
1020 [[casestudies.schema.listdata]]
1021 === Case Study - List Data
1022
1023 The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.
1024
1025 *** QUESTION ***
1026
1027 We're looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense.
1028 One option is store the majority of the data in a key, so we could have something like:
1029
1030 [source]
1031 ----
1032
1033 <FixedWidthUserName><FixedWidthValueId1>:"" (no value)
1034 <FixedWidthUserName><FixedWidthValueId2>:"" (no value)
1035 <FixedWidthUserName><FixedWidthValueId3>:"" (no value)
1036 ----
1037
1038 The other option we had was to do this entirely using:
1039
1040 [source,xml]
1041 ----
1042
1043 <FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
1044 <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
1045 ----
1046
1047 where each row would contain multiple values.
1048 So in one case reading the first thirty values would be:
1049
1050 [source,java]
1051 ----
1052
1053 scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
1054 ----
1055
1056 And in the second case it would be
1057
1058 [source]
1059 ----
1060
1061 get 'FixedWidthUserName\x00\x00\x00\x00'
1062 ----
1063
1064 The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists.
1065 Some users would have <= 30 total values in these lists, and some users would have millions (i.e.
1066 power-law distribution)
1067
1068 The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility.
1069 Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?
1070
1071 My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we'll always need the same page size.
1072 I've ended up hearing different people tell me opposite things about performance.
1073 I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case.
1074 I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we'd need to update all subsequent rows).
1075
1076 Thanks for help / suggestions / follow-up questions.
1077
1078 *** ANSWER ***
1079
1080 If I understand you correctly, you're ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:
1081
1082 [source]
1083 ----
1084
1085 "user123, firstname, Paul",
1086 "user234, lastname, Smith"
1087 ----
1088
1089 (But the usernames are fixed width, and the valueids are fixed width).
1090
1091 And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?
1092
1093 The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you're really sure it is needed.
1094
1095 Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done.
1096 What you're giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn't sound like you need that.
1097 Doing it this way is generally recommended (see here https://hbase.apache.org/book.html#schema.smackdown).
1098
1099 Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row.
1100 I'm guessing you jumped to the "paginated" version because you're assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you're not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn't be fundamentally worse.
1101 The client has methods that allow you to get specific slices of columns.
1102
1103 Note that neither case fundamentally uses more disk space than the other; you're just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name.
1104 (If this is a bit confusing, take an hour and watch Lars George's excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).
1105
1106 A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc.
1107 That seems significantly more complex.
1108 It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out.
1109 If you don't have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! :)
1110
1111 [[schema.ops]]
1112 == Operational and Performance Configuration Options
1113
1114 ===  Tune HBase Server RPC Handling
1115
1116 * Set `hbase.regionserver.handler.count` (in `hbase-site.xml`) to cores x spindles for concurrency.
1117 * Optionally, split the call queues into separate read and write queues for differentiated service. The parameter `hbase.ipc.server.callqueue.handler.factor` specifies the number of call queues:
1118 - `0` means a single shared queue
1119 - `1` means one queue for each handler.
1120 - A value between `0` and `1` allocates the number of queues proportionally to the number of handlers. For instance, a value of `.5` shares one queue between each two handlers.
1121 * Use `hbase.ipc.server.callqueue.read.ratio` (`hbase.ipc.server.callqueue.read.share` in 0.98) to split the call queues into read and write queues:
1122 - `0.5` means there will be the same number of read and write queues
1123 - `< 0.5` for more read than write
1124 - `> 0.5` for more write than read
1125 * Set `hbase.ipc.server.callqueue.scan.ratio` (HBase 1.0+)  to split read call queues into small-read and long-read queues:
1126 - 0.5 means that there will be the same number of short-read and long-read queues
1127 - `< 0.5` for more short-read
1128 - `> 0.5` for more long-read
1129
1130 ===  Disable Nagle for RPC
1131
1132 Disable Nagle’s algorithm. Delayed ACKs can add up to ~200ms to RPC round trip time. Set the following parameters:
1133
1134 * In Hadoop’s `core-site.xml`:
1135 - `ipc.server.tcpnodelay = true`
1136 - `ipc.client.tcpnodelay = true`
1137 * In HBase’s `hbase-site.xml`:
1138 - `hbase.ipc.client.tcpnodelay = true`
1139 - `hbase.ipc.server.tcpnodelay = true`
1140
1141 ===  Limit Server Failure Impact
1142
1143 Detect regionserver failure as fast as reasonable. Set the following parameters:
1144
1145 * In `hbase-site.xml`, set `zookeeper.session.timeout` to 30 seconds or less to bound failure detection (20-30 seconds is a good start).
1146 - Note: Zookeeper clients negotiate a session timeout with the server during client init. Server enforces this timeout to be in the
1147 range [`minSessionTimeout`, `maxSessionTimeout`] and both these timeouts (measured in milliseconds) are configurable in Zookeeper service configuration.
1148 If not configured, these default to 2 * `tickTime` and 20 * `tickTime` respectively (`tickTime` is the basic time unit used by ZooKeeper,
1149 as measured in milliseconds. It is used to regulate heartbeats, timeouts etc.). Refer to Zookeeper documentation for additional details.
1150
1151 * Detect and avoid unhealthy or failed HDFS DataNodes: in `hdfs-site.xml` and `hbase-site.xml`, set the following parameters:
1152 - `dfs.namenode.avoid.read.stale.datanode = true`
1153 - `dfs.namenode.avoid.write.stale.datanode = true`
1154
1155 [[shortcircuit.reads]]
1156 ===  Optimize on the Server Side for Low Latency
1157 Skip the network for local blocks when the RegionServer goes to read from HDFS by exploiting HDFS's
1158 link:https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html[Short-Circuit Local Reads] facility.
1159 Note how setup must be done both at the datanode and on the dfsclient ends of the conneciton -- i.e. at the RegionServer
1160 and how both ends need to have loaded the hadoop native `.so` library.
1161 After configuring your hadoop setting _dfs.client.read.shortcircuit_ to _true_ and configuring
1162 the _dfs.domain.socket.path_ path for the datanode and dfsclient to share and restarting, next configure
1163 the regionserver/dfsclient side.
1164
1165 * In `hbase-site.xml`, set the following parameters:
1166 - `dfs.client.read.shortcircuit = true`
1167 - `dfs.client.read.shortcircuit.skip.checksum = true` so we don't double checksum (HBase does its own checksumming to save on i/os. See <<hbase.regionserver.checksum.verify.performance>> for more on this.
1168 - `dfs.domain.socket.path` to match what was set for the datanodes.
1169 - `dfs.client.read.shortcircuit.buffer.size = 131072` Important to avoid OOME -- hbase has a default it uses if unset, see `hbase.dfs.client.read.shortcircuit.buffer.size`; its default is 131072.
1170 * Ensure data locality. In `hbase-site.xml`, set `hbase.hstore.min.locality.to.skip.major.compact = 0.7` (Meaning that 0.7 \<= n \<= 1)
1171 * Make sure DataNodes have enough handlers for block transfers. In `hdfs-site.xml`, set the following parameters:
1172 - `dfs.datanode.max.xcievers >= 8192`
1173 - `dfs.datanode.handler.count =` number of spindles
1174
1175 Check the RegionServer logs after restart. You should only see complaint if misconfiguration.
1176 Otherwise, shortcircuit read operates quietly in background. It does not provide metrics so
1177 no optics on how effective it is but read latencies should show a marked improvement, especially if
1178 good data locality, lots of random reads, and dataset is larger than available cache.
1179
1180 Other advanced configurations that you might play with, especially if shortcircuit functionality
1181 is complaining in the logs,  include `dfs.client.read.shortcircuit.streams.cache.size` and
1182 `dfs.client.socketcache.capacity`. Documentation is sparse on these options. You'll have to
1183 read source code.
1184
1185 RegionServer metric system exposes HDFS short circuit read metrics `shortCircuitBytesRead`. Other
1186 HDFS read metrics, including
1187 `totalBytesRead` (The total number of bytes read from HDFS),
1188 `localBytesRead` (The number of bytes read from the local HDFS DataNode),
1189 `zeroCopyBytesRead` (The number of bytes read through HDFS zero copy)
1190 are available and can be used to troubleshoot short-circuit read issues.
1191
1192 For more on short-circuit reads, see Colin's old blog on rollout,
1193 link:http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/[How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop].
1194 The link:https://issues.apache.org/jira/browse/HDFS-347[HDFS-347] issue also makes for an
1195 interesting read showing the HDFS community at its best (caveat a few comments).
1196
1197 ===  JVM Tuning
1198
1199 ====  Tune JVM GC for low collection latencies
1200
1201 * Use the CMS collector: `-XX:+UseConcMarkSweepGC`
1202 * Keep eden space as small as possible to minimize average collection time. Example:
1203
1204     -XX:CMSInitiatingOccupancyFraction=70
1205
1206 * Optimize for low collection latency rather than throughput: `-Xmn512m`
1207 * Collect eden in parallel: `-XX:+UseParNewGC`
1208 *  Avoid collection under pressure: `-XX:+UseCMSInitiatingOccupancyOnly`
1209 * Limit per request scanner result sizing so everything fits into survivor space but doesn’t tenure. In `hbase-site.xml`, set `hbase.client.scanner.max.result.size` to 1/8th of eden space (with -`Xmn512m` this is ~51MB )
1210 * Set `max.result.size` x `handler.count` less than survivor space
1211
1212 ====  OS-Level Tuning
1213
1214 * Turn transparent huge pages (THP) off:
1215
1216   echo never > /sys/kernel/mm/transparent_hugepage/enabled
1217   echo never > /sys/kernel/mm/transparent_hugepage/defrag
1218
1219 * Set `vm.swappiness = 0`
1220 * Set `vm.min_free_kbytes` to at least 1GB (8GB on larger memory systems)
1221 * Disable NUMA zone reclaim with `vm.zone_reclaim_mode = 0`
1222
1223 ==  Special Cases
1224
1225 ===  For applications where failing quickly is better than waiting
1226
1227 *  In `hbase-site.xml` on the client side, set the following parameters:
1228 - Set `hbase.client.pause = 1000`
1229 - Set `hbase.client.retries.number = 3`
1230 - If you want to ride over splits and region moves, increase `hbase.client.retries.number` substantially (>= 20)
1231 - Set the RecoverableZookeeper retry count: `zookeeper.recovery.retry = 1` (no retry)
1232 * In `hbase-site.xml` on the server side, set the Zookeeper session timeout for detecting server failures: `zookeeper.session.timeout` <= 30 seconds (20-30 is good).
1233
1234 ===  For applications that can tolerate slightly out of date information
1235
1236 **HBase timeline consistency (HBASE-10070) **
1237 With read replicas enabled, read-only copies of regions (replicas) are distributed over the cluster. One RegionServer services the default or primary replica, which is the only replica that can service writes. Other RegionServers serve the secondary replicas, follow the primary RegionServer, and only see committed updates. The secondary replicas are read-only, but can serve reads immediately while the primary is failing over, cutting read availability blips from seconds to milliseconds. Phoenix supports timeline consistency as of 4.4.0
1238 Tips:
1239
1240 * Deploy HBase 1.0.0 or later.
1241 * Enable timeline consistent replicas on the server side.
1242 * Use one of the following methods to set timeline consistency:
1243 - Use `ALTER SESSION SET CONSISTENCY = 'TIMELINE’`
1244 - Set the connection property `Consistency` to `timeline` in the JDBC connect string
1245
1246 === More Information
1247
1248 See the Performance section <<perf.schema,perf.schema>> for more information about operational and performance schema design options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes.