web/bit-hacks/bithacks.txt

   1 Bit Twiddling Hacks
   2
   3 By Sean Eron Anderson
   4 seander@cs.stanford.edu
   5
   6 Individually, the code snippets here are in the public domain (unless otherwise
   7 noted) — feel free to use them however you please. The aggregate collection and
   8 descriptions are © 1997-2005 Sean Eron Anderson. The code and descriptions are
   9 distributed in the hope that they will be useful, but WITHOUT ANY WARRANTY and
  10 without even the implied warranty of merchantability or fitness for a
  11 particular purpose. As of May 5, 2005, all the code has been tested thoroughly.
  12 Thousands of people have read it. Moreover, Professor Randal Bryant, the Dean
  13 of Computer Science at Carnegie Mellon University, has personally tested almost
  14 everything with his Uclid code verification system. What he hasn't tested, I
  15 have checked against all possible inputs on a 32-bit machine. To the first
  16 person to inform me of a legitimate bug in the code, I'll pay a bounty of US$10
  17 (by check or Paypal). If directed to a charity, I'll pay US$20.
  18
  19 Contents
  20
  21   • About the operation counting methodology
  22   • Compute the sign of an integer
  23   • Detect if two integers have opposite signs
  24   • Compute the integer absolute value (abs) without branching
  25   • Compute the minimum (min) or maximum (max) of two integers without
  26     branching
  27   • Determining if an integer is a power of 2
  28   • Sign extending
  29       □ Sign extending from a constant bit-width
  30       □ Sign extending from a variable bit-width
  31       □ Sign extending from a variable bit-width in 3 operations
  32   • Conditionally set or clear bits without branching
  33   • Conditionally negate a value without branching
  34   • Merge bits from two values according to a mask
  35   • Counting bits set
  36       □ Counting bits set, naive way
  37       □ Counting bits set by lookup table
  38       □ Counting bits set, Brian Kernighan's way
  39       □ Counting bits set in 12, 24, or 32-bit words using 64-bit instructions
  40       □ Counting bits set, in parallel
  41       □ Count bits set (rank) from the most-significant bit upto a given
  42         position
  43       □ Select the bit position (from the most-significant bit) with the given
  44         count (rank)
  45   • Computing parity (1 if an odd number of bits set, 0 otherwise)
  46       □ Compute parity of a word the naive way
  47       □ Compute parity by lookup table
  48       □ Compute parity of a byte using 64-bit multiply and modulus division
  49       □ Compute parity of word with a multiply
  50       □ Compute parity in parallel
  51   • Swapping Values
  52       □ Swapping values with subtraction and addition
  53       □ Swapping values with XOR
  54       □ Swapping individual bits with XOR
  55   • Reversing bit sequences
  56       □ Reverse bits the obvious way
  57       □ Reverse bits in word by lookup table
  58       □ Reverse the bits in a byte with 3 operations (64-bit multiply and
  59         modulus division)
  60       □ Reverse the bits in a byte with 4 operations (64-bit multiply, no
  61         division)
  62       □ Reverse the bits in a byte with 7 operations (no 64-bit, only 32)
  63       □ Reverse an N-bit quantity in parallel with 5 * lg(N) operations
  64   • Modulus division (aka computing remainders)
  65       □ Computing modulus division by 1 << s without a division operation
  66         (obvious)
  67       □ Computing modulus division by (1 << s) - 1 without a division operation
  68       □ Computing modulus division by (1 << s) - 1 in parallel without a
  69         division operation
  70   • Finding integer log base 2 of an integer (aka the position of the highest
  71     bit set)
  72       □ Find the log base 2 of an integer with the MSB N set in O(N) operations
  73         (the obvious way)
  74       □ Find the integer log base 2 of an integer with an 64-bit IEEE float
  75       □ Find the log base 2 of an integer with a lookup table
  76       □ Find the log base 2 of an N-bit integer in O(lg(N)) operations
  77       □ Find the log base 2 of an N-bit integer in O(lg(N)) operations with
  78         multiply and lookup
  79   • Find integer log base 10 of an integer
  80   • Find integer log base 10 of an integer the obvious way
  81   • Find integer log base 2 of a 32-bit IEEE float
  82   • Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for
  83     unsigned integer r)
  84   • Counting consecutive trailing zero bits (or finding bit indices)
  85       □ Count the consecutive zero bits (trailing) on the right linearly
  86       □ Count the consecutive zero bits (trailing) on the right in parallel
  87       □ Count the consecutive zero bits (trailing) on the right by binary
  88         search
  89       □ Count the consecutive zero bits (trailing) on the right by casting to a
  90         float
  91       □ Count the consecutive zero bits (trailing) on the right with modulus
  92         division and lookup
  93       □ Count the consecutive zero bits (trailing) on the right with multiply
  94         and lookup
  95   • Round up to the next highest power of 2 by float casting
  96   • Round up to the next highest power of 2
  97   • Interleaving bits (aka computing Morton Numbers)
  98       □ Interleave bits the obvious way
  99       □ Interleave bits by table lookup
 100       □ Interleave bits with 64-bit multiply
 101       □ Interleave bits by Binary Magic Numbers
 102   • Testing for ranges of bytes in a word (and counting occurances found)
 103       □ Determine if a word has a zero byte
 104       □ Determine if a word has a byte equal to n
 105       □ Determine if a word has byte less than n
 106       □ Determine if a word has a byte greater than n
 107       □ Determine if a word has a byte between m and n
 108   • Compute the lexicographically next bit permutation
 109
 110 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 111
 112 About the operation counting methodology
 113
 114 When totaling the number of operations for algorithms here, any C operator is
 115 counted as one operation. Intermediate assignments, which need not be written
 116 to RAM, are not counted. Of course, this operation counting approach only
 117 serves as an approximation of the actual number of machine instructions and CPU
 118 time. All operations are assumed to take the same amount of time, which is not
 119 true in reality, but CPUs have been heading increasingly in this direction over
 120 time. There are many nuances that determine how fast a system will run a given
 121 sample of code, such as cache sizes, memory bandwidths, instruction sets, etc.
 122 In the end, benchmarking is the best way to determine whether one method is
 123 really faster than another, so consider the techniques below as possibilities
 124 to test on your target architecture.
 125
 126 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 127
 128 Compute the sign of an integer
 129
 130 int v;      // we want to find the sign of v
 131 int sign;   // the result goes here
 132
 133 // CHAR_BIT is the number of bits per byte (normally 8).
 134 sign = -(v < 0);  // if v < 0 then -1, else 0.
 135 // or, to avoid branching on CPUs with flag registers (IA32):
 136 sign = -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1));
 137 // or, for one less instruction (but not portable):
 138 sign = v >> (sizeof(int) * CHAR_BIT - 1);
 139
 140 The last expression above evaluates to sign = v >> 31 for 32-bit integers. This
 141 is one operation faster than the obvious way, sign = -(v < 0). This trick works
 142 because when signed integers are shifted right, the value of the far left bit
 143 is copied to the other bits. The far left bit is 1 when the value is negative
 144 and 0 otherwise; all 1 bits gives -1. Unfortunately, this behavior is
 145 architecture-specific.
 146
 147 Alternatively, if you prefer the result be either -1 or +1, then use:
 148
 149 sign = +1 | (v >> (sizeof(int) * CHAR_BIT - 1));  // if v < 0 then -1, else +1
 150
 151 On the other hand, if you prefer the result be either -1, 0, or +1, then use:
 152
 153 sign = (v != 0) | -(int)((unsigned int)((int)v) >> (sizeof(int) * CHAR_BIT - 1));
 154 // Or, for more speed but less portability:
 155 sign = (v != 0) | (v >> (sizeof(int) * CHAR_BIT - 1));  // -1, 0, or +1
 156 // Or, for portability, brevity, and (perhaps) speed:
 157 sign = (v > 0) - (v < 0); // -1, 0, or +1
 158
 159 If instead you want to know if something is non-negative, resulting in +1 or
 160 else 0, then use:
 161
 162 sign = 1 ^ ((unsigned int)v >> (sizeof(int) * CHAR_BIT - 1)); // if v < 0 then 0, else 1
 163
 164 Caveat: On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C
 165 specification leaves the result of signed right-shift implementation-defined,
 166 so on some systems this hack might not work. For greater portability, Toby
 167 Speight suggested on September 28, 2005 that CHAR_BIT be used here and
 168 throughout rather than assuming bytes were 8 bits long. Angus recommended the
 169 more portable versions above, involving casting on March 4, 2006. Rohit Garg
 170 suggested the version for non-negative integers on September 12, 2009.
 171
 172 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 173
 174 Detect if two integers have opposite signs
 175
 176 int x, y;               // input values to compare signs
 177
 178 bool f = ((x ^ y) < 0); // true iff x and y have opposite signs
 179
 180 Manfred Weis suggested I add this entry on November 26, 2009.
 181
 182 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 183
 184 Compute the integer absolute value (abs) without branching
 185
 186 int v;           // we want to find the absolute value of v
 187 unsigned int r;  // the result goes here
 188 int const mask = v >> sizeof(int) * CHAR_BIT - 1;
 189
 190 r = (v + mask) ^ mask;
 191
 192 Patented variation:
 193
 194 r = (v ^ mask) - mask;
 195
 196 Some CPUs don't have an integer absolute value instruction (or the compiler
 197 fails to use them). On machines where branching is expensive, the above
 198 expression can be faster than the obvious approach, r = (v < 0) ? -(unsigned)v
 199 : v, even though the number of operations is the same.
 200
 201 On March 7, 2003, Angus Duggan pointed out that the 1989 ANSI C specification
 202 leaves the result of signed right-shift implementation-defined, so on some
 203 systems this hack might not work. I've read that ANSI C does not require values
 204 to be represented as two's complement, so it may not work for that reason as
 205 well (on a diminishingly small number of old machines that still use one's
 206 complement). On March 14, 2004, Keith H. Duggar sent me the patented variation
 207 above; it is superior to the one I initially came up with, r=(+1|(v>>(sizeof
 208 (int)*CHAR_BIT-1)))*v, because a multiply is not used. Unfortunately, this
 209 method has been patented in the USA on June 6, 2000 by Vladimir Yu Volkonsky
 210 and assigned to Sun Microsystems. On August 13, 2006, Yuriy Kaminskiy told me
 211 that the patent is likely invalid because the method was published well before
 212 the patent was even filed, such as in How to Optimize for the Pentium Processor
 213 by Agner Fog, dated November, 9, 1996. Yuriy also mentioned that this document
 214 was translated to Russian in 1997, which Vladimir could have read. Moreover,
 215 the Internet Archive also has an old link to it. On January 30, 2007, Peter
 216 Kankowski shared with me an abs version he discovered that was inspired by
 217 Microsoft's Visual C++ compiler output. It is featured here as the primary
 218 solution. On December 6, 2007, Hai Jin complained that the result was signed,
 219 so when computing the abs of the most negative value, it was still negative. On
 220 April 15, 2008 Andrew Shapira pointed out that the obvious approach could
 221 overflow, as it lacked an (unsigned) cast then; for maximum portability he
 222 suggested (v < 0) ? (1 + ((unsigned)(-1-v))) : (unsigned)v. But citing the ISO
 223 C99 spec on July 9, 2008, Vincent Lefèvre convinced me to remove it becasue
 224 even on non-2s-complement machines -(unsigned)v will do the right thing. The
 225 evaluation of -(unsigned)v first converts the negative value of v to an
 226 unsigned by adding 2**N, yielding a 2s complement representation of v's value
 227 that I'll call U. Then, U is negated, giving the desired result, -U = 0 - U =
 228 2**N - U = 2**N - (v+2**N) = -v = abs(v).
 229
 230 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 231
 232 Compute the minimum (min) or maximum (max) of two integers without branching
 233
 234 int x;  // we want to find the minimum of x and y
 235 int y;
 236 int r;  // the result goes here
 237
 238 r = y ^ ((x ^ y) & -(x < y)); // min(x, y)
 239
 240 On some rare machines where branching is very expensive and no condition move
 241 instructions exist, the above expression might be faster than the obvious
 242 approach, r = (x < y) ? x : y, even though it involves two more instructions.
 243 (Typically, the obvious approach is best, though.) It works because if x < y,
 244 then -(x < y) will be all ones, so r = y ^ (x ^ y) & ~0 = y ^ x ^ y = x.
 245 Otherwise, if x >= y, then -(x < y) will be all zeros, so r = y ^ ((x ^ y) & 0)
 246 = y. On some machines, evaluating (x < y) as 0 or 1 requires a branch
 247 instruction, so there may be no advantage.
 248
 249 To find the maximum, use:
 250
 251 r = x ^ ((x ^ y) & -(x < y)); // max(x, y)
 252
 253 Quick and dirty versions:
 254
 255 If you know that INT_MIN <= x - y <= INT_MAX, then you can use the following,
 256 which are faster because (x - y) only needs to be evaluated once.
 257
 258 r = y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y)
 259 r = x - ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // max(x, y)
 260
 261 Note that the 1989 ANSI C specification doesn't specify the result of signed
 262 right-shift, so these aren't portable. If exceptions are thrown on overflows,
 263 then the values of x and y should be unsigned or cast to unsigned for the
 264 subtractions to avoid unnecessarily throwing an exception, however the
 265 right-shift needs a signed operand to produce all one bits when negative, so
 266 cast to signed there.
 267
 268 On March 7, 2003, Angus Duggan pointed out the right-shift portability issue.
 269 On May 3, 2005, Randal E. Bryant alerted me to the need for the precondition,
 270 INT_MIN <= x - y <= INT_MAX, and suggested the non-quick and dirty version as a
 271 fix. Both of these issues concern only the quick and dirty version. Nigel
 272 Horspoon observed on July 6, 2005 that gcc produced the same code on a Pentium
 273 as the obvious solution because of how it evaluates (x < y). On July 9, 2008
 274 Vincent Lefèvre pointed out the potential for overflow exceptions with
 275 subtractions in r = y + ((x - y) & -(x < y)), which was the previous version.
 276 Timothy B. Terriberry suggested using xor rather than add and subract to avoid
 277 casting and the risk of overflows on June 2, 2009.
 278
 279 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 280
 281 Determining if an integer is a power of 2
 282
 283 unsigned int v; // we want to see if v is a power of 2
 284 bool f;         // the result goes here
 285
 286 f = (v & (v - 1)) == 0;
 287
 288 Note that 0 is incorrectly considered a power of 2 here. To remedy this, use:
 289
 290 f = v && !(v & (v - 1));
 291
 292 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 293
 294 Sign extending from a constant bit-width
 295
 296 Sign extension is automatic for built-in types, such as chars and ints. But
 297 suppose you have a signed two's complement number, x, that is stored using only
 298 b bits. Moreover, suppose you want to convert x to an int, which has more than
 299 b bits. A simple copy will work if x is positive, but if negative, the sign
 300 must be extended. For example, if we have only 4 bits to store a number, then
 301 -3 is represented as 1101 in binary. If we have 8 bits, then -3 is 11111101.
 302 The most-significant bit of the 4-bit representation is replicated sinistrally
 303 to fill in the destination when we convert to a representation with more bits;
 304 this is sign extending. In C, sign extension from a constant bit-width is
 305 trivial, since bit fields may be specified in structs or unions. For example,
 306 to convert from 5 bits to an full integer:
 307
 308 int x; // convert this from using 5 bits to a full int
 309 int r; // resulting sign extended number goes here
 310 struct {signed int x:5;} s;
 311 r = s.x = x;
 312
 313 The following is a C++ template function that uses the same language feature to
 314 convert from B bits in one operation (though the compiler is generating more,
 315 of course).
 316
 317 template <typename T, unsigned B>
 318 inline T signextend(const T x)
 319 {
 320   struct {T x:B;} s;
 321   return s.x = x;
 322 }
 323
 324 int r = signextend<signed int,5>(x);  // sign extend 5 bit number x to r
 325
 326 John Byrd caught a typo in the code (attributed to html formatting) on May 2,
 327 2005. On March 4, 2006, Pat Wood pointed out that the ANSI C standard requires
 328 that the bitfield have the keyword "signed" to be signed; otherwise, the sign
 329 is undefined.
 330 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 331
 332 Sign extending from a variable bit-width
 333
 334 Sometimes we need to extend the sign of a number but we don't know a priori the
 335 number of bits, b, in which it is represented. (Or we could be programming in a
 336 language like Java, which lacks bitfields.)
 337
 338 unsigned b; // number of bits representing the number in x
 339 int x;      // sign extend this b-bit number to r
 340 int r;      // resulting sign-extended number
 341 int const m = 1U << (b - 1); // mask can be pre-computed if b is fixed
 342
 343 x = x & ((1U << b) - 1);  // (Skip this if bits in x above position b are already zero.)
 344 r = (x ^ m) - m;
 345
 346 The code above requires four operations, but when the bitwidth is a constant
 347 rather than variable, it requires only two fast operations, assuming the upper
 348 bits are already zeroes.
 349
 350 A slightly faster but less portable method that doesn't depend on the bits in x
 351 above position b being zero is:
 352
 353 int const m = CHAR_BIT * sizeof(x) - b;
 354 r = (x << m) >> m;
 355
 356 Sean A. Irvine suggested that I add sign extension methods to this page on June
 357 13, 2004, and he provided m = (1 << (b - 1)) - 1; r = -(x & ~m) | x; as a
 358 starting point from which I optimized to get m = 1U << (b - 1); r = -(x & m) |
 359 x. But then on May 11, 2007, Shay Green suggested the version above, which
 360 requires one less operation than mine. Vipin Sharma suggested I add a step to
 361 deal with situations where x had possible ones in bits other than the b bits we
 362 wanted to sign-extend on Oct. 15, 2008. On December 31, 2009 Chris Pirazzi
 363 suggested I add the faster version, which requires two operations for constant
 364 bit-widths and three for variable widths.
 365
 366 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 367
 368 Sign extending from a variable bit-width in 3 operations
 369
 370 The following may be slow on some machines, due to the effort required for
 371 multiplication and division. This version is 4 operations. If you know that
 372 your initial bit-width, b, is greater than 1, you might do this type of sign
 373 extension in 3 operations by using r = (x * multipliers[b]) / multipliers[b],
 374 which requires only one array lookup.
 375
 376 unsigned b; // number of bits representing the number in x
 377 int x;      // sign extend this b-bit number to r
 378 int r;      // resulting sign-extended number
 379 #define M(B) (1U << ((sizeof(x) * CHAR_BIT) - B)) // CHAR_BIT=bits/byte
 380 static int const multipliers[] =
 381 {
 382   0,     M(1),  M(2),  M(3),  M(4),  M(5),  M(6),  M(7),
 383   M(8),  M(9),  M(10), M(11), M(12), M(13), M(14), M(15),
 384   M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23),
 385   M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31),
 386   M(32)
 387 }; // (add more if using more than 64 bits)
 388 static int const divisors[] =
 389 {
 390   1,    ~M(1),  M(2),  M(3),  M(4),  M(5),  M(6),  M(7),
 391   M(8),  M(9),  M(10), M(11), M(12), M(13), M(14), M(15),
 392   M(16), M(17), M(18), M(19), M(20), M(21), M(22), M(23),
 393   M(24), M(25), M(26), M(27), M(28), M(29), M(30), M(31),
 394   M(32)
 395 }; // (add more for 64 bits)
 396 #undef M
 397 r = (x * multipliers[b]) / divisors[b];
 398
 399 The following variation is not portable, but on architectures that employ an
 400 arithmetic right-shift, maintaining the sign, it should be fast.
 401
 402 const int s = -b; // OR:  sizeof(x) * CHAR_BIT - b;
 403 r = (x << s) >> s;
 404
 405 Randal E. Bryant pointed out a bug on May 3, 2005 in an earlier version (that
 406 used multipliers[] for divisors[]), where it failed on the case of x=1 and b=1.
 407
 408 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 409
 410 Conditionally set or clear bits without branching
 411
 412 bool f;         // conditional flag
 413 unsigned int m; // the bit mask
 414 unsigned int w; // the word to modify:  if (f) w |= m; else w &= ~m;
 415
 416 w ^= (-f ^ w) & m;
 417
 418 // OR, for superscalar CPUs:
 419 w = (w & ~m) | (-f & m);
 420
 421 On some architectures, the lack of branching can more than make up for what
 422 appears to be twice as many operations. For instance, informal speed tests on
 423 an AMD Athlon™ XP 2100+ indicated it was 5-10% faster. An Intel Core 2 Duo ran
 424 the superscalar version about 16% faster than the first. Glenn Slayden informed
 425 me of the first expression on December 11, 2003. Marco Yu shared the
 426 superscalar version with me on April 3, 2007 and alerted me to a typo 2 days
 427 later.
 428
 429 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 430
 431 Conditionally negate a value without branching
 432
 433 If you need to negate only when a flag is false, then use the following to
 434 avoid branching:
 435
 436 bool fDontNegate;  // Flag indicating we should not negate v.
 437 int v;             // Input value to negate if fDontNegate is false.
 438 int r;             // result = fDontNegate ? v : -v;
 439
 440 r = (fDontNegate ^ (fDontNegate - 1)) * v;
 441
 442 If you need to negate only when a flag is true, then use this:
 443
 444 bool fNegate;  // Flag indicating if we should negate v.
 445 int v;         // Input value to negate if fNegate is true.
 446 int r;         // result = fNegate ? -v : v;
 447
 448 r = (v ^ -fNegate) + fNegate;
 449
 450 Avraham Plotnitzky suggested I add the first version on June 2, 2009. Motivated
 451 to avoid the multiply, I came up with the second version on June 8, 2009.
 452 Alfonso De Gregorio pointed out that some parens were missing on November 26,
 453 2009, and received a bug bounty.
 454
 455 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 456
 457 Merge bits from two values according to a mask
 458
 459 unsigned int a;    // value to merge in non-masked bits
 460 unsigned int b;    // value to merge in masked bits
 461 unsigned int mask; // 1 where bits from b should be selected; 0 where from a.
 462 unsigned int r;    // result of (a & ~mask) | (b & mask) goes here
 463
 464 r = a ^ ((a ^ b) & mask);
 465
 466 This shaves one operation from the obvious way of combining two sets of bits
 467 according to a bit mask. If the mask is a constant, then there may be no
 468 advantage.
 469
 470 Ron Jeffery sent this to me on February 9, 2006.
 471
 472 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 473
 474 Counting bits set (naive way)
 475
 476 unsigned int v; // count the number of bits set in v
 477 unsigned int c; // c accumulates the total bits set in v
 478
 479 for (c = 0; v; v >>= 1)
 480 {
 481   c += v & 1;
 482 }
 483
 484 The naive approach requires one iteration per bit, until no more bits are set.
 485 So on a 32-bit word with only the high set, it will go through 32 iterations.
 486
 487 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 488
 489 Counting bits set by lookup table
 490
 491 static const unsigned char BitsSetTable256[256] =
 492 {
 493 #   define B2(n) n,     n+1,     n+1,     n+2
 494 #   define B4(n) B2(n), B2(n+1), B2(n+1), B2(n+2)
 495 #   define B6(n) B4(n), B4(n+1), B4(n+1), B4(n+2)
 496     B6(0), B6(1), B6(1), B6(2)
 497 };
 498
 499 unsigned int v; // count the number of bits set in 32-bit value v
 500 unsigned int c; // c is the total bits set in v
 501
 502 // Option 1:
 503 c = BitsSetTable256[v & 0xff] +
 504     BitsSetTable256[(v >> 8) & 0xff] +
 505     BitsSetTable256[(v >> 16) & 0xff] +
 506     BitsSetTable256[v >> 24];
 507
 508 // Option 2:
 509 unsigned char * p = (unsigned char *) &v;
 510 c = BitsSetTable256[p[0]] +
 511     BitsSetTable256[p[1]] +
 512     BitsSetTable256[p[2]] +
 513     BitsSetTable256[p[3]];
 514
 515
 516 // To initially generate the table algorithmically:
 517 BitsSetTable256[0] = 0;
 518 for (int i = 0; i < 256; i++)
 519 {
 520   BitsSetTable256[i] = (i & 1) + BitsSetTable256[i / 2];
 521 }
 522
 523 On July 14, 2009 Hallvard Furuseth suggested the macro compacted table.
 524
 525 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 526
 527 Counting bits set, Brian Kernighan's way
 528
 529 unsigned int v; // count the number of bits set in v
 530 unsigned int c; // c accumulates the total bits set in v
 531 for (c = 0; v; c++)
 532 {
 533   v &= v - 1; // clear the least significant bit set
 534 }
 535
 536 Brian Kernighan's method goes through as many iterations as there are set bits.
 537 So if we have a 32-bit word with only the high bit set, then it will only go
 538 once through the loop.
 539
 540 Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan
 541 and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don
 542 Knuth pointed out to me that this method "was first published by Peter Wegner
 543 in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and
 544 published in 1964 in a book edited by Beckenbach.)"
 545
 546 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 547
 548 Counting bits set in 14, 24, or 32-bit words using 64-bit instructions
 549
 550 unsigned int v; // count the number of bits set in v
 551 unsigned int c; // c accumulates the total bits set in v
 552
 553 // option 1, for at most 14-bit values in v:
 554 c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;
 555
 556 // option 2, for at most 24-bit values in v:
 557 c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
 558 c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL)
 559      % 0x1f;
 560
 561 // option 3, for at most 32-bit values in v:
 562 c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
 563 c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) %
 564      0x1f;
 565 c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
 566
 567 This method requires a 64-bit CPU with fast modulus division to be efficient.
 568 The first option takes only 3 operations; the second option takes 10; and the
 569 third option takes 15.
 570
 571 Rich Schroeppel originally created a 9-bit version, similiar to option 1; see
 572 the Programming Hacks section of Beeler, M., Gosper, R. W., and Schroeppel, R.
 573 HAKMEM. MIT AI Memo 239, Feb. 29, 1972. His method was the inspiration for the
 574 variants above, devised by Sean Anderson. Randal E. Bryant offered a couple bug
 575 fixes on May 3, 2005. Bruce Dawson tweaked what had been a 12-bit version and
 576 made it suitable for 14 bits using the same number of operations on Feburary 1,
 577 2007.
 578
 579 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 580
 581 Counting bits set, in parallel
 582
 583 unsigned int v; // count bits set in this (32-bit value)
 584 unsigned int c; // store the total here
 585 static const int S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers
 586 static const int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};
 587
 588 c = v - ((v >> 1) & B[0]);
 589 c = ((c >> S[1]) & B[1]) + (c & B[1]);
 590 c = ((c >> S[2]) + c) & B[2];
 591 c = ((c >> S[3]) + c) & B[3];
 592 c = ((c >> S[4]) + c) & B[4];
 593
 594 The B array, expressed as binary, is:
 595
 596 B[0] = 0x55555555 = 01010101 01010101 01010101 01010101
 597 B[1] = 0x33333333 = 00110011 00110011 00110011 00110011
 598 B[2] = 0x0F0F0F0F = 00001111 00001111 00001111 00001111
 599 B[3] = 0x00FF00FF = 00000000 11111111 00000000 11111111
 600 B[4] = 0x0000FFFF = 00000000 00000000 11111111 11111111
 601
 602 We can adjust the method for larger integer sizes by continuing with the
 603 patterns for the Binary Magic Numbers, B and S. If there are k bits, then we
 604 need the arrays S and B to be ceil(lg(k)) elements long, and we must compute
 605 the same number of expressions for c as S or B are long. For a 32-bit v, 16
 606 operations are used.
 607
 608 The best method for counting bits in a 32-bit integer v is the following:
 609
 610 v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
 611 v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
 612 c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
 613
 614 The best bit counting method takes only 12 operations, which is the same as the
 615 lookup-table method, but avoids the memory and potential cache misses of a
 616 table. It is a hybrid between the purely parallel method above and the earlier
 617 methods using multiplies (in the section on counting bits with 64-bit
 618 instructions), though it doesn't use 64-bit instructions. The counts of bits
 619 set in the bytes is done in parallel, and the sum total of the bits set in the
 620 bytes is computed by multiplying by 0x1010101 and shifting right 24 bits.
 621
 622 A generalization of the best bit counting method to integers of bit-widths upto
 623 128 (parameterized by type T) is this:
 624
 625 v = v - ((v >> 1) & (T)~(T)0/3);                           // temp
 626 v = (v & (T)~(T)0/15*3) + ((v >> 2) & (T)~(T)0/15*3);      // temp
 627 v = (v + (v >> 4)) & (T)~(T)0/255*15;                      // temp
 628 c = (T)(v * ((T)~(T)0/255)) >> (sizeof(T) - 1) * CHAR_BIT; // count
 629
 630 See Ian Ashdown's nice newsgroup post for more information on counting the
 631 number of bits set (also known as sideways addition). The best bit counting
 632 method was brought to my attention on October 5, 2005 by Andrew Shapira; he
 633 found it in pages 187-188 of Software Optimization Guide for AMD Athlon™ 64 and
 634 Opteron™ Processors. Charlie Gordon suggested a way to shave off one operation
 635 from the purely parallel version on December 14, 2005, and Don Clugston trimmed
 636 three more from it on December 30, 2005. I made a typo with Don's suggestion
 637 that Eric Cole spotted on January 8, 2006. Eric later suggested the arbitrary
 638 bit-width generalization to the best method on November 17, 2006. On April 5,
 639 2007, Al Williams observed that I had a line of dead code at the top of the
 640 first method.
 641
 642 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 643
 644 Count bits set (rank) from the most-significant bit upto a given position
 645
 646 The following finds the the rank of a bit, meaning it returns the sum of bits
 647 that are set to 1 from the most-signficant bit downto the bit at the given
 648 position.
 649
 650   uint64_t v;       // Compute the rank (bits set) in v from the MSB to pos.
 651   unsigned int pos; // Bit position to count bits upto.
 652   uint64_t r;       // Resulting rank of bit at pos goes here.
 653
 654   // Shift out bits after given position.
 655   r = v >> (sizeof(v) * CHAR_BIT - pos);
 656   // Count set bits in parallel.
 657   // r = (r & 0x5555...) + ((r >> 1) & 0x5555...);
 658   r = r - ((r >> 1) & ~0UL/3);
 659   // r = (r & 0x3333...) + ((r >> 2) & 0x3333...);
 660   r = (r & ~0UL/5) + ((r >> 2) & ~0UL/5);
 661   // r = (r & 0x0f0f...) + ((r >> 4) & 0x0f0f...);
 662   r = (r + (r >> 4)) & ~0UL/17;
 663   // r = r % 255;
 664   r = (r * (~0UL/255)) >> ((sizeof(v) - 1) * CHAR_BIT);
 665
 666 Juha Järvi sent this to me on November 21, 2009 as an inverse operation to the
 667 computing the bit position with the given rank, which follows.
 668
 669 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 670
 671 Select the bit position (from the most-significant bit) with the given count
 672 (rank)
 673
 674 The following 64-bit code selects the position of the r^th 1 bit when counting
 675 from the left. In other words if we start at the most significant bit and
 676 proceed to the right, counting the number of bits set to 1 until we reach the
 677 desired rank, r, then the position where we stop is returned. If the rank
 678 requested exceeds the count of bits set, then 64 is returned. The code may be
 679 modified for 32-bit or counting from the right.
 680
 681   uint64_t v;          // Input value to find position with rank r.
 682   unsigned int r;      // Input: bit's desired rank [1-64].
 683   unsigned int s;      // Output: Resulting position of bit with rank r [1-64]
 684   uint64_t a, b, c, d; // Intermediate temporaries for bit count.
 685   unsigned int t;      // Bit count temporary.
 686
 687   // Do a normal parallel bit count for a 64-bit integer,
 688   // but store all intermediate steps.
 689   // a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
 690   a =  v - ((v >> 1) & ~0UL/3);
 691   // b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
 692   b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
 693   // c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
 694   c = (b + (b >> 4)) & ~0UL/0x11;
 695   // d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
 696   d = (c + (c >> 8)) & ~0UL/0x101;
 697   t = (d >> 32) + (d >> 48);
 698   // Now do branchless select!
 699   s  = 64;
 700   // if (r > t) {s -= 32; r -= t;}
 701   s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
 702   t  = (d >> (s - 16)) & 0xff;
 703   // if (r > t) {s -= 16; r -= t;}
 704   s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
 705   t  = (c >> (s - 8)) & 0xf;
 706   // if (r > t) {s -= 8; r -= t;}
 707   s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
 708   t  = (b >> (s - 4)) & 0x7;
 709   // if (r > t) {s -= 4; r -= t;}
 710   s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
 711   t  = (a >> (s - 2)) & 0x3;
 712   // if (r > t) {s -= 2; r -= t;}
 713   s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
 714   t  = (v >> (s - 1)) & 0x1;
 715   // if (r > t) s--;
 716   s -= ((t - r) & 256) >> 8;
 717   s = 65 - s;
 718
 719 If branching is fast on your target CPU, consider uncommenting the
 720 if-statements and commenting the lines that follow them.
 721
 722 Juha Järvi sent this to me on November 21, 2009.
 723
 724 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 725
 726 Computing parity the naive way
 727
 728 unsigned int v;       // word value to compute the parity of
 729 bool parity = false;  // parity will be the parity of v
 730
 731 while (v)
 732 {
 733   parity = !parity;
 734   v = v & (v - 1);
 735 }
 736
 737
 738 The above code uses an approach like Brian Kernigan's bit counting, above. The
 739 time it takes is proportional to the number of bits set.
 740
 741 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 742
 743 Compute parity by lookup table
 744
 745 static const bool ParityTable256[256] =
 746 {
 747 #   define P2(n) n, n^1, n^1, n
 748 #   define P4(n) P2(n), P2(n^1), P2(n^1), P2(n)
 749 #   define P6(n) P4(n), P4(n^1), P4(n^1), P4(n)
 750     P6(0), P6(1), P6(1), P6(0)
 751 };
 752
 753 unsigned char b;  // byte value to compute the parity of
 754 bool parity = ParityTable256[b];
 755
 756 // OR, for 32-bit words:
 757 unsigned int v;
 758 v ^= v >> 16;
 759 v ^= v >> 8;
 760 bool parity = ParityTable256[v & 0xff];
 761
 762 // Variation:
 763 unsigned char * p = (unsigned char *) &v;
 764 parity = ParityTable256[p[0] ^ p[1] ^ p[2] ^ p[3]];
 765
 766
 767 Randal E. Bryant encouraged the addition of the (admittedly) obvious last
 768 variation with variable p on May 3, 2005. Bruce Rawles found a typo in an
 769 instance of the table variable's name on September 27, 2005, and he received a
 770 $10 bug bounty. On October 9, 2006, Fabrice Bellard suggested the 32-bit
 771 variations above, which require only one table lookup; the previous version had
 772 four lookups (one per byte) and were slower. On July 14, 2009 Hallvard Furuseth
 773 suggested the macro compacted table.
 774
 775 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 776
 777 Compute parity of a byte using 64-bit multiply and modulus division
 778
 779 unsigned char b;  // byte value to compute the parity of
 780 bool parity =
 781   (((b * 0x0101010101010101ULL) & 0x8040201008040201ULL) % 0x1FF) & 1;
 782
 783 The method above takes around 4 operations, but only works on bytes.
 784
 785 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 786
 787 Compute parity of word with a multiply
 788
 789 The following method computes the parity of the 32-bit value in only 8
 790 operations using a multiply.
 791
 792     unsigned int v; // 32-bit word
 793     v ^= v >> 1;
 794     v ^= v >> 2;
 795     v = (v & 0x11111111U) * 0x11111111U;
 796     return (v >> 28) & 1;
 797
 798 Also for 64-bits, 8 operations are still enough.
 799
 800     unsigned long long v; // 64-bit word
 801     v ^= v >> 1;
 802     v ^= v >> 2;
 803     v = (v & 0x1111111111111111UL) * 0x1111111111111111UL;
 804     return (v >> 60) & 1;
 805
 806 Andrew Shapira came up with this and sent it to me on Sept. 2, 2007.
 807 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 808
 809 Compute parity in parallel
 810
 811 unsigned int v;  // word value to compute the parity of
 812 v ^= v >> 16;
 813 v ^= v >> 8;
 814 v ^= v >> 4;
 815 v &= 0xf;
 816 return (0x6996 >> v) & 1;
 817
 818 The method above takes around 9 operations, and works for 32-bit words. It may
 819 be optimized to work just on bytes in 5 operations by removing the two lines
 820 immediately following "unsigned int v;". The method first shifts and XORs the
 821 eight nibbles of the 32-bit value together, leaving the result in the lowest
 822 nibble of v. Next, the binary number 0110 1001 1001 0110 (0x6996 in hex) is
 823 shifted to the right by the value represented in the lowest nibble of v. This
 824 number is like a miniature 16-bit parity-table indexed by the low four bits in
 825 v. The result has the parity of v in bit 1, which is masked and returned.
 826
 827 Thanks to Mathew Hendry for pointing out the shift-lookup idea at the end on
 828 Dec. 15, 2002. That optimization shaves two operations off using only shifting
 829 and XORing to find the parity.
 830
 831 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 832
 833 Swapping values with subtraction and addition
 834
 835 #define SWAP(a, b) ((&(a) == &(b)) || \
 836                     (((a) -= (b)), ((b) += (a)), ((a) = (b) - (a))))
 837
 838 This swaps the values of a and b without using a temporary variable. The
 839 initial check for a and b being the same location in memory may be omitted when
 840 you know this can't happen. (The compiler may omit it anyway as an
 841 optimization.) If you enable overflows exceptions, then pass unsigned values so
 842 an exception isn't thrown. The XOR method that follows may be slightly faster
 843 on some machines. Don't use this with floating-point numbers (unless you
 844 operate on their raw integer representations).
 845
 846 Sanjeev Sivasankaran suggested I add this on June 12, 2007. Vincent Lefèvre
 847 pointed out the potential for overflow exceptions on July 9, 2008
 848 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 849
 850 Swapping values with XOR
 851
 852 #define SWAP(a, b) (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b)))
 853
 854 This is an old trick to exchange the values of the variables a and b without
 855 using extra space for a temporary variable.
 856
 857 On January 20, 2005, Iain A. Fleming pointed out that the macro above doesn't
 858 work when you swap with the same memory location, such as SWAP(a[i], a[j]) with
 859 i == j. So if that may occur, consider defining the macro as (((a) == (b)) ||
 860 (((a) ^= (b)), ((b) ^= (a)), ((a) ^= (b)))). On July 14, 2009, Hallvard
 861 Furuseth suggested that on some machines, (((a) ^ (b)) && ((b) ^= (a) ^= (b),
 862 (a) ^= (b))) might be faster, since the (a) ^ (b) expression is reused.
 863
 864 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 865
 866 Swapping individual bits with XOR
 867
 868 unsigned int i, j; // positions of bit sequences to swap
 869 unsigned int n;    // number of consecutive bits in each sequence
 870 unsigned int b;    // bits to swap reside in b
 871 unsigned int r;    // bit-swapped result goes here
 872
 873 unsigned int x = ((b >> i) ^ (b >> j)) & ((1U << n) - 1); // XOR temporary
 874 r = b ^ ((x << i) | (x << j));
 875
 876 As an example of swapping ranges of bits suppose we have have b = 00101111
 877 (expressed in binary) and we want to swap the n = 3 consecutive bits starting
 878 at i = 1 (the second bit from the right) with the 3 consecutive bits starting
 879 at j = 5; the result would be r = 11100011 (binary).
 880
 881 This method of swapping is similar to the general purpose XOR swap trick, but
 882 intended for operating on individual bits.  The variable x stores the result of
 883 XORing the pairs of bit values we want to swap, and then the bits are set to
 884 the result of themselves XORed with x.  Of course, the result is undefined if
 885 the sequences overlap.
 886
 887 On July 14, 2009 Hallvard Furuseth suggested that I change the 1 << n to 1U <<
 888 n because the value was being assigned to an unsigned and to avoid shifting
 889 into a sign bit.
 890
 891 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 892
 893 Reverse bits the obvious way
 894
 895 unsigned int v;     // input bits to be reversed
 896 unsigned int r = v; // r will be reversed bits of v; first get LSB of v
 897 int s = sizeof(v) * CHAR_BIT - 1; // extra shift needed at end
 898
 899 for (v >>= 1; v; v >>= 1)
 900 {
 901   r <<= 1;
 902   r |= v & 1;
 903   s--;
 904 }
 905 r <<= s; // shift when v's highest bits are zero
 906
 907 On October 15, 2004, Michael Hoisie pointed out a bug in the original version.
 908 Randal E. Bryant suggested removing an extra operation on May 3, 2005. Behdad
 909 Esfabod suggested a slight change that eliminated one iteration of the loop on
 910 May 18, 2005. Then, on February 6, 2007, Liyong Zhou suggested a better version
 911 that loops while v is not 0, so rather than iterating over all bits it stops
 912 early.
 913
 914 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 915
 916 Reverse bits in word by lookup table
 917
 918 static const unsigned char BitReverseTable256[256] =
 919 {
 920 #   define R2(n)     n,     n + 2*64,     n + 1*64,     n + 3*64
 921 #   define R4(n) R2(n), R2(n + 2*16), R2(n + 1*16), R2(n + 3*16)
 922 #   define R6(n) R4(n), R4(n + 2*4 ), R4(n + 1*4 ), R4(n + 3*4 )
 923     R6(0), R6(2), R6(1), R6(3)
 924 };
 925
 926 unsigned int v; // reverse 32-bit value, 8 bits at time
 927 unsigned int c; // c will get v reversed
 928
 929 // Option 1:
 930 c = (BitReverseTable256[v & 0xff] << 24) |
 931     (BitReverseTable256[(v >> 8) & 0xff] << 16) |
 932     (BitReverseTable256[(v >> 16) & 0xff] << 8) |
 933     (BitReverseTable256[(v >> 24) & 0xff]);
 934
 935 // Option 2:
 936 unsigned char * p = (unsigned char *) &v;
 937 unsigned char * q = (unsigned char *) &c;
 938 q[3] = BitReverseTable256[p[0]];
 939 q[2] = BitReverseTable256[p[1]];
 940 q[1] = BitReverseTable256[p[2]];
 941 q[0] = BitReverseTable256[p[3]];
 942
 943 The first method takes about 17 operations, and the second takes about 12,
 944 assuming your CPU can load and store bytes easily.
 945
 946 On July 14, 2009 Hallvard Furuseth suggested the macro compacted table.
 947
 948 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 949
 950 Reverse the bits in a byte with 3 operations (64-bit multiply and modulus
 951 division):
 952
 953 unsigned char b; // reverse this (8-bit) byte
 954
 955 b = (b * 0x0202020202ULL & 0x010884422010ULL) % 1023;
 956
 957 The multiply operation creates five separate copies of the 8-bit byte pattern
 958 to fan-out into a 64-bit value. The AND operation selects the bits that are in
 959 the correct (reversed) positions, relative to each 10-bit groups of bits. The
 960 multiply and the AND operations copy the bits from the original byte so they
 961 each appear in only one of the 10-bit sets. The reversed positions of the bits
 962 from the original byte coincide with their relative positions within any 10-bit
 963 set. The last step, which involves modulus division by 2^10 - 1, has the effect
 964 of merging together each set of 10 bits (from positions 0-9, 10-19, 20-29, ...)
 965 in the 64-bit value. They do not overlap, so the addition steps underlying the
 966 modulus division behave like or operations.
 967
 968 This method was attributed to Rich Schroeppel in the Programming Hacks section
 969 of Beeler, M., Gosper, R. W., and Schroeppel, R. HAKMEM. MIT AI Memo 239, Feb.
 970 29, 1972.
 971
 972 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 973
 974 Reverse the bits in a byte with 4 operations (64-bit multiply, no division):
 975
 976 unsigned char b; // reverse this byte
 977
 978 b = ((b * 0x80200802ULL) & 0x0884422110ULL) * 0x0101010101ULL >> 32;
 979
 980 The following shows the flow of the bit values with the boolean variables a, b,
 981 c, d, e, f, g, and h, which comprise an 8-bit byte. Notice how the first
 982 multiply fans out the bit pattern to multiple copies, while the last multiply
 983 combines them in the fifth byte from the right.
 984
 985                                                                                         abcd efgh (-> hgfe dcba)
 986 *                                                      1000 0000  0010 0000  0000 1000  0000 0010 (0x80200802)
 987 -------------------------------------------------------------------------------------------------
 988                                             0abc defg  h00a bcde  fgh0 0abc  defg h00a  bcde fgh0
 989 &                                           0000 1000  1000 0100  0100 0010  0010 0001  0001 0000 (0x0884422110)
 990 -------------------------------------------------------------------------------------------------
 991                                             0000 d000  h000 0c00  0g00 00b0  00f0 000a  000e 0000
 992 *                                           0000 0001  0000 0001  0000 0001  0000 0001  0000 0001 (0x0101010101)
 993 -------------------------------------------------------------------------------------------------
 994                                             0000 d000  h000 0c00  0g00 00b0  00f0 000a  000e 0000
 995                                  0000 d000  h000 0c00  0g00 00b0  00f0 000a  000e 0000
 996                       0000 d000  h000 0c00  0g00 00b0  00f0 000a  000e 0000
 997            0000 d000  h000 0c00  0g00 00b0  00f0 000a  000e 0000
 998 0000 d000  h000 0c00  0g00 00b0  00f0 000a  000e 0000
 999 -------------------------------------------------------------------------------------------------
1000 0000 d000  h000 dc00  hg00 dcb0  hgf0 dcba  hgfe dcba  hgfe 0cba  0gfe 00ba  00fe 000a  000e 0000
1001 >> 32
1002 -------------------------------------------------------------------------------------------------
1003                                             0000 d000  h000 dc00  hg00 dcb0  hgf0 dcba  hgfe dcba
1004 &                                                                                       1111 1111
1005 -------------------------------------------------------------------------------------------------
1006                                                                                         hgfe dcba
1007
1008 Note that the last two steps can be combined on some processors because the
1009 registers can be accessed as bytes; just multiply so that a register stores the
1010 upper 32 bits of the result and the take the low byte. Thus, it may take only 6
1011 operations.
1012
1013 Devised by Sean Anderson, July 13, 2001.
1014
1015 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1016
1017 Reverse the bits in a byte with 7 operations (no 64-bit):
1018
1019 b = ((b * 0x0802LU & 0x22110LU) | (b * 0x8020LU & 0x88440LU)) * 0x10101LU >> 16;
1020
1021 Make sure you assign or cast the result to an unsigned char to remove garbage
1022 in the higher bits. Devised by Sean Anderson, July 13, 2001. Typo spotted and
1023 correction supplied by Mike Keith, January 3, 2002.
1024
1025 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1026
1027 Reverse an N-bit quantity in parallel in 5 * lg(N) operations:
1028
1029 unsigned int v; // 32-bit word to reverse bit order
1030
1031 // swap odd and even bits
1032 v = ((v >> 1) & 0x55555555) | ((v & 0x55555555) << 1);
1033 // swap consecutive pairs
1034 v = ((v >> 2) & 0x33333333) | ((v & 0x33333333) << 2);
1035 // swap nibbles ...
1036 v = ((v >> 4) & 0x0F0F0F0F) | ((v & 0x0F0F0F0F) << 4);
1037 // swap bytes
1038 v = ((v >> 8) & 0x00FF00FF) | ((v & 0x00FF00FF) << 8);
1039 // swap 2-byte long pairs
1040 v = ( v >> 16             ) | ( v               << 16);
1041
1042 The following variation is also O(lg(N)), however it requires more operations
1043 to reverse v. Its virtue is in taking less slightly memory by computing the
1044 constants on the fly.
1045
1046 unsigned int s = sizeof(v) * CHAR_BIT; // bit size; must be power of 2
1047 unsigned int mask = ~0;
1048 while ((s >>= 1) > 0)
1049 {
1050   mask ^= (mask << s);
1051   v = ((v >> s) & mask) | ((v << s) & ~mask);
1052 }
1053
1054 These methods above are best suited to situations where N is large. If you use
1055 the above with 64-bit ints (or larger), then you need to add more lines
1056 (following the pattern); otherwise only the lower 32 bits will be reversed and
1057 the result will be in the lower 32 bits.
1058
1059 See Dr. Dobb's Journal 1983, Edwin Freed's article on Binary Magic Numbers for
1060 more information. The second variation was suggested by Ken Raeburn on
1061 September 13, 2005. Veldmeijer mentioned that the first version could do
1062 without ANDS in the last line on March 19, 2006.
1063
1064 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1065
1066 Compute modulus division by 1 << s without a division operator
1067
1068 const unsigned int n;          // numerator
1069 const unsigned int s;
1070 const unsigned int d = 1U << s; // So d will be one of: 1, 2, 4, 8, 16, 32, ...
1071 unsigned int m;                // m will be n % d
1072 m = n & (d - 1);
1073
1074 Most programmers learn this trick early, but it was included for the sake of
1075 completeness.
1076
1077 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1078
1079 Compute modulus division by (1 << s) - 1 without a division operator
1080
1081 unsigned int n;                      // numerator
1082 const unsigned int s;                // s > 0
1083 const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).
1084 unsigned int m;                      // n % d goes here.
1085
1086 for (m = n; n > d; n = m)
1087 {
1088   for (m = 0; n; n >>= s)
1089   {
1090     m += n & d;
1091   }
1092 }
1093 // Now m is a value from 0 to d, but since with modulus division
1094 // we want m to be 0 when it is d.
1095 m = m == d ? 0 : m;
1096
1097 This method of modulus division by an integer that is one less than a power of
1098 2 takes at most 5 + (4 + 5 * ceil(N / s)) * ceil(lg(N / s)) operations, where N
1099 is the number of bits in the numerator. In other words, it takes at most O(N *
1100 lg(N)) time.
1101
1102 Devised by Sean Anderson, August 15, 2001. Before Sean A. Irvine corrected me
1103 on June 17, 2004, I mistakenly commented that we could alternatively assign m =
1104 ((m + 1) & d) - 1; at the end. Michael Miller spotted a typo in the code April
1105 25, 2005.
1106
1107 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1108
1109 Compute modulus division by (1 << s) - 1 in parallel without a division
1110 operator
1111
1112
1113 // The following is for a word size of 32 bits!
1114
1115 static const unsigned int M[] =
1116 {
1117   0x00000000, 0x55555555, 0x33333333, 0xc71c71c7,
1118   0x0f0f0f0f, 0xc1f07c1f, 0x3f03f03f, 0xf01fc07f,
1119   0x00ff00ff, 0x07fc01ff, 0x3ff003ff, 0xffc007ff,
1120   0xff000fff, 0xfc001fff, 0xf0003fff, 0xc0007fff,
1121   0x0000ffff, 0x0001ffff, 0x0003ffff, 0x0007ffff,
1122   0x000fffff, 0x001fffff, 0x003fffff, 0x007fffff,
1123   0x00ffffff, 0x01ffffff, 0x03ffffff, 0x07ffffff,
1124   0x0fffffff, 0x1fffffff, 0x3fffffff, 0x7fffffff
1125 };
1126
1127 static const unsigned int Q[][6] =
1128 {
1129   { 0,  0,  0,  0,  0,  0}, {16,  8,  4,  2,  1,  1}, {16,  8,  4,  2,  2,  2},
1130   {15,  6,  3,  3,  3,  3}, {16,  8,  4,  4,  4,  4}, {15,  5,  5,  5,  5,  5},
1131   {12,  6,  6,  6 , 6,  6}, {14,  7,  7,  7,  7,  7}, {16,  8,  8,  8,  8,  8},
1132   { 9,  9,  9,  9,  9,  9}, {10, 10, 10, 10, 10, 10}, {11, 11, 11, 11, 11, 11},
1133   {12, 12, 12, 12, 12, 12}, {13, 13, 13, 13, 13, 13}, {14, 14, 14, 14, 14, 14},
1134   {15, 15, 15, 15, 15, 15}, {16, 16, 16, 16, 16, 16}, {17, 17, 17, 17, 17, 17},
1135   {18, 18, 18, 18, 18, 18}, {19, 19, 19, 19, 19, 19}, {20, 20, 20, 20, 20, 20},
1136   {21, 21, 21, 21, 21, 21}, {22, 22, 22, 22, 22, 22}, {23, 23, 23, 23, 23, 23},
1137   {24, 24, 24, 24, 24, 24}, {25, 25, 25, 25, 25, 25}, {26, 26, 26, 26, 26, 26},
1138   {27, 27, 27, 27, 27, 27}, {28, 28, 28, 28, 28, 28}, {29, 29, 29, 29, 29, 29},
1139   {30, 30, 30, 30, 30, 30}, {31, 31, 31, 31, 31, 31}
1140 };
1141
1142 static const unsigned int R[][6] =
1143 {
1144   {0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000},
1145   {0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000001, 0x00000001},
1146   {0x0000ffff, 0x000000ff, 0x0000000f, 0x00000003, 0x00000003, 0x00000003},
1147   {0x00007fff, 0x0000003f, 0x00000007, 0x00000007, 0x00000007, 0x00000007},
1148   {0x0000ffff, 0x000000ff, 0x0000000f, 0x0000000f, 0x0000000f, 0x0000000f},
1149   {0x00007fff, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f, 0x0000001f},
1150   {0x00000fff, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f, 0x0000003f},
1151   {0x00003fff, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f, 0x0000007f},
1152   {0x0000ffff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff, 0x000000ff},
1153   {0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff, 0x000001ff},
1154   {0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff, 0x000003ff},
1155   {0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff, 0x000007ff},
1156   {0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff, 0x00000fff},
1157   {0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff, 0x00001fff},
1158   {0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff, 0x00003fff},
1159   {0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff, 0x00007fff},
1160   {0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff},
1161   {0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff, 0x0001ffff},
1162   {0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff, 0x0003ffff},
1163   {0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff, 0x0007ffff},
1164   {0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff, 0x000fffff},
1165   {0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff, 0x001fffff},
1166   {0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff, 0x003fffff},
1167   {0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff, 0x007fffff},
1168   {0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff, 0x00ffffff},
1169   {0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff, 0x01ffffff},
1170   {0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff, 0x03ffffff},
1171   {0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff, 0x07ffffff},
1172   {0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff, 0x0fffffff},
1173   {0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff, 0x1fffffff},
1174   {0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff, 0x3fffffff},
1175   {0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff, 0x7fffffff}
1176 };
1177
1178 unsigned int n;       // numerator
1179 const unsigned int s; // s > 0
1180 const unsigned int d = (1 << s) - 1; // so d is either 1, 3, 7, 15, 31, ...).
1181 unsigned int m;       // n % d goes here.
1182
1183 m = (n & M[s]) + ((n >> s) & M[s]);
1184
1185 for (const unsigned int * q = &Q[s][0], * r = &R[s][0]; m > d; q++, r++)
1186 {
1187   m = (m >> *q) + (m & *r);
1188 }
1189 m = m == d ? 0 : m; // OR, less portably: m = m & -((signed)(m - d) >> s);
1190
1191 This method of finding modulus division by an integer that is one less than a
1192 power of 2 takes at most O(lg(N)) time, where N is the number of bits in the
1193 numerator (32 bits, for the code above). The number of operations is at most 12
1194 + 9 * ceil(lg(N)). The tables may be removed if you know the denominator at
1195 compile time; just extract the few relevent entries and unroll the loop. It may
1196 be easily extended to more bits.
1197
1198 It finds the result by summing the values in base (1 << s) in parallel. First
1199 every other base (1 << s) value is added to the previous one. Imagine that the
1200 result is written on a piece of paper. Cut the paper in half, so that half the
1201 values are on each cut piece. Align the values and sum them onto a new piece of
1202 paper. Repeat by cutting this paper in half (which will be a quarter of the
1203 size of the previous one) and summing, until you cannot cut further. After
1204 performing lg(N/s/2) cuts, we cut no more; just continue to add the values and
1205 put the result onto a new piece of paper as before, while there are at least
1206 two s-bit values.
1207
1208 Devised by Sean Anderson, August 20, 2001. A typo was spotted by Randy E.
1209 Bryant on May 3, 2005 (after pasting the code, I had later added "unsinged" to
1210 a variable declaration). As in the previous hack, I mistakenly commented that
1211 we could alternatively assign m = ((m + 1) & d) - 1; at the end, and Don Knuth
1212 corrected me on April 19, 2006 and suggested m = m & -((signed)(m - d) >> s).
1213 On June 18, 2009 Sean Irvine proposed a change that used ((n >> s) & M[s])
1214 instead of ((n & ~M[s]) >> s), which typically requires fewer operations
1215 because the M[s] constant is already loaded.
1216
1217 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1218
1219 Find the log base 2 of an integer with the MSB N set in O(N) operations (the
1220 obvious way)
1221
1222 unsigned int v; // 32-bit word to find the log base 2 of
1223 unsigned int r = 0; // r will be lg(v)
1224
1225 while (v >>= 1) // unroll for more speed...
1226 {
1227   r++;
1228 }
1229
1230 The log base 2 of an integer is the same as the position of the highest bit set
1231 (or most significant bit set, MSB). The following log base 2 methods are faster
1232 than this one.
1233
1234 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1235
1236 Find the integer log base 2 of an integer with an 64-bit IEEE float
1237
1238 int v; // 32-bit integer to find the log base 2 of
1239 int r; // result of log_2(v) goes here
1240 union { unsigned int u[2]; double d; } t; // temp
1241
1242 t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;
1243 t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = v;
1244 t.d -= 4503599627370496.0;
1245 r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;
1246
1247 The code above loads a 64-bit (IEEE-754 floating-point) double with a 32-bit
1248 integer (with no paddding bits) by storing the integer in the mantissa while
1249 the exponent is set to 2^52. From this newly minted double, 2^52 (expressed as
1250 a double) is subtracted, which sets the resulting exponent to the log base 2 of
1251 the input value, v. All that is left is shifting the exponent bits into
1252 position (20 bits right) and subtracting the bias, 0x3FF (which is 1023
1253 decimal). This technique only takes 5 operations, but many CPUs are slow at
1254 manipulating doubles, and the endianess of the architecture must be
1255 accommodated.
1256
1257 Eric Cole sent me this on January 15, 2006. Evan Felix pointed out a typo on
1258 April 4, 2006. Vincent Lefèvre told me on July 9, 2008 to change the endian
1259 check to use the float's endian, which could differ from the integer's endian.
1260
1261 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1262
1263 Find the log base 2 of an integer with a lookup table
1264
1265 static const char LogTable256[256] =
1266 {
1267 #define LT(n) n, n, n, n, n, n, n, n, n, n, n, n, n, n, n, n
1268     -1, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
1269     LT(4), LT(5), LT(5), LT(6), LT(6), LT(6), LT(6),
1270     LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7), LT(7)
1271 };
1272
1273 unsigned int v; // 32-bit word to find the log of
1274 unsigned r;     // r will be lg(v)
1275 register unsigned int t, tt; // temporaries
1276
1277 if (tt = v >> 16)
1278 {
1279   r = (t = tt >> 8) ? 24 + LogTable256[t] : 16 + LogTable256[tt];
1280 }
1281 else
1282 {
1283   r = (t = v >> 8) ? 8 + LogTable256[t] : LogTable256[v];
1284 }
1285
1286 The lookup table method takes only about 7 operations to find the log of a
1287 32-bit value. If extended for 64-bit quantities, it would take roughly 9
1288 operations. Another operation can be trimmed off by using four tables, with the
1289 possible additions incorporated into each. Using int table elements may be
1290 faster, depending on your architecture.
1291
1292 The code above is tuned to uniformly distributed output values. If your inputs
1293 are evenly distributed across all 32-bit values, then consider using the
1294 following:
1295
1296 if (tt = v >> 24)
1297 {
1298   r = 24 + LogTable256[tt];
1299 }
1300 else if (tt = v >> 16)
1301 {
1302   r = 16 + LogTable256[tt];
1303 }
1304 else if (tt = v >> 8)
1305 {
1306   r = 8 + LogTable256[tt];
1307 }
1308 else
1309 {
1310   r = LogTable256[v];
1311 }
1312
1313 To initially generate the log table algorithmically:
1314
1315 LogTable256[0] = LogTable256[1] = 0;
1316 for (int i = 2; i < 256; i++)
1317 {
1318   LogTable256[i] = 1 + LogTable256[i / 2];
1319 }
1320 LogTable256[0] = -1; // if you want log(0) to return -1
1321
1322 Behdad Esfahbod and I shaved off a fraction of an operation (on average) on May
1323 18, 2005. Yet another fraction of an operation was removed on November 14, 2006
1324 by Emanuel Hoogeveen. The variation that is tuned to evenly distributed input
1325 values was suggested by David A. Butterfield on September 19, 2008. Venkat
1326 Reddy told me on January 5, 2009 that log(0) should return -1 to indicate an
1327 error, so I changed the first entry in the table to that.
1328 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1329
1330 Find the log base 2 of an N-bit integer in O(lg(N)) operations
1331
1332 unsigned int v;  // 32-bit value to find the log2 of
1333 const unsigned int b[] = {0x2, 0xC, 0xF0, 0xFF00, 0xFFFF0000};
1334 const unsigned int S[] = {1, 2, 4, 8, 16};
1335 int i;
1336
1337 register unsigned int r = 0; // result of log2(v) will go here
1338 for (i = 4; i >= 0; i--) // unroll for speed...
1339 {
1340   if (v & b[i])
1341   {
1342     v >>= S[i];
1343     r |= S[i];
1344   }
1345 }
1346
1347
1348 // OR (IF YOUR CPU BRANCHES SLOWLY):
1349
1350 unsigned int v;          // 32-bit value to find the log2 of
1351 register unsigned int r; // result of log2(v) will go here
1352 register unsigned int shift;
1353
1354 r =     (v > 0xFFFF) << 4; v >>= r;
1355 shift = (v > 0xFF  ) << 3; v >>= shift; r |= shift;
1356 shift = (v > 0xF   ) << 2; v >>= shift; r |= shift;
1357 shift = (v > 0x3   ) << 1; v >>= shift; r |= shift;
1358                                         r |= (v >> 1);
1359
1360
1361 // OR (IF YOU KNOW v IS A POWER OF 2):
1362
1363 unsigned int v;  // 32-bit value to find the log2 of
1364 static const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0,
1365                                  0xFF00FF00, 0xFFFF0000};
1366 register unsigned int r = (v & b[0]) != 0;
1367 for (i = 4; i > 0; i--) // unroll for speed...
1368 {
1369   r |= ((v & b[i]) != 0) << i;
1370 }
1371
1372 Of course, to extend the code to find the log of a 33- to 64-bit number, we
1373 would append another element, 0xFFFFFFFF00000000, to b, append 32 to S, and
1374 loop from 5 to 0. This method is much slower than the earlier table-lookup
1375 version, but if you don't want big table or your architecture is slow to access
1376 memory, it's a good choice. The second variation involves slightly more
1377 operations, but it may be faster on machines with high branch costs (e.g.
1378 PowerPC).
1379
1380 The second version was sent to me by Eric Cole on January 7, 2006. Andrew
1381 Shapira subsequently trimmed a few operations off of it and sent me his
1382 variation (above) on Sept. 1, 2007. The third variation was suggested to me by
1383 John Owens on April 24, 2002; it's faster, but it is only suitable when the
1384 input is known to be a power of 2. On May 25, 2003, Ken Raeburn suggested
1385 improving the general case by using smaller numbers for b[], which load faster
1386 on some architectures (for instance if the word size is 16 bits, then only one
1387 load instruction may be needed). These values work for the general version, but
1388 not for the special-case version below it, where v is a power of 2; Glenn
1389 Slayden brought this oversight to my attention on December 12, 2003.
1390
1391 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1392
1393 Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply
1394 and lookup
1395
1396 uint32_t v; // find the log base 2 of 32-bit v
1397 int r;      // result goes here
1398
1399 static const int MultiplyDeBruijnBitPosition[32] =
1400 {
1401   0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
1402   8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
1403 };
1404
1405 v |= v >> 1; // first round down to one less than a power of 2
1406 v |= v >> 2;
1407 v |= v >> 4;
1408 v |= v >> 8;
1409 v |= v >> 16;
1410
1411 r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
1412
1413 The code above computes the log base 2 of a 32-bit integer with a small table
1414 lookup and multiply. It requires only 13 operations, compared to (up to) 20 for
1415 the previous method. The purely table-based method requires the fewest
1416 operations, but this offers a reasonable compromise between table size and
1417 speed.
1418
1419 If you know that v is a power of 2, then you only need the following:
1420
1421 static const int MultiplyDeBruijnBitPosition2[32] =
1422 {
1423   0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
1424   31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
1425 };
1426 r = MultiplyDeBruijnBitPosition2[(uint32_t)(v * 0x077CB531U) >> 27];
1427
1428 Eric Cole devised this January 8, 2006 after reading about the entry below to
1429 round up to a power of 2 and the method below for computing the number of
1430 trailing bits with a multiply and lookup using a DeBruijn sequence. On December
1431 10, 2009, Mark Dickinson shaved off a couple operations by requiring v be
1432 rounded up to one less than the next power of 2 rather than the power of 2.
1433
1434 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1435
1436 Find integer log base 10 of an integer
1437
1438 unsigned int v; // non-zero 32-bit integer value to compute the log base 10 of
1439 int r;          // result goes here
1440 int t;          // temporary
1441
1442 static unsigned int const PowersOf10[] =
1443     {1, 10, 100, 1000, 10000, 100000,
1444      1000000, 10000000, 100000000, 1000000000};
1445
1446 t = (IntegerLogBase2(v) + 1) * 1233 >> 12; // (use a lg2 method from above)
1447 r = t - (v < PowersOf10[t]);
1448
1449 The integer log base 10 is computed by first using one of the techniques above
1450 for finding the log base 2. By the relationship log[10](v) = log[2](v) / log[2]
1451 (10), we need to multiply it by 1/log[2](10), which is approximately 1233/4096,
1452 or 1233 followed by a right shift of 12. Adding one is needed because the
1453 IntegerLogBase2 rounds down. Finally, since the value t is only an
1454 approximation that may be off by one, the exact value is found by subtracting
1455 the result of v < PowersOf10[t].
1456
1457 This method takes 6 more operations than IntegerLogBase2. It may be sped up (on
1458 machines with fast memory access) by modifying the log base 2 table-lookup
1459 method above so that the entries hold what is computed for t (that is, pre-add,
1460 -mulitply, and -shift). Doing so would require a total of only 9 operations to
1461 find the log base 10, assuming 4 tables were used (one for each byte of v).
1462
1463 Eric Cole suggested I add a version of this on January 7, 2006.
1464
1465 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1466
1467 Find integer log base 10 of an integer the obvious way
1468
1469 unsigned int v; // non-zero 32-bit integer value to compute the log base 10 of
1470 int r;          // result goes here
1471
1472 r = (v >= 1000000000) ? 9 : (v >= 100000000) ? 8 : (v >= 10000000) ? 7 :
1473     (v >= 1000000) ? 6 : (v >= 100000) ? 5 : (v >= 10000) ? 4 :
1474     (v >= 1000) ? 3 : (v >= 100) ? 2 : (v >= 10) ? 1 : 0;
1475
1476 This method works well when the input is uniformly distributed over 32-bit
1477 values because 76% of the inputs are caught by the first compare, 21% are
1478 caught by the second compare, 2% are caught by the third, and so on (chopping
1479 the remaining down by 90% with each comparision). As a result, less than 2.6
1480 operations are needed on average.
1481
1482 On April 18, 2007, Emanuel Hoogeveen suggested a variation on this where the
1483 conditions used divisions, which were not as fast as simple comparisons.
1484 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1485
1486 Find integer log base 2 of a 32-bit IEEE float
1487
1488 const float v; // find int(log2(v)), where v > 0.0 && finite(v) && isnormal(v)
1489 int c;         // 32-bit int c gets the result;
1490
1491 c = *(const int *) &v;  // OR, for portability:  memcpy(&c, &v, sizeof c);
1492 c = (c >> 23) - 127;
1493
1494 The above is fast, but IEEE 754-compliant architectures utilize subnormal (also
1495 called denormal) floating point numbers. These have the exponent bits set to
1496 zero (signifying pow(2,-127)), and the mantissa is not normalized, so it
1497 contains leading zeros and thus the log2 must be computed from the mantissa. To
1498 accomodate for subnormal numbers, use the following:
1499
1500 const float v;              // find int(log2(v)), where v > 0.0 && finite(v)
1501 int c;                      // 32-bit int c gets the result;
1502 int x = *(const int *) &v;  // OR, for portability:  memcpy(&x, &v, sizeof x);
1503
1504 c = x >> 23;
1505
1506 if (c)
1507 {
1508   c -= 127;
1509 }
1510 else
1511 { // subnormal, so recompute using mantissa: c = intlog2(x) - 149;
1512   register unsigned int t; // temporary
1513   // Note that LogTable256 was defined earlier
1514   if (t = x >> 16)
1515   {
1516     c = LogTable256[t] - 133;
1517   }
1518   else
1519   {
1520     c = (t = x >> 8) ? LogTable256[t] - 141 : LogTable256[x] - 149;
1521   }
1522 }
1523
1524 On June 20, 2004, Sean A. Irvine suggested that I include code to handle
1525 subnormal numbers. On June 11, 2005, Falk Hüffner pointed out that ISO C99 6.5/
1526 7 specified undefined behavior for the common type punning idiom *(int *)&,
1527 though it has worked on 99.9% of C compilers. He proposed using memcpy for
1528 maximum portability or a union with a float and an int for better code
1529 generation than memcpy on some compilers.
1530
1531 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1532
1533 Find integer log base 2 of the pow(2, r)-root of a 32-bit IEEE float (for
1534 unsigned integer r)
1535
1536 const int r;
1537 const float v; // find int(log2(pow((double) v, 1. / pow(2, r)))),
1538                // where isnormal(v) and v > 0
1539 int c;         // 32-bit int c gets the result;
1540
1541 c = *(const int *) &v;  // OR, for portability:  memcpy(&c, &v, sizeof c);
1542 c = ((((c - 0x3f800000) >> r) + 0x3f800000) >> 23) - 127;
1543
1544 So, if r is 0, for example, we have c = int(log2((double) v)). If r is 1, then
1545 we have c = int(log2(sqrt((double) v))). If r is 2, then we have c = int(log2
1546 (pow((double) v, 1./4))).
1547
1548 On June 11, 2005, Falk Hüffner pointed out that ISO C99 6.5/7 left the type
1549 punning idiom *(int *)& undefined, and he suggested using memcpy.
1550
1551 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1552
1553 Count the consecutive zero bits (trailing) on the right linearly
1554
1555 unsigned int v;  // input to count trailing zero bits
1556 int c;  // output: c will count v's trailing zero bits,
1557         // so if v is 1101000 (base 2), then c will be 3
1558 if (v)
1559 {
1560   v = (v ^ (v - 1)) >> 1;  // Set v's trailing 0s to 1s and zero rest
1561   for (c = 0; v; c++)
1562   {
1563     v >>= 1;
1564   }
1565 }
1566 else
1567 {
1568   c = CHAR_BIT * sizeof(v);
1569 }
1570
1571 The average number of trailing zero bits in a (uniformly distributed) random
1572 binary number is one, so this O(trailing zeros) solution isn't that bad
1573 compared to the faster methods below.
1574
1575 Jim Cole suggested I add a linear-time method for counting the trailing zeros
1576 on August 15, 2007. On October 22, 2007, Jason Cunningham pointed out that I
1577 had neglected to paste the unsigned modifier for v.
1578 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1579
1580 Count the consecutive zero bits (trailing) on the right in parallel
1581
1582 unsigned int v;      // 32-bit word input to count zero bits on right
1583 unsigned int c = 32; // c will be the number of zero bits on the right
1584 v &= -signed(v);
1585 if (v) c--;
1586 if (v & 0x0000FFFF) c -= 16;
1587 if (v & 0x00FF00FF) c -= 8;
1588 if (v & 0x0F0F0F0F) c -= 4;
1589 if (v & 0x33333333) c -= 2;
1590 if (v & 0x55555555) c -= 1;
1591
1592 Here, we are basically doing the same operations as finding the log base 2 in
1593 parallel, but we first isolate the lowest 1 bit, and then proceed with c
1594 starting at the maximum and decreasing. The number of operations is at most 3 *
1595 lg(N) + 4, roughly, for N bit words.
1596
1597 Bill Burdick suggested an optimization, reducing the time from 4 * lg(N) on
1598 February 4, 2011.
1599
1600 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1601
1602 Count the consecutive zero bits (trailing) on the right by binary search
1603
1604 unsigned int v;     // 32-bit word input to count zero bits on right
1605 unsigned int c;     // c will be the number of zero bits on the right,
1606                     // so if v is 1101000 (base 2), then c will be 3
1607 // NOTE: if 0 == v, then c = 31.
1608 if (v & 0x1)
1609 {
1610   // special case for odd v (assumed to happen half of the time)
1611   c = 0;
1612 }
1613 else
1614 {
1615   c = 1;
1616   if ((v & 0xffff) == 0)
1617   {
1618     v >>= 16;
1619     c += 16;
1620   }
1621   if ((v & 0xff) == 0)
1622   {
1623     v >>= 8;
1624     c += 8;
1625   }
1626   if ((v & 0xf) == 0)
1627   {
1628     v >>= 4;
1629     c += 4;
1630   }
1631   if ((v & 0x3) == 0)
1632   {
1633     v >>= 2;
1634     c += 2;
1635   }
1636   c -= v & 0x1;
1637 }
1638
1639 The code above is similar to the previous method, but it computes the number of
1640 trailing zeros by accumulating c in a manner akin to binary search. In the
1641 first step, it checks if the bottom 16 bits of v are zeros, and if so, shifts v
1642 right 16 bits and adds 16 to c, which reduces the number of bits in v to
1643 consider by half. Each of the subsequent conditional steps likewise halves the
1644 number of bits until there is only 1. This method is faster than the last one
1645 (by about 33%) because the bodies of the if statements are executed less often.
1646
1647 Matt Whitlock suggested this on January 25, 2006. Andrew Shapira shaved a
1648 couple operations off on Sept. 5, 2007 (by setting c=1 and unconditionally
1649 subtracting at the end).
1650
1651 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1652
1653 Count the consecutive zero bits (trailing) on the right by casting to a float
1654
1655 unsigned int v;            // find the number of trailing zeros in v
1656 int r;                     // the result goes here
1657 float f = (float)(v & -v); // cast the least significant bit in v to a float
1658 r = (*(uint32_t *)&f >> 23) - 0x7f;
1659
1660 Although this only takes about 6 operations, the time to convert an integer to
1661 a float can be high on some machines. The exponent of the 32-bit IEEE floating
1662 point representation is shifted down, and the bias is subtracted to give the
1663 position of the least significant 1 bit set in v. If v is zero, then the result
1664 is -127.
1665
1666 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1667
1668 Count the consecutive zero bits (trailing) on the right with modulus division
1669 and lookup
1670
1671 unsigned int v;  // find the number of trailing zeros in v
1672 int r;           // put the result in r
1673 static const int Mod37BitPosition[] = // map a bit value mod 37 to its position
1674 {
1675   32, 0, 1, 26, 2, 23, 27, 0, 3, 16, 24, 30, 28, 11, 0, 13, 4,
1676   7, 17, 0, 25, 22, 31, 15, 29, 10, 12, 6, 0, 21, 14, 9, 5,
1677   20, 8, 19, 18
1678 };
1679 r = Mod37BitPosition[(-v & v) % 37];
1680
1681 The code above finds the number of zeros that are trailing on the right, so
1682 binary 0100 would produce 2. It makes use of the fact that the first 32 bit
1683 position values are relatively prime with 37, so performing a modulus division
1684 with 37 gives a unique number from 0 to 36 for each. These numbers may then be
1685 mapped to the number of zeros using a small lookup table. It uses only 4
1686 operations, however indexing into a table and performing modulus division may
1687 make it unsuitable for some situations. I came up with this independently and
1688 then searched for a subsequence of the table values, and found it was invented
1689 earlier by Reiser, according to Hacker's Delight.
1690
1691 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1692
1693 Count the consecutive zero bits (trailing) on the right with multiply and
1694 lookup
1695
1696 unsigned int v;  // find the number of trailing zeros in 32-bit v
1697 int r;           // result goes here
1698 static const int MultiplyDeBruijnBitPosition[32] =
1699 {
1700   0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
1701   31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
1702 };
1703 r = MultiplyDeBruijnBitPosition[((uint32_t)((v & -v) * 0x077CB531U)) >> 27];
1704
1705 Converting bit vectors to indices of set bits is an example use for this. It
1706 requires one more operation than the earlier one involving modulus division,
1707 but the multiply may be faster. The expression (v & -v) extracts the least
1708 significant 1 bit from v. The constant 0x077CB531UL is a de Bruijn sequence,
1709 which produces a unique pattern of bits into the high 5 bits for each possible
1710 bit position that it is multiplied against. When there are no bits set, it
1711 returns 0. More information can be found by reading the paper Using de Bruijn
1712 Sequences to Index 1 in a Computer Word by Charles E. Leiserson, Harald Prokof,
1713 and Keith H. Randall.
1714
1715 On October 8, 2005 Andrew Shapira suggested I add this. Dustin Spicuzza asked
1716 me on April 14, 2009 to cast the result of the multiply to a 32-bit type so it
1717 would work when compiled with 64-bit ints.
1718
1719 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1720
1721 Round up to the next highest power of 2 by float casting
1722
1723 unsigned int const v; // Round this 32-bit value to the next highest power of 2
1724 unsigned int r;       // Put the result here. (So v=3 -> r=4; v=8 -> r=8)
1725
1726 if (v > 1)
1727 {
1728   float f = (float)v;
1729   unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
1730   r = t << (t < v);
1731 }
1732 else
1733 {
1734   r = 1;
1735 }
1736
1737 The code above uses 8 operations, but works on all v <= (1<<31).
1738
1739 Quick and dirty version, for domain of 1 < v < (1<<25):
1740
1741 float f = (float)(v - 1);
1742 r = 1U << ((*(unsigned int*)(&f) >> 23) - 126);
1743
1744 Although the quick and dirty version only uses around 6 operations, it is
1745 roughly three times slower than the technique below (which involves 12
1746 operations) when benchmarked on an Athlon™ XP 2100+ CPU. Some CPUs will fare
1747 better with it, though.
1748
1749 On September 27, 2005 Andi Smithers suggested I include a technique for casting
1750 to floats to find the lg of a number for rounding up to a power of 2. Similar
1751 to the quick and dirty version here, his version worked with values less than
1752 (1<<25), due to mantissa rounding, but it used one more operation.
1753
1754 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1755
1756 Round up to the next highest power of 2
1757
1758 unsigned int v; // compute the next highest power of 2 of 32-bit v
1759
1760 v--;
1761 v |= v >> 1;
1762 v |= v >> 2;
1763 v |= v >> 4;
1764 v |= v >> 8;
1765 v |= v >> 16;
1766 v++;
1767
1768 In 12 operations, this code computes the next highest power of 2 for a 32-bit
1769 integer. The result may be expressed by the formula 1U << (lg(v - 1) + 1). Note
1770 that in the edge case where v is 0, it returns 0, which isn't a power of 2; you
1771 might append the expression v += (v == 0) to remedy this if it matters. It
1772 would be faster by 2 operations to use the formula and the log base 2 methed
1773 that uses a lookup table, but in some situations, lookup tables are not
1774 suitable, so the above code may be best. (On a Athlon™ XP 2100+ I've found the
1775 above shift-left and then OR code is as fast as using a single BSR assembly
1776 language instruction, which scans in reverse to find the highest set bit.) It
1777 works by copying the highest set bit to all of the lower bits, and then adding
1778 one, which results in carries that set all of the lower bits to 0 and one bit
1779 beyond the highest set bit to 1. If the original number was a power of 2, then
1780 the decrement will reduce it to one less, so that we round up to the same
1781 original value.
1782
1783 You might alternatively compute the next higher power of 2 in only 8 or 9
1784 operations using a lookup table for floor(lg(v)) and then evaluating 1
1785 <<(1+floor(lg(v))); Atul Divekar suggested I mention this on September 5, 2010.
1786
1787 Devised by Sean Anderson, Sepember 14, 2001. Pete Hart pointed me to a couple
1788 newsgroup posts by him and William Lewis in February of 1997, where they arrive
1789 at the same algorithm.
1790
1791 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1792
1793 Interleave bits the obvious way
1794
1795 unsigned short x;   // Interleave bits of x and y, so that all of the
1796 unsigned short y;   // bits of x are in the even positions and y in the odd;
1797 unsigned int z = 0; // z gets the resulting Morton Number.
1798
1799 for (int i = 0; i < sizeof(x) * CHAR_BIT; i++) // unroll for more speed...
1800 {
1801   z |= (x & 1U << i) << i | (y & 1U << i) << (i + 1);
1802 }
1803
1804 Interleaved bits (aka Morton numbers) are useful for linearizing 2D integer
1805 coordinates, so x and y are combined into a single number that can be compared
1806 easily and has the property that a number is usually close to another if their
1807 x and y values are close.
1808
1809 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1810
1811 Interleave bits by table lookup
1812
1813 static const unsigned short MortonTable256[256] =
1814 {
1815   0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015,
1816   0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055,
1817   0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115,
1818   0x0140, 0x0141, 0x0144, 0x0145, 0x0150, 0x0151, 0x0154, 0x0155,
1819   0x0400, 0x0401, 0x0404, 0x0405, 0x0410, 0x0411, 0x0414, 0x0415,
1820   0x0440, 0x0441, 0x0444, 0x0445, 0x0450, 0x0451, 0x0454, 0x0455,
1821   0x0500, 0x0501, 0x0504, 0x0505, 0x0510, 0x0511, 0x0514, 0x0515,
1822   0x0540, 0x0541, 0x0544, 0x0545, 0x0550, 0x0551, 0x0554, 0x0555,
1823   0x1000, 0x1001, 0x1004, 0x1005, 0x1010, 0x1011, 0x1014, 0x1015,
1824   0x1040, 0x1041, 0x1044, 0x1045, 0x1050, 0x1051, 0x1054, 0x1055,
1825   0x1100, 0x1101, 0x1104, 0x1105, 0x1110, 0x1111, 0x1114, 0x1115,
1826   0x1140, 0x1141, 0x1144, 0x1145, 0x1150, 0x1151, 0x1154, 0x1155,
1827   0x1400, 0x1401, 0x1404, 0x1405, 0x1410, 0x1411, 0x1414, 0x1415,
1828   0x1440, 0x1441, 0x1444, 0x1445, 0x1450, 0x1451, 0x1454, 0x1455,
1829   0x1500, 0x1501, 0x1504, 0x1505, 0x1510, 0x1511, 0x1514, 0x1515,
1830   0x1540, 0x1541, 0x1544, 0x1545, 0x1550, 0x1551, 0x1554, 0x1555,
1831   0x4000, 0x4001, 0x4004, 0x4005, 0x4010, 0x4011, 0x4014, 0x4015,
1832   0x4040, 0x4041, 0x4044, 0x4045, 0x4050, 0x4051, 0x4054, 0x4055,
1833   0x4100, 0x4101, 0x4104, 0x4105, 0x4110, 0x4111, 0x4114, 0x4115,
1834   0x4140, 0x4141, 0x4144, 0x4145, 0x4150, 0x4151, 0x4154, 0x4155,
1835   0x4400, 0x4401, 0x4404, 0x4405, 0x4410, 0x4411, 0x4414, 0x4415,
1836   0x4440, 0x4441, 0x4444, 0x4445, 0x4450, 0x4451, 0x4454, 0x4455,
1837   0x4500, 0x4501, 0x4504, 0x4505, 0x4510, 0x4511, 0x4514, 0x4515,
1838   0x4540, 0x4541, 0x4544, 0x4545, 0x4550, 0x4551, 0x4554, 0x4555,
1839   0x5000, 0x5001, 0x5004, 0x5005, 0x5010, 0x5011, 0x5014, 0x5015,
1840   0x5040, 0x5041, 0x5044, 0x5045, 0x5050, 0x5051, 0x5054, 0x5055,
1841   0x5100, 0x5101, 0x5104, 0x5105, 0x5110, 0x5111, 0x5114, 0x5115,
1842   0x5140, 0x5141, 0x5144, 0x5145, 0x5150, 0x5151, 0x5154, 0x5155,
1843   0x5400, 0x5401, 0x5404, 0x5405, 0x5410, 0x5411, 0x5414, 0x5415,
1844   0x5440, 0x5441, 0x5444, 0x5445, 0x5450, 0x5451, 0x5454, 0x5455,
1845   0x5500, 0x5501, 0x5504, 0x5505, 0x5510, 0x5511, 0x5514, 0x5515,
1846   0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555
1847 };
1848
1849 unsigned short x; // Interleave bits of x and y, so that all of the
1850 unsigned short y; // bits of x are in the even positions and y in the odd;
1851 unsigned int z;   // z gets the resulting 32-bit Morton Number.
1852
1853 z = MortonTable256[y >> 8]   << 17 |
1854     MortonTable256[x >> 8]   << 16 |
1855     MortonTable256[y & 0xFF] <<  1 |
1856     MortonTable256[x & 0xFF];
1857
1858
1859 For more speed, use an additional table with values that are MortonTable256
1860 pre-shifted one bit to the left. This second table could then be used for the y
1861 lookups, thus reducing the operations by two, but almost doubling the memory
1862 required. Extending this same idea, four tables could be used, with two of them
1863 pre-shifted by 16 to the left of the previous two, so that we would only need
1864 11 operations total.
1865 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1866
1867 Interleave bits with 64-bit multiply
1868
1869 In 11 operations, this version interleaves bits of two bytes (rather than
1870 shorts, as in the other versions), but many of the operations are 64-bit
1871 multiplies so it isn't appropriate for all machines. The input parameters, x
1872 and y, should be less than 256.
1873
1874 unsigned char x;  // Interleave bits of (8-bit) x and y, so that all of the
1875 unsigned char y;  // bits of x are in the even positions and y in the odd;
1876 unsigned short z; // z gets the resulting 16-bit Morton Number.
1877
1878 z = ((x * 0x0101010101010101ULL & 0x8040201008040201ULL) *
1879      0x0102040810204081ULL >> 49) & 0x5555 |
1880     ((y * 0x0101010101010101ULL & 0x8040201008040201ULL) *
1881      0x0102040810204081ULL >> 48) & 0xAAAA;
1882
1883 Holger Bettag was inspired to suggest this technique on October 10, 2004 after
1884 reading the multiply-based bit reversals here.
1885
1886 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1887
1888 Interleave bits by Binary Magic Numbers
1889
1890 static const unsigned int B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF};
1891 static const unsigned int S[] = {1, 2, 4, 8};
1892
1893 unsigned int x; // Interleave lower 16 bits of x and y, so the bits of x
1894 unsigned int y; // are in the even positions and bits from y in the odd;
1895 unsigned int z; // z gets the resulting 32-bit Morton Number.
1896                 // x and y must initially be less than 65536.
1897
1898 x = (x | (x << S[3])) & B[3];
1899 x = (x | (x << S[2])) & B[2];
1900 x = (x | (x << S[1])) & B[1];
1901 x = (x | (x << S[0])) & B[0];
1902
1903 y = (y | (y << S[3])) & B[3];
1904 y = (y | (y << S[2])) & B[2];
1905 y = (y | (y << S[1])) & B[1];
1906 y = (y | (y << S[0])) & B[0];
1907
1908 z = x | (y << 1);
1909
1910 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1911
1912 Determine if a word has a zero byte
1913
1914 // Fewer operations:
1915 unsigned int v; // 32-bit word to check if any 8-bit byte in it is 0
1916 bool hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
1917
1918 The code above may be useful when doing a fast string copy in which a word is
1919 copied at a time; it uses 5 operations. On the other hand, testing for a null
1920 byte in the obvious ways (which follow) have at least 7 operations (when
1921 counted in the most sparing way), and at most 12.
1922
1923 // More operations:
1924 bool hasNoZeroByte = ((v & 0xff) && (v & 0xff00) && (v & 0xff0000) && (v & 0xff000000))
1925 // OR:
1926 unsigned char * p = (unsigned char *) &v;
1927 bool hasNoZeroByte = *p && *(p + 1) && *(p + 2) && *(p + 3);
1928
1929 The code at the beginning of this section (labeled "Fewer operations") works by
1930 first zeroing the high bits of the 4 bytes in the word. Subsequently, it adds a
1931 number that will result in an overflow to the high bit of a byte if any of the
1932 low bits were initialy set. Next the high bits of the original word are ORed
1933 with these values; thus, the high bit of a byte is set iff any bit in the byte
1934 was set. Finally, we determine if any of these high bits are zero by ORing with
1935 ones everywhere except the high bits and inverting the result. Extending to 64
1936 bits is trivial; simply increase the constants to be 0x7F7F7F7F7F7F7F7F.
1937
1938 For an additional improvement, a fast pretest that requires only 4 operations
1939 may be performed to determine if the word may have a zero byte. The test also
1940 returns true if the high byte is 0x80, so there are occasional false positives,
1941 but the slower and more reliable version above may then be used on candidates
1942 for an overall increase in speed with correct output.
1943
1944 bool hasZeroByte = ((v + 0x7efefeff) ^ ~v) & 0x81010100;
1945 if (hasZeroByte) // or may just have 0x80 in the high byte
1946 {
1947   hasZeroByte = ~((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F);
1948 }
1949
1950 There is yet a faster method — use hasless(v, 1), which is defined below; it
1951 works in 4 operations and requires no subsquent verification. It simplifies to
1952
1953 #define haszero(v) (((v) - 0x01010101UL) & ~(v) & 0x80808080UL)
1954
1955 The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte
1956 whenever the corresponding byte in v is zero or greater than 0x80. The
1957 sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the
1958 byte of v doesn't have its high bit set (so the byte was less than 0x80).
1959 Finally, by ANDing these two sub-expressions the result is the high bits set
1960 where the bytes in v were zero, since the high bits set due to a value greater
1961 than 0x80 in the first sub-expression are masked off by the second.
1962
1963 Paul Messmer suggested the fast pretest improvement on October 2, 2004. Juha
1964 Järvi later suggested hasless(v, 1) on April 6, 2005, which he found on Paul
1965 Hsieh's Assembly Lab; previously it was written in a newsgroup post on April
1966 27, 1987 by Alan Mycroft.
1967
1968 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1969
1970 Determine if a word has a byte equal to n
1971
1972 We may want to know if any byte in a word has a specific value. To do so, we
1973 can XOR the value to test with a word that has been filled with the byte values
1974 in which we're interested. Because XORing a value with itself results in a zero
1975 byte and nonzero otherwise, we can pass the result to haszero.
1976
1977 #define hasvalue(x,n) \
1978 (haszero((x) ^ (~0UL/255 * (n))))
1979
1980 Stephen M Bennet suggested this on December 13, 2009 after reading the entry
1981 for haszero.
1982
1983 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1984
1985 Determine if a word has a byte less than n
1986
1987 Test if a word x contains an unsigned byte with value < n. Specifically for n=
1988 1, it can be used to find a 0-byte by examining one long at a time, or any byte
1989 by XORing x with a mask first. Uses 4 arithmetic/logical operations when n is
1990 constant.
1991
1992 Requirements: x>=0; 0<=n<=128
1993
1994 #define hasless(x,n) (((x)-~0UL/255*(n))&~(x)&~0UL/255*128)
1995
1996 To count the number of bytes in x that are less than n in 7 operations, use
1997
1998 #define countless(x,n) \
1999 (((~0UL/255*(127+(n))-((x)&~0UL/255*127))&~(x)&~0UL/255*128)/128%255)
2000
2001 Juha Järvi sent this clever technique to me on April 6, 2005. The countless
2002 macro was added by Sean Anderson on April 10, 2005, inspired by Juha's
2003 countmore, below.
2004
2005 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
2006
2007 Determine if a word has a byte greater than n
2008
2009 Test if a word x contains an unsigned byte with value > n. Uses 3 arithmetic/
2010 logical operations when n is constant.
2011
2012 Requirements: x>=0; 0<=n<=127
2013
2014 #define hasmore(x,n) (((x)+~0UL/255*(127-(n))|(x))&~0UL/255*128)
2015
2016 To count the number of bytes in x that are more than n in 6 operations, use:
2017
2018 #define countmore(x,n) \
2019 (((((x)&~0UL/255*127)+~0UL/255*(127-(n))|(x))&~0UL/255*128)/128%255)
2020
2021 The macro hasmore was suggested by Juha Järvi on April 6, 2005, and he added
2022 countmore on April 8, 2005.
2023
2024 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
2025
2026 Determine if a word has a byte between m and n
2027
2028 When m < n, this technique tests if a word x contains an unsigned byte value,
2029 such that m < value < n. It uses 7 arithmetic/logical operations when n and m
2030 are constant.
2031
2032 Note: Bytes that equal n can be reported by likelyhasbetween as false
2033 positives, so this should be checked by character if a certain result is
2034 needed.
2035
2036 Requirements: x>=0; 0<=m<=127; 0&lt=n<=128
2037
2038 #define likelyhasbetween(x,m,n) \
2039 ((((x)-~0UL/255*(n))&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
2040
2041 This technique would be suitable for a fast pretest. A variation that takes one
2042 more operation (8 total for constant m and n) but provides the exact answer is:
2043
2044 #define hasbetween(x,m,n) \
2045 ((~0UL/255*(127+(n))-((x)&~0UL/255*127)&~(x)&((x)&~0UL/255*127)+~0UL/255*(127-(m)))&~0UL/255*128)
2046
2047 To count the number of bytes in x that are between m and n (exclusive) in 10
2048 operations, use:
2049
2050 #define countbetween(x,m,n) (hasbetween(x,m,n)/128%255)
2051
2052 Juha Järvi suggested likelyhasbetween on April 6, 2005. From there, Sean
2053 Anderson created hasbetween and countbetween on April 10, 2005.
2054
2055 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
2056
2057 Compute the lexicographically next bit permutation
2058
2059 Suppose we have a pattern of N bits set to 1 in an integer and we want the next
2060 permutation of N 1 bits in a lexicographical sense. For example, if N is 3 and
2061 the bit pattern is 00010011, the next patterns would be 00010101, 00010110,
2062 00011001,00011010, 00011100, 00100011, and so forth. The following is a fast
2063 way to compute the next permutation.
2064
2065 unsigned int v; // current permutation of bits
2066 unsigned int w; // next permutation of bits
2067
2068 unsigned int t = v | (v - 1); // t gets v's least significant 0 bits set to 1
2069 // Next set to 1 the most significant bit to change,
2070 // set to 0 the least significant ones, and add the necessary 1 bits.
2071 w = (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctz(v) + 1));
2072
2073 The __builtin_ctz(v) GNU C compiler intrinsic for x86 CPUs returns the number
2074 of trailing zeros. If you are using Microsoft compilers for x86, the intrinsic
2075 is _BitScanForward. These both emit a bsf instruction, but equivalents may be
2076 available for other architectures. If not, then consider using one of the
2077 methods for counting the consecutive zero bits mentioned earlier.
2078
2079 Here is another version that tends to be slower because of its division
2080 operator, but it does not require counting the trailing zeros.
2081
2082 unsigned int t = (v | (v - 1)) + 1;
2083 w = t | ((((t & -t) / (v & -v)) >> 1) - 1);
2084
2085 Thanks to Dario Sneidermanis of Argentina, who provided this on November 28,
2086 2009.
2087
2088 A Belorussian translation (provided by Webhostingrating) is available.