docs/release-notes/2018/major/performance.rst

   1 Performance improvements
   2 ^^^^^^^^^^^^^^^^^^^^^^^^
   3
   4 Implemented support for PME long-ranged interactions on GPUs
   5 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
   6 A single GPU can now be used to accelerate the computation of the
   7 long-ranged PME interactions. This feature provides excellent
   8 performance improvements, in particular that only 2-4 CPU cores per
   9 GPU will be about as fast as the 2016 version that needed many more
  10 CPU cores to balance the GPU. Performance on hardware that had good
  11 balance of GPU and CPU also shows minor improvements, and the capacity
  12 for hardware with strong GPUs to run effective simulations is now
  13 greatly improved.
  14
  15 Currently, the GPU used for PME must be either the same GPU as used
  16 for the short-ranged interactions and in the same single rank of the
  17 simulation, or any GPU used from a PME-only rank. mdrun -pme gpu now
  18 requires that PME runs on a GPU, if supported. All CUDA versions and
  19 hardware generations supported by |Gromacs| can run this code path,
  20 including CUDA 9.0 and Volta GPUs. However, not all combinations
  21 of features are supported with PME on GPUs - notably FEP calculations
  22 are not yet available.
  23
  24 The user guide is updated to reflect the new capabilities, and more
  25 documentation will be forthcoming.
  26
  27 Added more SIMD intrinsics support for PME spread and gather
  28 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  29 Achieved speedup on Intel KNL processors of around 11% for PME
  30 spread/gather on typical simulation systems.
  31
  32 Added SIMD intrinsics version of simple update
  33 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  34 In the simple case of leap-frog without pressure coupling and with at
  35 most one temperature-coupling group, the update of velocities and
  36 coordinates is now implemented with SIMD intrinsics for improved
  37 simulation rate.
  38
  39 Add SIMD intrinsics version of Urey-Bradley angle kernel
  40 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  41 For steps where energies and shift forces are not required, this kernel
  42 improves performance, which can otherwise be rate limiting in GPU-accelerated
  43 runs, particularly with CHARMM force fields.
  44
  45 Use OpenMP up to 16 threads with AMD Ryzen when automating run setup
  46 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  47 AMD Ryzen appears to always perform slightly better with OpenMP
  48 than MPI, up to using all 16 threads on the 8-core die.
  49
  50 128-bit AVX2 SIMD for AMD Ryzen
  51 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  52 While Ryzen supports 256-bit AVX2, the internal units are organized
  53 to execute either a single 256-bit instruction or two 128-bit SIMD
  54 instruction per cycle. Since most of our kernels are slightly
  55 less efficient for wider SIMD, this improves performance by roughly
  56 10%.
  57
  58 Choose faster nbnxn SIMD kernels on AMD Zen
  59 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  60 On AMD Zen, tabulated Ewald kernels are always faster than analytical.
  61 And with AVX2_256 2xNN kernels are faster than 4xN.
  62 These faster choices are now made based on CpuInfo at run time.
  63
  64 Refs :issue:`2328`
  65
  66 Enabled group-scheme SIMD with GMX_SIMD=AVX2_128
  67 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  68 The group-scheme kernels can use AVX instructions from either the
  69 AVX_128_FMA and AVX_256 extensions. But hardware that supports the new
  70 AVX2_128 extensions also supports AVX_256, so we enable such support
  71 for the group-scheme kernels.
  72
  73 Detect AVX-512 FMA units to choose best SIMD
  74 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  75 Recent Intel x86 hardware can have multiple AVX-512 FMA units, and the
  76 number of those units and the way their use interacts with the way the
  77 CPU chooses its clock speed mean that it can be advantageous to avoid
  78 using AVX-512 SIMD support in |Gromacs| if there is only one such
  79 unit.  Because there is no way to query the hardware to count the
  80 number of such units, we run code at CMake and mdrun time to compare
  81 the performance from using such units, and recommend the version that
  82 is best. This may mean that building |Gromacs| on the front-end node
  83 of the cluster might not suit the compute nodes, even when they are
  84 all from the same generation of Intel's hardware.
  85
  86 Speed up nbnxn buffer clearing
  87 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  88
  89 Tweaked conditional in the nonbonded GPU kernels
  90 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  91 GPU compilers miss an easy optimization of a loop invariant in the
  92 inner-lop conditional. Precomputing part of the conditional together
  93 with using bitwise instead of logical and/or improves performance with
  94 most compilers by up to 5%.
  95