docs/release-notes/2016/major/performance.rst

   1 Performance improvements
   2 ^^^^^^^^^^^^^^^^^^^^^^^^
   3
   4 GPU improvements
   5 ^^^^^^^^^^^^^^^^
   6
   7 In addition to those noted below, overall minor improvements contribute
   8 up to 5% increase in CUDA performance, so depending on parameters and compilers
   9 an 5-20% GPU kernel performance increase is expected.
  10 These benefits are seen with CUDA 7.5 (which is now the version we recommend);
  11 certain older versions (e.g. 7.0) see even larger improvements.
  12
  13 Even larger improvements in OpenCL performance on AMD devices are
  14 expected, e.g. can be >50% with RF/plain cut-off and PME with potential shift
  15 with recent AMD OpenCL compilers.
  16
  17 Note that due to limitations of the NVIDIA OpenCL compiler CUDA is still superior
  18 in performance on NVIDIA GPUs. Hence, it is recommended to use CUDA-based GPU acceleration
  19 on NVIDIA hardware.
  20
  21
  22 Improved support for OpenCL devices
  23 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  24 The OpenCL support is now fully compatible with all intra- and
  25 inter-node parallelization mode, including MPI, thread-MPI, and GPU
  26 sharing by PP ranks. (The previous limitations were caused by bugs in high-level
  27 GROMACS code.)
  28
  29 Additionally some prefetching in the short-ranged kernels (similar to
  30 that in the CUDA code) that had been disabled was found to be useful
  31 after all.
  32
  33 Added Lennard-Jones combination-rule kernels for GPUs
  34 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  35 Implemented LJ combination-rule parameter lookup in the CUDA and
  36 OpenCL kernels for both geometric and Lorentz-Berthelot combination
  37 rules, and enabled it for plain LJ cut-off. This optimization was
  38 already present in the CPU kernels. This improves performance with
  39 e.g. OPLS, GROMOS and AMBER force fields by about 10-15% (but does not
  40 help with CHARMM force fields because they use force-switched kernels).
  41
  42 Added support for CUDA CC 6.0/6.1
  43 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  44 Added build-system and kernel-generator support for the Pascal
  45 architectures announced so far (GP100: 6.0, GP104: 6.1) and supported
  46 by the CUDA 8.0 compiler.
  47
  48 By default we now generate binary as well as PTX code for both sm_60 and
  49 sm_61 and given the considerable differences between the two, we also
  50 generate PTX for both virtual arch. For now we don't add CC 6.2 (GP102)
  51 compilation support as we know nothing about it.
  52
  53 On the kernel-generation side, given the increased register file, for
  54 CC 6.0 the "wider" 128 threads/block kernels are enabled, on 6.1 and
  55 later the 64 threads/block remains.
  56
  57 Improved GPU pair-list splitting to improve performance
  58 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  59 Instead of splitting the GPU lists (to generate more work units) based
  60 on a maximum cut-off, we now generate lists as close to the target
  61 list size as possible. The heuristic estimate for the number of
  62 cluster pairs is now too high by 0-1% instead of 10%. This results in
  63 a few percent fewer pair lists, but still slightly more than
  64 requested.
  65
  66 Improved CUDA GPU memory configuration
  67 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  68 This makes use of the larger amount of L1 cache
  69 available for global load caching on hardware that supports it (K40,
  70 K80, Tegra K1, & CC 5.2) by passing the appropriate command line
  71 option ("-dlcm=ca").
  72
  73 :issue:`1804`
  74
  75 Automatic nstlist changes were tuned for Intel Knight's Landing
  76 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  77
  78 CPU improvements
  79 ^^^^^^^^^^^^^^^^
  80
  81 These improvements to individual kernels will provide incremental
  82 improvements to CPU performance for simulations where they are active,
  83 but their value for simulations using GPU offload are much higher,
  84 because via the auto-tuning, they permit all kinds of resource
  85 utilization and throughput to increase.
  86
  87 Optimized the bonded thread force reduction
  88 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
  89 The code for multi-threading of bonded interactions has to combine the
  90 forces afterwards. This reduction now uses fixed-size blocks of 32
  91 atoms, and instead of dividing reduction of the whole range of blocks
  92 uniformly over the threads, now only used blocks are divided
  93 (uniformly) over the threads.  This speeds up the reduction by a
  94 factor of the number of threads (!) for typical protein+water systems
  95 when not using domain decomposition. With domain decomposition, the
  96 speed up is up to a factor of 3.
  97
  98 Used SIMD transpose-scatter in bonded force reduction
  99 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 100 The angle and dihedral SIMD functions now use the SIMD transpose
 101 scatter functions for force reduction. This change gives a massive
 102 performance improvement for bondeds, mainly because the dihedral
 103 force update did a lot of vector operations without SIMD that are
 104 now fully replaced by SIMD operations.
 105
 106 Added SIMD implementation of Lennard-Jones 1-4 interactions
 107 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""-
 108 The gives a few factors speed improvement. The main improvement comes
 109 from simplified analytical LJ instead of tables; SIMD helps a bit.
 110
 111 Added SIMD implementation of SETTLE
 112 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 113 On Haswell CPUs, this makes SETTLE a factor 5 faster.
 114
 115 Added SIMD support for routines that do periodic boundary coordinate transformations
 116 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 117
 118 Threading improvements
 119 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 120
 121 These improvements enhance the performance of code that runs over
 122 multiple CPU threads.
 123
 124 Improved Verlet-scheme pair-list workload balancing
 125 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 126
 127 Implemented near perfect load-balancing for Verlet-scheme CPU
 128 pair-lists. This increases the search cost by 3%, but this is
 129 outweighed by the more balanced non-bonded kernel times, particularly
 130 for small systems.
 131
 132 Improved the threading of virtual-site code
 133 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 134 On many threads, a significant part of the vsites would end up in
 135 the separate serial task, thereby limiting scaling. Now two weakly
 136 dependent tasks are generated for each thread and one of them uses
 137 a thread-local force buffer, parts of which are reduced by different
 138 threads that are responsible for those parts.
 139
 140 Also the setup now runs multi-threaded.
 141
 142 Add OpenMP support to more loops
 143 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 144 Loops over number of atoms cause significant amount of serial time with
 145 large number of threads, which limits scaling.
 146
 147 Add OpenMP parallelization for the pull code
 148 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 149 The pull code could take up to a third of the compute time for OpenMP
 150 parallel simulation with large pull groups.
 151 Now all pull-code loops over atoms have an OpenMP parallel version.
 152
 153 Other improvements
 154 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 155
 156 Multi-simulations are coupled less frequently
 157 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 158 For example, replica-exchange simulations communicate between simulations
 159 only at exchange attempts. Plain multi-simulations do not communicate
 160 between simulations. Overall performance will tend to improve any time
 161 the progress of one simulation might be faster than others (e.g. it's
 162 at a different pressure, or using a quieter part of the network).