doc/src/accelerate_omp.txt

   1 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
   2 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
   3
   4 :link(lws,http://lammps.sandia.gov)
   5 :link(ld,Manual.html)
   6 :link(lc,Section_commands.html#comm)
   7
   8 :line
   9
  10 "Return to Section 5 overview"_Section_accelerate.html
  11
  12 5.3.4 USER-OMP package :h5
  13
  14 The USER-OMP package was developed by Axel Kohlmeyer at Temple
  15 University.  It provides multi-threaded versions of most pair styles,
  16 nearly all bonded styles (bond, angle, dihedral, improper), several
  17 Kspace styles, and a few fix styles.  The package currently uses the
  18 OpenMP interface for multi-threading.
  19
  20 Here is a quick overview of how to use the USER-OMP package, assuming
  21 one or more 16-core nodes.  More details follow.
  22
  23 use -fopenmp with CCFLAGS and LINKFLAGS in Makefile.machine
  24 make yes-user-omp
  25 make mpi                                   # build with USER-OMP package, if settings added to Makefile.mpi
  26 make omp                                   # or Makefile.omp already has settings
  27 Make.py -v -p omp -o mpi -a file mpi       # or one-line build via Make.py :pre
  28
  29 lmp_mpi -sf omp -pk omp 16 < in.script                         # 1 MPI task, 16 threads
  30 mpirun -np 4 lmp_mpi -sf omp -pk omp 4 -in in.script           # 4 MPI tasks, 4 threads/task
  31 mpirun -np 32 -ppn 4 lmp_mpi -sf omp -pk omp 4 -in in.script   # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
  32
  33 [Required hardware/software:]
  34
  35 Your compiler must support the OpenMP interface.  You should have one
  36 or more multi-core CPUs so that multiple threads can be launched by
  37 each MPI task running on a CPU.
  38
  39 [Building LAMMPS with the USER-OMP package:]
  40
  41 The lines above illustrate how to include/build with the USER-OMP
  42 package in two steps, using the "make" command.  Or how to do it with
  43 one command via the src/Make.py script, described in "Section
  44 2.4"_Section_start.html#start_4 of the manual.  Type "Make.py -h" for
  45 help.
  46
  47 Note that the CCFLAGS and LINKFLAGS settings in Makefile.machine must
  48 include "-fopenmp".  Likewise, if you use an Intel compiler, the
  49 CCFLAGS setting must include "-restrict".  The Make.py command will
  50 add these automatically.
  51
  52 [Run with the USER-OMP package from the command line:]
  53
  54 The mpirun or mpiexec command sets the total number of MPI tasks used
  55 by LAMMPS (one or multiple per compute node) and the number of MPI
  56 tasks used per node.  E.g. the mpirun command in MPICH does this via
  57 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
  58
  59 You need to choose how many OpenMP threads per MPI task will be used
  60 by the USER-OMP package.  Note that the product of MPI tasks *
  61 threads/task should not exceed the physical number of cores (on a
  62 node), otherwise performance will suffer.
  63
  64 As in the lines above, use the "-sf omp" "command-line
  65 switch"_Section_start.html#start_7, which will automatically append
  66 "omp" to styles that support it.  The "-sf omp" switch also issues a
  67 default "package omp 0"_package.html command, which will set the
  68 number of threads per MPI task via the OMP_NUM_THREADS environment
  69 variable.
  70
  71 You can also use the "-pk omp Nt" "command-line
  72 switch"_Section_start.html#start_7, to explicitly set Nt = # of OpenMP
  73 threads per MPI task to use, as well as additional options.  Its
  74 syntax is the same as the "package omp"_package.html command whose doc
  75 page gives details, including the default values used if it is not
  76 specified.  It also gives more details on how to set the number of
  77 threads via the OMP_NUM_THREADS environment variable.
  78
  79 [Or run with the USER-OMP package by editing an input script:]
  80
  81 The discussion above for the mpirun/mpiexec command, MPI tasks/node,
  82 and threads/MPI task is the same.
  83
  84 Use the "suffix omp"_suffix.html command, or you can explicitly add an
  85 "omp" suffix to individual styles in your input script, e.g.
  86
  87 pair_style lj/cut/omp 2.5 :pre
  88
  89 You must also use the "package omp"_package.html command to enable the
  90 USER-OMP package.  When you do this you also specify how many threads
  91 per MPI task to use.  The command doc page explains other options and
  92 how to set the number of threads via the OMP_NUM_THREADS environment
  93 variable.
  94
  95 [Speed-ups to expect:]
  96
  97 Depending on which styles are accelerated, you should look for a
  98 reduction in the "Pair time", "Bond time", "KSpace time", and "Loop
  99 time" values printed at the end of a run.
 100
 101 You may see a small performance advantage (5 to 20%) when running a
 102 USER-OMP style (in serial or parallel) with a single thread per MPI
 103 task, versus running standard LAMMPS with its standard un-accelerated
 104 styles (in serial or all-MPI parallelization with 1 task/core).  This
 105 is because many of the USER-OMP styles contain similar optimizations
 106 to those used in the OPT package, described in "Section
 107 5.3.5"_accelerate_opt.html.
 108
 109 With multiple threads/task, the optimal choice of number of MPI
 110 tasks/node and OpenMP threads/task can vary a lot and should always be
 111 tested via benchmark runs for a specific simulation running on a
 112 specific machine, paying attention to guidelines discussed in the next
 113 sub-section.
 114
 115 A description of the multi-threading strategy used in the USER-OMP
 116 package and some performance examples are "presented
 117 here"_http://sites.google.com/site/akohlmey/software/lammps-icms/lammps-icms-tms2011-talk.pdf?attredirects=0&d=1
 118
 119 [Guidelines for best performance:]
 120
 121 For many problems on current generation CPUs, running the USER-OMP
 122 package with a single thread/task is faster than running with multiple
 123 threads/task.  This is because the MPI parallelization in LAMMPS is
 124 often more efficient than multi-threading as implemented in the
 125 USER-OMP package.  The parallel efficiency (in a threaded sense) also
 126 varies for different USER-OMP styles.
 127
 128 Using multiple threads/task can be more effective under the following
 129 circumstances:
 130
 131 Individual compute nodes have a significant number of CPU cores but
 132 the CPU itself has limited memory bandwidth, e.g. for Intel Xeon 53xx
 133 (Clovertown) and 54xx (Harpertown) quad-core processors.  Running one
 134 MPI task per CPU core will result in significant performance
 135 degradation, so that running with 4 or even only 2 MPI tasks per node
 136 is faster.  Running in hybrid MPI+OpenMP mode will reduce the
 137 inter-node communication bandwidth contention in the same way, but
 138 offers an additional speedup by utilizing the otherwise idle CPU
 139 cores. :ulb,l
 140
 141 The interconnect used for MPI communication does not provide
 142 sufficient bandwidth for a large number of MPI tasks per node.  For
 143 example, this applies to running over gigabit ethernet or on Cray XT4
 144 or XT5 series supercomputers.  As in the aforementioned case, this
 145 effect worsens when using an increasing number of nodes. :l
 146
 147 The system has a spatially inhomogeneous particle density which does
 148 not map well to the "domain decomposition scheme"_processors.html or
 149 "load-balancing"_balance.html options that LAMMPS provides.  This is
 150 because multi-threading achives parallelism over the number of
 151 particles, not via their distribution in space. :l
 152
 153 A machine is being used in "capability mode", i.e. near the point
 154 where MPI parallelism is maxed out.  For example, this can happen when
 155 using the "PPPM solver"_kspace_style.html for long-range
 156 electrostatics on large numbers of nodes.  The scaling of the KSpace
 157 calculation (see the "kspace_style"_kspace_style.html command) becomes
 158 the performance-limiting factor.  Using multi-threading allows less
 159 MPI tasks to be invoked and can speed-up the long-range solver, while
 160 increasing overall performance by parallelizing the pairwise and
 161 bonded calculations via OpenMP.  Likewise additional speedup can be
 162 sometimes be achived by increasing the length of the Coulombic cutoff
 163 and thus reducing the work done by the long-range solver.  Using the
 164 "run_style verlet/split"_run_style.html command, which is compatible
 165 with the USER-OMP package, is an alternative way to reduce the number
 166 of MPI tasks assigned to the KSpace calculation. :l
 167 :ule
 168
 169 Additional performance tips are as follows:
 170
 171 The best parallel efficiency from {omp} styles is typically achieved
 172 when there is at least one MPI task per physical CPU chip, i.e. socket
 173 or die. :ulb,l
 174
 175 It is usually most efficient to restrict threading to a single
 176 socket, i.e. use one or more MPI task per socket. :l
 177
 178 NOTE: By default, several current MPI implementations use a processor
 179 affinity setting that restricts each MPI task to a single CPU core.
 180 Using multi-threading in this mode will force all threads to share the
 181 one core and thus is likely to be counterproductive.  Instead, binding
 182 MPI tasks to a (multi-core) socket, should solve this issue. :l
 183 :ule
 184
 185 [Restrictions:]
 186
 187 None.