doc/src/accelerate_intel.txt

   1 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
   2 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
   3
   4 :link(lws,http://lammps.sandia.gov)
   5 :link(ld,Manual.html)
   6 :link(lc,Section_commands.html#comm)
   7
   8 :line
   9
  10 "Return to Section accelerate overview"_Section_accelerate.html
  11
  12 5.3.2 USER-INTEL package :h5
  13
  14 The USER-INTEL package is maintained by Mike Brown at Intel
  15 Corporation.  It provides two methods for accelerating simulations,
  16 depending on the hardware you have.  The first is acceleration on
  17 Intel CPUs by running in single, mixed, or double precision with
  18 vectorization.  The second is acceleration on Intel Xeon Phi
  19 coprocessors via offloading neighbor list and non-bonded force
  20 calculations to the Phi.  The same C++ code is used in both cases.
  21 When offloading to a coprocessor from a CPU, the same routine is run
  22 twice, once on the CPU and once with an offload flag. This allows
  23 LAMMPS to run on the CPU cores and coprocessor cores simulataneously.
  24
  25 [Currently Available USER-INTEL Styles:]
  26
  27 Angle Styles: charmm, harmonic :ulb,l
  28 Bond Styles: fene, harmonic :l
  29 Dihedral Styles: charmm, harmonic, opls :l
  30 Fixes: nve, npt, nvt, nvt/sllod :l
  31 Improper Styles: cvff, harmonic :l
  32 Pair Styles: buck/coul/cut, buck/coul/long, buck, eam, gayberne,
  33 charmm/coul/long, lj/cut, lj/cut/coul/long, sw, tersoff :l
  34 K-Space Styles: pppm :l
  35 :ule
  36
  37 [Speed-ups to expect:]
  38
  39 The speedups will depend on your simulation, the hardware, which
  40 styles are used, the number of atoms, and the floating-point
  41 precision mode. Performance improvements are shown compared to
  42 LAMMPS {without using other acceleration packages} as these are
  43 under active development (and subject to performance changes). The
  44 measurements were performed using the input files available in
  45 the src/USER-INTEL/TEST directory. These are scalable in size; the
  46 results given are with 512K particles (524K for Liquid Crystal).
  47 Most of the simulations are standard LAMMPS benchmarks (indicated
  48 by the filename extension in parenthesis) with modifications to the
  49 run length and to add a warmup run (for use with offload
  50 benchmarks).
  51
  52 :c,image(JPG/user_intel.png)
  53
  54 Results are speedups obtained on Intel Xeon E5-2697v4 processors
  55 (code-named Broadwell) and Intel Xeon Phi 7250 processors
  56 (code-named Knights Landing) with "18 Jun 2016" LAMMPS built with
  57 Intel Parallel Studio 2016 update 3. Results are with 1 MPI task
  58 per physical core. See {src/USER-INTEL/TEST/README} for the raw
  59 simulation rates and instructions to reproduce.
  60
  61 :line
  62
  63 [Quick Start for Experienced Users:]
  64
  65 LAMMPS should be built with the USER-INTEL package installed.
  66 Simulations should be run with 1 MPI task per physical {core},
  67 not {hardware thread}.
  68
  69 For Intel Xeon CPUs:
  70
  71 Edit src/MAKE/OPTIONS/Makefile.intel_cpu_intelmpi as necessary. :ulb,l
  72 If using {kspace_style pppm} in the input script, add "neigh_modify binsize 3" and "kspace_modify diff ad" to the input script for better
  73 performance. :l
  74 "-pk intel 0 omp 2 -sf intel" added to LAMMPS command-line :l
  75 :ule
  76
  77 For Intel Xeon Phi CPUs for simulations without {kspace_style
  78 pppm} in the input script :
  79
  80 Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l
  81 Runs should be performed using MCDRAM. :l
  82 "-pk intel 0 omp 2 -sf intel" {or} "-pk intel 0 omp 4 -sf intel"
  83 should be added to the LAMMPS command-line. Choice for best
  84 performance will depend on the simulation. :l
  85 :ule
  86
  87 For Intel Xeon Phi CPUs for simulations with {kspace_style
  88 pppm} in the input script:
  89
  90 Edit src/MAKE/OPTIONS/Makefile.knl as necessary. :ulb,l
  91 Runs should be performed using MCDRAM. :l
  92 Add "neigh_modify binsize 3" to the input script for better
  93 performance. :l
  94 Add "kspace_modify diff ad" to the input script for better
  95 performance. :l
  96 export KMP_AFFINITY=none :l
  97 "-pk intel 0 omp 3 lrt yes -sf intel" or "-pk intel 0 omp 1 lrt yes
  98 -sf intel" added to LAMMPS command-line. Choice for best performance
  99 will depend on the simulation. :l
 100 :ule
 101
 102 For Intel Xeon Phi coprocessors (Offload):
 103
 104 Edit src/MAKE/OPTIONS/Makefile.intel_coprocessor as necessary :ulb,l
 105 "-pk intel N omp 1" added to command-line where N is the number of
 106 coprocessors per node. :l
 107 :ule
 108
 109 :line
 110
 111 [Required hardware/software:]
 112
 113 In order to use offload to coprocessors, an Intel Xeon Phi
 114 coprocessor and an Intel compiler are required. For this, the
 115 recommended version of the Intel compiler is 14.0.1.106 or
 116 versions 15.0.2.044 and higher.
 117
 118 Although any compiler can be used with the USER-INTEL pacakge,
 119 currently, vectorization directives are disabled by default when
 120 not using Intel compilers due to lack of standard support and
 121 observations of decreased performance. The OpenMP standard now
 122 supports directives for vectorization and we plan to transition the
 123 code to this standard once it is available in most compilers. We
 124 expect this to allow improved performance and support with other
 125 compilers.
 126
 127 For Intel Xeon Phi x200 series processors (code-named Knights
 128 Landing), there are multiple configuration options for the hardware.
 129 For best performance, we recommend that the MCDRAM is configured in
 130 "Flat" mode and with the cluster mode set to "Quadrant" or "SNC4".
 131 "Cache" mode can also be used, although the performance might be
 132 slightly lower.
 133
 134 [Notes about Simultaneous Multithreading:]
 135
 136 Modern CPUs often support Simultaneous Multithreading (SMT). On
 137 Intel processors, this is called Hyper-Threading (HT) technology.
 138 SMT is hardware support for running multiple threads efficiently on
 139 a single core. {Hardware threads} or {logical cores} are often used
 140 to refer to the number of threads that are supported in hardware.
 141 For example, the Intel Xeon E5-2697v4 processor is described
 142 as having 36 cores and 72 threads. This means that 36 MPI processes
 143 or OpenMP threads can run simultaneously on separate cores, but that
 144 up to 72 MPI processes or OpenMP threads can be running on the CPU
 145 without costly operating system context switches.
 146
 147 Molecular dynamics simulations will often run faster when making use
 148 of SMT. If a thread becomes stalled, for example because it is
 149 waiting on data that has not yet arrived from memory, another thread
 150 can start running so that the CPU pipeline is still being used
 151 efficiently. Although benefits can be seen by launching a MPI task
 152 for every hardware thread, for multinode simulations, we recommend
 153 that OpenMP threads are used for SMT instead, either with the
 154 USER-INTEL package, "USER-OMP package"_accelerate_omp.html, or
 155 "KOKKOS package"_accelerate_kokkos.html. In the example above, up
 156 to 36X speedups can be observed by using all 36 physical cores with
 157 LAMMPS. By using all 72 hardware threads, an additional 10-30%
 158 performance gain can be achieved.
 159
 160 The BIOS on many platforms allows SMT to be disabled, however, we do
 161 not recommend this on modern processors as there is little to no
 162 benefit for any software package in most cases. The operating system
 163 will report every hardware thread as a separate core allowing one to
 164 determine the number of hardware threads available. On Linux systems,
 165 this information can normally be obtained with:
 166
 167 cat /proc/cpuinfo :pre
 168
 169 [Building LAMMPS with the USER-INTEL package:]
 170
 171 The USER-INTEL package must be installed into the source directory:
 172
 173 make yes-user-intel :pre
 174
 175 Several example Makefiles for building with the Intel compiler are
 176 included with LAMMPS in the src/MAKE/OPTIONS/ directory:
 177
 178 Makefile.intel_cpu_intelmpi # Intel Compiler, Intel MPI, No Offload
 179 Makefile.knl                # Intel Compiler, Intel MPI, No Offload
 180 Makefile.intel_cpu_mpich    # Intel Compiler, MPICH, No Offload
 181 Makefile.intel_cpu_openpmi  # Intel Compiler, OpenMPI, No Offload
 182 Makefile.intel_coprocessor  # Intel Compiler, Intel MPI, Offload :pre
 183
 184 Makefile.knl is identical to Makefile.intel_cpu_intelmpi except that
 185 it explicitly specifies that vectorization should be for Intel
 186 Xeon Phi x200 processors making it easier to cross-compile. For
 187 users with recent installations of Intel Parallel Studio, the
 188 process can be as simple as:
 189
 190 make yes-user-intel
 191 source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh
 192 # or psxevars.csh for C-shell
 193 make intel_cpu_intelmpi :pre
 194
 195 Alternatively, the build can be accomplished with the src/Make.py
 196 script, described in "Section 2.4"_Section_start.html#start_4 of the
 197 manual. Type "Make.py -h" for help. For an example:
 198
 199 Make.py -v -p intel omp -intel cpu -a file intel_cpu_intelmpi :pre
 200
 201 Note that if you build with support for a Phi coprocessor, the same
 202 binary can be used on nodes with or without coprocessors installed.
 203 However, if you do not have coprocessors on your system, building
 204 without offload support will produce a smaller binary.
 205
 206 The general requirements for Makefiles with the USER-INTEL package
 207 are as follows. "-DLAMMPS_MEMALIGN=64" is required for CCFLAGS. When
 208 using Intel compilers, "-restrict" is required and "-qopenmp" is
 209 highly recommended for CCFLAGS and LINKFLAGS. LIB should include
 210 "-ltbbmalloc". For builds supporting offload, "-DLMP_INTEL_OFFLOAD"
 211 is required for CCFLAGS and "-qoffload" is required for LINKFLAGS.
 212 Other recommended CCFLAG options for best performance are
 213 "-O2 -fno-alias -ansi-alias -qoverride-limits fp-model fast=2
 214 -no-prec-div". The Make.py command will add all of these
 215 automatically.
 216
 217 NOTE: The vectorization and math capabilities can differ depending on
 218 the CPU. For Intel compilers, the "-x" flag specifies the type of
 219 processor for which to optimize. "-xHost" specifies that the compiler
 220 should build for the processor used for compiling. For Intel Xeon Phi
 221 x200 series processors, this option is "-xMIC-AVX512". For fourth
 222 generation Intel Xeon (v4/Broadwell) processors, "-xCORE-AVX2" should
 223 be used. For older Intel Xeon processors, "-xAVX" will perform best
 224 in general for the different simulations in LAMMPS. The default
 225 in most of the example Makefiles is to use "-xHost", however this
 226 should not be used when cross-compiling.
 227
 228 [Running LAMMPS with the USER-INTEL package:]
 229
 230 Running LAMMPS with the USER-INTEL package is similar to normal use
 231 with the exceptions that one should 1) specify that LAMMPS should use
 232 the USER-INTEL package, 2) specify the number of OpenMP threads, and
 233 3) optionally specify the specific LAMMPS styles that should use the
 234 USER-INTEL package. 1) and 2) can be performed from the command-line
 235 or by editing the input script. 3) requires editing the input script.
 236 Advanced performance tuning options are also described below to get
 237 the best performance.
 238
 239 When running on a single node (including runs using offload to a
 240 coprocessor), best performance is normally obtained by using 1 MPI
 241 task per physical core and additional OpenMP threads with SMT. For
 242 Intel Xeon processors, 2 OpenMP threads should be used for SMT.
 243 For Intel Xeon Phi CPUs, 2 or 4 OpenMP threads should be used
 244 (best choice depends on the simulation). In cases where the user
 245 specifies that LRT mode is used (described below), 1 or 3 OpenMP
 246 threads should be used. For multi-node runs, using 1 MPI task per
 247 physical core will often perform best, however, depending on the
 248 machine and scale, users might get better performance by decreasing
 249 the number of MPI tasks and using more OpenMP threads. For
 250 performance, the product of the number of MPI tasks and OpenMP
 251 threads should not exceed the number of available hardware threads in
 252 almost all cases.
 253
 254 NOTE: Setting core affinity is often used to pin MPI tasks and OpenMP
 255 threads to a core or group of cores so that memory access can be
 256 uniform. Unless disabled at build time, affinity for MPI tasks and
 257 OpenMP threads on the host (CPU) will be set by default on the host
 258 {when using offload to a coprocessor}. In this case, it is unnecessary
 259 to use other methods to control affinity (e.g. taskset, numactl,
 260 I_MPI_PIN_DOMAIN, etc.). This can be disabled with the {no_affinity}
 261 option to the "package intel"_package.html command or by disabling the
 262 option at build time (by adding -DINTEL_OFFLOAD_NOAFFINITY to the
 263 CCFLAGS line of your Makefile). Disabling this option is not
 264 recommended, especially when running on a machine with Intel
 265 Hyper-Threading technology disabled.
 266
 267 [Run with the USER-INTEL package from the command line:]
 268
 269 To enable USER-INTEL optimizations for all available styles used in
 270 the input script, the "-sf intel"
 271 "command-line switch"_Section_start.html#start_7 can be used without
 272 any requirement for editing the input script. This switch will
 273 automatically append "intel" to styles that support it. It also
 274 invokes a default command: "package intel 1"_package.html. This
 275 package command is used to set options for the USER-INTEL package.
 276 The default package command will specify that USER-INTEL calculations
 277 are performed in mixed precision, that the number of OpenMP threads
 278 is specified by the OMP_NUM_THREADS environment variable, and that
 279 if coprocessors are present and the binary was built with offload
 280 support, that 1 coprocessor per node will be used with automatic
 281 balancing of work between the CPU and the coprocessor.
 282
 283 You can specify different options for the USER-INTEL package by using
 284 the "-pk intel Nphi" "command-line switch"_Section_start.html#start_7
 285 with keyword/value pairs as specified in the documentation. Here,
 286 Nphi = # of Xeon Phi coprocessors/node (ignored without offload
 287 support). Common options to the USER-INTEL package include {omp} to
 288 override any OMP_NUM_THREADS setting and specify the number of OpenMP
 289 threads, {mode} to set the floating-point precision mode, and
 290 {lrt} to enable Long-Range Thread mode as described below. See the
 291 "package intel"_package.html command for details, including the
 292 default values used for all its options if not specified, and how to
 293 set the number of OpenMP threads via the OMP_NUM_THREADS environment
 294 variable if desired.
 295
 296 Examples (see documentation for your MPI/Machine for differences in
 297 launching MPI applications):
 298
 299 mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script                                 # 2 nodes, 36 MPI tasks/node, $OMP_NUM_THREADS OpenMP Threads
 300 mpirun -np 72 -ppn 36 lmp_machine -sf intel -in in.script -pk intel 0 omp 2 mode double   # Don't use any coprocessors that might be available, use 2 OpenMP threads for each task, use double precision :pre
 301
 302 [Or run with the USER-INTEL package by editing an input script:]
 303
 304 As an alternative to adding command-line arguments, the input script
 305 can be edited to enable the USER-INTEL package. This requires adding
 306 the "package intel"_package.html command to the top of the input
 307 script. For the second example above, this would be:
 308
 309 package intel 0 omp 2 mode double :pre
 310
 311 To enable the USER-INTEL package only for individual styles, you can
 312 add an "intel" suffix to the individual style, e.g.:
 313
 314 pair_style lj/cut/intel 2.5 :pre
 315
 316 Alternatively, the "suffix intel"_suffix.html command can be added to
 317 the input script to enable USER-INTEL styles for the commands that
 318 follow in the input script.
 319
 320 [Tuning for Performance:]
 321
 322 NOTE: The USER-INTEL package will perform better with modifications
 323 to the input script when "PPPM"_kspace_style.html is used:
 324 "kspace_modify diff ad"_kspace_modify.html and "neigh_modify binsize
 325 3"_neigh_modify.html should be added to the input script.
 326
 327 Long-Range Thread (LRT) mode is an option to the "package
 328 intel"_package.html command that can improve performance when using
 329 "PPPM"_kspace_style.html for long-range electrostatics on processors
 330 with SMT. It generates an extra pthread for each MPI task. The thread
 331 is dedicated to performing some of the PPPM calculations and MPI
 332 communications. On Intel Xeon Phi x200 series CPUs, this will likely
 333 always improve performance, even on a single node. On Intel Xeon
 334 processors, using this mode might result in better performance when
 335 using multiple nodes, depending on the machine. To use this mode,
 336 specify that the number of OpenMP threads is one less than would
 337 normally be used for the run and add the "lrt yes" option to the "-pk"
 338 command-line suffix or "package intel" command. For example, if a run
 339 would normally perform best with "-pk intel 0 omp 4", instead use
 340 "-pk intel 0 omp 3 lrt yes". When using LRT, you should set the
 341 environment variable "KMP_AFFINITY=none". LRT mode is not supported
 342 when using offload.
 343
 344 Not all styles are supported in the USER-INTEL package. You can mix
 345 the USER-INTEL package with styles from the "OPT"_accelerate_opt.html
 346 package or the "USER-OMP package"_accelerate_omp.html. Of course,
 347 this requires that these packages were installed at build time. This
 348 can performed automatically by using "-sf hybrid intel opt" or
 349 "-sf hybrid intel omp" command-line options. Alternatively, the "opt"
 350 and "omp" suffixes can be appended manually in the input script. For
 351 the latter, the "package omp"_package.html command must be in the
 352 input script or the "-pk omp Nt" "command-line
 353 switch"_Section_start.html#start_7 must be used where Nt is the
 354 number of OpenMP threads. The number of OpenMP threads should not be
 355 set differently for the different packages. Note that the "suffix
 356 hybrid intel omp"_suffix.html command can also be used within the
 357 input script to automatically append the "omp" suffix to styles when
 358 USER-INTEL styles are not available.
 359
 360 When running on many nodes, performance might be better when using
 361 fewer OpenMP threads and more MPI tasks. This will depend on the
 362 simulation and the machine. Using the "verlet/split"_run_style.html
 363 run style might also give better performance for simulations with
 364 "PPPM"_kspace_style.html electrostatics. Note that this is an
 365 alternative to LRT mode and the two cannot be used together.
 366
 367 Currently, when using Intel MPI with Intel Xeon Phi x200 series
 368 CPUs, better performance might be obtained by setting the
 369 environment variable "I_MPI_SHM_LMT=shm" for Linux kernels that do
 370 not yet have full support for AVX-512. Runs on Intel Xeon Phi x200
 371 series processors will always perform better using MCDRAM. Please
 372 consult your system documentation for the best approach to specify
 373 that MPI runs are performed in MCDRAM.
 374
 375 [Tuning for Offload Performance:]
 376
 377 The default settings for offload should give good performance.
 378
 379 When using LAMMPS with offload to Intel coprocessors, best performance
 380 will typically be achieved with concurrent calculations performed on
 381 both the CPU and the coprocessor. This is achieved by offloading only
 382 a fraction of the neighbor and pair computations to the coprocessor or
 383 using "hybrid"_pair_hybrid.html pair styles where only one style uses
 384 the "intel" suffix. For simulations with long-range electrostatics or
 385 bond, angle, dihedral, improper calculations, computation and data
 386 transfer to the coprocessor will run concurrently with computations
 387 and MPI communications for these calculations on the host CPU. This
 388 is illustrated in the figure below for the rhodopsin protein benchmark
 389 running on E5-2697v2 processors with a Intel Xeon Phi 7120p
 390 coprocessor. In this plot, the vertical access is time and routines
 391 running at the same time are running concurrently on both the host and
 392 the coprocessor.
 393
 394 :c,image(JPG/offload_knc.png)
 395
 396 The fraction of the offloaded work is controlled by the {balance}
 397 keyword in the "package intel"_package.html command. A balance of 0
 398 runs all calculations on the CPU.  A balance of 1 runs all
 399 supported calculations on the coprocessor.  A balance of 0.5 runs half
 400 of the calculations on the coprocessor.  Setting the balance to -1
 401 (the default) will enable dynamic load balancing that continously
 402 adjusts the fraction of offloaded work throughout the simulation.
 403 Because data transfer cannot be timed, this option typically produces
 404 results within 5 to 10 percent of the optimal fixed balance.
 405
 406 If running short benchmark runs with dynamic load balancing, adding a
 407 short warm-up run (10-20 steps) will allow the load-balancer to find a
 408 near-optimal setting that will carry over to additional runs.
 409
 410 The default for the "package intel"_package.html command is to have
 411 all the MPI tasks on a given compute node use a single Xeon Phi
 412 coprocessor.  In general, running with a large number of MPI tasks on
 413 each node will perform best with offload.  Each MPI task will
 414 automatically get affinity to a subset of the hardware threads
 415 available on the coprocessor.  For example, if your card has 61 cores,
 416 with 60 cores available for offload and 4 hardware threads per core
 417 (240 total threads), running with 24 MPI tasks per node will cause
 418 each MPI task to use a subset of 10 threads on the coprocessor.  Fine
 419 tuning of the number of threads to use per MPI task or the number of
 420 threads to use per core can be accomplished with keyword settings of
 421 the "package intel"_package.html command.
 422
 423 The USER-INTEL package has two modes for deciding which atoms will be
 424 handled by the coprocessor.  This choice is controlled with the {ghost}
 425 keyword of the "package intel"_package.html command.  When set to 0,
 426 ghost atoms (atoms at the borders between MPI tasks) are not offloaded
 427 to the card.  This allows for overlap of MPI communication of forces
 428 with computation on the coprocessor when the "newton"_newton.html
 429 setting is "on".  The default is dependent on the style being used,
 430 however, better performance may be achieved by setting this option
 431 explictly.
 432
 433 When using offload with CPU Hyper-Threading disabled, it may help
 434 performance to use fewer MPI tasks and OpenMP threads than available
 435 cores.  This is due to the fact that additional threads are generated
 436 internally to handle the asynchronous offload tasks.
 437
 438 If pair computations are being offloaded to an Intel Xeon Phi
 439 coprocessor, a diagnostic line is printed to the screen (not to the
 440 log file), during the setup phase of a run, indicating that offload
 441 mode is being used and indicating the number of coprocessor threads
 442 per MPI task.  Additionally, an offload timing summary is printed at
 443 the end of each run.  When offloading, the frequency for "atom
 444 sorting"_atom_modify.html is changed to 1 so that the per-atom data is
 445 effectively sorted at every rebuild of the neighbor lists. All the
 446 available coprocessor threads on each Phi will be divided among MPI
 447 tasks, unless the {tptask} option of the "-pk intel" "command-line
 448 switch"_Section_start.html#start_7 is used to limit the coprocessor
 449 threads per MPI task.
 450
 451 [Restrictions:]
 452
 453 When offloading to a coprocessor, "hybrid"_pair_hybrid.html styles
 454 that require skip lists for neighbor builds cannot be offloaded.
 455 Using "hybrid/overlay"_pair_hybrid.html is allowed.  Only one intel
 456 accelerated style may be used with hybrid styles.
 457 "Special_bonds"_special_bonds.html exclusion lists are not currently
 458 supported with offload, however, the same effect can often be
 459 accomplished by setting cutoffs for excluded atom types to 0.  None of
 460 the pair styles in the USER-INTEL package currently support the
 461 "inner", "middle", "outer" options for rRESPA integration via the
 462 "run_style respa"_run_style.html command; only the "pair" option is
 463 supported.
 464
 465 [References:]
 466
 467 Brown, W.M., Carrillo, J.-M.Y., Mishra, B., Gavhane, N., Thakker, F.M., De Kraker, A.R., Yamada, M., Ang, J.A., Plimpton, S.J., “Optimizing Classical Molecular Dynamics in LAMMPS,” in Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition, J. Jeffers, J. Reinders, A. Sodani, Eds. Morgan Kaufmann. :ulb,l
 468
 469 Brown, W. M., Semin, A., Hebenstreit, M., Khvostov, S., Raman, K., Plimpton, S.J. Increasing Molecular Dynamics Simulation Rates with an 8-Fold Increase in Electrical Power Efficiency. 2016 International Conference for High Performance Computing. In press. :l
 470
 471 Brown, W.M., Carrillo, J.-M.Y., Gavhane, N., Thakkar, F.M., Plimpton, S.J. Optimizing Legacy Molecular Dynamics Software with Directive-Based Offload. Computer Physics Communications. 2015. 195: p. 95-101. :l
 472 :ule
 473
 474
 475
 476