doc/src/accelerate_kokkos.txt

   1 "Previous Section"_Section_packages.html - "LAMMPS WWW Site"_lws -
   2 "LAMMPS Documentation"_ld - "LAMMPS Commands"_lc :c
   3
   4 :link(lws,http://lammps.sandia.gov)
   5 :link(ld,Manual.html)
   6 :link(lc,Section_commands.html#comm)
   7
   8 :line
   9
  10 "Return to Section accelerate overview"_Section_accelerate.html
  11
  12 5.3.3 KOKKOS package :h5
  13
  14 The KOKKOS package was developed primarily by Christian Trott (Sandia)
  15 with contributions of various styles by others, including Sikandar
  16 Mashayak (UIUC), Stan Moore (Sandia), and Ray Shan (Sandia).  The
  17 underlying Kokkos library was written primarily by Carter Edwards,
  18 Christian Trott, and Dan Sunderland (all Sandia).
  19
  20 The KOKKOS package contains versions of pair, fix, and atom styles
  21 that use data structures and macros provided by the Kokkos library,
  22 which is included with LAMMPS in lib/kokkos.
  23
  24 The Kokkos library is part of
  25 "Trilinos"_http://trilinos.sandia.gov/packages/kokkos and can also be
  26 downloaded from "Github"_https://github.com/kokkos/kokkos. Kokkos is a
  27 templated C++ library that provides two key abstractions for an
  28 application like LAMMPS.  First, it allows a single implementation of
  29 an application kernel (e.g. a pair style) to run efficiently on
  30 different kinds of hardware, such as a GPU, Intel Phi, or many-core
  31 CPU.
  32
  33 The Kokkos library also provides data abstractions to adjust (at
  34 compile time) the memory layout of basic data structures like 2d and
  35 3d arrays and allow the transparent utilization of special hardware
  36 load and store operations.  Such data structures are used in LAMMPS to
  37 store atom coordinates or forces or neighbor lists.  The layout is
  38 chosen to optimize performance on different platforms.  Again this
  39 functionality is hidden from the developer, and does not affect how
  40 the kernel is coded.
  41
  42 These abstractions are set at build time, when LAMMPS is compiled with
  43 the KOKKOS package installed.  All Kokkos operations occur within the
  44 context of an individual MPI task running on a single node of the
  45 machine.  The total number of MPI tasks used by LAMMPS (one or
  46 multiple per compute node) is set in the usual manner via the mpirun
  47 or mpiexec commands, and is independent of Kokkos.
  48
  49 Kokkos currently provides support for 3 modes of execution (per MPI
  50 task).  These are OpenMP (for many-core CPUs), Cuda (for NVIDIA GPUs),
  51 and OpenMP (for Intel Phi).  Note that the KOKKOS package supports
  52 running on the Phi in native mode, not offload mode like the
  53 USER-INTEL package supports.  You choose the mode at build time to
  54 produce an executable compatible with specific hardware.
  55
  56 Here is a quick overview of how to use the KOKKOS package
  57 for CPU acceleration, assuming one or more 16-core nodes.
  58 More details follow.
  59
  60 use a C++11 compatible compiler
  61 make yes-kokkos
  62 make mpi KOKKOS_DEVICES=OpenMP                 # build with the KOKKOS package
  63 make kokkos_omp                                # or Makefile.kokkos_omp already has variable set
  64 Make.py -v -p kokkos -kokkos omp -o mpi -a file mpi   # or one-line build via Make.py :pre
  65
  66 mpirun -np 16 lmp_mpi -k on -sf kk -in in.lj              # 1 node, 16 MPI tasks/node, no threads
  67 mpirun -np 2 -ppn 1 lmp_mpi -k on t 16 -sf kk -in in.lj   # 2 nodes, 1 MPI task/node, 16 threads/task
  68 mpirun -np 2 lmp_mpi -k on t 8 -sf kk -in in.lj           # 1 node, 2 MPI tasks/node, 8 threads/task
  69 mpirun -np 32 -ppn 4 lmp_mpi -k on t 4 -sf kk -in in.lj   # 8 nodes, 4 MPI tasks/node, 4 threads/task :pre
  70
  71 specify variables and settings in your Makefile.machine that enable OpenMP, GPU, or Phi support
  72 include the KOKKOS package and build LAMMPS
  73 enable the KOKKOS package and its hardware options via the "-k on" command-line switch use KOKKOS styles in your input script :ul
  74
  75 Here is a quick overview of how to use the KOKKOS package for GPUs,
  76 assuming one or more nodes, each with 16 cores and a GPU.  More
  77 details follow.
  78
  79 discuss use of NVCC, which Makefiles to examine
  80
  81 use a C++11 compatible compiler
  82 KOKKOS_DEVICES = Cuda, OpenMP
  83 KOKKOS_ARCH = Kepler35
  84 make yes-kokkos
  85 make machine
  86 Make.py -p kokkos -kokkos cuda arch=31 -o kokkos_cuda -a file kokkos_cuda :pre
  87
  88 mpirun -np 1 lmp_cuda -k on t 6 -sf kk -in in.lj          # one MPI task, 6 threads on CPU
  89 mpirun -np 4 -ppn 1 lmp_cuda -k on t 6 -sf kk -in in.lj   # ditto on 4 nodes :pre
  90
  91 mpirun -np 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj           # two MPI tasks, 8 threads per CPU
  92 mpirun -np 32 -ppn 2 lmp_cuda -k on t 8 g 2 -sf kk -in in.lj   # ditto on 16 nodes :pre
  93
  94 Here is a quick overview of how to use the KOKKOS package
  95 for the Intel Phi:
  96
  97 use a C++11 compatible compiler
  98 KOKKOS_DEVICES = OpenMP
  99 KOKKOS_ARCH = KNC
 100 make yes-kokkos
 101 make machine
 102 Make.py -p kokkos -kokkos phi -o kokkos_phi -a file mpi :pre
 103
 104 host=MIC, Intel Phi with 61 cores (240 threads/phi via 4x hardware threading):
 105 mpirun -np 1 lmp_g++ -k on t 240 -sf kk -in in.lj           # 1 MPI task on 1 Phi, 1*240 = 240
 106 mpirun -np 30 lmp_g++ -k on t 8 -sf kk -in in.lj            # 30 MPI tasks on 1 Phi, 30*8 = 240
 107 mpirun -np 12 lmp_g++ -k on t 20 -sf kk -in in.lj           # 12 MPI tasks on 1 Phi, 12*20 = 240
 108 mpirun -np 96 -ppn 12 lmp_g++ -k on t 20 -sf kk -in in.lj   # ditto on 8 Phis :pre
 109
 110 [Required hardware/software:]
 111
 112 Kokkos support within LAMMPS must be built with a C++11 compatible
 113 compiler.  If using gcc, version 4.7.2 or later is required.
 114
 115 To build with Kokkos support for CPUs, your compiler must support the
 116 OpenMP interface.  You should have one or more multi-core CPUs so that
 117 multiple threads can be launched by each MPI task running on a CPU.
 118
 119 To build with Kokkos support for NVIDIA GPUs, NVIDIA Cuda software
 120 version 7.5 or later must be installed on your system.  See the
 121 discussion for the "GPU"_accelerate_gpu.html package for details of
 122 how to check and do this.
 123
 124 NOTE: For good performance of the KOKKOS package on GPUs, you must
 125 have Kepler generation GPUs (or later).  The Kokkos library exploits
 126 texture cache options not supported by Telsa generation GPUs (or
 127 older).
 128
 129 To build with Kokkos support for Intel Xeon Phi coprocessors, your
 130 sysmte must be configured to use them in "native" mode, not "offload"
 131 mode like the USER-INTEL package supports.
 132
 133 [Building LAMMPS with the KOKKOS package:]
 134
 135 You must choose at build time whether to build for CPUs (OpenMP),
 136 GPUs, or Phi.
 137
 138 You can do any of these in one line, using the src/Make.py script,
 139 described in "Section 2.4"_Section_start.html#start_4 of the manual.
 140 Type "Make.py -h" for help.  If run from the src directory, these
 141 commands will create src/lmp_kokkos_omp, lmp_kokkos_cuda, and
 142 lmp_kokkos_phi.  Note that the OMP and PHI options use
 143 src/MAKE/Makefile.mpi as the starting Makefile.machine.  The CUDA
 144 option uses src/MAKE/OPTIONS/Makefile.kokkos_cuda.
 145
 146 The latter two steps can be done using the "-k on", "-pk kokkos" and
 147 "-sf kk" "command-line switches"_Section_start.html#start_7
 148 respectively.  Or the effect of the "-pk" or "-sf" switches can be
 149 duplicated by adding the "package kokkos"_package.html or "suffix
 150 kk"_suffix.html commands respectively to your input script.
 151
 152
 153 Or you can follow these steps:
 154
 155 CPU-only (run all-MPI or with OpenMP threading):
 156
 157 cd lammps/src
 158 make yes-kokkos
 159 make kokkos_omp :pre
 160
 161 CPU-only (only MPI, no threading):
 162
 163 cd lammps/src
 164 make yes-kokkos
 165 make kokkos_mpi :pre
 166
 167 Intel Xeon Phi (Intel Compiler, Intel MPI):
 168
 169 cd lammps/src
 170 make yes-kokkos
 171 make kokkos_phi :pre
 172
 173 CPUs and GPUs (with MPICH):
 174
 175 cd lammps/src
 176 make yes-kokkos
 177 make kokkos_cuda_mpich :pre
 178
 179 These examples set the KOKKOS-specific OMP, MIC, CUDA variables on the
 180 make command line which requires a GNU-compatible make command.  Try
 181 "gmake" if your system's standard make complains.
 182
 183 NOTE: If you build using make line variables and re-build LAMMPS twice
 184 with different KOKKOS options and the *same* target, e.g. g++ in the
 185 first two examples above, then you *must* perform a "make clean-all"
 186 or "make clean-machine" before each build.  This is to force all the
 187 KOKKOS-dependent files to be re-compiled with the new options.
 188
 189 NOTE: Currently, there are no precision options with the KOKKOS
 190 package.  All compilation and computation is performed in double
 191 precision.
 192
 193 There are other allowed options when building with the KOKKOS package.
 194 As above, they can be set either as variables on the make command line
 195 or in Makefile.machine.  This is the full list of options, including
 196 those discussed above, Each takes a value shown below.  The
 197 default value is listed, which is set in the
 198 lib/kokkos/Makefile.kokkos file.
 199
 200 #Default settings specific options
 201 #Options: force_uvm,use_ldg,rdc
 202
 203 KOKKOS_DEVICES, values = {OpenMP}, {Serial}, {Pthreads}, {Cuda}, default = {OpenMP}
 204 KOKKOS_ARCH, values = {KNC}, {SNB}, {HSW}, {Kepler}, {Kepler30}, {Kepler32}, {Kepler35}, {Kepler37}, {Maxwell}, {Maxwell50}, {Maxwell52}, {Maxwell53}, {ARMv8}, {BGQ}, {Power7}, {Power8}, default = {none}
 205 KOKKOS_DEBUG, values = {yes}, {no}, default = {no}
 206 KOKKOS_USE_TPLS, values = {hwloc}, {librt}, default = {none}
 207 KOKKOS_CUDA_OPTIONS, values = {force_uvm}, {use_ldg}, {rdc} :ul
 208
 209 KOKKOS_DEVICE sets the parallelization method used for Kokkos code
 210 (within LAMMPS).  KOKKOS_DEVICES=OpenMP means that OpenMP will be
 211 used.  KOKKOS_DEVICES=Pthreads means that pthreads will be used.
 212 KOKKOS_DEVICES=Cuda means an NVIDIA GPU running CUDA will be used.
 213
 214 If KOKKOS_DEVICES=Cuda, then the lo-level Makefile in the src/MAKE
 215 directory must use "nvcc" as its compiler, via its CC setting.  For
 216 best performance its CCFLAGS setting should use -O3 and have a
 217 KOKKOS_ARCH setting that matches the compute capability of your NVIDIA
 218 hardware and software installation, e.g. KOKKOS_ARCH=Kepler30.  Note
 219 the minimal required compute capability is 2.0, but this will give
 220 signicantly reduced performance compared to Kepler generation GPUs
 221 with compute capability 3.x.  For the LINK setting, "nvcc" should not
 222 be used; instead use g++ or another compiler suitable for linking C++
 223 applications.  Often you will want to use your MPI compiler wrapper
 224 for this setting (i.e. mpicxx).  Finally, the lo-level Makefile must
 225 also have a "Compilation rule" for creating *.o files from *.cu files.
 226 See src/Makefile.cuda for an example of a lo-level Makefile with all
 227 of these settings.
 228
 229 KOKKOS_USE_TPLS=hwloc binds threads to hardware cores, so they do not
 230 migrate during a simulation.  KOKKOS_USE_TPLS=hwloc should always be
 231 used if running with KOKKOS_DEVICES=Pthreads for pthreads.  It is not
 232 necessary for KOKKOS_DEVICES=OpenMP for OpenMP, because OpenMP
 233 provides alternative methods via environment variables for binding
 234 threads to hardware cores.  More info on binding threads to cores is
 235 given in "Section 5.3"_Section_accelerate.html#acc_3.
 236
 237 KOKKOS_ARCH=KNC enables compiler switches needed when compling for an
 238 Intel Phi processor.
 239
 240 KOKKOS_USE_TPLS=librt enables use of a more accurate timer mechanism
 241 on most Unix platforms.  This library is not available on all
 242 platforms.
 243
 244 KOKKOS_DEBUG is only useful when developing a Kokkos-enabled style
 245 within LAMMPS.  KOKKOS_DEBUG=yes enables printing of run-time
 246 debugging information that can be useful.  It also enables runtime
 247 bounds checking on Kokkos data structures.
 248
 249 KOKKOS_CUDA_OPTIONS are additional options for CUDA.
 250
 251 For more information on Kokkos see the Kokkos programmers' guide here:
 252 /lib/kokkos/doc/Kokkos_PG.pdf.
 253
 254 [Run with the KOKKOS package from the command line:]
 255
 256 The mpirun or mpiexec command sets the total number of MPI tasks used
 257 by LAMMPS (one or multiple per compute node) and the number of MPI
 258 tasks used per node.  E.g. the mpirun command in MPICH does this via
 259 its -np and -ppn switches.  Ditto for OpenMPI via -np and -npernode.
 260
 261 When using KOKKOS built with host=OMP, you need to choose how many
 262 OpenMP threads per MPI task will be used (via the "-k" command-line
 263 switch discussed below).  Note that the product of MPI tasks * OpenMP
 264 threads/task should not exceed the physical number of cores (on a
 265 node), otherwise performance will suffer.
 266
 267 When using the KOKKOS package built with device=CUDA, you must use
 268 exactly one MPI task per physical GPU.
 269
 270 When using the KOKKOS package built with host=MIC for Intel Xeon Phi
 271 coprocessor support you need to insure there are one or more MPI tasks
 272 per coprocessor, and choose the number of coprocessor threads to use
 273 per MPI task (via the "-k" command-line switch discussed below).  The
 274 product of MPI tasks * coprocessor threads/task should not exceed the
 275 maximum number of threads the coproprocessor is designed to run,
 276 otherwise performance will suffer.  This value is 240 for current
 277 generation Xeon Phi(TM) chips, which is 60 physical cores * 4
 278 threads/core.  Note that with the KOKKOS package you do not need to
 279 specify how many Phi coprocessors there are per node; each
 280 coprocessors is simply treated as running some number of MPI tasks.
 281
 282 You must use the "-k on" "command-line
 283 switch"_Section_start.html#start_7 to enable the KOKKOS package.  It
 284 takes additional arguments for hardware settings appropriate to your
 285 system.  Those arguments are "documented
 286 here"_Section_start.html#start_7.  The two most commonly used
 287 options are:
 288
 289 -k on t Nt g Ng :pre
 290
 291 The "t Nt" option applies to host=OMP (even if device=CUDA) and
 292 host=MIC.  For host=OMP, it specifies how many OpenMP threads per MPI
 293 task to use with a node.  For host=MIC, it specifies how many Xeon Phi
 294 threads per MPI task to use within a node.  The default is Nt = 1.
 295 Note that for host=OMP this is effectively MPI-only mode which may be
 296 fine.  But for host=MIC you will typically end up using far less than
 297 all the 240 available threads, which could give very poor performance.
 298
 299 The "g Ng" option applies to device=CUDA.  It specifies how many GPUs
 300 per compute node to use.  The default is 1, so this only needs to be
 301 specified is you have 2 or more GPUs per compute node.
 302
 303 The "-k on" switch also issues a "package kokkos" command (with no
 304 additional arguments) which sets various KOKKOS options to default
 305 values, as discussed on the "package"_package.html command doc page.
 306
 307 Use the "-sf kk" "command-line switch"_Section_start.html#start_7,
 308 which will automatically append "kk" to styles that support it.  Use
 309 the "-pk kokkos" "command-line switch"_Section_start.html#start_7 if
 310 you wish to change any of the default "package kokkos"_package.html
 311 optionns set by the "-k on" "command-line
 312 switch"_Section_start.html#start_7.
 313
 314
 315
 316 Note that the default for the "package kokkos"_package.html command is
 317 to use "full" neighbor lists and set the Newton flag to "off" for both
 318 pairwise and bonded interactions.  This typically gives fastest
 319 performance.  If the "newton"_newton.html command is used in the input
 320 script, it can override the Newton flag defaults.
 321
 322 However, when running in MPI-only mode with 1 thread per MPI task, it
 323 will typically be faster to use "half" neighbor lists and set the
 324 Newton flag to "on", just as is the case for non-accelerated pair
 325 styles.  You can do this with the "-pk" "command-line
 326 switch"_Section_start.html#start_7.
 327
 328 [Or run with the KOKKOS package by editing an input script:]
 329
 330 The discussion above for the mpirun/mpiexec command and setting
 331 appropriate thread and GPU values for host=OMP or host=MIC or
 332 device=CUDA are the same.
 333
 334 You must still use the "-k on" "command-line
 335 switch"_Section_start.html#start_7 to enable the KOKKOS package, and
 336 specify its additional arguments for hardware options appopriate to
 337 your system, as documented above.
 338
 339 Use the "suffix kk"_suffix.html command, or you can explicitly add a
 340 "kk" suffix to individual styles in your input script, e.g.
 341
 342 pair_style lj/cut/kk 2.5 :pre
 343
 344 You only need to use the "package kokkos"_package.html command if you
 345 wish to change any of its option defaults, as set by the "-k on"
 346 "command-line switch"_Section_start.html#start_7.
 347
 348 [Speed-ups to expect:]
 349
 350 The performance of KOKKOS running in different modes is a function of
 351 your hardware, which KOKKOS-enable styles are used, and the problem
 352 size.
 353
 354 Generally speaking, the following rules of thumb apply:
 355
 356 When running on CPUs only, with a single thread per MPI task,
 357 performance of a KOKKOS style is somewhere between the standard
 358 (un-accelerated) styles (MPI-only mode), and those provided by the
 359 USER-OMP package.  However the difference between all 3 is small (less
 360 than 20%). :ulb,l
 361
 362 When running on CPUs only, with multiple threads per MPI task,
 363 performance of a KOKKOS style is a bit slower than the USER-OMP
 364 package. :l
 365
 366 When running large number of atoms per GPU, KOKKOS is typically faster
 367 than the GPU package. :l
 368
 369 When running on Intel Xeon Phi, KOKKOS is not as fast as
 370 the USER-INTEL package, which is optimized for that hardware. :l
 371 :ule
 372
 373 See the "Benchmark page"_http://lammps.sandia.gov/bench.html of the
 374 LAMMPS web site for performance of the KOKKOS package on different
 375 hardware.
 376
 377 [Guidelines for best performance:]
 378
 379 Here are guidline for using the KOKKOS package on the different
 380 hardware configurations listed above.
 381
 382 Many of the guidelines use the "package kokkos"_package.html command
 383 See its doc page for details and default settings.  Experimenting with
 384 its options can provide a speed-up for specific calculations.
 385
 386 [Running on a multi-core CPU:]
 387
 388 If N is the number of physical cores/node, then the number of MPI
 389 tasks/node * number of threads/task should not exceed N, and should
 390 typically equal N.  Note that the default threads/task is 1, as set by
 391 the "t" keyword of the "-k" "command-line
 392 switch"_Section_start.html#start_7.  If you do not change this, no
 393 additional parallelism (beyond MPI) will be invoked on the host
 394 CPU(s).
 395
 396 You can compare the performance running in different modes:
 397
 398 run with 1 MPI task/node and N threads/task
 399 run with N MPI tasks/node and 1 thread/task
 400 run with settings in between these extremes :ul
 401
 402 Examples of mpirun commands in these modes are shown above.
 403
 404 When using KOKKOS to perform multi-threading, it is important for
 405 performance to bind both MPI tasks to physical cores, and threads to
 406 physical cores, so they do not migrate during a simulation.
 407
 408 If you are not certain MPI tasks are being bound (check the defaults
 409 for your MPI installation), binding can be forced with these flags:
 410
 411 OpenMPI 1.8: mpirun -np 2 -bind-to socket -map-by socket ./lmp_openmpi ...
 412 Mvapich2 2.0: mpiexec -np 2 -bind-to socket -map-by socket ./lmp_mvapich ... :pre
 413
 414 For binding threads with the KOKKOS OMP option, use thread affinity
 415 environment variables to force binding.  With OpenMP 3.1 (gcc 4.7 or
 416 later, intel 12 or later) setting the environment variable
 417 OMP_PROC_BIND=true should be sufficient.  For binding threads with the
 418 KOKKOS pthreads option, compile LAMMPS the KOKKOS HWLOC=yes option, as
 419 discussed in "Section 2.3.4"_Sections_start.html#start_3_4 of the
 420 manual.
 421
 422 [Running on GPUs:]
 423
 424 Insure the -arch setting in the machine makefile you are using,
 425 e.g. src/MAKE/Makefile.cuda, is correct for your GPU hardware/software
 426 (see "this section"_Section_start.html#start_3_4 of the manual for
 427 details).
 428
 429 The -np setting of the mpirun command should set the number of MPI
 430 tasks/node to be equal to the # of physical GPUs on the node.
 431
 432 Use the "-k" "command-line switch"_Section_commands.html#start_7 to
 433 specify the number of GPUs per node, and the number of threads per MPI
 434 task.  As above for multi-core CPUs (and no GPU), if N is the number
 435 of physical cores/node, then the number of MPI tasks/node * number of
 436 threads/task should not exceed N.  With one GPU (and one MPI task) it
 437 may be faster to use less than all the available cores, by setting
 438 threads/task to a smaller value.  This is because using all the cores
 439 on a dual-socket node will incur extra cost to copy memory from the
 440 2nd socket to the GPU.
 441
 442 Examples of mpirun commands that follow these rules are shown above.
 443
 444 NOTE: When using a GPU, you will achieve the best performance if your
 445 input script does not use any fix or compute styles which are not yet
 446 Kokkos-enabled.  This allows data to stay on the GPU for multiple
 447 timesteps, without being copied back to the host CPU.  Invoking a
 448 non-Kokkos fix or compute, or performing I/O for
 449 "thermo"_thermo_style.html or "dump"_dump.html output will cause data
 450 to be copied back to the CPU.
 451
 452 You cannot yet assign multiple MPI tasks to the same GPU with the
 453 KOKKOS package.  We plan to support this in the future, similar to the
 454 GPU package in LAMMPS.
 455
 456 You cannot yet use both the host (multi-threaded) and device (GPU)
 457 together to compute pairwise interactions with the KOKKOS package.  We
 458 hope to support this in the future, similar to the GPU package in
 459 LAMMPS.
 460
 461 [Running on an Intel Phi:]
 462
 463 Kokkos only uses Intel Phi processors in their "native" mode, i.e.
 464 not hosted by a CPU.
 465
 466 As illustrated above, build LAMMPS with OMP=yes (the default) and
 467 MIC=yes.  The latter insures code is correctly compiled for the Intel
 468 Phi.  The OMP setting means OpenMP will be used for parallelization on
 469 the Phi, which is currently the best option within Kokkos.  In the
 470 future, other options may be added.
 471
 472 Current-generation Intel Phi chips have either 61 or 57 cores.  One
 473 core should be excluded for running the OS, leaving 60 or 56 cores.
 474 Each core is hyperthreaded, so there are effectively N = 240 (4*60) or
 475 N = 224 (4*56) cores to run on.
 476
 477 The -np setting of the mpirun command sets the number of MPI
 478 tasks/node.  The "-k on t Nt" command-line switch sets the number of
 479 threads/task as Nt.  The product of these 2 values should be N, i.e.
 480 240 or 224.  Also, the number of threads/task should be a multiple of
 481 4 so that logical threads from more than one MPI task do not run on
 482 the same physical core.
 483
 484 Examples of mpirun commands that follow these rules are shown above.
 485
 486 [Restrictions:]
 487
 488 As noted above, if using GPUs, the number of MPI tasks per compute
 489 node should equal to the number of GPUs per compute node.  In the
 490 future Kokkos will support assigning multiple MPI tasks to a single
 491 GPU.
 492
 493 Currently Kokkos does not support AMD GPUs due to limits in the
 494 available backend programming models.  Specifically, Kokkos requires
 495 extensive C++ support from the Kernel language.  This is expected to
 496 change in the future.