docs/OpenCLTODOList.txt

   1 Gromacs – OpenCL Porting
   2 TODO List
   3
   4 TABLE OF CONTENTS
   5 1. KNOWN LIMITATIONS
   6 2. CODE IMPROVEMENTS
   7 3. ENHANCEMENTS
   8 4. OPTIMIZATIONS
   9 5. OTHER NOTES
  10 6. TESTED CONFIGURATIONS
  11
  12 1. KNOWN LIMITATIONS
  13    =================
  14 - Sharing an OpenCL GPU between two MPI ranks is not supported.
  15   See also Issue #91 - https://github.com/StreamComputing/gromacs/issues/91
  16
  17 - Using more than one OpenCL GPU on a node is thought to work with thread-MPI,
  18   and is known to segfault with AMD OpenCL and OpenMPI on x86 and Linux.
  19
  20 2. CODE IMPROVEMENTS
  21    =================
  22 - Errors returned by OpenCL functions are handled by using assert calls. This
  23   needs to be improved.
  24   See also Issue #6 - https://github.com/StreamComputing/gromacs/issues/6
  25
  26 - clCreateBuffer is always called with CL_MEM_READ_WRITE flag. This needs to be
  27   updated so that only the flags that reflect how the buffer is used are provided.
  28   For example, if the device is only going to read from a buffer,
  29   CL_MEM_READ_ONLY should be used.
  30   See also Issue #13 - https://github.com/StreamComputing/gromacs/issues/13
  31
  32 - The data structures shared between the OpenCL host and device are defined twice:
  33   once in the host code, once in the device code. They must be moved to a single
  34   file and shared between the host and the device.
  35   See also Issue #16 - https://github.com/StreamComputing/gromacs/issues/16
  36
  37 - Quite a few error conditions are unhandled, noted with TODOs in several files
  38
  39 - gmx_device_info_t needs struct field documentation
  40
  41 3. ENHANCEMENTS
  42    ============
  43 - Implement OpenCL kernels for Intel GPUs
  44
  45 - Implement OpenCL kernels for Intel CPUs
  46
  47 - Improve GPU device sorting in detect_gpus
  48   See also Issue #64 - https://github.com/StreamComputing/gromacs/issues/64
  49
  50 - Implement warp independent kernels
  51   See also Issue #66 - https://github.com/StreamComputing/gromacs/issues/66
  52
  53 - Have one OpenCL program object per OpenCL kernel
  54   See also Issue #86 - https://github.com/StreamComputing/gromacs/issues/86
  55
  56 - Consider parallelising JIT of programs over CPU cores to improve startup
  57   time
  58
  59 - Re-consider caching JIT artefacts to improve startup time
  60
  61 4. OPTIMIZATIONS
  62    =============
  63 - Defining nbparam fields as constants when building the OpenCL kernels
  64   See also Issue #87 - https://github.com/StreamComputing/gromacs/issues/87
  65
  66 - Fix the tabulated Ewald kernel. This has the potential of being faster than
  67   the analytical Ewald kernel
  68   See also Issue #65 - https://github.com/StreamComputing/gromacs/issues/65
  69
  70 - Evaluate gpu_min_ci_balanced_factor impact on performance for AMD
  71   See also Issue #69: https://github.com/StreamComputing/gromacs/issues/69
  72
  73 - Update ocl_pmalloc to allocate page locked memory
  74   See also Issue #90: https://github.com/StreamComputing/gromacs/issues/90
  75
  76 - Update kernel for 128/256threads/block
  77   See also Issue #92: https://github.com/StreamComputing/gromacs/issues/92
  78
  79 - Update the kernels to use OpenCL 2.0 workgroup level functions if they prove
  80   to bring a significant speedup.
  81   See also Issue #93: https://github.com/StreamComputing/gromacs/issues/93
  82
  83 - Update the kernels to use fixed precision accumulation for force and energy
  84   values, if this implementation is faster and does not affect precision.
  85   See also Issue #94: https://github.com/StreamComputing/gromacs/issues/94
  86
  87 5. OTHER NOTES
  88    ===========
  89 - NVIDIA GPUs are not handled differently depending on compute capability
  90
  91 - Because the tabulated kernels have a bug not yet fixed, the current
  92   implementation uses only the analytical kernels and never the tabulated ones
  93   See also Issue #65 - https://github.com/StreamComputing/gromacs/issues/65
  94
  95 - Unlike the CUDA version, the OpenCL implementation uses normal buffers
  96   instead of textures
  97   See also Issue #88 - https://github.com/StreamComputing/gromacs/issues/88
  98
  99 6. TESTED CONFIGURATIONS
 100    =====================
 101 Tested devices:
 102         NVIDIA GPUs: GeForce GTX 660M, GeForce GTX 750Ti, GeForce GTX 780
 103         AMD GPUs: FirePro W5100, HD 7950, FirePro W9100, Radeon R7 M260, R9 290
 104
 105 Tested kernels:
 106 Kernel                                          |Benchmark test                                 |Remarks
 107 --------------------------------------------------------------------------------------------------------
 108 nbnxn_kernel_ElecCut_VdwLJ_VF_prune_opencl      |d.poly-ch2                                     |
 109 nbnxn_kernel_ElecCut_VdwLJ_F_opencl             |d.poly-ch2                                     |
 110 nbnxn_kernel_ElecCut_VdwLJ_F_prune_opencl       |d.poly-ch2                                     |
 111 nbnxn_kernel_ElecCut_VdwLJ_VF_opencl            |d.poly-ch2                                     |
 112 nbnxn_kernel_ElecRF_VdwLJ_VF_prune_opencl       |adh_cubic with rf_verlet.mdp                   |
 113 nbnxn_kernel_ElecRF_VdwLJ_F_opencl              |adh_cubic with rf_verlet.mdp                   |
 114 nbnxn_kernel_ElecRF_VdwLJ_F_prune_opencl        |adh_cubic with rf_verlet.mdp                   |
 115 nbnxn_kernel_ElecEwQSTab_VdwLJ_VF_prune_opencl  |adh_cubic_vsites with pme_verlet_vsites.mdp    |Failed
 116 nbnxn_kernel_ElecEwQSTab_VdwLJ_F_prune_opencl   |adh_cubic_vsites with pme_verlet_vsites.mdp    |Failed
 117 nbnxn_kernel_ElecEw_VdwLJ_VF_prune_opencl       |adh_cubic_vsites with pme_verlet_vsites.mdp    |
 118 nbnxn_kernel_ElecEw_VdwLJ_F_opencl              |adh_cubic_vsites with pme_verlet_vsites.mdp    |
 119 nbnxn_kernel_ElecEw_VdwLJ_F_prune_opencl        |adh_cubic_vsites with pme_verlet_vsites.mdp    |
 120 nbnxn_kernel_ElecEwTwinCut_VdwLJ_F_prune_opencl |adh_cubic_vsites with pme_verlet_vsites.mdp    |
 121 nbnxn_kernel_ElecEwTwinCut_VdwLJ_F_opencl       |adh_cubic_vsites with pme_verlet_vsites.mdp    |
 122
 123 Input data used for testing - Benchmark data sets available here:
 124 ftp://ftp.gromacs.org/pub/benchmarks
 125