Add OpenCL pruning kernels and launch/timing logic
The kernels have been tested for correction on NVIDIA >=CC 3.5 and AMD
GCN devices. Tuning for AMD has been done on the old fglrx stack which
has limitations on the intra-workgroup parallelism, so choice of the
j4 concurrency parameter should be revisited at some later stage (using
the latest AMDGPU-PRO and hopefully ROCm).
A number of possible improvements have been also identified and noted as
comments in nbnxn_ocl.cpp.
Change-Id: I7129ec247706d33317df1256846943ee8b0d540c