Fix multiple tMPI ranks per OpenCL device
The OpenCL context and program objects were stored in the gpu_info
struct which was assumed to be a constant per compute host and therefore
shared across the tMPI ranks. Hence, gpu_info was initialized once
and a single pointer pointing to the data used by all ranks.
This led to the OpenCL context and program objects of different ranks
sharing a single device get overwritten/corrupted by one another.
Notes:
- MPI still segfaults in clCreateContext() with multiple ranks per node
both with and without GPU sharing, so no changes on that front.
- The AMD OpenCL runtime overhead with all hw threads used is quite
significant; as a short-term solution we should consider avoiding
using HT by launching less threads (and/or warning the user).
Refs #1804
Change-Id: I7c6c53a3e6a049ce727ae65ddf0978f436c04579