Reorganize PME GPU launch
Wrapped the first (prep/spread) and second stage (fft/gather) of PME GPU
in functions. Moved the second stage of the regular PME GPU mode to after
the nonbonded x transform to ensure that the transform can overlap with
spread even when the launch overhead of the FFT kernels is high.
Also removed TPI-related PME-GPU launch conditions as this should be
checked much earlier. Noted in the force flags docs that the current
code assumes GMX_FORCE_STATECHANGED is used only with TPI.
Change-Id: I7f765d66c6c4e7e54812b81b2dd23751af0b06b5