Optimized the bonded thread force reduction
The bonded thread force reduction now uses fixed size blocks of 32
atoms. Instead of dividing reduction of the whole range of blocks
uniformly over the threads, now only used blocks are divided
(uniformly) over the threads.
This speeds up the reduction by a factor #threads (!) for typical
protein+water systems when not using domain decomposition.
With domain decomposition the speed up is up to a factor of 3.
Change-Id: I255b4336d315ef62ac9bc446c22313bf47baa097