Improved wallcycle reporting
Removed reporting of MPI and thread counts on each row, in
favour of a header with that information.
With npme > 0, prints a note that the time column is not
supposed to add up.
Works correctly with a range of -npme, -ntomp and -ntomp_pme values:
PP times add up to the total, which equals the final walltime
reported; cycle count and percentage column totals are correct and
reflect the actual work done.
Partial fix for #1188
Change-Id: Ic870d981bf0375189601bf8c9bc67bc5d6226497