manual/gmxpar.tex

   1 %
   2 % This file is part of the GROMACS molecular simulation package.
   3 %
   4 % Copyright (c) 2013, by the GROMACS development team, led by
   5 % Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
   6 % and including many others, as listed in the AUTHORS file in the
   7 % top-level source directory and at http://www.gromacs.org.
   8 %
   9 % GROMACS is free software; you can redistribute it and/or
  10 % modify it under the terms of the GNU Lesser General Public License
  11 % as published by the Free Software Foundation; either version 2.1
  12 % of the License, or (at your option) any later version.
  13 %
  14 % GROMACS is distributed in the hope that it will be useful,
  15 % but WITHOUT ANY WARRANTY; without even the implied warranty of
  16 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  17 % Lesser General Public License for more details.
  18 %
  19 % You should have received a copy of the GNU Lesser General Public
  20 % License along with GROMACS; if not, see
  21 % http://www.gnu.org/licenses, or write to the Free Software Foundation,
  22 % Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA.
  23 %
  24 % If you want to redistribute modifications to GROMACS, please
  25 % consider that scientific software is very special. Version
  26 % control is crucial - bugs must be traceable. We will be happy to
  27 % consider code for inclusion in the official distribution, but
  28 % derived work must not be called official GROMACS. Details are found
  29 % in the README & COPYING files - if they are missing, get the
  30 % official version at http://www.gromacs.org.
  31 %
  32 % To help us fund GROMACS development, we humbly ask that you cite
  33 % the research papers on the package. Check out http://www.gromacs.org.
  34
  35 \section{Parallelization}
  36 \label{sec:par}
  37
  38 \newcommand{\abs}[1]{\mid \! {#1} \! \mid}
  39
  40 The purpose of this section is to discuss the
  41 \normindex{parallelization} of the
  42 principle MD algorithm and not to describe the algorithms that are in
  43 practical use for molecular systems with their complex variety of atoms
  44 and terms in the force field descriptions. We shall therefore consider
  45 as an example a simple system consisting only of a single type of atoms
  46 with a simple form of the interaction potential. The emphasis will be
  47 on the special problems that arise when the algorithm is implemented on
  48 a parallel computer.
  49
  50 The simple model problem already contains the bottleneck of all MD
  51 simulations: the computationally intensive evaluation of the
  52 {\em non-bonded} forces between pairs of atoms, based on the distance
  53 between particles. Complex molecular systems will in addition
  54 involve many different kinds of {\em bonded} forces between designated
  55 atoms. Such interactions add to the complexity of the algorithm but do
  56 not modify the basic considerations concerning parallelization.
  57
  58
  59 \subsection{Methods of parallelization}
  60 There are a number of methods to parallelize the MD algorithm, each of
  61 them with their own advantages and disadvantages. The method to
  62 choose depends on the hardware and compilers available.
  63 We list them here:
  64 \begin{enumerate}
  65 \item[1]        {\em \normindex{Message Passing}.}\\
  66                 In this method, which is more or less the traditional
  67                 way of parallel programming, all the parallelism is
  68                 explicitly programmed by the user. The disadvantage
  69                 is that it takes extra code and effort, the advantage
  70                 is that the programmer keeps full control over the data
  71                 flow and can do optimizations a compiler could not come
  72                 up with.
  73
  74                 The implementation is typically done by calling a set of
  75                 library routines to send and receive data to and from
  76                 other processors. Almost all hardware vendors support
  77                 this way of
  78                 parallelism in their C and Fortran compilers.
  79
  80 \item[2]        {\em \swapindex{Data}{Parallel}.}\\
  81                 This method lets the user define arrays on which to
  82                 operate in parallel. Programming this way is much
  83                 like vectorizing: recurrence is not parallelized
  84                 ({\eg} {\tt for(i=1; (i<MAX); i++) a[i] = a[i-1] + 1;}
  85                 does not vectorize and not parallelize, because for
  86                 every i the result from the previous step is needed).
  87
  88                 The advantage of data parallelism is that it is
  89                 easier for the user; the compiler takes care of the
  90                 parallelism. The disadvantage is that it is supported
  91                 by a small (though growing) number of hardware vendors,
  92                 and that it is much harder to maintain a program that has to
  93                 run on both parallel and sequential machines, because
  94                 the only standard language that supports it is Fortran-90
  95                 which is not available on many platforms.
  96 \end{enumerate}
  97 Both methods allow for the MD algorithm to be implemented without much
  98 trouble. Message passing MD algorithms have been published
  99 since the mid 80's (\cite{Fincham87}, \cite{Raine89})
 100 and development is still continuing.
 101 Data parallel programming is newer,
 102 but starting from a well vectorized program it is not hard to do.
 103
 104 Our implementation of MD is a message passing one, the reason for which
 105 is partly historical: the project to develop a parallel MD program started
 106 when Fortran-90 was still in the making, and no compilers were
 107 expected to be available.
 108 At current, we still believe that message passing is the way
 109 to go, after having done some experiments with data parallel programming on a
 110 Connection Machine (CM-5), because of portability to other hardware,
 111 the poor performance of the code produced by the compilers
 112 and because this way of programming
 113 has the same drawback as vectorization: the part of the program that is
 114 not vectorized or parallelized determines the run time of the program
 115 (\normindex{Amdahl's law}).
 116
 117 The approach we took to parallelism was a minimalist one: use as few
 118 non-standard elements in the software as possible, and use the
 119 simplest \swapindex{processor}{topology} that does the job. We therefore
 120 decided to use a standard language (ANSI-C) with as few non-standard
 121 routines as possible. We only use 5 communication routines that are
 122 non-standard. It is therefore very easy to port our code to other machines.
 123
 124 For an $O(N^2)$ problem like MD, one of the best schemes for the
 125 inter-processor connections is a ring, so our software demands that a
 126 ring is present in the inter-processor connections. A ring can essentially
 127 always be mapped onto another network like a \normindex{hypercube}, a
 128 bus interface (Ethernet {\eg} using
 129 \seeindex{Message Passing Interface}{MPI} \normindex{MPI}) or
 130 a \normindex{tree}
 131 (CM-5). Some hardware vendors have very luxurious connection schemes
 132 that connect every processor to every other processor, but we do not
 133 really need it and so do not use it even though it might come in handy
 134 at times. The advantage with this simple scheme is that {\gromacs}
 135 performs extremely well even on inexpensive workstation clusters.
 136
 137 When using a message passing scheme one has to divide the particles
 138 over processors, which can be done in two ways:
 139 \begin{itemize}
 140 \item   {\em \swapindex{Space}{Decomposition}.}\\
 141         An element of space is allocated to each processor, when dividing
 142         a cubic box with edge $b$ over $P$ processors this can be done
 143         by giving
 144         each processor a slab of length $b/P$. This method
 145         has the advantage
 146         that each processor has about the same number of interactions
 147         to calculate (at least when the simulated system has a homogeneous
 148         density, like a liquid or a gas). The disadvantage is that a lot of
 149         bookkeeping is necessary for particles that move over processor
 150         boundaries. When using more complex systems, such as macromolecules there
 151         are also 3- and 4-atom interactions that make
 152         the bookkeeping too complicated for our taste.
 153 \item   {\em \swapindex{Particle}{Decomposition}.}\\
 154         Every processor is allocated a number of particles. When
 155         dividing $N$ particles over $P$ processors each processor will
 156         get $N/P$ particles. The implementation of this method
 157         is described in the next section.
 158 \end{itemize}
 159
 160 \begin {figure}[p]
 161 \centerline{\includegraphics[width=10cm]{plots/int-mat}}
 162 \caption[The interaction matrix.]{The interaction matrix (left) and
 163 the same using action$~=~-$reaction (right).}
 164 \label{fig:int_mat}
 165 \end {figure}
 166
 167 \begin{table}[p]
 168 \centerline{
 169 \newcolumntype{s}[1]{D{/}{/}{#1}}
 170 \begin{tabular}{|l|s4|s4|s4|s4|}
 171 \dline
 172         & \mcc{1}{i mod 2 = 0}
 173                 & \mcc{1}{i mod 2 = 0}
 174                                 & \mcc{1}{i mod 2 = 1}
 175                                                 & \mcc{1}{i mod 2 = 1} \\
 176         & \mcc{1}{i $<$ N/2}
 177                 & \mcc{1}{i $\ge$ N/2}
 178                                 & \mcc{1}{i $<$ N/2}
 179                                                 & \mcc{1}{i $\ge$ N/2} \\
 180 \hline
 181 N mod 2 = 1     & N/2   & N/2           & N/2           & N/2           \\
 182 N mod 4 = 2     & N/2   & N/2           & N/2 - 1       & N/2 - 1       \\
 183 N mod 4 = 0     & N/2   & N/2 - 1       & N/2 - 1       & N/2           \\
 184 \dline
 185 \end{tabular}
 186 }
 187 \caption[The number of interactions between particles.]{The number of
 188 interactions between particles. The number of $j$ particles per $i$
 189 particle is a function of the total number of particles $N$ and
 190 particle number $i$. Note that here the $/$ operator is used for
 191 integer division, {\ie} truncating the reminder.}
 192 \label{tab:decomp}
 193 \end{table}
 194
 195 \begin{figure}[p]
 196 \centerline{\includegraphics[width=\linewidth]{plots/decomp}}
 197 \caption[Interaction matrices for different $N$.]{Interaction matrices for different $N$. The number of $j$-particles an $i$-particle interacts with depends on the {\em total} number of particles and on the {\em particle number}.}
 198 \label{fig:decomp}
 199 \end{figure}
 200
 201 \subsection{MD on a ring of processors}
 202 When a neighbor list is not used the MD problem is in principle an $O(N^2)$
 203 problem as each particle can interact
 204 with every other. This can be simplified using Newton's third law
 205 \beq
 206 F_{ij}  ~=~     -F_{ji}
 207 \label{eqn:Newt3}
 208 \eeq
 209 This implies that there is half a matrix of interactions (without diagonal,
 210 a particle does not interact with itself) to consider (\figref{int_mat}).
 211 When we reflect the upper right triangle of interactions to the lower
 212 left triangle of the matrix, we still cover all possible interactions,
 213 but now every row in the matrix has almost the same number of points
 214 or possible interactions.  We can now assign a (preferably equal)
 215 number of rows to each processor to compute the forces and at the same
 216 time a number of particles to do the update on, the {\em home}
 217 particles. The number of interactions per particle is dependent on the
 218 {\em total number} $N$ of particles (see \figref{decomp}) and on the
 219 {\em particle number} $i$.  The exact formulae are given in
 220 \tabref{decomp}.
 221
 222 A flow chart of the algorithm is given in \figref{mdpar}.
 223 \begin{figure}
 224 \centerline{\includegraphics[height=18cm]{plots/mdpar}}
 225 \caption[The Parallel MD algorithm.]{The Parallel MD algorithm. If
 226 the steps marked * are left out we have the sequential algorithm
 227 again.}
 228 \label{fig:mdpar}
 229 \end{figure}
 230 \vfill
 231
 232 It is the same as the sequential algorithm, except for two
 233 communication steps. After the particles have been reset in the box,
 234 each processor sends its coordinates onward (left) and then starts computation
 235 of the forces.  After this step each processor holds the {\em partial
 236 forces} for the available particles, {\eg} processor 0 holds forces
 237 acting on home particles from processor 0, 1, 2 and 3. These forces
 238 must be accumulated and sent back (right) to the home
 239 processor. Finally the update of the velocity and coordinates is done
 240 on the home processor.
 241
 242 The {\tt communicate_r} routine is given below in the full C-code:\\
 243 \begin{footnotesize}
 244 \begin{verbatim}
 245 void communicate_r(int nprocs,int pid,rvec vecs[],int start[],int homenr[])
 246 /*
 247  * nprocs = number of processors
 248  * pid    = processor id (0..nprocs-1)
 249  * vecs   = vectors
 250  * start  = starting index in vecs for each processor
 251  * homenr = number of home particles for each processor
 252  */
 253 {
 254   int i;        /* processor counter */
 255   int shift;    /* the amount of processors to communicate with */
 256   int cur;      /* current processor to send data from */
 257   int next;     /* next processor on a ring (using modulo) */
 258
 259   cur   = pid;
 260   shift = nprocs/2;
 261
 262   for (i=0; (i<shift); i++) {
 263     next=(cur+1) \% nprocs;
 264     send   (left,  vecs[start[cur]],  homenr[cur]);
 265     receive(right, vecs[start[next]], homenr[next]);
 266     cur=next;
 267   }
 268 }
 269 \end{verbatim}
 270
 271 \end{footnotesize}
 272
 273 The data flow around the ring is visualized in \figref{ring}.
 274 Note that because of the ring topology each processor automatically
 275 gets the proper particles to interact with.
 276 \begin {figure}
 277 \centerline{\includegraphics[width=6cm]{plots/ring}}
 278 \caption {Data flow in a ring of processors.}
 279 \label{fig:ring}
 280 \end {figure}
 281
 282 \section{Parallel Molecular Dynamics}
 283 In this chapter we describe some details of the \swapindex{parallel}{MD}
 284 algorithm used
 285 in {\gromacs}. This also includes some other information on neighbor searching and
 286 a side excursion to parallel sorting.
 287 Please note the following which we use throughout this chapter:\\
 288 {\bf definition:} {\natom}: Number of particles, {\nproc} number of processors.\\
 289 {\gromacs} employs two different grids: the neighbor searching grid ({\nsgrid})
 290 and the combined charge/potential grid ({\fftgrid}), as will be described below.
 291 To maximize the confusion,
 292 these two grids are mapped onto a grid of processors when {\gromacs} runs on a
 293 parallel computer.
 294
 295 \subsection{Domain decomposition}
 296 Modern parallel computers, such as an IBM SP/2 or a Cray T3E
 297 consist of relatively small numbers of relatively fast scalar
 298 processors (typically 8 to 256).  The communication channels that are
 299 available in hardware on these machine are not directly visible to
 300 the programmer; a software layer (usually \normindex{MPI})
 301 hides this, and makes communication from all processors to all
 302 others possible. In contrast, in the {\gromacs} hardware~\cite{Berendsen95a}
 303 only communication in a ring was available, {\ie} each processor could communicate
 304 with its direct neighbors only.
 305
 306 It seems logical to map the computational box of an MD simulation system
 307 to a 3D grid of
 308 processors ({\eg} 4x4x4 for a 64 processor system). This ensures that most
 309 interactions that are local in space can be computed with information from
 310 neighboring processors only. However, this means that there have to be
 311 communication channels in 3 dimensions too, which is not necessarily the case.
 312 Although this may be overcome in software, such a mapping complicates the MD
 313 software as well, without clear performance benefits on most
 314 parallel computers.
 315
 316 Therefore we opt for a simple one-dimensional division scheme
 317 for the computational box. Each processor gets a slab of this box in the
 318 X-dimension.
 319 For the communication between processors this has two main advantages:
 320 \begin{enumerate}
 321 \item   Simplicity of coding. Communication can only be to two neighbors
 322         (called {\em left} and {\em right} in {\gromacs}).
 323 \item   Communication can usually be done in large chunks, which makes it
 324         more efficient on most hardware platforms.
 325 \end{enumerate}
 326
 327 Most interactions in molecular dynamics have in principle a
 328 short-range character.  Bonds, angles and dihedrals are guaranteed to
 329 have the corresponding particles close in space.
 330
 331
 332 \subsection{Domain decomposition for non-bonded forces}
 333 For large parallel computers, domain decomposition is preferable over
 334 particle decomposition, since it is easier to do load
 335 balancing. Without load balancing the scaling of the code is rather
 336 poor. For this purpose, the computational box is divided in {\nproc}
 337 slabs, where {\nproc} is equal to the number of processors. There are
 338 multiple ways of dividing the box over processors, but since the
 339 {\gromacs} code assumes a ring topology for the processors, it is
 340 logical to cut the system in slabs in just one dimension, the X
 341 dimension.  The algorithm for neighbor searching then becomes:
 342 \begin{enumerate}
 343 \item   Make a list of charge group indices sorted on (increasing) X coordinate
 344         (\figref{parsort}).
 345         {\bf Note} that care must be taken to parallelize the sorting algorithm
 346         as well. See \secref{parsort}.
 347 \item   Divide this list into slabs, with each slab having the same number of
 348         charge groups
 349 \item   Put the particles corresponding to the local slab on a 3D {\nsgrid} as
 350         described in \secref{nsgrid}.
 351 \item   Communicate the {\nsgrid} to neighboring processors (not necessarily to all
 352         processors). The amount of neighboring {\nsgrid} cells (N$_{gx}$) to
 353         communicate is determined by the cut-off length $r_c$ according to
 354         \beq
 355         N_{gx}  ~=~     \frac{r_c \nproc}{l_x}
 356         \eeq
 357         where $l_x$ is the box length in the slabbing direction.
 358 \item   On each processor compute the neighbor list for all charge groups in
 359         its slab using the normal grid neighbor-searching.
 360 \end{enumerate}
 361
 362 \begin{figure}
 363 \centerline{\includegraphics[width=10cm]{plots/parsort}}
 364 \caption[Index in the coordinate array.]{Index in the coordinate
 365 array. The division in slabs is indicated by dashed lines.}
 366 \label{fig:parsort}
 367 \end{figure}
 368
 369 For homogeneous system, this is close to an optimal load balancing,
 370 without actually doing load balancing. For inhomogeneous system, such
 371 as membranes or interfaces, the slabs should be perpendicular to the
 372 interface; this way, each processor has ``a little bit of
 373 everything''.  The {\gromacs} utility program {\tt editconf} has an
 374 option to rotate a whole computational box.
 375
 376 The following observations are important here:
 377 \begin{itemize}
 378 \item   Particles may diffuse from one slab to the other, therefore each processor
 379         must hold coordinates for all particles all the time, and distribute forces
 380         back to all processors as well.
 381 \item   Velocities are kept on the ``home processor'' for each particle,
 382         where the integration of Newton's equations is done.
 383 \item   Fixed interaction lists (bonds, angles etc.) are kept each
 384         on a single processor.  Since all processors have all
 385         coordinates, it does not matter where interactions are
 386         calculated.  The division is actually done by the {\gromacs}
 387         preprocessor {\tt grompp} and care is taken that, as far as
 388         possible, every processor gets the same number of bonded
 389         interactions.
 390 \end{itemize}
 391
 392 In all, this makes for a mixed particle decomposition/domain decomposition scheme
 393 for parallelization of the MD code. The communication costs are four times higher
 394 than for the simple particle decomposition method described in \secref{par}
 395 (the whole coordinate and force array are communicated across the whole ring,
 396 rather than half the array over half the ring).
 397 However, for large numbers of processors the improved load balancing
 398 compensates this easily.
 399
 400 \subsection{Parallel \normindex{PPPM}}
 401 A further reason for domain decomposition is the PPPM algorithm. This
 402 algorithm works with a 3D Fast Fourier Transform. It employs a
 403 discrete grid of dimensions (\nx,\ny,\nz), the {\fftgrid}. The
 404 algorithm consist of five steps, each of which have to be
 405 parallelized:
 406 \begin{enumerate}
 407 \item   Spreading charges on the {\fftgrid} to obtain the charge
 408         distribution $\rho(\ve{r})$.
 409         This bit involves the following sub-steps:
 410         \begin{enumerate}
 411         \item[{\bf a.}] put particle in the box
 412         \item[{\bf b.}] find the {\fftgrid} cell in which the particle resides
 413         \item[{\bf c.}] add the charge of the particle times the appropriate
 414                         weight factor (see \secref{pppm}) to
 415                         each of the 27 grid points (3 x 3 x 3).
 416         \end{enumerate}
 417         In the parallel case, the {\fftgrid}
 418         must be filled on each processor with its
 419         share of the particles, and subsequently the {\fftgrid}s of all processors
 420         must be summed to find the total charge distribution. It may be clear that
 421         this induces a large amount of unnecessary work, unless we use domain
 422         decomposition. If each processor only has particles in a certain region
 423         of space, it only has to calculate the charge distribution for
 424         that region of space. Since {\gromacs} works with slabs, this means that
 425         each processor fills the {\fftgrid} cells corresponding to its slab in space
 426         and addition of {\fftgrid}s need only be done for neighboring slabs.\\
 427         To be more precise, the slab $x$ for processor $i$ is defined as:
 428         \beq
 429         i\, \frac{l_x}{M} \le x <\, (i+1)\frac{l_x}{M}
 430         \eeq
 431         Particle with this $x$ coordinate range will add to the charge distribution
 432         on the following range of
 433         of {\fftgrid} slabs in the $x$ direction:
 434         \beq
 435         {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-1 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+2
 436         \eeq
 437         where trunc indicates the truncation of a real number to the largest integer
 438         smaller than or equal to that real number.
 439
 440 \item   Doing the Fourier transform of the charge distribution $\rho(\ve{r})$
 441         in parallel to obtain $\hat{\rho}(\ve{k})$. This is done using
 442         the FFTW library (see \href{http://www.fftw.org}{www.fftw.org})
 443         which employs the MPI library for message passing programs
 444         (note that there are also \swapindex{shared}{memory} versions
 445         of the FFTW code).\\
 446         This FFT algorithm actually use slabs as well (good
 447         thinking!).  Each processor does 2D FFTs on its slab, and then
 448         the whole {\fftgrid} is transposed {\em in place}
 449         ({\ie} without using extra memory).  This means that after the
 450         FFT the X and Y components are swapped.  To complete the FFT,
 451         this swapping should be undone in principle (by transposing
 452         back).  Happily the FFTW code has an option to omit this,
 453         which we use in the next step.
 454 \item   Convolute $\hat{\rho}(\ve{k})$ with the Fourier transform of the
 455         charge spread function $\hat{g}(\ve{k})$ (which we have tabulated before)
 456         to obtain the potential $\hat{\phi}(k)$.
 457         As an optimization, we store the $\hat{g}(\ve{k})$  in transposed form
 458         as well, matching the transposed form of $\hat{\rho}(\ve{k})$
 459         which we get from the FFTW routine. After this step we have the
 460         potential $\hat{\phi}(k)$ in Fourier space, but still on the transposed
 461         {\fftgrid}.
 462 \item   Do an inverse transform of $\hat{\phi}(k)$ to obtain
 463         ${\phi}(\ve{r})$. Since the algorithm must do a transpose of the data
 464         this step actually yields the wanted result: the un-transposed
 465         potential in real space.
 466 \item   Interpolate the potential ${\phi}(\ve{r})$ in real space at the particle
 467         positions to obtain forces and energy. For this bit the same considerations
 468         about parallelism hold as for the charge spreading. However in this
 469         case more neighboring grid cells are needed, implying that we need
 470         the following set of {\fftgrid} slabs in the $x$ direction:
 471         \beq
 472         {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-3 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+4
 473         \eeq
 474
 475 \end{enumerate}
 476 The algorithm as sketched above requires communication for spreading
 477 the charges, for the forward and backward FFTs, and for interpolating
 478 the forces.  The {\gromacs} bits of the program use only left and
 479 right communication, {\ie} using two communication channels. The FFTW
 480 routines actually use other forms of communication as well, and these
 481 routines are coded with MPI routines for message passing. This implies
 482 that {\gromacs} can only perform the PPPM algorithm on parallel
 483 computers that support MPI. However, most
 484 \swapindex{shared}{memory} computers, such as the SGI Origin, also
 485 support MPI using the
 486 shared memory for communication.
 487
 488 \subsection{Parallel sorting}
 489 \label{sec:parsort}
 490 For the domain decomposition bit of {\gromacs} it is necessary to sort the
 491 coordinates (or rather the index to coordinates) every time a neighbor list is made.
 492 If we use brute force, and sort all coordinates on each processor (which is
 493 technically possible since we have all the coordinates), then this sorting procedure
 494 will take a constant wall clock time, proportional to {\natom}$^2\log${\natom},
 495 regardless of the number of processors. We can however do a little
 496 better, if we assume that particles diffuse only slowly.
 497 A parallel sorting algorithm can be conceived as follows: \\
 498 At the first step of the simulation
 499 \begin{enumerate}
 500 \item   Do a full sort of all indices using {\eg} the  Quicksort algorithm that is
 501         built-in in the standard C-library
 502 \item   Divide the sorted array into slabs (as described above see
 503         \figref{parsort}).
 504 \end{enumerate}
 505 At subsequent steps of the simulation:
 506 \begin{enumerate}
 507 \item   Send the indices for each processor to the preceding processor (if
 508         not processor 0) and to the next processor (if not {\nproc}-1). The
 509         communication associated with this operation is proportional to
 510         2{\natom}/{\nproc}.
 511 \item   Sort the combined indices of the three (or two) processors. Note that
 512         the CPU time associated with sorting is now
 513         (3{\natom}/{\nproc})$^2\log$ (3{\natom}/{\nproc}).
 514 \item   On each processor, the indices belonging to its slab can be determined
 515         from the order of the array (\figref{parsort}).
 516 \end{enumerate}
 517
 518 %\section{A Worked Example}
 519 %Suppose our box size is 4.0 nm and the cut-off is 0.8 nm. For neighborsearching
 520 %we use {\dgrid} = 2, such that there 10 {\nsgrid} cells.
 521
 522
 523 % LocalWords:  Parallelization parallelization parallelize Fortran vectorizing
 524 % LocalWords:  parallelized vectorize Amdahl's MPI ij ji decomp mdpar nprocs gx
 525 % LocalWords:  pid rvec vecs homenr dihedrals indices parsort nsgrid editconf
 526 % LocalWords:  preprocessor grompp PPPM pppm trunc FFTW FFT FFTs Convolute un
 527 % LocalWords:  SGI Quicksort formulae