Fix build with GMX_BUILD_UNITTESTS=OFF
[gromacs/AngularHB.git] / manual / gmxpar.tex
bloba1acceb18a6b06dad154bc2325b611943e60c86d
2 % This file is part of the GROMACS molecular simulation package.
4 % Copyright (c) 2013, by the GROMACS development team, led by
5 % Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
6 % and including many others, as listed in the AUTHORS file in the
7 % top-level source directory and at http://www.gromacs.org.
9 % GROMACS is free software; you can redistribute it and/or
10 % modify it under the terms of the GNU Lesser General Public License
11 % as published by the Free Software Foundation; either version 2.1
12 % of the License, or (at your option) any later version.
14 % GROMACS is distributed in the hope that it will be useful,
15 % but WITHOUT ANY WARRANTY; without even the implied warranty of
16 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
17 % Lesser General Public License for more details.
19 % You should have received a copy of the GNU Lesser General Public
20 % License along with GROMACS; if not, see
21 % http://www.gnu.org/licenses, or write to the Free Software Foundation,
22 % Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
24 % If you want to redistribute modifications to GROMACS, please
25 % consider that scientific software is very special. Version
26 % control is crucial - bugs must be traceable. We will be happy to
27 % consider code for inclusion in the official distribution, but
28 % derived work must not be called official GROMACS. Details are found
29 % in the README & COPYING files - if they are missing, get the
30 % official version at http://www.gromacs.org.
32 % To help us fund GROMACS development, we humbly ask that you cite
33 % the research papers on the package. Check out http://www.gromacs.org.
35 \section{Parallelization}
36 \label{sec:par}
38 \newcommand{\abs}[1]{\mid \! {#1} \! \mid}
40 The purpose of this section is to discuss the
41 \normindex{parallelization} of the
42 principle MD algorithm and not to describe the algorithms that are in
43 practical use for molecular systems with their complex variety of atoms
44 and terms in the force field descriptions. We shall therefore consider
45 as an example a simple system consisting only of a single type of atoms
46 with a simple form of the interaction potential. The emphasis will be
47 on the special problems that arise when the algorithm is implemented on
48 a parallel computer.
50 The simple model problem already contains the bottleneck of all MD
51 simulations: the computationally intensive evaluation of the
52 {\em non-bonded} forces between pairs of atoms, based on the distance
53 between particles. Complex molecular systems will in addition
54 involve many different kinds of {\em bonded} forces between designated
55 atoms. Such interactions add to the complexity of the algorithm but do
56 not modify the basic considerations concerning parallelization.
59 \subsection{Methods of parallelization}
60 There are a number of methods to parallelize the MD algorithm, each of
61 them with their own advantages and disadvantages. The method to
62 choose depends on the hardware and compilers available.
63 We list them here:
64 \begin{enumerate}
65 \item[1] {\em \normindex{Message Passing}.}\\
66 In this method, which is more or less the traditional
67 way of parallel programming, all the parallelism is
68 explicitly programmed by the user. The disadvantage
69 is that it takes extra code and effort, the advantage
70 is that the programmer keeps full control over the data
71 flow and can do optimizations a compiler could not come
72 up with.
74 The implementation is typically done by calling a set of
75 library routines to send and receive data to and from
76 other processors. Almost all hardware vendors support
77 this way of
78 parallelism in their C and Fortran compilers.
80 \item[2] {\em \swapindex{Data}{Parallel}.}\\
81 This method lets the user define arrays on which to
82 operate in parallel. Programming this way is much
83 like vectorizing: recurrence is not parallelized
84 ({\eg} {\tt for(i=1; (i<MAX); i++) a[i] = a[i-1] + 1;}
85 does not vectorize and not parallelize, because for
86 every i the result from the previous step is needed).
88 The advantage of data parallelism is that it is
89 easier for the user; the compiler takes care of the
90 parallelism. The disadvantage is that it is supported
91 by a small (though growing) number of hardware vendors,
92 and that it is much harder to maintain a program that has to
93 run on both parallel and sequential machines, because
94 the only standard language that supports it is Fortran-90
95 which is not available on many platforms.
96 \end{enumerate}
97 Both methods allow for the MD algorithm to be implemented without much
98 trouble. Message passing MD algorithms have been published
99 since the mid 80's (\cite{Fincham87}, \cite{Raine89})
100 and development is still continuing.
101 Data parallel programming is newer,
102 but starting from a well vectorized program it is not hard to do.
104 Our implementation of MD is a message passing one, the reason for which
105 is partly historical: the project to develop a parallel MD program started
106 when Fortran-90 was still in the making, and no compilers were
107 expected to be available.
108 At current, we still believe that message passing is the way
109 to go, after having done some experiments with data parallel programming on a
110 Connection Machine (CM-5), because of portability to other hardware,
111 the poor performance of the code produced by the compilers
112 and because this way of programming
113 has the same drawback as vectorization: the part of the program that is
114 not vectorized or parallelized determines the run time of the program
115 (\normindex{Amdahl's law}).
117 The approach we took to parallelism was a minimalist one: use as few
118 non-standard elements in the software as possible, and use the
119 simplest \swapindex{processor}{topology} that does the job. We therefore
120 decided to use a standard language (ANSI-C) with as few non-standard
121 routines as possible. We only use 5 communication routines that are
122 non-standard. It is therefore very easy to port our code to other machines.
124 For an $O(N^2)$ problem like MD, one of the best schemes for the
125 inter-processor connections is a ring, so our software demands that a
126 ring is present in the inter-processor connections. A ring can essentially
127 always be mapped onto another network like a \normindex{hypercube}, a
128 bus interface (Ethernet {\eg} using
129 \seeindex{Message Passing Interface}{MPI} \normindex{MPI}) or
130 a \normindex{tree}
131 (CM-5). Some hardware vendors have very luxurious connection schemes
132 that connect every processor to every other processor, but we do not
133 really need it and so do not use it even though it might come in handy
134 at times. The advantage with this simple scheme is that {\gromacs}
135 performs extremely well even on inexpensive workstation clusters.
137 When using a message passing scheme one has to divide the particles
138 over processors, which can be done in two ways:
139 \begin{itemize}
140 \item {\em \swapindex{Space}{Decomposition}.}\\
141 An element of space is allocated to each processor, when dividing
142 a cubic box with edge $b$ over $P$ processors this can be done
143 by giving
144 each processor a slab of length $b/P$. This method
145 has the advantage
146 that each processor has about the same number of interactions
147 to calculate (at least when the simulated system has a homogeneous
148 density, like a liquid or a gas). The disadvantage is that a lot of
149 bookkeeping is necessary for particles that move over processor
150 boundaries. When using more complex systems, such as macromolecules there
151 are also 3- and 4-atom interactions that make
152 the bookkeeping too complicated for our taste.
153 \item {\em \swapindex{Particle}{Decomposition}.}\\
154 Every processor is allocated a number of particles. When
155 dividing $N$ particles over $P$ processors each processor will
156 get $N/P$ particles. The implementation of this method
157 is described in the next section.
158 \end{itemize}
160 \begin {figure}[p]
161 \centerline{\includegraphics[width=10cm]{plots/int-mat}}
162 \caption[The interaction matrix.]{The interaction matrix (left) and
163 the same using action$~=~-$reaction (right).}
164 \label{fig:int_mat}
165 \end {figure}
167 \begin{table}[p]
168 \centerline{
169 \newcolumntype{s}[1]{D{/}{/}{#1}}
170 \begin{tabular}{|l|s4|s4|s4|s4|}
171 \dline
172 & \mcc{1}{i mod 2 = 0}
173 & \mcc{1}{i mod 2 = 0}
174 & \mcc{1}{i mod 2 = 1}
175 & \mcc{1}{i mod 2 = 1} \\
176 & \mcc{1}{i $<$ N/2}
177 & \mcc{1}{i $\ge$ N/2}
178 & \mcc{1}{i $<$ N/2}
179 & \mcc{1}{i $\ge$ N/2} \\
180 \hline
181 N mod 2 = 1 & N/2 & N/2 & N/2 & N/2 \\
182 N mod 4 = 2 & N/2 & N/2 & N/2 - 1 & N/2 - 1 \\
183 N mod 4 = 0 & N/2 & N/2 - 1 & N/2 - 1 & N/2 \\
184 \dline
185 \end{tabular}
187 \caption[The number of interactions between particles.]{The number of
188 interactions between particles. The number of $j$ particles per $i$
189 particle is a function of the total number of particles $N$ and
190 particle number $i$. Note that here the $/$ operator is used for
191 integer division, {\ie} truncating the reminder.}
192 \label{tab:decomp}
193 \end{table}
195 \begin{figure}[p]
196 \centerline{\includegraphics[width=\linewidth]{plots/decomp}}
197 \caption[Interaction matrices for different $N$.]{Interaction matrices for different $N$. The number of $j$-particles an $i$-particle interacts with depends on the {\em total} number of particles and on the {\em particle number}.}
198 \label{fig:decomp}
199 \end{figure}
201 \subsection{MD on a ring of processors}
202 When a neighbor list is not used the MD problem is in principle an $O(N^2)$
203 problem as each particle can interact
204 with every other. This can be simplified using Newton's third law
205 \beq
206 F_{ij} ~=~ -F_{ji}
207 \label{eqn:Newt3}
208 \eeq
209 This implies that there is half a matrix of interactions (without diagonal,
210 a particle does not interact with itself) to consider (\figref{int_mat}).
211 When we reflect the upper right triangle of interactions to the lower
212 left triangle of the matrix, we still cover all possible interactions,
213 but now every row in the matrix has almost the same number of points
214 or possible interactions. We can now assign a (preferably equal)
215 number of rows to each processor to compute the forces and at the same
216 time a number of particles to do the update on, the {\em home}
217 particles. The number of interactions per particle is dependent on the
218 {\em total number} $N$ of particles (see \figref{decomp}) and on the
219 {\em particle number} $i$. The exact formulae are given in
220 \tabref{decomp}.
222 A flow chart of the algorithm is given in \figref{mdpar}.
223 \begin{figure}
224 \centerline{\includegraphics[height=18cm]{plots/mdpar}}
225 \caption[The Parallel MD algorithm.]{The Parallel MD algorithm. If
226 the steps marked * are left out we have the sequential algorithm
227 again.}
228 \label{fig:mdpar}
229 \end{figure}
230 \vfill
232 It is the same as the sequential algorithm, except for two
233 communication steps. After the particles have been reset in the box,
234 each processor sends its coordinates onward (left) and then starts computation
235 of the forces. After this step each processor holds the {\em partial
236 forces} for the available particles, {\eg} processor 0 holds forces
237 acting on home particles from processor 0, 1, 2 and 3. These forces
238 must be accumulated and sent back (right) to the home
239 processor. Finally the update of the velocity and coordinates is done
240 on the home processor.
242 The {\tt communicate_r} routine is given below in the full C-code:\\
243 \begin{footnotesize}
244 \begin{verbatim}
245 void communicate_r(int nprocs,int pid,rvec vecs[],int start[],int homenr[])
247 * nprocs = number of processors
248 * pid = processor id (0..nprocs-1)
249 * vecs = vectors
250 * start = starting index in vecs for each processor
251 * homenr = number of home particles for each processor
254 int i; /* processor counter */
255 int shift; /* the amount of processors to communicate with */
256 int cur; /* current processor to send data from */
257 int next; /* next processor on a ring (using modulo) */
259 cur = pid;
260 shift = nprocs/2;
262 for (i=0; (i<shift); i++) {
263 next=(cur+1) \% nprocs;
264 send (left, vecs[start[cur]], homenr[cur]);
265 receive(right, vecs[start[next]], homenr[next]);
266 cur=next;
269 \end{verbatim}
271 \end{footnotesize}
273 The data flow around the ring is visualized in \figref{ring}.
274 Note that because of the ring topology each processor automatically
275 gets the proper particles to interact with.
276 \begin {figure}
277 \centerline{\includegraphics[width=6cm]{plots/ring}}
278 \caption {Data flow in a ring of processors.}
279 \label{fig:ring}
280 \end {figure}
282 \section{Parallel Molecular Dynamics}
283 In this chapter we describe some details of the \swapindex{parallel}{MD}
284 algorithm used
285 in {\gromacs}. This also includes some other information on neighbor searching and
286 a side excursion to parallel sorting.
287 Please note the following which we use throughout this chapter:\\
288 {\bf definition:} {\natom}: Number of particles, {\nproc} number of processors.\\
289 {\gromacs} employs two different grids: the neighbor searching grid ({\nsgrid})
290 and the combined charge/potential grid ({\fftgrid}), as will be described below.
291 To maximize the confusion,
292 these two grids are mapped onto a grid of processors when {\gromacs} runs on a
293 parallel computer.
295 \subsection{Domain decomposition}
296 Modern parallel computers, such as an IBM SP/2 or a Cray T3E
297 consist of relatively small numbers of relatively fast scalar
298 processors (typically 8 to 256). The communication channels that are
299 available in hardware on these machine are not directly visible to
300 the programmer; a software layer (usually \normindex{MPI})
301 hides this, and makes communication from all processors to all
302 others possible. In contrast, in the {\gromacs} hardware~\cite{Berendsen95a}
303 only communication in a ring was available, {\ie} each processor could communicate
304 with its direct neighbors only.
306 It seems logical to map the computational box of an MD simulation system
307 to a 3D grid of
308 processors ({\eg} 4x4x4 for a 64 processor system). This ensures that most
309 interactions that are local in space can be computed with information from
310 neighboring processors only. However, this means that there have to be
311 communication channels in 3 dimensions too, which is not necessarily the case.
312 Although this may be overcome in software, such a mapping complicates the MD
313 software as well, without clear performance benefits on most
314 parallel computers.
316 Therefore we opt for a simple one-dimensional division scheme
317 for the computational box. Each processor gets a slab of this box in the
318 X-dimension.
319 For the communication between processors this has two main advantages:
320 \begin{enumerate}
321 \item Simplicity of coding. Communication can only be to two neighbors
322 (called {\em left} and {\em right} in {\gromacs}).
323 \item Communication can usually be done in large chunks, which makes it
324 more efficient on most hardware platforms.
325 \end{enumerate}
327 Most interactions in molecular dynamics have in principle a
328 short-range character. Bonds, angles and dihedrals are guaranteed to
329 have the corresponding particles close in space.
332 \subsection{Domain decomposition for non-bonded forces}
333 For large parallel computers, domain decomposition is preferable over
334 particle decomposition, since it is easier to do load
335 balancing. Without load balancing the scaling of the code is rather
336 poor. For this purpose, the computational box is divided in {\nproc}
337 slabs, where {\nproc} is equal to the number of processors. There are
338 multiple ways of dividing the box over processors, but since the
339 {\gromacs} code assumes a ring topology for the processors, it is
340 logical to cut the system in slabs in just one dimension, the X
341 dimension. The algorithm for neighbor searching then becomes:
342 \begin{enumerate}
343 \item Make a list of charge group indices sorted on (increasing) X coordinate
344 (\figref{parsort}).
345 {\bf Note} that care must be taken to parallelize the sorting algorithm
346 as well. See \secref{parsort}.
347 \item Divide this list into slabs, with each slab having the same number of
348 charge groups
349 \item Put the particles corresponding to the local slab on a 3D {\nsgrid} as
350 described in \secref{nsgrid}.
351 \item Communicate the {\nsgrid} to neighboring processors (not necessarily to all
352 processors). The amount of neighboring {\nsgrid} cells (N$_{gx}$) to
353 communicate is determined by the cut-off length $r_c$ according to
354 \beq
355 N_{gx} ~=~ \frac{r_c \nproc}{l_x}
356 \eeq
357 where $l_x$ is the box length in the slabbing direction.
358 \item On each processor compute the neighbor list for all charge groups in
359 its slab using the normal grid neighbor-searching.
360 \end{enumerate}
362 \begin{figure}
363 \centerline{\includegraphics[width=10cm]{plots/parsort}}
364 \caption[Index in the coordinate array.]{Index in the coordinate
365 array. The division in slabs is indicated by dashed lines.}
366 \label{fig:parsort}
367 \end{figure}
369 For homogeneous system, this is close to an optimal load balancing,
370 without actually doing load balancing. For inhomogeneous system, such
371 as membranes or interfaces, the slabs should be perpendicular to the
372 interface; this way, each processor has ``a little bit of
373 everything''. The {\gromacs} utility program {\tt editconf} has an
374 option to rotate a whole computational box.
376 The following observations are important here:
377 \begin{itemize}
378 \item Particles may diffuse from one slab to the other, therefore each processor
379 must hold coordinates for all particles all the time, and distribute forces
380 back to all processors as well.
381 \item Velocities are kept on the ``home processor'' for each particle,
382 where the integration of Newton's equations is done.
383 \item Fixed interaction lists (bonds, angles etc.) are kept each
384 on a single processor. Since all processors have all
385 coordinates, it does not matter where interactions are
386 calculated. The division is actually done by the {\gromacs}
387 preprocessor {\tt grompp} and care is taken that, as far as
388 possible, every processor gets the same number of bonded
389 interactions.
390 \end{itemize}
392 In all, this makes for a mixed particle decomposition/domain decomposition scheme
393 for parallelization of the MD code. The communication costs are four times higher
394 than for the simple particle decomposition method described in \secref{par}
395 (the whole coordinate and force array are communicated across the whole ring,
396 rather than half the array over half the ring).
397 However, for large numbers of processors the improved load balancing
398 compensates this easily.
400 \subsection{Parallel \normindex{PPPM}}
401 A further reason for domain decomposition is the PPPM algorithm. This
402 algorithm works with a 3D Fast Fourier Transform. It employs a
403 discrete grid of dimensions (\nx,\ny,\nz), the {\fftgrid}. The
404 algorithm consist of five steps, each of which have to be
405 parallelized:
406 \begin{enumerate}
407 \item Spreading charges on the {\fftgrid} to obtain the charge
408 distribution $\rho(\ve{r})$.
409 This bit involves the following sub-steps:
410 \begin{enumerate}
411 \item[{\bf a.}] put particle in the box
412 \item[{\bf b.}] find the {\fftgrid} cell in which the particle resides
413 \item[{\bf c.}] add the charge of the particle times the appropriate
414 weight factor (see \secref{pppm}) to
415 each of the 27 grid points (3 x 3 x 3).
416 \end{enumerate}
417 In the parallel case, the {\fftgrid}
418 must be filled on each processor with its
419 share of the particles, and subsequently the {\fftgrid}s of all processors
420 must be summed to find the total charge distribution. It may be clear that
421 this induces a large amount of unnecessary work, unless we use domain
422 decomposition. If each processor only has particles in a certain region
423 of space, it only has to calculate the charge distribution for
424 that region of space. Since {\gromacs} works with slabs, this means that
425 each processor fills the {\fftgrid} cells corresponding to its slab in space
426 and addition of {\fftgrid}s need only be done for neighboring slabs.\\
427 To be more precise, the slab $x$ for processor $i$ is defined as:
428 \beq
429 i\, \frac{l_x}{M} \le x <\, (i+1)\frac{l_x}{M}
430 \eeq
431 Particle with this $x$ coordinate range will add to the charge distribution
432 on the following range of
433 of {\fftgrid} slabs in the $x$ direction:
434 \beq
435 {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-1 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+2
436 \eeq
437 where trunc indicates the truncation of a real number to the largest integer
438 smaller than or equal to that real number.
440 \item Doing the Fourier transform of the charge distribution $\rho(\ve{r})$
441 in parallel to obtain $\hat{\rho}(\ve{k})$. This is done using
442 the FFTW library (see \href{http://www.fftw.org}{www.fftw.org})
443 which employs the MPI library for message passing programs
444 (note that there are also \swapindex{shared}{memory} versions
445 of the FFTW code).\\
446 This FFT algorithm actually use slabs as well (good
447 thinking!). Each processor does 2D FFTs on its slab, and then
448 the whole {\fftgrid} is transposed {\em in place}
449 ({\ie} without using extra memory). This means that after the
450 FFT the X and Y components are swapped. To complete the FFT,
451 this swapping should be undone in principle (by transposing
452 back). Happily the FFTW code has an option to omit this,
453 which we use in the next step.
454 \item Convolute $\hat{\rho}(\ve{k})$ with the Fourier transform of the
455 charge spread function $\hat{g}(\ve{k})$ (which we have tabulated before)
456 to obtain the potential $\hat{\phi}(k)$.
457 As an optimization, we store the $\hat{g}(\ve{k})$ in transposed form
458 as well, matching the transposed form of $\hat{\rho}(\ve{k})$
459 which we get from the FFTW routine. After this step we have the
460 potential $\hat{\phi}(k)$ in Fourier space, but still on the transposed
461 {\fftgrid}.
462 \item Do an inverse transform of $\hat{\phi}(k)$ to obtain
463 ${\phi}(\ve{r})$. Since the algorithm must do a transpose of the data
464 this step actually yields the wanted result: the un-transposed
465 potential in real space.
466 \item Interpolate the potential ${\phi}(\ve{r})$ in real space at the particle
467 positions to obtain forces and energy. For this bit the same considerations
468 about parallelism hold as for the charge spreading. However in this
469 case more neighboring grid cells are needed, implying that we need
470 the following set of {\fftgrid} slabs in the $x$ direction:
471 \beq
472 {\rm trunc}\left(i\,\frac{l_x \nx}{M}\right)-3 \le i_x \le {\rm trunc}\left((i+1)\,\frac{l_x \nx}{M}\right)+4
473 \eeq
475 \end{enumerate}
476 The algorithm as sketched above requires communication for spreading
477 the charges, for the forward and backward FFTs, and for interpolating
478 the forces. The {\gromacs} bits of the program use only left and
479 right communication, {\ie} using two communication channels. The FFTW
480 routines actually use other forms of communication as well, and these
481 routines are coded with MPI routines for message passing. This implies
482 that {\gromacs} can only perform the PPPM algorithm on parallel
483 computers that support MPI. However, most
484 \swapindex{shared}{memory} computers, such as the SGI Origin, also
485 support MPI using the
486 shared memory for communication.
488 \subsection{Parallel sorting}
489 \label{sec:parsort}
490 For the domain decomposition bit of {\gromacs} it is necessary to sort the
491 coordinates (or rather the index to coordinates) every time a neighbor list is made.
492 If we use brute force, and sort all coordinates on each processor (which is
493 technically possible since we have all the coordinates), then this sorting procedure
494 will take a constant wall clock time, proportional to {\natom}$^2\log${\natom},
495 regardless of the number of processors. We can however do a little
496 better, if we assume that particles diffuse only slowly.
497 A parallel sorting algorithm can be conceived as follows: \\
498 At the first step of the simulation
499 \begin{enumerate}
500 \item Do a full sort of all indices using {\eg} the Quicksort algorithm that is
501 built-in in the standard C-library
502 \item Divide the sorted array into slabs (as described above see
503 \figref{parsort}).
504 \end{enumerate}
505 At subsequent steps of the simulation:
506 \begin{enumerate}
507 \item Send the indices for each processor to the preceding processor (if
508 not processor 0) and to the next processor (if not {\nproc}-1). The
509 communication associated with this operation is proportional to
510 2{\natom}/{\nproc}.
511 \item Sort the combined indices of the three (or two) processors. Note that
512 the CPU time associated with sorting is now
513 (3{\natom}/{\nproc})$^2\log$ (3{\natom}/{\nproc}).
514 \item On each processor, the indices belonging to its slab can be determined
515 from the order of the array (\figref{parsort}).
516 \end{enumerate}
518 %\section{A Worked Example}
519 %Suppose our box size is 4.0 nm and the cut-off is 0.8 nm. For neighborsearching
520 %we use {\dgrid} = 2, such that there 10 {\nsgrid} cells.
523 % LocalWords: Parallelization parallelization parallelize Fortran vectorizing
524 % LocalWords: parallelized vectorize Amdahl's MPI ij ji decomp mdpar nprocs gx
525 % LocalWords: pid rvec vecs homenr dihedrals indices parsort nsgrid editconf
526 % LocalWords: preprocessor grompp PPPM pppm trunc FFTW FFT FFTs Convolute un
527 % LocalWords: SGI Quicksort formulae