docs/manual/implement.tex

   1 %
   2 % This file is part of the GROMACS molecular simulation package.
   3 %
   4 % Copyright (c) 2013,2014,2015, by the GROMACS development team, led by
   5 % Mark Abraham, David van der Spoel, Berk Hess, and Erik Lindahl,
   6 % and including many others, as listed in the AUTHORS file in the
   7 % top-level source directory and at http://www.gromacs.org.
   8 %
   9 % GROMACS is free software; you can redistribute it and/or
  10 % modify it under the terms of the GNU Lesser General Public License
  11 % as published by the Free Software Foundation; either version 2.1
  12 % of the License, or (at your option) any later version.
  13 %
  14 % GROMACS is distributed in the hope that it will be useful,
  15 % but WITHOUT ANY WARRANTY; without even the implied warranty of
  16 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  17 % Lesser General Public License for more details.
  18 %
  19 % You should have received a copy of the GNU Lesser General Public
  20 % License along with GROMACS; if not, see
  21 % http://www.gnu.org/licenses, or write to the Free Software Foundation,
  22 % Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA.
  23 %
  24 % If you want to redistribute modifications to GROMACS, please
  25 % consider that scientific software is very special. Version
  26 % control is crucial - bugs must be traceable. We will be happy to
  27 % consider code for inclusion in the official distribution, but
  28 % derived work must not be called official GROMACS. Details are found
  29 % in the README & COPYING files - if they are missing, get the
  30 % official version at http://www.gromacs.org.
  31 %
  32 % To help us fund GROMACS development, we humbly ask that you cite
  33 % the research papers on the package. Check out http://www.gromacs.org.
  34
  35 \chapter{Some implementation details}
  36 In this chapter we will present some implementation details. This is
  37 far from complete, but we deemed it necessary to clarify some things
  38 that would otherwise be hard to understand.
  39
  40 \section{Single Sum Virial in {\gromacs}}
  41 \label{sec:virial}
  42 The \normindex{virial} $\Xi$ can be written in full tensor form as:
  43 \beq
  44 \Xi~=~-\half~\sum_{i < j}^N~\rvij\otimes\Fvij
  45 \eeq
  46 where $\otimes$ denotes the {\em direct product} of two vectors.\footnote
  47 {$({\bf u}\otimes{\bf v})^{\ab}~=~{\bf u}_{\al}{\bf v}_{\be}$} When this is
  48 computed in the inner loop of an MD program 9 multiplications and 9
  49 additions are needed.\footnote{The calculation of
  50 Lennard-Jones and Coulomb forces is about 50 floating point operations.}
  51
  52 Here it is shown how it is possible to extract the virial calculation
  53 from the inner loop~\cite{Bekker93b}.
  54
  55 \subsection{Virial}
  56 In a system with periodic boundary conditions\index{periodic boundary
  57 conditions}, the
  58 periodicity must be taken into account for the virial:
  59 \beq
  60 \Xi~=~-\half~\sum_{i < j}^{N}~\rnij\otimes\Fvij
  61 \eeq
  62 where $\rnij$ denotes the distance vector of the
  63 {\em nearest image} of atom $i$ from atom $j$. In this definition we add
  64 a {\em shift vector} $\delta_i$ to the position vector $\rvi$
  65 of atom $i$. The difference vector $\rnij$ is thus equal to:
  66 \beq
  67 \rnij~=~\rvi+\delta_i-\rvj
  68 \eeq
  69 or in shorthand:
  70 \beq
  71 \rnij~=~\rni-\rvj
  72 \eeq
  73 In a triclinic system, there are 27 possible images of $i$; when a truncated
  74 octahedron is used, there are 15 possible images.
  75
  76 \subsection{Virial from non-bonded forces}
  77 Here the derivation for the single sum virial in the {\em non-bonded force}
  78 routine is given. $i \neq j$ in all formulae below.
  79 \newcommand{\di}{\delta_{i}}
  80 \newcommand{\qrt}{\frac{1}{4}}
  81 \bea
  82 \Xi
  83 &~=~&-\half~\sum_{i < j}^{N}~\rnij\otimes\Fvij                          \\
  84 &~=~&-\qrt\sum_{i=1}^N~\sum_{j=1}^N ~(\rvi+\di-\rvj)\otimes\Fvij        \\
  85 &~=~&-\qrt\sum_{i=1}^N~\sum_{j=1}^N ~(\rvi+\di)\otimes\Fvij-\rvj\otimes\Fvij    \\
  86 &~=~&-\qrt\left(\sum_{i=1}^N~\sum_{j=1}^N ~(\rvi+\di)\otimes\Fvij~-~\sum_{i=1}^N~\sum_{j=1}^N ~\rvj\otimes\Fvij\right)  \\
  87 &~=~&-\qrt\left(\sum_{i=1}^N~(\rvi+\di)\otimes\sum_{j=1}^N~\Fvij~-~\sum_{j=1}^N ~\rvj\otimes\sum_{i=1}^N~\Fvij\right)   \\
  88 &~=~&-\qrt\left(\sum_{i=1}^N~(\rvi+\di)\otimes\Fvi~+~\sum_{j=1}^N ~\rvj\otimes\Fvj\right)       \\
  89 &~=~&-\qrt\left(2~\sum_{i=1}^N~\rvi\otimes\Fvi+\sum_{i=1}^N~\di\otimes\Fvi\right)
  90 \eea
  91 In these formulae we introduced:
  92 \bea
  93 \Fvi&~=~&\sum_{j=1}^N~\Fvij                                     \\
  94 \Fvj&~=~&\sum_{i=1}^N~\Fvji
  95 \eea
  96 which is the total force on $i$ with respect to $j$. Because we use Newton's Third Law:
  97 \beq
  98 \Fvij~=~-\Fvji
  99 \eeq
 100 we must, in the implementation, double the term containing the shift $\delta_i$.
 101
 102 \subsection{The intra-molecular shift (mol-shift)}
 103 For the bonded forces and SHAKE it is possible to make a {\em mol-shift}
 104 list, in which the periodicity is stored. We simple have an array {\tt mshift}
 105 in which for each atom an index in the {\tt shiftvec} array is stored.
 106
 107 The algorithm to generate such a list can be derived from graph theory,
 108 considering each particle in a molecule as a bead in a graph, the bonds
 109 as edges.
 110 \begin{enumerate}
 111 \item[1]        Represent the bonds and atoms as bidirectional graph
 112 \item[2]        Make all atoms white
 113 \item[3]        Make one of the white atoms black (atom $i$) and put it in the
 114                 central box
 115 \item[4]        Make all of the neighbors of $i$ that are currently
 116                 white, gray
 117 \item[5]        Pick one of the gray atoms (atom $j$), give it the
 118                 correct periodicity with respect to any of
 119                 its black neighbors
 120                 and make it black
 121 \item[6]        Make all of the neighbors of $j$ that are currently
 122                 white, gray
 123 \item[7]        If any gray atom remains, go to [5]
 124 \item[8]        If any white atom remains, go to [3]
 125 \end{enumerate}
 126 Using this algorithm we can
 127 \begin{itemize}
 128 \item   optimize the bonded force calculation as well as SHAKE
 129 \item   calculate the virial from the bonded forces
 130         in the single sum method again
 131 \end{itemize}
 132
 133 Find a representation of the bonds as a bidirectional graph.
 134
 135 \subsection{Virial from Covalent Bonds}
 136 Since the covalent bond force gives a contribution to the virial, we have:
 137 \bea
 138 b       &~=~&   \|\rnij\|                                       \\
 139 V_b     &~=~&   \half k_b(b-b_0)^2                              \\
 140 \Fvi    &~=~&   -\nabla V_b                                     \\
 141         &~=~&   k_b(b-b_0)\frac{\rnij}{b}                       \\
 142 \Fvj    &~=~&   -\Fvi
 143 \eea
 144 The virial contribution from the bonds then is:
 145 \bea
 146 \Xi_b   &~=~&   -\half(\rni\otimes\Fvi~+~\rvj\otimes\Fvj)       \\
 147         &~=~&   -\half\rnij\otimes\Fvi
 148 \eea
 149
 150 \subsection{Virial from SHAKE}
 151 An important contribution to the virial comes from shake. Satisfying
 152 the constraints a force {\bf G} that is exerted on the particles ``shaken.'' If this
 153 force does not come out of the algorithm (as in standard SHAKE) it can be
 154 calculated afterward (when using {\em leap-frog}) by:
 155 \bea
 156 \Delta\rvi&~=~&\rvi(t+\Dt)-
 157 [\rvi(t)+{\bf v}_i(t-\frac{\Dt}{2})\Dt+\frac{\Fvi}{m_i}\Dt^2]   \\
 158 {\bf G}_i&~=~&\frac{m_i\Delta\rvi}{\Dt^2}
 159 \eea
 160 This does not help us in the general case. Only when no periodicity
 161 is needed (like in rigid water) this can be used, otherwise
 162 we must add the virial calculation in the inner loop of SHAKE.
 163
 164 When it {\em is} applicable the virial can be calculated in the single sum way:
 165 \beq
 166 \Xi~=~-\half\sum_i^{N_c}~\rvi\otimes\Fvi
 167 \eeq
 168 where $N_c$ is the number of constrained atoms.
 169
 170 %Another method is the Non-Iterative shake as proposed (and implemented)
 171 %by Yoneya. In this algorithm the Lagrangian multipliers are solved in a
 172 %matrix equation, and given these multipliers it is easy to get the periodicity
 173 %in the virial afterwards.
 174
 175 %More...
 176
 177
 178 \section{Optimizations}
 179 Here we describe some of the algorithmic optimizations used
 180 in {\gromacs}, apart from parallelism.
 181 One of these, the implementation of the
 182 1.0/sqrt(x) function is treated separately in \secref{sqrt}.
 183 The most important other optimizations are described below.
 184
 185 \subsection{Inner Loops for Water}
 186 \label{sec:waterloops}
 187 {\gromacs} uses special inner loops to calculate non-bonded
 188 interactions for water molecules with other atoms, and yet
 189 another set of loops for interactions between pairs of
 190 water molecules. There highly optimized loops for two types of water models.
 191 For three site models similar to
 192 SPC~\cite{Berendsen81}, {\ie}:
 193 \begin{enumerate}
 194 \item   There are three atoms in the molecule.
 195 \item   The whole molecule is a single charge group.
 196 \item   The first atom has Lennard-Jones (\secref{lj}) and
 197         Coulomb (\secref{coul}) interactions.
 198 \item   Atoms two and three have only Coulomb interactions,
 199         and equal charges.
 200 \end{enumerate}
 201 These loops also works for the SPC/E~\cite{Berendsen87} and
 202 TIP3P~\cite{Jorgensen83} water models.
 203 And for four site water models similar to TIP4P~\cite{Jorgensen83}:
 204 \begin{enumerate}
 205 \item   There are four atoms in the molecule.
 206 \item   The whole molecule is a single charge group.
 207 \item   The first atom has only Lennard-Jones (\secref{lj}) interactions.
 208 \item   Atoms two and three have only Coulomb (\secref{coul}) interactions,
 209         and equal charges.
 210 \item   Atom four has only Coulomb interactions.
 211 \end{enumerate}
 212
 213 The benefit of these implementations is that there are more floating-point
 214 operations in a single loop, which implies that some compilers
 215 can schedule the code better. However, it turns out that even
 216 some of the most advanced compilers have problems with scheduling,
 217 implying that manual tweaking is necessary to get optimum
 218 \normindex{performance}.
 219 This may include common-sub-expression elimination, or moving
 220 code around.
 221
 222 \subsection{Fortran Code}
 223 Unfortunately, on a few platforms \normindex{Fortran} compilers are
 224 still better than C-compilers. For some machines ({\eg} SGI
 225 Power Challenge) the difference may be up to a factor of 3, in the
 226 case of vector computers this may be even larger. Therefore, some of
 227 the routines that take up a lot of computer time have been translated
 228 into Fortran and even assembly code for Intel and AMD x86 processors.
 229 In most cases, the Fortran or assembly loops should be selected
 230 automatically by the {\tt configure} script when appropriate, but you can
 231 also tweak this by setting options to the {\tt configure} script.
 232
 233 \section{Computation of the 1.0/sqrt function}
 234 \label{sec:sqrt}
 235 \subsection{Introduction}
 236 The {\gromacs} project started with the development of a $1/\sqrt{x}$
 237 processor that calculates:
 238 \begin{equation}
 239 Y(x) = \frac{1}{ \sqrt{x} }
 240 \end{equation}
 241 As the project continued, the {\intel} processor was used to implement
 242 {\gromacs}, which now turned into almost a full software project.  The
 243 $1/\sqrt{x}$ processor was implemented using a Newton-Raphson
 244 iteration scheme for one step. For this it needed look-up tables to
 245 provide the initial approximation. The $1/\sqrt{x}$ function makes it
 246 possible to use two almost independent tables for the exponent seed
 247 and the fraction seed with the IEEE floating-point representation.
 248
 249 \subsection{General}
 250 According to~\cite{Bekker87} the $1/\sqrt{x}$ function can be evaluated using
 251 the Newton-Raphson iteration scheme. The inverse function is:
 252 \begin{equation}
 253 X(y) = \frac{1}{y^{2}}
 254 \end{equation}
 255 So instead of calculating:
 256 \begin{equation}
 257 Y(a) = q
 258 \end{equation}
 259 the equation:
 260 \begin{equation}
 261 X(q) - a = 0
 262 \label{eqn:inversef}
 263 \end{equation}
 264 can now be solved using Newton-Raphson. An iteration is performed by
 265 calculating:
 266 \begin{equation}
 267 y_{n+1} = y_{n} - \frac{f(y_{n})}{f'(y_{n})}
 268 \label{eqn:nr}
 269 \end{equation}
 270 The absolute error $\varepsilon$, in this approximation is defined by:
 271 \begin{equation}
 272 \varepsilon \equiv y_{n} - q
 273 \end{equation}
 274 Using Taylor series expansion to estimate the error results in:
 275 \begin{equation}
 276 \varepsilon _{n+1} = - \frac{\varepsilon _{n}^{2}}{2}
 277                        \frac{ f''(y_{n})}{f'(y_{n})}
 278 \label{eqn:taylor}
 279 \end{equation}
 280 according to~\cite{Bekker87} equation (3.2). This is an estimation of the
 281 absolute error.
 282
 283 \subsection{Applied to floating-point numbers}
 284 Floating-point numbers in IEEE 32 bit single-precision format have a nearly
 285 constant relative error of $\Delta x / x = 2^{-24}$. As seen earlier in the
 286 Taylor series expansion equation (\eqnref{taylor}), the error in every
 287 iteration step is absolute and in general dependent of $y$. If the error is
 288 expressed as a relative error $\varepsilon_{r}$ the following holds:
 289 \begin{equation}
 290 \varepsilon _{{r}_{n+1}} \equiv \frac{\varepsilon_{n+1}}{y}
 291 \end{equation}
 292 and so:
 293 \begin{equation}
 294 \varepsilon _{{r}_{n+1}} =
 295 - ( \frac{\varepsilon _{n}}{y} )^{2} y \frac{ f''}{2f'}
 296 \end{equation}
 297 For the function $f(y) = y^{-2}$ the term $y f''/2f'$ is constant (equal
 298 to $-3/2$) so the relative error $\varepsilon _{r_{n}}$ is independent of $y$.
 299 \begin{equation}
 300 \varepsilon _{{r}_{n+1}} =
 301 \frac{3}{2} (\varepsilon_{r_{n}})^{2}
 302 \label{eqn:epsr}
 303 \end{equation}
 304
 305 The conclusion of this is that the function $1/\sqrt{x}$ can be
 306 calculated with a specified accuracy.
 307
 308 \begin{figure}
 309 \begin{center}
 310 \newcommand{\twltt}{\tt}
 311 \setlength{\unitlength}{0.0080in}
 312 \begin{picture}(489,176)(40,390)
 313 \thicklines
 314 \put(180,505){$\underbrace{\hspace{2.68in}}$}
 315 \put( 60,505){$\underbrace{\hspace{0.88in}}$}
 316 \put( 40,510){\framebox(480,30){}}
 317 \put( 45,505){\vector( 0,-1){ 15}}
 318 \put( 40,540){\dashbox{4}(0,0){}}
 319 \multiput(250,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 320 \multiput(220,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 321 \multiput(190,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 322 \multiput(280,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 323 \multiput(310,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 324 \multiput(340,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 325 \multiput(370,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 326 \multiput(400,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 327 \multiput(430,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 328 \multiput(460,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 329 \multiput(490,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 330 \multiput(205,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 331 \multiput(235,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 332 \multiput(265,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 333 \multiput(295,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 334 \multiput(325,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 335 \multiput(355,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 336 \multiput(385,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 337 \multiput(415,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 338 \multiput(445,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 339 \multiput(475,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 340 \multiput(505,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 341 \put( 40,510){\framebox(480,30){}}
 342 \put( 40,540){\dashbox{4}(0,0){}}
 343 \multiput(250,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 344 \multiput(220,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 345 \multiput(190,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 346 \multiput(160,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 347 \multiput(130,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 348 \multiput(100,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 349 \multiput(280,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 350 \multiput(310,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 351 \multiput(340,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 352 \multiput(370,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 353 \multiput(400,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 354 \multiput(430,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 355 \multiput(460,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 356 \multiput(490,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 357 \multiput( 70,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 358 \put( 55,540){\line( 0,-1){ 30}}
 359 \multiput( 85,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 360 \multiput(115,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 361 \multiput(145,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 362 \multiput(205,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 363 \multiput(235,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 364 \multiput(265,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 365 \multiput(295,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 366 \multiput(325,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 367 \multiput(355,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 368 \multiput(385,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 369 \multiput(415,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 370 \multiput(445,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 371 \multiput(475,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 372 \multiput(505,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 373 \put(175,540){\line( 0,-1){ 30}}
 374 \put(175,540){\line( 0,-1){ 30}}
 375 \multiput(145,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 376 \multiput(115,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 377 \multiput( 85,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 378 \put( 55,540){\line( 0,-1){ 30}}
 379 \multiput( 70,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 380 \multiput(100,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 381 \multiput(130,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 382 \multiput(160,540)(0.00000,-8.57143){4}{\line( 0,-1){  4.286}}
 383 \put(345,470){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $F$}}}
 384 \put(110,470){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $E$}}}
 385 \put( 40,470){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $S$}}}
 386 \put(505,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $0$}}}
 387 \put(160,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $23$}}}
 388 \put( 40,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $31$}}}
 389 \put(140,400){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $Value=(-1)^{S}(2^{E-127})(1.F)$}}}
 390 \put(505,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $0$}}}
 391 \put(160,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $23$}}}
 392 \put( 40,545){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $31$}}}
 393 \put(140,400){\makebox(0,0)[lb]{\raisebox{0pt}[0pt][0pt]{\twltt $Value=(-1)^{S}(2^{E-127})(1.F)$}}}
 394 \end{picture}
 395 \end{center}
 396 \caption{IEEE single-precision floating-point format}
 397 \label{fig:ieee}
 398 \end{figure}
 399
 400 \subsection{Specification of the look-up table}
 401 To calculate the function $1/\sqrt{x}$ using the previously mentioned
 402 iteration scheme, it is clear that the first estimation of the solution must
 403 be accurate enough to get precise results. The requirements for the
 404 calculation are
 405 \begin{itemize}
 406 \item Maximum possible accuracy with the used IEEE format
 407 \item Use only one iteration step for maximum speed
 408 \end{itemize}
 409
 410 The first requirement states that the result of $1/\sqrt{x}$ may have a
 411 relative error $\varepsilon_{r}$ equal to the $\varepsilon_{r}$ of a IEEE 32
 412 bit single-precision floating-point number. From this, the $1/\sqrt{x}$
 413 of the initial approximation can be derived, rewriting the definition of
 414 the relative error for succeeding steps (\eqnref{epsr}):
 415 \begin{equation}
 416 \frac{\varepsilon_{n}}{y} =
 417 \sqrt{\varepsilon_ {r_{n+1}} \frac{2f'}{yf''}}
 418 \end{equation}
 419 So for the look-up table the needed accuracy is:
 420 \begin{equation}
 421 \frac{\Delta Y}{Y} = \sqrt{\frac{2}{3} 2^{-24}}
 422 \label{eqn:accy}
 423 \end{equation}
 424 which defines the width of the table that must be $\geq 13$ bit.
 425
 426 At this point the relative error, $\varepsilon_{r_{n}}$, of the look-up table
 427 is known. From this the maximum relative error in the argument can be
 428 calculated as follows. The absolute error $\Delta x$ is defined as:
 429 \begin{equation}
 430 \Delta x \equiv \frac{\Delta Y}{Y'}
 431 \end{equation}
 432 and thus:
 433 \begin{equation}
 434 \frac{\Delta x}{Y} = \frac{\Delta Y}{Y} (Y')^{-1}
 435 \end{equation}
 436 and thus:
 437 \begin{equation}
 438 \Delta x = constant \frac{Y}{Y'}
 439 \end{equation}
 440 For the $1/\sqrt{x}$ function, $Y / Y' \sim x$ holds, so
 441 $\Delta x / x = constant$. This is a property of the used floating-point
 442 representation as earlier mentioned. The needed accuracy of the argument of the
 443 look-up table follows from:
 444 \begin{equation}
 445 \frac{\Delta x}{x} = -2 \frac{\Delta Y}{Y}
 446 \end{equation}
 447 So, using the floating-point accuracy (\eqnref{accy}):
 448 \begin{equation}
 449 \frac{\Delta x}{x} = -2 \sqrt{\frac{2}{3} 2^{-24}}
 450 \end{equation}
 451 This defines the length of the look-up table which should be $\geq 12$ bit.
 452
 453 \subsection{Separate exponent and fraction computation}
 454 The used IEEE 32 bit single-precision floating-point format specifies
 455 that a number is represented by a exponent and a fraction. The previous
 456 section specifies for every possible floating-point number the look-up table
 457 length and width. Only the size of the fraction of a floating-point number
 458 defines the accuracy. The conclusion from this can be that the size of the
 459 look-up table is length of look-up table, earlier specified, times the size of
 460 the exponent ($2^{12}2^{8}, 1Mb$). The $1/\sqrt{x}$  function has the
 461 property that the exponent is independent of the fraction. This becomes clear
 462 if the floating-point representation is used. Define:
 463 \begin{equation}
 464 x \equiv (-1)^{S}(2^{E-127})(1.F)
 465 \label{eqn:fpdef}
 466 \end{equation}
 467 See \figref{ieee}, where $0 \leq S \leq 1$, $0 \leq E \leq 255$,
 468 $1 \leq 1.F < 2$ and $S$, $E$, $F$ integer (normalization conditions).
 469 The sign bit ($S$) can be omitted because $1/\sqrt{x}$ is only defined
 470 for $x > 0$. The $1/\sqrt{x}$ function applied to $x$ results in:
 471 \begin{equation}
 472 y(x) = \frac{1}{\sqrt{x}}
 473 \end{equation}
 474 or:
 475 \begin{equation}
 476 y(x) = \frac{1}{\sqrt{(2^{E-127})(1.F)}}
 477 \end{equation}
 478 This can be rewritten as:
 479 \begin{equation}
 480 y(x) = (2^{E-127})^{-1/2}(1.F)^{-1/2}
 481 \label{eqn:yx}
 482 \end{equation}
 483 Define:
 484 \begin{equation}
 485 (2^{E'-127}) \equiv (2^{E-127})^{-1/2}
 486 \end{equation}
 487 \begin{equation}
 488 1.F'\equiv (1.F)^{-1/2}
 489 \end{equation}
 490 Then $\frac{1}{\sqrt{2}} < 1.F' \leq 1$ holds, so the condition
 491 $1 \leq 1.F' < 2$, which is essential for normalized real representation, is
 492 not valid anymore. By introducing an extra term, this can be corrected.
 493 Rewrite the $1/\sqrt{x}$ function applied to floating-point numbers (\eqnref{yx}) as:
 494 \begin{equation}
 495 y(x) = (2^{\frac{127-E}{2}-1}) (2(1.F)^{-1/2})
 496 \end{equation}
 497 and:
 498 \begin{equation}
 499 (2^{E'-127}) \equiv (2^{\frac{127-E}{2}-1})
 500 \label{eqn:exp}
 501 \end{equation}
 502 \begin{equation}
 503 1.F'\equiv 2(1.F)^{-1/2}
 504 \label{eqn:frac}
 505 \end{equation}
 506 Then $\sqrt{2} < 1.F \leq 2$ holds. This is not the exact valid range as
 507 defined for normalized floating-point numbers in \eqnref{fpdef}.
 508 The value $2$  causes the problem. By mapping this value on the nearest
 509 representation $< 2$, this can be solved. The small error that is introduced
 510 by this approximation is within the allowable range.
 511
 512 The integer representation of the exponent is the next problem. Calculating
 513 $(2^{\frac{127-E}{2}-1})$ introduces a fractional result if $(127-E) = odd$.
 514 This is again easily accounted for by splitting up the calculation into an
 515 odd and an even part. For $(127-E) = even$ $E'$ in equation (\eqnref{exp})
 516 can be exactly calculated in integer arithmetic as a function of $E$.
 517 \begin{equation}
 518 E' = \frac{127-E}{2} + 126
 519 \end{equation}
 520
 521 For $(127-E) = odd$ equation (\eqnref{yx}) can be rewritten as:
 522 \begin{equation}
 523 y(x) = (2^{\frac{127-E-1}{2}})(\frac{1.F}{2})^{-1/2}
 524 \end{equation}
 525 Thus:
 526 \begin{equation}
 527 E' = \frac{126-E}{2} + 127
 528 \end{equation}
 529 which also can be calculated exactly in integer arithmetic.
 530 {\bf Note} that the fraction is automatically corrected for its range earlier
 531 mentioned, so the exponent does not need an extra correction.
 532
 533 The conclusions from this are:
 534 \begin{itemize}
 535 \item The fraction and exponent look-up table are independent. The fraction
 536 look-up table exists of two tables (odd and even exponent) so the odd/even
 537 information of the exponent (lsb bit) has to be used to select the right
 538 table.
 539 \item The exponent table is an 256 x 8 bit table, initialized for $odd$
 540 and $even$.% as specified before
 541 \end{itemize}
 542
 543 \subsection{Implementation}
 544 The look-up tables can be generated by a small C program, which uses
 545 floating-point numbers and operations with IEEE 32 bit single-precision format.
 546 Note that because of the $odd$/$even$ information that is needed, the
 547 fraction table is twice the size earlier specified (13 bit i.s.o. 12 bit).
 548 %\figref{expgen} in the appendix shows how to fill the exponent table,
 549 %\figref{fractgen} shows how to fill the fraction table.
 550
 551 The function according to~\eqnref{nr} has to be implemented.
 552 Applied to the $1/\sqrt{x}$ function, equation~\eqnref{inversef} leads to:
 553 \begin{equation}
 554 f = a - \frac{1}{y^{2}}
 555 \end{equation}
 556 and so:
 557 \begin{equation}
 558 f' = \frac{2}{y^{3}}
 559 \end{equation}
 560 so:
 561 \begin{equation}
 562 y_{n+1} = y_{n} - \frac{ a - \frac{1}{y^{2}_{n}} }{ \frac{2}{y^{3}_{n}} }
 563 \end{equation}
 564 or:
 565 \begin{equation}
 566 y_{n+1} = \frac{y_{n}}{2} (3 - a y^{2}_{n})
 567 \end{equation}
 568 Where $y_{0}$ can be found in the look-up tables, and $y_{1}$ gives the result
 569 to the maximum accuracy.
 570 %In \figref{as_implementation}) the assembler implementation is given.
 571 It is clear that only one iteration extra (in double
 572 precision) is needed for a double-precision result.
 573
 574 % LocalWords:  Virial virial triclinic intra mol mshift shiftvec sqrt SPC lj yf
 575 % LocalWords:  coul Fortran SGI AMD Raphson IEEE taylor epsr accy ieee yx fpdef
 576 % LocalWords:  lsb nr inversef src formulae GROMACS