BioInfo/BS-Seq/bwa-meth/paper/supplement.tex

   1 \documentclass[12pt]{article}
   2 %\usepackage[english]{babel}
   3 %\usepackage[utf8x]{inputenc}
   4 \usepackage{fullpage}
   5 \usepackage{float}
   6 \usepackage{booktabs}
   7
   8 \usepackage{listings}
   9 \usepackage{color}
  10
  11 \title{Supplemental Information}
  12 \author{Pedersen \textit{et al.}}
  13
  14 \date{}
  15
  16 \definecolor{mygreen}{rgb}{0,0.6,0}
  17 \definecolor{mygray}{rgb}{0.5,0.5,0.5}
  18 \definecolor{mymauve}{rgb}{0.58,0,0.82}
  19
  20 \lstset{ %
  21   backgroundcolor=\color{white},   % choose the background color
  22   basicstyle=\footnotesize,        % size of fonts used for the code
  23   breaklines=true,                 % automatic line breaking only at whitespace
  24   captionpos=b,                    % sets the caption-position to bottom
  25   commentstyle=\color{mygreen},    % comment style
  26   escapeinside={\%*}{*)},          % if you want to add LaTeX within your code
  27   keywordstyle=\color{blue},       % keyword style
  28   stringstyle=\color{mymauve},     % string literal style
  29 }
  30 \usepackage{graphicx}
  31
  32 \begin{document}
  33
  34 \maketitle
  35
  36 \section{Advantages}
  37 \textit{bwa-meth} provides a number of advantages:
  38
  39 1. ease of use -- a single script indexes the fasta, aligns the reads, and
  40    tabulates methylation. Default parameters work well.
  41
  42 2. speed -- due to the efficient parallelization by bwa, bwa-meth can run on
  43    as many CPU's are available on a single machine.
  44
  45 3. disk-usage -- bwa-meth can take either compressed or uncompressed files
  46    and the reads are streamed directly to the aligner, never written to disk. An
  47    aligner that requires trimmed reads and that writes the converted reads to
  48    disk will require 3X as much storage for just the raw sequence data.
  49
  50 4. useful output -- the BAM that results from bwa-meth is sorted, indexed and
  51    contains the proper mapping quality scores, alignment flags, and read-groups.
  52    In addition, the header @SQ lines are sorted so that tools like Picard and
  53    GATK will accept them as is.
  54
  55 5. simplicity -- we have carefully designed bwa-meth to be quite simple and
  56    the code readable so that enhancements are simple. For example, see \#6.
  57
  58 6. strand-specificity -- Devon Ryan,
  59    the author of Bison, noted that many capture methods target only the reverse
  60    strand -- commonly called the original bottom. Though this would be simple to
  61    add to any aligner, ours is the only one that currently supports this. See
  62    from the supplemental figure 5 for the reduction in off-target reads that results
  63    from only considering reads that align to the original bottom strand.
  64
  65 7. specificity and sensitivity -- while the difference between aligners in
  66    this regard is small, we have shown that bwa-meth consistently performs as
  67    the best or among the best. We believe this to be the case in general and
  68    not limited to the specific cases that we have shown.
  69
  70 8. accuracy without trimming -- the accuracy is due in part to bwa's local
  71    alignment. This is something that is not supported in bismark and means that
  72    we can achieve very good accuracy without performing additional trimming
  73    which would result in additional disk-usage and processing time.
  74
  75
  76 \section{Aligners Compared}
  77
  78 \begin{table}[!htp]
  79 \caption{Methods Compared}
  80 {\begin{tabular}{lll}\\\toprule
  81 software & version & arguments\\\midrule
  82 bismark (bowtie1) & 0.11.1 & --gzip --bam -n 2 -l 24\\
  83 bismark (bowtie2) & 0.11.1 & --gzip --bam -N 1 --score\_min L,-0.4,-0.5\\
  84 bsmap & 2.74 & bsmap -v3 -m0 -x1000 -S42 -n0 -s12 -I1\\
  85 bsmooth & fb3f7ef & --very-sensitive\\
  86 bison & 0.3.0 & --directional --very-sensitive-local -N 1\\
  87 bwa-meth & 0.10 & bwa-meth\\
  88 gsnap & 2014-02-28 & --npaths 1 --quiet-if-excessive -k 15\\
  89 last & 392 & last-bisulfite-paired.sh\\\bottomrule
  90 \end{tabular}}
  91 \end{table}
  92 We used a modified version of the calling script for last available in the
  93 github repository.
  94
  95 \section{Comparison On Real Data}
  96
  97 We compared 7 aligners on real data as described in the text. Supplemental
  98 Figure 1 below shows the same data as Figure 1 from the main text.
  99
 100 \begin{figure}[H]%supplemental figure1
 101     \centerline{\includegraphics[width=125mm]{real-quals.eps}}
 102     \caption{Percent of paired-end, 100-base reads on (y) and off (x) target for
 103     the tested aligners. Aligners that report mapping quality are shown as lines
 104 that span each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner. This is a color version of figure 1 from the paper}\label{suppfig:01}
 105 \end{figure}
 106
 107 When we trim the reads with \emph{trim\_galore} before aligning, the result is shown below in Supplemental Figure 2. We report the percent of reads aligned relative
 108 to the original count in the untrimmed data because we are interested in the
 109 overall mapping rate.
 110
 111 %\begin{figure}[!htpb]%figure2
 112 \begin{figure}[H]%figure2
 113     \centerline{\includegraphics[width=125mm]{real-trim-quals.eps}}
 114     \caption{Percent of paired-end, \emph{trimmed} 100-base reads on (y) and off (x) target for the tested aligners. Aligners that report mapping quality are shown as connected dots for each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner.}\label{suppfig:02}
 115 \end{figure}
 116
 117 \subsection{Resources}
 118
 119 %http://truben.no/latex/table/
 120 \begin{table}[H]
 121     \centering
 122     \caption{Resources on real data}
 123     \begin{tabular}{lllll} \hline
 124     trimmed & program & CPU time(min) & mem(GB) & dataset \\ \hline
 125
 126 no &    bis1 & 588.68 & 8.30 & real \\
 127 no &    bis2 & 2067.23 & 9.88 & real \\
 128 no &    bison & 1639.25 & 10.34 & real \\
 129 no &    bsmap & 8498.52 & 23.06 & real \\
 130 no &    bsmooth & 5469.96 & 9.94 & real \\
 131 no &    bwa & 490.37 & 17.70 & real \\
 132 no &    gsnap & 5309.48 & 12.73 & real \\
 133 no &    last & 611.22 & 34.53 & real \\
 134
 135 yes &    bis1 & 581.60 & 8.28 & real \\
 136 yes &    bis2 & 1921.00 & 9.64 & real \\
 137 yes &    bison & 1433.06 & 10.06 & real \\
 138 yes &    bsmap & 7259.29 & 23.19 & real \\
 139 yes &    bsmooth & 5189.16 & 9.46 & real \\
 140 yes &    bwa & 413.69 & 17.65 & real \\
 141 yes &    gsnap & 4565.57 & 13.93 & real \\
 142 yes &    last & 560.53 & 34.53 & real \\
 143
 144     \end{tabular}
 145 \end{table}
 146
 147 \section{Comparison On Simulated Data}
 148 Paired-end, 100-base reads were simulated using the tool \emph{Sherman}
 149 (http://www.bioinformatics.babraham.ac.uk/projects/sherman/) with 95\% of the
 150 reads simulated from mm10 and 5\% of the reads from \emph{E. coli}.
 151 Reads were aligned with the parameters in Supplemental Table 1.
 152 Reads from \emph{E. coli} aligning to \emph{mm10} were considered as
 153 off-target reads. The Y-axis in the plot is the percent of the \emph{mm10}
 154 reads. Exact parameters sent to Sherman are here:
 155 https://github.com/brentp/bwa-meth/blob/master/compare/src/gen-simulated.sh
 156
 157 Note that these reads have no insertions or deletions. We simulated the
 158 \emph{E. coli} reads to be 98\% unmethylated and we used a per-base error
 159 rate of 1\% for the simulations.
 160
 161 Supplemental Figure 3 below shows the result for these simulations.
 162
 163 \begin{figure}[H]%figure3
 164     \centerline{\includegraphics[width=125mm]{sim-quals.eps}}
 165     \caption{Percent of paired-end, \emph{simulated} 100-base reads on (y) and off (x)
 166     target for the tested aligners. Aligners that report mapping quality are shown as
 167     connected dots for each quality cut-off. Reads are limited to those considered as
 168     primary, mapped alignments by the aligner.
 169 }\label{suppfig:03}
 170 \end{figure}
 171
 172 Supplemental Figure 4 below shows the result for trimmed and simulated reads.
 173 Trimming removes some read-pairs if one or both reads had low-quality. We
 174 report the percent of reads aligned relative to the original,
 175 un-trimmed number from \emph{mm10} since we are interested in the overall
 176 mapping rate.
 177
 178 \begin{figure}[H]%figure3
 179     \centerline{\includegraphics[width=125mm]{sim-trim-quals.eps}}
 180     \caption{Percent of paired-end, \emph{trimmed}, \emph{simulated} 100-base reads on (y) and off (x) target for the tested aligners. Aligners that report mapping quality are shown as connected dots for each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner.
 181 }\label{suppfig:04}
 182 \end{figure}
 183
 184 \subsection{Resources}
 185
 186 %http://truben.no/latex/table/
 187 \begin{table}[H]
 188     \centering
 189     \caption{Resources on simulated data with 1\% error-rate
 190 (programs run with more processes will use more memory)}
 191     \begin{tabular}{lllll} \hline
 192     trimmed & program & CPU time(min) & mem(GB) & dataset \\ \hline
 193
 194 no &    bis1 & 69.07 & 8.28 & sim \\
 195 no &    bis2 & 271.13 & 9.66 & sim \\
 196 no &    bison & 222.23 & 9.51 & sim \\
 197 no &    bsmap & 227.35 & 22.99 & sim \\
 198 no &    bsmooth & 501.97 & 9.52 & sim \\
 199 no &    bwa & 145.56 & 19.40 & sim \\
 200 no &    gsnap & 265.41 & 10.84 & sim \\
 201 no &    last & 138.50 & 25.90 & sim \\
 202
 203 yes &    bis1 & 62.92 & 8.27 & sim \\
 204 yes &    bis2 & 227.30 & 9.47 & sim \\
 205 yes &    bison & 154.81 & 9.16 & sim \\
 206 yes &    bsmap & 93.70 & 22.84 & sim \\
 207 yes &    bsmooth & 461.71 & 9.31 & sim \\
 208 yes &    bwa & 199.15 & 22.25 & sim \\
 209 yes &    gsnap & 91.98 & 10.78 & sim \\
 210 yes &    last & 113.80 & 25.90 & sim \\
 211
 212
 213     \end{tabular}
 214 \end{table}
 215
 216 \section{Comparison On Error-Free Simulated Data}
 217
 218 Paired-end, error-free 100-base reads were simulated as above but
 219 without contamination or errors.
 220 Exact parameters sent to Sherman are here:
 221 https://github.com/brentp/bwa-meth/blob/master/compare/src/gen-simulated.sh
 222
 223 Supplemental Figure 5 below shows the result for these simulations.
 224
 225 \begin{figure}[H]%figure5
 226     \centerline{\includegraphics[width=125mm]{noerrorsim-quals.eps}}
 227     \caption{Percent of paired-end, \emph{error-free, simulated} 100-base reads on (y) and off (x)
 228     target for the tested aligners. Aligners that report mapping quality are shown as
 229     connected dots for each quality cut-off. Reads are limited to those considered as
 230     primary, mapped alignments by the aligner.
 231 }\label{suppfig:05}
 232 \end{figure}
 233
 234 Supplemental Figure 6 below shows the result for trimmed and simulated reads.
 235 Trimming removes some read-pairs if one or both reads had low-quality. We
 236 report the percent of reads aligned relative to the original,
 237 un-trimmed number from \emph{mm10} since we are interested in the overall
 238 mapping rate.
 239
 240 \begin{figure}[H]%figure6
 241     \centerline{\includegraphics[width=125mm]{noerrorsim-trim-quals.eps}}
 242     \caption{Percent of paired-end, \emph{trimmed}, \emph{error-free, simulated} 100-base reads on (y) and off (x) target for the tested aligners. Aligners that report mapping quality are shown as connected dots for each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner.
 243 }\label{suppfig:06}
 244 \end{figure}
 245
 246
 247
 248 \subsection{Resources}
 249
 250 %http://truben.no/latex/table/
 251 \begin{table}[H]
 252     \centering
 253     \caption{Resources on error-free simulated data
 254 (programs run with more processes will use more memory)}
 255     \begin{tabular}{lllll} \hline
 256     trimmed & program & CPU time(min) & mem(GB) & dataset \\ \hline
 257
 258 no &    bis1 & 61.05 & 8.27 & error-free \\
 259 no &    bis2 & 289.67 & 9.46 & error-free \\
 260 no &    bison & 179.17 & 9.27 & error-free \\
 261 no &    bsmap & 66.03 & 22.72 & error-free \\
 262 no &    bsmooth & 492.62 & 9.37 & error-free \\
 263 no &    bwa & 170.66 & 21.11 & error-free \\
 264 no &    gsnap & 48.36 & 10.25 & error-free \\
 265 no &    last & 214.68 & 25.90 & error-free \\
 266
 267 yes &    bis1 & 64.32 & 8.27 & error-free \\
 268 yes &    bis2 & 245.42 & 9.48 & error-free \\
 269 yes &    bison & 184.55 & 9.10 & error-free \\
 270 yes &    bsmap & 67.68 & 22.94 & error-free \\
 271 yes &    bsmooth & 481.99 & 9.38 & error-free \\
 272 yes &    bwa & 173.93 & 21.43 & error-free \\
 273 yes &    gsnap & 51.11 & 11.69 & error-free \\
 274 yes &    last & 138.03 & 25.90 & error-free \\
 275
 276     \end{tabular}
 277 \end{table}
 278
 279
 280
 281 \section{Improved Accuracy for Stranded Capture Experiments}
 282
 283 Below we show a reduction in the percent of off-target reads by
 284 considering only the strand targetted by our capture protocol.
 285
 286 \begin{figure}[H]%figure7
 287     \centerline{\includegraphics[width=86mm]{real-bwa-strand.eps}}
 288     \caption{For a capture experiment that targets only one strand, we can
 289     improve accuracy by considering only reads that map to that strand.
 290     Here, we show the effect from the original bwameth alignments as shown
 291     in Figure 1 compared to the alignments restricted to the strand targeted
 292     by the capture method.}\label{suppfig:07}
 293 \end{figure}
 294
 295 This is implemented in \textit{bwa-meth} via the \emph{--set-as-failed}
 296 flag so that reads mapping to a given strand are given the SAM flag
 297 indicating that they failed vendor quality control checks. Invocation
 298 would look like this:
 299
 300 \begin{lstlisting}[language=bash]
 301 bwameth.py \
 302     --reference $REF \
 303     --set-as-failed f \
 304     $READS_R1 $READS_R2
 305
 306 \end{lstlisting}
 307 to set reads mapping to the forward, or original top strand
 308 as failed.
 309
 310
 311 \section{Mapping and Trimming}
 312 We note that trimming improves accuracy for most aligners but the
 313 difference is very small for \textit{bwa-meth}.
 314
 315 \section{\textit{bwa-meth} Installation And Requirements}
 316
 317 \textit{Bwa-meth} depends on samtools and a single python library, \textit{toolshed}.
 318 The latter can be installed by running \emph{sudo python setup.py install} from the main
 319 directory of the \textit{bwa-meth} project. Samtools is a C library installed on most
 320 systems and available at https://github.com/samtools/samtools.
 321 Further installation instructions are available in the \textit{bwa-meth} README at
 322 https://github.com/brentp/bwa-meth/
 323
 324 For tabulation of methylation by CpG, \emph{Bis-SNP} \cite{bissnp} is required.
 325 The java .jar file is available from: http://sourceforge.net/projects/bissnp/files/
 326
 327 For CNV detection from BS-Seq data, the R package \emph{cn.mops} \cite{cnmops} is required.
 328 It is available
 329 from bioconductor \cite{bioconductor} at: http://bioconductor.org/packages/devel/bioc/html/cn.mops.html
 330 \section{Additional Features}
 331 \textit{Bwa-meth} and GSNAP are the only programs that output a BAM file that passes picard's ValidateSam without errors.
 332 Last does not report the proper pair information in all cases and none of the other aligners add a read-group.
 333 For the comparison, we added sorting and forced SAM output by the other aligners regardless of their default.
 334 \textit{Bwa-meth} outputs a read-group for each sample by default and allows that to be customized.
 335
 336 \textit{bwa-meth} can calculate bias of methylation estimates by location in the read:
 337
 338 \begin{lstlisting}[language=bash]
 339 python bias-plot.py input.bam ref.fa
 340 \end{lstlisting}
 341 This will create a bias-plot showing bases in the reads that should not be considered
 342 for methylation calls.
 343
 344 \textit{Bwa-meth} defers tabulation of the methylation scores to Bis-SNP \cite{bissnp} by offering a simplified interface:
 345 \begin{lstlisting}[language=bash]
 346 bwameth.py tabulate \
 347     --trim 3,3 \
 348     --map-q 30 \
 349     --bissnp BisSNP-0.82.2.jar \
 350      --reference /path/to/ref.fasta \
 351      input.bam
 352 \end{lstlisting}
 353 Where the arguments are sent to Bis-SNP to, for example trim the first and last 3 bases
 354 from each read to avoid bias.
 355
 356 A full example on real data is at:
 357 https://github.com/brentp/bwa-meth/tree/master/example/
 358
 359 %\begin{thebibliography}{}
 360 \bibliographystyle{abbrv}
 361     \bibliography{document}
 362 %\end{thebibliography}
 363
 364 \end{document}