modified: pixi.toml
[GalaxyCodeBases.git] / BioInfo / BS-Seq / bwa-meth / paper / supplement.tex
blob8c9c5efe09b4c39204d85dc7c0392976ce7d1da2
1 \documentclass[12pt]{article}
2 %\usepackage[english]{babel}
3 %\usepackage[utf8x]{inputenc}
4 \usepackage{fullpage}
5 \usepackage{float}
6 \usepackage{booktabs}
8 \usepackage{listings}
9 \usepackage{color}
11 \title{Supplemental Information}
12 \author{Pedersen \textit{et al.}}
14 \date{}
16 \definecolor{mygreen}{rgb}{0,0.6,0}
17 \definecolor{mygray}{rgb}{0.5,0.5,0.5}
18 \definecolor{mymauve}{rgb}{0.58,0,0.82}
20 \lstset{ %
21 backgroundcolor=\color{white}, % choose the background color
22 basicstyle=\footnotesize, % size of fonts used for the code
23 breaklines=true, % automatic line breaking only at whitespace
24 captionpos=b, % sets the caption-position to bottom
25 commentstyle=\color{mygreen}, % comment style
26 escapeinside={\%*}{*)}, % if you want to add LaTeX within your code
27 keywordstyle=\color{blue}, % keyword style
28 stringstyle=\color{mymauve}, % string literal style
30 \usepackage{graphicx}
32 \begin{document}
34 \maketitle
36 \section{Advantages}
37 \textit{bwa-meth} provides a number of advantages:
39 1. ease of use -- a single script indexes the fasta, aligns the reads, and
40 tabulates methylation. Default parameters work well.
42 2. speed -- due to the efficient parallelization by bwa, bwa-meth can run on
43 as many CPU's are available on a single machine.
45 3. disk-usage -- bwa-meth can take either compressed or uncompressed files
46 and the reads are streamed directly to the aligner, never written to disk. An
47 aligner that requires trimmed reads and that writes the converted reads to
48 disk will require 3X as much storage for just the raw sequence data.
50 4. useful output -- the BAM that results from bwa-meth is sorted, indexed and
51 contains the proper mapping quality scores, alignment flags, and read-groups.
52 In addition, the header @SQ lines are sorted so that tools like Picard and
53 GATK will accept them as is.
55 5. simplicity -- we have carefully designed bwa-meth to be quite simple and
56 the code readable so that enhancements are simple. For example, see \#6.
58 6. strand-specificity -- Devon Ryan,
59 the author of Bison, noted that many capture methods target only the reverse
60 strand -- commonly called the original bottom. Though this would be simple to
61 add to any aligner, ours is the only one that currently supports this. See
62 from the supplemental figure 5 for the reduction in off-target reads that results
63 from only considering reads that align to the original bottom strand.
65 7. specificity and sensitivity -- while the difference between aligners in
66 this regard is small, we have shown that bwa-meth consistently performs as
67 the best or among the best. We believe this to be the case in general and
68 not limited to the specific cases that we have shown.
70 8. accuracy without trimming -- the accuracy is due in part to bwa's local
71 alignment. This is something that is not supported in bismark and means that
72 we can achieve very good accuracy without performing additional trimming
73 which would result in additional disk-usage and processing time.
76 \section{Aligners Compared}
78 \begin{table}[!htp]
79 \caption{Methods Compared}
80 {\begin{tabular}{lll}\\\toprule
81 software & version & arguments\\\midrule
82 bismark (bowtie1) & 0.11.1 & --gzip --bam -n 2 -l 24\\
83 bismark (bowtie2) & 0.11.1 & --gzip --bam -N 1 --score\_min L,-0.4,-0.5\\
84 bsmap & 2.74 & bsmap -v3 -m0 -x1000 -S42 -n0 -s12 -I1\\
85 bsmooth & fb3f7ef & --very-sensitive\\
86 bison & 0.3.0 & --directional --very-sensitive-local -N 1\\
87 bwa-meth & 0.10 & bwa-meth\\
88 gsnap & 2014-02-28 & --npaths 1 --quiet-if-excessive -k 15\\
89 last & 392 & last-bisulfite-paired.sh\\\bottomrule
90 \end{tabular}}
91 \end{table}
92 We used a modified version of the calling script for last available in the
93 github repository.
95 \section{Comparison On Real Data}
97 We compared 7 aligners on real data as described in the text. Supplemental
98 Figure 1 below shows the same data as Figure 1 from the main text.
100 \begin{figure}[H]%supplemental figure1
101 \centerline{\includegraphics[width=125mm]{real-quals.eps}}
102 \caption{Percent of paired-end, 100-base reads on (y) and off (x) target for
103 the tested aligners. Aligners that report mapping quality are shown as lines
104 that span each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner. This is a color version of figure 1 from the paper}\label{suppfig:01}
105 \end{figure}
107 When we trim the reads with \emph{trim\_galore} before aligning, the result is shown below in Supplemental Figure 2. We report the percent of reads aligned relative
108 to the original count in the untrimmed data because we are interested in the
109 overall mapping rate.
111 %\begin{figure}[!htpb]%figure2
112 \begin{figure}[H]%figure2
113 \centerline{\includegraphics[width=125mm]{real-trim-quals.eps}}
114 \caption{Percent of paired-end, \emph{trimmed} 100-base reads on (y) and off (x) target for the tested aligners. Aligners that report mapping quality are shown as connected dots for each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner.}\label{suppfig:02}
115 \end{figure}
117 \subsection{Resources}
119 %http://truben.no/latex/table/
120 \begin{table}[H]
121 \centering
122 \caption{Resources on real data}
123 \begin{tabular}{lllll} \hline
124 trimmed & program & CPU time(min) & mem(GB) & dataset \\ \hline
126 no & bis1 & 588.68 & 8.30 & real \\
127 no & bis2 & 2067.23 & 9.88 & real \\
128 no & bison & 1639.25 & 10.34 & real \\
129 no & bsmap & 8498.52 & 23.06 & real \\
130 no & bsmooth & 5469.96 & 9.94 & real \\
131 no & bwa & 490.37 & 17.70 & real \\
132 no & gsnap & 5309.48 & 12.73 & real \\
133 no & last & 611.22 & 34.53 & real \\
135 yes & bis1 & 581.60 & 8.28 & real \\
136 yes & bis2 & 1921.00 & 9.64 & real \\
137 yes & bison & 1433.06 & 10.06 & real \\
138 yes & bsmap & 7259.29 & 23.19 & real \\
139 yes & bsmooth & 5189.16 & 9.46 & real \\
140 yes & bwa & 413.69 & 17.65 & real \\
141 yes & gsnap & 4565.57 & 13.93 & real \\
142 yes & last & 560.53 & 34.53 & real \\
144 \end{tabular}
145 \end{table}
147 \section{Comparison On Simulated Data}
148 Paired-end, 100-base reads were simulated using the tool \emph{Sherman}
149 (http://www.bioinformatics.babraham.ac.uk/projects/sherman/) with 95\% of the
150 reads simulated from mm10 and 5\% of the reads from \emph{E. coli}.
151 Reads were aligned with the parameters in Supplemental Table 1.
152 Reads from \emph{E. coli} aligning to \emph{mm10} were considered as
153 off-target reads. The Y-axis in the plot is the percent of the \emph{mm10}
154 reads. Exact parameters sent to Sherman are here:
155 https://github.com/brentp/bwa-meth/blob/master/compare/src/gen-simulated.sh
157 Note that these reads have no insertions or deletions. We simulated the
158 \emph{E. coli} reads to be 98\% unmethylated and we used a per-base error
159 rate of 1\% for the simulations.
161 Supplemental Figure 3 below shows the result for these simulations.
163 \begin{figure}[H]%figure3
164 \centerline{\includegraphics[width=125mm]{sim-quals.eps}}
165 \caption{Percent of paired-end, \emph{simulated} 100-base reads on (y) and off (x)
166 target for the tested aligners. Aligners that report mapping quality are shown as
167 connected dots for each quality cut-off. Reads are limited to those considered as
168 primary, mapped alignments by the aligner.
169 }\label{suppfig:03}
170 \end{figure}
172 Supplemental Figure 4 below shows the result for trimmed and simulated reads.
173 Trimming removes some read-pairs if one or both reads had low-quality. We
174 report the percent of reads aligned relative to the original,
175 un-trimmed number from \emph{mm10} since we are interested in the overall
176 mapping rate.
178 \begin{figure}[H]%figure3
179 \centerline{\includegraphics[width=125mm]{sim-trim-quals.eps}}
180 \caption{Percent of paired-end, \emph{trimmed}, \emph{simulated} 100-base reads on (y) and off (x) target for the tested aligners. Aligners that report mapping quality are shown as connected dots for each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner.
181 }\label{suppfig:04}
182 \end{figure}
184 \subsection{Resources}
186 %http://truben.no/latex/table/
187 \begin{table}[H]
188 \centering
189 \caption{Resources on simulated data with 1\% error-rate
190 (programs run with more processes will use more memory)}
191 \begin{tabular}{lllll} \hline
192 trimmed & program & CPU time(min) & mem(GB) & dataset \\ \hline
194 no & bis1 & 69.07 & 8.28 & sim \\
195 no & bis2 & 271.13 & 9.66 & sim \\
196 no & bison & 222.23 & 9.51 & sim \\
197 no & bsmap & 227.35 & 22.99 & sim \\
198 no & bsmooth & 501.97 & 9.52 & sim \\
199 no & bwa & 145.56 & 19.40 & sim \\
200 no & gsnap & 265.41 & 10.84 & sim \\
201 no & last & 138.50 & 25.90 & sim \\
203 yes & bis1 & 62.92 & 8.27 & sim \\
204 yes & bis2 & 227.30 & 9.47 & sim \\
205 yes & bison & 154.81 & 9.16 & sim \\
206 yes & bsmap & 93.70 & 22.84 & sim \\
207 yes & bsmooth & 461.71 & 9.31 & sim \\
208 yes & bwa & 199.15 & 22.25 & sim \\
209 yes & gsnap & 91.98 & 10.78 & sim \\
210 yes & last & 113.80 & 25.90 & sim \\
213 \end{tabular}
214 \end{table}
216 \section{Comparison On Error-Free Simulated Data}
218 Paired-end, error-free 100-base reads were simulated as above but
219 without contamination or errors.
220 Exact parameters sent to Sherman are here:
221 https://github.com/brentp/bwa-meth/blob/master/compare/src/gen-simulated.sh
223 Supplemental Figure 5 below shows the result for these simulations.
225 \begin{figure}[H]%figure5
226 \centerline{\includegraphics[width=125mm]{noerrorsim-quals.eps}}
227 \caption{Percent of paired-end, \emph{error-free, simulated} 100-base reads on (y) and off (x)
228 target for the tested aligners. Aligners that report mapping quality are shown as
229 connected dots for each quality cut-off. Reads are limited to those considered as
230 primary, mapped alignments by the aligner.
231 }\label{suppfig:05}
232 \end{figure}
234 Supplemental Figure 6 below shows the result for trimmed and simulated reads.
235 Trimming removes some read-pairs if one or both reads had low-quality. We
236 report the percent of reads aligned relative to the original,
237 un-trimmed number from \emph{mm10} since we are interested in the overall
238 mapping rate.
240 \begin{figure}[H]%figure6
241 \centerline{\includegraphics[width=125mm]{noerrorsim-trim-quals.eps}}
242 \caption{Percent of paired-end, \emph{trimmed}, \emph{error-free, simulated} 100-base reads on (y) and off (x) target for the tested aligners. Aligners that report mapping quality are shown as connected dots for each quality cut-off. Reads are limited to those considered as primary, mapped alignments by the aligner.
243 }\label{suppfig:06}
244 \end{figure}
248 \subsection{Resources}
250 %http://truben.no/latex/table/
251 \begin{table}[H]
252 \centering
253 \caption{Resources on error-free simulated data
254 (programs run with more processes will use more memory)}
255 \begin{tabular}{lllll} \hline
256 trimmed & program & CPU time(min) & mem(GB) & dataset \\ \hline
258 no & bis1 & 61.05 & 8.27 & error-free \\
259 no & bis2 & 289.67 & 9.46 & error-free \\
260 no & bison & 179.17 & 9.27 & error-free \\
261 no & bsmap & 66.03 & 22.72 & error-free \\
262 no & bsmooth & 492.62 & 9.37 & error-free \\
263 no & bwa & 170.66 & 21.11 & error-free \\
264 no & gsnap & 48.36 & 10.25 & error-free \\
265 no & last & 214.68 & 25.90 & error-free \\
267 yes & bis1 & 64.32 & 8.27 & error-free \\
268 yes & bis2 & 245.42 & 9.48 & error-free \\
269 yes & bison & 184.55 & 9.10 & error-free \\
270 yes & bsmap & 67.68 & 22.94 & error-free \\
271 yes & bsmooth & 481.99 & 9.38 & error-free \\
272 yes & bwa & 173.93 & 21.43 & error-free \\
273 yes & gsnap & 51.11 & 11.69 & error-free \\
274 yes & last & 138.03 & 25.90 & error-free \\
276 \end{tabular}
277 \end{table}
281 \section{Improved Accuracy for Stranded Capture Experiments}
283 Below we show a reduction in the percent of off-target reads by
284 considering only the strand targetted by our capture protocol.
286 \begin{figure}[H]%figure7
287 \centerline{\includegraphics[width=86mm]{real-bwa-strand.eps}}
288 \caption{For a capture experiment that targets only one strand, we can
289 improve accuracy by considering only reads that map to that strand.
290 Here, we show the effect from the original bwameth alignments as shown
291 in Figure 1 compared to the alignments restricted to the strand targeted
292 by the capture method.}\label{suppfig:07}
293 \end{figure}
295 This is implemented in \textit{bwa-meth} via the \emph{--set-as-failed}
296 flag so that reads mapping to a given strand are given the SAM flag
297 indicating that they failed vendor quality control checks. Invocation
298 would look like this:
300 \begin{lstlisting}[language=bash]
301 bwameth.py \
302 --reference $REF \
303 --set-as-failed f \
304 $READS_R1 $READS_R2
306 \end{lstlisting}
307 to set reads mapping to the forward, or original top strand
308 as failed.
311 \section{Mapping and Trimming}
312 We note that trimming improves accuracy for most aligners but the
313 difference is very small for \textit{bwa-meth}.
315 \section{\textit{bwa-meth} Installation And Requirements}
317 \textit{Bwa-meth} depends on samtools and a single python library, \textit{toolshed}.
318 The latter can be installed by running \emph{sudo python setup.py install} from the main
319 directory of the \textit{bwa-meth} project. Samtools is a C library installed on most
320 systems and available at https://github.com/samtools/samtools.
321 Further installation instructions are available in the \textit{bwa-meth} README at
322 https://github.com/brentp/bwa-meth/
324 For tabulation of methylation by CpG, \emph{Bis-SNP} \cite{bissnp} is required.
325 The java .jar file is available from: http://sourceforge.net/projects/bissnp/files/
327 For CNV detection from BS-Seq data, the R package \emph{cn.mops} \cite{cnmops} is required.
328 It is available
329 from bioconductor \cite{bioconductor} at: http://bioconductor.org/packages/devel/bioc/html/cn.mops.html
330 \section{Additional Features}
331 \textit{Bwa-meth} and GSNAP are the only programs that output a BAM file that passes picard's ValidateSam without errors.
332 Last does not report the proper pair information in all cases and none of the other aligners add a read-group.
333 For the comparison, we added sorting and forced SAM output by the other aligners regardless of their default.
334 \textit{Bwa-meth} outputs a read-group for each sample by default and allows that to be customized.
336 \textit{bwa-meth} can calculate bias of methylation estimates by location in the read:
338 \begin{lstlisting}[language=bash]
339 python bias-plot.py input.bam ref.fa
340 \end{lstlisting}
341 This will create a bias-plot showing bases in the reads that should not be considered
342 for methylation calls.
344 \textit{Bwa-meth} defers tabulation of the methylation scores to Bis-SNP \cite{bissnp} by offering a simplified interface:
345 \begin{lstlisting}[language=bash]
346 bwameth.py tabulate \
347 --trim 3,3 \
348 --map-q 30 \
349 --bissnp BisSNP-0.82.2.jar \
350 --reference /path/to/ref.fasta \
351 input.bam
352 \end{lstlisting}
353 Where the arguments are sent to Bis-SNP to, for example trim the first and last 3 bases
354 from each read to avoid bias.
356 A full example on real data is at:
357 https://github.com/brentp/bwa-meth/tree/master/example/
359 %\begin{thebibliography}{}
360 \bibliographystyle{abbrv}
361 \bibliography{document}
362 %\end{thebibliography}
364 \end{document}