1 \documentclass{bioinfo
}
8 \title[bwa-meth
]{Fast and accurate alignment of long bisulfite-seq reads
}
9 \author[Pedersen
\textit{et~al
}]{
10 Brent S. Pedersen\,$^
{1,
}
11 \footnote{To whom correspondence should be addressed. Email: bpederse@gmail.com
}$,
12 Kenneth Eyring\,$^
{1}$,
13 Subhajyoti De\,$^
{1,
2}$,
15 and David A. Schwartz\,$^
1$
%
18 $^
{1}$Department of Medicine, University of Colorado Denver, School of Medicine, Denver, Colorado, USA.
80045 \\
19 $^
{2}$University of Colorado Cancer Center, Molecular Oncology Program,
23 \history{Received on XXXXX; revised on XXXXX; accepted on XXXXX
}
25 \editor{Associate Editor: XXXXXXX
}
31 %http://www.oxfordjournals.org/our_journals/bioinformatics/for_authors/general.html
34 We introduce a new tool,
\textit{bwa-meth
}, to align bisulfite-treated
35 sequences and compare it to existing aligners. We show that it is fast
36 and accurate even without quality-trimming and
37 the output is immediately usable by downstream tools.
38 We gauge accuracy by comparing the number of on and off-target
39 reads from a targeted sequencing project and by simulations.
41 \section{Availability and Implementation:
}
42 The benchmarking scripts and the
\textit{bwa-meth
} software are available at
43 https://github/com/brentp/bwa-meth/ under the MIT License.
45 \section{Contact:
} \href{bpederse@gmail.com
}{bpederse@gmail.com
}
46 \section{Supplementary information:
}
47 Supplemental Information I
50 \section{Introduction
}
51 Bisulfite sequencing (BS-Seq) is a common way to explore DNA methylation.
53 has been developed to map sequence reads treated with bisulfite to a reference genome
\citep{frithlast,methylcoder,gsnap,krueger2011,bsmap,bsmooth
}.
54 Previous studies have compared alignment statistics on
55 real
\citep{methylcoder,bsmap,shrestha
} and simulated
\citep{frithlast
} reads,
56 however most of these are limited by knowledge of the ground-truth and assumptions of
57 the simulation, respectively.
59 Here, we present an analysis of current BS-Seq mappers including the "four-base" aligners
60 Last
\citep{frithlast
}, GSNAP
\citep{gsnap
} and BSMAP
\citep{bsmap
} as well as the
61 "three-base" aligners BSmooth
\citep{bsmooth
}, Bison, and Bismark
\citep{krueger2011
} which
62 perform
\emph{in silico
} conversion of cytosines to thymines. In addtion, we introduce our
63 own, simple three-base aligner that wraps
\textit{BWA mem
} \citep{bwamem
}.
65 The comparison is performed on
100-base paired-end reads
66 which are of modest length by current standards, but, to our knowledge, longer than
67 utilized in any comparison. We hypothesized that long, paired reads, with up
68 to
200 bases from the same genomic region, could change the decision on which
69 alignment method performed the best and that aligners which relied on global
70 alignment might have reduced performance on longer reads.
71 We found limitations to exisiting aligners including the writing of large temporary
72 files, required quality-trimming, high memory-use, long run-time, output that was not
73 suitable for consumption by traditional tools, or some combination of these
74 inconveniences. We wrote
75 a BS-Seq aligner based on BWA mem
\citep{bwamem
} to address these
76 limitations. This new aligner,
77 \textit{bwa-meth
}, allows indels, local alignments, and it never writes a
78 temporary-file of the reads to disk, instead streaming the
\emph{in silico
} converted
79 reads directly to the aligner and streaming the alignments directly to a well-formatted
80 alignment file suitable for use in downstream tools. In addition, it works well without
81 quality-trimming, thereby reducing storage requirements
3-fold by not creating
82 quality-trimmed or converted reads.
85 In order to determine the accuracy of an aligner,
86 we utilize a dataset from Agilent's SureSelect Mouse Methyl-Seq kit which
88 99 million bases from CpG dense regions in the mouse genome (a similar
89 approach is available for human regions).
90 We evaulate an aligner by the number of reads in the capture area as compared
91 to outside of the capture area. While there will be off-target capture, all
92 aligners are subject to the same assumptions. With those constraints, we can
93 plot a receiver operating curve (ROC) with true positives as reads within
94 and false positives as reads outside of the target regions.
95 \citealp{shrestha
} performed an unbiased analysis on
35 base reads sequenced from
96 the X chromosome and called a read as accurately mapped if it aligned
97 to the X chromosome; here, we map to the entire genome with
100-base paired-end reads.
99 In addition, we map
100-base paired-end indel-free reads simulated using
100 Sherman (v0.1
.6), the software from the authors of Bismark. All data were
101 aligned to mouse genome version
\textit{mm10
}.
105 We aligned real and simulated data, both as-is and trimmed by quality (using Trim Galore
106 [\href{http://www.bioinformatics.babraham.ac.uk/projects/trim
\_galore/
}{http://www.bioinformatics.babraham.ac.uk/projects/trim
\_galore/
}]
107 default parameters), using the software and versions in Supplemental Table
1.
108 We evaluated a few parameters for each method and
report only the
109 best-performing here. Likely, every aligner could show improved results with
110 exhaustive search of the parameters, but this is a representation of
111 reasonable selections.
112 We considered a real read to be on-target if it was within
1001 bases
115 We designed
\textit{bwa-meth
} for paired-end reads from the directional
116 protocol but it can align single-end reads.
\textit{bwa-meth
} outputs alignments
117 directly to a BAM file that is usable even by those tools that require
118 coordinate-sorted alignments and read-groups. Since it consists
119 of fewer than
600 lines of code and runs
120 quickly, it can be used as a platform to test other optimizations. For example
121 in Supplemental Figure
7, we show a feature that further reduces the
122 number of off-target reads by considering only the single strand targeted by
123 the SureSelect protocol.
125 Although our comparisons are on an inbred mouse strain with few polymorphisms
126 we expect the results will hold even with insertions and deletions due to
127 our use of BWA mem
\citep{bwamem
}.
131 \begin{figure
}[!tpb
]%figure1
132 \centerline{\includegraphics[width=
86mm
]{real-quals
}}
133 \caption{Percent of un-trimmed paired-end,
100-base reads on (y) and off (x)
134 target for the tested aligners. Aligners that
report mapping quality are
135 shown as connected dots for each quality cut-off. Reads are limited to those
136 considered as primary, mapped alignments by the aligner. Note that, as shown
137 in the supplement, many aligners have better performance on trimmed reads.
143 \subsection{Accuracy
}
144 For the aligners that
report a range of mapping quality
145 scores--an indicator of the aligner's confidence in the
146 alignment--we vary the score from
1 to the maximum,
255, to draw an ROC-like
147 curve showing the trade-off between sensitivity and specificity. For the
148 other aligners, we plot their single location. Figure
\ref{fig:
01} shows
149 the on and off-target reads for our real paired-end data. Last and
150 \textit{bwa-meth
} align the most reads on target with a low percent
151 of off-target reads, but Last can provide better control over the number
152 of off-target reads. On trimmed data,
\textit{bwa-meth
}, Last, Bison and
153 Bsmooth all perform similarly with Bison allowing very good control
154 over the number off-target reads (Supplement Figure
2).
155 Only Last, GSNAP, and
\textit{bwa-meth
} are relatively unaffected by quality-trimming.
157 Bismark performance varies depending on whether it uses bowtie version
158 1 or
2 for the backend but in either case, (as with all of the bowtie-backed
159 aligners) it performs better with trimmed reads.
161 A similar comparison is shown in Supplemental Figure
2 for trimmed data.
162 We also show the ROC figures for simulated data with errors in Supplemental
163 Figures
3 (original) and
4 (trimmed) and without errors in Supplemental
165 \textit{Bwa-meth
} does out-perform all aligners for simulated data.
167 \subsection{Computational Resources
}
168 We are more interested in the accuracy of a
169 method than the speed. In general the aligners are comparable in terms of
170 speed. While Bismark with bowtie1 is quite fast on the simulated data it
171 can not use multiple processes and so takes the longest user-time. Bsmap
172 takes the longest CPU time on the real data. We
report the exact
173 timings and maximum memory use in Supplemental Information.
174 Bsmap uses the least disk, never writing an index of the reference genome
175 and only writing the alignment files. All other aligners write an index of
176 the reference genome. Last, bsmooth, and bismark write additional copies of the
178 \textit{Bwa-meth
} avoids writing the
\emph{in silico
}
179 converted reads to disk by streaming them directly to the aligner and shows
180 nearly identical accuracy without read-trimming. The trimming and conversion
181 steps can increase data storage needs enough to be a burden in our experience.
183 None of the programs used an inordinate amount of memory, however, due to
184 the parallelization strategy, Last did require about
10GB of shared memory
188 We have used simulated reads along with reads from a capture method that
189 targets CpG-rich regions to compare aligners.
190 We show that
\textit{bwa-meth
} is very accurate, even without quality trimming.
191 Aligners that use global alignment benefit more from quality trimming and may
192 also require an additional copy of the
\emph{in silico
} converted reads.
193 \textit{bwa-meth
} is also fast and it outputs alignments to a sorted bam
194 that is immediately usable by downstream tools.
196 \section*
{Acknowledgement
}
197 We thank Devon Ryan for several helpful conversations.
198 \paragraph{Funding
\textcolon} This was funded by R01 HL097163, R01 HL101251,
1I01BX001534, RC2 HL101715, N01 AI90052, and S10 RR031832.
199 %\bibliographystyle{natbib}
200 %\bibliographystyle{achemnat}
201 %\bibliographystyle{plainnat}
202 %\bibliographystyle{abbrv}
203 %\bibliographystyle{bioinformatics}
205 %\bibliographystyle{plain}
207 %\bibliography{Document}
210 %\begin{thebibliography}{}
211 \bibliographystyle{natbib
}
212 \bibliography{document}
213 %\end{thebibliography}