BioInfo/BS-Seq/bwa-meth/paper/document.tex

   1 \documentclass{bioinfo}
   2 \copyrightyear{2005}
   3 \pubyear{2005}
   4
   5 \begin{document}
   6 \firstpage{1}
   7
   8 \title[bwa-meth]{Fast and accurate alignment of long bisulfite-seq reads}
   9 \author[Pedersen \textit{et~al}]{
  10 Brent S. Pedersen\,$^{1,}
  11 \footnote{To whom correspondence should be addressed.  Email: bpederse@gmail.com}$,
  12 Kenneth Eyring\,$^{1}$,
  13 Subhajyoti De\,$^{1,2}$,
  14 Ivana V. Yang\,$^{1}$
  15 and David A. Schwartz\,$^1$%
  16 }
  17 \address{
  18     $^{1}$Department of Medicine, University of Colorado Denver, School of Medicine, Denver, Colorado, USA. 80045 \\
  19     $^{2}$University of Colorado Cancer Center, Molecular Oncology Program,
  20     Aurora, Colorado, USA
  21 }
  22
  23 \history{Received on XXXXX; revised on XXXXX; accepted on XXXXX}
  24
  25 \editor{Associate Editor: XXXXXXX}
  26
  27 \maketitle
  28
  29 \begin{abstract}
  30
  31     %http://www.oxfordjournals.org/our_journals/bioinformatics/for_authors/general.html
  32
  33 \section{Summary:}
  34 We introduce a new tool, \textit{bwa-meth}, to align bisulfite-treated
  35 sequences and compare it to existing aligners. We show that it is fast
  36 and accurate even without quality-trimming and
  37 the output is immediately usable by downstream tools.
  38 We gauge accuracy by comparing the number of on and off-target
  39 reads from a targeted sequencing project and by simulations.
  40
  41 \section{Availability and Implementation:}
  42 The benchmarking scripts and the \textit{bwa-meth} software are available at
  43 https://github/com/brentp/bwa-meth/ under the MIT License.
  44
  45 \section{Contact:} \href{bpederse@gmail.com}{bpederse@gmail.com}
  46 \section{Supplementary information:}
  47 Supplemental Information I
  48 \end{abstract}
  49
  50 \section{Introduction}
  51 Bisulfite sequencing (BS-Seq) is a common way to explore DNA methylation.
  52 As a result, software
  53 has been developed to map sequence reads treated with bisulfite to a reference genome \citep{frithlast,methylcoder,gsnap,krueger2011,bsmap,bsmooth}.
  54 Previous studies have compared alignment statistics on
  55 real \citep{methylcoder,bsmap,shrestha} and simulated \citep{frithlast} reads,
  56 however most of these are limited by knowledge of the ground-truth and assumptions of
  57 the simulation, respectively.
  58
  59 Here, we present an analysis of current BS-Seq mappers including the "four-base" aligners
  60 Last \citep{frithlast}, GSNAP \citep{gsnap} and BSMAP \citep{bsmap} as well as the
  61 "three-base" aligners BSmooth \citep{bsmooth}, Bison, and Bismark \citep{krueger2011} which
  62 perform \emph{in silico} conversion of cytosines to thymines. In addtion, we introduce our
  63 own, simple three-base aligner that wraps \textit{BWA mem} \citep{bwamem}.
  64
  65 The comparison is performed on 100-base paired-end reads
  66 which are of modest length by current standards, but, to our knowledge, longer than
  67 utilized in any comparison. We hypothesized that long, paired reads, with up
  68 to 200 bases from the same genomic region, could change the decision on which
  69 alignment method performed the best and that aligners which relied on global
  70 alignment might have reduced performance on longer reads.
  71 We found limitations to exisiting aligners including the writing of large temporary
  72 files, required quality-trimming, high memory-use, long run-time, output that was not
  73 suitable for consumption by traditional tools, or some combination of these
  74 inconveniences. We wrote
  75 a BS-Seq aligner based on BWA mem \citep{bwamem} to address these
  76 limitations. This new aligner,
  77 \textit{bwa-meth}, allows indels, local alignments, and it never writes a
  78 temporary-file of the reads to disk, instead streaming the \emph{in silico} converted
  79 reads directly to the aligner and streaming the alignments directly to a well-formatted
  80 alignment file suitable for use in downstream tools. In addition, it works well without
  81 quality-trimming, thereby reducing storage requirements 3-fold by not creating
  82 quality-trimmed or converted reads.
  83
  84 \section{Approach}
  85 In order to determine the accuracy of an aligner,
  86 we utilize a dataset from Agilent's SureSelect Mouse Methyl-Seq kit which
  87 captures about
  88 99 million bases from CpG dense regions in the mouse genome (a similar
  89 approach is available for human regions).
  90 We evaulate an aligner by the number of reads in the capture area as compared
  91 to outside of the capture area. While there will be off-target capture, all
  92 aligners are subject to the same assumptions. With those constraints, we can
  93 plot a receiver operating curve (ROC) with true positives as reads within
  94 and false positives as reads outside of the target regions.
  95 \citealp{shrestha} performed an unbiased analysis on 35 base reads sequenced from
  96 the X chromosome and called a read as accurately mapped if it aligned
  97 to the X chromosome; here, we map to the entire genome with 100-base paired-end reads.
  98
  99 In addition, we map 100-base paired-end indel-free reads simulated using
 100 Sherman (v0.1.6), the software from the authors of Bismark. All data were
 101 aligned to mouse genome version \textit{mm10}.
 102
 103 \begin{methods}
 104 \section{Methods}
 105 We aligned real and simulated data, both as-is and trimmed by quality (using Trim Galore
 106 [\href{http://www.bioinformatics.babraham.ac.uk/projects/trim\_galore/}{http://www.bioinformatics.babraham.ac.uk/projects/trim\_galore/}]
 107 default parameters), using the software and versions in Supplemental Table 1.
 108 We evaluated a few parameters for each method and report only the
 109 best-performing here. Likely, every aligner could show improved results with
 110 exhaustive search of the parameters, but this is a representation of
 111 reasonable selections.
 112 We considered a real read to be on-target if it was within 1001 bases
 113 of a target region.
 114
 115 We designed \textit{bwa-meth} for paired-end reads from the directional
 116 protocol but it can align single-end reads. \textit{bwa-meth} outputs alignments
 117 directly to a BAM file that is usable even by those tools that require
 118 coordinate-sorted alignments and read-groups. Since it consists
 119 of fewer than 600 lines of code and runs
 120 quickly, it can be used as a platform to test other optimizations. For example
 121 in Supplemental Figure 7, we show a feature that further reduces the
 122 number of off-target reads by considering only the single strand targeted by
 123 the SureSelect protocol.
 124
 125 Although our comparisons are on an inbred mouse strain with few polymorphisms
 126 we expect the results will hold even with insertions and deletions due to
 127 our use of BWA mem \citep{bwamem}.
 128
 129 \end{methods}
 130
 131 \begin{figure}[!tpb]%figure1
 132     \centerline{\includegraphics[width=86mm]{real-quals}}
 133     \caption{Percent of un-trimmed paired-end, 100-base reads on (y) and off (x)
 134     target for the tested aligners. Aligners that report mapping quality are
 135 shown as connected dots for each quality cut-off. Reads are limited to those
 136 considered as primary, mapped alignments by the aligner. Note that, as shown
 137 in the supplement, many aligners have better performance on trimmed reads.
 138 }\label{fig:01}
 139 \end{figure}
 140
 141 \section{Discussion}
 142
 143 \subsection{Accuracy}
 144 For the aligners that report a range of mapping quality
 145 scores--an indicator of the aligner's confidence in the
 146 alignment--we vary the score from 1 to the maximum, 255, to draw an ROC-like
 147 curve showing the trade-off between sensitivity and specificity. For the
 148 other aligners, we plot their single location. Figure \ref{fig:01} shows
 149 the on and off-target reads for our real paired-end data. Last and
 150 \textit{bwa-meth} align the most reads on target with a low percent
 151 of off-target reads, but Last can provide better control over the number
 152 of off-target reads. On trimmed data, \textit{bwa-meth}, Last, Bison and
 153 Bsmooth all perform similarly with Bison allowing very good control
 154 over the number off-target reads (Supplement Figure 2).
 155 Only Last, GSNAP, and \textit{bwa-meth} are relatively unaffected by quality-trimming.
 156
 157 Bismark performance varies depending on whether it uses bowtie version
 158 1 or 2 for the backend but in either case, (as with all of the bowtie-backed
 159 aligners) it performs better with trimmed reads.
 160
 161 A similar comparison is shown in Supplemental Figure 2 for trimmed data.
 162 We also show the ROC figures for simulated data with errors in Supplemental
 163 Figures 3 (original) and 4 (trimmed) and without errors in Supplemental
 164 Figures 5 and 6.
 165 \textit{Bwa-meth} does out-perform all aligners for simulated data.
 166
 167 \subsection{Computational Resources}
 168 We are more interested in the accuracy of a
 169 method than the speed. In general the aligners are comparable in terms of
 170 speed. While Bismark with bowtie1 is quite fast on the simulated data it
 171 can not use multiple processes and so takes the longest user-time. Bsmap
 172 takes the longest CPU time on the real data. We report the exact
 173 timings and maximum memory use in Supplemental Information.
 174 Bsmap uses the least disk, never writing an index of the reference genome
 175 and only writing the alignment files. All other aligners write an index of
 176 the reference genome. Last, bsmooth, and bismark write additional copies of the
 177 reads to disk.
 178 \textit{Bwa-meth} avoids writing the \emph{in silico}
 179 converted reads to disk by streaming them directly to the aligner and shows
 180 nearly identical accuracy without read-trimming. The trimming and conversion
 181 steps can increase data storage needs enough to be a burden in our experience.
 182
 183 None of the programs used an inordinate amount of memory, however, due to
 184 the parallelization strategy, Last did require about 10GB of shared memory
 185 per process.
 186
 187 \section{Conclusion}
 188 We have used simulated reads along with reads from a capture method that
 189 targets CpG-rich regions to compare aligners.
 190 We show that \textit{bwa-meth} is very accurate, even without quality trimming.
 191 Aligners that use global alignment benefit more from quality trimming and may
 192 also require an additional copy of the \emph{in silico} converted reads.
 193 \textit{bwa-meth} is also fast and it outputs alignments to a sorted bam
 194 that is immediately usable by downstream tools.
 195
 196 \section*{Acknowledgement}
 197 We thank Devon Ryan for several helpful conversations.
 198 \paragraph{Funding\textcolon} This was funded by R01 HL097163, R01 HL101251, 1I01BX001534, RC2 HL101715, N01 AI90052, and S10 RR031832.
 199 %\bibliographystyle{natbib}
 200 %\bibliographystyle{achemnat}
 201 %\bibliographystyle{plainnat}
 202 %\bibliographystyle{abbrv}
 203 %\bibliographystyle{bioinformatics}
 204 %
 205 %\bibliographystyle{plain}
 206 %
 207 %\bibliography{Document}
 208
 209
 210 %\begin{thebibliography}{}
 211 \bibliographystyle{natbib}
 212     \bibliography{document}
 213 %\end{thebibliography}
 214 \end{document}