1 http://sourceforge.net/projects/anytag/files/anytag2.0/
7 1, Optimize the short reads alignment.
8 1.1, Remove huge temporary files (which store the intermedia hits)
9 1.2, Load all index in memory, so that one read can query all indexs at once
10 1.3, Number of seeds can be customized by user
12 2, Optimize local assembly
13 2.1, Add re-alignment, retrieve more reads in local group. Thus, increase the accuracy of consensus sequence
14 2.2, Remove the pair_score module, I plan to estimate pair_score in future
15 2.3, Add MSA and SAM file for all FIS. MSA file is multiple alignments.
16 2.4, Filter read pairs which only have one end inside FIS
17 2.5, Remove PCR duplication
19 3, Add new command `all`, to more efficently invoker aln and asm
21 4, Add new function `lnk`. I try to use mate-paired reads to build mate information for FIS. It is in development.
23 [The purpose of anytag]
25 NGS can product a read pairs which come from one original DNA fragment. If the internal gap can be correctly filled,
26 we will get the full length sequence of original DNA fragment.
30 FIS is short for Filled-In Sequence. The sequence of a FIS repreasents the original DNA.
34 AR is short for Anchoring Reads. If the gap between two ends of one AR is well filled, it become FIS.
38 SR is short for Supporting Reads. SRs are used to fill in the gap of AR. The insert-size of SRs should be less than AR.
40 For example, all of our reads are 2*100bp,
41 1, inserts(bp): [250(ARs)], there is only ARs, the length of FIS will be 250bp
42 2, inserts(bp): [200(SRs), 400(ARs)], the length of FIS will be 400bp
43 3, inserts(bp): [200(SRs), 300(SRs), 600(ARs)], the length of FIS will be 600bp
44 4, inserts(bp): [200, 250, 300, 350, 400, 450, 500, 550(|<-SRs),600(ARs)], the length of FIS will be 600bp
45 5, inserts(bp): [200, 250, 300, 350, 400, 450, 500, 550, 600 (|<-SRs), 1000(ARs)], requiring much more sequence depth
46 the length of FIS will be 1000bp
48 [How does anytag work]
50 1, Clustering. Align SRs to AR. The minimum overlap between AR and SRs should be as large as to be nearly unique, such as 30bp.
51 All alignments are forword. That is, AR and SRs are the same strand.
52 2, Local assembly. Build an overlap graph for AR and SRs. Pairwise alignment for all AR and SRs is done to find overlaps.
53 Then, find a path from one end of AR to the other end. To make sure the path is correct, it prior to traverse max overlaps.
54 Final, call consensus sequence (FIS) for AR.
56 [Can I use anytag in a huge genome]
58 Yes, anytag-2.0 can handle huge genome. It is fast enough. Roughly, constructing 10X FIS for human require 32 cpu * 3 days.
62 If a repeat is less than the insert size of AR, anytag can build correct FISs which crossing it.
64 [Why FIS is better for whole genome assembly than direct short reads assembly]
66 1, for repeats. anytag can solve most of repeats of size less than the insert size of AR.
67 2, for heterozygosity. anytag use smith-waterman algorithm to align short reads in local assembly. Most of short reads assemblers cannot.
69 In our simulations of dm3 genome (50X, 2 * 80bp, max insert size 570bp):
70 ------------------------------
71 |Heterozygosity|Method|N50 |
72 ------------------------------
73 |0.001 |anytag|150.0k|
74 |0.001 |velvet|16.88k|
75 ------------------------------
78 ------------------------------
81 ------------------------------
85 The accuracy of FIS is higher than first several bases of solexa reads, also higher than sanger reads.
89 Although I haven't tested it in resequencing project, it will be useful in detecting small INDEL and SV, especially in detceting the breakpoints.
93 ruanjue@gmail.com ruanj@big.ac.cn Beijing Institute of Genomics, Chinese Academy of Sciences