c_cpp/etc/anytag-ruanjue/README.txt

   1 http://sourceforge.net/projects/anytag/files/anytag2.0/
   2
   3 anytag-2.0
   4
   5 [2.0 Main Changes]
   6
   7 1, Optimize the short reads alignment.
   8 1.1, Remove huge temporary files (which store the intermedia hits)
   9 1.2, Load all index in memory, so that one read can query all indexs at once
  10 1.3, Number of seeds can be customized by user
  11
  12 2, Optimize local assembly
  13 2.1, Add re-alignment, retrieve more reads in local group. Thus, increase the accuracy of consensus sequence
  14 2.2, Remove the pair_score module, I plan to estimate pair_score in future
  15 2.3, Add MSA and SAM file for all FIS. MSA file is multiple alignments.
  16 2.4, Filter read pairs which only have one end inside FIS
  17 2.5, Remove PCR duplication
  18
  19 3, Add new command `all`, to more efficently invoker aln and asm
  20
  21 4, Add new function `lnk`. I try to use mate-paired reads to build mate information for FIS. It is in development.
  22
  23 [The purpose of anytag]
  24
  25 NGS can product a read pairs which come from one original DNA fragment. If the internal gap can be correctly filled,
  26 we will get the full length sequence of original DNA fragment.
  27
  28 [FIS]
  29
  30 FIS is short for Filled-In Sequence. The sequence of a FIS repreasents the original DNA.
  31
  32 [AR]
  33
  34 AR is short for Anchoring Reads.  If the gap between two ends of one AR is well filled, it become FIS.
  35
  36 [SR]
  37
  38 SR is short for Supporting Reads. SRs are used to fill in the gap of AR. The insert-size of SRs should be less than AR.
  39
  40 For example, all of our reads are 2*100bp,
  41 1, inserts(bp): [250(ARs)], there is only ARs, the length of FIS will be 250bp
  42 2, inserts(bp): [200(SRs), 400(ARs)], the length of FIS will be 400bp
  43 3, inserts(bp): [200(SRs), 300(SRs), 600(ARs)], the length of FIS will be 600bp
  44 4, inserts(bp): [200, 250, 300, 350, 400, 450, 500, 550(|<-SRs),600(ARs)], the length of FIS will be 600bp
  45 5, inserts(bp): [200, 250, 300, 350, 400, 450, 500, 550, 600 (|<-SRs), 1000(ARs)], requiring much more sequence depth
  46                   the length of FIS will be 1000bp
  47
  48 [How does anytag work]
  49
  50 1, Clustering. Align SRs to AR. The minimum overlap between AR and SRs should be as large as to be nearly unique, such as 30bp.
  51    All alignments are forword. That is, AR and SRs are the same strand.
  52 2, Local assembly. Build an overlap graph for AR and SRs. Pairwise alignment for all AR and SRs is done to find overlaps.
  53    Then, find a path from one end of AR to the other end. To make sure the path is correct, it prior to traverse max overlaps.
  54    Final, call consensus sequence (FIS) for AR.
  55
  56 [Can I use anytag in a huge genome]
  57
  58 Yes, anytag-2.0 can handle huge genome. It is fast enough. Roughly, constructing 10X FIS for human require 32 cpu * 3 days.
  59
  60 [Repeats]
  61
  62 If a repeat is less than the insert size of AR, anytag can build correct FISs which crossing it.
  63
  64 [Why FIS is better for whole genome assembly than direct short reads assembly]
  65
  66 1, for repeats. anytag can solve most of repeats of size less than the insert size of AR.
  67 2, for heterozygosity. anytag use smith-waterman algorithm to align short reads in local assembly. Most of short reads assemblers cannot.
  68
  69 In our simulations of dm3 genome (50X, 2 * 80bp, max insert size 570bp):
  70 ------------------------------
  71 |Heterozygosity|Method|N50   |
  72 ------------------------------
  73 |0.001         |anytag|150.0k|
  74 |0.001         |velvet|16.88k|
  75 ------------------------------
  76 |0.01          |anytag|136.0k|
  77 |0.01          |velvet|3.48k |
  78 ------------------------------
  79 |0.02          |anytag|125.6k|
  80 |0.02          |velvet|1.81k |
  81 ------------------------------
  82
  83 [The accuracy of FIS]
  84
  85 The accuracy of FIS is higher than first several bases of solexa reads, also higher than sanger reads.
  86
  87 [FIS in resequencing]
  88
  89 Although I haven't tested it in resequencing project, it will be useful in detecting small INDEL and SV, especially in detceting the breakpoints.
  90
  91 [Feedback]
  92
  93 ruanjue@gmail.com ruanj@big.ac.cn Beijing Institute of Genomics, Chinese Academy of Sciences
  94