BGI/SOAPdenovo2/README.md

   1 # Manual of SOAPdenovo2
   2
   3 ## What's next of SOAPdenovo2
   4
   5 MEGAHIT is the formal successor of SOAPdenovo2
   6
   7 MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph
   8 http://www.ncbi.nlm.nih.gov/pubmed/25609793
   9 https://github.com/voutcn/megahit
  10
  11 ## Introduction
  12
  13 SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way.
  14
  15 ## System Requirement
  16
  17 SOAPdenovo aims for large plant and animal genomes, although it also works well on bacteria and fungi genomes.  It runs on 64-bit Linux system with a minimum of 5G physical memory. For big genomes like human, about 150 GB memory would be required.
  18
  19 ## Installation
  20 1. You can download the pre-compiled binary according to your platform, unpack using "tar -zxf  ${destination folder} download.tgz" and execute directly.
  21 2. Or download the source code, unpack to ${destination folder} with the method above, and compile by using GNU make with command "make" at ${destination folder}/SOAPdenovo-V2.04. Then install executable to ${destination folder}/SOAPdenovo-V2.04/bin using "make install"
  22
  23 ## How to use it
  24
  25 ### 1. Configuration file
  26
  27 For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. "example.config" is an example of such a file.
  28
  29 The configuration file has a section for global information, and then multiple library sections. Right now only "max_rd_len" is included in the global information section. Any read longer than max_rd_len will be cut to this length.
  30
  31 The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items:
  32
  33 <pre>
  34 1) avg_ins
  35    This value indicates the average insert size of this library or the peak value position in the insert size distribution figure.
  36 2) reverse_seq
  37    This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed.
  38 Illumima GA produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) reverse-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. The parameter "reverse_seq" should be set to indicate this: 0, forward-reverse; 1, reverse-forward.
  39 3) asm_flags
  40    This indicator decides in which part(s) the reads are used. It takes value 1(only contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly), or 4 (only gap closure).
  41 4) rd_len_cutof
  42    The assembler will cut the reads from the current library to this length.
  43 5) rank
  44    It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same "rank" are used at the same time during scaffold assembly.
  45 6) pair_num_cutoff
  46    This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. The minimum number for paired-end reads and mate-pair reads is 3 and 5 respectively.
  47 7) map_len
  48    This takes effect in the "map" step and is the minimun alignment length between a read and a contig required for a reliable read location. The minimum length for paired-end reads and mate-pair reads is 32 and 35 respectively.
  49 </pre>
  50
  51 The assembler accepts read file in three kinds of formats: FASTA, FASTQ and BAM. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. If a read in bam file fails platform/vendor quality checks(the flag field 0x0200 is set), itself and it's paired read would be ignored.
  52
  53 In the configuration file single end files are indicated by "f=/path/filename" or "q=/pah/filename" for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by "f1=" and "f2=". While paired reads in two fastq sequences files are indicated by "q1=" and "q2=". Paired reads in a single fasta sequence file is indicated by "p=" item. Reads in bam sequence files is indicated by "b=".
  54
  55 All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file.
  56
  57 ### 2. Get started
  58 Once the configuration file is available, a typical way to run the assembler is:
  59 <pre>
  60 ${bin} all -s config_file -K 63 -R -o graph_prefix 1>ass.log 2>ass.err
  61
  62 User can also choose to run the assembly process step by step as:
  63 step1:
  64 ${bin} pregraph -s config_file -K 63 -R -o graph_prefix 1>pregraph.log 2>pregraph.err
  65 OR
  66 ${bin} sparse_pregraph -s config_file -K 63 -z 5000000000 -R -o graph_prefix 1>pregraph.log 2>pregraph.err
  67
  68 step2:
  69 ${bin} contig -g graph_prefix -R 1>contig.log 2>contig.err
  70
  71 step3:
  72 ${bin} map -s config_file -g graph_prefix 1>map.log 2>map.err
  73
  74 step4:
  75 ${bin} scaff -g graph_prefix -F 1>scaff.log 2>scaff.err
  76 </pre>
  77
  78 ## 3.Options
  79
  80 ### 3.1 Options for all (pregraph-contig-map-scaff)
  81 <pre>
  82   -s <string>    configFile: the config file of solexa reads
  83   -o <string>    outputGraph: prefix of output graph file name
  84   -K <int>       kmer(min 13, max 63/127): kmer size, [23]
  85   -p <int>       n_cpu: number of cpu for use, [8]
  86   -a <int>       initMemoryAssumption: memory assumption initialized to avoid further reallocation, unit G, [0]
  87   -d <int>       KmerFreqCutoff: kmers with frequency no larger than KmerFreqCutoff will be deleted, [0]
  88   -R (optional)  resolve repeats by reads, [NO]
  89   -D <int>       EdgeCovCutoff: edges with coverage no larger than EdgeCovCutoff will be deleted, [1]
  90   -M <int>       mergeLevel(min 0, max 3): the strength of merging similar sequences during contiging, [1]
  91   -m <int>       max k when using multi kmer
  92   -e <int>       weight to filter arc when linearize two edges(default 0)
  93   -r (optional)  keep available read(*.read)
  94   -E (optional)  merge clean bubble before iterate
  95   -f (optional)  output gap related reads in map step for using SRkgf to fill gap, [NO]
  96   -k <int>       kmer_R2C(min 13, max 63): kmer size used for mapping read to contig, [K]
  97   -F (optional)  fill gaps in scaffold, [NO]
  98   -u (optional)  un-mask contigs with high/low coverage before scaffolding, [mask]
  99   -w (optional)  keep contigs weakly connected to other contigs in scaffold, [NO]
 100   -G <int>       gapLenDiff: allowed length difference between estimated and filled gap, [50]
 101   -L <int>       minContigLen: shortest contig for scaffolding, [K+2]
 102   -c <float>     minContigCvg: minimum contig coverage (c*avgCvg), contigs shorter than 100bp with coverage smaller than c*avgCvg will be masked before scaffolding unless -u is set, [0.1]
 103   -C <float>     maxContigCvg: maximum contig coverage (C*avgCvg), contigs with coverage larger than C*avgCvg or contigs shorter than 100bp with coverage larger than 0.8*C*avgCvg will be masked before scaffolding unless -u is set, [2]
 104   -b <float>     insertSizeUpperBound: (b*avg_ins) will be used as upper bound of insert size for large insert size ( > 1000) when handling pair-end connections between contigs if b is set to larger than 1, [1.5]
 105   -B <float>     bubbleCoverage: remove contig with lower cvoerage in bubble structure if both contigs' coverage are smaller than bubbleCoverage*avgCvg, [0.6]
 106   -N <int>       genomeSize: genome size for statistics, [0]
 107   -V (optional)  output visualization information of assembly, [NO]
 108 </pre>
 109
 110 ### 3.2 Options for sparse_pregraph
 111 <pre>
 112   Usage: ./SOAPdenovo2 sparse_pregraph -s configFile -K kmer -z genomeSize -o outputGraph [-g maxKmerEdgeLength -d kmerFreqCutoff -e kmerEdgeFreqCutoff -R -r runMode -p n_cpu]
 113   -s <string>     configFile: the config file of solexa reads
 114   -K <int>        kmer(min 13, max 63/127): kmer size, [23]
 115   -g <int>        maxKmerEdgeLength(min 1, max 25): number of skipped intermediate kmers, [15]
 116   -z <int>        genomeSize(mandatory): estimated genome size
 117   -d <int>        kmerFreqCutoff: delete kmers with frequency no larger than,[1]
 118   -e <int>        kmerEdgeFreqCutoff: delete kmers' related edge with frequency no larger than [1]
 119   -R (optional)   output extra information for resolving repeats in contig step, [NO]
 120   -r <int>        runMode: 0 build graph & build edge and preArc, 1 load graph by prefix & build edge and preArc, 2 build graph only, 3 build edges only, 4 build preArcs only [0]
 121   -p <int>        n_cpu: number of cpu for use,[8]
 122   -o <int>        outputGraph: prefix of output graph file name
 123 </pre>
 124
 125 ## 4. Output files
 126
 127 ### 4.1 These files are output as assembly results:
 128 <pre>
 129 a. *.contig
 130   contig sequences without using mate pair information.
 131 b. *.scafSeq
 132   scaffold sequences (final contig sequences can be extracted by breaking down scaffold sequences at gap regions).
 133 </pre>
 134
 135 ### 4.2 There are some other files that provide useful information for advanced users, which are listed in Appendix B.
 136
 137 ## 5. FAQ
 138
 139 ### 5.1 How to set K-mer size?
 140
 141 The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location.
 142
 143 The sparse pregraph module usually needs 2-10bp smaller kmer length to achieve the same performance as the original pregraph module.
 144
 145 ### 5.2 How to set genome size(-z) for sparse pregraph module?
 146
 147 The -z parameter for sparse pregraph should be set a litter larger than the real genome size, it is used to allocate memory.
 148
 149 ### 5.3 How to set library rank?
 150
 151 SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.
 152
 153 # APPENDIX A: an example.config
 154
 155 <pre>
 156 #maximal read length
 157 max_rd_len=100
 158 [LIB]
 159 #average insert size
 160 avg_ins=200
 161 #if sequence needs to be reversed
 162 reverse_seq=0
 163 #in which part(s) the reads are used
 164 asm_flags=3
 165 #use only first 100 bps of each read
 166 rd_len_cutoff=100
 167 #in which order the reads are used while scaffolding
 168 rank=1
 169 # cutoff of pair number for a reliable connection (at least 3 for short insert size)
 170 pair_num_cutoff=3
 171 #minimum aligned length to contigs for a reliable read location (at least 32 for short insert size)
 172 map_len=32
 173 #a pair of fastq file, read 1 file should always be followed by read 2 file
 174 q1=/path/**LIBNAMEA**/fastq1_read_1.fq
 175 q2=/path/**LIBNAMEA**/fastq1_read_2.fq
 176 #another pair of fastq file, read 1 file should always be followed by read 2 file
 177 q1=/path/**LIBNAMEA**/fastq2_read_1.fq
 178 q2=/path/**LIBNAMEA**/fastq2_read_2.fq
 179 #a pair of fasta file, read 1 file should always be followed by read 2 file
 180 f1=/path/**LIBNAMEA**/fasta1_read_1.fa
 181 f2=/path/**LIBNAMEA**/fasta1_read_2.fa
 182 #another pair of fasta file, read 1 file should always be followed by read 2 file
 183 f1=/path/**LIBNAMEA**/fasta2_read_1.fa
 184 f2=/path/**LIBNAMEA**/fasta2_read_2.fa
 185 #fastq file for single reads
 186 q=/path/**LIBNAMEA**/fastq1_read_single.fq
 187 #another fastq file for single reads
 188 q=/path/**LIBNAMEA**/fastq2_read_single.fq
 189 #fasta file for single reads
 190 f=/path/**LIBNAMEA**/fasta1_read_single.fa
 191 #another fasta file for single reads
 192 f=/path/**LIBNAMEA**/fasta2_read_single.fa
 193 #a single fasta file for paired reads
 194 p=/path/**LIBNAMEA**/pairs1_in_one_file.fa
 195 #another single fasta file for paired reads
 196 p=/path/**LIBNAMEA**/pairs2_in_one_file.fa
 197 #bam file for single or paired reads, reads 1 in paired reads file should always be followed by reads 2
 198 #       NOTE: If a read in bam file fails platform/vendor quality checks(the flag field 0x0200 is set), itself and it's paired read would be ignored.
 199 b=/path/**LIBNAMEA**/reads1_in_file.bam
 200 #another bam file for single or paired reads
 201 b=/path/**LIBNAMEA**/reads2_in_file.bam
 202 [LIB]
 203 avg_ins=2000
 204 reverse_seq=1
 205 asm_flags=2
 206 rank=2
 207 # cutoff of pair number for a reliable connection (at least 5 for large insert size)
 208 pair_num_cutoff=5
 209 #minimum aligned length to contigs for a reliable read location (at least 35 for large insert size)
 210 map_len=35
 211 q1=/path/**LIBNAMEB**/fastq_read_1.fq
 212 q2=/path/**LIBNAMEB**/fastq_read_2.fq
 213 f1=/path/**LIBNAMEA**/fasta_read_1.fa
 214 f2=/path/**LIBNAMEA**/fasta_read_2.fa
 215 p=/path/**LIBNAMEA**/pairs_in_one_file.fa
 216 b=/path/**LIBNAMEA**/reads_in_file.bam
 217 </pre>
 218
 219 # Appendix B: output files
 220
 221 ## 1. Output files from the command "pregraph"
 222 <pre>
 223    a. *.kmerFreq
 224       Each row shows the number of Kmers with a frequency equals the row number. Note that those peaks of frequencies which are the integral multiple of 63 are due to the data structure.
 225    b. *.edge
 226       Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it's reverse-complementarily identical and the sequence.
 227    c. *.markOnEdge & *.path
 228       These two files are for using reads to solve small repeats.
 229    e. *.preArc
 230       Connections between edges which are established by the read paths.
 231    f. *.vertex
 232       Kmers at the ends of edges.
 233    g. *.preGraphBasic
 234       Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc.
 235 </pre>
 236
 237 ## 2. Output files from the command "contig"
 238 <pre>
 239    a. *.contig
 240       Contig information: corresponding edge index, length, kmer coverage, whether it's tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file.
 241    b. *.Arc
 242       Arcs coming out of each edge and their corresponding coverage by reads
 243    c. *.updated.edge
 244       Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one.
 245    d. *.ContigIndex
 246       Each record gives information about each contig in the *.contig: it's edge index, length, the index difference between its reverse-complementary counterpart and itself.
 247 </pre>
 248
 249 ## 3. Output files from the command "map"
 250 <pre>
 251    a. *.peGrads
 252       Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning.
 253    b. *.readOnContig
 254       Reads' locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already.
 255    c. *.readInGap
 256       This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds if "-F" is set.
 257 </pre>
 258
 259 ## 4. Output files from the command "scaff"
 260 <pre>
 261    a. *.newContigIndex
 262       Contigs are sorted according their length before scaffolding. Their new index are listed in this file.  This is useful if one wants to corresponds contigs in *.contig with those in *.links.
 263    b. *.links
 264       Links between contigs which are established by read pairs. New index are used.
 265    c. *.scaf_gap
 266       Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used.
 267    d. *.scaf
 268       Contigs for each scaffold: contig index (concordant to index in *.contig),  approximate start position on scaffold, orientation, contig length, and its links to others contigs.
 269    e. *.gapSeq
 270       Gap sequences between contigs.
 271    f. *.scafSeq
 272       Sequences of each scaffolds.
 273    g. *.contigPosInscaff
 274       Contigs' positions in each scaffold.
 275    h. *.bubbleInScaff
 276       Contigs that form bubble structures in scaffolds. Every two contigs form a bubble and the contig with higher coverage will be kept in scaffold.
 277    i. *.scafStatistics
 278       Statistic information of final scaffold and contig.
 279 </pre>