2 pIRS is a program for simulating Illumina PE reads, with a series of characters generated by
3 Illumina sequencing platform, such as insert size distribution, sequencing error, quality score
5 As all know, the insert size follows a normal distribution, so users should set the mean value
6 and standard deviation. We usually set the standard deviation to be 1/20 of the mean value. We
7 simulate the normal distribution by Box-Muller method.
8 The program simulates sequencing error, quality score and GC bias according to the empirical
9 distribution profile. We provide some default profiles counted from lots of real sequencing data.
10 To simulate reads from diploid genome, users should simulate the diploid genome sequence firstly
11 by setting the ratio of heterozygosis SNP, heterozygosis InDel and structure variation.
12 At last, several another should be determined: whether simulate GC bias and quality scores, in
13 addition, the read length, coverage depth, input reference sequence, output prefix and output type.
17 Five tools are supplied. SOAP2 or BWA, soap.coverage (http://soap.genomics.org.cn/soapaligner.html)
18 are required. The full process is shown in getprofile.sh as an example.
20 a). run soap and soap.coverage to get .depth single file(s). gzip is OK to over it.
21 b). run gc_coverage_bias on all depth single files. You will get gc-depth stat by 1 GC% and other files.
22 c). run gc_coverage_bias_plot on the gc-depth stat file. You'll get PNG plot and a .gc file by 5 GC%.
23 d). Manually check the .gc file for any abnormal levels due to the lower depth on certain GC% windows.
24 2.1.2 Error Matrix Stat:
25 a). run soap or bwa to get .{soap,single} or .sam file(s).
26 b). run error_matrix_calculator on those file(s). You will get *.{count,ratio}.matrix .
27 c). You can use error_matrix_merger to merge several .{count,ratio}.matrix files.
28 However, it is up to you to keep the read length matches.
29 2.1.3 Insert size & mapping ratio stat:
30 a). run soap or bwa to get .{soap,single} or .sam file(s).
31 b). run alignment_stator *.
32 * alignment_stator cannot stat. mapping ratio for sam files now.
35 simulate_diploid_genome: use for simulating diploid genome sequence. Read the input genome sequence and
36 then simulate SNP, InDel, SV(structure variation) on it. At last, output the
37 result genome sequence.
38 simulate_illumina_reads: use for simulatingIllumina data, output PE-read file.
40 a) If you only want to simulate the diploid genome sequence, "simulate_diploid_genome" is enough.
41 b) If you want to simulate sequencing data of haploid genome, only you need is "simulate_Illumina_reads".
42 c) If you want to simulate sequencing data of diploid genome, you first need to run "simu-late_diploid_genome"
43 to get the other diploid genome sequence, and then run "simulate_Illumina_reads" using both the original
44 genome sequence and the previous output sequence as the input.
47 3.1 simulate_diploid_genome
48 Simulate the diploid genome sequence
50 Usage£ºsimulate_diploid_genome [OPTIONS]
51 -i <string> input reference sequence (file.fa / file.fa.gz)
52 -s <double> heterozygous SNP rate (default=0.001)
53 -a <double> value of transition divided by transvertion for heterSNP (default=2)
54 -d <double> small heterozygous InDel rate (default=0.0001)
55 -v <double> structure variation (large insertion, deletion, invertion) rate (default=1e-6)
56 -c <int> output file type, 0: text, 1: gz compressed (default=1)
57 -o <string> output prefix (default=ref_sequence)
58 -h output help information and exit
60 3.2 simulate_Illumina_reads
61 Simulate sequencing data of Illumina sequencing platform
63 Usage£ºsimulate_illumina_reads [OPTIONS]
64 -i <string> input reference sequence (file.fa / file.fa.gz)
65 -I <string> another input reference for diploid genome (generated by "simu-late_diploid_genome",
66 set only when simulating reads of diploid genome)
67 -s <string> input Base-calling profile (default=$Bin/Profiles/Base-calling_profile/hum20110701.bwanosnp.count.matrix)
68 -d <string> input GC-depth profile for simulating GC bias, the default GC bias pro-files are determined
69 based on the twice of read length
70 -l <int> read length, read1 and read2 are the same length (default=100)
71 -x <double> sequencing coverage depth (default=40)
72 -m <int> average insert size (default=500)
73 -v <int> the standard deviation of insert size (default=5% of average insert size)
74 -e <double> average error rate (default=average error rate of the Base-calling profile)
75 -g <int> whether simulate GC bias, 0: no, 1: yes (default=1)
76 -q <int> whether simulate quality value, 0: no, 1: yes (default=1)
77 -Q <int> Quality value ASCII shift, generally 64 or 33 for Illumina data, default:64
78 -f <int> whether cyclize insert fragment (influence on PE reads¡¯ direction) 0:read1-forward, read2-reverse,
79 1: contrary to parameter 0 (default=0)
80 -c <int> output file type, 0: text, 1: gz compressed (default=1)
81 -o <string> output prefix (default=Illumina)
82 -h output help information and exit
85 4.1 Simulation result of diploid genome sequence.
87 simulate_diploid_genome ¨CiHuman_ref.fa ¨Cs 0.001 ¨Ca 2 ¨Cd 0.00001 ¨Cv 0.000001 ¨Co Human >simulate_seq.o 2>simulate_seq.e
89 a) Human.snp.indel.invertion.fa.gz: another diploid genome sequence.
90 b) Human_indel.lst: InDel information list.
91 c) Human_snp.lst:SNP information list.
92 d) Human_invertion.lst:invertion information list.
93 e) simulate_seq.o, simulate_seq.e: records of the program running information.
94 4.2 Simulation result of Illumina sequencing data for haploid genome.
96 simulate_Illumina_reads -iHuman_ref.fa -m 170 -l 90 -x 5 -v 10 -o Hu-man>simulate_170.o 2>simulate_170.e
98 a) Human_90_170_1.fq.gz, Human_90_170_2.fq.gz: the pair-end read files.
99 b) Human_90_170.error_rate.distr: the error distribution file.
100 c) Human_90_170.insertsize.distr: the insert size distribution file.
101 d) simulate_170.o, simulate_170.e: records of the program running information.
102 4.3 Simulation result ofIllumina sequencing data for diploid genome.
103 Example command line:
104 simulate_Illumina_reads -i Human_ref.fa -I Human.snp.indel.invertion.fa.gz -m 800 -l 70 -x 5 -v 10
105 -o Human >simulate_800.o 2>simulate_800.e
107 a) Human_70_800_1.fq.gz, Human_70_800_2.fq.gz: the pair-end read files.
108 b) Human_70_800.error_rate.distr: the error distribution file.
109 c) Human_70_800.insertsize.distr: the insert size distribution file.
110 d) simulate_800.e, simulate_800.o: records of the program running information.
114 For example: the output file name: *_ 70_800_1.fq.gz, it mean 70bp reads, the average insert size length is 800bp.
117 @read_800_21/1 seq 1 856549 70 813
118 ACGGAAAAGTTACGCTATCGCATGCGTGTAAGAACACTGCTCCTACGCCCATTTTATCGATGGCGCCCAG
120 egcggdggfgfgggggfeggggYbcgegfgggbggg^e]egfegggfbSeggdggegg`^eJgggcbEeb
121 @read_800_22/1 seq 2 129025 70 786 32,C;70,A;
122 CACGGGGGGACTTTATTTAATGAGCGGCTGTAACTTGGTCCGTCGTTTGAGAGGGGACACCTCATATGAT
124 gggggegcgeggggcgfcgc_gf_ggfefcgVgegcfcdgf`geggdd[ge`ggafeggggdgdgee^gg
127 In Row-1, "800" represent for the mean insert size; "21" represent for the order of read in file;
128 The next "1" represent for read file1; "seq" represent for id of chromosome; The next "1" represent for
129 this read come from the reference sequence which is input by -i, for simulated data of haploid genome,
130 it will be always "1", for simulated data of diploid genome, "1" represent for this read generates from
131 the reference sequence which is input by -i, "2" represent for this read generates from the reference
132 sequence which is input by -I; "856549" represent for the start position(1-based) in the chromosome;
133 "70" represent for the length of read; "813" represent for the insert size.
134 In Row-5, "32,C;70,A;" represent for two sequencing errors in this read, "32" and "74" represent for
135 the error positions(1-based); "C" and "A" represent for the right Bases.
143 Column 1: chromosome sequence ID.
144 Column 2: SNP position(1-based).
145 Column 3: Base on raw chromosome sequence.
149 Ref: a t G c a // 3 is the position of G base in the chromosome base.
150 : a t A c a // G base is a A in another diploid sequence.
157 Column 1: chromosomesequence ID.
158 Column 2: "-" Deletion; "+" Insertion.
159 Column 3: InDel position (1-based) of raw chromosome sequence. Note that this position represent
160 for the prev-base of InDel region.
161 Column 4: InDel Base number.
162 Column 5: InDel Base.
165 Ref: a t G c a // 3 is the position of G base in the chromosome base.
166 : a t G - a // following the G base is a deletion of "C" base.
168 Ref: t t g c A - - g t t// 5 is the position of A base in the chromosome base.
169 : t t g c A A C g t t// following the A base is a insertion of "AC" bases.
175 Column 1: chromosome sequence ID.
176 Column 2: invertion position (1-based) of raw chromosome sequence. Note that this position represent
177 for the prev-base of invertion region.
178 Column 3: invertion length.
181 a) When simulate small InDel: insertion and deletion share 1/2 of the total rate respectively, 1~6bp bases
182 as the InDel number ,and the rate distribution of each type as below: 1bp-64.82%, 2bp-17.17%, 3bp-7.20%,
183 4bp-7.29%, 5bp-2.18%, 6bp-1.34%, which is the empirical distribution from panda re-sequencing data, we can
184 set the total small InDel rate with option -d. When simulate SV (structural variation): large insertion,
185 deletion and invertion, each of them share 1/3 of total rate, here we use this rate distribution for simplicity:
186 100bp-70%, 200bp-20%, 500bp-7%, 1000bp-2%, 2000bp-1%, we can set the total SV rate with option -v.
187 b) Program do not simulate reads contain "N" char. If the input reference contains "N" char, reads which
188 generated in this region will be discarded.
189 c) The allowed max length of read bases on the cycle number of Base-calling profile. User should set read length
190 less than half of cycle number.
191 d) The default error rate of simulate data is in accordance with the Base-calling profile and the quality score
192 can well reflect the probability of sequencing error. However, this situation will become inefficient when user
193 set the average error rate different from that of profile.
194 e) Reference sequences are parsed one chromosome at a time, and so be the output.Indiploid mode, each reference
195 file is sampled with half depth.
196 f) Random-seed in this program is set with system time, so when you need to simulate data in batches, each task
197 has to run in different seconds.