released/pIRS.old/pIRS_simulator/ReadMe.txt

   1 1 Introduction
   2     pIRS is a program for simulating Illumina PE reads, with a series of characters generated by
   3   Illumina sequencing platform, such as insert size distribution, sequencing error, quality score
   4   and GC bias.
   5     As all know, the insert size follows a normal distribution, so users should set the mean value
   6   and standard deviation. We usually set the standard deviation to be 1/20 of the mean value. We
   7   simulate the normal distribution by Box-Muller method.
   8     The program simulates sequencing error, quality score and GC bias according to the empirical
   9   distribution profile. We provide some default profiles counted from lots of real sequencing data.
  10     To simulate reads from diploid genome, users should simulate the diploid genome sequence firstly
  11   by setting the ratio of heterozygosis SNP, heterozygosis InDel and structure variation.
  12     At last, several another should be determined: whether simulate GC bias and quality scores, in
  13   addition, the read length, coverage depth, input reference sequence, output prefix and output type.
  14
  15 2 Program framework
  16   2.1 Profile Generator
  17     Five tools are supplied. SOAP2 or BWA, soap.coverage (http://soap.genomics.org.cn/soapaligner.html)
  18   are required. The full process is shown in getprofile.sh as an example.
  19     2.1.1 GC-Depth Stat.
  20       a). run soap and soap.coverage to get .depth single file(s). gzip is OK to over it.
  21       b). run gc_coverage_bias on all depth single files. You will get gc-depth stat by 1 GC% and other files.
  22       c). run gc_coverage_bias_plot on the gc-depth stat file. You'll get PNG plot and a .gc file by 5 GC%.
  23       d). Manually check the .gc file for any abnormal levels due to the lower depth on certain GC% windows.
  24     2.1.2 Error Matrix Stat:
  25       a). run soap or bwa to get .{soap,single} or .sam file(s).
  26       b). run error_matrix_calculator on those file(s). You will get *.{count,ratio}.matrix .
  27       c). You can use error_matrix_merger to merge several .{count,ratio}.matrix files.
  28          However, it is up to you to keep the read length matches.
  29     2.1.3 Insert size & mapping ratio stat:
  30       a). run soap or bwa to get .{soap,single} or .sam file(s).
  31       b). run alignment_stator *.
  32          * alignment_stator cannot stat. mapping ratio for sam files now.
  33   2.2 Simulator
  34     Two executable file:
  35     simulate_diploid_genome: use for simulating diploid genome sequence. Read the input genome sequence and
  36                              then simulate SNP, InDel, SV(structure variation) on it. At last, output the
  37                              result genome sequence.
  38     simulate_illumina_reads: use for simulatingIllumina data, output PE-read file.
  39     Note:
  40       a) If you only want to simulate the diploid genome sequence, "simulate_diploid_genome" is enough.
  41       b) If you want to simulate sequencing data of haploid genome, only you need is "simulate_Illumina_reads".
  42       c) If you want to simulate sequencing data of diploid genome, you first need to run "simu-late_diploid_genome"
  43          to get the other diploid genome sequence, and then run "simulate_Illumina_reads" using both the original
  44          genome sequence and the previous output sequence as the input.
  45
  46 3 Usage
  47   3.1 simulate_diploid_genome
  48     Simulate the diploid genome sequence
  49
  50     Usage£ºsimulate_diploid_genome [OPTIONS]
  51       -i <string> input reference sequence (file.fa / file.fa.gz)
  52       -s <double> heterozygous SNP rate (default=0.001)
  53       -a <double> value of transition divided by transvertion for heterSNP (default=2)
  54       -d <double> small heterozygous InDel rate (default=0.0001)
  55       -v <double> structure variation (large insertion, deletion, invertion) rate (default=1e-6)
  56       -c <int>    output file type, 0: text, 1: gz compressed (default=1)
  57       -o <string> output prefix (default=ref_sequence)
  58       -h          output help information and exit
  59
  60   3.2 simulate_Illumina_reads
  61     Simulate sequencing data of Illumina sequencing platform
  62
  63     Usage£ºsimulate_illumina_reads [OPTIONS]
  64       -i <string> input reference sequence (file.fa / file.fa.gz)
  65       -I <string> another input reference for diploid genome (generated by "simu-late_diploid_genome",
  66                   set only when simulating reads of diploid genome)
  67       -s <string> input Base-calling profile (default=$Bin/Profiles/Base-calling_profile/hum20110701.bwanosnp.count.matrix)
  68       -d <string> input GC-depth profile for simulating GC bias, the default GC bias pro-files are determined
  69                   based on the twice of read length
  70       -l <int>    read length, read1 and read2 are the same length (default=100)
  71       -x <double> sequencing coverage depth (default=40)
  72       -m <int>    average insert size (default=500)
  73       -v <int>    the standard deviation of insert size (default=5% of average insert size)
  74       -e <double> average error rate (default=average error rate of the Base-calling profile)
  75       -g <int>    whether simulate GC bias, 0: no, 1: yes (default=1)
  76       -q <int>    whether simulate quality value, 0: no, 1: yes (default=1)
  77       -Q <int>    Quality value ASCII shift, generally 64 or 33 for Illumina data, default:64
  78       -f <int>    whether cyclize insert fragment (influence on PE reads¡¯ direction) 0:read1-forward, read2-reverse,
  79                   1: contrary to parameter 0 (default=0)
  80       -c <int>    output file type, 0: text, 1: gz compressed (default=1)
  81       -o <string> output prefix (default=Illumina)
  82       -h          output help information and exit
  83
  84 4 Output files
  85   4.1 Simulation result of diploid genome sequence.
  86     Example command line:
  87     simulate_diploid_genome ¨CiHuman_ref.fa ¨Cs 0.001 ¨Ca 2 ¨Cd 0.00001 ¨Cv 0.000001 ¨Co Human >simulate_seq.o 2>simulate_seq.e
  88     Output files:
  89       a) Human.snp.indel.invertion.fa.gz: another diploid genome sequence.
  90       b) Human_indel.lst: InDel information list.
  91       c) Human_snp.lst:SNP information list.
  92       d) Human_invertion.lst:invertion information list.
  93       e) simulate_seq.o, simulate_seq.e: records of the program running information.
  94   4.2 Simulation result of Illumina sequencing data for haploid genome.
  95     Example command line:
  96     simulate_Illumina_reads -iHuman_ref.fa -m 170 -l 90 -x 5 -v 10 -o Hu-man>simulate_170.o 2>simulate_170.e
  97     Output files:
  98       a) Human_90_170_1.fq.gz, Human_90_170_2.fq.gz: the pair-end read files.
  99       b) Human_90_170.error_rate.distr: the error distribution file.
 100       c) Human_90_170.insertsize.distr: the insert size distribution file.
 101       d) simulate_170.o, simulate_170.e: records of the program running information.
 102   4.3 Simulation result ofIllumina sequencing data for diploid genome.
 103     Example command line:
 104     simulate_Illumina_reads -i Human_ref.fa -I Human.snp.indel.invertion.fa.gz -m 800 -l 70 -x 5 -v 10
 105                             -o Human >simulate_800.o 2>simulate_800.e
 106     Output files:
 107       a) Human_70_800_1.fq.gz, Human_70_800_2.fq.gz: the pair-end read files.
 108       b) Human_70_800.error_rate.distr: the error distribution file.
 109       c) Human_70_800.insertsize.distr: the insert size distribution file.
 110       d) simulate_800.e, simulate_800.o: records of the program running information.
 111
 112 5 Output file format
 113   a) *.fq/*.fq.gz
 114     For example: the output file name: *_ 70_800_1.fq.gz, it mean 70bp reads, the average insert size length is 800bp.
 115     In read file1:
 116
 117       @read_800_21/1 seq 1 856549 70 813
 118       ACGGAAAAGTTACGCTATCGCATGCGTGTAAGAACACTGCTCCTACGCCCATTTTATCGATGGCGCCCAG
 119       +
 120       egcggdggfgfgggggfeggggYbcgegfgggbggg^e]egfegggfbSeggdggegg`^eJgggcbEeb
 121       @read_800_22/1 seq 2 129025 70 786 32,C;70,A;
 122       CACGGGGGGACTTTATTTAATGAGCGGCTGTAACTTGGTCCGTCGTTTGAGAGGGGACACCTCATATGAT
 123       +
 124       gggggegcgeggggcgfcgc_gf_ggfefcgVgegcfcdgf`geggdd[ge`ggafeggggdgdgee^gg
 125
 126     note:
 127     In Row-1, "800" represent for the mean insert size; "21" represent for the order of read in file;
 128     The next "1" represent for read file1; "seq" represent for id of chromosome; The next "1" represent for
 129     this read come from the reference sequence which is input by -i, for simulated data of haploid genome,
 130     it will be always "1", for simulated data of diploid genome, "1" represent for this read generates from
 131     the reference sequence which is input by -i, "2" represent for this read generates from the reference
 132     sequence which is input by -I; "856549" represent for the start position(1-based) in the chromosome;
 133     "70" represent for the length of read; "813" represent for the insert size.
 134     In Row-5, "32,C;70,A;" represent for two sequencing errors in this read, "32" and "74" represent for
 135     the error positions(1-based); "C" and "A" represent for the right Bases.
 136   b) *_snp.lst
 137     For example:
 138
 139       I 3       G       A
 140       I 45342   C       T
 141       I 104775  C       T
 142       ¡¡
 143     Column 1: chromosome sequence ID.
 144     Column 2: SNP position(1-based).
 145     Column 3: Base on raw chromosome sequence.
 146     Column 4: SNP-Base.
 147
 148       I 3 G A
 149       Ref: a t G c a // 3 is the position of G base in the chromosome base.
 150          : a t A c a // G base is a A in another diploid sequence.
 151
 152   c) *_indel.lst
 153     For example:
 154       I  - 3 1 C
 155       IV + 5 2 AC
 156       ¡¡
 157     Column 1: chromosomesequence ID.
 158     Column 2: "-" Deletion; "+" Insertion.
 159     Column 3: InDel position (1-based) of raw chromosome sequence. Note that this position represent
 160               for the prev-base of InDel region.
 161     Column 4: InDel Base number.
 162     Column 5: InDel Base.
 163
 164       I - 3 1 C
 165       Ref: a t G c a // 3 is the position of G base in the chromosome base.
 166          : a t G - a // following the G base is a deletion of "C" base.
 167       IV + 5 2 A C
 168       Ref: t t g c A - - g t t// 5 is the position of A base in the chromosome base.
 169          : t t g c A A C g t t// following the A base is a insertion of "AC" bases.
 170
 171   d) *_invertion.lst
 172     For example:
 173       I 50191   100
 174       I 948984  200
 175     Column 1: chromosome sequence ID.
 176     Column 2: invertion position (1-based) of raw chromosome sequence. Note that this position represent
 177               for the prev-base of invertion region.
 178     Column 3: invertion length.
 179
 180 6 Note
 181   a) When simulate small InDel: insertion and deletion share 1/2 of the total rate respectively, 1~6bp bases
 182   as the InDel number ,and the rate distribution of each type as below: 1bp-64.82%, 2bp-17.17%, 3bp-7.20%,
 183   4bp-7.29%, 5bp-2.18%, 6bp-1.34%, which is the empirical distribution from panda re-sequencing data, we can
 184   set the total small InDel rate with option -d. When simulate SV (structural variation): large insertion,
 185   deletion and invertion, each of them share 1/3 of total rate, here we use this rate distribution for simplicity:
 186   100bp-70%, 200bp-20%, 500bp-7%, 1000bp-2%, 2000bp-1%, we can set the total SV rate with option -v.
 187   b) Program do not simulate reads contain "N" char. If the input reference contains "N" char, reads which
 188   generated in this region will be discarded.
 189   c) The allowed max length of read bases on the cycle number of Base-calling profile. User should set read length
 190   less than half of cycle number.
 191   d) The default error rate of simulate data is in accordance with the Base-calling profile and the quality score
 192   can well reflect the probability of sequencing error. However, this situation will become inefficient when user
 193   set the average error rate different from that of profile.
 194   e) Reference sequences are parsed one chromosome at a time, and so be the output.Indiploid mode, each reference
 195   file is sampled with half depth.
 196   f) Random-seed in this program is set with system time, so when you need to simulate data in batches, each task
 197   has to run in different seconds.
 198