new file: cell2loc.py
[GalaxyCodeBases.git] / perl / bsvf / README.md
blob63519ba6e0c7d89507572bc69bf9610cb7a4ce42
1 # BSVF
2 [Bisulfite Sequencing Virus integration Finder](https://github.com/BGI-SZ/BSVF)
4 Repo URL: <https://github.com/BGI-SZ/BSVF>
6 ## Attention
8 For directional libraries only. PBAT and indirectional libraries are _NOT_ supported.
10 ## Dependencies
12 [bwa-meth 0.10](https://github.com/brentp/bwa-meth/tree/0a2f9fc7c3fd3c99c4212941c94be73c9c865bb1) depends on 
14  + python 2.7+ (including python3)
15    - `toolshed` library. can be installed with: 
16       * `easy_install toolshed` or
17       * `pip install toolshed`
19  + samtools command on the `$PATH` (https://github.com/samtools/samtools)
21  + bwa mem from: https://github.com/lh3/bwa
23  + EMBOSS from: http://emboss.sourceforge.net/
25 ## Install
27 ### [Nerdy](https://thelinuxexperiment.com/?s=nerdy)
29 Since the project leader wants to include all relevant tools here, even if they are already provided by main Linux distributions.
31 For problems on compiling `EMBOSS`, `BWA` or `SAMTOOLS`/`HTSLIB`, please ask the original programmer. 
33 ````bash
34 pip install toolshed
35 apt-get install autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev
36 #yum install autoconf automake make gcc perl-Data-Dumper zlib-devel bzip2 bzip2-devel xz-devel curl-devel openssl-devel
37 git clone https://github.com/BGI-SZ/BSVF.git
38 cd BSVF
39 git submodule init
40 git submodule update
41 src/install.sh
42 ````
44 In case EMBOSS failed to install, you'll need to download the binary from above sites. And put `water` of EMBOSS in to `./bin`. Or, just link `water` to `./bin`.
46 Your `BSVF/bin/` should be like this:
48 ````bash
49 -rwxr-xr-x  398860 Feb 20 00:48 bwa
50 -rwxr-xr-x   21892 Sep  1 08:37 bwameth.py
51 -rwxr-xr-x   27040 Feb 20 01:14 water
52 -rwxr-xr-x  971772 Feb 20 00:48 samtools
53 ````
55 ### Debian
57 ````bash
58 apt install libbam-dev libhts-dev python3-pip emboss bwa samtools
59 pip install toolshed
60 git clone https://github.com/BGI-SZ/BSVF.git
61 cd BSVF/src/analyser
62 make
63 cd BSVF/bin
64 [symbolic link `bwa`, `samtools` and `water` from /usr/bin/ or so]
65 ````
67 ### Homebrew/Linuxbrew
69 ````bash
70 brew tap Ensembl/homebrew-external
71 brew install emboss bwa samtools python
72 pip install toolshed
74 ln -s `which bwa` ./bin/
75 ln -s `which samtools` ./bin/
76 ln -s `which water` ./bin/
78 brew install gcc
79 cd ./src/analyser/
80 make
81 cp -av analyser/bsanalyser ../../bin/
82 cd ../../bin/
83 ls -l
84 ````
86 ## Citation
88 Gao, S., Hu, X., Xu, F., Gao, C., Xiong, K., Zhao, X., … Pedersen, C. N. S. (2018). BS-virus-finder: virus integration calling using bisulfite sequencing data. GigaScience, 7(1), 1–7. <https://doi.org/10.1093/gigascience/gix123>
90 ## Usage
92 ```
93 ./bsuit <command> <config_file>
95 ./bsuit prepare prj.ini
96 ./bsuit aln prj.ini
97 ./bsuit grep prj.ini
98 ./bsuit analyse prj.ini
99 ```
101 ![a Logo](https://raw.githubusercontent.com/BGI-SZ/BSVF/master/logo/BSVFlogo.png)
103 ## Test Run
106 mkdir sim90 && cd sim90 && ./simVirusInserts.pl GRCh38_no_alt_analysis_set.fna.gz X04615.fa.gz s90 && cd ..
107 mkdir sim50 && cd sim50 && ./simVirusInserts.pl GRCh38_no_alt_analysis_set.fna.gz X04615.fa.gz s50 50 ../sim90/s90.ini && cd ..
109 ./bsuit prepare sim90/s90.ini
111 ./bsuit aln sim90/s90.ini
112 ./run/s90_aln.sh
113 ./bsuit grep sim90/s90.ini
114 ./bsuit analyse sim90/s90.ini
116 ./bsuit aln sim50/s50.ini
117 ./run/s50_aln.sh
118 ./bsuit grep sim50/s50.ini
119 ./bsuit analyse sim50/s50.ini
122 ## Reference Files
124  * Human: <ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz>
125  * HBV: [gi|59585|emb|X04615.1| Hepatitis B virus genome, subtype ayr](http://www.ncbi.nlm.nih.gov/nuccore/X04615.1?report=GenBank)
127 ## Simulation
129     ./simVirusInserts.pl GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz HBV.X04615.fa sim150 150
131 ### GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
133 A gzipped file that contains FASTA format sequences for the [following](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/README_analysis_sets.txt):
135 1. chromosomes from the GRCh38 Primary Assembly unit.  
136    Note: the two PAR regions on chrY have been hard-masked with Ns.   
137    The chromosome Y sequence provided therefore has the same 
138    coordinates as the GenBank sequence but it is not identical to the
139    GenBank sequence. Similarly, duplicate copies of centromeric arrays
140    and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked 
141    with Ns (locations of the unmasked copies are given below). 
142 2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
143 3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
144 4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
145 5. Epstein-Barr virus (EBV) sequence  
146    Note: The EBV sequence is not part of the genome assembly but is 
147    included in the analysis set as a sink for alignment of reads that
148    are often present in sequencing samples.
150 ## Format of `config_file`
152 ### An example
154 ```ini
155 [RefFiles]
156 HostRef=/share/HomoGRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
157 VirusRef=/share/work/bsvir/HBV.AJ507799.2.fa
159 [DataFiles]
160 780_T.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_1.fq.gz
161 780_T.2=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_2.fq.gz
162 s01_P.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_1.fq.gz
163 s01_P.2=/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_2.fq.gz
164 ;MultiLibExample.1=/test/Lib1/AAAA.1.fq.gz, /test/Lib2/AAAA.1.fq.gz, /test/Lib3/BBBB.1.fq.gz
165 ;MultiLibExample.2=/test/Lib1/AAAA.2.fq.gz, /test/Lib2/AAAA_2.fq , /test/Lib3/BBBB.2.fq.gz
166 tSE_X.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_1.fq.gz,/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_2.fq.gz
168 [InsertSizes]
169 780_T=200
170 780_T.SD=120
171 s01_P=200
172 s01_P.SD=30
173 ;MultiLibExample=210
174 ;MultiLibExample.SD=70
175 tSE_X=90
176 tSE_X.SD=1
178 [Output]
179 WorkDir=/share/work/bsvir/bsI
180 ProjectID=SZ2015
182 [Parameters]
183 Aligner=bwa-meth
184 MinVirusLength=20
187 ## Build
189 You'll need `cmake` and `autoconf, automake` and devel-libs, as well as `gcc, g++` to compile all sources.
191 For Mac OS X, install [Homebrew](http://brew.sh/) first. Then:
192 ```bash
193 xcode-select --install
194 brew install autoconf automake cmake python
195 brew install --without-multilib gcc
196 ````
198 To Build the binaries:
199 ```bash
200 cd src
201 ./download.sh
202 ./install.sh
204 pip install toolshed
207 ### Details
209  + For comment lines, use `;` as the first character.
211  + `RefFiles` Section
212    - `HostRef` is **Host genome**.
213    - `VirusRef` is **Virus sequence**.
215  + `DataFiles` Section
216    - Each *Sample* need an **unique ID** as *SampleID*. Use `SampleID.1` and `SampleID.2` to specify pair-end sequencing data.
217    - For samples with multiple PE sets, join each file with *comma* and keep their order.
219  + `InsertSizes` Section
220    - For each `SampleID`, use `SampleID` to specify average insert sizes. And use `SampleID.SD` to specify its standard deviation.
222  + `Output` Section
223    - `WorkDir` is the output directory.
224    - `ProjectID` is an **unique ID** for this analyse defined in the `config_file`.
226 ## Description
228 **BSuit** is a suit to analyse xxx.
230 ### Formats
232 #### 病毒整合结果文件
234 ````
235 Chr     breakpoint      virus-start virus-end virusstrand       how-many-reads-support cluster-name
236 Chr1    3000    200     300     +/-     20 cluster1
237 ````
239 #### 中间contig信息文件
241 ````
242 clustername contig-number chrpoint virus-integration
243 cluster1        contig1 chr1:3000       virus:+:200-300
244 cluster1        contig2 chr2:4000       viurs:-:300-400
245 ````
247 ## ToDo
249 Compare with [ViralFusionSeq [VFS]](https://sourceforge.net/projects/viralfusionseq/) and [VirusFinder 2](https://bioinfo.uth.edu/VirusFinder/) on normal WGS data.
251 ## See also
253 * [Bismark](https://www.bioinformatics.babraham.ac.uk/projects/bismark/) [0.18.1](https://github.com/FelixKrueger/Bismark/releases)
254 * [SVDetect](https://sourceforge.net/projects/svdetect/) [r0.8b](https://sourceforge.net/projects/svdetect/files/SVDetect/0.80/)
255 * [ViralFusionSeq](https://sourceforge.net/projects/viralfusionseq/) and [Virus-Clip](http://web.hku.hk/~dwhho/Virus-Clip.zip)
256 * [VirusSeq](http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html), which uses [MOSAIK](https://github.com/wanpinglee/MOSAIK) aligner.
257 * [VirusFinder 2 (VERSE)](https://bioinfo.uth.edu/VirusFinder/)
258 * [Vy-PER](http://www.ikmb.uni-kiel.de/vy-per/)
259 * [seeksv](https://github.com/qiukunlong/seeksv)
260 * [BS-Seeker2](https://github.com/BSSeeker/BSseeker2)
262 * [Sherman](https://www.bioinformatics.babraham.ac.uk/projects/sherman/)
263   * 模拟时甲基化率设为800.
265 | Tool | Sequencing Type | Programme Language | 1st Aligenment * | Assembler | 2nd Aligenment # | Epub Date |
266 |:-----|:---------------:|:------------------:|:-----:|:-----:|:-----:|:-----:|
267 |[VirusSeq](https://doi.org/10.1093/bioinformatics/bts665)|RNA-Seq, WGS|Perl|MOSAIK to Human|MOSAIK to Virus|MOSAIK to Hybrid|2012 Nov 08|
268 |[ViralFusionSeq](https://doi.org/10.1093/bioinformatics/btt011)|RNA-Seq, WGS|Perl|BWA-SW to Human|cap3, SSAKE|Blastall to Virus|2013 Jan 12|
269 |[VERSE(VirusFinder2)](https://doi.org/10.1186/s13073-015-0126-6)|WGS, RNA-Seq|Perl|Bowtie2 to Human, BLAT to Virus, BLASTN to Virus|Trinity|BWA-SW to Hybrid, SVDetect,CREST|2015 Jan 20|
270 |[Virus-Clip](https://doi.org/10.18632/oncotarget.4187)|RNA-seq|Perl|BWA-MEM to Virus|Virus-Clip|BLASTN to Human|2015 May 19|
271 |[Vy-PER](https://doi.org/10.1038/srep11534)|WGS, RNA-Seq|Python2|BWA-SW to Human|Vy-PER|BLAT to Virus|2015 Jul 13|
272 |[seeksv](https://doi.org/10.1093/bioinformatics/btw591)|WGS|C++|BWA to Hybrid|seeksv|seeksv to Hybrid|2016 Sep 14|
273 |BSVF|WGBS, WGS|Perl,C,C++|BWA-MEM to Hybrid|BSVF|water(EMBOSS) to Hybrid| N/A |
275 \* for virus-infected reads  
276 \# for integration infomation
278 ## One More Things
280 To extract relevant PE reads within 500 bp range from final result, *BS.analyse* for example.
282 ```bash
283 perl -lane '$a=$F[2]-501;$b=$F[2]+501;print join("\t",$F[1],$a,$b)' ../W2BS_analyse/BS.analyse >zones.bed
284 vi zones.bed # To remove the first head line
285 # sort BS.bam to BS.sort.bam and index it.
286 samtools view -L zones.bed BS.sort.bam > zones.sam
287 awk '{print $1}' zones.sam | sort | uniq > zones.ids
288 #samtools view BS.bam | grep -F -f zones.ids >zones.PE.sam
289 samtools view BS.sort.bam | grep -F -f zones.ids > zones.PEs.sam # sorted one maybe more useful.