2 [Bisulfite Sequencing Virus integration Finder](https://github.com/BGI-SZ/BSVF)
4 Repo URL: <https://github.com/BGI-SZ/BSVF>
8 For directional libraries only. PBAT and indirectional libraries are _NOT_ supported.
12 [bwa-meth 0.10](https://github.com/brentp/bwa-meth/tree/0a2f9fc7c3fd3c99c4212941c94be73c9c865bb1) depends on
14 + python 2.7+ (including python3)
15 - `toolshed` library. can be installed with:
16 * `easy_install toolshed` or
17 * `pip install toolshed`
19 + samtools command on the `$PATH` (https://github.com/samtools/samtools)
21 + bwa mem from: https://github.com/lh3/bwa
23 + EMBOSS from: http://emboss.sourceforge.net/
27 ### [Nerdy](https://thelinuxexperiment.com/?s=nerdy)
29 Since the project leader wants to include all relevant tools here, even if they are already provided by main Linux distributions.
31 For problems on compiling `EMBOSS`, `BWA` or `SAMTOOLS`/`HTSLIB`, please ask the original programmer.
35 apt-get install autoconf automake make gcc perl zlib1g-dev libbz2-dev liblzma-dev libcurl4-gnutls-dev libssl-dev
36 #yum install autoconf automake make gcc perl-Data-Dumper zlib-devel bzip2 bzip2-devel xz-devel curl-devel openssl-devel
37 git clone https://github.com/BGI-SZ/BSVF.git
44 In case EMBOSS failed to install, you'll need to download the binary from above sites. And put `water` of EMBOSS in to `./bin`. Or, just link `water` to `./bin`.
46 Your `BSVF/bin/` should be like this:
49 -rwxr-xr-x 398860 Feb 20 00:48 bwa
50 -rwxr-xr-x 21892 Sep 1 08:37 bwameth.py
51 -rwxr-xr-x 27040 Feb 20 01:14 water
52 -rwxr-xr-x 971772 Feb 20 00:48 samtools
58 apt install libbam-dev libhts-dev python3-pip emboss bwa samtools
60 git clone https://github.com/BGI-SZ/BSVF.git
64 [symbolic link `bwa`, `samtools` and `water` from /usr/bin/ or so]
67 ### Homebrew/Linuxbrew
70 brew tap Ensembl/homebrew-external
71 brew install emboss bwa samtools python
74 ln -s `which bwa` ./bin/
75 ln -s `which samtools` ./bin/
76 ln -s `which water` ./bin/
81 cp -av analyser/bsanalyser ../../bin/
88 Gao, S., Hu, X., Xu, F., Gao, C., Xiong, K., Zhao, X., … Pedersen, C. N. S. (2018). BS-virus-finder: virus integration calling using bisulfite sequencing data. GigaScience, 7(1), 1–7. <https://doi.org/10.1093/gigascience/gix123>
93 ./bsuit <command> <config_file>
95 ./bsuit prepare prj.ini
98 ./bsuit analyse prj.ini
101 ![a Logo](https://raw.githubusercontent.com/BGI-SZ/BSVF/master/logo/BSVFlogo.png)
106 mkdir sim90 && cd sim90 && ./simVirusInserts.pl GRCh38_no_alt_analysis_set.fna.gz X04615.fa.gz s90 && cd ..
107 mkdir sim50 && cd sim50 && ./simVirusInserts.pl GRCh38_no_alt_analysis_set.fna.gz X04615.fa.gz s50 50 ../sim90/s90.ini && cd ..
109 ./bsuit prepare sim90/s90.ini
111 ./bsuit aln sim90/s90.ini
113 ./bsuit grep sim90/s90.ini
114 ./bsuit analyse sim90/s90.ini
116 ./bsuit aln sim50/s50.ini
118 ./bsuit grep sim50/s50.ini
119 ./bsuit analyse sim50/s50.ini
124 * Human: <ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz>
125 * HBV: [gi|59585|emb|X04615.1| Hepatitis B virus genome, subtype ayr](http://www.ncbi.nlm.nih.gov/nuccore/X04615.1?report=GenBank)
129 ./simVirusInserts.pl GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz HBV.X04615.fa sim150 150
131 ### GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
133 A gzipped file that contains FASTA format sequences for the [following](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/README_analysis_sets.txt):
135 1. chromosomes from the GRCh38 Primary Assembly unit.
136 Note: the two PAR regions on chrY have been hard-masked with Ns.
137 The chromosome Y sequence provided therefore has the same
138 coordinates as the GenBank sequence but it is not identical to the
139 GenBank sequence. Similarly, duplicate copies of centromeric arrays
140 and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked
141 with Ns (locations of the unmasked copies are given below).
142 2. mitochondrial genome from the GRCh38 non-nuclear assembly unit.
143 3. unlocalized scaffolds from the GRCh38 Primary Assembly unit.
144 4. unplaced scaffolds from the GRCh38 Primary Assembly unit.
145 5. Epstein-Barr virus (EBV) sequence
146 Note: The EBV sequence is not part of the genome assembly but is
147 included in the analysis set as a sink for alignment of reads that
148 are often present in sequencing samples.
150 ## Format of `config_file`
156 HostRef=/share/HomoGRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
157 VirusRef=/share/work/bsvir/HBV.AJ507799.2.fa
160 780_T.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_1.fq.gz
161 780_T.2=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_2.fq.gz
162 s01_P.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_1.fq.gz
163 s01_P.2=/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_2.fq.gz
164 ;MultiLibExample.1=/test/Lib1/AAAA.1.fq.gz, /test/Lib2/AAAA.1.fq.gz, /test/Lib3/BBBB.1.fq.gz
165 ;MultiLibExample.2=/test/Lib1/AAAA.2.fq.gz, /test/Lib2/AAAA_2.fq , /test/Lib3/BBBB.2.fq.gz
166 tSE_X.1=/share/work/bsvir/F12HPCCCSZ0010_Upload/s00_C.bs_1.fq.gz,/share/work/bsvir/F12HPCCCSZ0010_Upload/s01_P.bs_2.fq.gz
174 ;MultiLibExample.SD=70
179 WorkDir=/share/work/bsvir/bsI
189 You'll need `cmake` and `autoconf, automake` and devel-libs, as well as `gcc, g++` to compile all sources.
191 For Mac OS X, install [Homebrew](http://brew.sh/) first. Then:
193 xcode-select --install
194 brew install autoconf automake cmake python
195 brew install --without-multilib gcc
198 To Build the binaries:
209 + For comment lines, use `;` as the first character.
212 - `HostRef` is **Host genome**.
213 - `VirusRef` is **Virus sequence**.
215 + `DataFiles` Section
216 - Each *Sample* need an **unique ID** as *SampleID*. Use `SampleID.1` and `SampleID.2` to specify pair-end sequencing data.
217 - For samples with multiple PE sets, join each file with *comma* and keep their order.
219 + `InsertSizes` Section
220 - For each `SampleID`, use `SampleID` to specify average insert sizes. And use `SampleID.SD` to specify its standard deviation.
223 - `WorkDir` is the output directory.
224 - `ProjectID` is an **unique ID** for this analyse defined in the `config_file`.
228 **BSuit** is a suit to analyse xxx.
235 Chr breakpoint virus-start virus-end virusstrand how-many-reads-support cluster-name
236 Chr1 3000 200 300 +/- 20 cluster1
242 clustername contig-number chrpoint virus-integration
243 cluster1 contig1 chr1:3000 virus:+:200-300
244 cluster1 contig2 chr2:4000 viurs:-:300-400
249 Compare with [ViralFusionSeq [VFS]](https://sourceforge.net/projects/viralfusionseq/) and [VirusFinder 2](https://bioinfo.uth.edu/VirusFinder/) on normal WGS data.
253 * [Bismark](https://www.bioinformatics.babraham.ac.uk/projects/bismark/) [0.18.1](https://github.com/FelixKrueger/Bismark/releases)
254 * [SVDetect](https://sourceforge.net/projects/svdetect/) [r0.8b](https://sourceforge.net/projects/svdetect/files/SVDetect/0.80/)
255 * [ViralFusionSeq](https://sourceforge.net/projects/viralfusionseq/) and [Virus-Clip](http://web.hku.hk/~dwhho/Virus-Clip.zip)
256 * [VirusSeq](http://odin.mdacc.tmc.edu/~xsu1/VirusSeq.html), which uses [MOSAIK](https://github.com/wanpinglee/MOSAIK) aligner.
257 * [VirusFinder 2 (VERSE)](https://bioinfo.uth.edu/VirusFinder/)
258 * [Vy-PER](http://www.ikmb.uni-kiel.de/vy-per/)
259 * [seeksv](https://github.com/qiukunlong/seeksv)
260 * [BS-Seeker2](https://github.com/BSSeeker/BSseeker2)
262 * [Sherman](https://www.bioinformatics.babraham.ac.uk/projects/sherman/)
265 | Tool | Sequencing Type | Programme Language | 1st Aligenment * | Assembler | 2nd Aligenment # | Epub Date |
266 |:-----|:---------------:|:------------------:|:-----:|:-----:|:-----:|:-----:|
267 |[VirusSeq](https://doi.org/10.1093/bioinformatics/bts665)|RNA-Seq, WGS|Perl|MOSAIK to Human|MOSAIK to Virus|MOSAIK to Hybrid|2012 Nov 08|
268 |[ViralFusionSeq](https://doi.org/10.1093/bioinformatics/btt011)|RNA-Seq, WGS|Perl|BWA-SW to Human|cap3, SSAKE|Blastall to Virus|2013 Jan 12|
269 |[VERSE(VirusFinder2)](https://doi.org/10.1186/s13073-015-0126-6)|WGS, RNA-Seq|Perl|Bowtie2 to Human, BLAT to Virus, BLASTN to Virus|Trinity|BWA-SW to Hybrid, SVDetect,CREST|2015 Jan 20|
270 |[Virus-Clip](https://doi.org/10.18632/oncotarget.4187)|RNA-seq|Perl|BWA-MEM to Virus|Virus-Clip|BLASTN to Human|2015 May 19|
271 |[Vy-PER](https://doi.org/10.1038/srep11534)|WGS, RNA-Seq|Python2|BWA-SW to Human|Vy-PER|BLAT to Virus|2015 Jul 13|
272 |[seeksv](https://doi.org/10.1093/bioinformatics/btw591)|WGS|C++|BWA to Hybrid|seeksv|seeksv to Hybrid|2016 Sep 14|
273 |BSVF|WGBS, WGS|Perl,C,C++|BWA-MEM to Hybrid|BSVF|water(EMBOSS) to Hybrid| N/A |
275 \* for virus-infected reads
276 \# for integration infomation
280 To extract relevant PE reads within 500 bp range from final result, *BS.analyse* for example.
283 perl -lane '$a=$F[2]-501;$b=$F[2]+501;print join("\t",$F[1],$a,$b)' ../W2BS_analyse/BS.analyse >zones.bed
284 vi zones.bed # To remove the first head line
285 # sort BS.bam to BS.sort.bam and index it.
286 samtools view -L zones.bed BS.sort.bam > zones.sam
287 awk '{print $1}' zones.sam | sort | uniq > zones.ids
288 #samtools view BS.bam | grep -F -f zones.ids >zones.PE.sam
289 samtools view BS.sort.bam | grep -F -f zones.ids > zones.PEs.sam # sorted one maybe more useful.