4 The `callvar` command is part of the MACS3 suite of tools and is used
5 to call variants (SNVs and small INDELs) in given peak regions from
6 the alignment BAM files.
8 ## Detailed Description of usage
10 The `callvar` command takes in treatment and control BAM files along
11 with a bed file containing peak regions. The command identifies
12 variants in these regions using a multi-process approach, greatly
13 improving the speed and efficiency of variant calling. Please check
14 the section *Callvar Algorithm* for detail on this variant calling
17 The `callvar` command assumes you have two types of BAM files. The
18 first type, what we call `TREAT`, is from DNA enrichment assay such as
19 ChIP-seq or ATAC-seq where the DNA fragments in the sequencing library
20 are enriched in certain genomics regions with potential allele biases;
21 the second type, called `CTRL` for control, is from genomic assay in
22 which the DNA enrichment is less biased in multiploid chromosomes and
23 more uniform across the whole genome (the later one is optional). In
24 order to run `callvar`, please sort (by coordinates) and index the BAM
30 `$ samtools sort TREAT.bam -o TREAT_sorted.bam`
31 `$ samtools sort CTRL.bam -o CTRL_sorted.bam`
32 2. Index the BAM file:
33 `$ samtools index TREAT_sorted.bam`
34 `$ samtools index CTRL_sorted.bam`
35 3. Make sure .bai files are available:
36 `$ ls TREAT_sorted.bam.bai`
37 `$ ls CTRL_sorted.bam.bai`
40 `$ macs3 callvar -b peaks.bed -t TREAT_sorted.bam -c CTRL_sorted.bam -o peaks.vcf`
42 ## Command Line Options
44 Here is a brief overview of these options:
46 ### Input files Options:
48 - `-b` or `--peak`: The peak regions in BED format, sorted by
49 coordinates. This option is required.
50 - `-t` or `--treatment`: The ChIP-seq/ATAC-seq treatment file in BAM
51 format, sorted by coordinates. Make sure the .bai file is avaiable
52 in the same directory. This option is required.
53 - `-c` or `--control`: Optional control file in BAM format, sorted by
54 coordinates. Make sure the .bai file is avaiable in the same
58 - `--outdir`: The directory for all output files to be written
59 to. Default: writes output files to the current working directory.
60 - `-o` or `--ofile`: The output VCF file name. Please check the
61 section *Customized fields in VCF* section for detail.
62 - `--verbose`: Set the verbose level of runtime messages. 0: only show
63 critical messages, 1: show additional warning messages, 2: show
64 process information, 3: show debug messages. DEFAULT: 2
66 ### Variant calling Options:
67 - `-g` or `--gq-hetero`: The Genotype Quality score
68 (-10log10((L00+L11)/(L01+L00+L11))) cutoff for Heterozygous allele
69 type. Default is 0, or there is no cutoff on GQ.
70 - `-G` or `--gq-homo`: The Genotype Quality score
71 (-10log10((L00+L01)/(L01+L00+L11))) cutoff for Homozygous allele
72 (not the same as reference) type. Default is 0, or there is no
74 - `-Q`: The cutoff for the quality score. Only consider bases with
75 quality score greater than this value. Default is 20, which means
76 Q20 or 0.01 error rate.
77 - `-F` or `--fermi`: The option to control when to apply local
78 assembly through fermi-lite. By default (set as 'auto'), while
79 `callvar` detects any INDEL variant in a peak region, it will
80 utilize fermi-lite to recover the actual DNA sequences to refine the
81 read alignments. If set as 'on', fermi-lite will always be
82 invoked. It can increase specificity, however sensivity and speed
83 will be significantly lower. If set as 'off', fermi-lite won't be
84 invoked at all. If so, speed and sensitivity can be higher but
85 specificity will be significantly lower.
86 - `--fermi-overlap`: The minimal overlap for fermi to initially
87 assemble two reads. Must be between 1 and read length. A longer
88 fermiMinOverlap is needed while read length is small (e.g. 30 for
89 36bp read, but 33 for 100bp read may work). Default is 30.
90 - `--top2alleles-mratio`: The reads for the top 2 most frequent
91 alleles (e.g. a ref allele and an alternative allele) at a loci
92 shouldn't be too few comparing to total reads mapped. The minimum
93 ratio is set by this optoin. Must be a float between 0.5
94 and 1. Default:0.8 which means at least 80% of reads contain the top
96 - `--altallele-count`: The count of the alternative (non-reference)
97 allele at a loci shouldn't be too few. By default, we require at
98 least two reads support the alternative allele. Default:2
99 - `--max-ar`: The maximum Allele-Ratio allowed while calculating
100 likelihood for allele-specific binding. If we allow higher maxAR, we
101 may mistakenly assign some homozygous loci as
102 heterozygous. Default:0.95
105 - `-m` or `--multiple-processing`: The CPU used for mutliple
106 processing. Please note that, assigning more CPUs does not guarantee
107 the process being faster. Creating too many parrallel processes need
108 memory operations and may negate benefit from multi
109 processing. Default: 1
113 Here is an example of how to use the `callvar` command:
116 macs3 callvar -b peaks.bed -t treatment.bam -c control.bam -o experiment1
119 In this example, the program will identify variants in the
120 `treatment.bam` file relative to the `control.bam` file. The name of
121 the experiment is 'experiment1'. All tags that pass quality filters
122 will be stored in a BAM file.
124 ## `callvar` Algorithm
126 ![Callvar Algorithm](./callvar_algorithm.jpeg)
128 Functional sequencing assays which targeted at particular sequences,
129 such as ChIP-Seq, were thought to be unsuitable for *de novo*
130 variation predictions because their genome-wide sequencing coverage is
131 not as uniform as Whole Genome Sequencing (WGS). However, if we aim at
132 discovering the variations and allele usage at the targeted genomic
133 regions, the coverage should be much higher and sufficient. We
134 therefore proposed a novel method to call the variants directly at the
135 called peaks by MACS3.
137 At each peak region, we extract the reads and assembled the DNA
138 sequences using [fermi-lite](https://github.com/lh3/fermi-lite), a
139 unitig graph based assembly algorithm developed by Heng Li. Then, we
140 align the unitigs (i.e., assembled short DNA sequences) to the
141 reference genome sequence using Smith-Waterman algorithm. Differences
142 between the reference sequence and the unitigs reveal possible SNVs
143 and INDELs. Please note that, by default, we only peform the *de novo*
144 assembly using fermi-lite for detecting INDELs to save time. For each
145 possible SNV or INDEL, we build a statistical model incorporating the
146 sequences and sequencing errors (base qualities) from both treatment
147 (ChIP) and control (genomic input) to predict the most likely genotype
148 using Bayesian Information Criterion (BIC) among four allele types:
149 homozygous loci (genotype 1/1), heterozygous loci (genotype 0/1 or
150 1/2) with allele bias, and heterozygous loci without allele bias. The
151 detailed explanation of our statistical model is as follows: we
152 retrieve the base quality scores $\epsilon$, which represents
153 sequencing errors, then we calculate the likelihoods of each of the
154 four types. We assume the independence of ChIP and control experiments
155 so that the generalized likelihood function is the product of the
156 likelihood functions of ChIP and control data:
158 $$L(\omega,\phi,g_c,g_i:D)=L(\omega,g_c:D_c)L(\phi,g_i:D_i)$$
160 where $D_c$ and $D_i$ represent the ChIP-Seq and control (e.g.,
161 genomic input) data observed at the position including base coverage
162 and base qualities. The parameter $\omega$ stands for the allele ratio
163 of allele A (chosen as the more abundant or stronger allele compared
164 with the others) from the ChIP-Seq data and $\phi$ represents the
165 allele ratio in the control. The parameter $g_c$ represents the
166 actual number of ChIPed DNA fragments containing allele A, which could
167 differ from the observed count $r_{c,A}$ considering that some
168 observations could be due to sequencing errors. The symbol $g_i$
169 represents the control analogously to $g_c$. We use $r_c$ to
170 denote the total number of observed allele A ($r_{c,A}$) and allele
171 B ($r_{c,B}$). We assume the occurrence of the allele A ($g_c$)
172 is from a Bernoulli trial from $r_c$ with the allele ratio
173 $\omega$. The probability of observing the ChIP-Seq data at a certain
174 position under a given type is as follows:
177 $$Pr(D_c|g_c,\omega) = Pr(D_c|g_c) =
178 \sum^{r_{c,A}}_{j=1}\left((1-\epsilon_j)g_c/r_c+\epsilon_j(1-g_c/r_c)\right)\sum_{j=1}^{r_{c,B}}\left((1-\epsilon_j)(1-g_c/r_c)+\epsilon_j
181 where $\epsilon_j$ represents the sequencing error of the base
182 showing difference with reference genome in case of mismatch
183 (corresponding to SNV) and insertion. In case of deletion, the
184 sequencing errors from the two bases on sequenced read surrounding the
185 deletion would be considered. We model the control data in the similar
186 way. We assess the likelihood functions of the 4 major type using the
187 following parameters: $\omega=1,\phi=1,g_c=r_{c,0},g_i=r_{i,0}$ for
188 A/A genotype; $\omega=0,\phi=0,g_c=0,g_i=0$ for B/B genotype,
189 $\omega=0.5,\phi=0.5$ and $g_c,g_i$ as free variables for A/B
190 genotype with unbiased binding; $\phi=0.5$ and $\omega,g_c,g_i$ as
191 free variables for A/B genotype with biased binding or allele
192 usage. Next, we apply the Bayesian Information Criterion (BIC) to
193 select the best type as our prediction with the minimal BIC value
194 among the 4 models. If the best type is either “A/B, noAS” or “A/B,
195 AS”, we conclude that the genotype is heterozygous (A/B). We consider
196 two types of data from the same assay independently: ChIP sample that
197 can have biased allele usage, and control sample that won’t have
198 biased allele usage. So that in case control is not available, such as
199 in ATAC-Seq assay, our model can still work. Furthermore, in case a
200 good quality WGS is available, it can be regarded as the control
201 sample and be inserted into our calculation to further increase the
204 ## Customized fields in the Output VCF file
206 The result VCF file from MACS3 `callvar` will have the following
207 customized fields in VCF flavor:
210 ##INFO=<ID=M,Number=.,Type=String,Description="MACS Model with minimum BIC value">
211 ##INFO=<ID=MT,Number=.,Type=String,Description="Mutation type: SNV/Insertion/Deletion">
212 ##INFO=<ID=DPT,Number=1,Type=Integer,Description="Depth Treatment: Read depth in ChIP-seq data">
213 ##INFO=<ID=DPC,Number=1,Type=Integer,Description="Depth Control: Read depth in control data">
214 ##INFO=<ID=DP1T,Number=.,Type=String,Description="Read depth of top1 allele in ChIP-seq data">
215 ##INFO=<ID=DP2T,Number=.,Type=String,Description="Read depth of top2 allele in ChIP-seq data">
216 ##INFO=<ID=DP1C,Number=.,Type=String,Description="Read depth of top1 allele in control data">
217 ##INFO=<ID=DP2C,Number=.,Type=String,Description="Read depth of top2 allele in control data">
218 ##INFO=<ID=DBIC,Number=.,Type=Float,Description="Difference of BIC of selected model vs second best alternative model">
219 ##INFO=<ID=BICHOMOMAJOR,Number=1,Type=Integer,Description="BIC of homozygous with major allele model">
220 ##INFO=<ID=BICHOMOMINOR,Number=1,Type=Integer,Description="BIC of homozygous with minor allele model">
221 ##INFO=<ID=BICHETERNOAS,Number=1,Type=Integer,Description="BIC of heterozygous with no allele-specific model">
222 ##INFO=<ID=BICHETERAS,Number=1,Type=Integer,Description="BIC of heterozygous with allele-specific model">
223 ##INFO=<ID=AR,Number=1,Type=Float,Description="Estimated allele ratio of heterozygous with allele-specific model">
224 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
225 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth after filtering bad reads">
226 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality score">
227 ##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Normalized, Phred-scaled genotype likelihoods for 00, 01, 11 genotype">