docs/callvar.md

   1 # callvar
   2
   3 ## Overview
   4 The `callvar` command is part of the MACS3 suite of tools and is used
   5 to call variants (SNVs and small INDELs) in given peak regions from
   6 the alignment BAM files.
   7
   8 ## Detailed Description of usage
   9
  10 The `callvar` command takes in treatment and control BAM files along
  11 with a bed file containing peak regions. The command identifies
  12 variants in these regions using a multi-process approach, greatly
  13 improving the speed and efficiency of variant calling. Please check
  14 the section *Callvar Algorithm* for detail on this variant calling
  15 algorithm.
  16
  17 The `callvar` command assumes you have two types of BAM files. The
  18 first type, what we call `TREAT`, is from DNA enrichment assay such as
  19 ChIP-seq or ATAC-seq where the DNA fragments in the sequencing library
  20 are enriched in certain genomics regions with potential allele biases;
  21 the second type, called `CTRL` for control, is from genomic assay in
  22 which the DNA enrichment is less biased in multiploid chromosomes and
  23 more uniform across the whole genome (the later one is optional). In
  24 order to run `callvar`, please sort (by coordinates) and index the BAM
  25 files.
  26
  27 Example:
  28
  29 1. Sort the BAM file:
  30     `$ samtools sort TREAT.bam -o TREAT_sorted.bam`
  31     `$ samtools sort CTRL.bam -o CTRL_sorted.bam`
  32 2. Index the BAM file:
  33     `$ samtools index TREAT_sorted.bam`
  34     `$ samtools index CTRL_sorted.bam`
  35 3. Make sure .bai files are available:
  36     `$ ls TREAT_sorted.bam.bai`
  37     `$ ls CTRL_sorted.bam.bai`
  38
  39 To call variants:
  40     `$ macs3 callvar -b peaks.bed -t TREAT_sorted.bam -c CTRL_sorted.bam -o peaks.vcf`
  41
  42 ## Command Line Options
  43
  44 Here is a brief overview of these options:
  45
  46 ### Input files Options:
  47
  48 - `-b` or `--peak`: The peak regions in BED format, sorted by
  49   coordinates. This option is required.
  50 - `-t` or `--treatment`: The ChIP-seq/ATAC-seq treatment file in BAM
  51   format, sorted by coordinates. Make sure the .bai file is avaiable
  52   in the same directory. This option is required.
  53 - `-c` or `--control`: Optional control file in BAM format, sorted by
  54   coordinates. Make sure the .bai file is avaiable in the same
  55   directory.
  56
  57 ### Output Options:
  58 - `--outdir`: The directory for all output files to be written
  59   to. Default: writes output files to the current working directory.
  60 - `-o` or `--ofile`: The output VCF file name. Please check the
  61   section *Customized fields in VCF* section for detail.
  62 - `--verbose`: Set the verbose level of runtime messages. 0: only show
  63   critical messages, 1: show additional warning messages, 2: show
  64   process information, 3: show debug messages. DEFAULT: 2
  65
  66 ### Variant calling Options:
  67 - `-g` or `--gq-hetero`: The Genotype Quality score
  68   (-10log10((L00+L11)/(L01+L00+L11))) cutoff for Heterozygous allele
  69   type. Default is 0, or there is no cutoff on GQ.
  70 - `-G` or `--gq-homo`: The Genotype Quality score
  71   (-10log10((L00+L01)/(L01+L00+L11))) cutoff for Homozygous allele
  72   (not the same as reference) type. Default is 0, or there is no
  73   cutoff on GQ.
  74 - `-Q`: The cutoff for the quality score. Only consider bases with
  75   quality score greater than this value. Default is 20, which means
  76   Q20 or 0.01 error rate.
  77 - `-F` or `--fermi`: The option to control when to apply local
  78   assembly through fermi-lite. By default (set as 'auto'), while
  79   `callvar` detects any INDEL variant in a peak region, it will
  80   utilize fermi-lite to recover the actual DNA sequences to refine the
  81   read alignments. If set as 'on', fermi-lite will always be
  82   invoked. It can increase specificity, however sensivity and speed
  83   will be significantly lower. If set as 'off', fermi-lite won't be
  84   invoked at all. If so, speed and sensitivity can be higher but
  85   specificity will be significantly lower.
  86 - `--fermi-overlap`: The minimal overlap for fermi to initially
  87   assemble two reads. Must be between 1 and read length. A longer
  88   fermiMinOverlap is needed while read length is small (e.g. 30 for
  89   36bp read, but 33 for 100bp read may work). Default is 30.
  90 - `--top2alleles-mratio`: The reads for the top 2 most frequent
  91   alleles (e.g. a ref allele and an alternative allele) at a loci
  92   shouldn't be too few comparing to total reads mapped. The minimum
  93   ratio is set by this optoin. Must be a float between 0.5
  94   and 1. Default:0.8 which means at least 80% of reads contain the top
  95   2 alleles.
  96 - `--altallele-count`: The count of the alternative (non-reference)
  97   allele at a loci shouldn't be too few. By default, we require at
  98   least two reads support the alternative allele. Default:2
  99 - `--max-ar`: The maximum Allele-Ratio allowed while calculating
 100   likelihood for allele-specific binding. If we allow higher maxAR, we
 101   may mistakenly assign some homozygous loci as
 102   heterozygous. Default:0.95
 103
 104 ### Misc Options:
 105 - `-m` or `--multiple-processing`: The CPU used for mutliple
 106   processing. Please note that, assigning more CPUs does not guarantee
 107   the process being faster. Creating too many parrallel processes need
 108   memory operations and may negate benefit from multi
 109   processing. Default: 1
 110
 111 ## Example Usage
 112
 113 Here is an example of how to use the `callvar` command:
 114
 115 ```
 116 macs3 callvar -b peaks.bed -t treatment.bam -c control.bam -o experiment1
 117 ```
 118
 119 In this example, the program will identify variants in the
 120 `treatment.bam` file relative to the `control.bam` file. The name of
 121 the experiment is 'experiment1'. All tags that pass quality filters
 122 will be stored in a BAM file.
 123
 124 ## `callvar` Algorithm
 125
 126 ![Callvar Algorithm](./callvar_algorithm.jpeg)
 127
 128 Functional sequencing assays which targeted at particular sequences,
 129 such as ChIP-Seq, were thought to be unsuitable for *de novo*
 130 variation predictions because their genome-wide sequencing coverage is
 131 not as uniform as Whole Genome Sequencing (WGS). However, if we aim at
 132 discovering the variations and allele usage at the targeted genomic
 133 regions, the coverage should be much higher and sufficient. We
 134 therefore proposed a novel method to call the variants directly at the
 135 called peaks by MACS3.
 136
 137 At each peak region, we extract the reads and assembled the DNA
 138 sequences using [fermi-lite](https://github.com/lh3/fermi-lite), a
 139 unitig graph based assembly algorithm developed by Heng Li. Then, we
 140 align the unitigs (i.e., assembled short DNA sequences) to the
 141 reference genome sequence using Smith-Waterman algorithm. Differences
 142 between the reference sequence and the unitigs reveal possible SNVs
 143 and INDELs. Please note that, by default, we only peform the *de novo*
 144 assembly using fermi-lite for detecting INDELs to save time. For each
 145 possible SNV or INDEL, we build a statistical model incorporating the
 146 sequences and sequencing errors (base qualities) from both treatment
 147 (ChIP) and control (genomic input) to predict the most likely genotype
 148 using Bayesian Information Criterion (BIC) among four allele types:
 149 homozygous loci (genotype 1/1), heterozygous loci (genotype 0/1 or
 150 1/2) with allele bias, and heterozygous loci without allele bias. The
 151 detailed explanation of our statistical model is as follows: we
 152 retrieve the base quality scores $\epsilon$, which represents
 153 sequencing errors, then we calculate the likelihoods of each of the
 154 four types. We assume the independence of ChIP and control experiments
 155 so that the generalized likelihood function is the product of the
 156 likelihood functions of ChIP and control data:
 157
 158 $$L(\omega,\phi,g_c,g_i:D)=L(\omega,g_c:D_c)L(\phi,g_i:D_i)$$
 159
 160 where $D_c$ and $D_i$ represent the ChIP-Seq and control (e.g.,
 161 genomic input) data observed at the position including base coverage
 162 and base qualities. The parameter $\omega$ stands for the allele ratio
 163 of allele A (chosen as the more abundant or stronger allele compared
 164 with the others) from the ChIP-Seq data and $\phi$ represents the
 165 allele ratio in the control. The parameter $g_c$ represents the
 166 actual number of ChIPed DNA fragments containing allele A, which could
 167 differ from the observed count $r_{c,A}$ considering that some
 168 observations could be due to sequencing errors. The symbol $g_i$
 169 represents the control analogously to $g_c$. We use $r_c$ to
 170 denote the total number of observed allele A ($r_{c,A}$) and allele
 171 B ($r_{c,B}$). We assume the occurrence of the allele A ($g_c$)
 172 is from a Bernoulli trial from $r_c$ with the allele ratio
 173 $\omega$. The probability of observing the ChIP-Seq data at a certain
 174 position under a given type is as follows:
 175
 176
 177 $$Pr(D_c|g_c,\omega) = Pr(D_c|g_c) =
 178  \sum^{r_{c,A}}_{j=1}\left((1-\epsilon_j)g_c/r_c+\epsilon_j(1-g_c/r_c)\right)\sum_{j=1}^{r_{c,B}}\left((1-\epsilon_j)(1-g_c/r_c)+\epsilon_j
 179  g_c/r_c\right)$$
 180
 181 where $\epsilon_j$ represents the sequencing error of the base
 182 showing difference with reference genome in case of mismatch
 183 (corresponding to SNV) and insertion. In case of deletion, the
 184 sequencing errors from the two bases on sequenced read surrounding the
 185 deletion would be considered. We model the control data in the similar
 186 way. We assess the likelihood functions of the 4 major type using the
 187 following parameters: $\omega=1,\phi=1,g_c=r_{c,0},g_i=r_{i,0}$ for
 188 A/A genotype; $\omega=0,\phi=0,g_c=0,g_i=0$ for B/B genotype,
 189 $\omega=0.5,\phi=0.5$ and $g_c,g_i$ as free variables for A/B
 190 genotype with unbiased binding; $\phi=0.5$ and $\omega,g_c,g_i$ as
 191 free variables for A/B genotype with biased binding or allele
 192 usage. Next, we apply the Bayesian Information Criterion (BIC) to
 193 select the best type as our prediction with the minimal BIC value
 194 among the 4 models. If the best type is either “A/B, noAS” or “A/B,
 195 AS”, we conclude that the genotype is heterozygous (A/B). We consider
 196 two types of data from the same assay independently: ChIP sample that
 197 can have biased allele usage, and control sample that won’t have
 198 biased allele usage. So that in case control is not available, such as
 199 in ATAC-Seq assay, our model can still work. Furthermore, in case a
 200 good quality WGS is available, it can be regarded as the control
 201 sample and be inserted into our calculation to further increase the
 202 sensitivity.
 203
 204 ## Customized fields in the Output VCF file
 205
 206 The result VCF file from MACS3 `callvar` will have the following
 207 customized fields in VCF flavor:
 208
 209 ```
 210 ##INFO=<ID=M,Number=.,Type=String,Description="MACS Model with minimum BIC value">
 211 ##INFO=<ID=MT,Number=.,Type=String,Description="Mutation type: SNV/Insertion/Deletion">
 212 ##INFO=<ID=DPT,Number=1,Type=Integer,Description="Depth Treatment: Read depth in ChIP-seq data">
 213 ##INFO=<ID=DPC,Number=1,Type=Integer,Description="Depth Control: Read depth in control data">
 214 ##INFO=<ID=DP1T,Number=.,Type=String,Description="Read depth of top1 allele in ChIP-seq data">
 215 ##INFO=<ID=DP2T,Number=.,Type=String,Description="Read depth of top2 allele in ChIP-seq data">
 216 ##INFO=<ID=DP1C,Number=.,Type=String,Description="Read depth of top1 allele in control data">
 217 ##INFO=<ID=DP2C,Number=.,Type=String,Description="Read depth of top2 allele in control data">
 218 ##INFO=<ID=DBIC,Number=.,Type=Float,Description="Difference of BIC of selected model vs second best alternative model">
 219 ##INFO=<ID=BICHOMOMAJOR,Number=1,Type=Integer,Description="BIC of homozygous with major allele model">
 220 ##INFO=<ID=BICHOMOMINOR,Number=1,Type=Integer,Description="BIC of homozygous with minor allele model">
 221 ##INFO=<ID=BICHETERNOAS,Number=1,Type=Integer,Description="BIC of heterozygous with no allele-specific model">
 222 ##INFO=<ID=BICHETERAS,Number=1,Type=Integer,Description="BIC of heterozygous with allele-specific model">
 223 ##INFO=<ID=AR,Number=1,Type=Float,Description="Estimated allele ratio of heterozygous with allele-specific model">
 224 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 225 ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth after filtering bad reads">
 226 ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality score">
 227 ##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Normalized, Phred-scaled genotype likelihoods for 00, 01, 11 genotype">
 228 ```