docs/hmmratac.md

   1 # Hidden Markov Model peak caller for ATAC-seq -- HMMRATAC
   2
   3 HMMRATAC (`macs3 hmmratac`) is a dedicated peak calling algorithm
   4 based on Hidden Markov Model for ATAC-seq data. The basic idea behind
   5 HMMRATAC is to digest ATAC-seq data according to the fragment length
   6 of read pairs into four signal tracks: short fragments,
   7 mono-nucleosomal fragments, di-nucleosomal fragments and
   8 tri-nucleosomal fragments. Then integrate the four tracks using Hidden
   9 Markov Model to consider three hidden states: open region, nucleosomal
  10 region, and background region. The [orginal
  11 paper](https://academic.oup.com/nar/article/47/16/e91/5519166) was
  12 published in 2019, and the original software was written in JAVA, by
  13 the then PhD student Evan Tarbell, a mohawk bioinformatician. In MACS3
  14 project, we implemented HMMRATAC idea in Python/Cython and optimize
  15 the whole process using existing MACS functions and hmmlearn.
  16
  17 Here's an example of runing the `hmmratac` command:
  18
  19 ```
  20 $ macs3 hmmratac -b test.bam -n test_result
  21 ```
  22
  23 Please use `macs3 hmmratac --help` to see all the options. Here we
  24 list the essential ones.
  25
  26 ## Essential Options
  27
  28 ### `-b BAM_FILE [BAM_FILE ...]` / `--bam BAM_FILE [BAM_FILE ...]`
  29
  30 This is the only REQUIRED parameter for `hmmratac`. The file can
  31 should be in BAMPE format. If multiple files are given as '-b A B C',
  32 then they will all be read and pooled together. REQUIRED.
  33
  34 ### `--outdir OUTDIR`
  35
  36 If specified all output files will be written to that directory. Default: the current working directory
  37
  38 ### `-n NAME`/ `--name NAME`
  39 Name for this experiment, which will be used as a prefix to generate output file names. DEFAULT: "NA"
  40
  41 ### `--cutoff-analysis-only`
  42
  43  Only run the cutoff analysis and output a report. After generating
  44  the report, the process will stop. The report will help user decide
  45  the three crucial parameters for `-l`, `-u`, and `-c`. So it's highly
  46  recommanded to run this first! Please read the report and
  47  instructions in `Choices of cutoff values` on how to decide the three
  48  crucial parameters
  49
  50 ### `-u HMM_UPPER` / `--upper HMM_UPPER`
  51
  52 Upper limit on fold change range for choosing training sites. This is
  53 an important parameter for training so please read. The purpose of
  54 this parameter is to EXCLUDE those unusually highly enriched chromatin
  55 regions so we can get training samples in 'ordinary' regions
  56 instead. It's highly recommended to run the `--cutoff-analysis-only`
  57 first to decide the lower cutoff `-l`, the upper cutoff `-u`, and the
  58 pre-scanning cutoff `-c`. The upper cutoff should be the cutoff in the
  59 cutoff analysis result that can capture some (typically hundreds of)
  60 extremely high enrichment and unusually wide peaks. Default: 20
  61
  62 ### `-l HMM_LOWER` / `--lower HMM_LOWER`
  63 Lower limit on fold change range for choosing training sites. This is
  64 an important parameter for training so please read. The purpose of
  65 this parameter is to ONLY INCLUDE those chromatin regions having
  66 ordinary enrichment so we can get training samples to learn the common
  67 features through HMM. It's highly recommended to run the
  68 `--cutoff-analysis-only` first to decide the lower cutoff `-l`, the
  69 upper cutoff `-u`, and the pre-scanning cutoff `-c`. The lower cutoff
  70 should be the cutoff in the cutoff analysis result that can capture
  71 moderate number ( about 10k ) of peaks with normal width ( average
  72 length 500-1000bps long). Default: 10
  73
  74 ### `-c PRESCAN_CUTOFF` / `--prescan-cutoff PRESCAN_CUTOFF`
  75
  76 The fold change cutoff for prescanning candidate regions in the whole
  77 dataset. Then we will use HMM to predict/decode states on these
  78 candidate regions. Higher the prescan cutoff, fewer regions will be
  79 considered. Must > 1. This is an important parameter for decoding so
  80 please read. The purpose of this parameter is to EXCLUDE those
  81 chromatin regions having noises/random enrichment so we can have a
  82 large number of possible regions to predict the HMM states. It's
  83 highly recommended to run the `--cutoff-analysis-only` first to decide
  84 the lower cutoff `-l`, the upper cutoff `-u`, and the pre-scanning
  85 cutoff `-c`. The pre-scanning cutoff should be the cutoff close to the
  86 BOTTOM of the cutoff analysis result that can capture large number of
  87 possible peaks with normal length (average length 500-1000bps). In
  88 most cases, please do not pick a cutoff too low that capture almost
  89 all the background noises from the data. Default: 1.2
  90
  91
  92 ## Choices of cutoff values
  93
  94 Before you proceed, it's highly recommended to run with
  95 `--cutoff-analysis-only` for the initial attempt. When this option is
  96 activated, `hmmratac` will use EM to estimate the best parameters for
  97 fragment sizes of short fragments, mono-, di-, and tri-nucleosomes,
  98 pileup fragments, convert the pileup values into fold-change, and
  99 analyze each possible cutoff. This analysis includes the number of
 100 peaks that can be called, their average peak length, and their total
 101 length. After the report is generated, you can review its contents and
 102 decide on the optimal `-l`, `-u`, and `-c`.
 103
 104 The report consists of four columns:
 105
 106 1. Score: the possible fold change cutoff value.
 107 2. npeaks: the number of peaks.
 108 3. lpeaks: the total length of all peaks.
 109 4. avelpeak: the average length of peaks.
 110
 111 While there's no universal rule, here are a few suggestions:
 112
 113 - The lower cutoff should be the cutoff in the report that captures a
 114   moderate number (about 10k) of peaks with a normal width (average
 115   length 500-1000bps long).
 116 - The upper cutoff should capture some (typically hundreds of)
 117   extremely high enrichment and unusually wide peaks in the
 118   report. The aim here is to exclude abnormal enrichment caused by
 119   artifacts such as repetitive regions.
 120 - The pre-scanning cutoff should be the cutoff close to the BOTTOM of
 121   the report that can capture a large number of potential peaks with a
 122   normal length (average length 500-1000bps). However, it's
 123   recommended not to use the lowest cutoff value in the report as this
 124   may include too much noise from the genome.
 125
 126 ## Tune the HMM model
 127
 128 It's highly recommended to check the runtime message of the HMM model
 129 after training. An example is like this:
 130
 131 ```
 132 #4 Train Hidden Markov Model with Multivariate Gaussian Emission
 133 #  Extract signals in training regions with bin size of 10
 134 #  Use Baum-Welch algorithm to train the HMM
 135 #   HMM converged: True
 136 #  Write HMM parameters into JSON: test_model.json
 137 #  The Hidden Markov Model for signals of binsize of 10 basepairs:
 138 #   open state index: state2
 139 #   nucleosomal state index: state1
 140 #   background state index: state0
 141 #   Starting probabilities of states:
 142 #                            bg        nuc       open
 143 #                        0.7994     0.1312    0.06942
 144 #   HMM Transition probabilities:
 145 #                            bg        nuc       open
 146 #               bg->     0.9842    0.01202   0.003759
 147 #              nuc->    0.03093     0.9562    0.01287
 148 #             open->   0.007891    0.01038     0.9817
 149 #   HMM Emissions (mean):
 150 #                         short       mono         di        tri
 151 #               bg:      0.2551      1.526     0.4646    0.07071
 152 #              nuc:       6.538      17.94      3.422    0.05819
 153 #             open:       5.016      17.47      6.897      2.121
 154 ```
 155
 156 We will 'guess' which hidden state is for the open region, which is
 157 the nucleosomal region, and which is the background. We compute from
 158 the HMM Emission matrix to pick the state with the highest sum of mean
 159 signals as the open state, the lowest as the backgound state, then the
 160 rest is the nucleosomal state. However it may not work in every
 161 case. In the above example, it may be tricky to call the second row as
 162 'nuc' and the third as 'open'. If the users want to exchange the state
 163 assignments of the 'nuc' and 'open', they can modify the state
 164 assignment in the HMM model file (e.g. test_model.json). For the above
 165 example, the model.json looks like this (we skipped some detail):
 166
 167 ```
 168 {"startprob": [...], "transmat": [...], "means": [...], "covars": [...],
 169 "covariance_type": "full", "n_features": 4,
 170 "i_open_region": 2, "i_background_region": 0, "i_nucleosomal_region": 1,
 171 "hmm_binsize": 10}
 172 ```
 173
 174 We can modify the assignment of: `"i_open_region": 2,
 175 "i_background_region": 0, "i_nucleosomal_region": 1,` by assigning `1`
 176 to open, and `2` to nucleosomal as: `"i_open_region": 1,
 177 "i_background_region": 0, "i_nucleosomal_region": 2,` Then save the
 178 HMM in a new model file such as `new_model.json`.
 179
 180 Then next, we can re-run `macs3 hmmratac` with the same parameters
 181 plus an extra option for the HMM model file like `macs3 hmmratac
 182 --model new_model.json`
 183