5 HMMRATAC (`macs3 hmmratac`) is a dedicated peak calling algorithm
6 based on Hidden Markov Model for ATAC-seq data. The basic idea behind
7 HMMRATAC is to digest ATAC-seq data according to the fragment length
8 of read pairs into four signal tracks: short fragments,
9 mono-nucleosomal fragments, di-nucleosomal fragments and
10 tri-nucleosomal fragments. Then integrate the four tracks using Hidden
11 Markov Model to consider three hidden states: open region, nucleosomal
12 region, and background region. The [orginal
13 paper](https://academic.oup.com/nar/article/47/16/e91/5519166) was
14 published in 2019, and the original software was written in JAVA, by
15 the then PhD student Evan Tarbell, a mohawk bioinformatician. In MACS3
16 project, we implemented HMMRATAC idea in Python/Cython and optimize
17 the whole process using existing MACS functions and hmmlearn.
19 Here's an example of how to run the `hmmratac` command:
22 $ macs3 hmmratac -i yeast.bam -n yeast
25 or with the BEDPE format
28 $ macs3 hmmratac -i yeast.bedpe.gz -f BEDPE -n yeast
31 Note: you can convert BAMPE to BEDPE by using
34 $ macs3 filterdup --keep-dup all -f BAMPE -i yeast.bam -o yeast.bedpe
37 Please use `macs3 hmmratac --help` to see all the options. Here we
38 list the essential ones.
42 ### `-i INPUT_FILE [INPUT_FILE ...]` / `--input INPUT_FILE [INPUT_FILE ...]`
44 This is the only REQUIRED parameter for `hmmratac`. Input files
45 containing the aligment results for ATAC-seq paired end reads. If
46 multiple files are given as '-t A B C', then they will all be read and
47 pooled together. The file should be in BAMPE or BEDPE format (aligned
48 in paired end mode). Files can be gzipped. Note: all files should be
49 in the same format. REQUIRED.
51 ### `-f {BAMPE,BEDPE}` / `--format {BAMPE,BEDPE}`
53 Format of input files, "BAMPE" or "BEDPE". If there are multiple
54 files, they should be in the same format -- either BAMPE or
55 BEDPE. Please note that the BEDPE only contains three columns --
56 chromosome, left position of the whole pair, right position of the
57 whole pair-- and is NOT the same BEDPE format used by BEDTOOLS. To
58 convert BAMPE to BEDPE, you can use this command `macs3 filterdup
59 --keep-dup all -f BAMPE -i input.bam -o output.bedpe`. DEFAULT:
64 If specified all output files will be written to that
65 directory. Default: the current working directory
67 ### `-n NAME`/ `--name NAME`
68 Name for this experiment, which will be used as a prefix to generate
69 output file names. DEFAULT: "NA"
72 This option will only generate the HMM model as a JSON file and
73 quit. This model can then be applied using the `--model`
74 option. Default: False
77 If provided, HMM training will be skipped and a JSON file generated
78 from a previous HMMRATAC run will be used instead of creating new
81 ### `-t HMM_TRAINING_REGIONS` / `--training HMM_TRAINING_REGIONS`
82 Customized training regions can be provided through this option. `-t`
83 takes the filename of training regions (previously was BED_file) to
84 use for training HMM, instead of using foldchange settings to
87 ### `--min-frag-p MIN_FRAG_P`
88 We will exclude the abnormal fragments that can't be assigned to any
89 of the four signal tracks. After we use EM to find the means and
90 stddevs of the four distributions, we will calculate the likelihood
91 that a given fragment length fit any of the four using normal
92 distribution. The criteria we will use is that if a fragment length
93 has less than MIN_FRAG_P probability to be like either of short,
94 mono, di, or tri-nuc fragment, we will exclude it while generating
95 the four signal tracks for later HMM training and prediction. The
96 value should be between 0 and 1. Larger the value, more abnormal
97 fragments will be allowed. So if you want to include more 'ideal'
98 fragments, make this value smaller. Default = 0.001
100 ### `--cutoff-analysis-only`
102 Only run the cutoff analysis and output a report. After generating
103 the report, the process will stop. The report will help user decide
104 the three crucial parameters for `-l`, `-u`, and `-c`. So it's highly
105 recommanded to run this first! Please read the report and
106 instructions in [Choices of cutoff values](#choices-of-cutoff-values)
107 on how to decide the three crucial parameters.
109 ### `-u HMM_UPPER` / `--upper HMM_UPPER`
111 Upper limit on fold change range for choosing training sites. This is
112 an important parameter for training so please read. The purpose of
113 this parameter is to EXCLUDE those unusually highly enriched chromatin
114 regions so we can get training samples in 'ordinary' regions
115 instead. It's highly recommended to run the `--cutoff-analysis-only`
116 first to decide the lower cutoff `-l`, the upper cutoff `-u`, and the
117 pre-scanning cutoff `-c`. The upper cutoff should be the cutoff in the
118 cutoff analysis result that can capture some (typically hundreds of)
119 extremely high enrichment and unusually wide peaks. Default: 20
121 ### `-l HMM_LOWER` / `--lower HMM_LOWER`
122 Lower limit on fold change range for choosing training sites. This is
123 an important parameter for training so please read. The purpose of
124 this parameter is to ONLY INCLUDE those chromatin regions having
125 ordinary enrichment so we can get training samples to learn the common
126 features through HMM. It's highly recommended to run the
127 `--cutoff-analysis-only` first to decide the lower cutoff `-l`, the
128 upper cutoff `-u`, and the pre-scanning cutoff `-c`. The lower cutoff
129 should be the cutoff in the cutoff analysis result that can capture
130 moderate number ( about 10k ) of peaks with normal width ( average
131 length 500-1000bps long). Default: 10
133 ### `-c PRESCAN_CUTOFF` / `--prescan-cutoff PRESCAN_CUTOFF`
135 The fold change cutoff for prescanning candidate regions in the whole
136 dataset. Then we will use HMM to predict/decode states on these
137 candidate regions. The higher the prescan cutoff, the fewer regions
138 will be considered. Must be > 1. This is an important parameter for
139 decoding so please read. The purpose of this parameter is to EXCLUDE
140 those chromatin regions having noises/random enrichment so we can have
141 a large number of possible regions to predict the HMM states. It's
142 highly recommended to run the `--cutoff-analysis-only` first to decide
143 the lower cutoff `-l`, the upper cutoff `-u`, and the pre-scanning
144 cutoff `-c`. The pre-scanning cutoff should be the cutoff close to the
145 BOTTOM of the cutoff analysis result that can capture a large number
146 of possible peaks with normal length (average length 500-1000bps). In
147 most cases, please do not pick a cutoff too low that captures almost
148 all the background noises from the data. Default: 1.2
151 ## Choices of cutoff values
153 Before you proceed, it's highly recommended to run with
154 `--cutoff-analysis-only` for the initial attempt. When this option is
155 activated, `hmmratac` will use EM to estimate the best parameters for
156 fragment sizes of short fragments, mono-, di-, and tri-nucleosomes,
157 pileup fragments, convert the pileup values into fold-change, and
158 analyze each possible cutoff. This analysis includes the number of
159 peaks that can be called, their average peak length, and their total
160 length. After the report is generated, you can review its contents and
161 decide on the optimal `-l`, `-u`, and `-c`.
163 The report consists of four columns:
165 1. Score: the possible fold change cutoff value.
166 2. npeaks: the number of peaks.
167 3. lpeaks: the total length of all peaks.
168 4. avelpeak: the average length of peaks.
170 While there's no universal rule, here are a few suggestions:
172 - The lower cutoff should be the cutoff in the report that captures a
173 moderate number (about 10k) of peaks with a normal width (average
174 length 500-1000bps long).
175 - The upper cutoff should capture some (typically hundreds of)
176 extremely high enrichment and unusually wide peaks in the
177 report. The aim here is to exclude abnormal enrichment caused by
178 artifacts such as repetitive regions.
179 - The pre-scanning cutoff should be the cutoff close to the BOTTOM of
180 the report that can capture a large number of potential peaks with a
181 normal length (average length 500-1000bps). However, it's
182 recommended not to use the lowest cutoff value in the report as this
183 may include too much noise from the genome.
185 ## Tune the HMM model
187 It's highly recommended to check the runtime message of the HMM model
188 after training. An example is like this:
191 #4 Train Hidden Markov Model with Multivariate Gaussian Emission
192 # Extract signals in training regions with bin size of 10
193 # Use Baum-Welch algorithm to train the HMM
194 # HMM converged: True
195 # Write HMM parameters into JSON: test_model.json
196 # The Hidden Markov Model for signals of binsize of 10 basepairs:
197 # open state index: state2
198 # nucleosomal state index: state1
199 # background state index: state0
200 # Starting probabilities of states:
202 # 0.7994 0.1312 0.06942
203 # HMM Transition probabilities:
205 # bg-> 0.9842 0.01202 0.003759
206 # nuc-> 0.03093 0.9562 0.01287
207 # open-> 0.007891 0.01038 0.9817
208 # HMM Emissions (mean):
210 # bg: 0.2551 1.526 0.4646 0.07071
211 # nuc: 6.538 17.94 3.422 0.05819
212 # open: 5.016 17.47 6.897 2.121
215 We will 'guess' which hidden state is for the open region, which is
216 the nucleosomal region, and which is the background. We compute from
217 the HMM Emission matrix to pick the state with the highest sum of mean
218 signals as the open state, the lowest as the backgound state, then the
219 rest is the nucleosomal state. However it may not work in every
220 case. In the above example, it may be tricky to call the second row as
221 'nuc' and the third as 'open'. If the users want to exchange the state
222 assignments of the 'nuc' and 'open', they can modify the state
223 assignment in the HMM model file (e.g. test_model.json). For the above
224 example, the model.json looks like this (we skipped some detail):
227 {"startprob": [...], "transmat": [...], "means": [...], "covars": [...],
228 "covariance_type": "full", "n_features": 4,
229 "i_open_region": 2, "i_background_region": 0, "i_nucleosomal_region": 1,
233 We can modify the assignment of: `"i_open_region": 2,
234 "i_background_region": 0, "i_nucleosomal_region": 1,` by assigning `1`
235 to open, and `2` to nucleosomal as: `"i_open_region": 1,
236 "i_background_region": 0, "i_nucleosomal_region": 2,` Then save the
237 HMM in a new model file such as `new_model.json`.
239 Then next, we can re-run `macs3 hmmratac` with the same parameters
240 plus an extra option for the HMM model file like `macs3 hmmratac
241 --model new_model.json`