4 This is the main function in MACS3. It will take alignment files in
5 various format (please check the detail below) and call the
6 significantly enriched regions in the genome as 'peaks'. It can be
7 invoked by `macs3 callpeak` . If you type this command with `-h`, you
8 will see a full description of command-line options. Here we only list
11 ## Essential Commandline Options
17 This is the only REQUIRED parameter for MACS3. The file can be in
18 any supported format -- see detail in the `--format` option. If you
19 have more than one alignment file, you can specify them as `-t A B
20 C`. MACS3 will pool up all these files together.
24 The control, genomic input or mock IP data file. Please follow the
25 same direction as for `-t`/`--treatment`.
29 The name string of the experiment. MACS3 will use this string NAME
30 to create output files like `NAME_peaks.xls`,
31 `NAME_negative_peaks.xls`, `NAME_peaks.bed` , `NAME_summits.bed`,
32 `NAME_model.r` and so on. So please avoid any confliction between
33 these filenames and your existing files.
35 - `-f`/`--format FORMAT`
37 Format of tag file can be `ELAND`, `BED`, `ELANDMULTI`,
38 `ELANDEXPORT`, `SAM`, `BAM`, `BOWTIE`, `BAMPE`, or `BEDPE`. Default
39 is `AUTO` which will allow MACS3 to decide the format
40 automatically. `AUTO` is also useful when you combine different
41 formats of files. Note that MACS3 can't detect `BAMPE` or `BEDPE`
42 format with `AUTO`, and you have to implicitly specify the format
43 for `BAMPE` and `BEDPE`.
45 Nowadays, the most common formats are `BED` or `BAM` (including
46 `BEDPE` and `BAMPE`). Our recommendation is to convert your data to
49 Also, MACS3 can detect and read gzipped file. For example, `.bed.gz`
50 file can be directly used without being uncompressed with `--format
53 Here are detailed explanation of the recommended formats:
57 The BED format can be found at [UCSC genome browser
58 website](http://genome.ucsc.edu/FAQ/FAQformat#format1).
60 The essential columns in BED format input are the 1st column
61 `chromosome name`, the 2nd `start position`, the 3rd `end
62 position`, and the 6th, `strand`.
64 Note that, for `BED` format, the 6th column of strand information
65 is required by MACS3. And please pay attention that the
66 coordinates in BED format are zero-based and half-open. See more
68 site](http://genome.ucsc.edu/FAQ/FAQtracks#tracks1).
72 If the format is `BAM`/`SAM`, please check the definition in
73 [samtools](https://samtools.github.io/hts-specs/SAMv1.pdf). If
74 the `BAM` file is generated for paired-end data, MACS3 will only
75 keep the left mate(5' end) tag. However, when format `BAMPE` is
76 specified, MACS3 will use the real fragments inferred from
77 alignment results for reads pileup.
81 A special mode will be triggered while the format is specified as
82 `BAMPE` or `BEDPE`. In this way, MACS3 will process the `BAM` or
83 `BED` files as paired-end data. Instead of building a bimodal
84 distribution of plus and minus strand reads to predict fragment
85 size, MACS3 will use actual insert sizes of pairs of reads to
86 build fragment pileup.
88 The `BAMPE` format is just a `BAM` format containing paired-end
89 alignment information, such as those from `BWA` or `BOWTIE`.
91 The `BEDPE` format is a simplified and more flexible `BED` format,
92 which only contains the first three columns defining the
93 chromosome name, left and right position of the fragment from
94 Paired-end sequencing. Please note, this is NOT the same format
95 used by `bedtools`, and the `bedtools` version of `BEDPE` is
96 actually not in a standard `BED` format. You can use MACS3
97 subcommand [`randsample`](./randsample.md) or
98 [`filterdup`](./filterdup.md) to convert a `BAMPE` file containing
99 paired-end information to a `BEDPE` format file:
102 macs3 randsample -i the_BAMPE_file.bam -f BAMPE -p 100 -o the_BEDPE_file.bed
107 macs3 filterdup -i the_BAMPE_file.bam -f BAMPE --keep-dup all -o the_BEDPE_file.bed
113 MACS3 will save all output files into the specified folder for this
114 option. A new folder will be created if necessary.
118 If this flag is on, MACS3 will store the fragment pileup, control
119 lambda in bedGraph files. The bedGraph files will be stored in the
120 current directory named `NAME_treat_pileup.bdg` for treatment data,
121 `NAME_control_lambda.bdg` for local lambda values from control.
123 ### Options controling peak calling behaviors
127 It's the mappable genome size or effective genome size which is
128 defined as the genome size which can be sequenced. Because of the
129 repetitive features on the chromosomes, the actual mappable genome
130 size will be smaller than the original size, about 90% or 70% of the
131 genome size. The default *hs* ~2.9e9 is recommended for human
132 genome. Here are all precompiled parameters for effective genome
134 [deeptools](https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html):
136 * hs: 2,913,022,398 for GRCh38
137 * mm: 2,652,783,500 for GRCm38
138 * ce: 100,286,401 for WBcel235
139 * dm: 142,573,017 for dm6
141 Please check deeptools webpage to find the appropriate effective
142 genome size if you want a more accurate estimation regarding
143 specific assembly and read length.
145 Users may want to use k-mer tools to simulate mapping of Xbps long
146 reads to target genome, and to find the ideal effective genome
147 size. However, usually by taking away the simple repeats and Ns from
148 the total genome, one can get an approximate number of effective
149 genome size. A slight difference in the number won't cause a big
150 difference of peak calls, because this number is used to estimate a
151 genome-wide noise level which is usually the least significant one
152 compared with the *local biases* modeled by MACS3.
156 The size of sequencing tags. If you don't specify it, MACS3 will try
157 to use the first 10 sequences from your input treatment file to
158 determine the tag size. Specifying it will override the
159 automatically determined tag size.
163 The q-value (minimum FDR) cutoff to call significant
164 regions. Default is 0.05. For broad marks, you can try 0.01 as the
165 cutoff. The q-values are calculated from p-values using the
167 procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure).
171 The p-value cutoff. If `-p` is specified, MACS3 will use p-value
174 - `--min-length`, `--max-gap`
176 These two options can be used to fine-tune the peak calling behavior
177 by specifying the minimum length of a called peak and the maximum
178 allowed a gap between two nearby regions to be merged. In other
179 words, a called peak has to be longer than `min-length`, and if the
180 distance between two nearby peaks is smaller than `max-gap` then
181 they will be merged as one. If they are not set, MACS3 will set the
182 DEFAULT value for `min-length` as the predicted fragment size `d`,
183 and the DEFAULT value for `max-gap` as the detected read
184 length. Note, if you set a `min-length` value smaller than the
185 fragment size, it may have NO effect on the result. For broad peak
186 calling with `--broad` option set, the DEFAULT `max-gap` for merging
187 nearby stronger peaks will be the same as narrow peak calling, and 4
188 times of the `max-gap` will be used to merge nearby weaker (broad)
189 peaks. You can also use `--cutoff-analysis` option with the default
190 setting, and check the column `avelpeak` under different cutoff
191 values to decide a reasonable `min-length` value.
195 With this flag on, MACS3 will use the background lambda as local
196 lambda. This means MACS3 will not consider the local bias at peak
197 candidate regions. It is particularly recommended while calling
198 peaks without control sample.
200 - `--slocal`, `--llocal`
202 These two parameters control which two levels of regions will be
203 checked around the peak regions to calculate the maximum lambda as
204 local lambda. By default, MACS3 considers 1000bp for small local
205 region(`--slocal`), and 10000bps for large local region(`--llocal`)
206 which captures the bias from a long-range effect like an open
207 chromatin domain. You can tweak these according to your
208 project. Remember that if the region is set too small, a sharp spike
209 in the input data may kill a significant peak.
213 While on, MACS3 will bypass building the shifting model. Please
214 combine the usage of `--extsize` and `--shift` to achieve the effect
219 While `--nomodel` is set, MACS3 uses this parameter to extend reads
220 in 5'->3' direction to fix-sized fragments. For example, if the size
221 of the binding region for your transcription factor is 200 bp, and
222 you want to bypass the model building by MACS3, this parameter can
223 be set as 200. This option is only valid when `--nomodel` is set or
224 when MACS3 fails to build model and `--fix-bimodal` is on.
228 Note, this is NOT the legacy `--shiftsize` option which is replaced
229 by `--extsize` from MACS version 2! You can set an arbitrary shift
230 in bp here to adjust the alignment positions of reads in the whole
231 library. Please use discretion while setting it other than the
232 default value (0). When `--nomodel` is set, MACS3 will use this
233 value to move cutting ends (5') then apply `--extsize` from 5' to 3'
234 direction to extend them to fragments. When this value is negative,
235 the cutting ends (5') will be moved toward 3'->5' direction,
236 otherwise 5'->3' direction. Recommended to keep it as default 0 for
237 ChIP-Seq datasets, or -1 * half of *EXTSIZE* together with
238 `--extsize` option for detecting enriched cutting loci such as
239 certain DNAseI-Seq datasets. Note, you can't set values other than 0
240 if the format is BAMPE or BEDPE for paired-end data. The default is
243 Here are some examples for combining `--shift` and `--extsize`:
245 1. To find enriched cutting sites such as some DNAse-Seq
246 datasets. In this case, all 5' ends of sequenced reads should be
247 extended in both directions to smooth the pileup signals. If the
248 wanted smoothing window is 200bps, then use `--nomodel --shift
251 2. For certain nucleosome-seq data, we need to pile up the centers
252 of nucleosomes using a half-nucleosome size for wavelet analysis
253 (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about
254 147bps, this option can be used: `--nomodel --shift 37 --extsize
259 It controls the MACS3 behavior towards duplicate tags at the exact
260 same location -- the same coordination and the same strand. You can
261 set this as `auto`, `all`, or an integer value. The `auto` option
262 makes MACS3 calculate the maximum tags at the exact same location
263 based on binomial distribution using 1e-5 as p-value cutoff; and the
264 `all` option keeps every tag. If an integer is given, at most this
265 number of tags will be kept at the same location. The default is to
266 keep one tag at the same location. Default: 1
270 This option, along with the `bdgbroadcall` command, facilitates
271 broad peak calling, producing results in the UCSC gappedPeak format
272 which encapsulates a nested structure of peaks. To conceptualize
273 'nested' peaks, picture a gene structure housing regions analogous
274 to exons (strong peaks) and introns coupled with UTRs (weak
275 peaks). The broad peak calling process utilizes two distinct cutoffs
276 to discern broader, weaker peaks (`--broad-cutoff`) and narrower,
277 stronger peaks (`-p` or `-q`), which are subsequently nested to
278 provide a detailed peak landscape. Please note that, the `max-gap`
279 value for merging nearby weaker/broad peaks is 4 times of `max-gap`
280 for merging nearby stronger peaks. The later one can be controlled
281 by `--max-gap` option, and by default it is the average
282 fragment/insertion length in the PE data. DEFAULT: False
284 Please note that, if you only want to call 'broader' peak and not
285 interested in the nested peak structure, please simply use `-p` or
286 `-q` with weaker cutoff instead of using `--broad` option.
290 Cutoff for the broad region. This option is not available unless
291 `--broad` is set. Please note that if `-p` is set, this is a p-value
292 cutoff, otherwise, it's a q-value cutoff. DEFAULT: 0.1
294 - `--scale-to <large|small>`
296 When set to `large`, linearly scale the smaller dataset to the same
297 depth as the larger dataset. By default or being set as `small`, the
298 larger dataset will be scaled towards the smaller dataset. Beware,
299 to scale up small data would cause more false positives. So the
300 default behavior `small` is recommended.
304 MACS3 will now reanalyze the shape of signal profile (p or q-score
305 depending on the cutoff setting) to deconvolve subpeaks within each
306 peak called from the general procedure. It's highly recommended to
307 detect adjacent binding events. While used, the output subpeaks of a
308 big peak region will have the same peak boundaries, and different
309 scores and peak summit positions.
315 MACS3 uses a buffer size for incrementally increasing internal array
316 size to store reads alignment information for each chromosome or
317 contig. To increase the buffer size, MACS3 can run faster but will
318 waste more memory if certain chromosome/contig only has very few
319 reads. In most cases, the default value 100000 works fine. However,
320 if there are a large number of chromosomes/contigs in your alignment
321 and reads per chromosome/contigs are few, it's recommended to
322 specify a smaller buffer size in order to decrease memory usage (but
323 it will take longer time to read alignment files). Minimum memory
324 requested for reading an alignment file is about # of CHROMOSOME *
325 BUFFER_SIZE * 8 Bytes. DEFAULT: 100000
327 - `--cutoff-analysis`
329 While set, MACS3 will analyze the number or total length of peaks
330 that can be called by different cutoff then output a summary table
331 to help the user decide a better cutoff. Note, minlen and maxgap may
332 affect the results. DEFAULT: False
334 Different with the option in `bdgpeakcall`, `callpeak` will perform
335 both tasks to call peaks and to generate a report for cutoff
336 analysis. Please check the section *Cutoff Analysis* for more
341 1. `NAME_peaks.xls` is a tabular file which contains information about
342 called peaks. You can open it in excel and sort/filter using excel
343 functions. Information include:
346 - start position of peak
347 - end position of peak
348 - length of peak region
349 - absolute peak summit position
350 - pileup height at peak summit
351 - -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then
352 this value should be 10)
353 - fold enrichment for this peak summit against random Poisson
354 distribution with local lambda,
355 - -log10(qvalue) at peak summit
357 Coordinates in XLS is 1-based which is different from BED
358 format. When `--broad` is enabled for broad peak calling, the
359 pileup, p-value, q-value, and fold change in the XLS file will be
360 the mean value across the entire peak region, since peak summit
361 won't be called in broad peak calling mode.
363 2. `NAME_peaks.narrowPeak` is BED6+4 format file which contains the
364 peak locations together with peak summit, p-value, and q-value. You
365 can load it to the UCSC genome browser. Definition of some specific
368 - 5th: integer score for display. It's calculated as
369 `int(-10*log10pvalue)` or `int(-10*log10qvalue)` depending on
370 whether `-p` (pvalue) or `-q` (qvalue) is used as score
371 cutoff. Please note that currently this value might be out of the
372 [0-1000] range defined in [UCSC ENCODE narrowPeak
373 format](https://genome.ucsc.edu/FAQ/FAQformat.html#format12). You
374 can let the value saturated at 1000 (i.e. p/q-value = 10^-100) by
375 using the following 1-liner awk: `awk -v OFS="\t"
376 '{$5=$5>1000?1000:$5} {print}' NAME_peaks.narrowPeak`
377 - 7th: fold-change at peak summit
378 - 8th: -log10pvalue at peak summit
379 - 9th: -log10qvalue at peak summit
380 - 10th: relative summit position to peak start
382 The file can be loaded directly to the UCSC genome browser. Remove
383 the beginning track line if you want to analyze it by other tools.
385 3. `NAME_summits.bed` is in BED format, which contains the peak
386 summits locations for every peak. The 5th column in this file is
387 the same as what is in the `narrowPeak` file. If you want to find
388 the motifs at the binding sites, this file is recommended. The file
389 can be loaded directly to the UCSC genome browser. Remove the
390 beginning track line if you want to analyze it by other tools.
392 4. `NAME_peaks.broadPeak` is in BED6+3 format which is similar to
393 `narrowPeak` file, except for missing the 10th column for
394 annotating peak summits. This file and the `gappedPeak` file will
395 only be available when `--broad` is enabled. Since in the broad
396 peak calling mode, the peak summit won't be called, the values in
397 the 5th, and 7-9th columns are the mean value across all positions
398 in the peak region. Refer to `narrowPeak` if you want to fix the
399 value issue in the 5th column.
401 5. `NAME_peaks.gappedPeak` is in BED12+3 format which contains both
402 the broad region and narrow peaks. The 5th column is the score for
403 showing grey levels on the UCSC browser as in `narrowPeak`. The 7th
404 is the start of the first narrow peak in the region, and the 8th
405 column is the end. The 9th column should be RGB color key, however,
406 we keep 0 here to use the default color, so change it if you
407 want. The 10th column tells how many blocks including the starting
408 1bp and ending 1bp of broad regions. The 11th column shows the
409 length of each block and 12th for the start of each block. 13th:
410 fold-change, 14th: *-log10pvalue*, 15th: *-log10qvalue*. The file can
411 be loaded directly to the UCSC genome browser. Refer to
412 `narrowPeak` if you want to fix the value issue in the 5th column.
414 6. `NAME_model.r` is an R script which you can use to produce a PDF
415 image of the model based on your data. Load it to R by:
417 `$ Rscript NAME_model.r`
419 Then a pdf file `NAME_model.pdf` will be generated in your current
420 directory. Note, R is required to draw this figure.
422 7. The `NAME_treat_pileup.bdg` and `NAME_control_lambda.bdg` files are
423 in bedGraph format which can be imported to the UCSC genome browser
424 or be converted into even smaller bigWig files. The
425 `NAME_treat_pielup.bdg` contains the pileup signals (normalized
426 according to `--scale-to` option) from ChIP/treatment sample. The
427 `NAME_control_lambda.bdg` contains local biases estimated for each
428 genomic location from the control sample, or from treatment sample
429 when the control sample is absent. The subcommand `bdgcmp` can be
430 used to compare these two files and make a bedGraph file of scores
431 such as p-value, q-value, log-likelihood, and log fold changes.
435 Since cutoff can be an arbitrary value during peak calling, there are
436 many approaches proposed in the community to guide the cutoff
437 selection such as the [IDR
438 approach](https://doi.org/doi:10.1214%2F11-AOAS466). In MACS3, we
439 provide a simple way to do the cutoff analysis. The cutoff analysis
440 function is provided by `--cutoff-analysis` option in `callpeak`,
441 `bdgpeakcall`, and `hmmratac`. Among them, the function in
442 `bdgpeakcall` is more flexible and can be applied on any scoring
443 scheme. We will sperate this function into a dedicated subcommand in
446 Please note that if this `--cutoff-anlaysis` option is on, the report
447 will be written into a file named `NAME_cutoff_analysis.txt`.
449 When the option is on, we will generate a list of possible pvalue
450 cutoffs to check from pscore cutoff from 0.3 to 10, with a step of
451 0.3. When -log10(pvalue) is 0.3, it represents an extremely loose
452 cutoff pvalue 0.5; and when it's 10, it represents an extremely
453 strigent cutoff pvalue 1e-10. Please note that the is different with
454 `bdgpeakcall` where users can control how the cutoff should be
457 Then for each cutoff we plan to investigate, we will check the number
458 of peaks that can be called, their average peak length, and their
461 The report consists of four columns:
463 1. score: the possible fold change cutoff value.
464 2. npeaks: the number of peaks under this cutoff.
465 3. lpeaks: the total length of all peaks.
466 4. avelpeak: the average length of peaks.
468 While there's no universal rule to suggest the best cutoff, here are a
471 - You can use elbow analysis to find the cutoff that dramatically
472 change the trend of npeaks, lpeaks, or avelpeak. But you need to
473 think about how to define 'dramatical change'.
474 - You can use some common expectation to decide the cutoff. For
475 example, the number of peaks should be thousands/ or the avelpeak
476 should be around 500bps. Of course, it's arbitrary but the table
477 will give you some insight.