docs/callpeak.md

   1 # Call peaks
   2
   3 This is the main function in MACS3. It can be invoked by `macs3
   4 callpeak` . If you type this command with `-h`, you will see a full
   5 description of command-line options. Here we only list the essentials.
   6
   7 ## Essential Options
   8
   9 ### `-t`/`--treatment FILENAME`
  10
  11 This is the only REQUIRED parameter for MACS. The file can be in any
  12 supported format -- see detail in the `--format` option. If you have
  13 more than one alignment file, you can specify them as `-t A B C`. MACS
  14 will pool up all these files together.
  15
  16 ### `-c`/`--control`
  17
  18 The control, genomic input or mock IP data file. Please follow the
  19 same direction as for `-t`/`--treatment`.
  20
  21 ### `-n`/`--name`
  22
  23 The name string of the experiment. MACS will use this string NAME to
  24 create output files like `NAME_peaks.xls`, `NAME_negative_peaks.xls`,
  25 `NAME_peaks.bed` , `NAME_summits.bed`, `NAME_model.r` and so on. So
  26 please avoid any confliction between these filenames and your existing
  27 files.
  28
  29 ### `--outdir`
  30
  31 MACS3 will save all output files into the specified folder for this
  32 option. A new folder will be created if necessary.
  33
  34 ### `-f`/`--format FORMAT`
  35
  36 Format of tag file can be `ELAND`, `BED`, `ELANDMULTI`, `ELANDEXPORT`,
  37 `SAM`, `BAM`, `BOWTIE`, `BAMPE`, or `BEDPE`. Default is `AUTO` which
  38 will allow MACS to decide the format automatically. `AUTO` is also
  39 useful when you combine different formats of files. Note that MACS
  40 can't detect `BAMPE` or `BEDPE` format with `AUTO`, and you have to
  41 implicitly specify the format for `BAMPE` and `BEDPE`.
  42
  43 Nowadays, the most common formats are `BED` or `BAM` (including
  44 `BEDPE` and `BAMPE`). Our recommendation is to convert your data to
  45 `BED` or `BAM` first.
  46
  47 Also, MACS3 can detect and read gzipped file. For example, `.bed.gz`
  48 file can be directly used without being uncompressed with `--format
  49 BED`.
  50
  51 Here are detailed explanation of the recommanded formats:
  52
  53 ## `BED`
  54
  55 The BED format can be found at [UCSC genome browser
  56 website](http://genome.ucsc.edu/FAQ/FAQformat#format1).
  57
  58 The essential columns in BED format input are the 1st column
  59 `chromosome name`, the 2nd `start position`, the 3rd `end position`,
  60 and the 6th, `strand`.
  61
  62 Note that, for `BED` format, the 6th column of strand information is
  63 required by MACS. And please pay attention that the coordinates in BED
  64 format are zero-based and half-open. See more detail at
  65 [UCSC site](http://genome.ucsc.edu/FAQ/FAQtracks#tracks1).
  66
  67 ## `BAM`/`SAM`
  68
  69 If the format is `BAM`/`SAM`, please check the definition in
  70 (http://samtools.sourceforge.net/samtools.shtml).  If the `BAM` file is
  71 generated for paired-end data, MACS will only keep the left mate(5'
  72 end) tag. However, when format `BAMPE` is specified, MACS will use the
  73 real fragments inferred from alignment results for reads pileup.
  74
  75 ## `BEDPE` or `BAMPE`
  76
  77 A special mode will be triggered while the format is specified as
  78 `BAMPE` or `BEDPE`. In this way, MACS3 will process the `BAM` or `BED`
  79 files as paired-end data. Instead of building a bimodal distribution
  80 of plus and minus strand reads to predict fragment size, MACS3 will
  81 use actual insert sizes of pairs of reads to build fragment pileup.
  82
  83 The `BAMPE` format is just a `BAM` format containing paired-end alignment
  84 information, such as those from `BWA` or `BOWTIE`.
  85
  86 The `BEDPE` format is a simplified and more flexible `BED` format,
  87 which only contains the first three columns defining the chromosome
  88 name, left and right position of the fragment from Paired-end
  89 sequencing. Please note, this is NOT the same format used by
  90 `BEDTOOLS`, and the `BEDTOOLS` version of `BEDPE` is actually not in a
  91 standard `BED` format. You can use MACS3 subcommand `randsample` to
  92 convert a `BAM` file containing paired-end information to a `BEDPE`
  93 format file:
  94
  95 ```
  96 macs3 randsample -i the_BAMPE_file.bam -f BAMPE -p 100 -o the_BEDPE_file.bed
  97 ```
  98
  99 ### `-g`/`--gsize`
 100
 101 PLEASE assign this parameter to fit your needs!
 102
 103 It's the mappable genome size or effective genome size which is
 104 defined as the genome size which can be sequenced. Because of the
 105 repetitive features on the chromosomes, the actual mappable genome
 106 size will be smaller than the original size, about 90% or 70% of the
 107 genome size. The default *hs* -- 2.7e9 is recommended for human
 108 genome. Here are all precompiled parameters for effective genome size:
 109
 110  * hs: 2.7e9
 111  * mm: 1.87e9
 112  * ce: 9e7
 113  * dm: 1.2e8
 114
 115 Users may want to use k-mer tools to simulate mapping of Xbps long
 116 reads to target genome, and to find the ideal effective genome
 117 size. However, usually by taking away the simple repeats and Ns from
 118 the total genome, one can get an approximate number of effective
 119 genome size. A slight difference in the number won't cause a big
 120 difference of peak calls, because this number is used to estimate a
 121 genome-wide noise level which is usually the least significant one
 122 compared with the *local biases* modeled by MACS.
 123
 124 ### `-s`/`--tsize`
 125
 126 The size of sequencing tags. If you don't specify it, MACS will try to
 127 use the first 10 sequences from your input treatment file to determine
 128 the tag size. Specifying it will override the automatically determined
 129 tag size.
 130
 131 ### `-q`/`--qvalue`
 132
 133 The q-value (minimum FDR) cutoff to call significant regions. Default
 134 is 0.05. For broad marks, you can try 0.05 as the cutoff. Q-values are
 135 calculated from p-values using the Benjamini-Hochberg procedure.
 136
 137 ### `-p`/`--pvalue`
 138
 139 The p-value cutoff. If `-p` is specified, MACS3 will use p-value instead
 140 of q-value.
 141
 142 ### `--min-length`, `--max-gap`
 143
 144 These two options can be used to fine-tune the peak calling behavior
 145 by specifying the minimum length of a called peak and the maximum
 146 allowed a gap between two nearby regions to be merged. In other words,
 147 a called peak has to be longer than `min-length`, and if the distance
 148 between two nearby peaks is smaller than `max-gap` then they will be
 149 merged as one. If they are not set, MACS3 will set the DEFAULT value
 150 for `min-length` as the predicted fragment size `d`, and the DEFAULT
 151 value for `max-gap` as the detected read length. Note, if you set a
 152 `min-length` value smaller than the fragment size, it may have NO
 153 effect on the result. For broad peak calling with `--broad` option
 154 set, the DEFAULT `max-gap` for merging nearby stronger peaks will be
 155 the same as narrow peak calling, and 4 times of the `max-gap` will be
 156 used to merge nearby weaker (broad) peaks. You can also use
 157 `--cutoff-analysis` option with the default setting, and check the
 158 column `avelpeak` under different cutoff values to decide a reasonable
 159 `min-length` value.
 160
 161 ### `--nolambda`
 162
 163 With this flag on, MACS will use the background lambda as local
 164 lambda. This means MACS will not consider the local bias at peak
 165 candidate regions.
 166
 167 ### `--slocal`, `--llocal`
 168
 169 These two parameters control which two levels of regions will be
 170 checked around the peak regions to calculate the maximum lambda as
 171 local lambda. By default, MACS considers 1000bp for small local
 172 region(`--slocal`), and 10000bps for large local region(`--llocal`)
 173 which captures the bias from a long-range effect like an open
 174 chromatin domain. You can tweak these according to your
 175 project. Remember that if the region is set too small, a sharp spike
 176 in the input data may kill a significant peak.
 177
 178 ### `--nomodel`
 179
 180 While on, MACS will bypass building the shifting model.
 181
 182 ### `--extsize`
 183
 184 While `--nomodel` is set, MACS uses this parameter to extend reads in
 185 5'->3' direction to fix-sized fragments. For example, if the size of
 186 the binding region for your transcription factor is 200 bp, and you
 187 want to bypass the model building by MACS, this parameter can be set
 188 as 200. This option is only valid when `--nomodel` is set or when MACS
 189 fails to build model and `--fix-bimodal` is on.
 190
 191 ### `--shift`
 192
 193 Note, this is NOT the legacy `--shiftsize` option which is replaced by
 194 `--extsize`! You can set an arbitrary shift in bp here. Please Use
 195 discretion while setting it other than the default value (0). When
 196 `--nomodel` is set, MACS will use this value to move cutting ends (5')
 197 then apply `--extsize` from 5' to 3' direction to extend them to
 198 fragments. When this value is negative, ends will be moved toward
 199 3'->5' direction, otherwise 5'->3' direction. Recommended to keep it
 200 as default 0 for ChIP-Seq datasets, or -1 * half of *EXTSIZE* together
 201 with `--extsize` option for detecting enriched cutting loci such as
 202 certain DNAseI-Seq datasets. Note, you can't set values other than 0
 203 if the format is BAMPE or BEDPE for paired-end data. The default is 0.
 204
 205 Here are some examples for combining `--shift` and `--extsize`:
 206
 207 1. To find enriched cutting sites such as some DNAse-Seq datasets. In
 208 this case, all 5' ends of sequenced reads should be extended in both
 209 directions to smooth the pileup signals. If the wanted smoothing
 210 window is 200bps, then use `--nomodel --shift -100 --extsize 200`.
 211
 212 2. For certain nucleosome-seq data, we need to pile up the centers of
 213 nucleosomes using a half-nucleosome size for wavelet analysis
 214 (e.g. NPS algorithm). Since the DNA wrapped on nucleosome is about
 215 147bps, this option can be used: `--nomodel --shift 37 --extsize 73`.
 216
 217 ### `--keep-dup`
 218
 219 It controls the MACS behavior towards duplicate tags at the exact same
 220 location -- the same coordination and the same strand. The default
 221 `auto` option makes MACS calculate the maximum tags at the exact same
 222 location based on binomial distribution using 1e-5 as p-value cutoff;
 223 and the `all` option keeps every tag.  If an integer is given, at most
 224 this number of tags will be kept at the same location. The default is
 225 to keep one tag at the same location. Default: 1
 226
 227 ### `--broad`
 228
 229 When this flag is on, MACS will try to composite broad regions in
 230 BED12 ( a gene-model-like format ) by putting nearby highly enriched
 231 regions into a broad region with loose cutoff. The broad region is
 232 controlled by another cutoff through `--broad-cutoff`. Please note
 233 that, the `max-gap` value for merging nearby weaker/broad peaks is 4
 234 times of `max-gap` for merging nearby stronger peaks. The later one
 235 can be controlled by `--max-gap` option, and by default it is the
 236 average fragment/insertion length in the PE data. DEFAULT: False
 237
 238 ### `--broad-cutoff`
 239
 240 Cutoff for the broad region. This option is not available unless
 241 `--broad` is set. If `-p` is set, this is a p-value cutoff, otherwise,
 242 it's a q-value cutoff.  DEFAULT: 0.1
 243
 244 ### `--scale-to <large|small>`
 245
 246 When set to `large`, linearly scale the smaller dataset to the same
 247 depth as the larger dataset. By default or being set as `small`, the
 248 larger dataset will be scaled towards the smaller dataset. Beware, to
 249 scale up small data would cause more false positives.
 250
 251 ### `-B`/`--bdg`
 252
 253 If this flag is on, MACS will store the fragment pileup, control
 254 lambda in bedGraph files. The bedGraph files will be stored in the
 255 current directory named `NAME_treat_pileup.bdg` for treatment data,
 256 `NAME_control_lambda.bdg` for local lambda values from control.
 257
 258 ### `--call-summits`
 259
 260 MACS will now reanalyze the shape of signal profile (p or q-score
 261 depending on the cutoff setting) to deconvolve subpeaks within each
 262 peak called from the general procedure. It's highly recommended to
 263 detect adjacent binding events. While used, the output subpeaks of a
 264 big peak region will have the same peak boundaries, and different
 265 scores and peak summit positions.
 266
 267 ### `--buffer-size`
 268
 269 MACS uses a buffer size for incrementally increasing internal array
 270 size to store reads alignment information for each chromosome or
 271 contig. To increase the buffer size, MACS can run faster but will
 272 waste more memory if certain chromosome/contig only has very few
 273 reads. In most cases, the default value 100000 works fine. However, if
 274 there are a large number of chromosomes/contigs in your alignment and
 275 reads per chromosome/contigs are few, it's recommended to specify a
 276 smaller buffer size in order to decrease memory usage (but it will
 277 take longer time to read alignment files). Minimum memory requested
 278 for reading an alignment file is about # of CHROMOSOME * BUFFER_SIZE *
 279 8 Bytes. DEFAULT: 100000
 280
 281 ## Output files
 282
 283 1. `NAME_peaks.xls` is a tabular file which contains information about
 284    called peaks. You can open it in excel and sort/filter using excel
 285    functions. Information include:
 286
 287     - chromosome name
 288     - start position of peak
 289     - end position of peak
 290     - length of peak region
 291     - absolute peak summit position
 292     - pileup height at peak summit
 293     - -log10(pvalue) for the peak summit (e.g. pvalue =1e-10, then
 294       this value should be 10)
 295     - fold enrichment for this peak summit against random Poisson
 296       distribution with local lambda,
 297     - -log10(qvalue) at peak summit
 298
 299    Coordinates in XLS is 1-based which is different from BED
 300    format. When `--broad` is enabled for broad peak calling, the
 301    pileup, p-value, q-value, and fold change in the XLS file will be
 302    the mean value across the entire peak region, since peak summit
 303    won't be called in broad peak calling mode.
 304
 305 2. `NAME_peaks.narrowPeak` is BED6+4 format file which contains the
 306    peak locations together with peak summit, p-value, and q-value. You
 307    can load it to the UCSC genome browser. Definition of some specific
 308    columns are:
 309
 310    - 5th: integer score for display. It's calculated as
 311      `int(-10*log10pvalue)` or `int(-10*log10qvalue)` depending on
 312      whether `-p` (pvalue) or `-q` (qvalue) is used as score
 313      cutoff. Please note that currently this value might be out of the
 314      [0-1000] range defined in [UCSC ENCODE narrowPeak
 315      format](https://genome.ucsc.edu/FAQ/FAQformat.html#format12). You
 316      can let the value saturated at 1000 (i.e. p/q-value = 10^-100) by
 317      using the following 1-liner awk: `awk -v OFS="\t"
 318      '{$5=$5>1000?1000:$5} {print}' NAME_peaks.narrowPeak`
 319    - 7th: fold-change at peak summit
 320    - 8th: -log10pvalue at peak summit
 321    - 9th: -log10qvalue at peak summit
 322    - 10th: relative summit position to peak start
 323
 324    The file can be loaded directly to the UCSC genome browser. Remove
 325    the beginning track line if you want to analyze it by other tools.
 326
 327 3. `NAME_summits.bed` is in BED format, which contains the peak
 328    summits locations for every peak. The 5th column in this file is
 329    the same as what is in the `narrowPeak` file. If you want to find
 330    the motifs at the binding sites, this file is recommended. The file
 331    can be loaded directly to the UCSC genome browser. Remove the
 332    beginning track line if you want to analyze it by other tools.
 333
 334 4. `NAME_peaks.broadPeak` is in BED6+3 format which is similar to
 335    `narrowPeak` file, except for missing the 10th column for
 336    annotating peak summits. This file and the `gappedPeak` file will
 337    only be available when `--broad` is enabled. Since in the broad
 338    peak calling mode, the peak summit won't be called, the values in
 339    the 5th, and 7-9th columns are the mean value across all positions
 340    in the peak region. Refer to `narrowPeak` if you want to fix the
 341    value issue in the 5th column.
 342
 343 5. `NAME_peaks.gappedPeak` is in BED12+3 format which contains both
 344    the broad region and narrow peaks. The 5th column is the score for
 345    showing grey levels on the UCSC browser as in `narrowPeak`. The 7th
 346    is the start of the first narrow peak in the region, and the 8th
 347    column is the end. The 9th column should be RGB color key, however,
 348    we keep 0 here to use the default color, so change it if you
 349    want. The 10th column tells how many blocks including the starting
 350    1bp and ending 1bp of broad regions. The 11th column shows the
 351    length of each block and 12th for the start of each block. 13th:
 352    fold-change, 14th: *-log10pvalue*, 15th: *-log10qvalue*. The file can
 353    be loaded directly to the UCSC genome browser. Refer to
 354    `narrowPeak` if you want to fix the value issue in the 5th column.
 355
 356 6. `NAME_model.r` is an R script which you can use to produce a PDF
 357    image of the model based on your data. Load it to R by:
 358
 359    `$ Rscript NAME_model.r`
 360
 361    Then a pdf file `NAME_model.pdf` will be generated in your current
 362    directory. Note, R is required to draw this figure.
 363
 364 7. The `NAME_treat_pileup.bdg` and `NAME_control_lambda.bdg` files are
 365    in bedGraph format which can be imported to the UCSC genome browser
 366    or be converted into even smaller bigWig files. The
 367    `NAME_treat_pielup.bdg` contains the pileup signals (normalized
 368    according to `--scale-to` option) from ChIP/treatment sample. The
 369    `NAME_control_lambda.bdg` contains local biases estimated for each
 370    genomic location from the control sample, or from treatment sample
 371    when the control sample is absent. The subcommand `bdgcmp` can be
 372    used to compare these two files and make a bedGraph file of scores
 373    such as p-value, q-value, log-likelihood, and log fold changes.
 374