ChangeLog

   1 2023-06-15  Tao Liu  <vladimir.liu@gmail.com>
   2         MACS 3.0.0b2
   3         * New features in MACS3:
   4
   5         1) Speed/memory optimization.  Use the cykhash to replace python
   6         dictionary. Use buffer (10MB) to read and parse input file (not
   7         available for BAM file parser). And many optimization tweaks. We
   8     added memory monitoring to the runtime messages.
   9
  10         2) Code cleanup. Reorganize source codes.
  11
  12         3) Unit testing.
  13
  14         4) R wrappers for MACS -- MACSr
  15
  16         5) Switch to Github Action for CI, support multi-arch testing
  17         including x64, armv7, aarch64, s390x and ppc64le.
  18
  19         6) MACS tag-shifting model has been refined. Now it will use a
  20         naive peak calling approach to find ALL possible paired peaks at +
  21         and - strand, then use all of them to calculate the
  22         cross-correlation. (a related bug has been fix #442)
  23
  24         7) Call variants in peak regions directly from BAM files. The
  25         function was originally developed under code name SAPPER. Now
  26         SAPPER has been merged into MACS. Also, `simde` has been added as
  27         a submodule in order to support fermi-lite library under non-x64
  28         architectures.
  29
  30         8) BAI index and random access to BAM file now is supported. #449
  31         And user can use original BAM file (instead of the subset of BAM
  32         file as in SAPPER) in the `callvar` command.
  33
  34         9) Support of Python > 3.10 #497 #498
  35
  36         10) HMMRATAC module is added. HMMRATAC is a dedicated software to
  37         analyze ATAC-seq data. The basic idea behind HMMRATAC is to digest
  38         ATAC-seq data according to the fragment length of read pairs into
  39         four signal tracks: short fragments, mononucleosomal fragments,
  40         di-nucleosomal fragments and tri-nucleosomal fragments. Then
  41         integrate the four tracks again using Hidden Markov Model to
  42         consider three hidden states: open region, nucleosomal region, and
  43         background region. The orginal paper was published in 2019 written
  44         in JAVA, by Evan Tarbell. We implemented it in Python/Cython and
  45         optimize the whole process using existing MACS functions and
  46         hmmlearn. Now it can run much faster than the original JAVA
  47         version. Note: evaluation of the peak calling results is underway.
  48
  49         11) The effective genome size parameters have been updated
  50         according to deeptools. #508
  51
  52         12) Multiple updates regarding dependencies, anaconda built, CI/CD
  53         process.
  54
  55         * Other:
  56         1) Missing header line while no peaks can be called #501 #502
  57
  58 2020-04-11  Tao Liu  <vladimir.liu@gmail.com>
  59         MACS version 2.2.7.1
  60
  61         * hotfix:
  62
  63         Add 'wheel' and 'pip' to pyproject.toml so that `pip install` can
  64         work.
  65
  66 2020-04-10  Tao Liu  <vladimir.liu@gmail.com>
  67         MACS version 2.2.7
  68
  69         * Bugs fixed
  70
  71         1) MACS2 has been tested on multiple architectures to make sure it
  72         can successfully generate consistent results. Currently the
  73         supported architectures are: AMD64, ARM64, i386, PPC64LE, and
  74         S390X. Thanks to @mr-c, @junaruga, and @tillea! Related to issue
  75         #340, #349, #351, and #359; to PR #348, #350, #360, #361, #367,
  76         and #370. The lesson is that if the project is built on Cython and
  77         is aimed at memory efficiency, we should specifically define all
  78         int/float types in pyx files such as int8_t or uint32_t using
  79         either libc or numpy (c version) instead of relying on Cython
  80         types such as short, long, double.
  81
  82         2) MACS2 setup script will check numpy and install numpy if
  83         necessary. PR #378, issue #364
  84
  85         3) `bdgbroadcall` command will correctly add the score column (5th
  86         column). The score (5th) column contains 10 times of the average
  87         score in the broad region. PR #373, issue #362
  88
  89         4) The missing test on `bdgopt` subcommand has been added. PR #363
  90
  91         5) The obsolete option `--ratio` from `callpeak` subcommand has
  92         been removed. PR #369, issue #366
  93
  94         6) Fixed the incorrect description in README on the 'maximum
  95         length of broad region is 4 times of d' to 'maximum gap for
  96         merging broad regions is 4 times of tag size by default'. PR #380,
  97         issue #365.
  98
  99         * Other
 100
 101         1) CODE OF CONDUCT document has been added to MACS2 github
 102         repository. PR #358
 103
 104 2019-12-12  Tao Liu  <vladimir.liu@gmail.com>
 105         MACS version 2.2.6
 106
 107         * New Features
 108
 109         1) Speed up MACS2. Some programming tricks and code cleanup. The
 110         filter_dup function replaces separate_dups. The later one was
 111         implemented for potentially putting back duplicate reads in
 112         certain downstream analysis. However such analysis hasn't been
 113         implemented. Optimize the speed of writing bedGraph
 114         files. Optimize BAM and BAMPE parsing with pointer casting instead
 115         of python unpack.
 116
 117         2) The comment lines in the headers of BED or SAM files will be
 118         correctly skipped. However, MACS2 won't check comment lines in the
 119         middle of the file.
 120
 121         * Bugs fixed
 122
 123         1) Cutoff-analysis in callpeak command. #341
 124
 125         2) Issues related to SAMParser and three ELAND Parsers are
 126         fixed. #347
 127
 128         * Other
 129
 130         1) cmdlinetest script in test/ folder has been updated to: 1. test
 131         cutoff-analysis with callpeak cmd; 2. output the 2 lines before
 132         and after the error or warning message during tests; 3. output
 133         only the first 10 lines if the difference between test result and
 134         standard result can be found; 4. prockreport monitor CPU time and
 135         memory usage in 1 sec interval -- a bit more accurate.
 136
 137         2) Python3.5 support is removed. Now MACS2 requires Python>=3.6.
 138
 139 2019-10-31  Tao Liu  <vladimir.liu@gmail.com>
 140         MACS version 2.2.5 (Py3 speed up)
 141
 142         * Features added
 143
 144         1) *Github code only and Not included in MACS2 release* New
 145         testing data for performance test. An subsampled ENCODE2 CTCF
 146         ChIP-seq dataset, including 5million ChIP reads and 5 million
 147         control reads, has been included in the test folder for testing
 148         CPU and memory usage (i.e. 5M test). Several related scripts ,
 149         including `prockreport` for output cpu memory usage, `pyprofile`
 150         and `pyprofile_stat` for debuging and profiling MACS2 codes, have
 151         been included.
 152
 153         2) Speed up pvalue-qvalue checkup (pqtable checkup) #335 #338.
 154         The old hashtable.pyx implementation copied from Pandas (very old
 155         version) doesn't work well in Python3+Cython. It slows down the
 156         pqtable checkup using the identical Cython codes as in
 157         v2.1.4. While running 5M test, the `__getitem__` function in the
 158         hashtable.pyx took 3.5s with 37,382,037 calls in MACS2 v2.1.4, but
 159         148.6s with the same number of calls in MACS2 v2.2.4. As a
 160         consequence, the standard python dictionary implementation has
 161         replaced hashtable.pyx for pqtable checkup. Now MACS2 runs a bit
 162         faster than py2 version, but uses a bit more memory. In general,
 163         v2.2.5 can finish 5M reads test in 20% less time than MACS2
 164         v2.1.4, but use 15% more memory.
 165
 166         * Bug fixed
 167
 168         1) More Python3 related fixes, e.g. the return value of keys from
 169         py3 dict. #333 #337
 170
 171
 172 2019-10-01  Tao Liu  <vladimir.liu@gmail.com>
 173         MACS version 2.2.4 (Python3)
 174
 175         * Features added
 176
 177         1) First Python3 version MACS2 released.
 178
 179         2) Version number 2.2.X will be used for MACS2 in Python3, in
 180         parallel to 2.1.X.
 181
 182         3) More comprehensive test.sh script to check the consistency of
 183         results from Python2 version and Python3 version.
 184
 185         4) Simplify setup.py script since the newest version transparently
 186         supports cython. And when cython is not installed by the user,
 187         setup.py can still compile using only C codes.
 188
 189         5) Fix Signal.pyx to use np.array instead of np.mat.
 190
 191 2019-09-30  Tao Liu  <vladimir.liu@gmail.com>
 192         MACS version 2.1.4
 193
 194         * Features added
 195
 196         Github Actions is used together with Travis CI for testing and
 197         deployment.
 198
 199         * Bugs fixed
 200
 201         PR #322:
 202
 203         1) #318 Random score in bdgdiff output. It turns out the sum_v is
 204         not initialized as 0 before adding. Potential bugs are fixed in
 205         other functions in ScoreTrack and CallPeakUnit codes.
 206
 207         2) #321 Cython dependency in setup.py script is removed. And place
 208         'cythonzie' call to the correct position.
 209
 210         3) A typo is fixed in Github Actions script.
 211
 212 2019-09-19  Tao Liu  <vladimir.liu@gmail.com>
 213         MACS version 2.1.3.3
 214
 215         * Features added
 216
 217         1) Support Docker auto-deploy. PR #309
 218
 219         2) Support Travis CI auto-testing, update unit-testing
 220         scripts, and enable subcommand testing on small datasets.
 221
 222         3) Update README documents. #297 PR #306
 223
 224         4) `cmbreps` supports more than 2 replicates. Merged from PR #304
 225         @Maarten-vd-Sande and PR #307 (our own chi-sq test code)
 226
 227         5) `--d-min` option is added in `callpeak` and `predictd`, to
 228         exclude predictions of fragment size smaller than the given
 229         value. Merged from PR #267 @shouldsee.
 230
 231         6) `--buffer-size` option is added in `predictd`, `filterdup`,
 232         `pileup` and `refinepeak` subcommands. Users can use this option
 233         to decrease memory usage while there are a large number of contigs
 234         in the data. Also, now `callpeak`, `predictd`, `filterdup`,
 235         `pileup` and `refinepeak` will suggest users to tweak
 236         `--buffer-size` while catching a MemoryError. #313 PR #314
 237
 238         * Bugs fixed
 239
 240         1) #265 Fixed a bug where the pseudocount hasn't been applied
 241         while calculating p-value score in ScoreTrack object.
 242
 243         2) Fixed bdgbroadcall so that it will report those broad peaks
 244         without strong peak inside, a consistent behavior as `callpeak
 245         --broad`.
 246
 247         3) Rename COPYING to LICENSE.
 248
 249 2018-10-17  Tao Liu  <vladimir.liu@gmail.com>
 250         MACS version 2.1.2
 251
 252         * New features
 253
 254         1) Added missing BEDPE support. And enable the support for BAMPE
 255         and BEDPE formats in 'pileup', 'filterdup' and 'randsample'
 256         subcommands. When format is BAMPE or BEDPE, The 'pileup' command
 257         will pile up the whole fragment defined by mapping locations of
 258         the left end and right end of each read pair. Thank @purcaro
 259
 260         2) Added options to callpeak command for tweaking max-gap and
 261         min-len during peak calling. Thank @jsh58!
 262
 263         3) The callpeak option "--to-large" option is replaced with
 264         "--scale-to large".
 265
 266         4) The randsample option "-t" has been replaced with "-i".
 267
 268         * Bug fixes
 269
 270         1) Fixed memory issue related to #122 and #146
 271
 272         2) Fixed a bug caused by a typo. Related to #249, Thank @shengqh
 273
 274         3) Fixed a bug while setting commandline qvalue cutoff.
 275
 276         4) Better describe the 5th column of narrowPeak. Thank @alexbarrera
 277
 278         5) Fixed the calculation of average fragment length for paired-end
 279         data. Thank @jsh58
 280
 281         6) Fixed bugs caused by khash while computing p/q-value and log
 282         likelihood ratios. Thank @jsh58
 283
 284         7) More spelling tweaks in source code. Thank @mr-c
 285
 286 2016-03-09  Tao Liu  <vladimir.liu@gmail.com>
 287         MACS version 2.1.1 20160309
 288
 289         * Retire the tag:rc.
 290
 291         * Fixed spelling. Merged pull request #120. Thank @mr-c!
 292
 293         * Change filtering criteria for reading BAM/SAM files
 294
 295         Related to callpeak and filterdup commands. Now the
 296         reads/alignments flagged with 1028 or 'PCR/Optical duplicate' will
 297         still be read although MACS2 may decide them as duplicates
 298         later. Related to old issue #33. Sorry I forgot to address it for
 299         years!
 300
 301 2016-02-26  Tao Liu  <vladimir.liu@gmail.com>
 302         MACS version 2.1.1 20160226 (tag:rc Zhengyue)
 303
 304         * Bug fixes
 305
 306         1) Now "-Ofast" has been replaced by "-O3 --ffast-math", because
 307         the former option is not supported by older GCC. Related to issues
 308         #91, #109.
 309
 310         2) Issue #108 is fixed. If no peak can be found in a chromosome,
 311         the PeakIO won't throw an error.
 312
 313         * New features
 314
 315         1) callpeak
 316
 317         a) A more flexible format, BEDPE, is supported. Now users can
 318         define the left and right position of the ChIPed fragment, and
 319         MACS2 will skip model building and directly pileup the
 320         fragments. Related to issue #112.
 321
 322         b) The 'tempdir' can be specified, to save cached pileup
 323         tracks. Originially, the temporary files were stored in
 324         /tmp. Thank @daler! Related to issues #97 and #105.
 325
 326         2) bdgopt
 327
 328         New operations are added, to calculate the maximum or minimum value between
 329         values in BEDGRAPH and given value.
 330
 331         3) bdgcmp
 332
 333         New method is added, to calculate the maximum value between values
 334         defined in two BEDGRAPH files.
 335
 336 2015-12-22  Tao Liu  <vladimir.liu@gmail.com>
 337         MACS version 2.1.0 20151222 (tag:rc Dongzhi)
 338
 339         * Bug fixes
 340
 341         1) Fix a bug while dealing with some chromosomes only containing
 342         one read (pair). The size of dup_plus/dup_minus arrays after
 343         filtering dups should +1.
 344
 345         2) Fix a bug related to the broad peak calling function in
 346         previous versions. The gaps were miscalculated, so segmented weak
 347         broad calls may be reported, and sometimes you would see peaks
 348         with lower than cutoff values in the output files.
 349
 350         3) "Potentially" Fixed issue #105 on temporary cache files, need
 351         further followup.
 352
 353
 354 2015-07-31  Tao Liu  <vladimir.liu@gmail.com>
 355         MACS version 2.1.0 20150731 (tag:rc)
 356
 357         * Bug fixes
 358
 359         1) Fixed issue #76: information about broad/narrow cutoff will be
 360         correctly displayed.
 361
 362         2) Fixed issue #79: bdgopt extparam option is fixed.
 363
 364         3) Fixed issue #87: reference to cProb has been fixed as 'Prob'
 365         for filterdup command.
 366
 367         4) Fixed issue #78, #88 and similar issue reported in MACS google
 368         group: MACS2 now can correctly deal with multiple alignment files
 369         for -t or -c. The 'finalize' function will be correctly
 370         called. Multiple files option is enabled for filterdup,
 371         randsample, predictd, pileup and refinepeak commands.
 372
 373         5) A related issue to #88, when BAMPE mode is used, PE pairs will
 374         be sorted by leftmost then rightmost ends.
 375
 376         6) Fixed issue #86: A wrong use of 'ndarray' to create Numpy
 377         array. This will cause 'callpeak --nolambda' hang forever while
 378         calculating pvalues and qvalues.
 379
 380 2015-04-20  Tao Liu  <vladimir.liu@gmail.com>
 381         MACS version 2.1.0 20150420 (tag:rc)
 382
 383         * New commands
 384
 385         1) bdgopt: some convenient functions to modify bedGraph files.
 386
 387         2) cmbreps: Combine scores from two replicates. Including three
 388         methods: 1. take the maximum; 2. take the average; 3. use Fisher's
 389         method to combine two p-value scores. After that, user can use
 390         bdgpeakcall to call peaks on combined scores.
 391
 392         * New features
 393
 394         1) callpeak and bdgpeakcall now can try to analyze the
 395         relationship between p-values and number/length of peaks then
 396         generate a summary to help users decide an appropriate cutoff.
 397
 398         2) callpeak now can accept fold-enrichment cutoff as a filter for
 399         final peak calls.
 400
 401         * Performance
 402
 403         Now MACS2 runs about 3X as fast as previous version. Trade
 404         clean python codes for speed... Now while processing 50M ChIP vs
 405         50M control, it will take only 10 minutes.
 406
 407         * Bug fixes
 408
 409         1) Sampling function in BAMPE mode.
 410
 411         2) Callpeak while there are >= 2 input files for -t or -c.
 412
 413         3) While reading BAM/SAM, those secondary or supplementary
 414         alignments will be correctly skipped.
 415
 416         4) Fixed issue #33: Explanation is added to callpeak --keep-dup
 417         option that MACS2 will discard those SAM/BAM alignments with bit
 418         1024 no matter how --keep-dup is set.
 419
 420         5) Fixed issue #49: setuptools is used intead of distutils
 421
 422         6) Fixed issue #51: fix the problem when using --trackline
 423         argument when control file is absent.
 424
 425         7) Fixed issue #53: Use Use SAM/BAM CIGAR to find the 5' end of
 426         read mapped to minus strand. Previous implementation will find
 427         incorrect 5' end if there is indel in alignment.
 428
 429         8) Fixed issue #56: An incorrect sorting method used for BAMPE
 430         mode which will cause incorrect filtering of duplicated reads. Now
 431         fixed.
 432
 433         9) Issue #63: Merged from jayhesselberth@github, extsize now can
 434         be 1.
 435
 436         10) Issue #71: Merged from aertslab@github, close file descriptor
 437         after creating them with mkstemp().
 438
 439 2014-06-16  Tao Liu  <vladimir.liu@gmail.com>
 440         MACS version 2.1.0 20140616 (tag:rc)
 441
 442         * callpeak module
 443
 444         "--ratio" is added to manually assign the scaling factor of ChIP
 445         vs control, e.g. from NCIS. Thank Colin D and Dietmar Rieder for
 446         implementing the patch file!
 447
 448         "--shift" is added to move cutting ends (5' end of reads) around,
 449         in order to process DNAse-Seq data, e.g., use "--shift -100
 450         --extsize 200" to get 200bps fragments around 5' ends. For general
 451         ChIP-Seq data analysis, this option should be always set as
 452         0. Thank Xi Chen and Anshul Kundaje for the discussions in user
 453         group!
 454
 455         ** Do not output negative fragment size from cross-correlation
 456         analysis. Thank Alvin Qin for the feedback!
 457
 458         ** --half-ext and --control-shift are removed. For complex read
 459         shifting and extending, combine '--shift' and '--extsize'
 460         options. For comparing two conditions, use 'bdgdiff' module
 461         instead.
 462
 463         ** a bug is fixed to output the last pileup value in bdg file
 464         correctly.
 465
 466         * filterdup
 467
 468         A 'dry-run' option is added to only output numbers, including the
 469         number of allowed duplicates, the total number of reads before and
 470         after filtering duplicates and the estimated duplication
 471         rate. Thank John Urban for the suggestion!
 472
 473
 474 2013-12-16  Tao Liu  <vladimir.liu@gmail.com>
 475         MACS version 2.0.10 20131216 (tag:alpha)
 476
 477         bug fixes and tweaks
 478
 479         * We changed license from Artistic License to 3-clauses BSD license.
 480
 481         Yes. Simpler the better.
 482
 483         * Process paired-end data with "-f BAMPE" without control
 484
 485         * GappedPeak output for --broad option has been fixed again to be
 486         consistent with official UCSC format. We add 1bp pseudo-block to
 487         left and/or right of broad region when necessary, so that you can
 488         virtualize the regions without strong enrichment inside
 489         successfully. In downstream analysis except for virtualization,
 490         you may need to remove all 1bps blocks from gappedPeak file.
 491
 492         * diffpeak subcommand is temporarily disabled. Till we
 493         re-implement it.
 494
 495 2013-10-28  Tao Liu  <vladimir.liu@gmail.com>
 496         MACS version 2.0.10 20131028 (tag:alpha)
 497
 498         * callpeak --call-summits improvement
 499
 500         The smoothing window length has been fixed as fragment length
 501         instead of short read length. The larger smoothing window will
 502         grant better smoothing results and better sub-peak summits
 503         detection.
 504
 505         * --outdir and --ofile options for almost all commands
 506
 507         Thank Björn Grüning for initially implementing these options!
 508         Now, MACS2 will save results into a specified
 509         directory by '--outdir' option, and/or save result into a
 510         specified file by '--ofile' option. Note, in case '--ofile' is
 511         available for a subcommand, '-o' now has been adjusted to be the
 512         same as '--ofile' instead of '--o-prefix'.
 513
 514         Here is the list of changes. For more detail, use 'macs2 xxx -h'
 515         for each subcommand:
 516
 517         ** callpeak: --outdir
 518         ** diffpeak: Not implemented
 519         ** bdgpeakcall: --outdir and --ofile
 520         ** bdgbroadcall: --outdir and --ofile
 521         ** bdgcmp: --outdir and --ofile. While --ofile is used, the number
 522         and the order of arguments for --ofile must be the same as for -m.
 523         ** bdgdiff: --outdir and --ofile
 524         ** filterdup: --outdir
 525         ** pileup: --outdir
 526         ** randsample: --outdir
 527         ** refinepeak: --outdir and --ofile
 528
 529
 530 2013-09-15  Tao Liu  <vladimir.liu@gmail.com>
 531         MACS version 2.0.10 20130915 (tag:alpha)
 532
 533         * callpeak Added a new option --buffer-size
 534
 535         This option is to tweak a previously hidden parameter that
 536         controls the steps to increase array size for storing alignment
 537         information. While in some rare cases, the number of
 538         chromosomes/contigs/scaffolds is huge, the original default
 539         setting will cause a huge memory waste. In these cases, we
 540         recommend to decrease --buffer-size (e.g., 1000) to save memory,
 541         although the decrease will slow process to read alignment files.
 542
 543         * an optimization to speed up pvalue-qvalue statistics
 544
 545         Previously, it took a hour to prepare p-q-table for 65M vs 65M
 546         human TF library, and now it will take 10 minutes. It was due to a
 547         single line of code to get a value from a numpy array ...
 548
 549         * fixed logLR bugs.
 550
 551 2013-07-31  Tao Liu  <vladimir.liu@gmail.com>
 552         MACS version 2.0.10 20130731 (tag:alpha)
 553
 554         * callpeak --call-summits
 555
 556         Fix bugs causing callpeak --call-summits option generating extra
 557         number of peaks and inconsistent peak boundaries comparing to
 558         default option. Thank Ben Levinson!
 559
 560         * bdgcmp output
 561
 562         Fix bugs causing bdgcmp output logLR all in positive values. Now
 563         'depletion' can be correctly represented as negative values.
 564
 565         * bdgdiff
 566
 567         Fix the behavior of bdgdiff module. Now it can take four
 568         bedGraph files, then use logLR as cutoff to call differential
 569         regions. Check command line of bdgdiff for detail.
 570
 571 2013-07-13  Tao Liu  <vladimir.liu@gmail.com>
 572         MACS version 2.0.10 20130713 (tag:alpha)
 573
 574         * fix bugs while output broadPeak and gappedPeak.
 575
 576         Note. Those weak broad regions without any strong enrichment
 577         regions inside won't be saved in gappedPeak file.
 578
 579         * bdgcmp -T and -C are merged into -S and description is updated.
 580
 581         Now, you can use it to override SPMR values in your input for
 582         bdgcmp. To use SPMR (from 'callpeak --SPMR -B') while calculating
 583         statistics will cause weird results ( in most cases, lower
 584         significancy), and won't be consistent with MACS2 callpeak
 585         behavior. So if you have SPMR bedGraphs, input the smaller/larger
 586         sample size in MILLION according to 'callpeak --to-large' option.
 587
 588 2013-07-10  Tao Liu  <vladimir.liu@gmail.com>
 589         MACS version 2.0.10 20130710 (tag:alpha)
 590
 591         * fix BED style output format of callpeak module:
 592
 593         1) without --broad: narrowPeak (BED6+4) and BED for summit will be
 594         the output. Old BED format file won't be saved.
 595
 596         2) with --broad: broadPeak (BED6+3) for broad region and
 597         gappedPeak (BED12+3) for chained enriched regions will be the
 598         output. Old BED format, narrowPeak format, summit file won't be
 599         saved.
 600
 601         * bdgcmp now can accept list of methods to calculate scores. So
 602         you can run it once to generate multiple types of scores. Thank
 603         Jon Urban for this suggestion!
 604
 605         * C codes are re-generated through Cython 0.19.1.
 606
 607 2013-05-21  Tao Liu  <vladimir.liu@gmail.com>
 608         MACS version 2.0.10 20130520 (tag:alpha)
 609
 610         * broad peak calling modules are modified in order to report all
 611         relexed regions even there is no strong enrichment inside.
 612
 613 2013-05-01  Tao Liu  <vladimir.liu@gmail.com>
 614         MACS version 2.0.10 20130501 (tag:alpha)
 615
 616         * Memory usage is decreased to about 1/4-1/5 of previous usage
 617         Now, the internal data structure and algorithm are both
 618         re-organized, so that intermediate data wouldn't be saved in
 619         memory. Intead they will be calculated on the fly. New MACS2 will
 620         spend longer time (1.5 to 2 times) however it will use less memory
 621         so can be more usable on small mem servers.
 622
 623         * --seed option is added to callpeak and randsample commands
 624         Thank Mathieu Gineste for this suggestion!
 625
 626 2013-03-05  Tao Liu  <vladimir.liu@gmail.com>
 627         MACS version 2.0.10 20130306 (tag:alpha)
 628
 629         * diffpeak module New module to detect differential binding sites
 630         with more statistics.
 631
 632         * Introduced --refine-peaks
 633         Calculates reads balancing to refine peak summits
 634
 635         * Ouput file names prefix
 636         Correct encodePeak to narrowPeak, broadPeak to bed12.
 637
 638 2012-09-13 Benjamin Schiller <benjamin.schiller@ucsf.edu>,  Tao Liu  <taoliu@jimmy.harvard.edu>
 639         MACS version 2.0.10 (tag:alpha not released)
 640
 641         * Introduced BAMPEParser
 642         Reads PE data directly, requires bedtools for now
 643
 644         * Introduced --call-summits
 645         Uses signal processing methods to call overlapping peaks
 646
 647         * Added --no-trackline
 648         By default, files have descriptive tracklines now
 649
 650         * new refinepeak command (experimental)
 651         This new function will use a similar method in SPP (wtd), to
 652         analyze raw tag distribution in peak region, then redefine the
 653         peak summit where plus and minus tags are evenly distributed
 654         around.
 655
 656         * Changes to output *
 657         cPeakDetect.pyx has full support for new print/write methods and
 658         --call-peaks, BAMPEParser, and use of paired-end data
 659
 660         * Parser optimization
 661
 662         cParser.pyx is rewritten to use io.BufferedReader to speed
 663         up. Speed is doubled.
 664
 665         Code is reorganized -- most of functions are inherited from
 666         GenericParser class.
 667
 668         * Use cross-correlation to calculate fragment size
 669
 670         First, all pairs will be used in prediction for fragment
 671         size. Previously, only no more than 1000 pairs are used. Second,
 672         cross-correlation is used to find the best phase difference
 673         between + and - tag pileups.
 674
 675         * Speed up p-value and q-value calculation
 676
 677         This part is ten times faster now. I am using a dictionary to
 678         cache p-value results from Poisson CDF function. A bit more memory
 679         will be used to increase speed. I hope this dictionary would not
 680         explode since the possible pairs of ChIP signal and control lambda
 681         are hugely redundant. Also, I rewrited part of q-value
 682         calculation.
 683
 684         * Speed up peak detection
 685
 686         This part is about hundred of times faster now.  Optimizations
 687         include using Numpy functions as much as possible, and making loop
 688         body as small as possible.
 689
 690         * Post-processing on differential calls
 691
 692         After macs2diff finds differential binding sites between two
 693         conditions, it will try to annotate the peak calls from one of two
 694         conditions, describe the changes ...
 695
 696         * Fragment size prediction in macs2diff
 697
 698         Now by default, macs2diff will try to use the average fragment
 699         size from both condition 1 and condition 2 for tag extension and
 700         peak calling. Previously, by default, it will use different sizes
 701         unless --nomodel is specified.
 702
 703         Technically, I separate model building processes out. So macs2diff
 704         will build fragment sizes for condition 1 and 2 in parallel (2
 705         processes maximum), then perform 4-way comparisons in parallel (4
 706         processes maximum).
 707
 708         * Diff score
 709
 710         Combine two p/qscore tracks together. At regions where condition 1
 711         is higher than condition 2, score would be positive, otherwise,
 712         negative.
 713
 714         * SAMParser and BAMParser
 715
 716         Bug fixed for paired-end sequencing data.
 717
 718         * BedGraph.pyx
 719
 720         Fixed a bug while calling peaks from BedGraph file. It previously
 721         mistakenly output same peaks multiple times at the end of
 722         chromosome.
 723
 724 2011-11-2  Tao Liu  <taoliu@jimmy.harvard.edu>
 725         MACS version 2.0.9 (tag:alpha)
 726
 727         * Auto fixation on predicted d is turned off by default!
 728
 729         Previous --off-auto is now default. MACS will not automatically
 730         fix d less than 2 times of tag size according to
 731         --shiftsize. While tag size is getting longer nowadays, it would
 732         be easier to have d less than 2 times of tag size, however d may
 733         still be meaningful and useful. Please judge it using your own
 734         wisdom.
 735
 736         * Scaling issue
 737
 738         Now, the default scaling while treatment and input are unbalanced
 739         has been adjusted. By default, larger sample will be scaled down
 740         linearly to match the smaller sample. In this way, background
 741         noise will be reduced more than real signals, so we expect to have
 742         more specific results than the other way around (i.e. --to-large
 743         is set).
 744
 745         Also, an alternative option to randomly sample larger data
 746         (--down-sample) is provided to replace default linear
 747         scaling. However, this option will cause results irresproducible,
 748         so be careful.
 749
 750         * randsample script
 751
 752         A new script 'randsample'  is added, which can randomly sample
 753         certain percentage or number of tags.
 754
 755         * Peak summit
 756
 757         Now, MACS will decide peak summits according to pileup height
 758         instead of qvalue scores. In this way, the summit may be more
 759         accurate.
 760
 761         * Diff score
 762
 763         MACS calculate qvalue scores as differential scores. When compare
 764         two conditions (saying A and B), the maximum qscore for comparing
 765         A to B -- maxqscore_a2b, and for comparing B to A --maxqscore_b2a
 766         will be computed. If maxqscore_a2b is bigger, the diff score is
 767         +maxqscore_a2b, otherwise, diff score is -1*maxqscore_b2a.
 768
 769 2011-09-15  Tao Liu  <taoliu@jimmy.harvard.edu>
 770         MACS version 2.0.8 (tag:alpha)
 771
 772         * bin/macs2, bin/bdgbroadcall, MACS2/IO/cScoreTrack.pyx, MACS2/IO/cBedGraph.pyx
 773
 774         New script bdgbroadcall and the extra option '--broad' for macs2
 775         script, can be used to call broad regions with a loose cutoff to
 776         link nearby significant regions. The output is represented as
 777         BED12 format.
 778
 779         * MACS2/IO/cScoreTrack.pyx
 780
 781         Fix q-value calculation to generate forcefully monotonic values.
 782
 783         * bin/eland*2bed, bin/sam2bed and bin/filterdup
 784
 785         They are combined to one more powerful script called
 786         "filterdup". The script filterdup can filter duplicated reads
 787         according to sequencing depth and genome size. The script can also
 788         convert any format supported by MACS to BED format.
 789
 790 2011-08-21  Tao Liu  <taoliu@jimmy.harvard.edu>
 791         MACS version 2.0.7 (tag:alpha)
 792
 793         * bin/macsdiff renamed to bin/bdgdiff
 794
 795         Now this script will work as a low-level finetuning tool as bdgcmp
 796         and bdgpeakcall.
 797
 798         * bin/macs2diff
 799
 800         A new script to take treatment and control files from two
 801         condition, calculate fragment size, use local poisson to get
 802         pvalues and BH process to get qvalues, then combine 4-ways result
 803         to call differential sites.
 804
 805         This script can use upto 4 cpus to speed up 4-ways calculation. (
 806         I am trying multiprocessing in python. )
 807
 808         * MACS2/Constants.py, MACS2/IO/cBedGraph.pyx,
 809         MACS2/IO/cScoreTrack.pyx, MACS2/OptValidator.py,
 810         MACS2/PeakModel.py, MACS2/cPeakDetect.pyx
 811
 812         All above files are modified for the new macs2diff script.
 813
 814         * bin/macs2, bin/macs2diff, MACS2/OptValidator.py
 815
 816         Now q-value 0.01 is the default cutoff. If -p is specified,
 817         p-value cutoff will be used instead.
 818
 819 2011-07-25  Tao Liu  <vladimir.liu@gmail.com>
 820         MACS version 2.0.6 (tag:alpha)
 821
 822         * bin/macsdiff
 823
 824         A script to call differential regions. A naive way is introduced
 825         to find the regions where:
 826
 827         1. signal from condition 1 is larger than input 1 and condition 2 --
 828         unique region in condition 1;
 829         2. signal from condition 2 is larger than input 2 and condition 1
 830         -- unique region in condition 2;
 831         3. signal from condition 1 is larger than input 1, signal from
 832         condition 2 is larger than input 2, however either signal from
 833         condition 1 or 2 is not larger than the other.
 834
 835         Here 'larger' means the pvalue or qvalue from a Poisson test is
 836         under certain cutoff.
 837
 838         (I will make another script to wrap up mulitple scripts for
 839         differential calling)
 840
 841 2011-07-07  Tao Liu  <vladimir.liu@gmail.com>
 842         MACS version 2.0.5 (tag:alpha)
 843
 844         * bin/macs2, MACS2/cPeakDetect.py, MACS2/IO/cScoreTrack.pyx,
 845         MACS2/IO/cPeakIO.pyx
 846
 847         Use hash to store peak information. Add back the feature to deal
 848         with data without control.
 849
 850         Fix bug which incorrectly allows small peaks at the end of
 851         chromosomes.
 852
 853         * bin/bdgpeakcall, bin/bdgcmp
 854
 855         Fix bugs. bdgpeakcall can output encodePeak format.
 856
 857 2011-06-22  Tao Liu  <taoliu@jimmy.harvard.edu>
 858         MACS version 2.0.4 (tag:alpha)
 859
 860         * cPeakDetect.py
 861
 862         Fix a bug, correctly assign lambda_bg while --to-small is
 863         set. Thanks Junya Seo!
 864
 865         Add rank and num of bp columns to pvalue-qvalue table.
 866
 867         * cScoreTrack.py
 868
 869         Fix bugs to correctly deal with peakless chromosomes. Thanks
 870         Vaibhav Jain!
 871
 872         Use AFDR for independent tests instead.
 873
 874         * encodePeak
 875
 876         Now MACS can output peak coordinates together with pvalue, qvalue,
 877         summit positions in a single encodePeak format (designed for
 878         ENCODE project) file. This file can be loaded to UCSC
 879         browser. Definition of some specific columns are: 5th:
 880         int(-log10pvalue*10), 7th: fold-change, 8th: -log10pvalue, 9th:
 881         -log10qvalue, 10th: relative summit position to peak start.
 882
 883
 884 2011-06-19  Tao Liu  <taoliu@jimmy.harvard.edu>
 885         MACS version 2.0.3 (tag:alpha)
 886
 887         * Rich output with qvalue, fold enrichment, and pileup height
 888
 889         Calculate q-values using a refined Benjamini–Hochberg–Yekutieli
 890         procedure:
 891
 892         http://en.wikipedia.org/wiki/False_discovery_rate#Dependent_tests
 893
 894         Now we have a similiar xls output file as before. The differences
 895         from previous file are:
 896
 897         1. Summit now is absolute summit, instead of relative summit
 898            position;
 899         2. 'Pileup' is previous 'tag' column. It's the extended fragment
 900            pileup at the peak summit;
 901         3. We now use '-log10(pvalue)' instead of '-10log10(pvalue)', so
 902            5.00 means 1e-5, simple and less confusing.
 903         4. FDR column becomes '-log10(qvalue)' column.
 904         5. The pileup, -log10pvalue, fold_enrichment and -log10qvalue are
 905            the values at the peak summit.
 906
 907         * Extra output files
 908
 909         NAME_pqtable.txt contains pvalue and qvalue relationships.
 910
 911         NAME_treat_pvalue.bdg and NAME_treat_qvalue.bdg store -log10pvalue
 912         and -log10qvalue scores in BedGraph format. Nearby regions with
 913         the same value are not merged.
 914
 915         * Separation of FeatIO.py
 916
 917         Its content has been divided into cPeakIO.pyx, cBedGraph.pyx, and
 918         cFixWidthTrack.pyx. A modified bedGraphTrackI class was
 919         implemented to store pileup, local lambda, pvalue, and qvalue
 920         alltogether in cScoreTrack.pyx.
 921
 922         * Experimental option --half-ext
 923
 924         Suggested by NPS algorithm, I added an experimental option
 925         --half-ext to let MACS only extends ChIP fragment around its
 926         middle point for only 1/2 d.
 927
 928 2011-06-12  Tao Liu  <taoliu@jimmy.harvard.edu>
 929         MACS version 2.0.2 (tag:alpha)
 930
 931         * macs2
 932
 933         Add an error check to see if there is no common chromosome names
 934         from treatment file and control file
 935
 936         * cPeakDetect.pyx, cFeatIO.pyx, cPileup.pyx
 937
 938         Reduce memory usage by removing deepcopy() calls.
 939
 940         * Modify README documents and others.
 941
 942 2011-05-19  Tao Liu  <taoliu@jimmy.harvard.edu>
 943         MACS Version 2.0.1 (tag:alpha)
 944
 945         * cPileup.pyx, cPeakDetect.pyx and peak calling process
 946
 947         Jie suggested me a brilliant simple method to pileup fragments
 948         into bedGraph track. It works extremely faster than the previous
 949         function, i.e, faster than MACS1.3 or MACS1.4. So I can include
 950         large local lambda calculation in MACSv2 now. Now I generate three
 951         bedGraphs for d-size local bias, slocal-size and llocal-size local
 952         bias, and calculate the maximum local bias as local lambda
 953         bedGraph track.
 954
 955         Minor: add_loc in bedGraphTrackI now can correctly merge the
 956         region with its preceding region if their value are the same.
 957
 958         * macs2
 959
 960         Add an option to shift control tags before extension. By default,
 961         control tags will be extended to both sides regardless of strand
 962         information.
 963
 964 2011-05-17  Tao Liu  <taoliu@jimmy.harvard.edu>
 965         MACS Version 2.0.0 (tag:alpha)
 966
 967         * Use bedGraph type to store data internally and externally.
 968
 969         We can have theoretically one-basepair resolution profiles. 10
 970         times smaller in filesize and even smaller after converting to
 971         bigWig for visualization.
 972
 973         * Peak calling process modified. Better peak boundary detection.
 974
 975         Extend ChIP tag to d, and pileup to have a ChIP bedGraph. Extend
 976         Control tag to d and 1,000bp, and pileup to two bedGraphs. (1000bp
 977         one will be averaged to d size) Then calculate the maximum value
 978         of these two tracks and a global background, to have a
 979         local-lambda bedGraph.
 980
 981         Use -10log10poisson_pvalue as scores to generate a score track
 982         before peak calling.
 983
 984         A general peak calling based on a score cutoff, min length of peak
 985         and max gap between nearby peaks.
 986
 987         * Option changes.
 988
 989         Wiggle file output is removed. Now we only support bedGraph
 990         output. The generation of bedGraph is highly recommended since it
 991         will not cost extra time. In other words, bedGraph generation is
 992         internally run even you don't want to save bedGraphs on disk, due
 993         to the peak calling algorithm in MACS v2.
 994
 995         * cProb.pyx
 996
 997         We now can calculate poisson pvalue in log space so that the score
 998         (-10*log10pvalue) will not have a upper limit of 3100 due to
 999         precision of float number.
1000
1001         * Cython is adopted to speed up Python code.
1002
1003 2011-02-28  Tao Liu  <taoliu@jimmy.harvard.edu>
1004         Small fixes
1005
1006         * Replaced with a newest WigTrackI class and fixed the wignorm script.
1007
1008 2011-02-21  Tao Liu  <taoliu@jimmy.harvard.edu>
1009         Version 1.4.0rc2 (Valentine)
1010
1011         * --single-wig option is renamed to --single-profile
1012
1013         * BedGraph output with --bdg or -B option.
1014
1015         The BedGraph output provides 1bp resolution fragment pileup
1016         profile. File size is smaller than wig file. This option can be
1017         combined with --single-profile option to produce a bedgraph file
1018         for the whole genome. This option can also make --space,
1019         --call-subpeaks invalid.
1020
1021         * Fix the description of --shiftsize to correctly state that the
1022         value is 1/2 d (fragment size).
1023
1024         * Fix a bug in the call to __filter_w_control_tags when control is
1025         not available.
1026
1027         * Fix a bug on --to-small option. Now it works as expected.
1028
1029         * Fix a bug while counting the tags in candidate peak region, an
1030         extra tag may be included. (Thanks to Jake Biesinger!)
1031
1032         * Fix the bug for the peaks extended outside of chromosome
1033         start. If the minus strand tag goes outside of chromosome start
1034         after extension of d, it will be thrown out.
1035
1036         * Post-process script for a combined wig file:
1037
1038         The "wignorm" command can be called after a full run of MACS14 as
1039         a postprocess. wignorm can calculate the local background from the
1040         control wig file from MACS14, then use either foldchange,
1041         -10*log10(pvalue) from possion test, or difference after asinh
1042         transformation as the score to build a single wig track to
1043         represent the binding strength. This script will take a
1044         significant long time to process.
1045
1046         * --wigextend has been obsoleted.
1047
1048 2010-09-21  Tao Liu  <taoliu@jimmy.harvard.edu>
1049         Version 1.4.0rc1 (Starry Sky)
1050
1051         * Duplicate reads option
1052
1053         --keep-dup behavior is changed. Now user can specify how many
1054         reads he/she wants to keep at the same genomic location. 'auto' to
1055         let MACS decide the number based on binomial distribution, 'all'
1056         to let MACS keep all reads.
1057
1058         * pvalue and FDR fixes (Thanks to Prof. Zhiping Weng)
1059
1060         By default, MACS will now scale the smaller dataset to the bigger
1061         dataset. For instance, if IP has 10 million reads, and Input has 5
1062         million, MACS will double the lambda value calculated from Input
1063         reads while calling BOTH the positive peaks and negative
1064         peaks. This will address the issue caused by unbalanced numbers of
1065         reads from IP and Input. If --to-small is turned on, MACS will
1066         scale the larger dataset to the smaller one. So from now on, if d
1067         is fixed, then the peaks from a MACS call for A vs B should be
1068         identical to the negative peaks from a B vs A.
1069
1070 2010-09-01  Tao Liu  <taoliu@jimmy.harvard.edu>
1071         Version 1.4.0beta (summer wishes)
1072
1073         * New features
1074
1075         ** Model building
1076
1077         The default behavior in the model building step is slightly
1078         changed. When MACS can't find enough pairs to build model
1079         (implemented in alpha version) or the modeled fragment length is
1080         less than 2 times of tag length (implemented in beta version),
1081         MACS will use 2 times of --shiftsize value as fragment length in
1082         the later analysis. --off-auto can turn off this default behavior.
1083
1084         ** Redundant tag filtering
1085
1086         The IO module is rewritten. The redundant tag filtering process
1087         becomes simpler and works as promise. The maximum allowed number
1088         of tags at the exact same location is calculated from the
1089         sequencing depth and genome size using a binomial distribution,
1090         for both TREAMENT and CONTROL separately. ( previously only
1091         TREATMENT is considered ) The exact same location means the same
1092         coordination and the same strand. Then MACS will only keep at most
1093         this number of tags at the exact same location in the following
1094         analysis. An option --keep-dup can let MACS skip the filtering and
1095         keep all the tags. However this may bring in a lot of sequencing
1096         bias, so you may get many false positive peaks.
1097
1098         ** Single wiggle mode
1099
1100         First thing to mention, this is not the score track that I
1101         described before. By default, MACS generates wiggle files for
1102         fragment pileup for every chromosomes separately. When you use
1103         --single-wig option, MACS will generate a single wiggle file for
1104         all the chromosomes so you will get a wig.gz for TREATMENT and
1105         another wig.gz for CONTROL if available.
1106
1107         ** Sniff -- automatic format detection
1108
1109         Now, by default or "-f AUTO", MACS will decide the input file
1110         format automatically. Technically, it will try to read at most
1111         1000 records for the first 10 non-comment lines. If it succeeds,
1112         the format is decided. I recommend not to use AUTO and specify the
1113         right format for your input files, unless you combine different
1114         formats in a single MACS run.
1115
1116         * Options changes
1117
1118         --single-wig and --keep-dup are added. Check previous section in
1119         ChangeLog for detail.
1120
1121         -f (--format) AUTO is now the default option.
1122
1123         --slocal default: 1000
1124         --llocal default: 10000
1125
1126         * Bug fixed
1127
1128         Setup script will stop the installation if python version is not
1129         python2.6 or python2.7.
1130
1131         Local lambda calculation has been changed back. MACS will check
1132         peak_region, slocal( default 1K) and llocal (default 10K) for the
1133         local bias. The previous 200bps default will cause MACS misses
1134         some peaks where the input bias is very sharp.
1135
1136         sam2bed.py script is corrected.
1137
1138         Relative pos in xls output is fixed.
1139
1140         Parser for ELAND_export is fixed to pass some of the no match
1141         lines. And elandexport2bed.py is fixed too. ( however I can't
1142         guarantee that it works on any eland_export files. )
1143
1144 2010-06-04  Tao Liu  <taoliu@jimmy.harvard.edu>
1145         Version 1.4.0alpha2 (be smarter)
1146
1147         * Options changes
1148
1149         --gsize now provides shortcuts for common genomes, including
1150         human, mouse, C. elegans and fruitfly.
1151
1152         --llocal now will be 5000 bps if there is no input file, so that
1153         local lambda doesn't overkill enriched binding sites.
1154
1155 2010-06-02  Tao Liu  <taoliu@jimmy.harvard.edu>
1156         Version 1.4alpha (be smarter)
1157
1158         * Options changes
1159
1160         --tsize option is redesigned. MACS will use the first 10 lines of
1161         the input to decide the tag size. If user specifies --tsize, it
1162         will override the auto decided tsize.
1163
1164         --lambdaset is replaced by --slocal and --llocal which mean the
1165         small local region and large local region.
1166
1167         --bw has no effect on the scan-window size now. It only affects the
1168         paired-peaks model process.
1169
1170         * Model building
1171
1172         During the model building, MACS will pick out the enriched regions
1173         which are not too high and not too low to build the paired-peak
1174         model. Default the region is from fold 10 to fold 30. If MACS
1175         fails to build the model, by default it will use the nomodel
1176         settings, like shiftsize=100bps, to shift and extend each
1177         tags. This behavior can be turned off by '--off-auto'.
1178
1179         * Output files
1180
1181         An extra file including all the summit positions are saved in
1182         *_summits.bed file. An option '--call-subpeaks' will invoke
1183         PeakSplitter developed by Mali Salmon to split wide peaks into
1184         smaller subpeaks.
1185
1186         * Sniff ( will in beta )
1187
1188         Automatically recognize the input file format, so use can combine
1189         different format in one MACS run.
1190
1191         Not implemented features/TODO:
1192
1193         * Algorithms ( in near future? )
1194
1195         MACS will try to refine the peak boundaries by calculating the
1196         scores for every point in the candidate peak regions. The score
1197         will be the -10*log(10,pvalue) on a local poisson distribution. A
1198         cutoff specified by users (--pvalue) will be applied to find the
1199         precise sub-peaks in the original candidate peak region. Peak
1200         boudaries and peak summits positions will be saved in separate BED
1201         files.
1202
1203         * Single wiggle track ( in near future? )
1204
1205         A single wiggle track will be generated to save the scores within
1206         candidate peak regions in the 10bps resolution. The wiggle file
1207         is in fixedStep format.
1208
1209
1210 2009-10-16  Tao Liu  <taoliu@jimmy.harvard.edu>
1211         Version 1.3.7.1 (Oktoberfest, bug fixed #1)
1212
1213         * bin/Constants.py
1214
1215         Fixed typo. FCSTEP -> FESTEP
1216
1217         * lib/PeakDetect.py
1218
1219         The 'femax' attribute bug is fixed
1220
1221 2009-10-02  Tao Liu  <taoliu@jimmy.harvard.edu>
1222         Version 1.3.7 (Oktoberfest)
1223
1224         * bin/macs, lib/PeakDetect.py, lib/IO/__init__.py, lib/OptValidator.py
1225
1226         Enhancements by Peter Chines:
1227
1228         1. gzip files are supported.
1229         2. when --diag is on, user can set the increment and endpoint for
1230         fold enrichment analysis by setting --fe-step and --fe-max.
1231
1232         Enhancements by Davide Cittaro:
1233
1234         1. BAM and SAM formats are supported.
1235         2. small changes in the header lines of wiggle output.
1236
1237         Enhancements by Me:
1238         1. I added --fe-min option;
1239         2. Bowtie ascii output with suffix ".map" is supported.
1240
1241         Bug fixed:
1242
1243         1. --nolambda bug is fixed. ( reported by Martin in JHU )
1244         2. --diag bug is fixed. ( reported by Bogdan Tanasa )
1245         3. Function to remove suffix '.fa' is fixed. ( reported by Jeff Johnston )
1246         4. Some "fold change" have been changed to "fold enrichment".
1247
1248 2009-06-10  Tao Liu  <taoliu@jimmy.harvard.edu>
1249         Version 1.3.6.1 (default parameter change)
1250
1251         * bin/macs, lib/PeakDetect.py
1252
1253         "--oldfdr" is removed. The 'oldfdr' behaviour becomes
1254         default. "--futurefdr" is added which can turn on the 'new' method
1255         introduced in 1.3.6. By default it's off.
1256
1257         * lib/PeakDetect.py
1258
1259         Fixed a bug. p-value is corrected a little bit.
1260
1261
1262 2009-05-11  Tao Liu  <taoliu@jimmy.harvard.edu>
1263         Version 1.3.6 (Birthday cake)
1264
1265         * bin/macs
1266
1267         "track name" is added to the header of BED output file.
1268
1269         Now the default peak detection method is to consider 5k and 10k
1270         nearby regions in treatment data and peak location, 1k, 5k, and
1271         10k regions in control data to calculate local bias. The old
1272         method can be called through '--old' option.
1273
1274         Information about how many total/unique tags in treatment or
1275         control will be saved in final .xls output.
1276
1277         * lib/IO/__init__.py
1278
1279         ".fa" will be removed from input tag alignment so only the
1280         chromosome names are kept.
1281
1282         WigTrackI class is added for Wiggle like data structure. (not used
1283         now)
1284
1285         The parser for ELAND multi PET files has been fixed. Now the 5'
1286         tag position for a pair will be kept, whereas in the previous
1287         version, the middle points are kept.
1288
1289         * lib/IO/BinKeeper.py
1290
1291         BinKeeperI class is inspired by Jim Kent's library for UCSC genome
1292         browser, which can quickly access certain region for values in a
1293         large wiggle like data file. (not used now)
1294
1295         * lib/OptValidator.py
1296
1297         typo fixed.
1298
1299         * lib/PeakDetect.py
1300
1301         Now the default peak detection method is to consider 5k and 10k
1302         nearby regions in treatment data and peak location, 1k, 5k, and
1303         10k regions in control data to calculate local bias. The old
1304         method can be called through '--old' option.
1305
1306         Two columns have beed added to BED output file. 4th column: peak
1307         name; 5th column: peak score using -10log(10,pvalue) as score.
1308
1309         * setup.py
1310
1311         Add support to build a Mac App through 'setup.py py2app', or a
1312         Windows executable through 'setup.py py2exe'. You need to install
1313         py2app or py2exe package in order to use these functions.
1314
1315 2009-02-12  Tao Liu  <taoliu@jimmy.harvard.edu>
1316         Version 1.3.5 (local lambda fixed, typo fixed, model figure improved)
1317
1318         * PeakDetect.py
1319
1320         Now, besides 1k, 5k, 10k, MACS will also consider peak size region
1321         in control data to calculate local lambda for each peak. Peak
1322         calling results will be slightly different with previous version,
1323         beware!
1324
1325         * OptValidator.py
1326
1327         Typo fixed, ELANDParser -> ELANDResultParser
1328
1329         * OutputWriter.py
1330
1331         Now, modeled d value will be shown on the model figure.
1332
1333 2009-01-06  Tao Liu  <taoliu@jimmy.harvard.edu>
1334         Version 1.3.4 (Happy New Year Version, bug fixed, ELAND multi/PET support)
1335
1336         * macs, IO/__init__.py, PeakDetect.py
1337
1338         Add support for ELAND multi format. Add support for Pair-End
1339         experiment, in this case, 5'end and 3'end ELAND multi format files
1340         are required for treatment or control data. See 00README file for
1341         detail.
1342
1343         Add wigextend option.
1344
1345         Add petdist option for Pair-End Tag experiment, which is the best
1346         distance between 5' and 3' tags.
1347
1348         * PeakDetect.py
1349
1350         Fixed a bug which cause the end positions of every peak region
1351         incorrectly added by 1 bp. ( Thanks Mali Salmon!)
1352
1353         * OutputWriter.py
1354
1355         Fix bugs while generating wiggle files. The start position of
1356         wiggle file is set to 1 instead of 0.
1357
1358         Fix a bug that every 10M bps, signals in the first 'd' range are
1359         lower than actual. ( Thanks Mali Salmon!)
1360
1361
1362 2008-12-03  Tao Liu  <taoliu@jimmy.harvard.edu>
1363         Version 1.3.3 (wiggle bugs fixed)
1364
1365         * OutputWriter.py
1366
1367         Fix bugs while generating wiggle files. 1. 'span=' is added to
1368         'variableStep' line; 2. previously, every 10M bps, the coordinates
1369         were wrongly shifted to the right for 'd' basepairs.
1370
1371         * macs, PeakDetect.py
1372
1373         Add an option to save wiggle files on different resolution.
1374
1375 2008-10-02  Tao Liu  <taoliu@jimmy.harvard.edu>
1376         Version 1.3.2 (tiny bugs fixed)
1377
1378         * IO/__init__.py
1379
1380         Fix 65536 -> 65535. ( Thank Joon)
1381
1382         * Prob.py
1383
1384         Improved for binomial function with extra large number. Imported
1385         from Cistrome project.
1386
1387         * PeakDetect.py
1388
1389         If treatment channel misses reads in some chromosome included in
1390         control channel, or vice versa, MACS will not exit. (Thank Shaun
1391         Mahony)
1392
1393         Instead, MACS will fake a tag at position -1 when calling
1394         treatment peaks vs control, but will ignore the chromosome while
1395         calling negative peaks.
1396
1397 2008-09-04  Tao Liu  <taoliu@jimmy.harvard.edu>
1398         Version 1.3.1 (tiny bugs fixed version)
1399
1400         * Prob.py
1401
1402         Hyunjin Gene Shin contributed some codes to Prob.py. Now the
1403         binomial functions can tolerate large and small numbers.
1404
1405         * IO/__init__.py
1406
1407         Parsers now split lines in BED/ELAND file using any
1408         whitespaces. 'track' or 'browser' lines will be regarded as
1409         comment lines. A bug fixed when throwing StrandFormatError. The
1410         maximum redundant tag number at a single position can be no less
1411         than 65536.
1412
1413
1414 2008-07-15  Tao Liu  <taoliu@jimmy.harvard.edu>
1415         Version 1.3 (naming clarification version)
1416
1417         * Naming clarification changes according to our manuscript:
1418
1419         'frag_len' is changed to 'd'.
1420
1421         'fold_change' is changed to 'fold_enrichment'.
1422
1423         Suggest '--bw' parameter to be determined by users from the real
1424         sonication size.
1425
1426         Maximum FDR is 100% in the output file.
1427
1428         And other clarifications in 00README file and the documents on the
1429         website.
1430
1431         * IO/__init__.py
1432         If the redundant tag number at a single position is over 32767,
1433         just remember 32767, instead of raising an overflow exception.
1434
1435         * setup.py
1436         fixed a typo.
1437
1438         * PeakDetect.py
1439         Bug fixed for diagnosis report.
1440
1441
1442 2008-07-10  Tao Liu  <taoliu@jimmy.harvard.edu>
1443         Version 1.2.2gamma
1444
1445         * Serious bugs fix:
1446
1447         Poisson distribution CDF and inverse CDF functions are
1448         corrected. They can produce right results even for huge lambda
1449         now. So that the p-value and FDR values in the final excel sheet
1450         are corrected.
1451
1452         IO package now can tolerate some rare cases; ELANDParser in IO
1453         package is fixed. (Thank Bogdan)
1454
1455         * Improvement:
1456
1457         Reverse paired peaks in model are rejected. So there will be no
1458         negative 'frag_len'. (Thank Bogdan)
1459
1460         * Features added:
1461
1462         Diagnosis function is completed. Which can output a table file for
1463         users to estimate their sequencing depth.
1464
1465
1466 2008-06-30  Tao Liu  <taoliu@jimmy.harvard.edu>
1467         Version 1.2
1468
1469         * Probe.py is added!
1470
1471         GSL is totally removed from MACS. Instead, I have implemented the
1472         CDF and inverse CDF for poisson and binomial distribution purely
1473         in python.
1474
1475         * Constants.py is added!
1476
1477         Organize constants used in MACS in the Constants.py file.
1478
1479         * All other files are modified!
1480
1481         Foldchange calculation is modified. Now the foldchange only be
1482         calculated at the peak summit position instead of the whole peak
1483         region. The values will be higher and more robust than before.
1484
1485         Features added:
1486
1487         1. MACS can save wiggle format files containing the tag number at
1488         every 10 bp along the genome. Tags are shifted according to our
1489         model before they are calculated.
1490
1491         2. Model building and local lambda calculation can be skipped with
1492         certain options.
1493
1494         3. A diagnosis report can be generated through '--diag'
1495         option. This report can help you get an assumption about the
1496         sequencing saturation. This funtion is only in beta stage.
1497
1498         4. FDR calculation speed is highly improved.
1499
1500 2008-05-28  Tao Liu  <taoliu@jimmy.harvard.edu>
1501         Version 1.1
1502
1503         * TabIO, PeakModel.py ...
1504         Bug fixed to let MACS tolerate some cases while there is no tag on
1505         either plus strand or minus strand.
1506
1507         * setup.py
1508         Check the version of python. If the version is lower than 2.4,
1509         refuse to install with warning.
1510
1511
1512 2013-07-31  Tao Liu  <vladimir.liu@gmail.com>
1513         MACS version 2.0.10 20130731 (tag:alpha)
1514
1515         * callpeak --call-summits
1516
1517         Fix bugs causing callpeak --call-summits option generating extra
1518         number of peaks and inconsistent peak boundaries comparing to
1519         default option. Thank Ben Levinson!
1520
1521         * bdgcmp output
1522
1523         Fix bugs causing bdgcmp output logLR all in positive values. Now
1524         'depletion' can be correctly represented as negative values.
1525
1526         * bdgdiff
1527
1528         Fix the behavior of bdgdiff module. Now it can take four
1529         bedGraph files, then use logLR as cutoff to call differential
1530         regions. Check command line of bdgdiff for detail.
1531
1532 2013-07-13  Tao Liu  <vladimir.liu@gmail.com>
1533         MACS version 2.0.10 20130713 (tag:alpha)
1534
1535         * fix bugs while output broadPeak and gappedPeak.
1536
1537         Note. Those weak broad regions without any strong enrichment
1538         regions inside won't be saved in gappedPeak file.
1539
1540         * bdgcmp -T and -C are merged into -S and description is updated.
1541
1542         Now, you can use it to override SPMR values in your input for
1543         bdgcmp. To use SPMR (from 'callpeak --SPMR -B') while calculating
1544         statistics will cause weird results ( in most cases, lower
1545         significancy), and won't be consistent with MACS2 callpeak
1546         behavior. So if you have SPMR bedGraphs, input the smaller/larger
1547         sample size in MILLION according to 'callpeak --to-large' option.
1548
1549 2013-07-10  Tao Liu  <vladimir.liu@gmail.com>
1550         MACS version 2.0.10 20130710 (tag:alpha)
1551
1552         * fix BED style output format of callpeak module:
1553
1554         1) without --broad: narrowPeak (BED6+4) and BED for summit will be
1555         the output. Old BED format file won't be saved.
1556
1557         2) with --broad: broadPeak (BED6+3) for broad region and
1558         gappedPeak (BED12+3) for chained enriched regions will be the
1559         output. Old BED format, narrowPeak format, summit file won't be
1560         saved.
1561
1562         * bdgcmp now can accept list of methods to calculate scores. So
1563         you can run it once to generate multiple types of scores. Thank
1564         Jon Urban for this suggestion!
1565
1566         * C codes are re-generated through Cython 0.19.1.
1567
1568 2013-05-21  Tao Liu  <vladimir.liu@gmail.com>
1569         MACS version 2.0.10 20130520 (tag:alpha)
1570
1571         * broad peak calling modules are modified in order to report all
1572         relexed regions even there is no strong enrichment inside.
1573
1574 2013-05-01  Tao Liu  <vladimir.liu@gmail.com>
1575         MACS version 2.0.10 20130501 (tag:alpha)
1576
1577         * Memory usage is decreased to about 1/4-1/5 of previous usage
1578         Now, the internal data structure and algorithm are both
1579         re-organized, so that intermediate data wouldn't be saved in
1580         memory. Intead they will be calculated on the fly. New MACS2 will
1581         spend longer time (1.5 to 2 times) however it will use less memory
1582         so can be more usable on small mem servers.
1583
1584         * --seed option is added to callpeak and randsample commands
1585         Thank Mathieu Gineste for this suggestion!
1586
1587 2013-03-05  Tao Liu  <vladimir.liu@gmail.com>
1588         MACS version 2.0.10 20130306 (tag:alpha)
1589
1590         * diffpeak module New module to detect differential binding sites
1591         with more statistics.
1592
1593         * Introduced --refine-peaks
1594         Calculates reads balancing to refine peak summits
1595
1596         * Ouput file names prefix
1597         Correct encodePeak to narrowPeak, broadPeak to bed12.
1598
1599 2012-09-13 Benjamin Schiller <benjamin.schiller@ucsf.edu>,  Tao Liu  <taoliu@jimmy.harvard.edu>
1600         MACS version 2.0.10 (tag:alpha not released)
1601
1602         * Introduced BAMPEParser
1603         Reads PE data directly, requires bedtools for now
1604
1605         * Introduced --call-summits
1606         Uses signal processing methods to call overlapping peaks
1607
1608         * Added --no-trackline
1609         By default, files have descriptive tracklines now
1610
1611         * new refinepeak command (experimental)
1612         This new function will use a similar method in SPP (wtd), to
1613         analyze raw tag distribution in peak region, then redefine the
1614         peak summit where plus and minus tags are evenly distributed
1615         around.
1616
1617         * Changes to output *
1618         cPeakDetect.pyx has full support for new print/write methods and
1619         --call-peaks, BAMPEParser, and use of paired-end data
1620
1621         * Parser optimization
1622
1623         cParser.pyx is rewritten to use io.BufferedReader to speed
1624         up. Speed is doubled.
1625
1626         Code is reorganized -- most of functions are inherited from
1627         GenericParser class.
1628
1629         * Use cross-correlation to calculate fragment size
1630
1631         First, all pairs will be used in prediction for fragment
1632         size. Previously, only no more than 1000 pairs are used. Second,
1633         cross-correlation is used to find the best phase difference
1634         between + and - tag pileups.
1635
1636         * Speed up p-value and q-value calculation
1637
1638         This part is ten times faster now. I am using a dictionary to
1639         cache p-value results from Poisson CDF function. A bit more memory
1640         will be used to increase speed. I hope this dictionary would not
1641         explode since the possible pairs of ChIP signal and control lambda
1642         are hugely redundant. Also, I rewrited part of q-value
1643         calculation.
1644
1645         * Speed up peak detection
1646
1647         This part is about hundred of times faster now.  Optimizations
1648         include using Numpy functions as much as possible, and making loop
1649         body as small as possible.
1650
1651         * Post-processing on differential calls
1652
1653         After macs2diff finds differential binding sites between two
1654         conditions, it will try to annotate the peak calls from one of two
1655         conditions, describe the changes ...
1656
1657         * Fragment size prediction in macs2diff
1658
1659         Now by default, macs2diff will try to use the average fragment
1660         size from both condition 1 and condition 2 for tag extension and
1661         peak calling. Previously, by default, it will use different sizes
1662         unless --nomodel is specified.
1663
1664         Technically, I separate model building processes out. So macs2diff
1665         will build fragment sizes for condition 1 and 2 in parallel (2
1666         processes maximum), then perform 4-way comparisons in parallel (4
1667         processes maximum).
1668
1669         * Diff score
1670
1671         Combine two p/qscore tracks together. At regions where condition 1
1672         is higher than condition 2, score would be positive, otherwise,
1673         negative.
1674
1675         * SAMParser and BAMParser
1676
1677         Bug fixed for paired-end sequencing data.
1678
1679         * BedGraph.pyx
1680
1681         Fixed a bug while calling peaks from BedGraph file. It previously
1682         mistakenly output same peaks multiple times at the end of
1683         chromosome.
1684
1685 2011-11-2  Tao Liu  <taoliu@jimmy.harvard.edu>
1686         MACS version 2.0.9 (tag:alpha)
1687
1688         * Auto fixation on predicted d is turned off by default!
1689
1690         Previous --off-auto is now default. MACS will not automatically
1691         fix d less than 2 times of tag size according to
1692         --shiftsize. While tag size is getting longer nowadays, it would
1693         be easier to have d less than 2 times of tag size, however d may
1694         still be meaningful and useful. Please judge it using your own
1695         wisdom.
1696
1697         * Scaling issue
1698
1699         Now, the default scaling while treatment and input are unbalanced
1700         has been adjusted. By default, larger sample will be scaled down
1701         linearly to match the smaller sample. In this way, background
1702         noise will be reduced more than real signals, so we expect to have
1703         more specific results than the other way around (i.e. --to-large
1704         is set).
1705
1706         Also, an alternative option to randomly sample larger data
1707         (--down-sample) is provided to replace default linear
1708         scaling. However, this option will cause results irresproducible,
1709         so be careful.
1710
1711         * randsample script
1712
1713         A new script 'randsample'  is added, which can randomly sample
1714         certain percentage or number of tags.
1715
1716         * Peak summit
1717
1718         Now, MACS will decide peak summits according to pileup height
1719         instead of qvalue scores. In this way, the summit may be more
1720         accurate.
1721
1722         * Diff score
1723
1724         MACS calculate qvalue scores as differential scores. When compare
1725         two conditions (saying A and B), the maximum qscore for comparing
1726         A to B -- maxqscore_a2b, and for comparing B to A --maxqscore_b2a
1727         will be computed. If maxqscore_a2b is bigger, the diff score is
1728         +maxqscore_a2b, otherwise, diff score is -1*maxqscore_b2a.
1729
1730 2011-09-15  Tao Liu  <taoliu@jimmy.harvard.edu>
1731         MACS version 2.0.8 (tag:alpha)
1732
1733         * bin/macs2, bin/bdgbroadcall, MACS2/IO/cScoreTrack.pyx, MACS2/IO/cBedGraph.pyx
1734
1735         New script bdgbroadcall and the extra option '--broad' for macs2
1736         script, can be used to call broad regions with a loose cutoff to
1737         link nearby significant regions. The output is represented as
1738         BED12 format.
1739
1740         * MACS2/IO/cScoreTrack.pyx
1741
1742         Fix q-value calculation to generate forcefully monotonic values.
1743
1744         * bin/eland*2bed, bin/sam2bed and bin/filterdup
1745
1746         They are combined to one more powerful script called
1747         "filterdup". The script filterdup can filter duplicated reads
1748         according to sequencing depth and genome size. The script can also
1749         convert any format supported by MACS to BED format.
1750
1751 2011-08-21  Tao Liu  <taoliu@jimmy.harvard.edu>
1752         MACS version 2.0.7 (tag:alpha)
1753
1754         * bin/macsdiff renamed to bin/bdgdiff
1755
1756         Now this script will work as a low-level finetuning tool as bdgcmp
1757         and bdgpeakcall.
1758
1759         * bin/macs2diff
1760
1761         A new script to take treatment and control files from two
1762         condition, calculate fragment size, use local poisson to get
1763         pvalues and BH process to get qvalues, then combine 4-ways result
1764         to call differential sites.
1765
1766         This script can use upto 4 cpus to speed up 4-ways calculation. (
1767         I am trying multiprocessing in python. )
1768
1769         * MACS2/Constants.py, MACS2/IO/cBedGraph.pyx,
1770         MACS2/IO/cScoreTrack.pyx, MACS2/OptValidator.py,
1771         MACS2/PeakModel.py, MACS2/cPeakDetect.pyx
1772
1773         All above files are modified for the new macs2diff script.
1774
1775         * bin/macs2, bin/macs2diff, MACS2/OptValidator.py
1776
1777         Now q-value 0.01 is the default cutoff. If -p is specified,
1778         p-value cutoff will be used instead.
1779
1780 2011-07-25  Tao Liu  <vladimir.liu@gmail.com>
1781         MACS version 2.0.6 (tag:alpha)
1782
1783         * bin/macsdiff
1784
1785         A script to call differential regions. A naive way is introduced
1786         to find the regions where:
1787
1788         1. signal from condition 1 is larger than input 1 and condition 2 --
1789         unique region in condition 1;
1790         2. signal from condition 2 is larger than input 2 and condition 1
1791         -- unique region in condition 2;
1792         3. signal from condition 1 is larger than input 1, signal from
1793         condition 2 is larger than input 2, however either signal from
1794         condition 1 or 2 is not larger than the other.
1795
1796         Here 'larger' means the pvalue or qvalue from a Poisson test is
1797         under certain cutoff.
1798
1799         (I will make another script to wrap up mulitple scripts for
1800         differential calling)
1801
1802 2011-07-07  Tao Liu  <vladimir.liu@gmail.com>
1803         MACS version 2.0.5 (tag:alpha)
1804
1805         * bin/macs2, MACS2/cPeakDetect.py, MACS2/IO/cScoreTrack.pyx,
1806         MACS2/IO/cPeakIO.pyx
1807
1808         Use hash to store peak information. Add back the feature to deal
1809         with data without control.
1810
1811         Fix bug which incorrectly allows small peaks at the end of
1812         chromosomes.
1813
1814         * bin/bdgpeakcall, bin/bdgcmp
1815
1816         Fix bugs. bdgpeakcall can output encodePeak format.
1817
1818 2011-06-22  Tao Liu  <taoliu@jimmy.harvard.edu>
1819         MACS version 2.0.4 (tag:alpha)
1820
1821         * cPeakDetect.py
1822
1823         Fix a bug, correctly assign lambda_bg while --to-small is
1824         set. Thanks Junya Seo!
1825
1826         Add rank and num of bp columns to pvalue-qvalue table.
1827
1828         * cScoreTrack.py
1829
1830         Fix bugs to correctly deal with peakless chromosomes. Thanks
1831         Vaibhav Jain!
1832
1833         Use AFDR for independent tests instead.
1834
1835         * encodePeak
1836
1837         Now MACS can output peak coordinates together with pvalue, qvalue,
1838         summit positions in a single encodePeak format (designed for
1839         ENCODE project) file. This file can be loaded to UCSC
1840         browser. Definition of some specific columns are: 5th:
1841         int(-log10pvalue*10), 7th: fold-change, 8th: -log10pvalue, 9th:
1842         -log10qvalue, 10th: relative summit position to peak start.
1843
1844
1845 2011-06-19  Tao Liu  <taoliu@jimmy.harvard.edu>
1846         MACS version 2.0.3 (tag:alpha)
1847
1848         * Rich output with qvalue, fold enrichment, and pileup height
1849
1850         Calculate q-values using a refined Benjamini–Hochberg–Yekutieli
1851         procedure:
1852
1853         http://en.wikipedia.org/wiki/False_discovery_rate#Dependent_tests
1854
1855         Now we have a similiar xls output file as before. The differences
1856         from previous file are:
1857
1858         1. Summit now is absolute summit, instead of relative summit
1859            position;
1860         2. 'Pileup' is previous 'tag' column. It's the extended fragment
1861            pileup at the peak summit;
1862         3. We now use '-log10(pvalue)' instead of '-10log10(pvalue)', so
1863            5.00 means 1e-5, simple and less confusing.
1864         4. FDR column becomes '-log10(qvalue)' column.
1865         5. The pileup, -log10pvalue, fold_enrichment and -log10qvalue are
1866            the values at the peak summit.
1867
1868         * Extra output files
1869
1870         NAME_pqtable.txt contains pvalue and qvalue relationships.
1871
1872         NAME_treat_pvalue.bdg and NAME_treat_qvalue.bdg store -log10pvalue
1873         and -log10qvalue scores in BedGraph format. Nearby regions with
1874         the same value are not merged.
1875
1876         * Separation of FeatIO.py
1877
1878         Its content has been divided into cPeakIO.pyx, cBedGraph.pyx, and
1879         cFixWidthTrack.pyx. A modified bedGraphTrackI class was
1880         implemented to store pileup, local lambda, pvalue, and qvalue
1881         alltogether in cScoreTrack.pyx.
1882
1883         * Experimental option --half-ext
1884
1885         Suggested by NPS algorithm, I added an experimental option
1886         --half-ext to let MACS only extends ChIP fragment around its
1887         middle point for only 1/2 d.
1888
1889 2011-06-12  Tao Liu  <taoliu@jimmy.harvard.edu>
1890         MACS version 2.0.2 (tag:alpha)
1891
1892         * macs2
1893
1894         Add an error check to see if there is no common chromosome names
1895         from treatment file and control file
1896
1897         * cPeakDetect.pyx, cFeatIO.pyx, cPileup.pyx
1898
1899         Reduce memory usage by removing deepcopy() calls.
1900
1901         * Modify README documents and others.
1902
1903 2011-05-19  Tao Liu  <taoliu@jimmy.harvard.edu>
1904         MACS Version 2.0.1 (tag:alpha)
1905
1906         * cPileup.pyx, cPeakDetect.pyx and peak calling process
1907
1908         Jie suggested me a brilliant simple method to pileup fragments
1909         into bedGraph track. It works extremely faster than the previous
1910         function, i.e, faster than MACS1.3 or MACS1.4. So I can include
1911         large local lambda calculation in MACSv2 now. Now I generate three
1912         bedGraphs for d-size local bias, slocal-size and llocal-size local
1913         bias, and calculate the maximum local bias as local lambda
1914         bedGraph track.
1915
1916         Minor: add_loc in bedGraphTrackI now can correctly merge the
1917         region with its preceding region if their value are the same.
1918
1919         * macs2
1920
1921         Add an option to shift control tags before extension. By default,
1922         control tags will be extended to both sides regardless of strand
1923         information.
1924
1925 2011-05-17  Tao Liu  <taoliu@jimmy.harvard.edu>
1926         MACS Version 2.0.0 (tag:alpha)
1927
1928         * Use bedGraph type to store data internally and externally.
1929
1930         We can have theoretically one-basepair resolution profiles. 10
1931         times smaller in filesize and even smaller after converting to
1932         bigWig for visualization.
1933
1934         * Peak calling process modified. Better peak boundary detection.
1935
1936         Extend ChIP tag to d, and pileup to have a ChIP bedGraph. Extend
1937         Control tag to d and 1,000bp, and pileup to two bedGraphs. (1000bp
1938         one will be averaged to d size) Then calculate the maximum value
1939         of these two tracks and a global background, to have a
1940         local-lambda bedGraph.
1941
1942         Use -10log10poisson_pvalue as scores to generate a score track
1943         before peak calling.
1944
1945         A general peak calling based on a score cutoff, min length of peak
1946         and max gap between nearby peaks.
1947
1948         * Option changes.
1949
1950         Wiggle file output is removed. Now we only support bedGraph
1951         output. The generation of bedGraph is highly recommended since it
1952         will not cost extra time. In other words, bedGraph generation is
1953         internally run even you don't want to save bedGraphs on disk, due
1954         to the peak calling algorithm in MACS v2.
1955
1956         * cProb.pyx
1957
1958         We now can calculate poisson pvalue in log space so that the score
1959         (-10*log10pvalue) will not have a upper limit of 3100 due to
1960         precision of float number.
1961
1962         * Cython is adopted to speed up Python code.
1963
1964 2011-02-28  Tao Liu  <taoliu@jimmy.harvard.edu>
1965         Small fixes
1966
1967         * Replaced with a newest WigTrackI class and fixed the wignorm script.
1968
1969 2011-02-21  Tao Liu  <taoliu@jimmy.harvard.edu>
1970         Version 1.4.0rc2 (Valentine)
1971
1972         * --single-wig option is renamed to --single-profile
1973
1974         * BedGraph output with --bdg or -B option.
1975
1976         The BedGraph output provides 1bp resolution fragment pileup
1977         profile. File size is smaller than wig file. This option can be
1978         combined with --single-profile option to produce a bedgraph file
1979         for the whole genome. This option can also make --space,
1980         --call-subpeaks invalid.
1981
1982         * Fix the description of --shiftsize to correctly state that the
1983         value is 1/2 d (fragment size).
1984
1985         * Fix a bug in the call to __filter_w_control_tags when control is
1986         not available.
1987
1988         * Fix a bug on --to-small option. Now it works as expected.
1989
1990         * Fix a bug while counting the tags in candidate peak region, an
1991         extra tag may be included. (Thanks to Jake Biesinger!)
1992
1993         * Fix the bug for the peaks extended outside of chromosome
1994         start. If the minus strand tag goes outside of chromosome start
1995         after extension of d, it will be thrown out.
1996
1997         * Post-process script for a combined wig file:
1998
1999         The "wignorm" command can be called after a full run of MACS14 as
2000         a postprocess. wignorm can calculate the local background from the
2001         control wig file from MACS14, then use either foldchange,
2002         -10*log10(pvalue) from possion test, or difference after asinh
2003         transformation as the score to build a single wig track to
2004         represent the binding strength. This script will take a
2005         significant long time to process.
2006
2007         * --wigextend has been obsoleted.
2008
2009 2010-09-21  Tao Liu  <taoliu@jimmy.harvard.edu>
2010         Version 1.4.0rc1 (Starry Sky)
2011
2012         * Duplicate reads option
2013
2014         --keep-dup behavior is changed. Now user can specify how many
2015         reads he/she wants to keep at the same genomic location. 'auto' to
2016         let MACS decide the number based on binomial distribution, 'all'
2017         to let MACS keep all reads.
2018
2019         * pvalue and FDR fixes (Thanks to Prof. Zhiping Weng)
2020
2021         By default, MACS will now scale the smaller dataset to the bigger
2022         dataset. For instance, if IP has 10 million reads, and Input has 5
2023         million, MACS will double the lambda value calculated from Input
2024         reads while calling BOTH the positive peaks and negative
2025         peaks. This will address the issue caused by unbalanced numbers of
2026         reads from IP and Input. If --to-small is turned on, MACS will
2027         scale the larger dataset to the smaller one. So from now on, if d
2028         is fixed, then the peaks from a MACS call for A vs B should be
2029         identical to the negative peaks from a B vs A.
2030
2031 2010-09-01  Tao Liu  <taoliu@jimmy.harvard.edu>
2032         Version 1.4.0beta (summer wishes)
2033
2034         * New features
2035
2036         ** Model building
2037
2038         The default behavior in the model building step is slightly
2039         changed. When MACS can't find enough pairs to build model
2040         (implemented in alpha version) or the modeled fragment length is
2041         less than 2 times of tag length (implemented in beta version),
2042         MACS will use 2 times of --shiftsize value as fragment length in
2043         the later analysis. --off-auto can turn off this default behavior.
2044
2045         ** Redundant tag filtering
2046
2047         The IO module is rewritten. The redundant tag filtering process
2048         becomes simpler and works as promise. The maximum allowed number
2049         of tags at the exact same location is calculated from the
2050         sequencing depth and genome size using a binomial distribution,
2051         for both TREAMENT and CONTROL separately. ( previously only
2052         TREATMENT is considered ) The exact same location means the same
2053         coordination and the same strand. Then MACS will only keep at most
2054         this number of tags at the exact same location in the following
2055         analysis. An option --keep-dup can let MACS skip the filtering and
2056         keep all the tags. However this may bring in a lot of sequencing
2057         bias, so you may get many false positive peaks.
2058
2059         ** Single wiggle mode
2060
2061         First thing to mention, this is not the score track that I
2062         described before. By default, MACS generates wiggle files for
2063         fragment pileup for every chromosomes separately. When you use
2064         --single-wig option, MACS will generate a single wiggle file for
2065         all the chromosomes so you will get a wig.gz for TREATMENT and
2066         another wig.gz for CONTROL if available.
2067
2068         ** Sniff -- automatic format detection
2069
2070         Now, by default or "-f AUTO", MACS will decide the input file
2071         format automatically. Technically, it will try to read at most
2072         1000 records for the first 10 non-comment lines. If it succeeds,
2073         the format is decided. I recommend not to use AUTO and specify the
2074         right format for your input files, unless you combine different
2075         formats in a single MACS run.
2076
2077         * Options changes
2078
2079         --single-wig and --keep-dup are added. Check previous section in
2080         ChangeLog for detail.
2081
2082         -f (--format) AUTO is now the default option.
2083
2084         --slocal default: 1000
2085         --llocal default: 10000
2086
2087         * Bug fixed
2088
2089         Setup script will stop the installation if python version is not
2090         python2.6 or python2.7.
2091
2092         Local lambda calculation has been changed back. MACS will check
2093         peak_region, slocal( default 1K) and llocal (default 10K) for the
2094         local bias. The previous 200bps default will cause MACS misses
2095         some peaks where the input bias is very sharp.
2096
2097         sam2bed.py script is corrected.
2098
2099         Relative pos in xls output is fixed.
2100
2101         Parser for ELAND_export is fixed to pass some of the no match
2102         lines. And elandexport2bed.py is fixed too. ( however I can't
2103         guarantee that it works on any eland_export files. )
2104
2105 2010-06-04  Tao Liu  <taoliu@jimmy.harvard.edu>
2106         Version 1.4.0alpha2 (be smarter)
2107
2108         * Options changes
2109
2110         --gsize now provides shortcuts for common genomes, including
2111         human, mouse, C. elegans and fruitfly.
2112
2113         --llocal now will be 5000 bps if there is no input file, so that
2114         local lambda doesn't overkill enriched binding sites.
2115
2116 2010-06-02  Tao Liu  <taoliu@jimmy.harvard.edu>
2117         Version 1.4alpha (be smarter)
2118
2119         * Options changes
2120
2121         --tsize option is redesigned. MACS will use the first 10 lines of
2122         the input to decide the tag size. If user specifies --tsize, it
2123         will override the auto decided tsize.
2124
2125         --lambdaset is replaced by --slocal and --llocal which mean the
2126         small local region and large local region.
2127
2128         --bw has no effect on the scan-window size now. It only affects the
2129         paired-peaks model process.
2130
2131         * Model building
2132
2133         During the model building, MACS will pick out the enriched regions
2134         which are not too high and not too low to build the paired-peak
2135         model. Default the region is from fold 10 to fold 30. If MACS
2136         fails to build the model, by default it will use the nomodel
2137         settings, like shiftsize=100bps, to shift and extend each
2138         tags. This behavior can be turned off by '--off-auto'.
2139
2140         * Output files
2141
2142         An extra file including all the summit positions are saved in
2143         *_summits.bed file. An option '--call-subpeaks' will invoke
2144         PeakSplitter developed by Mali Salmon to split wide peaks into
2145         smaller subpeaks.
2146
2147         * Sniff ( will in beta )
2148
2149         Automatically recognize the input file format, so use can combine
2150         different format in one MACS run.
2151
2152         Not implemented features/TODO:
2153
2154         * Algorithms ( in near future? )
2155
2156         MACS will try to refine the peak boundaries by calculating the
2157         scores for every point in the candidate peak regions. The score
2158         will be the -10*log(10,pvalue) on a local poisson distribution. A
2159         cutoff specified by users (--pvalue) will be applied to find the
2160         precise sub-peaks in the original candidate peak region. Peak
2161         boudaries and peak summits positions will be saved in separate BED
2162         files.
2163
2164         * Single wiggle track ( in near future? )
2165
2166         A single wiggle track will be generated to save the scores within
2167         candidate peak regions in the 10bps resolution. The wiggle file
2168         is in fixedStep format.
2169
2170
2171 2009-10-16  Tao Liu  <taoliu@jimmy.harvard.edu>
2172         Version 1.3.7.1 (Oktoberfest, bug fixed #1)
2173
2174         * bin/Constants.py
2175
2176         Fixed typo. FCSTEP -> FESTEP
2177
2178         * lib/PeakDetect.py
2179
2180         The 'femax' attribute bug is fixed
2181
2182 2009-10-02  Tao Liu  <taoliu@jimmy.harvard.edu>
2183         Version 1.3.7 (Oktoberfest)
2184
2185         * bin/macs, lib/PeakDetect.py, lib/IO/__init__.py, lib/OptValidator.py
2186
2187         Enhancements by Peter Chines:
2188
2189         1. gzip files are supported.
2190         2. when --diag is on, user can set the increment and endpoint for
2191         fold enrichment analysis by setting --fe-step and --fe-max.
2192
2193         Enhancements by Davide Cittaro:
2194
2195         1. BAM and SAM formats are supported.
2196         2. small changes in the header lines of wiggle output.
2197
2198         Enhancements by Me:
2199         1. I added --fe-min option;
2200         2. Bowtie ascii output with suffix ".map" is supported.
2201
2202         Bug fixed:
2203
2204         1. --nolambda bug is fixed. ( reported by Martin in JHU )
2205         2. --diag bug is fixed. ( reported by Bogdan Tanasa )
2206         3. Function to remove suffix '.fa' is fixed. ( reported by Jeff Johnston )
2207         4. Some "fold change" have been changed to "fold enrichment".
2208
2209 2009-06-10  Tao Liu  <taoliu@jimmy.harvard.edu>
2210         Version 1.3.6.1 (default parameter change)
2211
2212         * bin/macs, lib/PeakDetect.py
2213
2214         "--oldfdr" is removed. The 'oldfdr' behaviour becomes
2215         default. "--futurefdr" is added which can turn on the 'new' method
2216         introduced in 1.3.6. By default it's off.
2217
2218         * lib/PeakDetect.py
2219
2220         Fixed a bug. p-value is corrected a little bit.
2221
2222
2223 2009-05-11  Tao Liu  <taoliu@jimmy.harvard.edu>
2224         Version 1.3.6 (Birthday cake)
2225
2226         * bin/macs
2227
2228         "track name" is added to the header of BED output file.
2229
2230         Now the default peak detection method is to consider 5k and 10k
2231         nearby regions in treatment data and peak location, 1k, 5k, and
2232         10k regions in control data to calculate local bias. The old
2233         method can be called through '--old' option.
2234
2235         Information about how many total/unique tags in treatment or
2236         control will be saved in final .xls output.
2237
2238         * lib/IO/__init__.py
2239
2240         ".fa" will be removed from input tag alignment so only the
2241         chromosome names are kept.
2242
2243         WigTrackI class is added for Wiggle like data structure. (not used
2244         now)
2245
2246         The parser for ELAND multi PET files has been fixed. Now the 5'
2247         tag position for a pair will be kept, whereas in the previous
2248         version, the middle points are kept.
2249
2250         * lib/IO/BinKeeper.py
2251
2252         BinKeeperI class is inspired by Jim Kent's library for UCSC genome
2253         browser, which can quickly access certain region for values in a
2254         large wiggle like data file. (not used now)
2255
2256         * lib/OptValidator.py
2257
2258         typo fixed.
2259
2260         * lib/PeakDetect.py
2261
2262         Now the default peak detection method is to consider 5k and 10k
2263         nearby regions in treatment data and peak location, 1k, 5k, and
2264         10k regions in control data to calculate local bias. The old
2265         method can be called through '--old' option.
2266
2267         Two columns have beed added to BED output file. 4th column: peak
2268         name; 5th column: peak score using -10log(10,pvalue) as score.
2269
2270         * setup.py
2271
2272         Add support to build a Mac App through 'setup.py py2app', or a
2273         Windows executable through 'setup.py py2exe'. You need to install
2274         py2app or py2exe package in order to use these functions.
2275
2276 2009-02-12  Tao Liu  <taoliu@jimmy.harvard.edu>
2277         Version 1.3.5 (local lambda fixed, typo fixed, model figure improved)
2278
2279         * PeakDetect.py
2280
2281         Now, besides 1k, 5k, 10k, MACS will also consider peak size region
2282         in control data to calculate local lambda for each peak. Peak
2283         calling results will be slightly different with previous version,
2284         beware!
2285
2286         * OptValidator.py
2287
2288         Typo fixed, ELANDParser -> ELANDResultParser
2289
2290         * OutputWriter.py
2291
2292         Now, modeled d value will be shown on the model figure.
2293
2294 2009-01-06  Tao Liu  <taoliu@jimmy.harvard.edu>
2295         Version 1.3.4 (Happy New Year Version, bug fixed, ELAND multi/PET support)
2296
2297         * macs, IO/__init__.py, PeakDetect.py
2298
2299         Add support for ELAND multi format. Add support for Pair-End
2300         experiment, in this case, 5'end and 3'end ELAND multi format files
2301         are required for treatment or control data. See 00README file for
2302         detail.
2303
2304         Add wigextend option.
2305
2306         Add petdist option for Pair-End Tag experiment, which is the best
2307         distance between 5' and 3' tags.
2308
2309         * PeakDetect.py
2310
2311         Fixed a bug which cause the end positions of every peak region
2312         incorrectly added by 1 bp. ( Thanks Mali Salmon!)
2313
2314         * OutputWriter.py
2315
2316         Fix bugs while generating wiggle files. The start position of
2317         wiggle file is set to 1 instead of 0.
2318
2319         Fix a bug that every 10M bps, signals in the first 'd' range are
2320         lower than actual. ( Thanks Mali Salmon!)
2321
2322
2323 2008-12-03  Tao Liu  <taoliu@jimmy.harvard.edu>
2324         Version 1.3.3 (wiggle bugs fixed)
2325
2326         * OutputWriter.py
2327
2328         Fix bugs while generating wiggle files. 1. 'span=' is added to
2329         'variableStep' line; 2. previously, every 10M bps, the coordinates
2330         were wrongly shifted to the right for 'd' basepairs.
2331
2332         * macs, PeakDetect.py
2333
2334         Add an option to save wiggle files on different resolution.
2335
2336 2008-10-02  Tao Liu  <taoliu@jimmy.harvard.edu>
2337         Version 1.3.2 (tiny bugs fixed)
2338
2339         * IO/__init__.py
2340
2341         Fix 65536 -> 65535. ( Thank Joon)
2342
2343         * Prob.py
2344
2345         Improved for binomial function with extra large number. Imported
2346         from Cistrome project.
2347
2348         * PeakDetect.py
2349
2350         If treatment channel misses reads in some chromosome included in
2351         control channel, or vice versa, MACS will not exit. (Thank Shaun
2352         Mahony)
2353
2354         Instead, MACS will fake a tag at position -1 when calling
2355         treatment peaks vs control, but will ignore the chromosome while
2356         calling negative peaks.
2357
2358 2008-09-04  Tao Liu  <taoliu@jimmy.harvard.edu>
2359         Version 1.3.1 (tiny bugs fixed version)
2360
2361         * Prob.py
2362
2363         Hyunjin Gene Shin contributed some codes to Prob.py. Now the
2364         binomial functions can tolerate large and small numbers.
2365
2366         * IO/__init__.py
2367
2368         Parsers now split lines in BED/ELAND file using any
2369         whitespaces. 'track' or 'browser' lines will be regarded as
2370         comment lines. A bug fixed when throwing StrandFormatError. The
2371         maximum redundant tag number at a single position can be no less
2372         than 65536.
2373
2374
2375 2008-07-15  Tao Liu  <taoliu@jimmy.harvard.edu>
2376         Version 1.3 (naming clarification version)
2377
2378         * Naming clarification changes according to our manuscript:
2379
2380         'frag_len' is changed to 'd'.
2381
2382         'fold_change' is changed to 'fold_enrichment'.
2383
2384         Suggest '--bw' parameter to be determined by users from the real
2385         sonication size.
2386
2387         Maximum FDR is 100% in the output file.
2388
2389         And other clarifications in 00README file and the documents on the
2390         website.
2391
2392         * IO/__init__.py
2393         If the redundant tag number at a single position is over 32767,
2394         just remember 32767, instead of raising an overflow exception.
2395
2396         * setup.py
2397         fixed a typo.
2398
2399         * PeakDetect.py
2400         Bug fixed for diagnosis report.
2401
2402
2403 2008-07-10  Tao Liu  <taoliu@jimmy.harvard.edu>
2404         Version 1.2.2gamma
2405
2406         * Serious bugs fix:
2407
2408         Poisson distribution CDF and inverse CDF functions are
2409         corrected. They can produce right results even for huge lambda
2410         now. So that the p-value and FDR values in the final excel sheet
2411         are corrected.
2412
2413         IO package now can tolerate some rare cases; ELANDParser in IO
2414         package is fixed. (Thank Bogdan)
2415
2416         * Improvement:
2417
2418         Reverse paired peaks in model are rejected. So there will be no
2419         negative 'frag_len'. (Thank Bogdan)
2420
2421         * Features added:
2422
2423         Diagnosis function is completed. Which can output a table file for
2424         users to estimate their sequencing depth.
2425
2426
2427 2008-06-30  Tao Liu  <taoliu@jimmy.harvard.edu>
2428         Version 1.2
2429
2430         * Probe.py is added!
2431
2432         GSL is totally removed from MACS. Instead, I have implemented the
2433         CDF and inverse CDF for poisson and binomial distribution purely
2434         in python.
2435
2436         * Constants.py is added!
2437
2438         Organize constants used in MACS in the Constants.py file.
2439
2440         * All other files are modified!
2441
2442         Foldchange calculation is modified. Now the foldchange only be
2443         calculated at the peak summit position instead of the whole peak
2444         region. The values will be higher and more robust than before.
2445
2446         Features added:
2447
2448         1. MACS can save wiggle format files containing the tag number at
2449         every 10 bp along the genome. Tags are shifted according to our
2450         model before they are calculated.
2451
2452         2. Model building and local lambda calculation can be skipped with
2453         certain options.
2454
2455         3. A diagnosis report can be generated through '--diag'
2456         option. This report can help you get an assumption about the
2457         sequencing saturation. This funtion is only in beta stage.
2458
2459         4. FDR calculation speed is highly improved.
2460
2461 2008-05-28  Tao Liu  <taoliu@jimmy.harvard.edu>
2462         Version 1.1
2463
2464         * TabIO, PeakModel.py ...
2465         Bug fixed to let MACS tolerate some cases while there is no tag on
2466         either plus strand or minus strand.
2467
2468         * setup.py
2469         Check the version of python. If the version is lower than 2.4,
2470         refuse to install with warning.
2471