docs/filterdup.md

   1 # Filterdup
   2
   3 ## Overview
   4 The `filterdup` command is part of the MACS3 suite of tools and is used to filter duplicate reads from your data. It is particularly useful in sequencing analysis where duplicate reads can bias the downstream analysis.
   5
   6 ## Detailed Description
   7
   8 The `filterdup` command takes an input file and produces an output file with duplicate reads removed. It uses an efficient algorithm to detect and filter duplicate reads, greatly improving the quality of your data for further analysis.
   9
  10 ## Command Line Options
  11
  12 The command line options for `filterdup` are defined in `/MACS3/Commands/filterdup_cmd.py` and `/bin/macs3` files. Here is a brief overview of these options:
  13
  14 - `-i` or `--ifile`: The input file. This should be in BAM format. This option is required.
  15 - `-o` or `--ofile`: The output file. This will be in the same format as the input file. This option is required.
  16 - `-g` or `--gsize`: The mappable genome size. This can be an integer or a string. If it's an integer, it represents the genome size. If it's a string, it should be one of the following: `hs` (for Homo Sapiens), `mm` (for Mus musculus), `ce` (for Caenorhabditis elegans), or `dm` (for Drosophila melanogaster). The default is `hs`.
  17 - `-s` or `--format`: The format of the input file. This can be `AUTO`, `BED`, `ELAND`, `ELANDMULTI`, `ELANDEXPORT`, `SAM`, `BAM`, or `BOWTIE`. The default is `AUTO`, which means the program will try to determine the format automatically.
  18 - `--keep-dup`: The number of duplicates to keep. This can be `all`, `auto`, or an integer. If it's `all`, all duplicates will be kept. If it's `auto`, the program will use an algorithm to determine the number of duplicates to keep. If it's an integer, it represents the number of duplicates to keep. The default is `1`.
  19 - `--buffer-size`: Buffer size for incrementally increasing internal array size to store reads alignment information. Default: `100000`.
  20
  21 ## Example Usage
  22
  23 Here is an example of how to use the `filterdup` command:
  24
  25 ```bash
  26 macs3 filterdup -i input.bam -o output.bam --gsize hs --format AUTO --keep-dup 1 --buffer-size 100000
  27 ```
  28
  29 In this example, the program will remove duplicate reads from the `input.bam` file and write the result to `output.bam`. The mappable genome size is set to `hs` (Homo Sapiens), the format of the input file is determined automatically, and the program keeps only one duplicate.
  30
  31 ## FAQs about `filterdup`
  32
  33 Q: What does `filterdup` do?
  34 A: `filterdup` is a tool in the MACS3 suite that filters duplicate reads from your data. Duplicate reads can bias your analysis, so it's important to remove them.
  35
  36 Q: How do I use `filterdup`?
  37 A: You can use `filterdup` by providing it with an input file, an output file, and optional parameters that control its behavior. See the [Example Usage](#example-usage) section above for an example of how to use `filterdup`.
  38
  39 ## Troubleshooting `filterdup`
  40
  41 If you're having trouble using `filterdup`, here are some things to try:
  42
  43 - Make sure your input file is in the correct format. `filterdup` can handle several formats, and it can try to determine the format automatically, but it's best to specify the format explicitly if possible.
  44 - Make sure the `--gsize` option matches your genome. If you're not working with one of the four supported genomes (`hs`, `mm`, `ce`, `dm`), you will need to specify the genome size as an integer.
  45 - If the program is running out of memory or taking too long, try adjusting the `--buffer-size` option.
  46
  47 ## Known issues or limitations of `filterdup`
  48
  49 As of now, there are no known issues or limitations with `filterdup`. If you encounter a problem, please submit an issue on the MACS3 GitHub page.