etc/gatk-wdl/ReadMe.md

   1 # My GATK4 WorkFlows
   2
   3 ## Install
   4
   5 ```R
   6 install.packages('optparse')
   7 install.packages('data.table')
   8 ```
   9
  10 ## seq-format-conversion
  11 Workflows for converting between sequence data formats
  12
  13 <https://github.com/gatk-workflows/seq-format-conversion>
  14
  15 ## Local CMD
  16
  17 ```bash
  18 time cromwell run paired-fastq-to-unmapped-bam.wdl -i paired-fastq-to-unmapped-bam.inputs.json >cromwell.uBAM.log &
  19 ```
  20
  21     You can also use `date '+%Y%m%d%H%M%S'` for unique strings.
  22
  23 ### paired-fastq-to-unmapped-bam :
  24 This WDL converts paired FASTQ to uBAM and adds read group information
  25
  26 *NOTE: paired-fastq-to-unmapped-bam-fc.wdl is a slightly modified version of the original to support users interested running on FireCloud.
  27 As input this wdl takes a TSV with each row being a different readgroup and each column in the row being descriptors*
  28
  29 #### Requirements/expectations
  30 - Pair-end sequencing data in FASTQ format (one file per orientation)
  31 - The following metada descriptors per sample:
  32 ```
  33 readgroup   fastq_pair1_file_path   fastq_pair2_file_path   sample_name   library_name   platform_unit   run_date   platform_name   sequecing_center
  34 ```
  35
  36 #### Outputs
  37 - Set of unmapped BAMs, one per read group
  38 - File containing a list of the generated unmapped BAMs
  39
  40
  41
  42 ## gatk4-data-processing
  43
  44 <https://github.com/gatk-workflows/gatk4-data-processing>
  45
  46 ## Local CMD
  47
  48 ```bash
  49 time cromwell run processing-for-variant-discovery-gatk4.wdl -i processing-for-variant-discovery-gatk4.hg38.wgs.D3B.inputs.json > cromwell.processing.D3B.log &
  50 time cromwell run processing-for-variant-discovery-gatk4.wdl -i processing-for-variant-discovery-gatk4.hg38.wgs.Normal.inputs.json > cromwell.processing.Normal.log &
  51 ```
  52
  53 ### Purpose :
  54 Workflows for processing high-throughput sequencing data for variant discovery with GATK4 and related tools.
  55
  56 ### processing-for-variant-discovery-gatk4 :
  57 The processing-for-variant-discovery-gatk4 WDL pipeline implements data pre-processing according to the GATK Best Practices
  58 (June 2016).
  59
  60 #### Requirements/expectations
  61 - Pair-end sequencing data in unmapped BAM (uBAM) format
  62 - One or more read groups, one per uBAM file, all belonging to a single sample (SM)
  63 - Input uBAM files must additionally comply with the following requirements:
  64   - filenames all have the same suffix (we use ".unmapped.bam")
  65   - files must pass validation by ValidateSamFile
  66   - reads are provided in query-sorted order
  67   - all reads must have an RG tag
  68
  69 #### Outputs
  70 - A clean BAM file and its index, suitable for variant discovery analyses.
  71
  72 ### Software version requirements :
  73 - GATK 4 or later
  74 - Picard 2.x
  75 - Samtools (see gotc docker)
  76 - Python 2.7
  77
  78
  79
  80 ## somatic-snvs-indels
  81
  82 <https://github.com/gatk-workflows/gatk4-somatic-snvs-indels>
  83
  84 ### Purpose :
  85 Workflows for somatic short variant analysis with GATK4.
  86
  87 ### mutect2 :
  88 Implements Somatic short variant discovery using [GATK Best Practices](https://software.broadinstitute.org/gatk/best-practices/workflow).
  89 Note: Also provided in this repo is mutect2_nio which is a NIO supported version of the wdl.
  90
  91 #### Requirements/expectations
  92 - Tumor bam and index
  93 - Normal bam and index
  94
  95 #### Outputs
  96 - unfiltered vcf
  97 - unfiltered vcf index
  98 - filtered vcf
  99 - filtered vcf index
 100
 101 ### mutect2_pon :
 102 Creates a Panel of Norms to be implemented in somatic short variant discovery.
 103
 104 #### Requirements/expectations
 105 - Normal bams and index
 106
 107 #### Outputs
 108 - PON vcf and index
 109 - Normal calls vcf and index
 110
 111 ### mutect2-normal-normal :
 112 Used to validate mutect2 workflow.
 113
 114 #### Requirements/expectations
 115 - One analysis-ready BAM file (and its index) for each replicate
 116
 117 #### Outputs
 118 - False Positive VCF files and its index with summary
 119
 120 ### Software version requirements :
 121 - GATK4 or later
 122
 123 Cromwell version support
 124 - Successfully tested on v31
 125
 126
 127 ### Parameter descriptions :
 128 #### mutect2 (single pair/sample)
 129 - ``Mutect2.gatk4_jar`` -- Location *within the docker file* of the GATK4 jar file.  If you wish you to use a different jar file, such as one on your local filesystem or a google bucket, specify that location with ``Mutect2_Multi.gatk4_jar_override``.  This parameter is ignored if ``Mutect2_Multi.gatk4_jar_override`` is specified.
 130 - ``Mutect2.intervals`` -- A file listing genomic intervals to search for somatic mutations.  This should be in the standard GATK4 format.
 131 - ``Mutect2.ref_fasta``         -- reference fasta.  For Broad internal VM:  ``/seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta``
 132 - ``Mutect2.ref_fasta_index`` -- For Broad internal VM:  ``/seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta.fai``
 133 - ``Mutect2.ref_dict`` -- For Broad internal VM:  ``/seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.dict``
 134 - ``Mutect2.tumor_bam`` -- File path or storage location (depending on backend) of the tumor bam file.
 135 - ``Mutect2.tumor_bam_index`` -- File path or storage location (depending on backend) of the tumor bam file index.
 136 - ``Mutect2.normal_bam`` -- (optional) File path or storage location (depending on backend) of the normal bam file.
 137 - ``Mutect2.normal_bam_index`` -- (optional, but required if ``Mutect2.normal_bam`` is specified)  File path or storage location (depending on backend) of the normal bam file index.
 138 - ``Mutect2.pon`` -- (optional) Panel of normals VCF to use for false positive reduction.
 139 - ``Mutect2.pon_index`` -- (optional, but required if ``Mutect2_Multi.pon`` is specified)  VCF index for the panel of normals.  Please see GATK4 tool ``IndexFeatureFile`` for creation of an index.
 140 - ``Mutect2.scatter_count`` -- Number of executions to split the Mutect2 task into.  The more you put here, the faster Mutect2 will return results, but at a higher cost of resources.
 141 - ``Mutect2.gnomad`` -- (optional)  gnomAD vcf containing population allele frequencies (AF) of common and rare alleles.  Download an exome or genome sites vcf [here](http://gnomad.broadinstitute.org/downloads).  Essential for determining possible germline variants in tumor
 142 - ``Mutect2.gnomad_index`` -- (optional, but required if ``Mutect2_Multi.gnomad`` is specified)  VCF index for gnomAD.  Please see GATK4 tool ``IndexFeatureFile`` for creation of an index.
 143 - ``Mutect2.variants_for_contamination`` -- (optional)  vcf containing population allele frequencies (AF) of common SNPs.  If omitted, cross-sample contamination will not be calculated and contamination filtering will not be applied.  This can be generated from a gnomAD vcf using the GATK4 tool ``SelectVariants`` with the argument ``--select "AF > 0.05"``.  For speed, one can get very good results using only SNPs on chromosome 1.  For example, ``java -jar $gatk SelectVariants -V gnomad.vcf -L 1 --select "AF > 0.05" -O variants_for_contamination.vcf``.
 144 - ``Mutect2.variants_for_contamination_index`` -- (optional, but required if ``Mutect2_Multi.variants_for_contamination`` is specified)  VCF index for contamination variants.  Please see GATK4 tool ``IndexFeatureFile`` for creation of an index.
 145 - ``Mutect2.is_run_orientation_bias_filter`` -- ``true``/``false`` whether the orientation bias filter should be run.
 146 - ``Mutect2.is_run_oncotator`` -- ``true``/``false`` whether the command-line version of oncotator should be run.  If ``false``, ``Mutect2_Multi.oncotator_docker`` parameter is ignored.
 147 - ``Mutect2.gatk_docker`` -- Docker image to use for Mutect2 tasks.  This is only used for backends configured to use docker.
 148 - ``Mutect2.oncotator_docker`` -- (optional)  A GATK4 jar file to be used instead of the jar file in the docker image.  (See ``Mutect2_Multi.gatk4_jar``)  This can be very useful for developers.  Please note that you need to be careful that the docker image you use is compatible with the GATK4 jar file given here -- no automated checks are made.
 149 - ``Mutect2.gatk4_jar_override`` -- (optional)  A GATK4 jar file to be used instead of the jar file in the docker image.  (See ``Mutect2_Multi.gatk4_jar``)  This can be very useful for developers.  Please note that you need to be careful that the docker image you use is compatible with the GATK4 jar file given here
 150 - ``Mutect2.preemptible_attempts`` -- Number of times to attempt running a task on a preemptible VM.  This is only used for cloud backends in cromwell and is ignored for local and SGE backends.
 151 - ``Mutect2.onco_ds_tar_gz`` -- (optional)  A tar.gz file of the oncotator datasources -- often quite large (>15GB).  This will be uncompressed as part of the oncotator task.  Depending on backend used, this can be specified as a path on the local filesystem of a cloud storage container (e.g. gs://...).  Typically the Oncotator default datasource can be downloaded at ``ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/oncotator/``.  Do not put the FTP URL into the json file.
 152 - ``Mutect2.onco_ds_local_db_dir`` -- "(optional)  A direct path to the Oncotator datasource directory (uncompressed).  While this is the fastest approach, it cannot be used with docker unless your docker image already has the datasources in it.  For cromwell backends without docker, this can be a local filesystem path.  *This cannot be a cloud storage location*
 153
 154  Note:  If neither ``Mutect2_Multi.onco_ds_tar_gz``, nor ``Mutect2_Multi.onco_ds_local_db_dir``, is specified, the Oncotator task will download and uncompress for each execution.
 155
 156 The following three parameters are useful for rendering TCGA MAFs using oncotator.  These parameters are ignored if ``is_run_oncotator`` is ``false``."
 157 - ``Mutect2.artifact_modes`` -- List of artifact modes to search for in the orientation bias filter.  For example to filter the OxoG artifact, you would specify ``["G/T"]``.  For both the FFPE artifact and the OxoG artifact, specify ``["G/T", "C/T"]``.  If you do not wish to search for any artifacts, please set ``Mutect2_Multi.is_run_orientation_bias_filter`` to ``false``.
 158 - ``Mutect2.picard_jar`` -- A direct path to a picard jar for using ``CollectSequencingArtifactMetrics``.  This parameter requirement will be eliminated in the future.
 159 - ``Mutect2.m2_extra_args`` -- (optional) a string of additional command line arguments of the form "-argument1 value1 -argument2 value2" for Mutect 2.  Most users will not need this.
 160 - ``Mutect2.m2_extra_filtering_args`` -- (optional) a string of additional command line arguments of the form "-argument1 value1 -argument2 value2" for Mutect 2.  Most users will not need this.
 161 - ``Mutect2.sequencing_center``         -- (optional) center reporting this variant.     Please see ``https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+%28MAF%29+Specification+-+v2.4`` for more details.
 162 - ``Mutect2.sequence_source`` -- (optional)  ``WGS`` or ``WXS`` for whole genome or whole exome sequencing, respectively.  Please note that the controlled vocabulary of the TCGA MAF spec is *not* enforced.  Please see ``https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+%28MAF%29+Specification+-+v2.4`` for more details.
 163 - ``Mutect2.default_config_file`` -- "(optional)  A configuration file that can direct oncotator to use default values for unspecified annotations in the TCGA MAF.  This help prevents having MAF files with a lot of ""__UNKNOWN__"" values.  An usable example is given below.  Here is an example that should work for most users:
 164
 165 ```
 166 [manual_annotations]
 167 override:NCBI_Build=37,Strand=+,status=Somatic,phase=Phase_I,sequencer=Illumina,Tumor_Validation_Allele1=,Tumor_Validation_Allele2=,Match_Norm_Validation_Allele1=,Match_Norm_Validation_Allele2=,Verification_Status=,Validation_Status=,Validation_Method=,Score=,BAM_file=,Match_Norm_Seq_Allele1=,Match_Norm_Seq_Allele2=
 168 ```
 169 - ``Mutect2.filter_oncotator_maf`` -- (optional, default true) Whether Oncotator should remove filtered variants when rendering the MAF.  Ignored if `run_oncotator` is false.
 170
 171
 172
 173 ## gatk4-germline-snps-indels
 174
 175 <https://github.com/gatk-workflows/gatk4-germline-snps-indels>
 176
 177 ### Purpose :
 178 Workflows for germline short variant discovery with GATK4.
 179
 180 ### haplotypecaller-gvcf-gatk :
 181 The haplotypecaller-gvcf-gatk4 workflow runs HaplotypeCaller
 182 from GATK4 in GVCF mode on a single sample according to the GATK Best Practices (June 2016),
 183 scattered across intervals.
 184
 185 #### Requirements/expectations
 186 - One analysis-ready BAM file for a single sample (as identified in RG:SM)
 187 - Set of variant calling intervals lists for the scatter, provided in a file
 188 #### Outputs
 189 - One GVCF file and its index
 190
 191 ### joint-discovery-gatk :
 192 The second WDL implements the joint discovery and VQSR
 193 filtering portion of the GATK Best Practices (June 2016) for germline SNP and Indel
 194 discovery in human whole-genome sequencing (WGS) and exome sequencing data.
 195
 196 *NOTE: joint-discovery-gatk4-local.wdl is a slightly modified version of the original to support users interested in running the workflow locally.*
 197
 198 #### Requirements/expectations
 199 - One or more GVCFs produced by HaplotypeCaller in GVCF mode
 200 - Bare minimum 1 WGS sample or 30 Exome samples. Gene panels are not supported.
 201 - When deteriming disk size in the json, use the guideline below
 202   - small_disk = (num_gvcfs / 10) + 10
 203   - medium_disk = (num_gvcfs * 15) + 10
 204   - huge_disk = num_gvcfs + 10
 205
 206 ### Outputs
 207 - A VCF file and its index, filtered using variant quality score recalibration
 208   (VQSR) with genotypes for all samples present in the input VCF. All sites that
 209   are present in the input VCF are retained; filtered sites are annotated as such
 210   in the FILTER field.
 211
 212 ### Software version requirements :
 213 - GATK 4 or later
 214 - Samtools (see gotc docker)
 215 - Python 2.7
 216
 217 Cromwell version support
 218 - Successfully tested on v31
 219 - Does not work on versions < v23 due to output syntax
 220
 221
 222
 223 ---
 224
 225
 226
 227 ## gatk-somatic-with-preprocessing
 228
 229 This WDL pipeline implements data pre-processing and initial calling for somatic SNP,
 230 Indel, and copy number variants in human whole-genome sequencing (WGS) data.
 231
 232 <https://github.com/gatk-workflows/gatk4-somatic-with-preprocessing>
 233
 234 Note: The gatk-somatic-with-preprocessing WDL is not used in any pipelines at the Broad Institute
 235 and has been provided only as a convenience for the community.  Therefore, this WDL is unsupported.
 236
 237 ## Local CMD
 238
 239 ```bash
 240 time cromwell run FullSomaticPipeline.wdl --imports FullSomaticPipeline.imports.zip -i FullSomaticPipeline.json >cromwell.02.log &
 241 ```
 242
 243 ### Requirements/expectations
 244  - Human whole-genome pair-end sequencing data in unmapped BAM (uBAM) format
 245  - One or more read groups, one per uBAM file, all belonging to a single sample (SM)
 246  - Input uBAM files must additionally comply with the following requirements:
 247  - - filenames all have the same suffix (we use ".unmapped.bam")
 248  - - files must pass validation by ValidateSamFile
 249  - - reads are provided in query-sorted order
 250  - - all reads must have an RG tag
 251  - Reference genome must be Hg38 with ALT contigs
 252
 253 ---
 254
 255 ## somatic-cnvs
 256 Workflows for somatic copy number variant analysis
 257
 258 <https://github.com/gatk-workflows/gatk4-somatic-cnvs>
 259
 260 ## Running the Somatic CNV WDL
 261
 262 ### Which WDL should you use?
 263
 264 - Building a panel of normals (PoN): ``cnv_somatic_panel_workflow.wdl``
 265 - Running a matched pair: ``cnv_somatic_pair_workflow.wdl``
 266
 267 #### Setting up parameter json file for a run
 268
 269 To get started, create the json template (using ``java -jar wdltool.jar inputs <workflow>``) for the workflow you wish to run and adjust parameters accordingly.
 270
 271 *Please note that there are optional workflow-level and task-level parameters that do not appear in the template file.  These are set to reasonable values by default, but can also be adjusted if desired.*
 272
 273 #### Required parameters in the somatic panel workflow
 274
 275 Important: The normal_bams samples in the json can be used test the wdl, they are NOT to be used to create a panel of normals for sequence analysis. For instructions on creating a proper PON please refer to user the documents https://software.broadinstitute.org/gatk/documentation/ .
 276
 277 The reference used must be the same between PoN and case samples.
 278
 279 - ``CNVSomaticPanelWorkflow.gatk_docker`` -- GATK Docker image (e.g., ``broadinstitute/gatk:latest``).
 280 - ``CNVSomaticPanelWorkflow.intervals`` -- Picard or GATK-style interval list.  For WGS, this should typically only include the autosomal chromosomes.
 281 - ``CNVSomaticPanelWorkflow.normal_bais`` -- List of BAI files.  This list must correspond to `normal_bams`.  For example, `["Sample1.bai", "Sample2.bai"]`.
 282 - ``CNVSomaticPanelWorkflow.normal_bams`` -- List of BAM files.  This list must correspond to `normal_bais`.  For example, `["Sample1.bam", "Sample2.bam"]`.
 283 - ``CNVSomaticPanelWorkflow.pon_entity_id`` -- Name of the final PoN file.
 284 - ``CNVSomaticPanelWorkflow.ref_fasta_dict`` -- Path to reference dict file.
 285 - ``CNVSomaticPanelWorkflow.ref_fasta_fai`` -- Path to reference fasta fai file.
 286 - ``CNVSomaticPanelWorkflow.ref_fasta`` -- Path to reference fasta file.
 287
 288 In additional, there are optional workflow-level and task-level parameters that may be set by advanced users; for example:
 289
 290 - ``CNVSomaticPanelWorkflow.do_explicit_gc_correction`` -- (optional) If true, perform explicit GC-bias correction when creating PoN and in subsequent denoising of case samples.  If false, rely on PCA-based denoising to correct for GC bias.
 291 - ``CNVSomaticPanelWorkflow.PreprocessIntervals.bin_length`` -- Size of bins (in bp) for coverage collection.  *This must be the same value used for all case samples.*
 292 - ``CNVSomaticPanelWorkflow.PreprocessIntervals.padding`` -- Amount of padding (in bp) to add to both sides of targets for WES coverage collection.  *This must be the same value used for all case samples.*
 293
 294 Further explanation of other task-level parameters may be found by invoking the ``--help`` documentation available in the gatk.jar for each tool.
 295
 296 #### Required parameters in the somatic pair workflow
 297
 298 The reference and bins (if specified) must be the same between PoN and case samples.
 299
 300 - ``CNVSomaticPairWorkflow.common_sites`` -- Picard or GATK-style interval list of common sites to use for collecting allelic counts.
 301 - ``CNVSomaticPairWorkflow.gatk_docker`` -- GATK Docker image (e.g., ``broadinstitute/gatk:latest``).
 302 - ``CNVSomaticPairWorkflow.intervals`` -- Picard or GATK-style interval list.  For WGS, this should typically only include the autosomal chromosomes.
 303 - ``CNVSomaticPairWorkflow.normal_bam`` -- Path to normal BAM file.
 304 - ``CNVSomaticPairWorkflow.normal_bam_idx`` -- Path to normal BAM file index.
 305 - ``CNVSomaticPairWorkflow.read_count_pon`` -- Path to read-count PoN created by the panel workflow.
 306 - ``CNVSomaticPairWorkflow.ref_fasta_dict`` -- Path to reference dict file.
 307 - ``CNVSomaticPairWorkflow.ref_fasta_fai`` -- Path to reference fasta fai file.
 308 - ``CNVSomaticPairWorkflow.ref_fasta`` -- Path to reference fasta file.
 309 - ``CNVSomaticPairWorkflow.tumor_bam`` -- Path to tumor BAM file.
 310 - ``CNVSomaticPairWorkflow.tumor_bam_idx`` -- Path to tumor BAM file index.
 311
 312 In additional, there are several task-level parameters that may be set by advanced users as above.
 313
 314 To invoke Oncotator on the called tumor copy-ratio segments:
 315
 316 - ``CNVSomaticPairWorkflow.is_run_oncotator`` -- (optional) If true, run Oncotator on the called copy-ratio segments.  This will generate both a simple TSV and a gene list.
 317
 318 ---
 319