1 ## Copyright Broad Institute, 2017
3 ## This WDL workflow runs GATK4 Mutect 2 on a single tumor-normal pair or on a single tumor sample,
4 ## and performs additional filtering and functional annotation tasks.
6 ## Main requirements/expectations :
7 ## - One analysis-ready BAM file (and its index) for each sample
9 ## Description of inputs:
12 ## gatk_docker, oncotator_docker: docker images to use for GATK 4 Mutect2 and for Oncotator
13 ## preemptible_attempts: how many preemptions to tolerate before switching to a non-preemptible machine (on Google)
14 ## max_retries: how many times to retry failed tasks -- very important on the cloud when there are transient errors
15 ## gatk_override: (optional) local file or Google bucket path to a GATK 4 java jar file to be used instead of the GATK 4 jar
16 ## in the docker image. This must be supplied when running in an environment that does not support docker
17 ## (e.g. SGE cluster on a Broad on-prem VM)
19 ## ** Workflow options **
20 ## intervals: genomic intervals (will be used for scatter)
21 ## scatter_count: number of parallel jobs to generate when scattering over intervals
22 ## artifact_modes: types of artifacts to consider in the orientation bias filter (optional)
23 ## m2_extra_args, m2_extra_filtering_args: additional arguments for Mutect2 calling and filtering (optional)
24 ## split_intervals_extra_args: additional arguments for splitting intervals before scattering (optional)
25 ## run_orientation_bias_filter: (deprecated) if true, run the orientation bias filter (optional)
26 ## run_orientation_bias_mixture_model_filter: if true, filter orientation bias sites based on the posterior probabilities computed by the read orientation artifact mixture model (optional)
27 ## run_oncotator: if true, annotate the M2 VCFs using oncotator (to produce a TCGA MAF). Important: This requires a
28 ## docker image and should not be run in environments where docker is unavailable (e.g. SGE cluster on
29 ## a Broad on-prem VM). Access to docker hub is also required, since the task downloads a public docker image.
30 ## (optional, false by default)
32 ## ** Primary inputs **
33 ## ref_fasta, ref_fai, ref_dict: reference genome, index, and dictionary
34 ## tumor_bam, tumor_bam_index: BAM and index for the tumor sample
35 ## normal_bam, normal_bam_index: BAM and index for the normal sample
37 ## ** Primary resources ** (optional but strongly recommended)
38 ## pon, pon_index: optional panel of normals in VCF format containing probable technical artifacts (false positves)
39 ## gnomad, gnomad_index: optional database of known germline variants (see http://gnomad.broadinstitute.org/downloads)
40 ## variants_for_contamination, variants_for_contamination_index: VCF of common variants with allele frequencies for calculating contamination
42 ## ** Secondary resources ** (for optional tasks)
43 ## onco_ds_tar_gz, default_config_file: Oncotator datasources and config file
44 ## sequencing_center, sequence_source: metadata for Oncotator
45 ## filter_oncotator_maf: Whether the MAF generated by oncotator should have the filtered variants removed. Default: true
46 ## realignment_index_bundle: resource for FilterAlignmentArtifacts, which runs if and only if it is specified. Generated by BwaMemIndexImageCreator.
48 ## Funcotator parameters (see Funcotator help for more details).
49 ## funco_reference_version: "hg19" for hg19 or b37. "hg38" for hg38. Default: "hg19"
50 ## funco_transcript_selection_list: Transcripts (one GENCODE ID per line) to give priority during selection process.
51 ## funco_transcript_selection_mode: How to select transcripts in Funcotator. ALL, CANONICAL, or BEST_EFFECT
52 ## funco_data_sources_tar_gz: Funcotator datasources tar gz file. Bucket location is recommended when running on the cloud.
53 ## funco_annotation_defaults: Default values for annotations, when values are unspecified. Specified as <ANNOTATION>:<VALUE>. For example: "Center:Broad"
54 ## funco_annotation_overrides: Values for annotations, even when values are unspecified. Specified as <ANNOTATION>:<VALUE>. For example: "Center:Broad"
57 ## - One VCF file and its index with primary filtering applied; secondary filtering and functional annotation if requested; a bamout.bam
58 ## file of reassembled reads if requested
60 ## Cromwell version support
61 ## - Successfully tested on v29
64 ## This script is released under the WDL source code license (BSD-3) (see LICENSE in
65 ## https://github.com/broadinstitute/wdl). Note however that the programs it calls may
66 ## be subject to different licenses. Users are responsible for checking that they are
67 ## authorized to run all programs before running this script. Please see the docker
68 ## pages at https://hub.docker.com/r/broadinstitute/* for detailed licensing information
69 ## pertaining to the included programs.
85 File? variants_for_contamination
86 File? variants_for_contamination_index
87 File? realignment_index_bundle
88 String? realignment_extra_args
89 Boolean? run_orientation_bias_filter
90 Boolean run_ob_filter = select_first([run_orientation_bias_filter, false]) && (length(select_first([artifact_modes, ["G/T", "C/T"]])) > 0)
91 Boolean? run_orientation_bias_mixture_model_filter
92 Boolean run_ob_mm_filter = select_first([run_orientation_bias_mixture_model_filter, false])
93 File? ob_mm_filter_training_intervals
94 Array[String]? artifact_modes
95 File? tumor_sequencing_artifact_metrics
97 String? m2_extra_filtering_args
98 String? split_intervals_extra_args
100 Boolean make_bamout_or_default = select_first([make_bamout, false])
101 Boolean? compress_vcfs
102 Boolean compress = select_first([compress_vcfs, false])
107 Boolean? run_oncotator
108 Boolean run_oncotator_or_default = select_first([run_oncotator, false])
110 String? onco_ds_local_db_dir
111 String? sequencing_center
112 String? sequence_source
113 File? default_config_file
116 Boolean? run_funcotator
117 Boolean run_funcotator_or_default = select_first([run_funcotator, false])
118 String? funco_reference_version
119 File? funco_data_sources_tar_gz
120 String? funco_transcript_selection_mode
121 File? funco_transcript_selection_list
122 Array[String]? funco_annotation_defaults
123 Array[String]? funco_annotation_overrides
129 String basic_bash_docker = "ubuntu:16.04"
130 String? oncotator_docker
131 String oncotator_docker_or_default = select_first([oncotator_docker, "broadinstitute/oncotator:1.9.9.0"])
132 Boolean? filter_oncotator_maf
133 Boolean filter_oncotator_maf_or_default = select_first([filter_oncotator_maf, true])
134 Boolean? filter_funcotations
135 Boolean filter_funcotations_or_default = select_first([filter_funcotations, true])
136 String? oncotator_extra_args
137 String? funcotator_extra_args
139 Int? preemptible_attempts
142 # Use as a last resort to increase the disk given to every task in case of ill behaving data
143 Int? emergency_extra_disk
145 # Disk sizes used for dynamic sizing
146 Int ref_size = ceil(size(ref_fasta, "GB") + size(ref_dict, "GB") + size(ref_fai, "GB"))
147 Int tumor_bam_size = ceil(size(tumor_bam, "GB") + size(tumor_bai, "GB"))
148 Int gnomad_vcf_size = if defined(gnomad) then ceil(size(gnomad, "GB") + size(gnomad_index, "GB")) else 0
149 Int normal_bam_size = if defined(normal_bam) then ceil(size(normal_bam, "GB") + size(normal_bai, "GB")) else 0
151 # If no tar is provided, the task downloads one from broads ftp server
152 Int onco_tar_size = if defined(onco_ds_tar_gz) then ceil(size(onco_ds_tar_gz, "GB") * 3) else 100
153 Int funco_tar_size = if defined(funco_data_sources_tar_gz) then ceil(size(funco_data_sources_tar_gz, "GB") * 3) else 100
154 Int gatk_override_size = if defined(gatk_override) then ceil(size(gatk_override, "GB")) else 0
156 # This is added to every task as padding, should increase if systematically you need more disk for every call
157 Int disk_pad = 10 + gatk_override_size + select_first([emergency_extra_disk,0])
159 # These are multipliers to multipler inputs by to make sure we have enough disk to accommodate for possible output sizes
160 # Large is for Bams/WGS vcfs
161 # Small is for metrics/other vcfs
162 Float large_input_to_output_multiplier = 2.25
163 Float small_input_to_output_multiplier = 2.0
165 # logic about output file names -- these are the names *without* .vcf extensions
166 String output_basename = basename(tumor_bam, ".bam")
167 String unfiltered_name = output_basename + "-unfiltered"
168 String filtered_name = output_basename + "-filtered"
169 String funcotated_name = output_basename + "-funcotated"
171 String output_vcf_name = basename(tumor_bam, ".bam") + ".vcf"
174 call SplitIntervals {
176 intervals = intervals,
177 ref_fasta = ref_fasta,
180 scatter_count = scatter_count,
181 split_intervals_extra_args = split_intervals_extra_args,
182 gatk_override = gatk_override,
183 gatk_docker = gatk_docker,
184 preemptible_attempts = preemptible_attempts,
185 max_retries = max_retries,
186 disk_space = ref_size + ceil(size(intervals, "GB") * small_input_to_output_multiplier) + disk_pad
189 Int m2_output_size = tumor_bam_size / scatter_count
190 scatter (subintervals in SplitIntervals.interval_files ) {
193 intervals = subintervals,
194 ref_fasta = ref_fasta,
197 tumor_bam = tumor_bam,
198 tumor_bai = tumor_bai,
199 normal_bam = normal_bam,
200 normal_bai = normal_bai,
202 pon_index = pon_index,
204 gnomad_index = gnomad_index,
205 preemptible_attempts = preemptible_attempts,
206 max_retries = max_retries,
207 m2_extra_args = m2_extra_args,
208 make_bamout = make_bamout_or_default,
209 artifact_prior_table = LearnReadOrientationModel.artifact_prior_table,
212 gga_vcf_idx = gga_vcf_idx,
213 gatk_override = gatk_override,
214 gatk_docker = gatk_docker,
215 disk_space = tumor_bam_size + normal_bam_size + ref_size + gnomad_vcf_size + m2_output_size + disk_pad
218 Float sub_vcf_size = size(M2.unfiltered_vcf, "GB")
219 Float sub_bamout_size = size(M2.output_bamOut, "GB")
222 call SumFloats as SumSubVcfs {
224 sizes = sub_vcf_size,
225 preemptible_attempts = preemptible_attempts,
226 max_retries = max_retries
231 input_vcfs = M2.unfiltered_vcf,
232 input_vcf_indices = M2.unfiltered_vcf_index,
233 output_name = unfiltered_name,
235 gatk_override = gatk_override,
236 gatk_docker = gatk_docker,
237 preemptible_attempts = preemptible_attempts,
238 max_retries = max_retries,
239 disk_space = ceil(SumSubVcfs.total_size * large_input_to_output_multiplier) + disk_pad
242 if (make_bamout_or_default) {
243 call SumFloats as SumSubBamouts {
245 sizes = sub_bamout_size,
246 preemptible_attempts = preemptible_attempts,
247 max_retries = max_retries
252 ref_fasta = ref_fasta,
255 bam_outs = M2.output_bamOut,
256 output_vcf_name = basename(MergeVCFs.merged_vcf, ".vcf"),
257 gatk_override = gatk_override,
258 gatk_docker = gatk_docker,
259 disk_space = ceil(SumSubBamouts.total_size * large_input_to_output_multiplier) + disk_pad,
260 max_retries = max_retries
264 if (run_ob_filter && !defined(tumor_sequencing_artifact_metrics)) {
265 call CollectSequencingArtifactMetrics {
267 gatk_docker = gatk_docker,
268 ref_fasta = ref_fasta,
270 preemptible_attempts = preemptible_attempts,
271 max_retries = max_retries,
272 tumor_bam = tumor_bam,
273 tumor_bai = tumor_bai,
274 gatk_override = gatk_override,
275 disk_space = tumor_bam_size + ref_size + disk_pad
279 if (run_ob_mm_filter) {
280 call CollectF1R2Counts {
282 gatk_docker = gatk_docker,
283 ref_fasta = ref_fasta,
286 preemptible_attempts = preemptible_attempts,
287 tumor_bam = tumor_bam,
288 tumor_bai = tumor_bai,
289 gatk_override = gatk_override,
290 disk_space = tumor_bam_size + ref_size + disk_pad,
291 intervals = if defined(ob_mm_filter_training_intervals) then ob_mm_filter_training_intervals else intervals,
292 max_retries = max_retries
295 call LearnReadOrientationModel {
297 alt_table = CollectF1R2Counts.alt_table,
298 ref_histogram = CollectF1R2Counts.ref_histogram,
299 alt_histograms = CollectF1R2Counts.alt_histograms,
300 tumor_sample = CollectF1R2Counts.tumor_sample,
301 gatk_override = gatk_override,
302 gatk_docker = gatk_docker,
303 preemptible_attempts = preemptible_attempts,
304 max_retries = max_retries
308 if (defined(variants_for_contamination)) {
309 call CalculateContamination {
311 gatk_override = gatk_override,
312 intervals = intervals,
313 ref_fasta = ref_fasta,
316 preemptible_attempts = preemptible_attempts,
317 max_retries = max_retries,
318 gatk_docker = gatk_docker,
319 tumor_bam = tumor_bam,
320 tumor_bai = tumor_bai,
321 normal_bam = normal_bam,
322 normal_bai = normal_bai,
323 variants_for_contamination = variants_for_contamination,
324 variants_for_contamination_index = variants_for_contamination_index,
325 disk_space = tumor_bam_size + normal_bam_size + ceil(size(variants_for_contamination, "GB") * small_input_to_output_multiplier) + disk_pad
331 gatk_override = gatk_override,
332 gatk_docker = gatk_docker,
333 intervals = intervals,
334 unfiltered_vcf = MergeVCFs.merged_vcf,
335 unfiltered_vcf_index = MergeVCFs.merged_vcf_index,
336 output_name = filtered_name,
338 preemptible_attempts = preemptible_attempts,
339 max_retries = max_retries,
340 contamination_table = CalculateContamination.contamination_table,
341 maf_segments = CalculateContamination.maf_segments,
342 m2_extra_filtering_args = m2_extra_filtering_args,
343 disk_space = ceil(size(MergeVCFs.merged_vcf, "GB") * small_input_to_output_multiplier) + disk_pad
347 # Get the metrics either from the workflow input or CollectSequencingArtifactMetrics if no workflow input is provided
348 File input_artifact_metrics = select_first([tumor_sequencing_artifact_metrics, CollectSequencingArtifactMetrics.pre_adapter_metrics])
350 call FilterByOrientationBias {
352 gatk_override = gatk_override,
353 input_vcf = Filter.filtered_vcf,
354 input_vcf_index = Filter.filtered_vcf_index,
355 output_name = filtered_name,
357 gatk_docker = gatk_docker,
358 preemptible_attempts = preemptible_attempts,
359 max_retries = max_retries,
360 pre_adapter_metrics = input_artifact_metrics,
361 artifact_modes = artifact_modes,
362 disk_space = ceil(size(Filter.filtered_vcf, "GB") * small_input_to_output_multiplier) + ceil(size(input_artifact_metrics, "GB")) + disk_pad
366 if (defined(realignment_index_bundle)) {
367 File realignment_filter_input = select_first([FilterByOrientationBias.filtered_vcf, Filter.filtered_vcf])
368 File realignment_filter_input_idx = select_first([FilterByOrientationBias.filtered_vcf_index, Filter.filtered_vcf_index])
369 call FilterAlignmentArtifacts {
371 gatk_override = gatk_override,
374 realignment_index_bundle = select_first([realignment_index_bundle]),
375 realignment_extra_args = realignment_extra_args,
376 gatk_docker = gatk_docker,
377 max_retries = max_retries,
379 output_name = filtered_name,
380 input_vcf = realignment_filter_input,
381 input_vcf_idx = realignment_filter_input_idx
385 if (run_oncotator_or_default) {
386 File oncotate_vcf_input = select_first([FilterAlignmentArtifacts.filtered_vcf, FilterByOrientationBias.filtered_vcf, Filter.filtered_vcf])
389 m2_vcf = oncotate_vcf_input,
390 onco_ds_tar_gz = onco_ds_tar_gz,
391 onco_ds_local_db_dir = onco_ds_local_db_dir,
392 sequencing_center = sequencing_center,
393 sequence_source = sequence_source,
394 default_config_file = default_config_file,
395 case_id = M2.tumor_sample[0],
396 control_id = M2.normal_sample[0],
397 oncotator_docker = oncotator_docker_or_default,
398 preemptible_attempts = preemptible_attempts,
399 max_retries = max_retries,
400 disk_space = ceil(size(oncotate_vcf_input, "GB") * large_input_to_output_multiplier) + onco_tar_size + disk_pad,
401 filter_maf = filter_oncotator_maf_or_default,
402 oncotator_extra_args = oncotator_extra_args
406 if (run_funcotator_or_default) {
407 File funcotate_vcf_input = select_first([FilterAlignmentArtifacts.filtered_vcf, FilterByOrientationBias.filtered_vcf, Filter.filtered_vcf])
408 File funcotate_vcf_input_index = select_first([FilterAlignmentArtifacts.filtered_vcf_index, FilterByOrientationBias.filtered_vcf_index, Filter.filtered_vcf_index])
411 input_vcf = funcotate_vcf_input,
412 input_vcf_idx = funcotate_vcf_input_index,
413 ref_fasta = ref_fasta,
414 ref_fasta_index = ref_fai,
416 reference_version = select_first([funco_reference_version, "hg19"]),
417 data_sources_tar_gz = funco_data_sources_tar_gz,
418 case_id = M2.tumor_sample[0],
419 control_id = M2.normal_sample[0],
420 transcript_selection_mode = funco_transcript_selection_mode,
421 transcript_selection_list = funco_transcript_selection_list,
422 annotation_defaults = funco_annotation_defaults,
423 annotation_overrides = funco_annotation_overrides,
424 gatk_docker = gatk_docker,
425 gatk_override = gatk_override,
426 filter_funcotations = filter_funcotations_or_default,
427 sequencing_center = sequencing_center,
428 sequence_source = sequence_source,
429 disk_space_gb = ceil(size(funcotate_vcf_input, "GB") * large_input_to_output_multiplier) + onco_tar_size + disk_pad,
430 max_retries = max_retries,
431 extra_args = funcotator_extra_args
436 File filtered_vcf = select_first([FilterAlignmentArtifacts.filtered_vcf, FilterByOrientationBias.filtered_vcf, Filter.filtered_vcf])
437 File filtered_vcf_index = select_first([FilterAlignmentArtifacts.filtered_vcf_index, FilterByOrientationBias.filtered_vcf_index, Filter.filtered_vcf_index])
438 File? contamination_table = CalculateContamination.contamination_table
439 File? oncotated_m2_maf = oncotate_m2.oncotated_m2_maf
440 File? funcotated_maf = FuncotateMaf.funcotated_output
441 File? preadapter_detail_metrics = CollectSequencingArtifactMetrics.pre_adapter_metrics
442 File? bamout = MergeBamOuts.merged_bam_out
443 File? bamout_index = MergeBamOuts.merged_bam_out_index
444 File? maf_segments = CalculateContamination.maf_segments
448 task SplitIntervals {
455 String? split_intervals_extra_args
462 Int? preemptible_attempts
466 Boolean use_ssd = false
468 # Mem is in units of GB but our command and memory runtime values are in MB
469 Int machine_mem = if defined(mem) then mem * 1000 else 3500
470 Int command_mem = machine_mem - 500
474 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
477 gatk --java-options "-Xmx${command_mem}m" SplitIntervals \
479 ${"-L " + intervals} \
480 -scatter ${scatter_count} \
482 ${split_intervals_extra_args}
483 cp interval-files/*.intervals .
489 memory: machine_mem + " MB"
490 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
491 preemptible: select_first([preemptible_attempts, 10])
492 maxRetries: select_first([max_retries, 0])
493 cpu: select_first([cpu, 1])
497 Array[File] interval_files = glob("*.intervals")
515 String? m2_extra_args
520 File? artifact_prior_table
522 String output_vcf = "output" + if compress then ".vcf.gz" else ".vcf"
523 String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx"
530 Int? preemptible_attempts
534 Boolean use_ssd = false
536 # Mem is in units of GB but our command and memory runtime values are in MB
537 Int machine_mem = if defined(mem) then mem * 1000 else 3500
538 Int command_mem = machine_mem - 500
544 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
546 # We need to create these files regardless, even if they stay empty
548 echo "" > normal_name.txt
550 gatk --java-options "-Xmx${command_mem}m" GetSampleName -R ${ref_fasta} -I ${tumor_bam} -O tumor_name.txt -encode
551 tumor_command_line="-I ${tumor_bam} -tumor `cat tumor_name.txt`"
553 if [[ -f "${normal_bam}" ]]; then
554 gatk --java-options "-Xmx${command_mem}m" GetSampleName -R ${ref_fasta} -I ${normal_bam} -O normal_name.txt -encode
555 normal_command_line="-I ${normal_bam} -normal `cat normal_name.txt`"
558 gatk --java-options "-Xmx${command_mem}m" Mutect2 \
560 $tumor_command_line \
561 $normal_command_line \
562 ${"--germline-resource " + gnomad} \
564 ${"-L " + intervals} \
565 ${"--genotyping-mode GENOTYPE_GIVEN_ALLELES --alleles " + gga_vcf} \
567 ${true='--bam-output bamout.bam' false='' make_bamout} \
568 ${"--orientation-bias-artifact-priors " + artifact_prior_table} \
575 memory: machine_mem + " MB"
576 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
577 preemptible: select_first([preemptible_attempts, 10])
578 maxRetries: select_first([max_retries, 3])
579 cpu: select_first([cpu, 1])
583 File unfiltered_vcf = "${output_vcf}"
584 File unfiltered_vcf_index = "${output_vcf_index}"
585 File output_bamOut = "bamout.bam"
586 String tumor_sample = read_string("tumor_name.txt")
587 String normal_sample = read_string("normal_name.txt")
593 Array[File] input_vcfs
594 Array[File] input_vcf_indices
597 String output_vcf = output_name + if compress then ".vcf.gz" else ".vcf"
598 String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx"
605 Int? preemptible_attempts
609 Boolean use_ssd = false
611 # Mem is in units of GB but our command and memory runtime values are in MB
612 Int machine_mem = if defined(mem) then mem * 1000 else 3500
613 Int command_mem = machine_mem - 1000
615 # using MergeVcfs instead of GatherVcfs so we can create indices
616 # WARNING 2015-10-28 15:01:48 GatherVcfs Index creation not currently supported when gathering block compressed VCFs.
619 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
620 gatk --java-options "-Xmx${command_mem}m" MergeVcfs -I ${sep=' -I ' input_vcfs} -O ${output_vcf}
626 memory: machine_mem + " MB"
627 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
628 preemptible: select_first([preemptible_attempts, 10])
629 maxRetries: select_first([max_retries, 3])
630 cpu: select_first([cpu, 1])
634 File merged_vcf = "${output_vcf}"
635 File merged_vcf_index = "${output_vcf_index}"
644 Array[File]+ bam_outs
645 String output_vcf_name
652 Int? preemptible_attempts
656 Boolean use_ssd = false
658 # Mem is in units of GB but our command and memory runtime values are in MB
659 Int machine_mem = if defined(mem) then mem * 1000 else 7000
660 Int command_mem = machine_mem - 1000
663 # This command block assumes that there is at least one file in bam_outs.
664 # Do not call this task if len(bam_outs) == 0
666 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
667 gatk --java-options "-Xmx${command_mem}m" GatherBamFiles \
668 -I ${sep=" -I " bam_outs} -O unsorted.out.bam -R ${ref_fasta}
670 # We must sort because adjacent scatters may have overlapping (padded) assembly regions, hence
671 # overlapping bamouts
673 gatk --java-options "-Xmx${command_mem}m" SortSam -I unsorted.out.bam \
674 -O ${output_vcf_name}.out.bam \
675 --SORT_ORDER coordinate -VALIDATION_STRINGENCY LENIENT
676 gatk --java-options "-Xmx${command_mem}m" BuildBamIndex -I ${output_vcf_name}.out.bam -VALIDATION_STRINGENCY LENIENT
682 memory: machine_mem + " MB"
683 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
684 preemptible: select_first([preemptible_attempts, 10])
685 maxRetries: select_first([max_retries, 3])
686 cpu: select_first([cpu, 1])
690 File merged_bam_out = "${output_vcf_name}.out.bam"
691 File merged_bam_out_index = "${output_vcf_name}.out.bai"
695 # This task is deprecated and is no longer supported
696 task CollectSequencingArtifactMetrics {
708 Int? preemptible_attempts
712 Boolean use_ssd = false
714 # Mem is in units of GB but our command and memory runtime values are in MB
715 Int machine_mem = if defined(mem) then mem * 1000 else 7000
716 Int command_mem = machine_mem - 1000
720 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
721 gatk --java-options "-Xmx${command_mem}m" CollectSequencingArtifactMetrics \
722 -I ${tumor_bam} -O "gatk" -R ${ref_fasta} -VALIDATION_STRINGENCY LENIENT
728 memory: machine_mem + " MB"
729 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
730 preemptible: select_first([preemptible_attempts, 10])
731 maxRetries: select_first([max_retries, 3])
732 cpu: select_first([cpu, 1])
736 File pre_adapter_metrics = "gatk.pre_adapter_detail_metrics"
740 task CollectF1R2Counts {
755 Int? preemptible_attempts
758 Boolean use_ssd = false
760 # Mem is in units of GB but our command and memory runtime values are in MB
761 Int machine_mem = if defined(mem) then mem * 1000 else 7000
762 Int command_mem = machine_mem - 1000
766 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
768 # Get the sample name. The task M2 retrieves this information too, but it must be done separately here
769 # to avoid a cyclic dependency
770 gatk --java-options "-Xmx${command_mem}m" GetSampleName -R ${ref_fasta} -I ${tumor_bam} -O tumor_name.txt -encode
771 tumor_name=$(head -n 1 tumor_name.txt)
773 gatk --java-options "-Xmx${command_mem}m" CollectF1R2Counts \
774 -I ${tumor_bam} -R ${ref_fasta} \
775 ${"-L " + intervals} \
776 -alt-table "$tumor_name-alt.tsv" \
777 -ref-hist "$tumor_name-ref.metrics" \
778 -alt-hist "$tumor_name-alt-depth1.metrics"
784 memory: machine_mem + " MB"
785 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
786 preemptible: select_first([preemptible_attempts, 10])
787 maxRetries: select_first([max_retries, 3])
788 cpu: select_first([cpu, 1])
792 File alt_table = glob("*-alt.tsv")[0]
793 File ref_histogram = glob("*-ref.metrics")[0]
794 File alt_histograms = glob("*-alt-depth1.metrics")[0]
795 String tumor_sample = read_string("tumor_name.txt")
799 task LearnReadOrientationModel {
812 Int? preemptible_attempts
815 Boolean use_ssd = false
817 # Mem is in units of GB but our command and memory runtime values are in MB
818 Int machine_mem = if defined(mem) then mem * 1000 else 8000
819 Int command_mem = machine_mem - 1000
823 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
825 gatk --java-options "-Xmx${command_mem}m" LearnReadOrientationModel \
826 -alt-table ${alt_table} \
827 -ref-hist ${ref_histogram} \
828 -alt-hist ${alt_histograms} \
829 -O "${tumor_sample}-artifact-prior-table.tsv"
835 memory: machine_mem + " MB"
836 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
837 preemptible: select_first([preemptible_attempts, 10])
838 maxRetries: select_first([max_retries, 3])
839 cpu: select_first([cpu, 1])
843 File artifact_prior_table = "${tumor_sample}-artifact-prior-table.tsv"
848 task CalculateContamination {
858 File? variants_for_contamination
859 File? variants_for_contamination_index
864 Int? preemptible_attempts
870 # Mem is in units of GB but our command and memory runtime values are in MB
871 Int machine_mem = if defined(mem) then mem * 1000 else 3000
872 Int command_mem = machine_mem - 500
877 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
879 if [[ -f "${normal_bam}" ]]; then
880 gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries -I ${normal_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
881 -V ${variants_for_contamination} -L ${variants_for_contamination} -O normal_pileups.table
882 NORMAL_CMD="-matched normal_pileups.table"
885 gatk --java-options "-Xmx${command_mem}m" GetPileupSummaries -R ${ref_fasta} -I ${tumor_bam} ${"--interval-set-rule INTERSECTION -L " + intervals} \
886 -V ${variants_for_contamination} -L ${variants_for_contamination} -O pileups.table
887 gatk --java-options "-Xmx${command_mem}m" CalculateContamination -I pileups.table -O contamination.table --tumor-segmentation segments.table $NORMAL_CMD
893 memory: command_mem + " MB"
894 disks: "local-disk " + select_first([disk_space, 100]) + " HDD"
895 preemptible: select_first([preemptible_attempts, 10])
896 maxRetries: select_first([max_retries, 3])
900 File pileups = "pileups.table"
901 File contamination_table = "contamination.table"
902 File maf_segments = "segments.table"
910 File unfiltered_vcf_index
913 String output_vcf = output_name + if compress then ".vcf.gz" else ".vcf"
914 String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx"
915 File? contamination_table
917 String? m2_extra_filtering_args
924 Int? preemptible_attempts
928 Boolean use_ssd = false
930 # Mem is in units of GB but our command and memory runtime values are in MB
931 Int machine_mem = if defined(mem) then mem * 1000 else 7000
932 Int command_mem = machine_mem - 500
937 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
939 gatk --java-options "-Xmx${command_mem}m" FilterMutectCalls -V ${unfiltered_vcf} \
941 ${"--contamination-table " + contamination_table} \
942 ${"--tumor-segmentation " + maf_segments} \
943 ${m2_extra_filtering_args}
949 memory: machine_mem + " MB"
950 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
951 preemptible: select_first([preemptible_attempts, 10])
952 maxRetries: select_first([max_retries, 3])
953 cpu: select_first([cpu, 1])
957 File filtered_vcf = "${output_vcf}"
958 File filtered_vcf_index = "${output_vcf_index}"
962 task FilterByOrientationBias {
969 String output_vcf = output_name + if compress then ".vcf.gz" else ".vcf"
970 String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx"
971 File pre_adapter_metrics
972 Array[String]? artifact_modes
974 # If artifact modes is passed in to the task as [], this task will fail.
975 Array[String] final_artifact_modes = select_first([artifact_modes, ["G/T", "C/T"]])
978 Int? preemptible_attempts
984 Boolean use_ssd = false
986 # Mem is in units of GB but our command and memory runtime values are in MB
987 Int machine_mem = if defined(mem) then mem * 1000 else 7000
988 Int command_mem = machine_mem - 500
993 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
995 gatk --java-options "-Xmx${command_mem}m" FilterByOrientationBias \
997 -AM ${sep=" -AM " final_artifact_modes} \
998 -P ${pre_adapter_metrics} \
1003 #docker: gatk_docker
1005 memory: command_mem + " MB"
1006 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
1007 preemptible: select_first([preemptible_attempts, 10])
1008 maxRetries: select_first([max_retries, 3])
1009 cpu: select_first([cpu, 1])
1013 File filtered_vcf = "${output_vcf}"
1014 File filtered_vcf_index = "${output_vcf_index}"
1018 task FilterAlignmentArtifacts {
1027 String output_vcf = output_name + if compress then ".vcf.gz" else ".vcf"
1028 String output_vcf_index = output_vcf + if compress then ".tbi" else ".idx"
1029 File realignment_index_bundle
1030 String? realignment_extra_args
1035 Int? preemptible_attempts
1039 Boolean use_ssd = false
1041 # Mem is in units of GB but our command and memory runtime values are in MB
1042 Int machine_mem = if defined(mem) then mem * 1000 else 9000
1043 Int command_mem = machine_mem - 500
1048 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
1050 gatk --java-options "-Xmx${command_mem}m" FilterAlignmentArtifacts \
1053 --bwa-mem-index-image ${realignment_index_bundle} \
1054 ${realignment_extra_args} \
1059 #docker: gatk_docker
1061 memory: command_mem + " MB"
1062 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
1063 preemptible: select_first([preemptible_attempts, 10])
1064 maxRetries: select_first([max_retries, 3])
1065 cpu: select_first([cpu, 1])
1069 File filtered_vcf = "${output_vcf}"
1070 File filtered_vcf_index = "${output_vcf_index}"
1077 File? onco_ds_tar_gz
1078 String? onco_ds_local_db_dir
1079 String? oncotator_exe
1080 String? sequencing_center
1081 String? sequence_source
1082 File? default_config_file
1085 String? oncotator_extra_args
1088 String oncotator_docker
1090 Int? preemptible_attempts
1094 Boolean use_ssd = false
1097 Boolean is_filter_maf = select_first([filter_maf, true])
1098 String filter_maf_args = if (is_filter_maf) then " --collapse-filter-cols --prune-filter-cols " else ""
1100 # Mem is in units of GB but our command and memory runtime values are in MB
1101 Int machine_mem = if defined(mem) then mem * 1000 else 3500
1102 Int command_mem = machine_mem - 500
1105 # fail if *any* command below (not just the last) doesn't return 0, in particular if wget fails
1108 # local db dir is a directory and has been specified
1109 if [[ -d "${onco_ds_local_db_dir}" ]]; then
1110 echo "Using local db-dir: ${onco_ds_local_db_dir}"
1111 echo "THIS ONLY WORKS WITHOUT DOCKER!"
1112 ln -s ${onco_ds_local_db_dir} onco_dbdir
1113 elif [[ "${onco_ds_tar_gz}" == *.tar.gz ]]; then
1114 echo "Using given tar file: ${onco_ds_tar_gz}"
1116 tar zxvf ${onco_ds_tar_gz} -C onco_dbdir --strip-components 1
1118 echo "Downloading and installing oncotator datasources from Broad FTP site..."
1119 # Download and untar the db-dir
1120 wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/oncotator/oncotator_v1_ds_April052016.tar.gz
1121 tar zxvf oncotator_v1_ds_April052016.tar.gz
1122 ln -s oncotator_v1_ds_April052016 onco_dbdir
1125 ${default="/root/oncotator_venv/bin/oncotator" oncotator_exe} --db-dir onco_dbdir/ -c $HOME/tx_exact_uniprot_matches.AKT1_CRLF2_FGFR1.txt \
1126 -v ${m2_vcf} ${case_id}.maf.annotated hg19 -i VCF -o TCGAMAF --skip-no-alt --collapse-number-annotations --log_name oncotator.log \
1127 -a Center:${default="Unknown" sequencing_center} \
1128 -a source:${default="Unknown" sequence_source} \
1129 -a normal_barcode:${control_id} \
1130 -a tumor_barcode:${case_id} \
1131 ${"--default_config " + default_config_file} \
1132 ${filter_maf_args} \
1133 ${oncotator_extra_args}
1137 #docker: oncotator_docker
1138 memory: machine_mem + " MB"
1140 disks: "local-disk " + select_first([disk_space, 100]) + if use_ssd then " SSD" else " HDD"
1141 preemptible: select_first([preemptible_attempts, 10])
1142 maxRetries: select_first([max_retries, 3])
1143 cpu: select_first([cpu, 1])
1147 File oncotated_m2_maf="${case_id}.maf.annotated"
1151 # Calculates sum of a list of floats
1155 # Runtime parameters
1156 Int? preemptible_attempts
1160 python -c "print ${sep="+" sizes}"
1164 Float total_size = read_float(stdout())
1168 #docker: "python:2.7"
1169 disks: "local-disk " + 10 + " HDD"
1170 preemptible: select_first([preemptible_attempts, 10])
1171 maxRetries: select_first([max_retries, 3])
1178 File ref_fasta_index
1182 String reference_version
1183 String output_format = "MAF"
1184 String? sequencing_center
1185 String? sequence_source
1189 File? data_sources_tar_gz
1190 String? transcript_selection_mode
1191 File? transcript_selection_list
1192 Array[String]? annotation_defaults
1193 Array[String]? annotation_overrides
1194 Boolean filter_funcotations
1200 # Process input args:
1201 String annotation_def_arg = if defined(annotation_defaults) then " --annotation-default " else ""
1202 String annotation_over_arg = if defined(annotation_overrides) then " --annotation-override " else ""
1203 String filter_funcotations_args = if (filter_funcotations) then " --remove-filtered-variants " else ""
1204 String final_output_filename = basename(input_vcf, ".vcf") + ".maf.annotated"
1212 Int? preemptible_attempts
1217 Boolean use_ssd = false
1219 # This should be updated when a new version of the data sources is released
1220 String default_datasources_version = "funcotator_dataSources.v1.4.20180615"
1222 # You may have to change the following two parameter values depending on the task requirements
1223 Int default_ram_mb = 3000
1224 # WARNING: In the workflow, you should calculate the disk space as an input to this task (disk_space_gb).
1225 Int default_disk_space_gb = 100
1227 # Mem is in units of GB but our command and memory runtime values are in MB
1228 Int machine_mem = if defined(mem) then mem *1000 else default_ram_mb
1229 Int command_mem = machine_mem - 1000
1233 #export GATK_LOCAL_JAR=${default="/root/gatk.jar" gatk_override}
1235 DATA_SOURCES_TAR_GZ=${data_sources_tar_gz}
1236 if [[ ! -e $DATA_SOURCES_TAR_GZ ]] ; then
1237 # We have to download the data sources:
1238 echo "Data sources gzip does not exist: $DATA_SOURCES_TAR_GZ"
1239 echo "Downloading default data sources..."
1240 wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/funcotator/${default_datasources_version}.tar.gz
1241 tar -zxf ${default_datasources_version}.tar.gz
1242 DATA_SOURCES_FOLDER=${default_datasources_version}
1244 # Extract the tar.gz:
1245 mkdir datasources_dir
1246 tar zxvf ${data_sources_tar_gz} -C datasources_dir --strip-components 1
1247 DATA_SOURCES_FOLDER="$PWD/datasources_dir"
1250 gatk --java-options "-Xmx${command_mem}m" Funcotator \
1251 --data-sources-path $DATA_SOURCES_FOLDER \
1252 --ref-version ${reference_version} \
1253 --output-file-format ${output_format} \
1256 -O ${final_output_filename} \
1257 ${"-L " + interval_list} \
1258 ${"--transcript-selection-mode " + transcript_selection_mode} \
1259 ${"--transcript-list " + transcript_selection_list} \
1260 --annotation-default normal_barcode:${control_id} \
1261 --annotation-default tumor_barcode:${case_id} \
1262 --annotation-default Center:${default="Unknown" sequencing_center} \
1263 --annotation-default source:${default="Unknown" sequence_source} \
1264 ${annotation_def_arg}${default="" sep=" --annotation-default " annotation_defaults} \
1265 ${annotation_over_arg}${default="" sep=" --annotation-override " annotation_overrides} \
1266 ${filter_funcotations_args} \
1271 #docker: gatk_docker
1273 memory: machine_mem + " MB"
1274 disks: "local-disk " + select_first([disk_space_gb, default_disk_space_gb]) + if use_ssd then " SSD" else " HDD"
1275 preemptible: select_first([preemptible_attempts, 3])
1276 maxRetries: select_first([max_retries, 3])
1277 cpu: select_first([cpu, 1])
1281 File funcotated_output = "${final_output_filename}"