Configuration
The workflow needs to be configured to perform the analysis of your choice by editing the following files in the config/
folder. Each file is described in more detail below.
Configuration file
The pipeline offers three run modes. Please specify the run mode in config.yaml
. The name of the file defaults to config.yaml
but you can use other filenames in conjunction with Snakemake’s --configfile
command line flag.
full: This mode starts with FASTQs and emits a joint-called, filtered, multi-sample VCF.
joint_genotyping: This mode starts with gVCFs and runs joint-calling and filtering, emitting a multi-sample VCF. In the event you have analyzed batches of samples in the full-run mode, these batches can then jointly re-genotyped with this run mode.
fastqc_only: This mode starts with FASTQs and emits trimmed FASTQs as well as a multiQC report for raw and trimmed reads. This run mode is meant for performing QC on FASTQ data before further downstream analysis.
Manifest file
You will need to provide a headerless, white-space delimited manifest file to run the pipeline for all three run-modes. The default name for the file is manifest.txt
but this is user configurable in the config file under sampleFile
.
For full and fastqc_only mode, the manifest.txt
requires the following columns:
Columns:
readgroup sample_ID path/to/r1.fastq path/to/r2.fastq
readgroup
values should be unique, e.g. <sampleID>_<barcode>_<lane>sample_ID
should be the same for all FASTQ pairs from a single sample, and can be different from the FASTQ filenames
For example:
Sample1_S1_L001 Sample1 input/Sample_001_S1_L001_R1.fastq input/Sample_001_S1_L001_R2.fastq
Sample1_S1_L002 Sample1 input/Sample_001_S1_L002_R1.fastq input/Sample_001_S1_L002_R2.fastq
For joint_genotyping mode:
Columns:
sample_ID path/to/file.g.vcf.gz
sample_ID
values should be unique, and should correspond to the sample IDs in the gVCF headergVCFs should be bgzipped and indexed
For example:
Sample1 vcfs/Sample1.g.vcf.gz
Sample2 vcfs/Sample2.g.vcf.gz
Intervals file
For full and joint_genotyping modes only.
Joint-calling for a large number of samples is computationally expensive and time-consuming. This pipeline was designed to mitigate these issues by parallelizing joint-calling over multiple intervals of the genome. To specify the number of intervals, and which regions to parallelize over, a 2-column tab-delmited intervals.tsv
file can be specified. The filename can be customized and edited in the config file under intervalsFile
.
This file contains two columns with headers:
interval_name
for the name of the particular interval or regionfile_path
full path to the interval/region BED file, Picard-style.interval_list
, VCF file, or GATK-style.list
or.intervals
file (see further details on these formats here)
For example:
interval_name file_path
interval_1 resources/scattered_calling_intervals/interval_1.bed
interval_2 resources/scattered_calling_intervals/interval_2.bed
The pipeline will supply these interval files to the GATK HaplotypeCaller
, GenomicsDBImport
, and GenotypeGVCFs
steps to run concurrent instances of these rules at each specified interval(s), reducing overall execution time.
We recommend specifying regions of equal size for parallelization.
Sex linker file
The pipeline provides a boolean option somalier
to estimate relatedness amongst the samples using Somalier in the config.yaml
(see check_relatedness
parameter in Configuration). This requires a 2-column, tab-delimited file. The filename defaults to sex_linker.tsv
and is specified in the config.yaml
under sexLinker
. This file requires:
First column with the header
Sample
with all sample namesSecond column with the header
Sex
containing case-insensitive encodings of sex in either m/f or male/female format
For example:
Sample Sex
NA12878 F
Subject1 female
Subject2 m
MultiQC configuration
A configuration file for MultiQC can be found in config/multiqc.yaml
and is used for generating and specifying the order of the various modules in the MultiQC report from the pipeline. We do not recommend modifying this file unless you understand how this configuration file is setup or how MultiQC works.
Config parameters
Below are descriptions and usage options for the various config parameters specified in config.yaml
.
Parameter |
Required |
Description |
---|---|---|
|
Y |
Manifest file with IDs |
|
Y |
File with interval names and file paths |
|
Y |
Max jobs to run concurrently |
|
Y |
File with reported sex of each sample ID |
|
Y |
Location of temp directory; does not have to exist prior to pipeline execution |
|
Y |
Specify run mode to use (see below) |
|
Y |
|
|
Y |
|
|
Y |
|
|
N |
Set global java options |
|
N |
Used to submit jobs to a cluster only if you are using the optional wrapper script. See Execution |
|
Y |
Name of your default cluster partition/queue; can be |
|
Y |
Name of queue/partition best suited for compute-
intensive jobs; can be |
|
Y |
Name of queue/partition best suited for memory-intensive
jobs; can be |
|
Y |
Name of sequencing center for use in |
|
Y |
Max concurrent jobs for specific high-bandwidth rules,
to avoid potentially hitting bandwidth caps if deployed
in a cloud environment; see wrapper script for an example
of how to pass this in to snakemake. Set to the same
number as |
|
Y |
Max het/hom ratio to allow through post-calling QC |
|
Y |
Minimum depth required for sample to pass post-calling QC |
|
Y |
Max % contamination to allow through post-calling QC |
|
Y |
(minutes) Exclude rules from the benchmarking report if elapsed time is below this threshold |
|
Y |
Check relatedness and sex discordance with Somalier (requires sex_linker.tsv) only available in full run mode. Support of Mac OSX is experimental, so you may want to set this to False on a Mac |
The remainder of the config.yaml
file contains a selected set of exposed per-tool parameters. For the most part, this allows tuning of resource allocation on a per-tool basis (i.e. threads
and memory
in MB). Java-based tools also allow for arbitrary java options to be passed through via java_opts
. Additional exposed parameters include:
genomicsDBImport
andgenotypeGVCFs
: We have exposed some useful parameters that have been helpful to adjust as scale increases. Please see GATK documentation for the relevant tools to learn more.verifyBamID
: Aregion
field allows the user to specify chromosomes over which to run contamination analysis, in an attempt to mitigate large memory requirements.