Overview
This workflow was designed by the Genomics & Data Science team (GDS) at 54gene and is used to analyze paired-end short-read germline whole-genome sequencing data. This pipeline is designed to first be deployed in small batches (e.g. per flow cell), starting with FASTQs and resulting in gVCFs and a small batch joint-called VCF. A second run of the pipeline can receive a larger batch of gVCFs (e.g. gVCFs derived from many flow cells), and generates a large batch joint-called VCF. The workflow, which is designed to support reproducible bioinformatics, is written in Snakemake and is platform-agnostic. All dependencies are installed by the pipeline as-needed using conda. Development and testing has been predominantly on AWS’ ParallelCluster using Amazon Linux using Snakemake version 7.8.2.
Features:
Read filtering and trimming
Read alignment, deduplication, and BQSR
Variant calling and filtering
Joint-genotyping
Sex discordance and relatedness assessment
Generate MultiQC reports
To install the latest release, type:
git clone https://gitlab.com/data-analysis5/dna-sequencing/54gene-wgs-germline.git
Inputs
The pipeline requires the following inputs:
A headerless, whitespace delimited
manifest.txt
file with sample names and paths (columns dependent on the run-mode)Config file with the run-mode specified and other pipeline parameters configured (see default config provided in
config/config.yaml
)A tab-delimited
intervals.tsv
file with names of intervals and paths to region (BED) files of the genome you want to parallelize the variant calling and joint-calling steps by (i.e. 50 BED files each with a small region of the genome to parallelize by)A tab-delimited
sex_linker.tsv
file with the sample names in one column and sex in the other to identify discordances in reported vs. inferred sexA
multiqc.yaml
config file for generating MultiQC reports (provided for you)
Outputs
Depending on which run-mode you have set, you will be able to generate:
A hard-filtered, multi-sample joint-called VCF in
full
andjoint_genotyping
modePer-sample gVCFs for all regions of the genome for future joint-calling in
full
modeDeduplicated and post-BQSR BAM files in
full
modeVarious QC metrics (e.g. FastQC, MultiQC, bcftools stats) in all three modes
See the Installation, Execution, and Configuration for details on setting up and running the pipeline.