Overview

This workflow was designed by the Genomics & Data Science team (GDS) at 54gene and is used to analyze paired-end short-read germline whole-genome sequencing data. This pipeline is designed to first be deployed in small batches (e.g. per flow cell), starting with FASTQs and resulting in gVCFs and a small batch joint-called VCF. A second run of the pipeline can receive a larger batch of gVCFs (e.g. gVCFs derived from many flow cells), and generates a large batch joint-called VCF. The workflow, which is designed to support reproducible bioinformatics, is written in Snakemake and is platform-agnostic. All dependencies are installed by the pipeline as-needed using conda. Development and testing has been predominantly on AWS’ ParallelCluster using Amazon Linux using Snakemake version 7.8.2.

Features:

Read filtering and trimming
Read alignment, deduplication, and BQSR
Variant calling and filtering
Joint-genotyping
Sex discordance and relatedness assessment
Generate MultiQC reports

To install the latest release, type:

git clone https://gitlab.com/data-analysis5/dna-sequencing/54gene-wgs-germline.git

Inputs

The pipeline requires the following inputs:

A headerless, whitespace delimited manifest.txt file with sample names and paths (columns dependent on the run-mode)
Config file with the run-mode specified and other pipeline parameters configured (see default config provided in config/config.yaml)
A tab-delimited intervals.tsv file with names of intervals and paths to region (BED) files of the genome you want to parallelize the variant calling and joint-calling steps by (i.e. 50 BED files each with a small region of the genome to parallelize by)
A tab-delimited sex_linker.tsv file with the sample names in one column and sex in the other to identify discordances in reported vs. inferred sex
A multiqc.yaml config file for generating MultiQC reports (provided for you)

Outputs

Depending on which run-mode you have set, you will be able to generate:

A hard-filtered, multi-sample joint-called VCF in full and joint_genotyping mode
Per-sample gVCFs for all regions of the genome for future joint-calling in full mode
Deduplicated and post-BQSR BAM files in full mode
Various QC metrics (e.g. FastQC, MultiQC, bcftools stats) in all three modes

See the Installation, Execution, and Configuration for details on setting up and running the pipeline.