Running clonal hematopoiesis pipeline from start to finish, or optionally from aligned BAM files
CH pipeline was designed to detect hematopoietic and tumor mutations from blood DNA after aligning demultiplexed short-read paired FastQ files against the human genome, see https://github.com/amunzur/Clonal-hematopoiesis-pipeline for further details. Syntax highlighting and language support for Snakemake are available in the VS Code Marketplace, which also provides clean and colorful formatting for the pipeline scripts.
After FastQ processing and alignment, duplicate reads will be collapsed into singlex consensus read families consisting of at least two reads. Mutation calling is performed individually on each sample using an ensemble variant calling approach. The pipeline can also be used solely as a BAM generation tool, omitting the variant calling steps.
1. Input File Naming
It is important that the FastQ files follow the Patient-id_sampletype_timepoint_date_1.fq.gz format, such as TRIAL-cohort-P123-BC_WBC_first_01Jan1930_1.fq.gz. Here TRIAL-cohort-P123-BC would refer to the patient identifier . When defining the patient identifier please avoid using underscores because downstream processing tools will split the sample name string while collecting all per-patient mutations into one data frame. For example instead of TRIALXX_PatientXX please use of TRIALXX-PatientXX.You may not know which timepoint this sample was collected at, in that case please add "CHIP" to the filenames such as TRIAL-cohort-P123-BC_WBC_CHIP_01Jan1930.fq.gz. If you do know the timepoint, you could add a string to define that such as "baseline", "during-treatment" or "sample-one". Use "WBC" in the sample names instead of gDNA (genomic DNA).
For the test samples acceptable strings you can use are:
cfDNA (cell-free DNA)
utDNA (urine tumor DNA)
FiT (fixed tissue)
Using other strings (such as ctDNA instead of cfDNA, or FFPE instead of FiT) will cause downstream problems. Do not interleave the two FastQ files into one one, provide the read1 and read2 FastQs separately. Additionally, do not perform any preprocessing on the FastQ files such as adapter trimming, all downstream processing steps will be performed by the pipeline. After cloning the GitHub repository generate the Clonal-hematopoiesis-pipeline/results/data/fastq/merged directory and place all FastQ files in that location. This is the starting point of the pipeline.
2. Sample List File
Next navigate to the Clonal-hematopoiesis-pipeline/resources/sample_lists/sample_list.tsv file and edit it with all the sample names you wish to work on. File extensions should not be included here, and the sample names should match exactly to the FastQ file names mentioned above. The header sample_names should remain. The pipeline will process all samples annotated here, sample names that are not included in this file will be ignored by the pipeline and will not be processed. Below is an example of how this .tsv file may look like, note that in this file the sample names don't contain the _1 and _2:
sample_names
TRIAL-cohort-P123-BC_WBC_CHIP_01Jan1930
TRIAL-cohort-P124-BC_WBC_CHIP_01Jan1930
TRIAL-cohort-P125-BC_WBC_CHIP_01Jan1930
TRIAL-cohort-P126-BC_WBC_CHIP_01Jan1930
TRIAL-cohort-P127-BC_WBC_CHIP_01Jan1930
3. Pipeline Overview and Environment Setup
rule_all section in the workflow/snakefile file has all the various output files the pipeline will generate such as aligning, consensus collapsing, mutation calling using three software tools, annotation via ANNOVAR and various sequencing metrics. I recommend that you create a conda environment named snakemake_env and download the snakemake software through conda, and activate the environment using conda activate snakemake before you run the pipeline. This link has instructions on downloading Snakemake using conda: https://snakemake.readthedocs.io/en/stable/getting_started/installation.html. Other conda environments and packages needed will be managed automatically by the pipeline. Alternatively you can generate the Snakemake environment using the Clonal-hematopoiesis-pipeline/workflow/envs/snakemake_env.yaml file. This is the base environment that will run most of the jobs using ABRA2, fastp, Picard and fgbio. If you choose to build your own Snakemake environment, ensure you use one of the version 6 or 7 releases, ≥8 introduced some changes to how job submissions to HPCs were handled. snakemake_env.yaml file uses 6.9.1.
4. Preparing the Config File and Input Files
The config/config.yaml file should also be annotated with the paths to the reference genome, panel targets bed file and panel baits bed file (3 or 4 column formats will work). Please follow GATK instructions here to generate the .interval_list files that some downstream GATK tools will require: https://gatk.broadinstitute.org/hc/en-us/articles/360057439071-IntervalListTools-Picard. Both bed files and .interval_list files need to be provided. Files for PATH_germline, PATH_PoN, PATH_known_indels, PATH_gold_std_indels and PATH_SNP_db can be obtained from the GATK's resource bundle: https://gatk.broadinstitute.org/hc/en-us/articles/360035890811-Resource-bundle. Once you download them from the Google Bucket update the config.yaml file with the corresponding absolute paths.
If you are colleague of mine please contact me for a copy of this file for easy set up.
5. Running the Pipeline
All Snakemake commands need to be executed from the main project folder, meaning you should navigate to the Clonal-hematopoiesis-pipeline before you start the pipeline. You can execute a dry run with snakemake --n. If you are running the pipeline on a local Linux server you can limit the jobs you run in parallel using the -j option. For example, this command below will execute 5 jobs in parallel and print out the shell commands invoked by Snakemake:
snakemake --reason --rerun-incomplete --use-conda --keep-going --printshellcmds -j 5
Use the following to submit jobs visa SLURM:
snakemake --reason --rerun-incomplete --use-conda --keep-going --cluster-config config/cluster.yaml --cluster 'sbatch -N {cluster.nodes} -c {cluster.cpus-per-task} -o {cluster.output} --mem={cluster.mem} --time={cluster.time} --job-name={rule}' -j 100
This will submit 100 jobs at a time. You can edit the config/cluster.yaml file to change the time and memory allocations for each job. Note that if you make any changes to the workflow such as adding new rules or modifying the rule names, you also need to generate the corresponding directory with the new rule name in the Clonal-hematopoiesis-pipeline/results/logs_slurm. If the log directory is missing job submission will not run to completion.
You can use the MultiQC tool to compile all FastQC html reports into one dashboard: https://github.com/MultiQC/MultiQC
6. Running Only Variant Calling and Annotation (Optional)
You also may want to run only the variant calling and annotation portions of the pipeline if you are working with a set of already aligned BAM files, in this case you still need to make sure the file naming follows the criteria outlined above. After cloning the repository into your working directory navigate to the root of the directory and run mkdir -p results/data/bam/SSCS2_final. In this directory aligned and filtered BAM places should be copied or soft linked from another location. Next scroll down to the Modules section in the workflow/snakefile file and comment out the following rules, since FastQ processing and consensus collapsing are not needed. If these lines aren't commented out Snakemake will give an error because it requires FastQ files to be made available by default (which we don't have in this scenario).
include: "rules/process_fastq.smk"
include: "rules/filter_consensus_bams.smk"
include: "rules/make_consensus_bams.smk"
You can uncomment the desired output from the rule all section to run only a portion of the pipeline. For example uncommenting run_annovar_chip will run three mutation callers (Mutect2, VarDict, Freebayes), process the variants, annotate them using ANNOVAR and convert the outputs into .txt format.
7. Running Downstream Variant Processing Tools
Upon completion the pipeline will have generated individual BAM files through consensus collapsing for each sample, and variant tables if you chose to execute the variant calling rules. Computational tools outlined in this section will compile these individual files, perform additional variant filtering and generate one high-confidence list of variants. These tools are available at https://github.com/amunzur/genomics_toolkit/tree/main/process_variants_steps. They should be run in this order:
STEP1_filter_variants.R: Compiles variant tables into one file and filters based on user inputs.
STEP2_combine_variant_callers.R: Combines variant calls from different variant calling tools and retains variant called by two or three variant callers.
STEP3_make_IGV_snapshots_tumor_wbc.py: Generates IGV screenshots for manual curation (Click here for a detailed explanation on how to generate IGV snapshots).
STEP4_curate_mutations.py: Filters out excluded mutations after manual curation.
For detailed instructions on how to use these tools please see the README at https://github.com/amunzur/genomics_toolkit/tree/main.