Running copy number analysis using CNVkit
1. Setting up the Conda environment
I recommend generating a conda environment for this purpose containing cnvkit.
conda create --name cnv
conda activate cnv
conda install bioconda::cnvkit
For details on what each step does CNVkit has tutorials here: https://cnvkit.readthedocs.io/en/stable/pipeline.html. The following scripts are designed to run on single samples, then which later can be run on many samples using a sample sheet to parallelize the computation for large cohorts. These scripts will generate individual sbatch files that can be submitted to a HPC via Slurm. Scripts should be run in this order:
Calculate coverage in the BAM files: SLURM_run_cnvkit_coverage.py
Generate a pooled normal using coverage obtained from normal samples: SLURM_generate_pooled_normal.py
Detect germline heterozygous SNPs using VarDict: SLURM_run_vardict.py
Fix and segment: SLURM_run_cnvkit_fix.py
You can access these scripts here: https://github.com/amunzur/genomics_toolkit/tree/main/copy_number
2. Generating coverage files (.cnn)
To generate scripts for many samples at once you can save all sample names in a text file, then in a while loop you can create individual scripts like so:
while IFS=$'\t' read -r sample_name; do
genomics_toolkit/copy_number/SLURM_run_cnvkit_coverage.py \
--path_bam .../__.bam \ # REQUIRED, abs path to sample BAM
--path_panel .../__.bed \ # REQUIRED, abs path to BED file of target regions
--dir_output ... \ # REQUIRED, abs path for CNVkit output (.cnn files)
--dir_batch_scripts ... \ # REQUIRED, abs path for generated batch scripts
--dir_logs ... \ # REQUIRED, abs path for log files
--path_conda ... \ # OPTIONAL, abs path to user conda.sh (default: ~/anaconda3/etc/profile.d/conda.sh)
--sbatch_time_string ... \ # OPTIONAL, default = "29:00"
--sbatch_partition ... # OPTIONAL, cluster partition (long, normal, etc.)
done < /path/sample_list.tsv
3. Generating a pooled normal file
For denoising and GC correction we need a pooled normal coverage file to normalize against. It is recommended that you use the same sample types, for example for copy number analysis in cell-free DNA samples, the pooled normal should consist of 10-20 ctDNA negative samples sequenced on the same panel. Similarly urine tumor DNA (utDNA) needs to have a pooled normal constructed from non-tumor containing utDNA samples. If you don't have tumor-negative samples from the same DNA source, you may construct a pooled normal from whatever tumor-negative material you have, inspect the genome wide copy number profiles to choose the samples that have a flat profile, then regenerate the pooled normal from those samples and rerun the fix and segment steps. For example, if your goal is to perform copy number analysis in utDNA samples but you don' t have access to tumor-negative utDNA samples donated from healthy volunteers, then you could consider generating a pooled normal from ctDNA negative samples as a first step.
However generating a pooled normal from a different DNA source that underwent different biological processes (and sometimes technical process too during library preparation) will yield noisier copy number profiles. Below two genome wide copy number profiles from the same urine DNA sample are provided. Top plot was generated by normalizing against a pooled normal of tumor negative blood plasma cell-free DNA samples, and in the bottom plot a pooled normal obtained from tumor-negative urine DNA samples were used. Matching the analyte source in sample of interest and pooled normal will help reduce the noise in coverage log ratio calculations.
Example command:
genomics_toolkit/copy_number/SLURM_generate_pooled_normal.py \
--path_cnns .../__.txt \ # REQUIRED, text file of normal cnn paths separated by space
--path_hg38 .../__.fa \ # REQUIRED, reference genome fasta
--path_output .../__.cnn \ # REQUIRED, output pooled normal
--path_sbatch .../__.batch \ # REQUIRED, output batch script
--dir_logs ... \ # REQUIRED, log directory
--path_conda .../conda.sh \ # OPTIONAL, default = ~/anaconda3/etc/profile.d/conda.sh
--sbatch_time_string ... \ # OPTIONAL, default = "30:00"
--sbatch_partition ... # OPTIONAL, default = long