ASLI MUNZUR - Copy number analysis with CNVkit

Running copy number analysis using CNVkit

1. Setting up the Conda environment

I recommend generating a conda environment for this purpose containing cnvkit.

conda create --name cnv

conda activate cnv

conda install bioconda::cnvkit

For details on what each step does CNVkit has tutorials here: https://cnvkit.readthedocs.io/en/stable/pipeline.html. The following scripts are designed to run on single samples, then which later can be run on many samples using a sample sheet to parallelize the computation for large cohorts. These scripts will generate individual sbatch files that can be submitted to a HPC via Slurm. Scripts should be run in this order:

Calculate coverage in the BAM files: SLURM_run_cnvkit_coverage.py
Generate a pooled normal using coverage obtained from normal samples: SLURM_generate_pooled_normal.py
Detect germline heterozygous SNPs using VarDict: SLURM_run_vardict.py
Fix and segment: SLURM_run_cnvkit_fix.py

You can access these scripts here: https://github.com/amunzur/genomics_toolkit/tree/main/copy_number

2. Generating coverage files (.cnn)

To generate scripts for many samples at once you can save all sample names in a text file, then in a while loop you can create individual scripts like so:

while IFS=$'\t' read -r sample_name; do

genomics_toolkit/copy_number/SLURM_run_cnvkit_coverage.py \

--path_bam .../__.bam \ # REQUIRED, abs path to sample BAM

--path_panel .../__.bed \ # REQUIRED, abs path to BED file of target regions

--dir_output ... \ # REQUIRED, abs path for CNVkit output (.cnn files)

--dir_batch_scripts ... \ # REQUIRED, abs path for generated batch scripts

--dir_logs ... \ # REQUIRED, abs path for log files

--path_conda ... \ # OPTIONAL, abs path to user conda.sh (default: ~/anaconda3/etc/profile.d/conda.sh)

--sbatch_time_string ... \ # OPTIONAL, default = "29:00"

--sbatch_partition ... # OPTIONAL, cluster partition (long, normal, etc.)

done < /path/sample_list.tsv

3. Generating a pooled normal file

For denoising and GC correction we need a pooled normal coverage file to normalize against. It is recommended that you use the same sample types, for example for copy number analysis in cell-free DNA samples, the pooled normal should consist of 10-20 ctDNA negative samples sequenced on the same panel. Similarly urine tumor DNA (utDNA) needs to have a pooled normal constructed from non-tumor containing utDNA samples. If you don't have tumor-negative samples from the same DNA source, you may construct a pooled normal from whatever tumor-negative material you have, inspect the genome wide copy number profiles to choose the samples that have a flat profile, then regenerate the pooled normal from those samples and rerun the fix and segment steps. For example, if your goal is to perform copy number analysis in utDNA samples but you don' t have access to tumor-negative utDNA samples donated from healthy volunteers, then you could consider generating a pooled normal from ctDNA negative samples as a first step.

However generating a pooled normal from a different DNA source that underwent different biological processes (and sometimes technical process too during library preparation) will yield noisier copy number profiles. Below two genome wide copy number profiles from the same urine DNA sample are provided. Top plot was generated by normalizing against a pooled normal of tumor negative blood plasma cell-free DNA samples, and in the bottom plot a pooled normal obtained from tumor-negative urine DNA samples were used. Matching the analyte source in sample of interest and pooled normal will help reduce the noise in coverage log ratio calculations.

Example command:

genomics_toolkit/copy_number/SLURM_generate_pooled_normal.py \

--path_cnns .../__.txt \ # REQUIRED, text file of normal cnn paths separated by space

--path_hg38 .../__.fa \ # REQUIRED, reference genome fasta

--path_output .../__.cnn \ # REQUIRED, output pooled normal

--path_sbatch .../__.batch \ # REQUIRED, output batch script

--dir_logs ... \ # REQUIRED, log directory

--path_conda .../conda.sh \ # OPTIONAL, default = ~/anaconda3/etc/profile.d/conda.sh

--sbatch_time_string ... \ # OPTIONAL, default = "30:00"

--sbatch_partition ... # OPTIONAL, default = long

4. Detecting germline heterozygous SNPs

B allele frequency in heterozygous SNPs can help during segmentation , but CNVKit can perform segmentation without them, too. If you are generating a genome wide copy number profile using coverage log ratios of heterozygous SNPs evenly dispersed across the genome, you could use that bed file to call mutations only in selected regions. Ideally what what needs to be done is to locate germline SNPs in the patient-matched leukocyte DNA samples, then calculate the VAF in the tumor samples in those selected positions. This is somewhat tedious to rerun variant calling twice, instead a workaround is to directly perform unmatched tumor-only variant calling, if you are working with a bed file of verified population level SNPs (since we know the variations in these locations likely represents copy number changes, rather than represent tumor-originating mutations). This output file doesn't need to be reformatted or annotated, visualization script can work with raw VCF file.

Download VarDictJava from the GitHub link here: https://github.com/AstraZeneca-NGS/VarDictJava

Example command:

genomics_toolkit/variant_calling/SLURM_run_vardict.py \

--path_bam ... \ # REQUIRED, abs path to sample BAM

--path_hg38 .../__.fa \ # REQUIRED, reference genome

--path_bed ... \ # REQUIRED, BED file of target SNPs

--dir_output ... \ # REQUIRED, output directory

--dir_batch_scripts ... \ # REQUIRED, output directory for batch scripts

--dir_logs ... \ # REQUIRED, log directory

--threshold_min_vaf ... \ # REQUIRED, min VAF

--min_alt_reads ... \ # REQUIRED, min alt reads

--dir_vardictjava ... \ # OPTIONAL, path to VarDictJava (default: ~)

--sbatch_time_string ... \ # OPTIONAL, default = "30:00"

--sbatch_partition ... # OPTIONAL, default = long

5. Fixing GC bias, normalizing the read-depth bias and segmentation

CNVkit fix command can be run the following way. Note that you can choose to provide a blank file for path_antitarget. For path_pooled_reference_normal please provide the path to the pooled normal generated in step 3. Besides normalizing this command will also generate the default CNVkit figures (diagram and scatter). Normal and test samples must use the same BED and reference genome.

Example command:

genomics_toolkit/copy_number/SLURM_run_cnvkit_fix.py \

--path_cnn ... \ # REQUIRED, sample target coverage file

--path_pooled_reference_normal .../__.cnn \ # REQUIRED, pooled normal file

--dir_output ... \ # REQUIRED, output directory

--dir_batch_scripts ... \ # REQUIRED, batch scripts directory

--dir_logs ... \ # REQUIRED, log directory

--dir_wbc_vcf ... \ # OPTIONAL, directory with VCFs of SNPs

--path_antitarget .../__.bed \ # OPTIONAL, anti-target BED

--dir_figures ... \ # REQUIRED, figure output directory

--path_conda ... \ # OPTIONAL, default = ~/anaconda3/etc/profile.d/conda.sh

--sbatch_time_string ... \ # OPTIONAL, default = "29:00"

--sbatch_partition ... # OPTIONAL, default = long

6. Visualization

After segmentation, CNVkit outputs corrected copy number ratios for each probe (.cnn.fix files) and segmented regions (.cnn.fix.segment files). To interpret these results, the following script can be used to generate genome-wide copy number plots that combine log2 copy number ratios and B allele frequencies across all chromosomes. Script will import the corrected CNV data and integrate heterozygous SNPs (if provided). You can also provide a .txt file to exclude any unreliable SNPs. I have a sample file on GitHub that you can use for this purpose provided by Dr. Jack Bacon. He observed that these SNPs consistently show deviation from 50% in leukocyte samples. The plotting script will further filter out unreliable SNPs and probes based on coverage, allelic imbalance, and local deviation from neighboring sites to minimize noise.

For each sample, two aligned panels are generated:

Top panel: Log₂ copy number ratios across the genome with segmented CN lines.
Bottom panel: Corresponding BAF distribution showing allelic imbalance.

Example command:

python plot_all_chip_cnv.py \

--cnrdir /path/to/fix_snp \

--vardir /path/to/variants \

--plot_output_dir /path/figures \

--path_SNPs_to_remove /path/to/snps.txt

7. Calculating absolute copy numbers from segmented CN data

This script converts log2 copy ratio values from CNVkit segmentation files into absolute copy numbers by incorporating each sample’s estimated diploid baseline and ctDNA fraction. These parameters should be precomputed (e.g., from CNVkit reference normalization and tumor fraction estimation) and provided as input arguments. Note that this step isn't required to generate the genome wide copy number plots, nevertheless it is useful to have the outputs available for any downstream tools that require segment copy numbers calculated like Amplicon Architect.

The script reads a per-sample segmentation file (.tsv), excludes sex chromosomes, and applies the following formula for each segment:

CN = (2 * (ctdna_fraction + 2^(logratio - diploid_level) - 1)) / ctdna_fraction

This calculation adjusts observed copy number ratios for tumor purity and baseline ploidy, resulting in integer-like absolute CN values. Example command:

calculate_CN_segment.py \

--path_segmentation /path/to/igv_tracks/sample_segments.tsv \

--path_output /path/to/sample_segments_CN.tsv \

--diploid_level -0.707 \

--ctdna_fraction 0.691

Most of these jobs can be completed in less than 30 minutes per sample in an HPC environment. Mutation calling is the most resource intensive step that can take up to an hour for targeted and WES panels.

Page updated

Google Sites

Report abuse