Background error rate estimation workflow
This pipeline computes position-specific background error rates from whole-genome or targeted sequencing BAM files. It uses pysamstats to generate per-base statistics and calculates mean nucleotide and indel error rates. You need to have pysamstats available in a conda environment, which you can pass as an argument to the STEP2_vstat2bgerror.bash script.
Overview of Scripts
1. STEP1_bam2vstat.bash
Generates .vstat (variant statistics) files for each BAM sample using pysamstats. It will submit individual jobs to the cluster for each chromosome, and can easily be run with the sbatch STEP1_bam2vstat.bash command.
Reads a list of BAM files. These are usually tumor negative plasma cell-free DNA samples.
Generates per-base statistics for all chromosomes (chr1–22, X, Y, M) from input BAM files.
Filters out mitochondrial chromosome and low-depth positions (depth < 20).
Adds sample name column and formats output as sample.vstat.
Edit the following section in the script with appropriate paths and parameters:
ref="/path/to/hg38.fa"
outputdir=/path/to/error_rates
conda_profile_path="~/anaconda3/etc/profile.d/conda.sh" # Replace ~ with your home directory, Slurm may not recognize ~.
min_depth=20 # Minimum sequencing depth required in error rate calculation. Positions with depth less than this value will be omitted.
bam_list="/path/to/bam_list.txt" # Abs path to normal bams, one line per bam. Include the .bam file extension.
2. STEP2_vstat2bgerror.bash
Processes .vstat files to compute background error rates per genomic position. It will subset by chromosome and fix edge positions to ensure positions aren't split between files. It will invoke the compute_error.py and add_indel.py utility scripts to calculate mean base-substitution and indel error rates. Final error_rates.tsv file outputted by this script has error rate information for all genomic positions.
Resource requirements:
This script is resource-intensive (64 cores, 512 GB RAM, 24 hours). You may adjust the SBATCH directives if these resources are unavailable. For large sequencing panels, reducing memory may require increasing the runtime limit.
Edit the following section in the script:
conda_profile_path="~/anaconda3/etc/profile.d/conda.sh";
conda_env="pysamstats"; # Name of the conda environment containing pysamstats.
threads=64;
vstat_dir="/path/to/error_rates/vstat";
output_dir="/path/to/error_rates/error_rate/";
compute_error_script="/path/to/toolkit/bg_error_scripts/compute_error.py";