ASLI MUNZUR - Designing MRD panels and probe selection

Designing minimal residual disease (MRD) panels and probe sets

These are best practices for MRD panel design, although they apply to any custom probe set.

GC Content: Each probe (~120 bp) should have 35–65% GC content.
- Probes with high GC content will bind to capture targets more tightly, which may lead to nonuniform coverage across all targets. If absolutely necessary, probes with GC content up to 80% may be included, albeit these probes should remain a minority within the probe set.
- Probes with <35% GC may have reduced pull-down efficiency. Here 20% is the absolute minimum I would recommend, probes with GC content under 20% are likely to experience severe dropout during washes due to suboptimal hybridization to target regions.
For hard to capture regions like to the TERT promoter with GC content >80%, you can consider using multiple overlapping probes to improve coverage.
Specificity: Once you have a candidate probe set you can use BLAT or any other alignment tool to check if there are any probes that align to off-target regions. Up to 3 total 20 base pair contiguous and perfect matches can be tolerated and are unlikely to have significant off-target effects. There should be no stretches with ≥50 base pair with 90% match, and off-target hits with a BLAT score ≥40.
Mutation Coverage: For tracking mutations (~30 per patient), you could generate sliding windows of 1 base pair step across 100 base pairs upstream and 100 base pairs downstream of each mutation. This produces multiple candidate probes covering each mutation. After filtering for GC content and uniqueness, this method usually yields a sufficient number of probes per mutation. Whenever possible, position the mutation near the center of the probe.

Workflow Steps

1. Generate BED regions around mutations: Filters whitelist mutations and excludes mitochondrial chromosome, then produces a BED file with regions of size*2 bp around each mutation.

def generate_bed(path_tsv, path_output, chrom_col, pos_col, id_col, size=):

data = pd.read_csv(path_tsv, sep='\t')

# filter for whitelist muts

data = data[data['call'] == 'W']

# filter out mitochondrial chromosome

data = data[~(data['CHROM'] == 'M')]

# generate regions of 200 bp around each mutation position (+- 100 bp upstream and downstream)

bed = pd.DataFrame()

bed['chrom'] = data[chrom_col]

bed['start'] = data[pos_col] - size

bed['end'] = data[pos_col] + size

bed['name'] = data[id_col] + "_" + data[chrom_col] + "_" + data[pos_col].astype(str)

width = str(size*2)

path_file = path_output + "/all_muts_" + width + "bp_regions.bed"

# Remove old file if it exists

try:

os.remove(path_file)

except FileNotFoundError:

pass

# generate bed file and save in target dir

bed.to_csv(path_file, index=False, sep='\t', header=None)

print("RESULT_PATH: " + path_file)

generate_bed(path_tsv, path_output, chrom_col, pos_col, id_col, size)

2. Create sliding windows

Note that all_muts_200bp_regions.bed is the output file generated by the generate_bed() function. Bedtools generates 120 bp candidate probes with 1 bp step, and we use awk later to keep probes that are at least 120 base pairs in length to filter out edge-clipped probes.

bedtools makewindows -b all_muts_200bp_regions.bed -w 120 -s 1 -i src > all_sliding_windows.bed

awk '($3-$2) == 120' all_sliding_windows.bed > all_cand_probes.bed

3. Extract probe sequences from the reference genome

bedtools getfasta -fi hg38_masked.fa -bed all_cand_probes.bed -name -fo all_cand_seqs.fa

4. Remove multi-mapping probes to keep probes that uniquely map to a single location in the genome.

blat hg38_masked.fa all_cand_seqs.fa -stepSize=5 -repMatch=2253 -minScore=40 -minIdentity=0 blat_output.psl

pslReps blat_output.psl repeat_align_info.psr

awk '($4 == 1 && $3 == 1)' repeat_align_info.psr > unique_queries.psr

seqtk subseq all_cand_seqs.fa unique_probe_ids.txt > unique_probes.fa

5. Filter probes by GC content to retains probes with GC content between 35–65%.

seqkit fx2tab unique_probes.fa -g -H -n > gc_content.tsv

awk '(35 < $2 && $2 < 65)' gc_content.tsv > filtered_gc.tsv

seqkit grep -f filtered_gc_ids.txt unique_probes.fa -n -o qualified_probes.fa

qualified_probes.fa contains the final set of probes for MRD panel design! Note that you may need to prune the final set of probes further depending on the instructions from your provider. Certain manufacturers require a minimum number of probes per targeted sequencing panel. Thank you Zoe Shong for assisting with this documentation.