Designing  minimal residual disease (MRD) panels and probe sets

These are best practices for MRD panel design, although they apply to any custom probe set.

Workflow Steps

1. Generate BED regions around mutations: Filters whitelist mutations and excludes mitochondrial chromosome, then produces a BED file with regions of size*2 bp around each mutation.

def generate_bed(path_tsv, path_output, chrom_col, pos_col, id_col, size=):

        data = pd.read_csv(path_tsv, sep='\t')


        # filter for whitelist muts

        data = data[data['call'] == 'W']


        # filter out mitochondrial chromosome

        data = data[~(data['CHROM'] == 'M')]


        # generate regions of 200 bp around each mutation position (+- 100 bp upstream and downstream)

        bed = pd.DataFrame()

        bed['chrom'] = data[chrom_col]

        bed['start'] = data[pos_col] - size

        bed['end'] = data[pos_col] + size

        bed['name'] = data[id_col] + "_" + data[chrom_col] + "_" + data[pos_col].astype(str)

        width = str(size*2)

        path_file = path_output + "/all_muts_" + width + "bp_regions.bed"

  

        # Remove old file if it exists

try:

            os.remove(path_file)

        except FileNotFoundError:

            pass

        

       # generate bed file and save in target dir

        bed.to_csv(path_file, index=False, sep='\t', header=None)

        print("RESULT_PATH: " + path_file)

generate_bed(path_tsv, path_output, chrom_col, pos_col, id_col, size)

2. Create sliding windows

Note that all_muts_200bp_regions.bed is the output file generated by the generate_bed() function. Bedtools generates 120 bp candidate probes with 1 bp step, and we use awk later to keep probes that are at least 120 base pairs in length to filter out edge-clipped probes.

bedtools makewindows -b all_muts_200bp_regions.bed -w 120 -s 1 -i src > all_sliding_windows.bed

awk '($3-$2) == 120' all_sliding_windows.bed > all_cand_probes.bed

3. Extract probe sequences from the reference genome

bedtools getfasta -fi hg38_masked.fa -bed all_cand_probes.bed -name -fo all_cand_seqs.fa

4. Remove multi-mapping probes to keep probes that uniquely map to a single location in the genome.

blat hg38_masked.fa all_cand_seqs.fa -stepSize=5 -repMatch=2253 -minScore=40 -minIdentity=0 blat_output.psl

pslReps blat_output.psl repeat_align_info.psr

awk '($4 == 1 && $3 == 1)' repeat_align_info.psr > unique_queries.psr

seqtk subseq all_cand_seqs.fa unique_probe_ids.txt > unique_probes.fa

5. Filter probes by GC content to retains probes with GC content between 35–65%.

seqkit fx2tab unique_probes.fa -g -H -n > gc_content.tsv

awk '(35 < $2 && $2 < 65)' gc_content.tsv > filtered_gc.tsv

seqkit grep -f filtered_gc_ids.txt unique_probes.fa -n -o qualified_probes.fa

qualified_probes.fa contains the final set of probes for MRD panel design! Note that you may need to prune the final set of probes further depending on the instructions from your provider. Certain manufacturers require a minimum number of probes per targeted sequencing panel. Thank you Zoe Shong for assisting with this documentation.