RP06 - Analysis of genome data

In this project, we present a novel statistical model for haplotype quantification in an uncertainty-aware manner while reaching a resolution of subclonal levels in e.g. tumors. A haplotype is a set of genes or nucleotides that are inherited together from a single parental origin. Human Leukocyte Antigen (HLA) system genes sets an example for haplotypes, being found in chromosome 6 with over 36000 genes reported in the database [1]. HLAs have central roles in immune system as they are known to be associated with autoimmune disorders, response to viral infections, organ transplantation and cancer. HLA typing is the process of determining the types of HLAs in a patient sample and can be performed with certain techniques such as serological typing, sequence-specific oligonucleotide probes or PCR primers, Sanger sequencing and lastly Next Generation Sequencing technologies [2]. In this project, HLA typing is performed using data that is acquired using NGS technologies. However, HLA typing using NGS data is not straightforward as there is extreme polymorphism and sequence similarity in HLA loci. For this reason, HLA identification requires specialized approaches. Our approach for haplotype quantification is based on a Bayesian latent variable model that uses variant allele frequencies (VAFs) in sequencing data. It utilizes linear optimization by using maximum likelihood estimates of VAFs to determine the most plausible set of haplotypes. Then, a Bayesian latent variable model is employed to calculate posterior probabilities of combinations of latent haplotype fractions using posterior VAF distributions that are provided by Varlociraptor. Varlociraptor is a variant calling tool in which all types of genomic variants can be called with a unified statistical model that can deal with all involved biases (strand, read pair orientation, read position, sampling and contamination) and uncertainties (typing, alignment, and heterogeneity) [3]. Our approach relies on pangenome read alignments to capture the most relevant genetic variation. Using benchmark datasets, we show that Orthanq performs same or better prediction than state-of-the-art HLA typers and we name our tool ORThogonal evidence based HAplotype Quantification, Orthanq [4]. Apart from HLA typing, Orthanq can also be applied for virus lineage quantification. For this purpose, we plan to evaluate Orthanq on virus lineage quantification by creating simulated samples that contain SARS-CoV-2 lineages and using public benchmark datasets. Orthanq allows to track its decisions down to individual variants that can be explored via comprehensive visualizations. The model will help solving problems associated with heterogeneity of tumors when determining sensitivity of subclones to neoantigen based immunotherapy. In the end, HLA types of patients can be integrated into patient dashboard and this can be used at the point of care. Orthanq can be reached under https://orthanq.github.io.


[1] Robinson, J., Barker, D. J., Georgiou, X., Cooper, M. A., Flicek, P., & Marsh, S. G. (2020). Ipd-imgt/hla database. Nucleic acids research, 48(D1), D948-D955. [2] Klasberg, S., Surendranath, V., Lange, V., & Schöfl, G. (2019). Bioinformatics strategies, challenges, and opportunities for next generation sequencing-based HLA genotyping. Transfusion Medicine and Hemotherapy, 46(5), 312-325. [3] Köster, J., Dijkstra, L. J., Marschall, T., & Schönhuth, A. (2020). Varlociraptor: enhancing sensitivity and controlling false discovery rate in somatic indel discovery. Genome biology, 21(1), 98. https://doi.org/10.1186/s13059-020-01993-6 [4] Uzuner, H., Paschen, A., Schadendorf, D., & Köster, J. (2024). Orthanq: transparent and uncertainty-aware haplotype quantification with application in HLA-typing. BMC bioinformatics, 25(1), 240.

Next
Previous

Related