Publications by Year: Submitted
Submitted
Mobile element insertions (MEI) shape the human genome in both germline and somatic tissues. While inherited MEIs are well characterized, mapping somatic MEIs (sMEI) in non-cancer tissues remains challenging due to their low allelic fraction and repetitive nature. We established an integrative framework for sMEI analysis leveraging modern sequencing technologies and analytical innovations. We first benchmarked sMEI detection and demonstrated advantages of long-read and MEI-targeted sequencing for ultra-low-frequency events using a mixture of well-established cell lines. We then showed that haplotype phasing and donor-specific assemblies refine sMEI detection, effectively distinguishing from germline and false signals in in-silico tumor-normal mixtures. We further developed a source-tracing strategy based on internal sequence variation, expanding the catalogue of active source elements beyond traditional transduction-based methods. Applying this framework to donor tissues, we identified 18 rare somatic L1 insertions, revealing structural and source diversity. Our work provides a foundational framework and biological insight into sMEIs.
The adaptive immune system monitors cellular integrity by recognizing short peptides from intracellular proteins presented on Major Histocompatibility Complex class I (MHC-I) molecules, collectively termed peptide-MHC complexes (pMHC), enabling detection of foreign or mutated proteins. With the rising importance of immunotherapies targeting neoantigens in cancers, the ability to accurately predict which peptides will bind to the diverse population of MHC alleles is critically important. Current computational methods for pMHC-I prediction fall broadly into sequence-based methods, which rely heavily on large training datasets, and structure-based methods that leverage structural modeling and energetics of pMHC binding. While sequence-based methods have been popularly used, their performance is dependent on the size and quality of training data. On the other hands, while structure-based approaches can generalize better across diverse MHC alleles, they traditionally depend on identifying a single global minimum energy conformation, an assumption that often fails due to the inherent binding promiscuity of MHC-I molecules. To address these limitations, we developed a STRUMP-I (STRUcture-based pMHC Prediction (for class I)), a novel pMHC binding prediction tool that directly leverages a broad set of force-field-derived energy terms as machine-learning features. STRUMP-I achieves performance comparable to state-of-the-art sequence-based models while significantly outperforming them on MHC alleles with limited representation in training data. Furthermore, STRUMP-I demonstrates strong synergy when integrated with sequence-based methods, notably enhancing prediction precision. The robustness and generalizability of STRUMP-I were confirmed by evaluating its predictive performance on independent, previously unseen datasets, including an experimentally validated cancer neoantigen dataset. This combined approach advances our capability to reliably identify clinically relevant neoantigen targets. The source code and trained models are available at https://github.com/yoonjoolab/STRUMP-I.
Clonal hematopoiesis of indeterminate potential (CHIP) represents clonal expansion of blood cells, and increases the risk of hematological malignancies and cardiovascular disorders. Recent studies have studied CHIP mutations in individuals with Alzheimer's disease (AD), but it is unclear whether their role in AD pathogenesis is protective, detrimental, or neutral. In this study, we used molecular-barcoded deep gene panel sequencing (~400X) to examine CHIP mutations in 298 blood samples from AD and neurotypical individuals 60 years and older. The AD patients exhibited a significantly higher burden of CHIP mutations compared to the age-matched controls (p < 2e-7, odds ratio (OR) = 2.89), particularly in low-frequency variants often not captured by standard whole exome or whole genome sequencing (WGS). This increase was driven by individuals with the APOE ϵ3/ϵ3 genotype and absent in ϵ4 carriers. Analysis of an independent dataset from the Alzheimer's Disease Sequencing Project (ADSP), comprised of WGS data from ~30,000 individuals, confirmed increased CHIP mutations in AD versus control (p < 0.02, OR = 1.32), again driven by individuals with APOE ϵ3/ϵ3 genotype. CHIP mutations in AD patients also showed stronger positive selection than in controls. Our results indicate that AD patients show significantly more CHIP mutations in their blood than controls, involving more than one third of AD patients, and contributing to AD risk through a mechanism independent of APOE ϵ4.