Publications

2015

Bersani, Francesca, Eunjung Lee, Peter Kharchenko V, Andrew W Xu, Mingzhu Liu, Kristina Xega, Olivia C MacKenzie, et al. (2015) 2015. “Pericentromeric Satellite Repeat Expansions through RNA-Derived DNA Intermediates in Cancer.”. Proceedings of the National Academy of Sciences of the United States of America 112 (49): 15148-53. https://doi.org/10.1073/pnas.1518008112.

Aberrant transcription of the pericentromeric human satellite II (HSATII) repeat is present in a wide variety of epithelial cancers. In deriving experimental systems to study its deregulation, we observed that HSATII expression is induced in colon cancer cells cultured as xenografts or under nonadherent conditions in vitro, but it is rapidly lost in standard 2D cultures. Unexpectedly, physiological induction of endogenous HSATII RNA, as well as introduction of synthetic HSATII transcripts, generated cDNA intermediates in the form of DNA/RNA hybrids. Single molecule sequencing of tumor xenografts showed that HSATII RNA-derived DNA (rdDNA) molecules are stably incorporated within pericentromeric loci. Suppression of RT activity using small molecule inhibitors reduced HSATII copy gain. Analysis of whole-genome sequencing data revealed that HSATII copy number gain is a common feature in primary human colon tumors and is associated with a lower overall survival. Together, our observations suggest that cancer-associated derepression of specific repetitive sequences can promote their RNA-driven genomic expansion, with potential implications on pericentromeric architecture.

2013

Gokcumen, Omer, Verena Tischler, Jelena Tica, Qihui Zhu, Rebecca C Iskow, Eunjung Lee, Markus Hsi-Yang Fritz, et al. (2013) 2013. “Primate Genome Architecture Influences Structural Variation Mechanisms and Functional Consequences.”. Proceedings of the National Academy of Sciences of the United States of America 110 (39): 15764-9. https://doi.org/10.1073/pnas.1305904110.

Although nucleotide resolution maps of genomic structural variants (SVs) have provided insights into the origin and impact of phenotypic diversity in humans, comparable maps in nonhuman primates have thus far been lacking. Using massively parallel DNA sequencing, we constructed fine-resolution genomic structural variation maps in five chimpanzees, five orang-utans, and five rhesus macaques. The SV maps, which are comprised of thousands of deletions, duplications, and mobile element insertions, revealed a high activity of retrotransposition in macaques compared with great apes. By comparison, nonallelic homologous recombination is specifically active in the great apes, which is correlated with architectural differences between the genomes of great apes and macaque. Transcriptome analyses across nonhuman primates and humans revealed effects of species-specific whole-gene duplication on gene expression. We identified 13 gene duplications coinciding with the species-specific gain of tissue-specific gene expression in keeping with a role of gene duplication in the promotion of diversification and the acquisition of unique functions. Differences in the present day activity of SV formation mechanisms that our study revealed may contribute to ongoing diversification and adaptation of great ape and Old World monkey lineages.

2012

Lee, Eunjung, Rebecca Iskow, Lixing Yang, Omer Gokcumen, Psalm Haseley, Lovelace J Luquette, Jens G Lohr, et al. (2012) 2012. “Landscape of Somatic Retrotransposition in Human Cancers.”. Science (New York, N.Y.) 337 (6097): 967-71. https://doi.org/10.1126/science.1222077.

Transposable elements (TEs) are abundant in the human genome, and some are capable of generating new insertions through RNA intermediates. In cancer, the disruption of cellular mechanisms that normally suppress TE activity may facilitate mutagenic retrotranspositions. We performed single-nucleotide resolution analysis of TE insertions in 43 high-coverage whole-genome sequencing data sets from five cancer types. We identified 194 high-confidence somatic TE insertions, as well as thousands of polymorphic TE insertions in matched normal genomes. Somatic insertions were present in epithelial tumors but not in blood or brain cancers. Somatic L1 insertions tend to occur in genes that are commonly mutated in cancer, disrupt the expression of the target genes, and are biased toward regions of cancer-specific DNA hypomethylation, highlighting their potential impact in tumorigenesis.

Evrony, Gilad D, Xuyu Cai, Eunjung Lee, Benjamin Hills, Princess C Elhosary, Hillel S Lehmann, J J Parker, et al. (2012) 2012. “Single-Neuron Sequencing Analysis of L1 Retrotransposition and Somatic Mutation in the Human Brain.”. Cell 151 (3): 483-96. https://doi.org/10.1016/j.cell.2012.09.035.

A major unanswered question in neuroscience is whether there exists genomic variability between individual neurons of the brain, contributing to functional diversity or to an unexplained burden of neurological disease. To address this question, we developed a method to amplify genomes of single neurons from human brains. Because recent reports suggest frequent LINE-1 (L1) retrotransposition in human brains, we performed genome-wide L1 insertion profiling of 300 single neurons from cerebral cortex and caudate nucleus of three normal individuals, recovering >80% of germline insertions from single neurons. While we find somatic L1 insertions, we estimate <0.6 unique somatic insertions per neuron, and most neurons lack detectable somatic insertions, suggesting that L1 is not a major generator of neuronal diversity in cortex and caudate. We then genotyped single cortical cells to characterize the mosaicism of a somatic AKT3 mutation identified in a child with hemimegalencephaly. Single-neuron sequencing allows systematic assessment of genomic diversity in the human brain.

2011

Lee, Sejoon, Eunjung Lee, Kwang H Lee, and Doheon Lee. (2011) 2011. “Predicting Disease Phenotypes Based on the Molecular Networks With Condition-Responsive Correlation.”. International Journal of Data Mining and Bioinformatics 5 (2): 131-42.

Network-based methods using molecular interaction networks integrated with gene expression profiles have been proposed to solve problems, which arose from smaller number of samples compared with the large number of predictors. However, previous network-based methods, which have focused only on expression levels of proteins, nodes in the network through the identification of condition-responsive interactions. We propose a novel network-based classification, which focuses on both nodes with discriminative expression levels and edges with Condition-Responsive Correlations (CRCs) across two phenotypes. We found that modules with condition-responsive interactions provide candidate molecular models for diseases and show improved performances compared conventional gene-centric classification methods.

Paik, Hyojung, Eunjung Lee, Inho Park, Junho Kim, and Doheon Lee. (2011) 2011. “Prediction of Cancer Prognosis With the Genetic Basis of Transcriptional Variations.”. Genomics 97 (6): 350-7. https://doi.org/10.1016/j.ygeno.2011.03.005.

Phenotypes of diseases, including prognosis, are likely to have complex etiologies and be derived from interactive mechanisms, including genetic and protein interactions. Many computational methods have been used to predict survival outcomes without explicitly identifying interactive effects, such as the genetic basis for transcriptional variations. We have therefore proposed a classification method based on the interaction between genotype and transcriptional expression features (CORE-F). This method considers the overall "genetic architecture," referring to genetically based transcriptional alterations that influence prognosis. In comparing the performance of CORE-F with the ensemble tree, the best-performing method predicting patient survival, we found that CORE-F outperformed the ensemble tree (mean AUC, 0.85 vs. 0.72). Moreover, the trained associations in the CORE-F successfully identified the genetic mechanisms underlying survival outcomes at the interaction-network level.

Xi, Ruibin, Angela G Hadjipanayis, Lovelace J Luquette, Tae-Min Kim, Eunjung Lee, Jianhua Zhang, Mark D Johnson, et al. (2011) 2011. “Copy Number Variation Detection in Whole-Genome Sequencing Data Using the Bayesian Information Criterion.”. Proceedings of the National Academy of Sciences of the United States of America 108 (46): E1128-36. https://doi.org/10.1073/pnas.1110574108.

DNA copy number variations (CNVs) play an important role in the pathogenesis and progression of cancer and confer susceptibility to a variety of human disorders. Array comparative genomic hybridization has been used widely to identify CNVs genome wide, but the next-generation sequencing technology provides an opportunity to characterize CNVs genome wide with unprecedented resolution. In this study, we developed an algorithm to detect CNVs from whole-genome sequencing data and applied it to a newly sequenced glioblastoma genome with a matched control. This read-depth algorithm, called BIC-seq, can accurately and efficiently identify CNVs via minimizing the Bayesian information criterion. Using BIC-seq, we identified hundreds of CNVs as small as 40 bp in the cancer genome sequenced at 10× coverage, whereas we could only detect large CNVs (> 15 kb) in the array comparative genomic hybridization profiles for the same genome. Eighty percent (14/16) of the small variants tested (110 bp to 14 kb) were experimentally validated by quantitative PCR, demonstrating high sensitivity and true positive rate of the algorithm. We also extended the algorithm to detect recurrent CNVs in multiple samples as well as deriving error bars for breakpoints using a Gibbs sampling approach. We propose this statistical approach as a principled yet practical and efficient method to estimate CNVs in whole-genome sequencing data.

2010

Jung, Juhyun, Taewoo Ryu, Yongdeuk Hwang, Eunjung Lee, and Doheon Lee. (2010) 2010. “Prediction of Extracellular Matrix Proteins Based on Distinctive Sequence and Domain Characteristics.”. Journal of Computational Biology : A Journal of Computational Molecular Cell Biology 17 (1): 97-105. https://doi.org/10.1089/cmb.2008.0236.

Extracellular matrix (ECM) proteins are secreted to the exterior of the cell, and function as mediators between resident cells and the external environment. These proteins not only support cellular structure but also participate in diverse processes, including growth, hormonal response, homeostasis, and disease progression. Despite their importance, current knowledge of the number and functions of ECM proteins is limited. Here, we propose a computational method to predict ECM proteins. Specific features, such as ECM domain score and repetitive residues, were utilized for prediction. Based on previously employed and newly generated features, discriminatory characteristics for ECM protein categorization were determined, which significantly improved the performance of Random Forest and support vector machine (SVM) classification. We additionally predicted novel ECM proteins from non-annotated human proteins, validated with gene ontology and earlier literature. Our novel prediction method is available at biosoft.kaist.ac.kr/ecm.

Paik, Hyojung, Eunjung Lee, and Doheon Lee. (2010) 2010. “Relationships Between Genetic Polymorphisms and Transcriptional Profiles for Outcome Prediction in Anticancer Agent Treatment.”. BMB Reports 43 (12): 836-41. https://doi.org/10.5483/BMBRep.2010.43.12.836.

In the era of personal genomics, predicting the individual response to drug-treatment is a challenge of biomedical research. The aim of this study was to validate whether interaction information between genetic and transcriptional signatures are promising features to predict a drug response. Because drug resistance/susceptibilities result from the complex associations of genetic and transcriptional activities, we predicted the inter-relationships between genetic and transcriptional signatures. With this concept, captured genetic polymorphisms and transcriptional profiles were prepared in cancer samples. By splitting ninety-nine samples into a trial set (n = 30) and a test set (n = 69), the outperformance of relationship-focused model (0.84 of area under the curve in trial set, P = 2.90 x 10⁻⁴) was presented in the trial set and validated in the test set, respectively. The prediction results of modeling show that considering the relationships between genetic and transcriptional features is an effective approach to determine outcome predictions of drug-treatment.

2009

Lee, Eunjung, Hyunchul Jung, Predrag Radivojac, Jong-Won Kim, and Doheon Lee. (2009) 2009. “Analysis of AML Genes in Dysregulated Molecular Networks.”. Summit on Translational Bioinformatics 2009: 1-18.

BACKGROUND: Identifying disease causing genes and understanding their molecular mechanisms are essential to developing effective therapeutics. Thus, several computational methods have been proposed to prioritize candidate disease genes by integrating different data types, including sequence information, biomedical literature, and pathway information. Recently, molecular interaction networks have been incorporated to predict disease genes, but most of those methods do not utilize invaluable disease-specific information available in mRNA expression profiles of patient samples.

RESULTS: Through the integration of protein-protein interaction networks and gene expression profiles of acute myeloid leukemia (AML) patients, we identified subnetworks of interacting proteins dysregulated in AML and characterized known mutation genes causally implicated to AML embedded in the subnetworks. The analysis shows that the set of extracted subnetworks is a reservoir rich in AML genes reflecting key leukemogenic processes such as myeloid differentiation,

CONCLUSION: We showed that the integrative approach both utilizing gene expression profiles and molecular networks could identify AML causing genes most of which were not detectable with gene expression analysis alone due to their minor changes in mRNA.