Publications

2013

Lin C, Karlson E, Canhao H, Miller T, Dligach D, Chen PJ, Perez RNG, Shen Y, Weinblatt M, Shadick N, et al. Automatic prediction of rheumatoid arthritis disease activity from the electronic medical records.. PloS one. 2013;8(8):e69932. doi:10.1371/journal.pone.0069932
OBJECTIVE: We aimed to mine the data in the Electronic Medical Record to automatically discover patients' Rheumatoid Arthritis disease activity at discrete rheumatology clinic visits. We cast the problem as a document classification task where the feature space includes concepts from the clinical narrative and lab values as stored in the Electronic Medical Record. MATERIALS AND METHODS: The Training Set consisted of 2792 clinical notes and associated lab values. Test Set 1 included 1749 clinical notes and associated lab values. Test Set 2 included 344 clinical notes for which there were no associated lab values. The Apache clinical Text Analysis and Knowledge Extraction System was used to analyze the text and transform it into informative features to be combined with relevant lab values. RESULTS: Experiments over a range of machine learning algorithms and features were conducted. The best performing combination was linear kernel Support Vector Machines with Unified Medical Language System Concept Unique Identifier features with feature selection and lab values. The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.831 ($\sigma$ = 0.0317), statistically significant as compared to two baselines (AUC = 0.758, $\sigma$ = 0.0291). Algorithms demonstrated superior performance on cases clinically defined as extreme categories of disease activity (Remission and High) compared to those defined as intermediate categories (Moderate and Low) and included laboratory data on inflammatory markers. CONCLUSION: Automatic Rheumatoid Arthritis disease activity discovery from Electronic Medical Record data is a learnable task approximating human performance. As a result, this approach might have several research applications, such as the identification of patients for genome-wide pharmacogenetic studies that require large sample sizes with precise definitions of disease activity and response to therapies.

2012

Polepalli Ramesh B, Prasad R, Miller T, Harrington B, Yu H. Automatic discourse connective detection in biomedical text. Journal of the American Medical Informatics Association. 2012;19(5):800–808. doi:10.1136/amiajnl-2011-000775
OBJECTIVE: Relation extraction in biomedical text mining systems has largely focused on identifying clause-level relations, but increasing sophistication demands the recognition of relations at discourse level. A first step in identifying discourse relations involves the detection of discourse connectives: words or phrases used in text to express discourse relations. In this study supervised machine-learning approaches were developed and evaluated for automatically identifying discourse connectives in biomedical text.$\backslash$n$\backslash$nMATERIALS AND METHODS: Two supervised machine-learning models (support vector machines and conditional random fields) were explored for identifying discourse connectives in biomedical literature. In-domain supervised machine-learning classifiers were trained on the Biomedical Discourse Relation Bank, an annotated corpus of discourse relations over 24 full-text biomedical articles (\~112,000 word tokens), a subset of the GENIA corpus. Novel domain adaptation techniques were also explored to leverage the larger open-domain Penn Discourse Treebank (\~1 million word tokens). The models were evaluated using the standard evaluation metrics of precision, recall and F1 scores.$\backslash$n$\backslash$nRESULTS AND CONCLUSION: Supervised machine-learning approaches can automatically identify discourse connectives in biomedical text, and the novel domain adaptation techniques yielded the best performance: 0.761 F1 score. A demonstration version of the fully implemented classifier BioConn is available at: http://bioconn.askhermes.org.

2011

Li D, Miller T, Schuler W. A Pronoun Anaphora Resolution System based on Factorial Hidden Markov Models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT ’11). 2011. pp. 1169–1178.
This paper presents a supervised pronoun anaphora resolution system based on factorial hidden Markov models (FHMMs). The basic idea is that the hidden states of FHMMs are an explicit short-term memory with an antecedent buffer containing recently described referents. Thus an observed pronoun can find its antecedent from the hidden buffer, or in terms of a generative model, the entries in the hidden buffer generate the corresponding pronouns. A system implementing this model is evaluated on the ACE corpus with promising performance.

2010

Schuler W, AbdelRahman S, Miller T, Schwartz L. Broad-Coverage Parsing Using Human-Like Memory Constraints. Computational Linguistics. 2010;36(1):1–30.
Human syntactic processing shows many signs of taking place within a general-purpose short-term memory. But this kind of memory is known to have a severely constrained storage capacity—possibly constrained to as few as three or four distinct elements. This article describes a model of syntactic processing that operates successfully within these severe constraints, by recognizing constituents in a right-corner transformed representation (a variant of left-corner parsing) and mapping this representation to random variables in a Hierarchic Hidden Markov Model, a factored time-series model which probabilistically models the contents of a bounded memory store over time. Evaluations of the coverage of this model on a large syntactically annotated corpus of English sentences, and the accuracy of a a bounded-memory parsing strategy based on this model, suggest this model may be cognitively plausible.