Publications

P

Reis B, Olson K, Tian L, Bohn R, Brownstein J, Park P, Cziraky M, Wilson M, Mandl K. A pharmacoepidemiological network model for drug safety surveillance: statins and rhabdomyolysis.. Drug Saf. 2012;35(5):395–406. doi:10.2165/11596610-000000000-00000
BACKGROUND: Recent withdrawals of major drugs have highlighted the critical importance of drug safety surveillance in the postmarketing phase. Limitations of spontaneous report data have led drug safety professionals to pursue alternative postmarketing surveillance approaches based on healthcare administrative claims data. These data are typically analysed by comparing the adverse event rates associated with a drug of interest to those of a single comparable reference drug. OBJECTIVE: The aim of this study was to determine whether adverse event detection can be improved by incorporating information from multiple reference drugs. We developed a pharmacological network model that implemented this approach and evaluated its performance. METHODS: We studied whether adverse event detection can be improved by incorporating information from multiple reference drugs, and describe two approaches for doing so. The first, reported previously, combines a set of related drugs into a single reference cohort. The second is a novel pharmacoepidemiological network model, which integrates multiple pair-wise comparisons across an entire set of related drugs into a unified consensus safety score for each drug. We also implemented a single reference drug approach for comparison with both multi-drug approaches. All approaches were applied within a sequential analysis framework, incorporating new information as it became available and addressing the issue of multiple testing over time. We evaluated all these approaches using statin (HMG-CoA reductase inhibitors) safety data from a large healthcare insurer in the US covering April 2000 through March 2005. RESULTS: We found that both multiple reference drug approaches offer earlier detection (6-13 months) than the single reference drug approach, without triggering additional false positives. CONCLUSIONS: Such combined approaches have the potential to be used with existing healthcare databases to improve the surveillance of therapeutics in the postmarketing phase over single-comparator methods. The proposed network approach also provides an integrated visualization framework enabling decision makers to understand the key high-level safety relationships amongst a group of related drugs.

M

Reis B, Brownstein J. Measuring the impact of health policies using Internet search patterns: the case of abortion.. BMC Public Health. 2010;10:514. doi:10.1186/1471-2458-10-514
BACKGROUND: Internet search patterns have emerged as a novel data source for monitoring infectious disease trends. We propose that these data can also be used more broadly to study the impact of health policies across different regions in a more efficient and timely manner. METHODS: As a test use case, we studied the relationships between abortion-related search volume, local abortion rates, and local abortion policies available for study. RESULTS: Our initial integrative analysis found that, both in the US and internationally, the volume of Internet searches for abortion is inversely proportional to local abortion rates and directly proportional to local restrictions on abortion. CONCLUSION: These findings are consistent with published evidence that local restrictions on abortion lead individuals to seek abortion services outside of their area. Further validation of these methods has the potential to produce a timely, complementary data source for studying the effects of health policies.

L

Reis B, Kohane I, Mandl K. Longitudinal histories as predictors of future diagnoses of domestic abuse: modelling study.. BMJ. 2009;339:b3677. doi:10.1136/bmj.b3677
OBJECTIVE: To determine whether longitudinal data in patients' historical records, commonly available in electronic health record systems, can be used to predict a patient's future risk of receiving a diagnosis of domestic abuse. DESIGN: Bayesian models, known as intelligent histories, used to predict a patient's risk of receiving a future diagnosis of abuse, based on the patient's diagnostic history. Retrospective evaluation of the model's predictions using an independent testing set. SETTING: A state-wide claims database covering six years of inpatient admissions to hospital, admissions for observation, and encounters in emergency departments. Population All patients aged over 18 who had at least four years between their earliest and latest visits recorded in the database (561,216 patients). MAIN OUTCOME MEASURES: Timeliness of detection, sensitivity, specificity, positive predictive values, and area under the ROC curve. RESULTS: 1.04% (5829) of the patients met the narrow case definition for abuse, while 3.44% (19,303) met the broader case definition for abuse. The model achieved sensitive, specific (area under the ROC curve of 0.88), and early (10-30 months in advance, on average) prediction of patients' future risk of receiving a diagnosis of abuse. Analysis of model parameters showed important differences between sexes in the risks associated with certain diagnoses. CONCLUSIONS: Commonly available longitudinal diagnostic data can be useful for predicting a patient's future risk of receiving a diagnosis of abuse. This modelling approach could serve as the basis for an early warning system to help doctors identify high risk patients for further screening.
Fine A, Nigrovic L, Reis B, Cook F, Mandl K. Linking surveillance to action: incorporation of real-time regional data into a medical decision rule.. J Am Med Inform Assoc. 2007;14(2):206–11. doi:10.1197/jamia.M2253
OBJECTIVE: Broadly, to create a bidirectional communication link between public health surveillance and clinical practice. Specifically, to measure the impact of integrating public health surveillance data into an existing clinical prediction rule. We incorporate data about recent local trends in meningitis epidemiology into a prediction model differentiating aseptic from bacterial meningitis. DESIGN AND MEASUREMENTS: Retrospective analysis of a cohort of all 696 children with meningitis admitted to a large urban pediatric hospital from 1992 to 2000. We modified a published bacterial meningitis score by adding a new epidemiological context adjustor variable. We examined 540 possible rules for this adjustor, varying both the number of aseptic meningitis cases that needed to be seen, and the recent time window in which they were seen. We performed sensitivity analyses with each of 540 possibilities in order to identify the optimal rule--namely, the one that included the most cases of aseptic meningitis without missing additional cases of bacterial meningitis, as compared with the published prediction model. We used bootstrap methods to validate this new score. RESULTS: The optimal rule was found to be: "at least four cases of aseptic meningitis in the previous 10 days." The epidemiological context adjustor based on surveillance of recent cases of meningitis allowed the correct identification of an additional 47 cases (7%) of aseptic meningitis without missing any additional cases of bacterial meningitis. The epidemiological context adjustor was validated, showing significance in 84% of 1,000 bootstrap samples. CONCLUSION: Epidemiological contextual information can improve the performance of a clinical prediction rule. We provide a methodological framework for leveraging regional surveillance data to improve medical decision-making.

I

Barak-Corren Y, Reis B. Internet activity as a proxy for vaccination compliance.. Vaccine. 2015;33(21):2395–8. doi:10.1016/j.vaccine.2015.03.100
Tracking the progress of vaccination campaigns is a challenging and important public health need. Examining a recent Polio outbreak in the Middle East, we show that novel methods utilizing online search trends have great potential to provide a real-time, reliable proxy for vaccination rates over space and time.
Syndromic surveillance systems are being deployed widely to monitor for signals of covert bioterrorist attacks. Regional systems are being established through the integration of local surveillance data across multiple facilities. We studied how different methods of data integration affect outbreak detection performance. We used a simulation relying on a semi-synthetic dataset, introducing simulated outbreaks of different sizes into historical visit data from two hospitals. In one simulation, we introduced the synthetic outbreak evenly into both hospital datasets (aggregate model). In the second, the outbreak was introduced into only one or the other of the hospital datasets (local model). We found that the aggregate model had a higher sensitivity for detecting outbreaks that were evenly distributed between the hospitals. However, for outbreaks that were localized to one facility, maintaining individual models for each location proved to be better. Given the complementary benefits offered by both approaches, the results suggest building a hybrid system that includes both individual models for each location, and an aggregate model that combines all the data. We also discuss options for multi-level signal integration hierarchies.
McMurry A, Fitch B, Savova G, Kohane I, Reis B. Improved de-identification of physician notes through integrative modeling of both public and private medical text.. BMC Med Inform Decis Mak. 2013;13:112. doi:10.1186/1472-6947-13-112
BACKGROUND: Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in addition to training a model to recognize Protected Health Information (PHI) within private physician notes, we reverse the problem and train a model to recognize non-PHI words and phrases that appear in public medical texts. METHODS: Public and private medical text sources were analyzed to distinguish common medical words and phrases from Protected Health Information. Patient identifiers are generally nouns and numbers that appear infrequently in medical literature. To quantify this relationship, term frequencies and part of speech tags were compared between journal publications and physician notes. Standard medical concepts and phrases were then examined across ten medical dictionaries. Lists and rules were included from the US census database and previously published studies. In total, 28 features were used to train decision tree classifiers. RESULTS: The model successfully recalled 98% of PHI tokens from 220 discharge summaries. Cost sensitive classification was used to weight recall over precision (98% F10 score, 76% F1 score). More than half of the false negatives were the word "of" appearing in a hospital name. All patient names, phone numbers, and home addresses were at least partially redacted. Medical concepts such as "elevated white blood cell count" were informative for de-identification. The results exceed the previously approved criteria established by four Institutional Review Boards. CONCLUSIONS: The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI. The data and algorithms reported here are made freely available for evaluation and improvement.

H

Freifeld C, Mandl K, Reis B, Brownstein J. HealthMap: global infectious disease monitoring through automated classification and visualization of Internet media reports.. J Am Med Inform Assoc. 2008;15(2):150–7. doi:10.1197/jamia.M2544
OBJECTIVE: Unstructured electronic information sources, such as news reports, are proving to be valuable inputs for public health surveillance. However, staying abreast of current disease outbreaks requires scouring a continually growing number of disparate news sources and alert services, resulting in information overload. Our objective is to address this challenge through the HealthMap.org Web application, an automated system for querying, filtering, integrating and visualizing unstructured reports on disease outbreaks. DESIGN: This report describes the design principles, software architecture and implementation of HealthMap and discusses key challenges and future plans. MEASUREMENTS: We describe the process by which HealthMap collects and integrates outbreak data from a variety of sources, including news media (e.g., Google News), expert-curated accounts (e.g., ProMED Mail), and validated official alerts. Through the use of text processing algorithms, the system classifies alerts by location and disease and then overlays them on an interactive geographic map. We measure the accuracy of the classification algorithms based on the level of human curation necessary to correct misclassifications, and examine geographic coverage. RESULTS: As part of the evaluation of the system, we analyzed 778 reports with HealthMap, representing 87 disease categories and 89 countries. The automated classifier performed with 84% accuracy, demonstrating significant usefulness in managing the large volume of information processed by the system. Accuracy for ProMED alerts is 91% compared to Google News reports at 81%, as ProMED messages follow a more regular structure. CONCLUSION: HealthMap is a useful free and open resource employing text-processing algorithms to identify important disease outbreak information through a user-friendly interface.

E

Reis, Butte, Kohane. Extracting knowledge from dynamics in gene expression.. J Biomed Inform. 2001;34(1):15–27. doi:10.1006/jbin.2001.1005
Most investigations of coordinated gene expression have focused on identifying correlated expression patterns between genes by examining their normalized static expression levels. In this study, we focus on the dynamics of gene expression by seeking to identify correlated patterns of changes in genetic expression level. In doing so, we build upon methods developed in clinical informatics to detect temporal trends of laboratory and other clinical data. We construct relevance networks from Saccharomyces cerevisiae gene-expression dynamics data and find genes with related functional annotations grouped together. While some of these associations are also found using a standard expression level analysis, many are identified exclusively through the dynamic analysis. These results strongly suggest that the analysis of gene expression dynamics is a necessary and important tool for studying regulatory and other functional relationships among genes. The source code developed for this investigation is freely available to all non-commercial investigators by contacting the authors.
Charland, Buckeridge D, Sturtevant, Melton, Reis, Mandl, Brownstein. Effect of environmental factors on the spatio-temporal patterns of influenza spread.. Epidemiol Infect. 2009;137(10):1377–87. doi:10.1017/S0950268809002283
Although spatio-temporal patterns of influenza spread often suggest that environmental factors play a role, their effect on the geographical variation in the timing of annual epidemics has not been assessed. We examined the effect of solar radiation, dew point, temperature and geographical position on the city-specific timing of epidemics in the USA. Using paediatric in-patient data from hospitals in 35 cities for each influenza season in the study period 2000-2005, we determined 'epidemic timing' by identifying the week of peak influenza activity. For each city we calculated averages of daily climate measurements for 1 October to 31 December. Bayesian hierarchical models were used to assess the strength of association between each variable and epidemic timing. Of the climate variables only solar radiation was significantly related to epidemic timing (95% CI -0.027 to -0.0032). Future studies may elucidate biological mechanisms intrinsically linked to solar radiation that contribute to epidemic timing in temperate regions.