Discovering New Drug-Drug Interactions by Text-Mining the Biomedical Literature Americans are living longer than ever before, and with A. Existing Code Base that increased age comes a greater reliance on pharmaceu-ticals. For example, recent estimates by Kaiser Perma- Most of the code base that allowed me to extract se- nente indicate that the average 70-year-old American fills mantic relationships for this project has already been con- over 30 prescriptions per year [1]. The chance of an adverse structed. However, the code has not yet been synthesized drug reaction increases exponentially as each new drug is into a user-friendly pipeline, so the procedure took several added to an individual's regime. What many people do not know is that clinical trials for new drugs do not typically • Create two lexicons of terms, one for gene names address the issue of drug-drug interactions (DDI) directly, and one for drug names. I created two custom lex- and often test new drugs in young, healthy people who are icons for this project.
The first consisted of a set not part of a given drug's target population. Because of of 43 known pharmacologically-important genes iden- this, potentially-serious DDI are often not discovered until tified by the PharmGKB database [3]. These were a drug is already on the market. In addition, a patient mostly liver cytochromes and various detoxification may be unaware that a symptom he experiences is due to enzymes responsible for key processes within impor- a DDI, and may blame it on other factors. Many DDIs, tant metabolic pathways.
The second lexicon con- therefore, probably go unreported.
sisted of 602 unique drug names obtained from a list Chemically-speaking, most DDIs are the result of one of of drug interactions provided by the Veterans Affairs two possible factors. First, a drug may inhibit an enzyme hospital system, which meant that all were guaranteed that is responsible for metabolizing another drug, effec- to interact with at least one other drug in the lexicon.
tively increasing the second drug's concentration in the • Obtain a corpus of Medline article abstracts.
body. And second, a drug may cause the body to produce tunately, the Helix Group here at Stanford had al- more of an enzyme that metabolizes another drug, effec- ready downloaded all Medline abstracts published be- tively decreasing that drug's concentration [2]. In both fore 2009. The corpus contained about 17.5 million cases, the DDI is actually the result of both drugs' inter- abstracts and 88 million sentences.
acting with a single enzyme, which is the protein product • Retrieve all sentences in Medline that mention both a of a gene. Therefore, most drug-drug interactions are ac- drug and a gene of interest. (For the purposes of this tually drug-gene-drug interactions.
project, the drug and gene entities of interest will be Unfortunately, while lists of known DDIs are widely known as seeds.) I accomplished this using my two available and commonly-used in clinical practice, drug- lexicons and running 100 search processes in parallel gene interactions are not as widely known. In addition, on Stanford's BioX2 cluster [4].
genes and drugs can interact in a variety of ways, and it • Represent sentences as parse trees using the Stanford is unclear which interaction types are most predictive of a Parser [5]. If two seeds were not located in the same drug's tendency to interact with other drugs. Furthermore, sentence clause, that sentence was removed from con- no complete databases exist that concisely describe the ex- sideration. In addition, if a tree contained more than act mechanisms by which drugs and genes interact; most one clause and there was a clause that did not con- of these interactions are only described in papers buried tain either seed, it was removed from consideration. A deep within the scientific literature.
sample parse tree for one Medline sentence of interestis shown in Figure 1.
In this environment, text mining presents a solution to the problem of uncovering novel DDIs. Previous work [7]has established methods for using a syntactical parser toidentify and characterize drug-gene relationships. The endresult was a semantic network of drug-gene relationshipsin which the edges consisted of several hundred interactiontypes normalized to concepts in an ontology. Here I presenta method for using this approach to learn the types of drug-gene relationships that can predict drug-drug interactions,and then applying this method to predict novel DDIs.
Department of Biomedical Informatics, Stanford University, Stan- ford, CA 94305–9510. E-mail: [email protected]. This paper Parse tree for a single sentence in Medline. The two was submitted as part of my final project for CS224N (Natural Lan- seeds of interest are the drug name Pepstatin A and the gene name guage Processing) in the Winter 2011 term.
CYP3A4 (a liver cytochrome).
Dependency graph for the sentence shown in Fig- ure 1. The red arrows show the path through the graph that con-nects the seeds Pepstatin A and CYP3A4. Because this path containsthe verb "blocked", this is a valid sentence. From here, it moves onto the normalization step.
An example of a small semantic network. The green • Convert parse trees into dependency graphs, also using squares represent drugs, and the blue circles represent genes. Note the Stanford Parser [6]. The dependency graphs are that the network is bipartite, meaning that there are no gene-gene rooted, oriented, and labeled graphs, where the nodes or drug-drug connections; the only connections allowed are between are words and the edges are dependency relations be- a gene and a drug. The relation from Figure 2 is illustrated by theedge between Pepstatin A and CYP3A4, which is labeled with the tween words. The corresponding dependency graph relation name "blocked". Another hypothetical relation, "isMetab- for the parse tree in Figure 1 is shown in Figure 2.
olizedBy", is shown between the drug Warfarin and CYP3A4. We • Extract raw relationships between the two entities of might hypothesize from this graph that Pepstatin A and Warfarin interest. Relations were of the form R(a, b), where a would interact, since both are connected to the same gene. An al-ternative representation of this network is an adjacency matrix, an and b represent the locations of the two seeds on the example of which is shown at right.
dependency graph, and R is a node that connects aand b and indicates the nature of their relationship.
To make it past this stage of the analysis, the relation B. Building the Semantic Network connecting seeds a and b must have been a verb (e.g.
associated ) or a nominalized verb (e.g. association).
To learn the types of gene-drug interactions that were most predictive of DDI, I built a bipartite network like • Normalize relations. This was the trickiest part of the analysis, and depended on a custom ontology devel- the one in Figure 3, in which the nodes were the gene oped by members of the Helix Group at Stanford [7].
and drug seeds from the two lexicons and the edges were The process of normalization entails mapping the raw labeled with the normalized relation types found in the lit- relations onto a much smaller set of normalized rela- erature. In Figure 3, the green squares represent drugs and tionships taken from the ontology. For example, the the blue circles represent genes. The final network there- raw relations associated and related both map to the fore consisted of 602 drug nodes and 43 gene nodes, with ontological entity associated with. In addition, less- 3,522 gene-drug edges. The edges were labeled using the common terms like augment are mapped to more com- 344 different relations that appeared in the final dataset.
mon synonyms, like increase. This has the advantage Computationally, I represented this network as a three- of decreasing the overall number of features that need dimensional adjacency matrix of dimensions to be considered in the analysis.
Based on my original lexicons of 602 drug names and 43 gene names, I was able to extract 9,418 sentences from The elements of the matrix, Aijk, were 1 if there was ≥ 1 Medline that contained two seeds of interest. All of these valid sentence in the literature connecting drug i and gene made it through the process of creating parse trees and j via relation k, and 0 if no such sentence existed.
dependency graphs, but only 3,522 made it through theprocess of extracting and normalizing the relations. Many C. Learning Interaction Rules sentences were cut because the relation connecting the two My overall goal in this project was to learn the types of seeds of interest was not a verb or a nominalized verb, relations between genes and drugs that were most predic- or because it was not a term recognized by the ontology.
tive of drug-drug interactions. Conceptually, this meant There were 344 unique relation types (isAssociatedWith, considering all two-edge paths through the network that induces, etc.) represented among this final set of relations.
connected two drugs via a gene These 3,522 normalized relations represented the last part of the project for which I was able to use the He- lix Group's previously-constructed code base. From hereon, all of the analysis was conducted using my own [ugly] and determining which pairs of relation types were most scripts in both R and Python.
indicative of drug-drug interactions. The number of paths PERCHA: FINAL PROJECT of length 2 that include relation types m and n between D. Predicting New Interactions drug i and drug j is given by The power of this approach is that it allows us to use observed paths within the semantic network to predict previously-unknown drug-drug interactions. To evaluate the final logistic regression model's predictive power, I built m and An are two-dimensional 602 × 43 "slices" of the larger 3-dimensional adjacency matrix that correspond the model using only 90% of the original data (chosen ran- to relations m and n. By multiplying these matrices, we domly) and then tested it on the remaining 10%. I eval- eliminate all information about exactly which gene(s) the uated the model's performance at choosing the drug pairs paths pass through; we only care whether paths exist that that were part of the original set of 2,217 interactions, as encompass the relation types of interest. Likewise, we do well as the proportions of false negatives and false positives not differentiate between that occurred in the test set analysis. Finally, I, a popular source of information about drug in- teractions, for information on the interactions predicted by the model that were not in the original list. I wanted to see if they were truly novel predictions, or if [more likely] they −−−−−−→ drug .
were known interactions that simply had not appeared in the list provided by the VA administration.
The actual rule-learning process could be accomplished using a variety of supervised learning techniques in which the response variable was A. Most Important Relations As a first step in my analysis, I conducted t-tests for each feature - comparing its frequency of occurrence forinteracting drug pairs vs. noninteracting drug pairs - and and the predictor variables, xi, were chose the features with the 100 lowest p-values. I then con-structed a tag cloud of all the relations found within those drugs connected by path type i features. The tag cloud is shown in Figure 4. The relations not connected by path type i.
that occurred most often in paths linking interacting drugpairs were isMetabolizedBy and isAssociatedWith, closely Since there were 344 unique relations and each path con- followed by induces, influences, and inhibits.
tained two relations, the total number of features consid- The goal of this part of the analysis was simply to see ered in the analysis was if the feature extraction method was pulling out relations that looked reasonable. Since it is likely that two drugs metabolized by the same gene would interact, or that twodrugs that induce production of the same protein would The total size of the training set was the total number interact, these terms make intuitive sense.
of drug pairs, or 602 · 601/2 = 180, 901. The number ofknown interacting drug pairs was 2, 217, so the data were B. Feature Selection quite sparse. The final dataframe used for the analysis con- Unsurprisingly given the p-value adjustment for multiple sisted of 180,901 rows (all drug pairs) and 59,341 columns hypothesis testing, there were only 9 features for which the (59,340 features, plus one response column). I therefore proportion of occurrences among the interacting drug pairs represented it using the MatrixMarket sparse matrix repre- was significantly greater than the proportion among non- sentation format in R to increase computational efficiency.
interacting pairs. In addition, there were a few features Obviously, with so many features, some feature selection that appeared/disappeared on this list depending on the was required. I began my analysis by performing univari-ate t-tests for each feature, comparing its mean rate ofoccurrence between drug pairs that interacted and those that did not. Because my primary interest was in features that occurred more often for the interacting drug pairs than the non-interacting drug pairs, I used a one-sided t- test and only accepted features where the proportion of occurrences was greater in the interacting group. After performing a simple Bonferroni correction [8] for multiple hypothesis testing, the p-value cutoff for a given feature to be included in the final analysis was 8.432 × 10−7. I thenincorporated the features that survived this initial cut into Top 6 relations with P-values. This list contains the a multivariate logistic regression model, which I used to six features most strongly associated with interacting drug pairs vs.
predict whether other drug pairs would interact.
noninteracting drug pairs.
Tag cloud of most important relations. The relations shown here are sized relative to how often they appeared in the top 100 features most predictive of drug-drug interactions.
specific 90% random sample chosen from the training set.
This single sentence includes three drugs of interest and The "consensus list" is shown in Figure 5.
one gene of interest; given the wide variety of drugs me- The lowest p-value occurred for the feature tabolized by CYP3A4, therefore, it is no wonder that afeature comprised of includes and isMetabolizedBy is such drug −−−−−→ gene ←−−−−−−−−−−− drug a strong predictor of drug interactions. It is also worth noting that the real relations described in this sentence which indicates that two drugs are likely to interact if one inhibits the production of a gene product that in turn isresponsible for metabolizing the other. This makes perfect sense biologically, because co-administration of those two erythromycin −−−−−→ CYP3A4 drugs would lead to highly-elevated levels of drug 2 within nefazodone −−−−−→ CYP3A4 the body. Another important feature of interest is ritonavir −−−−−→ CYP3A4 drug −−−−−→ gene ←−−−−−−−−−−− drug so although the normalized relation chosen by the code was which would work in the opposite direction if drugs 1 and 2 includes, the real relationships are inhibitory ones similar were co-administered: the presence of drug 1 would induce to the other features discussed earlier.
the production of a gene product that metabolizes drug 2,leading to decreased levels of drug 2 within the body anda decrease in drug 2's therapeutic efficacy. Some of the C. Predicting New Interactions features, such as Using the nine features shown in Figure 5, I built a logis- drug −−−−−−−−−−−−→ gene ←−−−−−−−−−−− drug tic regression model using known DDI status as the out-come and a random sample of 90% of the original data are less clear, mechanistically-speaking. The relation isAs- as the training set.
I then used that model to predict sociatedWith is a relatively high-level term in the ontology, which drug pairs in the test set were most likely to inter- and encompasses a wide variety of other terms that cannot act. Unfortunately, only three drug pairs in the test set be mapped to a lower-level, more-specific relation like in- were predicted to interact; that is, the probability of their duces or inhibits. Therefore, these interactions could very interaction, as given by the logistic regression model, was well represent biological mechanisms similar to the two dis- greater than 0.5. These three pairs were: cussed above; they simply weren't described that way inthe literature.
One interesting feature that placed highly on the list was ketoconazole tacrolimus erlotinib erythromycin drug ←−−−−−− gene ←−−−−−−−−−−− drug .
amlodipine erythromycin On the surface, it is unclear what the verb includes is re- The first was a known interaction from the list provided by ferring to. Does the protein product of the gene include a the VA, but the other two were not on the list. The total molecule or structural motif that resembles the drug? It is number of known interactions in the test set was 217, so difficult to tell without looking at the raw sentences. Upon this seemed to indicate extremely poor model performance further inspection, we see that includes is usually used in on this test set.
sentences directly describing relationships between drugs However, when I looked up the other two interactions and genes, such as on, I was surprised to find that one (erlotinib Clinically important CYP3A4 inhibitors include and erythromycin) was considered a moderately important itraconazole, ketoconazole, clarithromycin, interaction. For example, the following warning was issued erythromycin, nefazodone, ritonavir and grapefruit PERCHA: FINAL PROJECT Caution is advised if erlotinib must be used have relatively high sensitivity but low specificity (i.e. it with potent CYP450 3A4 inhibitors such asitraconazole, ketoconazole, voriconazole, is unable to tell when a drug pair will not interact).
nefazodone, delavirdine, protease inhibitors, To get a sense of the model's specificity, I chose a random and ketolide and certain macrolide antibiotics.
sample of twenty drug pairs for which the probability of According to product labeling, coadministrationwith the potent CYP450 3A4 inhibitor ketoconazole interaction (according to the model) was less than 0.02. I increased erlotinib area under the plasma then repeated my analysis for those pairs. The results are concentration-time curve (AUC) by two-thirds shown in Figure 7. Of the 20 pairs, 2 (10%) were on the compared to administration of erlotinib alone.
VA list and 7 (35%) had known interactions according to Indeed, erythromycin was one of the macrolide antibiotics known to interact with erlotinib. Erlotinib (brand nameTarceva) is a drug most often used in cancer chemother-apy, so its interaction with a drug used to treat bacterial infections is somewhat surprising. Nonetheless the model, bexarotene potassium based solely on the drugs' relationships to common genes pancuronium penicillin as described in the scientific literature, was able to pick it indomethacin potassium miconazole tocainide hyoscyamine moxifloxacin The other drug combination, amlodipine and ery- bepridil fluoxetine thromycin, was not listed as a known interaction on cisplatin isoniazid lamotrigine triamterene insulin theophylline D. Further Predictions omeprazole valsartan I was interested in seeing which drug pairs my model mibefradil minocycline atorvastatin fluconazole ranked most highly as likely interactions, even if the prob- ability of those interactions, as given by the model, did not hydrocodone memantine reach the cutoff of 0.5. I therefore ranked the top 20 most mephentermine metformin likely interacting pairs from the test set, as predicted by heparin oxyphenbutazone bumetanide sulindac the final model. The results are shown in Figure 6.
halazepam naproxen amobarbital isoniazid A random sample of drug pairs with interaction probabilities of less than 2%. The symbols are the same as those ketoconazole tacrolimus in Figure 6.
amlodipine erythromycin erlotinib erythromycin gefitinib testosterone codeine quinidine ketoconazole nefazodone In this report, I describe a novel method for predict- nefazodone tacrolimus ing drug-drug interactions based on a combination of alprazolam nefazodone nefazodone repaglinide techniques from natural language processing and machine nefazodone pimozide learning. The raw extraction of textual features of inter- gefitinib pravastatin est (the normalized relations) was accomplished using the amlodipine itraconazole fluoxetine terfenadine same sentence parsing techniques we explored in Program- erythromycin methadone ming Assignment 3.
I then used basic techniques from network theory (the concept of an adjacency matrix; us- erythromycin midazolam ing an adjacency matrix representation to find all paths of atomoxetine methadone methadone nicotine length 2 in a network) and machine learning (feature selec- captopril enalapril tion; the Bonferroni correction; logistic regression) to eval-uate the textual features and find those most predictive of Fig. 6. The top 20 most likely interacting drug pairs, as pre- drug-drug interactions. To me, this project perfectly illus- dicted by the final model. Although many of these drug pairs werenot represented on the original VA interaction list, they had at least trates the power of natural language processing: distilling moderate interactions according to A single "X" repre- free text into machine-readable features that many known sents a moderate interaction on, while "XX" represents a algorithms already know how to handle. This pipeline al- severe interaction.
lows us to make meaningful inferences from text that would Of the 20 drug pairs on the list, 7 (35%) had known be difficult or impossible without first deconstructing the interactions from the VA list and 14 (70%) had known role of each textual element and deciphering how the dif- moderate or severe interactions according to
ferent elements - noun phrases, verbs, etc. - relate to each However, these results only show that the model is able to pick up known interacting drug pairs at a fairly high Of course, the performance of the final model on the rate. Since the drugs from the lexicon are known to in- test set was not ideal. If we consider the list of known teract with at least one other drug, the model may simply interactions from the VA as our gold standard, the test set contained 217 interacting pairs and 17,874 noninteracting an even more serious omission. In any case, this problem (or unknown) pairs. The model was only able to detect is easily rectified by including synonyms in subsequent lit- one of the 217 pairs; there were two false-positives and erature searches.
216 false-negatives, along with 17,872 true negatives. If On a related note, the model's lowest assigned probabil- we estimate the model's sensitivity and specificity based ity of interaction (Figure 7) occurred for drug pairs where solely on this test set, therefore, we obtain: neither drug appeared anywhere in the literature. Sincethis total absence from literature references is unlikely, its most probable cause is that the search term was actually a less-common synonym for another entity.
help explain why several interacting drug pairs showed up in Figure 7: the model simply did not have any evidence whatsoever about those pairs, and so coult not differentiate which indicates that the model is great at rejecting drug them from other pairs that truly did not interact.
pairs that do not interact, but terrible at picking up those In addition, the number of drug interactions from that did not show up in the VA's "gold stan- There are several reasons for this that could be addressed dard" list is somewhat troubling. It seems clear that the in future versions of the model, however. First, the number VA's list is incomplete, and that a better gold standard will of textual co-occurrences of terms from the two lexicons be necessary for future models. Unfortunately, obtaining was actually quite small (9,418 sentences; 3,522 normalized a complete list of drug interactions is fairly difficult, since relations) compared to the number of terms involved (602 prominent web services like and Epocrates are drugs and 43 genes). The main reason for this was that my usually careful to guard their raw data.
two lexicons did not include synonyms - I wanted to obtain Finally, some of the model's poor performance might be preliminary results for the project as quickly as possible, the result of the original Medline corpus used when extract- and a full search of the literature that included all potential ing the raw relations. This corpus included all abstracts synonyms for this many gene and drug names can take over from before the year 2009, but the biomedical literature a week. To give a sense of how drastically this cut down is one of the fastest-growing bodies of free text in exis- my number of "hits", here is a list of all the synonyms for tence, and thousands of new abstracts are added every erythromycin, the macrolide antibiotic discussed earlier: year. Much more information on drug-gene interactions is available now than was available two years ago. Sub- sequent work on this topic, therefore, may require a more current "edition" of the Medline corpus.
V. A Note on the Code Abboticin;Stievamycin Forte Gel ;Kesso-Mycin;Abomacetin;Stievamycin Gel ;Mephamycin;Ak-mycin; Most of the code for this project was provided by Yael T-Stat Lot ;Pantomicina;Akne-Mycin;T-Stat Garten and Russ Altman of the Helix Lab, and was writ- ten by Adrien Coulet, a former graduate student who is now a professor in Switzerland. Going from raw Medline Eryc 125;Eryc Sprinkles;Erycen;Erythroguent; sentences to normalized relations was no easy task, and the code reflected this; it was an unholy combination of Erycinum;Erythromycin A;Erythromycin B;Erythromycin Stearate;Erythromycin estolate; Bash scripting, random bits of Perl, and about a dozen Java classes, all of which had file locations hard-coded into them. For this reason (and because the lab is still using Erythromycin lactobionate;R-P Mycin;Robimycin; this code and I didn't have permission to share it) I am not Sans-Acne Solution ;Sansac;Serp-AFD;Staticin Lot;Stiemycin including the code used for extracting the parse trees and Searching only for "erythromycin", therefore, misses normalized relations with my submission. I am, however, many synonyms that would have been mapped to the term including the raw data files obtained from running their "erythromycin" during the normalization process. Perhaps code on the BioX2 cluster.
more crucially, genes also have many synonyms. Here is a I did most of my own analysis in Python and R, and list for CYP3A4, a liver cytochrome discussed earlier: my scripts for that are included. They aren't pretty, butyou can get a sense of what I did to create the adjacency matrices for the network and extract the relevant features.
NF-25;P450C3;P450PCN1;*1A;CP33;CP34;cytochromeP450, family 3, subfamily A, polypeptide 4; If you need any further information about the code, please glucocorticoid-inducible P450;nifedipine oxidase; don't hesitate to contact me.
P450-III, steroid inducible;cytochrome P450,subfamily IIIA (niphedipine oxidase), polypeptide3; cytochrome P450, subfamily IIIA (niphedipine oxidase), polypeptide 4 [1] Accessed Monday, March 7, Gene nomenclature is perhaps even less standardized than [2] Katzung BG, Masters SB, Trevor AJ Basic and Clinical Phar- drug nomenclature, so eliminating synonyms for genes was macology. McGraw-Hill: New York, NY, 2009.
PERCHA: FINAL PROJECT [3] T.E. Klein, J.T. Chang, M.K. Cho, K.L. Easton, R. Fergerson, M. Hewett, Z. Lin, Y. Liu, S. Liu, D.E. Oliver, D.L. Rubin,F. Shafa, J.M. Stuart and R.B. Altman, "Integrating Genotypeand Phenotype Information: An Overview of the PharmGKBProject", The Pharmacogenomics Journal (2001) 1, 167-170.
[4] The Bio-X2 cluster is the result of an NSF-funded research pro- posal submitted by 21 Bio-X affiliated faculty, representing 13departments and 4 schools at Stanford. The purpose of the clus-ter is to facilitate biological research problems ranging in scalefrom molecules to organisms. It was funded by the National Sci-ence Foundation. The hardware represents generous donationsby both Dell and Cisco.
[5] Dan Klein and Christopher D. Manning. 2003. Accurate Unlex- icalized Parsing. Proceedings of the 41st Meeting of the Associ-ation for Computational Linguistics, pp. 423-430.
[6] Marie-Catherine de Marneffe, Bill MacCartney and Christopher D. Manning. 2006. Generating Typed Dependency Parses fromPhrase Structure Parses. In LREC 2006.
[7] Adrien Coulet, Nigam H Shah, Yael Garten, Mark A Musen, Russ B Altman: Using text to build semantic networks for phar-macogenomics. Journal of Biomedical Informatics 43(6):1009-19(2010) [8] Miller, Rupert G. (1981) Simultaneous statistical inference. 2nd ed. Springer Verlag, pages 6-8.


Community-based Management of Acute Malnutrition Using Community November 2012–December 2013Health Activists in Angola World Vision Angola and Partners This report summarises the final project evaluation conducted by Ellie Rogers of ACF UK. Community-based Management of Acute Malnutrition (CMAM) The CMAM Model was developed by Valid International and has been endorsed


ANTES DE LA CAMPAÑA DETERMINE EL TIPO DE CAMPAÑAAntes de empezar a planear su campaña de A Limpiar el Mundo decida cuáles son los objetivos de su organización. Le sugerimos que los escriba y no los pierda de vista a medida que progresa la campaña. Objetivos pueden ser: recogida de residuos, limpieza de un cauce de agua, concienciar a la población sobre un problema medioambiental local y educación en temas ecológicos de estudiantes y el público en general.Averigüe si ya existen programas medioambientales en su zona. Podría ser beneficioso para su campaña el tener como base programas ya existentes o realizarse en

Copyright © 2008-2016 No Medical Care