Introduction
Drug–drug interactions (DDIs) are recognized as a major cause of adverse drug reactions (ADRs). DDIs may result in reduced treatment efficacy or increased adverse effects and toxicities, posing significant risks to patient health.[
1] ADRs, often caused by DDIs, lead to increased hospital admissions, prolonged treatment durations, and rising healthcare costs. Consequently, there is a critical need for an accurate, comprehensive, and personalized assessment of DDI risk. This necessity is heightened by individual factors—such as comorbidities, organ function, and pharmacogenomic variations—which can significantly alter the pharmacological mechanisms involved.[
2]
Conventionally, healthcare professionals depend on specialized rule-based platforms like UpToDate Lexidrug, Micromedex, and Drugs.com for the identification of potential DDIs within complex medication regimens. Despite being grounded in rigorously curated databases, there are notable variations regarding concordance between these systems.[
3]
Recently, Artificial Intelligence (AI) and Machine Learning (ML) have emerged as transformative approaches in drug safety.[
4] In contrast to conventional rule-based systems, ML and Deep Learning (DL) models are capable of predicting complex DDI relationships by synthesizing diverse data sources, including chemical structures (SMILES), molecular pathways (KEGG), drug-target interactions, and biomedical literature processed via Natural Language Processing (NLP).[
5]
ClinicalKey AI (CK AI), a proprietary clinical AI tool developed by Elsevier, utilizes Generative AI for this purpose but addresses the hallucination risks inherent in general-purpose LLMs by adopting a Retrieval-Augmented Generation (RAG) architecture.[
6] This ensures outputs are grounded solely in verified content. While LLMs have shown high accuracy (95.5%) and minimal potential for harm (0.47%) in clinical settings,[
7] the opaque ‘black box’ nature of these systems necessitates rigorous validation of their transparency and reliability, especially regarding critical decisions like DDI prediction.[
8]
A comprehensive performance evaluation of a dynamic, evidence-based, and AI-driven decision support system like CK AI against traditional expert systems is critical for the responsible integration of advanced AI technologies into clinical practice. The primary scientific novelty and focus of this study is to rigorously assess the capability of AI in a complex clinical domain.
Methods
Unlike general-purpose artificial intelligence tools, this study evaluated ClinicalKey AI (CKAI), a generative AI tool specifically designed for clinical practice, against UpToDate Lexidrug (LD), which served as one of the reference standards for drug-drug interaction (DDI) analysis. To ensure clinical validity, a dataset of 280 drug pairs was curated through a comprehensive literature review, focusing on the most frequently encountered and clinically significant interactions (Supplementary File 1).
For the reference standard, each drug pair was queried in LD. Extracted data were classified into two groups to facilitate comparison. Content-based parameters included management category (Ratings B, C, D, X), severity (minor, moderate, major), level of evidence, and mechanism of interaction (pharmacokinetic vs. pharmacodynamic). To facilitate standardized statistical comparison, UpToDate’s six-level evidence scale was consolidated into three main categories: definitions of ‘Highest’ and ‘High’ were grouped as High; ‘High-Intermediate’ and ‘Intermediate’ as Moderate; and ‘Intermediate-Low’ and ‘Lowest’ as Low. Information availability (Binary: Provided/Not Provided) was assessed based on whether the system provided information regarding onset of action, risk factors (e.g., age, sex), and specific alternative agent suggestions.
To enable a standardized comparison, a structured prompt for CKAI was developed through iterative testing. The final prompt, selected for its ability to yield consistent and comparable outputs, was as follows:
“Evaluate DDI between [Drug-1] & [Drug-2] in a patient. Provide a structured summary covering: 1.Mechanism (PK/PD) 2.Clinical Severity (Major/Moderate/Minor) 3.Management (Monitor/Adjust/Avoid) 4.Onset 5.Risk Factors 6.Evidence Level 7.Specific Alternative Agents (name drugs)”.
Recognizing that AI systems are inherently predisposed to provide direct answers to structured queries, a rigorous manual review of LD monographs was conducted to ensure a fair comparison. The data extraction from LD extended beyond summary fields to include a thorough review of the full discussion sections. This ensured that granular details—such as specific onset times, gender-specific risks, or mentions of safer therapeutic alternatives embedded within the narrative text—were captured and compared accurately against the AI-generated responses.
Inter-rater reliability for categorical variables (e.g., severity, management, evidence level) was assessed using Cohen’s Kappa (κ) coefficient to control for chance agreement. For ordinal variables (e.g., Severity: Major > Moderate > Minor), Weighted Kappa analysis was employed to account for the magnitude of disagreement (e.g., weighting a Major-Minor discordance more heavily than a Major-Moderate one).
To detect systematic differences in information provision capabilities (i.e., whether the system provided the information or not), McNemar’s test was used for binary variables (Onset, Risk Factors, Alternative Suggestions). A p-value of < 0.05 was considered statistically significant. Additionally, crosstabulations were generated to visualize the direction of discordance, identifying potential trends of overestimation or underestimation of risk by CKAI compared to LD.
Results
Analysis of the mechanistic origins of the evaluated drug pairs revealed a predominance of pharmacokinetic processes (
Table 1).
A high degree of consensus was observed between the two systems regarding the classification of interaction mechanisms (Pharmacokinetic vs. Pharmacodymanic), with an agreement rate exceeding 99%.
A marked divergence was observed in risk perception, with CKAI demonstrating a systematic tendency towards higher severity classifications compared to LD. While the LD dataset included a distribution across Major, Moderate, and a minority of Minor interactions , CKAI almost entirely excluded the “Minor” category and reclassified a substantial proportion of interactions deemed “Moderate” by LD as “Major”. Consequently, the inter-rater reliability for severity grading between the two systems remained at a “Moderate/Fair” level of agreement (κ).
The disparity in risk perception directly influenced management strategies. CKAI frequently escalated interactions classified as manageable (e.g., modify) by LD to the “Avoid” (Category X) category. As detailed in
Table 2, CKAI suggested avoiding the combination in 174 cases, representing a net increase of 64 cases compared to LD. Conversely, the recommendation to “modify” therapy was significantly less frequent in CKAI compared to LD.
A marked difference was identified in the assignment of evidence levels. CKAI classified 85% of the interactions as supported by “High” evidence, whereas LD assigned this level to only 23.5% (
Table 3). This discrepancy resulted in a substantially low Kappa coefficient for evidence level agreement.
In terms of onset information availability, ClinicalKey AI (CKAI) provided data in 276 of 280 cases (98.5%), whereas Lexi-Drug (LD) did so in only 193 of 280 cases (68.9%). Accordingly, CKAI’s ability to supply onset information was significantly higher than that of LD (p < 0.001). Onset information was frequently absent in LD, especially for interactions classified as moderate or minor.
CKAI was more comprehensive LD in identifying interaction risk factors: CKAI listed risk factors in 278 of 280 interactions (99.2%), while LD did so in only 200 (71.4%) (p < 0.001). Similarly, CKAI provided alternative drug suggestions in 278 of 280 interactions (99.2%), in contrast to LD’s 150 (53.5%). These differences were statistically significant (p < 0.001).
For specific drug–drug interactions, both systems demonstrated high sensitivity in detecting warfarin-related interactions. Combinations of statins (such as simvastatin) with potent CYP3A4 inhibitors were flagged as major in severity and designated as avoid by both CKAI and LD, indicating complete concordance in this category.
Discussion
This study is the first comprehensive analysis that comparatively evaluates the concordance between ClinicalKey AI (CKAI) and UpToDate LexiDrug (LD), for drug interactions (DDI), as well as their potential impacts on clinical decision-making. The findings indicate that AI-based systems have a statistically significant advantage (p < 0.001) over traditional systems in providing clinical information (mechanism, onset, risk factors, and alternative recommendations); however, in terms of risk classification, management recommendations, and evidence-level grading, CKAI diverges markedly from traditional systems, exhibiting a pronounced “safety-first” (conservative) algorithmic bias.
A notable finding was that, despite achieving over 99% concordance between the two systems in identifying the pharmacological mechanism, there was a substantial divergence in the interpretation of the mechanism’s clinical outcome (severity). CKAI systematically maintained a higher perceived risk compared to LD: it escalated many interactions classified as moderate in the LD dataset to the major category and rarely utilized the minor category. This resulted in the Kappa coefficient for severity agreement between the two systems remaining at only a moderate/fair level.
This “severity inflation” phenomenon has also been reported in recent AI studies. In their 2025 study of antidote interactions, Yaowaluk et al. noted that AI models (ChatGPT and Gemini) tended to resolve discrepancies between databases by favoring the more severe rating, citing safety-oriented justifications.[
9] Similarly, in our study CKAI labeled cases that LD regarded as “manageable risk” as “avoidable risk” (avoid/X), increasing such cases by 64. This suggests that specialized RAG-architecture systems may likewise tend to base their assessments on the most severe potential outcomes when dealing with uncertainties in training data.
The basis of this conservative approach may reflect a system design strategy prioritizing maximum patient safety. By contrast, rule-based systems like LD have been optimized through decades of clinical feedback and expert panel input (e.g., the Hansten and Horn criteria) to filter out interactions that are theoretically possible but clinically rare. CKAI, on the other hand, may scan case reports and theoretical pharmacokinetic models in the literature and, even when the likelihood of an interaction is low, systematically elevate the risk score by focusing on the severity of the potential outcome. For example, an interaction with the potential to increase plasma concentration by 20%—which LD considers “manageable with dose adjustment” (moderate)—is interpreted by the AI as a “toxicity risk” (major).
Wu et al. termed this the “safety paradox” in a 2025 study, warning that models may, by issuing excessively restrictive recommendations (errors of omission), risk depriving patients of necessary treatments.[
10] In our study, the fact that CKAI reduced the number of “D – Modify” recommendations from 108 (as recommended by LD) to 32, while increasing “X – Avoid” recommendations from 110 to 174, provides concrete evidence of this paradox.
The pharmacological subgroup in which the discrepancy in management recommendations was most pronounced was interactions between tyrosine kinase inhibitors (TKIs) and proton pump inhibitors (PPIs). The absorption of TKIs such as erlotinib, dasatinib, and pazopanib is pH-dependent, and strong, prolonged suppression of gastric acidity by PPIs can reduce the bioavailability of these drugs by 40–60%.[
11]
CKAI’s tendency to classify this group frequently as “X – avoid” and to recommend complete discontinuation of PPI use indicates that the model is guided by strict pharmacokinetic parameters. The AI classifies a 50% drop in bioavailability as a “treatment failure risk” and, instead of managing this risk, recommends eliminating its source entirely (avoiding the drug combination). While this approach is pharmacologically correct, in oncological practice it may leave patients’ quality-of-life–diminishing symptoms untreated or result in the use of a less effective cancer therapy. This finding demonstrates that AI systems still depend on human expert oversight and contextual interpretation for nuanced clinical decision-making, particularly in fields like oncology where the risk–benefit balance is highly sensitive.
One of the most methodologically notable findings of the study is the substantial discordance in evidence level grading. CKAI labeled 85% of the 280 analyzed interactions as “High” evidence level, whereas LD assigned this level to only 23.5%. This suggests that the two systems process the concept of “evidence” in a fundamentally different manner.
Rule-based systems like LD determine evidence levels using strict hierarchical criteria (e.g., the GRADE system). The “High” level typically requires randomized controlled trials (RCTs) or large-scale pharmacokinetic data, whereas case reports or in vitro data are classified as “Low” or “Lowest”. CKAI, however, uses a RAG (retrieval-augmented generation) architecture to generate answers from Elsevier’s extensive literature repositories (ClinicalKey, ScienceDirect, textbooks). When the model retrieves a detailed article or case series published in a reputable journal about an interaction, it tends to treat the accessibility and quality of that source as being equivalent to the strength of evidence.
This scenario corresponds to the problems of “overconfidence” and the “calibration gap” described in the LLM literature.[
12] Models tend to assign high confidence scores to their output texts regardless of the actual correctness of the responses. CKAI’s RAG architecture may minimize the risk of “hallucinations” (fabricating non-existent information)[
13], but it introduces the risk of overstating the strength of available evidence. For a clinician, this could pose a significant clinical risk: a warning presented with a high evidence label and an avoid recommendation might actually be based on only a theoretical risk or a few isolated case reports, which the clinician may not recognize. This finding underscores that, when evaluating outputs from AI-based clinical decision support systems, the indicated evidence level requires qualitative verification and one should not rely solely on the label. Nonetheless, as an interactive AI, CKAI can aid in this verification by providing follow-up questions that elaborate on the initially restrictive “avoid” and “high” statements, effectively helping to moderate overreliance on these labels.
Despite its highly conservative approach to risk evaluation, CKAI demonstrated a significant advantage in information provision compared to LD. These statistically significant differences illustrate the system’s potential as a clinical assistant (in terms of onset, risk factors, and alternative medication suggestions).
Traditional databases depend on structured fields that are manually filled by editors for each drug pair. If an editor has not entered an onset value for a specific interaction, or if precise timing information (hours/days) is not available in the literature, that field remains blank. In contrast, CKAI, leveraging its RAG architecture and natural language processing capabilities, can scan texts containing relevant pharmacokinetic profiles of the drugs (enzyme inhibition, time to Cmax, etc.) and, even without direct onset data, infer and present an estimated onset time to the clinician.
CKAI’s 99.2% success rate in providing alternative medication suggestions is significant for clinical workflows. While LD offered no alternatives in 130 cases, CKAI proposed specific agents by name in nearly every case, demonstrating its capacity to provide rapid clinical guidance in response to the inquiry, “If I cannot use this drug, what should I use?”. However, the reliability of these alternatives must be carefully evaluated; our study measured the availability of alternatives, not their accuracy. Nevertheless, a tool capable of rapidly presenting therapeutic alternatives in high-throughput clinical setting could significantly reduce cognitive load.
The results of this study provide critical insight into how generative AI tools like CKAI should be integrated into clinical workflows. If CKAI—with its current risk perceptions (low-threshold “Avoid” recommendations and high “Severity” assignments)—were integrated directly into a CPOE (Computerized Physician Order Entry) system and deployed as an interruptive alert, the alert burden on clinicians would be much higher than with LD and potentially unsustainable.
The literature indicates that even high-severity DDI alerts are overridden by clinicians up to 90% of the time.[
14] This is primarily due to “alert fatigue”: clinicians become desensitized by constantly encountering alerts that are often trivial or manageable.[
15] CKAI’s tendency to label situations that LD classifies as moderate as major is one factor that would increase this fatigue. Therefore, it seems more appropriate to position CKAI and similar systems not solely as an interruptive alert mechanism at the point of order entry, but as an interactive decision support tool that the clinician can consult in complex cases (polypharmacy, comorbidity, organ failure), querying it for detailed information such as risk factors, onset, and alternatives. The rich contextual information provided by the AI, as opposed to static alerts, can help the clinician understand why an alert is given (enhancing explainability), potentially increasing the alert acceptance rate.
There are several limitations to this study. First, even systems regarded as gold standards, such as Lexi-Drug, are subject to inherent variability; significant discordances are known to exist even between different databases (e.g., Micromedex and Drugs.com).[
16] Some of the risk assessments from CKAI that appear “excessive” may actually be grounded in more recent literature. Second, we did not evaluate the clinical appropriateness and safety of the proposed “alternative” medications in patient-based scenarios or case simulations.
Conclusions
ClinicalKey AI demonstrates a more comprehensive, detailed, and solution-oriented performance compared to traditional systems in the management of drug–drug interactions, particularly demonstrating strong capability in addressing the “why” and “how” questions through its responses regarding mechanism, onset, and risk factors. However, its heightened risk perception and conservative management recommendations indicate that it should be utilized not as an autonomous decision-maker but as a powerful decision support tool that augments clinical judgment. Clinicians must interpret AI-generated high evidence labels and avoid warnings within the context of the individual patient’s clinical scenario. As a valuable decision support tool in the future of medicine, AI has the potential to accelerate information access and enhance polypharmacy management. Nevertheless, to prevent excessive alerting from leading to therapeutic inertia in clinical practice, the presence of human oversight (human-in-the-loop) remains essential.