Cherla, Avi, Naci, Huseyin ORCID: 0000-0002-7192-5751, Woloshin, Steven, Wagner, Anita Katharina and Ong, Mei-Sing
(2025)
Communication of uncertainties about recent cancer drugs in large language models.
Journal of Clinical Oncology, 43 (16_suppl).
11026 - 11026.
ISSN 0732-183X
Abstract
11026 Background: Increasingly more people use large language models (LLMs) to find information about medical treatments. However, there are notable concerns about the accuracy and completeness of information generated by these models. We assessed whether LLMs accurately summarized uncertainties about the benefits and harms of new cancer drugs. Methods: We identified the 10 cancer drugs approved by the US Food and Drug Administration (FDA) between 2019 and 2022 with the highest Medicare spending in 2022 (5 in Part B and 5 in Part D). We then searched FDA review documents to extract information about the uncertainties with each drug’s clinical trial evidence that were identified by FDA reviewers at the time of approval. Uncertainties with clinical trial evidence were assigned to mutually exclusive categories. We evaluated the extent to which 4 state-of-the-art LLMs (OpenAI’s ChatGPT-4, Google’s Gemini 1.5 Pro, Meta’s Llama 3.1, and Anthropic’s Claude 3.5 Sonnet) provided information about FDA-identified uncertainties when queried for information about the drugs using two prompts: (1) how well does [drug] work for [condition]? ; (2) is there anything uncertain about how well [drug] works for [condition]? . Results: For the 10 recently approved cancer drugs with the highest Medicare spending in 2022, FDA reviewers identified a total of 38 uncertainties with the clinical trial evidence. For 9 of 10 drugs, FDA reviewers identified uncertainties related to the generalizability of the evidence. Other common uncertainties included bias related to the measurement of the outcome and the use of single arm trial designs. When the LLMs were prompted about how well these 10 cancer drugs worked, the models rarely provided information about the uncertainties identified by FDA reviewers: GPT-4 (4/38, 11%), Gemini 1.5 Pro (3/38, 8%), Llama 3.1 (3/38, 8%), and Claude Sonnet (2/38, 5%). The proportion of FDA-identified uncertainties reported by the models improved marginally when specifically prompted for uncertainties about the drugs. Qualitative assessment of the information generated by the LLMs showed that most of the models tended to report similar, non-specific uncertainties for every drug with little variation. Conclusions: LLMs do not provide adequate information about uncertainties related to the benefits and harms of recently approved cancer drugs, despite the availability of this information in the public domain. There is a need to improve LLMs to accurately report such information, so that patients can make informed decisions about cancer treatments. Performance of LLMs compared to uncertainties identified by the FDA. Prompt GPT-4 Gemini 1.5 Pro Llama 3.1 Claude 3.5 Sonnet Average How well does [drug] work for [condition]? 4/38 (11%) 3/38 (8%) 3/38 (8%) 2/38 (5%) 3/38 (8%) Is there anything uncertain about how well [drug] works for [condition]? 8/38 (21%) 5/38 (13%) 6/38 (16%) 5/38 (13%) 6/38 (16%)
Item Type: | Article |
---|---|
Additional Information: | © 2025 by American Society of Clinical Oncology |
Divisions: | Health Policy |
Subjects: | R Medicine > RA Public aspects of medicine Q Science > Q Science (General) |
Date Deposited: | 06 Jun 2025 08:00 |
Last Modified: | 06 Jun 2025 08:03 |
URI: | http://eprints.lse.ac.uk/id/eprint/128323 |
Actions (login required)
![]() |
View Item |