Introduction
Patient Information Materials (PIMs) published by major urologic organizations are a reliable adjunct to improve patient understanding of kidney stone disease, and to help them navigate complex choices. Significant recent advances in machine learning language models have disrupted patient engagement in online information, namely through generative pre-trained transfomers. These “chatbots” supply easily accessible, apparently authoritative but unsourced responses to queries. As consumption of this type of information is increasing exponentially among patients, we compared readability and accuracy of artificial intelligence-generated PIMs to those supplied by the CUA, AUA, and EAU for kidney stones.
Materials
PIMs from CUA, AUA, and EAU on 4 topics related to nephrolithiasis were assessed; general information, dietary, surgical and medical management. We then assessed therer readability with three standard readability scoring systems. Next, we completed a basic web search to identify the top patient questions related to kidney stones. The first five hits for ‘Patient questions Kidney Stones’ were surveyed and the ten most frequently occurring questions were chosen. These were assigned to one of four categories based on our previous work; general information, dietary management, surgical management and medical management. These were then input into ChatGPT and the outputs then analyzed for accuracy and readability. Accuracy was assessed by two independent reviewers and graded from 1-5 (1= accurate, 5 = inaccurate), while readability was scored based on validated readability indexes, SMOG, Gunning Fog and FKGL (as in our previous work)
Results
,Readabiity for ChatGPT outputs ranged from 13.1-14.1. This represents a uniformly more complex score across all topics in comparison to PIMs published by the CUA which ranged from 9.5-11.6. Furthermore, CUA PIM’s on both medical and surgical management of stones were almost two grade levels simpler than dietary and general information (9.5, 9.5- 11.6, 11.3). this trend was not seen for the ChatGPT outputs which had minimal variability and all scored above 13 (13.1-14.1). Accuracy was also investigated, and the ChatGPT outputs scored 2, 2, 2 and 3. Indicating ‘accurate with minor details omitted’.
Conclusion
ChatGPT’s outputs on patient questions related to nephrolithiasis do not compare favourably to the CUA PIM’s in terms of patient readability. They were almost two grade levels more complex than their CUA counterparts, and in the case of both surgical, and medical management, almost four grade levels. However, accuracy of the output was reasonable, but not perfect with most topics having minor detail omissions. As these AI resources become more integrated into our patients lives, its important we as clinicicians are aware of the limitations to provide appropriate counselling.
Funding
None
Lead Authors
Alec Mitchell, MD
UBC Urology
Co-Authors
Abdul Halawani, MD
UBC Urology
Readability and accuracy of artificial intelligence generated outputs compared to CUA patient information materials
Category
Abstract
Description
MP26: 01Session Name:Moderated Poster Session 26: Endourology Miscellaneous