RT Journal Article
SR Electronic
T1 ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study
JF BMJ Open
JO BMJ Open
FD British Medical Journal Publishing Group
SP e086148
DO 10.1136/bmjopen-2024-086148
VO 14
IS 12
A1 Arvidsson, Rasmus
A1 Gunnarsson, Ronny
A1 Entezarjou, Artin
A1 Sundemo, David
A1 Wikberg, Carl
YR 2024
UL http://bmjopen.bmj.com/content/14/12/e086148.abstract
AB Background Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care.Objectives To compare the performance of ChatGPT, version GPT-4, with that of real doctors.Design and setting A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared.Participants Anonymous responses from the Swedish family medicine specialist examination 2017–2022 were used.Outcome measures Primary: the mean difference in scores between GPT-4’s responses and randomly selected responses by human doctors, as well as between GPT-4’s responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories.Results The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p&lt;0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p&lt;0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044).Conclusion In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.All data relevant to the study are included in the article or uploaded as supplementary information. The scores are published in the Swedish National Data Service’s Data Organisation and Information System repository. Three examples of cases and their corresponding scoring guides and GPT-4 responses have been translated into English and included as supplemental file 1. The original cases, evaluation guides and top-tier responses are publicly available in Swedish on SFAM’s website, from where they were used in this study with permission.