ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study

Rasmus Arvidsson; Ronny Gunnarsson; Artin Entezarjou; David Sundemo; Carl Wikberg

doi:10.1136/bmjopen-2024-086148

Responses

General practice / Family practice

Original research

ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study

Compose a Response to This Article

Compose Response

Title *

Author Information

Contributors
First Name and Middle Initial * First or given name, e.g. 'Peter'. Last Name * Your last, or family, name, e.g. 'MacMoody'. Email Address * Your email address, e.g. higgs-boson@gmail.com Occupation * Your role and/or occupation, e.g. 'Orthopedic Surgeon'. Affiliation * Your organization or institution (if applicable), e.g. 'Royal Free Hospital'.

Statement of Competing Interests

Competing interests? *

Yes

Please describe the competing interests

PLEASE NOTE:

A rapid response is a moderated but not peer reviewed online response to a published article in a BMJ journal; it will not receive a DOI and will not be indexed unless it is also republished as a Letter, Correspondence or as other content. Find out more about rapid responses.
We intend to post all responses which are approved by the Editor, within 14 days (BMJ Journals) or 24 hours (The BMJ), however timeframes cannot be guaranteed. Responses must comply with our requirements and should contribute substantially to the topic, but it is at our absolute discretion whether we publish a response, and we reserve the right to edit or remove responses before and after publication and also republish some or all in other BMJ publications, including third party local editions in other countries and languages
Our requirements are stated in our rapid response terms and conditions and must be read. These include ensuring that: i) you do not include any illustrative content including tables and graphs, ii) you do not include any information that includes specifics about any patients,iii) you do not include any original data, unless it has already been published in a peer reviewed journal and you have included a reference, iv) your response is lawful, not defamatory, original and accurate, v) you declare any competing interests, vi) you understand that your name and other personal details set out in our rapid response terms and conditions will be published with any responses we publish and vii) you understand that once a response is published, we may continue to publish your response and/or edit or remove it in the future.
By submitting this rapid response you are agreeing to our terms and conditions for rapid responses and understand that your personal data will be processed in accordance with those terms and our privacy notice.

I agree to and have read the terms and conditions for rapid responses and BMJ Privacy Notice *

CAPTCHA

This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.

Vertical Tabs

Other responses

Jump to comment:

Comment on: ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study.
Denise E. Hilling, Michel E. van Genderen and Jan A.J.G. van den Brand
Published on: 30 January 2025
Declaring the context helps
Roger Larsson
Published on: 29 January 2025

Published on: 30 January 2025
Comment on: ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study.
- Denise E. Hilling, Surgeon, CSO Erasmus MC Datahub Erasmus MC, University Medical Center Rotterdam
- Other Contributors:
  Michel E. van Genderen, Internist-Intensivist, Director Erasmus MC Datahub
  
  Jan A.J.G. van den Brand, CTO Erasmus MC Datahub
With great interest we read the recently published article titled "ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study" by Arvidsson et al. [1] The study provides valuable insights into the capabilities and limitations of generative AI models like GPT-4 in complex medical decision-making scenarios. However, the study's approach, which relies on GPT-4 as a general-purpose model without any domain-specific fine-tuning or optimised prompting strategies, presents an inherent limitation. Deploying an AI system in such a manner is fundamentally inferior and does not align with best practices in any industry. In real-world applications, AI models are typically customised, fine-tuned, or integrated with structured knowledge bases to enhance their relevance and reliability in specific domains. The zero-shot prompting approach used in this study, while convenient for initial evaluation, does not reflect the practical implementation of AI solutions in healthcare or other high-stakes industries.

In the medical field, AI applications must be trained and validated within a well-defined context, leveraging domain-specific data, tailored prompts, and reinforcement learning with human feedback to improve performance over time. Successful AI implementation in healthcare involves collaboration with medical professionals to refine model outputs, ensuring that the AI system aligns with c...
Show More

With great interest we read the recently published article titled "ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study" by Arvidsson et al. [1] The study provides valuable insights into the capabilities and limitations of generative AI models like GPT-4 in complex medical decision-making scenarios. However, the study's approach, which relies on GPT-4 as a general-purpose model without any domain-specific fine-tuning or optimised prompting strategies, presents an inherent limitation. Deploying an AI system in such a manner is fundamentally inferior and does not align with best practices in any industry. In real-world applications, AI models are typically customised, fine-tuned, or integrated with structured knowledge bases to enhance their relevance and reliability in specific domains. The zero-shot prompting approach used in this study, while convenient for initial evaluation, does not reflect the practical implementation of AI solutions in healthcare or other high-stakes industries.

In the medical field, AI applications must be trained and validated within a well-defined context, leveraging domain-specific data, tailored prompts, and reinforcement learning with human feedback to improve performance over time. Successful AI implementation in healthcare involves collaboration with medical professionals to refine model outputs, ensuring that the AI system aligns with clinical guidelines, best practices, and local healthcare policies. Furthermore, industries such as finance, manufacturing, and legal services do not deploy AI systems based on a one-size-fits-all approach. Instead, they develop specialised models that undergo extensive domain adaptation and real-world validation [2]. This study's reliance on zero-shot performance without such adaptations risks underestimating the true potential of AI in primary care settings. Evaluating AI performance without domain-specific tuning does not offer a fair comparison with human expertise, which is built upon years of specialised training and practical experience.

Additionally, the authors suggests that the use of GPT-4 in this way simulates a scenario where clinicians seek to get input on the management of patients by posting real patient case summaries in GPT-4. However, this behaviour is inherently problematic, as it risks exposing identifiable patient data, leading to significant privacy and ethical concerns. General practitioners should be made aware that relying on GPT-4 in this manner is not appropriate due to privacy risks. It is essential to ensure compliance with data protection regulations and ethical guidelines by avoiding the sharing of sensitive patient information with AI systems.

In light of these considerations, we suggest future research in this area should incorporate domain-specific fine-tuning, where AI models are trained using real-world primary care data and localised guidelines to enhance accuracy and contextual relevance. Optimised prompt engineering should also be employed, utilising structured prompts that guide the AI model to generate responses more aligned with clinical reasoning processes. Additionally, integrating AI with Clinical Decision Support Systems (CDSS) would combine AI capabilities with existing healthcare IT systems to provide augmented support rather than standalone decision-making. Finally, conducting comparative evaluations with fine-tuned models would allow for a better understanding of GPT-4's performance against customised AI solutions tailored for medical application

While the findings of the current study highlight the current limitations of GPT-4 in primary care, it is important to acknowledge that AI is an evolving field. Future advancements, combined with targeted implementation strategies, hold significant promise in supporting healthcare professionals and improving patient outcomes. Addressing the shortcomings of the zero-shot approach and adopting best practices from other industries can pave the way for more effective AI integration in healthcare.

References

1. Arvidsson R, Gunnarsson R, Entezarjou A, et al. ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study. BMJ Open 2024;14:e086148. doi:10.1136/ bmjopen-2024-086148
2. Huyen C. AI Engineering; Building applications with foundation models. O'Reilly Media, Inc. December 2024
Show Less

Conflict of Interest:
None declared.
- Back to top
Published on: 29 January 2025
Declaring the context helps
- Roger Larsson, Student of Artificial Intelligence Slentelit AB
Hi,

I did also try this - started from DoctorAI, and noticed that answers were to generic to fit in Swedish perspective, so I added
"I Svensk primärvårdskontext" (In Swedish primary care context) and got better results.

Conflict of Interest:
None declared.
- Back to top

Main menu

Plain text

PLEASE NOTE:

Vertical Tabs

Other responses

Jump to comment:

Log in using your username and password

Main menu

Log in using your username and password

You are here

Plain text

PLEASE NOTE:

Vertical Tabs

Other responses

Jump to comment:

Read the full text or download the PDF:

Log in using your username and password