A study published in Vision has tested the capabilities of ChatGPT models against human candidates sitting the European Board of Ophthalmology Diploma (EBOD) examination, and the results reveal both promise and limitations for artificial intelligence (AI) in medical education.
Researchers evaluated ChatGPT-3.5 Turbo and ChatGPT-4o using over 2,200 true/false statements and 48 single best answer (SBA) questions sourced from actual EBOD exams held between 2012 and 2023. ChatGPT-4o achieved an impressive 80.4 percent accuracy on multiple-choice questions (MCQs), surpassing the pass mark and performing comparably to human candidates. In contrast, ChatGPT-3.5 scored 63.2 percent, slightly below the typical passing threshold. Both models showed strongest performance in text-based pathology and retina-related questions, with weaker results in optics and refraction.
However, AI's performance dropped dramatically on SBA questions. ChatGPT-3.5 scored just 28.4 percent accuracy, with ChatGPT-4o coming in slightly lower at 24.1 percent, both significantly underperforming compared to the average candidate. These SBA questions often require higher-order clinical reasoning and the ability to discriminate between closely related options, skills that current AI models struggle to replicate.
Interestingly, ChatGPT-4o answered all the easiest MCQs correctly but fared worse than ChatGPT-3.5 on the most challenging ones. This highlights a trade-off: newer models may excel in general knowledge, but they are not necessarily better at complex, ambiguous reasoning.
The study suggests that while ChatGPT is capable of retrieving and interpreting structured knowledge, its integration of nuanced clinical judgment still remains limited. The authors conclude that while ChatGPT is not yet ready to replace human judgment in high-stakes medical assessments, rapid advancements in large language models suggest that its role in ophthalmic education will continue to expand in the future.