OpenAI has introduced HealthBench, a new open-source evaluation framework designed to assess the performance and safety of large language models (LLMs) in real-world healthcare scenarios. Developed in collaboration with 262 physicians from 60 countries and 26 medical specialties, HealthBench seeks to overcome the limitations of existing benchmarks that often fail to reflect the complexity of clinical interactions. Unlike traditional tests that rely on multiple-choice formats, HealthBench uses 5,000 realistic, multi-turn conversations between AI models and users. These include healthcare professionals and lay users, with model responses evaluated using rubrics created by the physicians involved in the project.
The HealthBench framework focuses on a more realistic approach by assessing models across seven key themes: emergency referrals, global health, health data tasks, expertise-tailored communication, context-seeking, response depth, and responding under uncertainty. Each theme targets different aspects of medical decision-making and user interaction. Additionally, HealthBench offers two variations for more targeted insights: HealthBench Consensus, which incorporates 34 physician-validated criteria, and HealthBench Hard, a subset of 1,000 conversations that are particularly challenging for current AI models. These categories help break down model performance and provide a more detailed picture of strengths and weaknesses.
In its evaluation, OpenAI said they have tested multiple models, including GPT-3.5 Turbo, GPT-4o, GPT-4.1, and the newer o3 model. The results show noticeable improvement, with o3 scoring 60% overall, while GPT-3.5 and GPT-4o achieved 16% and 32%, respectively. Interestingly, the smaller GPT-4.1 nano model outperformed the larger GPT-4o by a significant margin while also being 25 times more cost-effective in terms of inference cost. Although models excelled in emergency referrals and tailored communication, context-seeking and completeness presented greater challenges.
A critical aspect of the evaluation involved comparing the AI model responses to physician-written responses. The results indicated that while physicians often provided lower-scoring responses independently, they were able to enhance model-generated responses, particularly with earlier versions. This suggests that AI models could serve as valuable tools to assist clinicians in documentation and decision support, rather than replacing them entirely.
HealthBench also introduced a reliability metric, known as “worst-at-k,” to assess the stability of model performance. While newer models like GPT-4.1 showed improved reliability, there is still room for growth in this area. To further verify the quality of its automated grading system, OpenAI said it conducted a meta-evaluation with over 60,000 annotated examples, finding that GPT-4.1 performed consistently well compared to individual physicians.