superhuman performance of a large language model on the reasoning tasks of a physician

similar to large language model influence on diagnostic reasoning but now with o1-preview

classic result of LLMs are good. I know little about healthcare or diagnostics, but “trials evaluating AI in real clinical settings” sounds like the correct next step. I think a good study would be comparing:

a doctor
a doctor with access to the LLM
a doctor primed with the AI system diagnosis
a doctor primed with the AI system diagnosis and access to the LLM
a med student (with n years of school remaining)
a med student access to the LLM
a med student primed with the AI system diagnosis
a med student primed with the AI system diagnosis and access to the LLM
the AI system

I’m mostly skeptical that “access to the LLM tool” will be more useful than priming. I’d guess results like:

doctors only being marginally higher than med students (confidence intervals overlapping, mean only 3% higher) when they’re on the same level.
only having access to the llm tool being a similar boost (confidence intervals overlapping, mean only 3% higher)
- priming getting results ~equivalent to the AI system
- priming and access to the LLM being +4-5% better than the AI system

.arunim.fyi

superhuman performance of a large language model on the reasoning tasks of a physician

Graph View

Backlinks