similar to large language model influence on diagnostic reasoning but now with o1-preview

classic result of LLMs are good. I know little about healthcare or diagnostics, but “trials evaluating AI in real clinical settings” sounds like the correct next step. I think a good study would be comparing:

  • a doctor
  • a doctor with access to the LLM
  • a doctor primed with the AI system diagnosis
  • a doctor primed with the AI system diagnosis and access to the LLM
  • a med student (with n years of school remaining)
  • a med student access to the LLM
  • a med student primed with the AI system diagnosis
  • a med student primed with the AI system diagnosis and access to the LLM
  • the AI system

I’m mostly skeptical that “access to the LLM tool” will be more useful than priming. I’d guess results like:

  • doctors only being marginally higher than med students (confidence intervals overlapping, mean only 3% higher) when they’re on the same level.
  • only having access to the llm tool being a similar boost (confidence intervals overlapping, mean only 3% higher)
    • priming getting results ~equivalent to the AI system
    • priming and access to the LLM being +4-5% better than the AI system