planetarium benchmark

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages Evaluates LLMs on converting natural language to Structured Planning Languages. The LLMs do well—zero-shot, GPT-4o gets 35%, and finetuned 2B param LLMs can get near-perfect answers.

I’ve heard some recent takes that the o-series models aren’t going to translate to true transformative change, because not everything is a math problem. But what if everything was made a math problem? This is in fact what LLMs are already good at—translation! I certainly buy that some things can not be accurately mapped to a math problem, but I do imagine that many tasks simply can be.

.arunim.fyi

planetarium benchmark

Graph View

Backlinks