A post from Steve Newman’s fantastic Substack

Some vague thoughts to get down while I was reading it:

I think the analogy to chess is particularly interesting. Specifically interesting because chess is not solved. We don’t have an upper bound of performance, hence ELO, and what we have found is that humans are likely quite bounded when it comes to chess. AI is far less bounded, it seems. I believe that there are also games like checkers (according to GPT) which are solved but impractical for humans to play perfectly. Not sure which is a better analogy!

  • Actually, I might be wrong about this, but I think I’m operating under the assumption that the ELO system makes most sense or maybe even only makes sense in cases where we don’t have bounds.

It seems like we should simply be measuring the squishy stuff too! Sticking with the theme of ELO measurements, surely it’s not impossible to run a version of LMSYS where users must choose which response they prefer, and one of the responses is written by a human? Maybe you have multiple levels, corresponding to how much you’ve paid the human, etc. I imagine this has already been done for some domains. Eventually, you’ll get to do evaluations of whether people prefer their coffee made by a human or robot barista, human or robot nannies, and plenty more—though for those, market demand and pricing will probably serve as the benchmark, haha

There’s something here that I haven’t quite figured out—will spend like 5 more minutes on this tonight then poast