Abstract Blue Triangle

CaseysEvals

Benchmarking the future of AI

Philosophy

As model capabilities steadily improve and increasingly difficulty benchmarks saturate, new benchmarks and evaluation methods are needed to keep pace with the most powerful models. Benchmarks often take the shape of a long list of multiple choice questions, or questions with a single correct answer (think mathematics), which can only capture so much. There have been recent movements towards more interactive evaluations, such as VendingBench and ARC AGI 3, a movement which I want to help advance.

I try to design benchmarks with the goal of allowing models to explore a new setting, as this is where I think some of the most important facets of intelligence lie. EsoBench is my first step, where models are presented with an alien language and are tasked with using it to solve simple problems, learning from the code outputs. The more free and open-ended the task, the more we can learn about how models behave in new and hostile environments.

This direction feels interesting and promising, and is the general guise under which my benchmarks are developed.

Benchmarks

Benchmark

EsoBench

Learning A Novel Esolang via Iterative Execution Feedback

Research

See all
Analysis

Benchmark Orthogonality

Coming Soon...