CaseysEvals 2025
Introducing EsoBench, an exploratory programming benchmark that tests how AI systems learn unknown languages through experimentation. Models must discover the syntax and semantics of a novel esoteric programming language (esolang) by writing code and observing outputs from an interpreter. The benchmark contains 6 tasks of increasing difficulty, with 5 instances per task. Each task provides only a token list for the esolang, an example program, its output, and a problem to solve. Models have 50 attempts to experiment with the interpreter and find a working solution. Task 1 requires only pattern matching, while Task 6 demands learning language features not shown the example.
Models score 100 points for solving on the first attempt, decreasing by 2 points per attempt down to 2 points at attempt 50. The final score averages across all 30 task instances. A perfect score of 100 is effectively impossible, as the harder tasks require some experimentation to understand the language's syntax.
Top 20 Models
Esoteric Programming Languages
The current standard for AI benchmarks is a comprehensive suite of questions that are either multiple choice, or have one verifiably correct answer (think mathematics).
I wanted to design a benchmark where the models are placed into a situation with a goal, and are then allowed to explore and experiment. I cannot present them with new
science to uncover, and they are already aware of most programming languages. Esolangs are a nice solution here, as they are easy to make and can be constructed in ways
unlike traditional programming languages. Also, a custom esolang can easily be kept private and off the internet, allowing the benchmark to be used for models into the
future.
The Esolang
I cannot go into detail as the esolang should remain private, however I can give an idea of its depth. The language uses a total of 22 characters (10 of which are numbers)
and can print arbitrary strings and integers. Programs can perform arithmetic and loops. There are no native if statements, however it is possible to write code that
branches dependant on the value of an integer.
Task Complexity
There are 6 tasks of increasing complexity. Each task contains an example program and output, and a problem to be solved. Here is the approximate difficulty scaling,
detailing how much knowledge of the language is needed to find a solution:
Task 1: No knowledge
Task 2: Minimal knowledge with relevant example
Task 3: Major knowledge with relevant example
Task 4: Major knowledge with minimal example
Task 5: Full knowledge with relevant example
Task 6: Full knowledge with minimal example
Correct Solutions
Each task is a simple programming problem that requires printing outputs of some algorithm. There are many programs that are technically correct, so we instead
compare the outputs of the submitted programs to the expected output. If they match, that task is marked for closer inspection. It is of course possible for the
model to manually calculate the expected outputs and instead write a simpler program that directly prints the results, so any tentative solutions are checked to
ensure that the submitted program does actually do the calculations.
Scoring Points
Models earn points based on how quickly they find a correct solution, with the intent being that smarter models should find solutions more quickly on average.
Scoring points instead of a binary win/lose system allows for more granularity in the results. For example, 9 of the 16 initial models tested solved 16.67% of
the problems, but the points let us see more detail by spreading the models out over 12.5 - 16.7 points.
Click the button above to copy the BibTeX entry for use in your academic references.