CaseysEvals 2025
Introducing EsoBench, an exploratory programming benchmark that tests how AI systems learn unknown languages through experimentation. Models must discover the syntax and semantics of a novel esoteric programming language (esolang) by writing code and observing outputs from an interpreter. The benchmark contains 6 tasks of increasing difficulty, with 5 instances per task. Each task provides only a token list for the esolang, an example program, its output, and a problem to solve. Models have 50 attempts to experiment with the interpreter and find a working solution. Task 1 requires only pattern matching, while Task 6 demands learning language features not shown the example.
Models score 100 points for solving on the first attempt, decreasing by 2 points per attempt down to 2 points at attempt 50. The final score averages across all 30 task instances. A perfect score of 100 is effectively impossible, as the harder tasks require some experimentation to understand the language's syntax.
Top 20 Models
Esoteric Programming Languages
The current standard for AI benchmarks is a comprehensive suite of questions that are either multiple choice, or have one verifiably correct answer (think mathematics).
I wanted to design a benchmark where the models are placed into a situation with a goal, and are then allowed to explore and experiment. I cannot present them with new
science to uncover, and they are already aware of most programming languages. Esolangs are a nice solution here, as they are easy to make and can be constructed in ways
unlike traditional programming languages. Also, a custom esolang can easily be kept private and off the internet, allowing the benchmark to be used for models into the
future.
The Esolang
I cannot go into detail as the esolang should remain private, however I can give an idea of its depth. The language uses a total of 22 characters (10 of which are numbers)
and can print arbitrary strings and integers. Programs can perform arithmetic and loops. There are no native if statements, however it is possible to write code that
branches dependant on the value of an integer.
Task Complexity
There are 6 tasks of increasing complexity. Each task contains an example program and output, and a problem to be solved. Here is the approximate difficulty scaling,
detailing how much knowledge of the language is needed to find a solution:
Task 1: No knowledge
Task 2: Minimal knowledge with relevant example
Task 3: Major knowledge with relevant example
Task 4: Major knowledge with minimal example
Task 5: Full knowledge with relevant example
Task 6: Full knowledge with minimal example
Correct Solutions
Each task is a simple programming problem that requires printing outputs of some algorithm. There are many programs that are technically correct, so we instead
compare the outputs of the submitted programs to the expected output. If they match, that task is marked for closer inspection. It is of course possible for the
model to manually calculate the expected outputs and instead write a simpler program that directly prints the results, so any tentative solutions are checked to
ensure that the submitted program does actually do the calculations.
Scoring Points
Models earn points based on how quickly they find a correct solution, with the intent being that smarter models should find solutions more quickly on average.
Scoring points instead of a binary win/lose system allows for more granularity in the results. For example, 9 of the 16 initial models tested solved 16.67% of
the problems, but the points let us see more detail by spreading the models out over 12.5 - 16.7 points.
API Parameters
All models were evaluated under the same conditions with a simple system prompt with temperature = 1 and top_p = 1. No model responses were truncated due to maximum
output tokens. Reasoning models are tested at their default settings unless specified (o4-mini and o3 for example have been tested at default and high). If there
are no distinct reasoning tiers (e.g. Gemini), then the reasoning token allowance is maxed out.
Grok 4 & Cheating
The initial results showed that Grok 4 appeared to solve both task 4 and task 6. However, closer examination of the submitted code revealed some interesting details.
For task 6, the solution simply printed the expected outputs rather than actually calculating them. Task 4 presented a more nuanced situation. This task required
creating a non-terminating program that generates an infinite sequence. While the interpreter can handle such programs by displaying only a truncated portion of
the output, Grok 4 managed to exploit this. It wrote code that computed a finite number of terms, but calculated enough terms that the output was still truncated
by the interpreter, ultimately matching the expected result.
o3 Performing Badly
o3's score is perhaps the most surprising here, as both the default medium and high reasoning effort fail to beat Claude 3.7 sonnet or Gemini 2.5 flash. Looking into
its responses during the task, the reason becomes clear. For o3, about 48% of its messages failed to include any code to run, meaning it was learning about the esolang
more slowly than the other models. For o3-high, this drops to 40%, which is an improvement but the model is still failing to use its turns to learn about the esolang.
For some context, all Claude models (as of 4.5 sonnet) have a 0% rate of not providing code, and Grok 4 had a rate of 7%. This seems to be a problem unique to o3.
Gemini 3 Performing Strangely
Gemini 3 pro is a very competent model, yet fails at obtaining any points during question 2. This is striking, and the reason isn't clear, even when reading through the
5 attempts. Despite this, the model did, on one attempt, find a solution to problem 5. This problem is much, much harder. Upon inspecting the solution for false positives,
like with Grok 4, it turned out the model actually, genuinely, solved the problem. This was the first time a model has ever solved a question beyond 2, and yet it failed
to pick up any points in the preceding questions 2-4.
Copy the above BibTeX entry for use in your academic references.