EsoBench

Learning a Private Esolang Through Exploration

CaseysEvals 2025

Introducing EsoBench, an exploratory programming benchmark that tests how AI systems learn unknown languages through experimentation. Models must discover the syntax and semantics of a novel esoteric programming language (esolang) by writing code and observing outputs from an interpreter. The benchmark contains 6 tasks of increasing difficulty, with 5 instances per task. Each task provides only a token list for the esolang, an example program, its output, and a problem to solve. Models have 50 attempts to experiment with the interpreter and find a working solution. Task 1 requires only pattern matching, while Task 6 demands learning language features not shown the example.

Models score 100 points for solving on the first attempt, decreasing by 2 points per attempt down to 2 points at attempt 50. The final score averages across all 30 task instances. A perfect score of 100 is effectively impossible, as the harder tasks require some experimentation to understand the language's syntax.

Model Scores

Top 20 Models

EsoBench Leaderboard

Loading benchmark results...

Details

Benchmark Design

Esoteric Programming Languages
The current standard for AI benchmarks is a comprehensive suite of questions that are either multiple choice, or have one verifiably correct answer (think mathematics). I wanted to design a benchmark where the models are placed into a situation with a goal, and are then allowed to explore and experiment. I cannot present them with new science to uncover, and they are already aware of most programming languages. Esolangs are a nice solution here, as they are easy to make and can be constructed in ways unlike traditional programming languages. Also, a custom esolang can easily be kept private and off the internet, allowing the benchmark to be used for models into the future.

The Esolang
I cannot go into detail as the esolang should remain private, however I can give an idea of its depth. The language uses a total of 22 characters (10 of which are numbers) and can print arbitrary strings and integers. Programs can perform arithmetic and loops. There are no native if statements, however it is possible to write code that branches dependant on the value of an integer.

Task Complexity
There are 6 tasks of increasing complexity. Each task contains an example program and output, and a problem to be solved. Here is the approximate difficulty scaling, detailing how much knowledge of the language is needed to find a solution:

  • Task 1: No knowledge

  • Task 2: Minimal knowledge with relevant example

  • Task 3: Major knowledge with relevant example

  • Task 4: Major knowledge with minimal example

  • Task 5: Full knowledge with relevant example

  • Task 6: Full knowledge with minimal example

Evaluation & Scoring

Correct Solutions
Each task is a simple programming problem that requires printing outputs of some algorithm. There are many programs that are technically correct, so we instead compare the outputs of the submitted programs to the expected output. If they match, that task is marked for closer inspection. It is of course possible for the model to manually calculate the expected outputs and instead write a simpler program that directly prints the results, so any tentative solutions are checked to ensure that the submitted program does actually do the calculations.

Scoring Points
Models earn points based on how quickly they find a correct solution, with the intent being that smarter models should find solutions more quickly on average. Scoring points instead of a binary win/lose system allows for more granularity in the results. For example, 9 of the 16 initial models tested solved 16.67% of the problems, but the points let us see more detail by spreading the models out over 12.5 - 16.7 points.

Experimental Setup

API Parameters
All models were evaluated under the same conditions with a simple system prompt with temperature = 1 and top_p = 1. No model responses were truncated due to maximum output tokens. Reasoning models are tested at their default settings unless specified (o4-mini and o3 for example have been tested at default and high). If there are no distinct reasoning tiers (e.g. Gemini), then the reasoning token allowance is maxed out.

Notable Results

Grok 4 & Cheating
The initial results showed that Grok 4 appeared to solve both task 4 and task 6. However, closer examination of the submitted code revealed some interesting details. For task 6, the solution simply printed the expected outputs rather than actually calculating them. Task 4 presented a more nuanced situation. This task required creating a non-terminating program that generates an infinite sequence. While the interpreter can handle such programs by displaying only a truncated portion of the output, Grok 4 managed to exploit this. It wrote code that computed a finite number of terms, but calculated enough terms that the output was still truncated by the interpreter, ultimately matching the expected result.

o3 Performing Badly
o3's score is perhaps the most surprising here, as both the default medium and high reasoning effort fail to beat Claude 3.7 sonnet or Gemini 2.5 flash. Looking into its responses during the task, the reason becomes clear. For o3, about 48% of its messages failed to include any code to run, meaning it was learning about the esolang more slowly than the other models. For o3-high, this drops to 40%, which is an improvement but the model is still failing to use its turns to learn about the esolang. For some context, all Claude models (as of 4.5 sonnet) have a 0% rate of not providing code, and Grok 4 had a rate of 7%. This seems to be a problem unique to o3.

Gemini 3 Performing Strangely
Gemini 3 pro is a very competent model, yet fails at obtaining any points during question 2. This is striking, and the reason isn't clear, even when reading through the 5 attempts. Despite this, the model did, on one attempt, find a solution to problem 5. This problem is much, much harder. Upon inspecting the solution for false positives, like with Grok 4, it turned out the model actually, genuinely, solved the problem. This was the first time a model has ever solved a question beyond 2, and yet it failed to pick up any points in the preceding questions 2-4.

BibTeX

@misc{esobench, author = {CaseysEvals}, title = {EsoBench: Learning a Private Esolang Through Exploration}, year = {2025}, url = {https://caseys-evals.com/}, }

Copy the above BibTeX entry for use in your academic references.