EsoBench

Learning a Private Esolang Through Exploration

CaseysEvals 2025

Introducing EsoBench, an exploratory programming benchmark that tests how AI systems learn unknown languages through experimentation. Models must discover the syntax and semantics of a novel esoteric programming language (esolang) by writing code and observing outputs from an interpreter. The benchmark contains 6 tasks of increasing difficulty, with 5 instances per task. Each task provides only a token list for the esolang, an example program, its output, and a problem to solve. Models have 50 attempts to experiment with the interpreter and find a working solution. Task 1 requires only pattern matching, while Task 6 demands learning language features not shown the example.

Models score 100 points for solving on the first attempt, decreasing by 2 points per attempt down to 2 points at attempt 50. The final score averages across all 30 task instances. A perfect score of 100 is effectively impossible, as the harder tasks require some experimentation to understand the language's syntax.

Model Scores

Top 20 Models

EsoBench Leaderboard

Loading benchmark results...

Details

Benchmark Design

Esoteric Programming Languages
The current standard for AI benchmarks is a comprehensive suite of questions that are either multiple choice, or have one verifiably correct answer (think mathematics). I wanted to design a benchmark where the models are placed into a situation with a goal, and are then allowed to explore and experiment. I cannot present them with new science to uncover, and they are already aware of most programming languages. Esolangs are a nice solution here, as they are easy to make and can be constructed in ways unlike traditional programming languages. Also, a custom esolang can easily be kept private and off the internet, allowing the benchmark to be used for models into the future.

The Esolang
I cannot go into detail as the esolang should remain private, however I can give an idea of its depth. The language uses a total of 22 characters (10 of which are numbers) and can print arbitrary strings and integers. Programs can perform arithmetic and loops. There are no native if statements, however it is possible to write code that branches dependant on the value of an integer.

Task Complexity
There are 6 tasks of increasing complexity. Each task contains an example program and output, and a problem to be solved. Here is the approximate difficulty scaling, detailing how much knowledge of the language is needed to find a solution:

  • Task 1: No knowledge

  • Task 2: Minimal knowledge with relevant example

  • Task 3: Major knowledge with relevant example

  • Task 4: Major knowledge with minimal example

  • Task 5: Full knowledge with relevant example

  • Task 6: Full knowledge with minimal example

Here is a redacted problem statement, to get an idea of the compelxity of the benchmark. The task is incredibly simple, and the reason the models fail to solve them is because they fail to learn the language in time.
You are given a code example in an unknown programming language. Your task is to explore and understand this language, and write code in this language to solve the given problem ## Instructions - Enclose any code you want to run in <CODE></CODE> tags - Limit code blocks to at most 20 lines - Only the most recent code block in each message will be executed - Each code execution is a new program, with no memory of previous executions - You will receive the output of your code in <OUTPUT></OUTPUT> tags - Do not use <OUTPUT></OUTPUT> tags - Program outputs exceeding 100 characters will be truncated with "..." at the end - You have up to 50 messages - The conversation ends when you either: - Submit code that provides the correct output - Reach 50 messages - Correct solutions are scored based on how quickly you find them. A correct answer on message 1 = 100 points, message 2 = 98 points, message 50 = 2 points. No solution = 0 points - The problem can be solved with a program that is under 20 lines of code ## Example Code and Output <CODE>REDACTED</CODE> <OUTPUT>1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097...</OUTPUT> ## Valid Tokens REDACTED ## Problem Write a program that calculates and prints the entire Fibonacci sequence in order. The sequence should start 1 1 and the numbers should be separated by spaces

Evaluation & Scoring

Correct Solutions
Each task is a simple programming problem that requires printing outputs of some algorithm. There are many programs that are technically correct, so we instead compare the outputs of the submitted programs to the expected output. If they match, that task is marked for closer inspection. It is of course possible for the model to manually calculate the expected outputs and instead write a simpler program that directly prints the results, so any tentative solutions are checked to ensure that the submitted program does actually do the calculations.

Scoring Points
Models earn points based on how quickly they find a correct solution, with the intent being that smarter models should find solutions more quickly on average. Scoring points instead of a binary win/lose system allows for more granularity in the results. For example, 9 of the 16 initial models tested solved 16.67% of the problems, but the points let us see more detail by spreading the models out over 12.5 - 16.7 points.

BibTeX

@misc{esobench, author = {CaseysEvals}, title = {EsoBench: Learning a Private Esolang Through Exploration}, year = {2025}, url = {https://caseys-evals.com/}, }

Click the button above to copy the BibTeX entry for use in your academic references.