EsoBench — CaseysEvals

Introduction

Overview

EsoBench is an exploratory programming benchmark that tests how AI systems learn unknown languages through experimentation. Models must discover the syntax and semantics of a novel esoteric programming language (esolang) by writing code and observing outputs from an interpreter.

The benchmark contains 6 tasks of increasing difficulty, with 5 repeats per task. Each task provides only a token list for the esolang, an example program, its output, and a problem to solve.

Models have 50 attempts to experiment with the interpreter and find a working solution. Task 1 requires only pattern matching, while Task 6 demands learning language features not shown the example.

Models score 100 points for solving on the first attempt, decreasing by 2 points per attempt down to 2 points at attempt 50. The final score averages across all 30 task instances. A perfect score of 100 is effectively impossible, as the harder tasks require some experimentation to understand the language's syntax.

Performance

Results

The top 20 best performing models are shown below, color coded by their company. Click companies on the legend to highlight their models, and toggle to filter by the most recently evaluated models.

Capability Distribution

                        The distribution of capabilities is smooth, no one model (or small group of models) sits isolated at the top of the benchmark.

Rankings

Leaderboard

Full results across all evaluated models. Scores represent average points across all 30 attempts (6 tasks, each repeated 5 times). The heatmap shows the exact score on each task instance, ranging from red (0) - green (100). Toggling "Details" shows information about how the model was tested, including the provider, quantization, and thinking amount effort.

Over Time

Trends

Model performance over time by release date. Click companies in the legend to filter and see company-specific trends, and toggle to filter by the most recently added models.

The date range slider can be used to isolate specific time intervals, and the trendline and rate of progress are calculated based on the chosen interval. The plotted trendline can be toggled on/off, but the calculated improvement will continue to be displayed on the graph.

Stagnation

                        The rate of improvement is very small, close to 0 points per month. AI models have not meaningfully improved at EsoBench since launch.

Methodology

Benchmark Details

The current standard for AI benchmarks is a comprehensive suite of questions that are either multiple choice, or have one verifiably correct answer (think mathematics). I wanted to design a benchmark where the models are placed into a situation with a goal, and are then allowed to explore and experiment. I cannot present them with new science to uncover, and they are already aware of most programming languages. Esolangs are a nice solution here, as they are easy to make and can be constructed in ways unlike traditional programming languages. Also, a custom esolang can easily be kept private and off the internet, allowing the benchmark to be used for models into the future.

The esolang should remain private making transparency difficult, however I can give an idea of its depth. The language uses a total of 22 characters (10 of which are numbers 0-9) and can print arbitrary strings and integers. Programs can perform arithmetic and loops. There are no native if statements, however it is possible to write code that branches dependant on the value of an integer.

The Tasks

There are 6 tasks of increasing complexity. Each task contains an example program and output, and a problem to be solved. Here is the approximate difficulty scaling, detailing how much knowledge of the language is needed to find a solution:

Task 1: No knowledge
Task 2: Minimal knowledge with relevant example
Task 3: Major knowledge with relevant example
Task 4: Major knowledge with minimal example
Task 5: Full knowledge with relevant example
Task 6: Full knowledge with minimal example

Here is a redacted problem statement, to get an idea of the compelxity of the benchmark. The task is incredibly simple, and the reason the models fail to solve them is because they fail to learn the language in time.

You are given a code example in an unknown programming language. Your task is to explore and understand this language, and write code in this language to solve the given problem

## Instructions
- Enclose any code you want to run in <CODE></CODE> tags
- Limit code blocks to at most 20 lines
- Only the most recent code block in each message will be executed
- Each code execution is a new program, with no memory of previous executions
- You will receive the output of your code in <OUTPUT></OUTPUT> tags
- Do not use <OUTPUT></OUTPUT> tags
- Program outputs exceeding 100 characters will be truncated with "..." at the end
- You have up to 50 messages
- The conversation ends when you either:
    - Submit code that provides the correct output
    - Reach 50 messages
- Correct solutions are scored based on how quickly you find them. A correct answer on message 1 = 100 points, message 2 = 98 points, message 50 = 2 points. No solution = 0 points
- The problem can be solved with a program that is under 20 lines of code

## Example Code and Output
<CODE>REDACTED</CODE>

<OUTPUT>1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097...</OUTPUT>

## Valid Tokens
REDACTED

## Problem
Write a program that calculates and prints the entire Fibonacci sequence in order. The sequence should start 1 1 and the numbers should be separated by spaces

Scoring

Each task is a simple programming problem that requires printing outputs of some algorithm. There are many programs that are technically correct, so we instead compare the outputs of the submitted programs to the expected output. If they match, that task is marked for closer inspection. It is of course possible for the model to manually calculate the expected outputs and instead write a simpler program that directly prints the results, so any tentative solutions are checked to ensure that the submitted program does actually do the calculations.

Models earn points based on how quickly they find a correct solution, with the intent being that smarter models should find solutions more quickly on average. Scoring points instead of a binary win/lose system allows for more granularity in the results. For example, 9 of the 16 initial models tested solved 16.67% of the problems, but the points let us see more detail by spreading the models out over 12.5 - 16.7 points.

Reference

Citation

If you use EsoBench in your research, please cite:

@misc{esobench,
    author = {CaseysEvals},
    title = {EsoBench: Learning a Private Esolang Through Exploration},
    year = {2025},
    url = {https://caseys-evals.com/esobench},
}