Autoresearch and its use in Autonomous Experimentation
Date Published

AI tools are increasingly capable of writing code, analysing data, and assisting with development tasks. But what happens when AI systems are allowed not just to generate solutions, but to test, evaluate, and improve them, automatically, and on their own?
A small open-source project called Autoresearch, created by AI researcher Andrej Karpathy, provides an early glimpse of this idea. The Github repository describes a framework and tool for automating machine learning experiments, but its real significance is much broader: it demonstrates a clean and simple architecture for autonomous experimentation loops, where AI agents repeatedly propose improvements, run experiments, and keep only the changes that work.
In other words, it's a practical example of AI iteratively improving things, including other AI systems and without direct human intervention at each step.
What Is Autoresearch?
Autoresearch is a minimal open-source project that lets an AI agent run machine learning experiments automatically. Instead of the traditional approach that involves a human researcher editing code, running experiments, and analysing results, the agent handles those steps itself. It modifies training code, runs a short experiment, measures performance, keeps the change if it helped, and discards it if it didn't. Then it starts again.
The process runs continuously, allowing the agent to explore model configurations and training approaches far faster than any human researcher could manage alone.
How It's Built
One reason Autoresearch is being widely discussed and recognised is due to its simplicity. The repository is intentionally small and designed so an AI agent can interact with it easily. Three components do most of the work.
The research environment (prepare.py and supporting utilities) handles the stable infrastructure: data preparation, tokenisation, evaluation metrics, and training utilities. These remain mostly fixed so that the experimentation context stays consistent across runs.
The training script (train.py) is where the agent operates. This is the component that the agent is permitted to modify: model architecture, hyperparameters, optimisation strategies, training configuration. Each modification becomes a new experiment.
The research instructions (RESEARCH.md) are written by a human researcher in plain language — high-level guidance describing what the agent should try to improve and how success is measured. The human sets the direction so that the agent can do the work.
The Loop
Autoresearch operates through a feedback cycle that mirrors the scientific method: hypothesise, experiment, measure, iterate. The agent proposes a change, the system runs the experiment, results are evaluated against a defined metric, and the cycle repeats. Because each experiment is short (5 mins), this loop can run hundreds of times in a single day.
The difference from traditional research isn't the method here, it's the speed and the absence of a human in the loop for every iteration.
This pattern will be familiar to anyone who has followed the rise of the Ralph Wiggum Loop, a technique coined by developer Geoffrey Huntley earlier this year (2026). Ralph named after the Simpsons character is, at its core, a bash loop that feeds an AI agent's output back into itself, errors and all, until it arrives at the correct answer. The key thing to note is that progress doesn't live in the model's context window; it lives in files and version history. Each new iteration starts fresh, picks up where the last one left off, and keeps going until the job is done.
Autoresearch follows the same fundamental architecture: a loop that runs autonomously, persists state outside the model, and keeps only what works. The primary difference is one of direction and guidance. Ralph loops are steered by human-defined acceptance criteria ie:did the agent build what was asked? Autoresearch uses an automated performance score ie: did the model actually get better? That shift moves Autoresearch one step further from human involvement where the loop isn't just building software autonomously, it's improving the AI itself.
The Bigger Idea
Although Autoresearch focuses on machine learning, the architecture it demonstrates is much more widely applicable. It's a general pattern for autonomous optimisation. For it to work, three things need to be true:
1. The system can be modified programmatically
2. Performance can be measured against a clear objective
3. Experiments can run repeatedly
When those conditions exist, and they do in many domains beyond ML, an AI agent can continuously explore variations and retain only what works.
This means the same loop could run over interface layouts and onboarding flows to improve conversion rates, over infrastructure configurations and caching strategies to reduce latency, or over operational workflows and scheduling systems to cut costs. Any environment where experiments are cheap and outcomes are measurable is a candidate.
A Shift in What Humans Do
If systems like this become more widespread, the role of human experts doesn't disappear, it just shifts. Rather than running experiments themselves, humans would focus on defining the objective, designing the experimental environment, and setting the constraints. The agent handles the exploration.
In this model, human expertise moves up a level of abstraction. Instead of doing the experiments, you design the system that does them.
Why It Matters
Autoresearch is intentionally small. It is not a production platform or a finished research tool. But it demonstrates something important: AI systems can participate directly in iterative improvement processes, not just assist with individual tasks.
If that concept scales, it could change how we approach optimisation and research across a wide range of fields. The interesting question is no longer whether AI can help with experiments. It's what happens when AI is running the lab.
References: GitHub – Autoresearch · kingy.ai · philschmid.de · VentureBeat · Geoffrey Huntley – Everything is a Ralph Loop

Sanjay Dandeker
Principal Consultant