Can a Browser Be a Sandbox for Genuine Autonomy?

There is a tool you use for most of your working day. You open it first thing in the morning. You use it to communicate, to research, to manage, to create, to buy, to apply, to track. It is where most of your productive life actually happens.

You already know what it is.

Every tab you open, you opened. Every form you filled, you filled. Every search you ran, you ran. The browser is extraordinarily capable and almost entirely passive. It waits for you. It has always waited for you.

I think about scratching the surface a little, because the claim matters.

The Question to Answer

Here is the research question I'm spending my time on:

Can an autonomous agent operate intelligibly and reliably inside a browser environment, recognising failure, adapting its strategy, learning from the open web and do genuinely useful work using only what's publicly available on the internet as its toolbox without human intervention?

Not a script. Not a macro. Not a pre-programmed sequence of clicks that breaks the moment a button moves.

Something that has a goal, figures out the steps, handles the unexpected, and reports back when it's done.

The browser is the constraint. That's intentional. No private APIs, no special infrastructure, no back-end access. Just a browser, a goal, and the entire public internet as its toolbox.

The constraint is the point. If it works here in the most universal, accessible, tool-rich environment that exists, then it works everywhere that matters.

Why the Browser Specifically

Most serious attempts at AI automation reach for APIs first. Connect to the calendar API, the email API, the CRM API. This works, but it's fragile in a specific way: it only works for services that have APIs, that give you access, that don't change their endpoints, that don't rate-limit you out.

The browser doesn't have that problem. Every service that has ever built a website has already built a browser interface. The browser is the universal API. It's what every service exposes to the world by default.

There's something else too. The browser is where human judgment currently lives. When you do research, you're browsing. When you make decisions, you're reading pages, comparing options, filling forms. The cognitive work of the modern knowledge worker mostly happens inside a browser tab.

If an agent can do that work… not simulate it, actually do it- then you have something qualitatively different from any automation tool that came before.

The Experiment

The experiment is called Bunty Box.

Bunty is an autonomous browser agent, built as a browser extension that receives goals in plain English and executes them without hand-holding. You send it a task from wherever you are, remotely. It opens tabs, browses, handles failures, and reports back when it's done, when its not or when it genuinely can't continue without you.

The architecture has evolved significantly since it started. The current system has three layers:

An Overseer that receives goals and decides how to decompose them. A Conductor that manages a pool of workers and mediates information between them. And Workers; isolated AI sessions, each owning a browser tab, each responsible for one subtask.

They don't share a context window. They communicate through structured results. When one worker discovers something useful, the conductor decides whether another worker needs to know.

It's less like a single AI doing things and more like a small team that happens to move at machine speed, which is already a proven theory in several use cases today.

What Makes This Hard

The honest answer is: almost everything.

The web is not designed to be operated programmatically by an agent. Pages change. Selectors break. Login flows have verification steps. Forms have validation logic that isn't documented anywhere. Services actively resist automation in ways both obvious and subtle.

And the agent has to handle all of that without asking you every five minutes.

The first version broke on a GitHub signup form. Not because the agent was stupid, it navigated correctly, found the form, filled in the fields. But then GitHub asked for an email verification code. The agent wasn’t built to have a way to access an email. It didn't know what to do. It got stuck.

That single failure rewrote the architecture. Now the system detects authenticated sessions in open tabs, can switch context to read an email, find the code, switch back, and continue. That capability came directly from a failure.

Most of the interesting design decisions in Buntybox came from failures like that one.

The Philosophical Claim

There is a difference between automation and autonomy that I think gets collapsed too easily in most discussions about AI agents.

Automation is a script with a success path. It works when everything goes as expected and breaks when it doesn't. It has no model of failure. It doesn't know the difference between "this step took longer than usual", "this step is fundamentally impossible in this context." and “What should happen next?”

Autonomy is different. An autonomous system has a goal, a model of the world, and the ability to update that model when reality doesn't match expectation. It can recognise that a strategy isn't working, abandon it, and try something else. It can decide that a subtask is impossible and figure out whether the parent goal can still be achieved without it.

The gap between those two things is not a prompting problem. It's an architectural loop. How do you build a system where failure compounds into continual learning that becomes information and enables the system to make the most human decision rather than a terminal state?

That's the question this experiment is trying to answer in practice, not just in theory.

What This Is

This is a lab. Not a product launch, not a startup announcement. A public research log about what happens when you take the idea of browser autonomy seriously and not just build it but actually put it to work in the real world.

The architecture decisions and why they were made. The failures — especially the failures. The moments where the agent does something unplanned. The milestones where it starts to feel like something truly intelligent in world context.

The goal isn't to build the perfect autonomous agent, there’s Openclaw already taking a shot at that. The goal is to find out what "good enough to be genuinely useful in a constrained environment" actually requires, and document everything along the way.

If that sounds interesting, follow along. The next post is about the GitHub signup that broke everything and what the rebuild looked like.

Follow the experiment at [buntybox.beehiiv.com]