"Can a general-purpose autonomous agent operate reliably and improve over time inside the constraints of a browser environment, using only what's publicly available on the internet as its toolbox?"
Tried to do it the unconventional way. No private APIs, no special infrastructure, just a browser extension and whatever the public web exposes. The 10th serious test ran into a failure that forced a full redesign. If you don't mind a little rant, read along.
The Number
The early architecture was simple: one AI session per goal, looping up to 35 steps. The model carried the entire task context; page state, history, scratchpad, tool patterns, everything. It worked fine for small tasks.
Then an eventful goal burned 583,000 tokens in a single run, and of course it failed.
To understand why that number matters, you have to understand what the context window was actually carrying. Every step of the loop appended to it: the current page state, the full action history since step one, a running scratchpad of notes the agent had written to itself, known tool patterns for the sites it had visited, memories surfaced from previous goals. By step 20 the model wasn't reasoning about the current page. It was reasoning about the current page plus the entire documented history of everything it had tried before getting there.
The longer the run, the worse the reasoning got. The context wasn't helping anymore it was just noise the model had to filter through before it could think.
The Rebuild
I applied a modest version of Meta's continual learning design: context isolation.
The core idea from continual learning research is that you don't need the full history of a task in active memory to perform well on the current step. What matters is having the right context for right now, not everything that led to it. Meta's approach to this in their research involves preventing catastrophic forgetting by keeping representations isolated rather than letting new information overwrite old. The modest version of that applied here: stop letting the context window grow unbounded, and give each unit of work its own clean session.
Now each subtask runs in its own AI session with strict limits. A worker only sees the current page map (hard capped at 3,000 characters), a small local scratchpad (4,000 characters max), and a few results from sibling workers doing parallel tasks. No giant historical context. No carrying forward everything that happened in step 3 when you're now at step 18.
The step limit also dropped from 35 to 25 per worker. Not because the agent needed fewer steps, a clean 25-step session with focused context outperforms a bloated 35-step one. Lesson learnt: a long-running agent can't rely on one expanding context window.
The Unexpected Benefit
Failures became much easier to debug because they stay scoped.
In the single-session architecture, when something went wrong it was hard to know exactly where. The context was one long rope, a failure at step 22 had tendrils going back to decisions made at step 4. Untangling it meant reading through hundreds of steps of history.
With isolated sessions, a failure is a single connection in its neural network. It happened in this worker, on this subtask, at this step, with this page state. The blast radius is contained. You fix the thing that broke, not the whole system.
That alone would have been worth the rebuild. The 583k token bill that came with it just made it urgent.
The Bug That Exposed All Of This
The bug that exposed this whole problem was ironically simple — a GitHub signup.
The agent filled the form correctly. Navigated to github.com, found the signup form, typed in the username, email, and password fields. Clicked continue. Got to the next screen.
GitHub sent a verification code to the email address.
The agent saw a field asking for a code. Had no idea where to get it. It tried clicking around the page. Read the page content again. Navigated back and forward. Tried the same sequence of actions in a loop until it hit the step limit and gave up.
The head system spent too much time trying to find the solution within the current page because that was the only world it knew existed. Two tabs over, a Gmail session was open and authenticated. The verification email had arrived within seconds of the signup attempt. Had the agent had awareness of other authenticated contexts in the browser, it would have found that code in a flash.
It didn't, because the concept of "other tabs" simply wasn't part of its model of the world.
What the Fix Actually Unlocked
That eventually led to adding session awareness, scanning open tabs and authenticated services before each subtask starts.
Before the worker touches anything on the current page, it now checks every open tab, reads the cookies, and looks for authentication signatures. The result gets injected into the agent's working memory alongside the page state:
AUTHENTICATED SESSIONS:
• [tabId:4] mail.google.com
• [tabId:7] github.com
OTHER OPEN TABS:
• [tabId:2] notion.so
• [tabId:9] docs.google.com
Now when it hits the verification step, it knows tab 4 has an authenticated Gmail session. It switches, reads the inbox, finds the code, switches back, continues. The whole sequence runs without intervention.
That one fix ended up unlocking things way beyond GitHub signups; verification flows, two-factor authentication, confirmation links, cross-service tasks where the agent needs to pull information from one authenticated service to continue work in another. None of it was planned. All of it fell out of fixing one stuck-up signup form.
What Comes Next
Still publicly experimenting. I definitely have more failures on the way.
The architecture has moved significantly since this. There are now parallel workers that can also spawn their own sub workers, a conductor layer that mediates between them, and an overseer that decides how to decompose goals before any worker touches a browser. That's the next post.
But the two lessons from this one have stayed constant through every rebuild since:
A long-running agent needs isolated context, not an expanding one. And a browser agent needs a model of its session, not just its current page.
Everything else is downstream of those two things.
Next: why there's an Overseer, a Conductor, and Workers and what that separation actually buys you.
Follow the experiment at [buntybox.beehiiv.com]
