Skip to content

A Browser Game Mostly Built By A Long Running AI Agent Over 4 Days

How a nostalgia driven point and click experiment became a focused AI agent workflow. Codex carried almost all of the execution autonomously inside one long running goal: research, build work, docs, validation loops and trailer production, with one proof target and no grand claims before real players have their say.

Part 1 of 2Updated 20 Jun 20268 min read
BeckyBot AI Lab avatar for an experimental browser game case study
The Bureau of Obsolete Futures

The Brief

I started with a feeling I missed from old point-and-click adventures: walking into a strange room, clicking on everything and slowly learning how the world thinks.

The project became The Bureau of Obsolete Futures, a browser game about a junior archivist trying to decommission civic technology where old machines still have paperwork to finish.

The first room is Archive Intake. The player has a simple goal: get into the staff only archive. The room makes that goal awkward through a badge reader, a missing stamp, a printer ribbon, a vending machine and paperwork with a grudge.

Two part series

Part 1: The Playable Slice

This first piece covers the game build: how a long running Codex goal turned a messy idea into a scoped browser prototype with validation gates. Part 2 follows the trailer workflow that kept running from the same project context while other work continued.

Read Part 2
The main point

Almost All Of The Execution Was Agent Led

This was not a normal request a code snippet build. A single Codex goal carried the work across 4 days of research, implementation, documentation, validation and promotion loops. My job was to keep the brief sharp, make the judgement calls and stop a playable slice being mistaken for proof from real players.

Agent led work

Research, implementation, documentation, validation checks and trailer workflow stayed inside one long running Codex goal.

Human control points

I set the goal, made taste calls, checked the claims and decided what still needed real player evidence.

Working pattern

The agent kept moving through research and validation loops over 4 days instead of resetting at every prompt.

The Goal Became The Filter

The useful shift was treating the goal as a decision filter, not a task list. Every tempting expansion had to pass the same test: does this help prove the first room, or does it only make the project look bigger?

What would prove this game idea is worth expanding?

What is the smallest playable slice that can test tone, puzzle fairness, interface clarity and art direction?

What claims are still blocked until real players create evidence?

The Trailer

The trailer uses real gameplay capture from the current vertical slice. It shows the room, the protagonist, the vending machine beat and the archive access payoff without pretending the game is bigger than one tested room.

It also became a validation surface. If the UI was unclear, the trailer would show it. If the puzzle state had no payoff, the trailer would show that too. The public artefact had to stay tied to the actual build.

The Risk Was Sprawl

A nostalgic game carries a few traps. It can copy the surface of its references. It can turn into an unfinished engine. It can pile jokes on top of weak puzzle logic. It can look polished enough to fool the builder before anyone else has played it.

The useful goal was deliberately narrow: build one playable browser room that tests tone, puzzle fairness, interaction clarity and art direction before planning anything larger.

No second room until the first room earns expansion.

No broad player claims before human validation exists.

No public playtest link while access policy stays private.

No nostalgia shortcut that only works because of reference material.

Turning Nostalgia Into Rules

The references mattered, but as production discipline rather than surface material. The project needed safe experimentation, dense authored interactions, readable staging and absurd logic that feels fair after the answer lands.

Codex helped turn those preferences into project rules: influence guardrails, tone notes, puzzle contracts, accessibility checks, deployment notes and public positioning. That sounds excessive for one room. It was the opposite. It kept the work from drifting into vibes.

Archive Intake Became The Test

The first room tests the whole game promise through one compact puzzle chain. The route starts with a locked archive and ends with a payoff that only works if the Bureau logic has made sense along the way.

1

Staff only archive door

2

Badge reader refuses the unstamped badge

3

Printer needs ribbon before paperwork moves

4

Vending machine argues about category logic

5

Stamped paperwork opens the path forward

Screenshots From The Slice

Archive Intake gameplay screenshot with visible hotspot labels, inventory slots and the current objective panel.

Hotspots visible

The prototype shows the browser UI, labelled hotspots, inventory and objective panel in one view.

Gameplay screenshot of the vending machine category dispute with dialogue options for classifying printer ribbon.

Vending machine logic

The vending machine puzzle tests whether absurd Bureau logic still reads as fair interaction design.

Gameplay screenshot showing archive access granted after the badge reader accepts the completed paperwork chain.

Archive access payoff

The first puzzle arc resolves with archive access granted and updated case notes.

What The AI Agent Carried Across 4 Days

This was the unusual bit. Almost all of the execution ran through one long running Codex goal over 4 days: research, build work, documentation, validation checks and the trailer workflow. Not a prompt, an answer and a handover. A working loop.

Codex kept the notes, game logic, copy, validation scripts, deployment decisions and trailer plan in the same project context. The autonomous part mattered because the agent could keep moving across disciplines. The human part still mattered because I set the goal, made the taste calls, checked the claims and decided what needed human evidence.

The visuals shown here are prototype/reference art from the AI assisted build, not final production art. Good enough to test the room. Not a reason to pretend the game is finished.

Autonomous Loop

Codex planned, edited, checked and reported back across one long running goal instead of waiting for a separate prompt for every task.

Research

Classic adventure games became influence guardrails, not content to copy.

Design

Archive Intake kept the proof small enough to test tone, puzzle fairness and interface clarity.

Build

Phaser and TypeScript handled the first room interactions, inventory, state flags and dialogue.

Validate

Checks covered docs, puzzle contracts, accessibility, deployment readiness and open evidence gaps.

Share

Real gameplay capture and Remotion turned the prototype into a trailer that stays honest about scope.

The Agentic Part

Codex moved autonomously across research, design, TypeScript, Phaser, accessibility checks, deployment hygiene, documentation, capture scripts and Remotion trailer work without losing the thread.

The Human Part

The work still needed taste, constraints and judgement. The AI moved fast. The human job was to keep the goal sharp and stop a working prototype being mistaken for player proof.

What This Proves

The game has a playable slice with defined puzzle logic, authored wrong turns, stateful interactions, a working browser build and a repeatable trailer workflow.

What It Does Not Prove Yet

Implemented is not the same as proven. Automated checks do not prove puzzle fairness, humour or comprehension. The next decision needs evidence from real people, not another internal pass.

Want To Join The Closed Test?

I am looking for a small number of people who care about puzzle design, retro computing, accessibility or practical AI workflows. If you want to test the Archive Intake slice, get in touch and I will share the private route when the next testing round is ready.

Ask To Join Testing
BeckyBot governance avatar for validation and guardrails

The Useful Lesson

The interesting part is not that AI helped write code. It is that a long running Codex goal let an AI agent carry most of the execution across research, design, implementation, validation, deployment and promotion without losing the thread. The human job stayed the same: judgement, taste, boundaries and evidence.

BeckyBot innovation avatar

Got An AI Prototype That Needs To Grow Up?

Digital Boop helps turn messy ideas into focused prototypes, validation plans and handover ready workflows. No innovation theatre. No pretending a demo is production.

Talk To Digital Boop