What we learned building sandbox for document agents

Cross-posted from the Raycaster blog; I’ve spent the last several months building this, and here is my take.

2025 brought us the new idiom for building AI (beginning with Manus and Claude Code): give it tools to operate a computer. This is a break from the past default approach represented by ChatGPT, which is LLM + a menu of bespoke API connections to plug in to various systems of record.

The limits of plaintext

Our first attempt at a document agent was to ingest documents, parse them into plaintext pages, expose search/read/write tools, and let the LLM operate over virtual directories of artifacts and pages backed by a SQL database.

This immediately worked for agentic navigation and observational tasks, but quickly fell apart when it came to multimodality and editing: users bring complex Word and PDF files with tables and charts, for which even frontier VLM-based OCR approaches can only do so much (reading a textual description of a chart isn’t the same as seeing the chart), not to mention subtle layout and formatting such as text coloring, which is often lost in the process. While perfectly adequate for a corporate knowledge base, our setup could not realistically ingest a rich-format Word document, reason and edit over it, then spit out a usable revision as a deliverable.

Giving agents a real computer

Our next iteration came with the realization that coding is a general ability of an LLM, not a specialty feature. Coding to an LLM is what a pair of deft hands is to a person — it’s just an advanced form of computer use. In practice this shows up everywhere as the Claude Code idiom spread and frontier models more heavily trained on coding: modern LLMs reach for pandoc to convert a Word document into something they can more easily read, unpack a .docx’s underlying data to access an embedded image, and they absolutely want Python on hand to pull numbers from the third sheet of an Excel report at a moment’s notice.

So we ended up giving agents a real computer — Linux machines on Daytona with useful utilities pre-installed — and mounting users’ documents from S3 into it (commonly done with FUSE-based solutions, a pattern AWS has since blessed natively with S3 Files). Once the LLM has Bash and a rich toolchain, it starts to handle complex knowledge work through interleaved reasoning and coding.

Agent screenshots a page it just edited to confirm its layout.

The need for a sandbox

Combined with good document skills (such as Anthropic’s), this solved the capability problem. But the AI isn’t me. When I look at the changes an LLM proposes to a document, more often than not I’ll disagree — they are usually on point but almost never exactly what I would have written. Traditional wisdom is to gate write-capable tools behind human approval, but once the human is confirming individual tool calls, the promise of autonomy is gone.

Coding agents already show us where this ends: power users eventually switch to the dangerously-skip-permissions camp for the benefit of full autonomy. Not because developers are reckless, but because the agent is working inside a git branch where bad changes can be identified and thrown away, often in a dedicated sandbox.

Most agent sandbox discussions are about security in the virtualization layer — containers, network restrictions and syscall filters — mechanisms that stop a rogue agent from escaping and rooting the host. For us, we mean it almost in a literal sense: a carefully laid-out work surface ready for a child to explore, make a mess, build a sand castle, have another child (an adversarial reviewer) knock it down, and try again, with us watching and deciding what gets to leave the sandbox. In short, sandboxing is a situation we set up where the AI is allowed to be wrong.

The agent’s filesystem cannot be the user’s filesystem; there needs to be a translation layer between them. When the agent edits /workspace/Report.docx, it should not touch the user’s canonical Report.docx. Rather, it’s writing a candidate revision that can be reviewed.

For coding agents, this is easy. Source code is plaintext — an LLM reads and produces patches directly. The concepts of git branching and PR are well-understood by users who are often trained software engineers. Documents are different. A .docx is a zipped XML database LLMs can’t directly read, while LLM-friendly views of it are derived and multi-faceted: parsed markdown pages, chunks of text returned by semantic search, extracted tables and figures. When Report.docx is manipulated into a new revision, all the derived views need to be recomputed from the newly-staged revision for the agent’s view to stay consistent, while the previous (committed) revision’s data stays visible for other consumers, such as users looking at it through UI or another agent working on something else.

Nothing off the shelf does this, so we ended up building it: a chain of patches like git commits, produced one per agent’s turn and replayable on top of the user’s committed source of truth. Parsed artifacts and vector indices are keyed by content hash, so a candidate (staged) revision’s derived state lives alongside the committed revision’s without clobbering each other.

Document version control: current and staged revisions of Report.docx each hold source bytes, parsed markdown, and chunk embeddings. A staging journal of file operations sits underneath, capturing changes from the LLM-driven sandbox.

What we need now is GitHub

While software engineers complain that traditional git infrastructure like GitHub isn’t keeping up with the volume of LLM-produced code and warrants a rethink, the world of document-heavy general knowledge work, on the other hand, hasn’t even gotten its GitHub yet.

It’s tempting to look at adoption of AI in knowledge work and think the bottleneck is plumbing — more tools, more integrations, eventually a complete-enough toolbelt that the agent can do a human’s work as a 1:1 replacement. We started by thinking we were building Cursor for knowledge work, only to realize what needs to be built is GitHub — a refactor of how high-stakes document work is packaged, represented, and wired into a CI system for semantic review.

This is the future we’re betting on, and I’m curious to learn what you think.

The limits of plaintext#

Giving agents a real computer#

The need for a sandbox#

What we need now is GitHub#

The limits of plaintext

Giving agents a real computer

The need for a sandbox

What we need now is GitHub