For a long time, we’ve largely viewed the relationship between large language models and code through a very narrow lens. We treated code as a final product... an ‘artifact’ generated by the model to solve a specific, isolated problem. You enter a prompt, the model calculates the most probable tokens, and it outputs a script.
But a new paper from researchers at Meta, Stanford and UIUC suggests something much more profound is happening.
In their view, code is no longer just something AI generates and it is now becoming the foundation that agents use to make decisions, perform tasks, remember previous interactions and manage work across multiple stages.
How code is changing in AI agents
AI agents are changing how they code. Previously, the last output from a language model was typically coded. Now, researchers view code as the system that enables agents to do work.
This is known as an ‘agent of harness’. It combines a language model with tools, APIs, memory, execution processes and access controls - thereby allowing agents to do work over long periods of time instead of stopping after a single response.
Herein code is a good fit, because code’s actions can be run, checked and logged. Actions of an agent, its progress and incomplete work are available for teams to see. These properties make code a good foundation for agents that need memory, coordination, and reliable execution of tasks.
Why this matters right now
Language models have become very good at writing code. But generating code is only one part of the challenge. In real-world systems, agents often need to complete long sequences of tasks... and frankly, mistakes can happen at any stage.
When an agent fails midway through a workflow, what matters is not only the error itself. Teams need to know what information was recorded, what context the agent could access, and what checks were performed before it moved to the next task. The paper argues that treating code as the system around the agent makes these questions easier to understand and evaluate. Instead of seeing them as side issues like it was done previously, it now places them at the center of how reliable agent systems are built and assessed.
The three layers of the architecture
When we look at how these autonomous systems are actually being constructed, the researchers map out a beautiful, three-tiered hierarchy that moves from the immediate moment of action up to the complex coordination of collective intelligence.
Layer 1: The Harness Interface
This is the boundary where the neural network touches reality, representing a fundamental shift in how an AI interacts with its world across three distinct dimensions.
First, through code for reasoning, the agent offloads complex logic to interpreters and symbolic solvers instead of trying to calculate everything within the fuzzy, probabilistic space of natural language. Using frameworks like Program-of-Thoughts or Chain of Code, the model proposes a precise procedure, and the rigid runtime executes it.
Second, for code as action, the agent doesn’t merely describe what it wants to do in a narrative format; it writes the actual, executable programs required to move a robotic arm, navigate a graphical user interface, or call an external API. The line between thinking and doing is bridged entirely by code. Finally, in terms of environment modeling, the history of the world, the current state of the system, and environmental feedback are no longer stored as loose, rambling text summaries.
Instead, they are structured as executable entities like repositories, unit tests, simulation traces, and sandboxes, meaning the world itself effectively becomes code.
Layer 2: Harness Mechanisms
If the first layer is about the interface, the second layer is about the internal mechanisms that allow an agent to persist through time, addressing the deeper, systemic challenges of autonomy. It governs structured planning, which breaks massive, intimidating, long-horizon objectives down into discrete, deterministic steps. It manages memory segmentation, creating an elegant architectural separation between immediate working context, long-term learned knowledge and historical episodic experience. It handles deliberate tool use, seamlessly connecting the model to external software ecosystems without drowning or exhausting the model's limited context window.
Most importantly, it establishes immediate feedback loops that capture runtime errors and failing unit tests, instantly translating them into clear corrective signals so the agent can self-correct and steer before small mistakes compound into catastrophic failure.
Layer 3: Scaling the Harness
This is where the engineering becomes truly profound, dealing with what happens when a single agent is no longer enough and a collective of minds is required to solve a problem. If multiple AI agents try to communicate purely in natural language, the system quickly degrades into a chaotic, noisy committee meeting.
This layer replaces that ambiguity with structural rigidity by turning shared code repositories, execution traces, and test suites into a centralized, physical workspace. Independent agents - operating in highly specialized roles like Planner, Coder, Reviewer, and Tester - all read from and write to the exact same, inspectable program state.
This gives multi-agent coordination a mathematical truth and a shared reality that natural language alone simply cannot replicate.
Practical examples of agent-based systems
The paper grounds its claims in working systems, not hypothetical ones.
- Software development: Agents on SWE-bench and Claude Code work against real repositories. Unit tests serve as objective ground truth; the harness reads results, feeds them back, and retries.
- GUI automation: Agents translate natural-language instructions into DOM-grounded executable actions. Browser and OS state becomes an inspectable artifact rather than a black box.
- Scientific research: Early systems write hypothesis-testing pipelines, run experiments, parse outputs, and revise their approach. The lab notebook becomes an executable code.
- Robotics: Agents write Python policies that can be inspected, corrected, and accumulated into a reusable skill library. The Voyager system pioneered this in Minecraft; the approach has since moved to physical robots.
The problems that remain open
This is arguably the most useful section for anyone building real systems.
- Evaluation gaps: Most benchmarks only measure whether a task succeeds at the end. They miss whether the path was safe, whether intermediate states were valid, or whether the harness is degrading over time.
- Incomplete verification: Running code confirms it ran without errors. It does not confirm what was intended. Formal proof methods like Lean exist but don't yet scale to production agent tasks.
- Self-improvement without regression: An agent that rewrites its own harness can improve and quietly break things that previously worked. No standard mechanism exists yet for evolving agent capabilities without introducing regressions.
- Multi-agent state conflicts: Ten agents reading from and writing to a shared repository will create conflicts. Git handles this at human-paced workflows. Agentic workflows need transactional semantics at machine speed.
- Human oversight at scale: As tasks grow longer, the number of agent decisions before any human review grows with them. "Approve every step" doesn't scale. "Check only the output" isn't safe. The harness is where that line needs to be drawn, but no one has drawn it well yet.
What this means for people building agent systems
As AI agents take longer and more complex tasks, model capability is only part of the equation. The design of the system around the model matters just as much. A reliable agent needs access to its previous work, clear feedback when something goes wrong, a safe environment for carrying out actions, and the ability to adjust its approach based on actual outcomes.
This is where code becomes important. Code makes actions visible, records progress and helps agents respond to real-world results rather than assumptions. These qualities help make agent systems more dependable and easier to manage over time.
GoML covers research and practice at the intersection of machine learning and production systems. Stay tuned for more on our website.
Link to the original paper: https://arxiv.org/html/2605.18747v1





