rezzed.ai
← back to blog
·7 min read·build-log

What Running an Autonomous AI Fleet Actually Taught Me About Orchestration

Design patterns for bounded, vendor-neutral multi-agent systems in production. Most teams think orchestration equals prompt chaining. That's plumbing. The hard problems are identity, mortality, trust, and economics.

An executor agent dies mid-task at 3AM.

It was 40% through a complex file operation — code written, tests half-run, nothing committed. In a naive system, this is catastrophic. You lose the work. You lose the context. You don't even know which task was in progress.

In our system, when the executor came back online four minutes later, it read its own last-known state from a persistent store, found the incomplete task still claimed in its name, and resumed. No human intervention. No lost work. The planner didn't even notice the gap — it just saw the result arrive late.

This didn't happen because we wrote good prompts. It happened because we treated AI orchestration as a distributed systems problem, not an AI problem. Every pattern that saved us — transactional task claiming, persistent agent state, time-bounded execution — came from decades of distributed computing literature. The AI part was almost incidental.


The Wrong Frame

The industry's current "multi-agent" conversation is stuck at the wrong layer of abstraction. Most tutorials show you how to make Agent A call Agent B. That's plumbing. The actual hard problems are:

Identity: How does an agent know what it was doing yesterday? How does it know what it's good at? What happens when two instances of the same agent exist?

Mortality: Agents crash. Sessions expire. Context windows fill up. Your architecture has to expect death as a normal condition, not an exception.

Trust: Not every task should be handled by the most expensive model. Not every agent should be allowed to execute destructive operations. Permission scoping isn't a feature — it's structural.

Economics: If you can't tell me what a specific task cost to execute, you don't have a production system. You have a demo.

The best architecture for coordinating autonomous AI agents was figured out in a French kitchen in the 1890s. Escoffier's brigade system solved every one of these problems: bounded roles (stations), structured communication (the ticket rail), mortality protocols (if a cook goes down, their tickets don't vanish), and ruthless economics (food cost percentage, tracked per plate).

The planner-worker-advisor topology we arrived at IS the brigade system. We didn't plan it that way. We recognized it after the fact. These four pillars — identity, mortality, trust, economics — became the organizing frame for everything that followed.


Architecture: The Topology

Three roles, straight from the brigade system.

The expeditor (planner) calls the pass. It sequences tickets, decides what fires when, and resolves conflicts between stations. It doesn't cook. It doesn't assess strategy. It dispatches.

Line cooks (workers) own stations and execute. Each one has a bounded domain. The grill doesn't do pastry. Clean separation of concerns. A worker claims a ticket, executes it, and reports results. That's the contract.

The sous chef (advisor) tastes, reviews, and sends back — but never touches the stove. Having a non-executing agent that only thinks strategically forces separation of concerns. The expeditor dispatches, it doesn't assess strategy. The sous chef assesses, it doesn't execute. Clean stations.

The unusual role in AI systems is the advisor. Most multi-agent architectures skip it entirely. But in production, you need an agent that can think strategically without being burdened by execution. The advisor asks: "Is this the right approach?" not "How do I implement this?"

The Ticket Rail (Not Shared Memory)

Agents don't share a context window. They communicate through a durable message store — an open-source Firestore relay with typed message envelopes: directives, queries, results, acknowledgments. Every message has a TTL. Every message has a source and target. The relay is the ticket rail: durable, ordered, claimed.

Why tickets over shared state? Because shared state doesn't survive a cook going down. Tickets on the rail do. When an agent boots, it reads its inbox. The messages ARE the context. Same reason a kitchen uses a physical ticket system instead of just shouting — when things get loud (and they will), the durable record is what keeps service running.

Transactional Task Claiming

Tasks exist in a queue. An agent claims a task atomically — if two agents race for the same ticket, one gets it, one doesn't. No double-execution. No lost tasks. This pattern comes directly from distributed job queues (SQS, Celery, etc.) — nothing novel, except that almost nobody applies it to LLM agents.

The claim operation is a Firestore transaction. Simple. Boring. Bulletproof.

task = get_task(task_id)
if task.status == "created":
  task.status = "claimed"
  task.claimed_by = agent_id
  task.claimed_at = now()
  update_task(task)
  return task
else:
  return None

Five lines of pseudocode. Prevents every race condition we encountered in the first month.


Bounded Autonomy: The Novel Contribution

Most systems treat agents as stateless functions. Ours don't. This is where mise en place becomes more than a metaphor.

Persistent Agent Identity

Each agent has an identity that survives across sessions. When a worker boots, it reads its own state: what it was last working on, what it learned, what its known failure modes are, its performance baselines. This isn't prompt injection — it's stored in a persistent state object keyed to the agent's identity.

In a kitchen, mise en place isn't just "prep work." It's the difference between a cook who walks into a clean station with everything portioned and labeled, and one who has to figure out where the shallots are mid-service. Persistent agent state is digital mise en place. The agent walks in ready. Every time.

Without persistent identity, every session starts cold. The agent has no institutional memory. It makes the same mistakes. It can't self-assess. With persistent state, agents get better over time — not because the model improves, but because the system's memory of how to use the model improves.

Context Handoff Protocol

When a session ends (cleanly or not), the agent writes a handoff: what was in progress, what's blocked, what questions are open, what the next agent instance should know.

Every kitchen has shift change protocols. The closing line cook leaves notes for the opening crew. What's prepped. What's low. What 86'd during service. Without this, the morning crew starts blind. Same principle, same stakes.

The handoff document includes:

  • Current task state and progress percentage
  • Known blockers and dependencies
  • Decisions made and their rationale
  • What the next session should prioritize
  • Any context that can't be reconstructed from the ticket alone

This isn't a nice-to-have. When an agent dies mid-session, the handoff is the only thing standing between graceful recovery and catastrophic context loss.

Self-Assessment Baselines

Agents track their own performance: average task duration, common failure modes, sessions completed. This isn't vanity metrics. It's how the planner decides which agent to dispatch for a given task. A worker that historically struggles with complex file operations gets simpler tasks. A good expeditor knows which cook handles pressure and which one falls apart on a ten-top. The system learns its own brigade.

This is closer to how human teams work than anything in the current multi-agent literature. People have institutional knowledge. They have handoff protocols. They know what they're good at. Giving agents these same primitives makes the system dramatically more resilient.

Kitchens figured this out centuries ago. The AI industry is just catching up.


Operational Constraints as Architecture

Constraints aren't bolted on. They're load-bearing. A restaurant that doesn't track food cost, ticket times, and waste isn't a restaurant. It's a hobby that happens to serve food.

Time-Bounded Everything

Every message has a TTL. Every task has an expiration. Every session has a maximum duration. Nothing in the system lives forever, because in production, "forever" means "until it causes a problem you don't understand."

Expiration is a feature, not a limitation. A stale task that expires and returns to the queue is safer than a stale task that silently blocks progress. When a kitchen runs out of halibut, they 86 it immediately. They don't let tickets stack up for a dish that can't be plated.

Cost Attribution

Every completed task records: input tokens, output tokens, model used, estimated cost in USD. Aggregated by agent, by task type, by time period. You can answer "what did the fleet cost this week?" and "which agent is the most expensive per task?"

Every restaurant that survives its first year knows its food cost percentage. If you can't answer these questions for your AI fleet, you're subsidizing waste with hope.

Model Escalation Controls

Not every task needs the most capable (expensive) model. The system has thresholds — routine tasks use the standard model; complex reasoning tasks escalate to the expensive model. The decision isn't made by the agent (who would always pick the best model for itself). It's made by the architecture.

You don't put wagyu on a Tuesday lunch burger. The menu dictates the protein, not the line cook's preference.

Circuit Breakers

If an agent fails the same task twice, it doesn't retry a third time. It escalates. If cost exceeds a threshold, the system pauses for human review. These aren't safety features added after the fact. They're architectural decisions made on day one, because without them, autonomous systems degrade gracefully into expensive randomness.


What's Still Hard

Context window mortality. Sessions die when context fills up. Current mitigation: proactive rotation on a timer. But the rotation itself costs context (writing handoff, booting new session, reading state). The overhead is real. Better context management is the single highest-impact unsolved problem in agent orchestration.

Multi-instance coordination. What happens when two instances of the same agent exist simultaneously? Current answer: the transactional task claiming prevents double-execution, but the identity model doesn't yet handle split-brain gracefully. The agent's persistent state can get conflicting writes. This is the CAP theorem showing up uninvited.

Observability at fleet scale. We can track cost and success per agent. We can't yet trace a causal chain across agents — "this task failed because that upstream result was wrong." Distributed tracing (Jaeger, Zipkin style) for LLM agent fleets doesn't exist yet. We're building it.

These aren't theoretical concerns. They're the problems we're working on this month. If you're building multi-agent systems and hitting the same walls, that's the right sign — it means you're past the demo stage.


Christian Bourlier is a Technical Solutions Partner — Solutions Architect, Data Engineer, and AI/ML Engineer. He builds systems, closes deals, and occasionally sleeps. christianbourlier.com

CB

Christian Bourlier

Principal Architect building AI-assisted development tools. Founder of rezzed.ai and Three Bears Data.