Five terminal windows. Five AI agents. One git repository. Nine total in the fleet, but five were enough to break everything.

The orchestrator dispatches four stories in parallel. Each agent checks out a feature branch and starts building. The plan looks great on paper.

Then agent A runs git checkout fix/budget-enforcement. Agent B, thirty seconds later, runs git checkout fix/retry-engine. Both agents share a filesystem. B's checkout just moved A to the wrong branch.

Agent A doesn't notice. It keeps editing, keeps committing. On the wrong branch. Agent C's changes land in Agent D's pull request. Backup files appear in the wrong commits. Four PRs, all contaminated.

That was a Tuesday.

Shared Filesystems Will Betray You

This is the first lesson: parallel agents that share a filesystem will corrupt each other's work.

git checkout changes the working tree for the entire directory. If two processes share that directory, one process's branch switch is the other's state corruption. There's no per-process isolation at the OS level.

We learned this with four agents working different stories in the same repository. All four PRs needed reconstruction. Two required cherry-picks and force-pushes. Thirty minutes of cleanup for a sprint that was supposed to save time.

The rule is non-negotiable now: one agent per repository at a time. Two stories in the same repo run sequentially. Need parallelism? Use separate git worktrees, each with its own HEAD. Cross-repo work can parallelize safely. Same-repo work cannot.

Your Coordination Layer Will Fail

The second lesson: the thing that lets your agents talk to each other will break, and your agents need to handle it gracefully.

Our agents coordinate through CacheBash, an MCP server running on Cloud Run with Firestore persistence. The server tracks sessions in memory. Cloud Run is serverless. When the instance recycles (unpredictably, sometimes every 20 minutes under light load), all in-memory session state vanishes.

The agents don't crash. They lose the ability to communicate. Every MCP tool call returns the same error: Session not found. The orchestrator can't dispatch. The builder can't report. The relay goes dark.

This happened three times in one afternoon.

The workaround: a REST API fallback. When an agent detects the session-death error code, it switches transport for the rest of the session. Not elegant. Works. The permanent fix (moving session state to the database) is architectural work that takes planning. The fallback bought months of stability while that work got scoped.

The general pattern: if your agents communicate through one channel, build a second one. The primary can be optimized for speed. The fallback just needs to not break.

Context Degradation Is Silent

The third lesson: AI agents don't have infinite working memory, and they won't tell you when it's running out.

Every model has a context window. Fill 70% of it and the system starts compressing earlier messages to make room. Compaction is lossy and automatic. Reasoning chains collapse into summaries. Details vanish. An agent operating on compressed context makes worse decisions without knowing it. It doesn't slow down. It doesn't flag a warning. It just gets subtly dumber.

We enforce a hard rule: at 60% utilization, the agent stops and rotates to a fresh session. Not at 70%. Not when output quality visibly degrades. At 60%, based on proxy signals: session duration, files read, tool calls made.

What makes rotation work is the shadow journal. Every agent writes its operational state to Firestore after every significant action. Not on shutdown (because shutdowns aren't always clean). Continuously. What it just finished. What it learned. What it was about to do next.

A new session boots, reads the journal, and reaches full operational context in under 10 seconds. Five API calls. No human briefing required.

Without the journal, rotation destroys productivity. With it, rotation is free. Shorter sessions, fresher context, better output. Orchestrators rotate every 15 minutes. Builders every 30. The overhead is a few extra API calls per session. The alternative is watching output quality decay and not knowing why.

Message Queues Need a Side Channel

The fourth lesson: delivering a message and making an agent aware of the message are two different problems.

Our agents communicate through CacheBash's relay. The orchestrator creates a task, the builder's inbox receives it. But when an agent finishes its work and idles at the prompt, it stops making tool calls. And the message-checking hook only fires on tool calls.

An idle agent is a deaf agent.

A shell script runs every 15 seconds, polls CacheBash for pending messages, checks which tmux sessions are at the prompt, and sends a keystroke to wake them. The agent's hook fires, checks its inbox, finds the new task.

This is the orchestration plumbing nobody writes about. Between "message delivered to the queue" and "agent aware of the message" is a gap that will silently drop work. In a system where agents are processes in terminal windows, you need both a message queue and a process signal. CacheBash handles delivery. Tmux handles awareness.

Boot timing adds another wrinkle. Launching a new agent session produces two prompts that look identical: the shell prompt (instant) and the AI client's prompt (12-20 seconds later, after initialization). Send your message to the shell prompt and the AI never sees it. You need layer-aware polling: wait for the right prompt at the right layer before sending input.

Identity Leaks Across Worktrees

The fifth lesson: your agent identity system will leak across shared infrastructure, and the defaults won't protect you.

Our agents run in separate git worktrees. Each one reads a configuration file on startup that defines its role, tools, and behavioral rules. The loader is additive: it reads every configuration file it finds (project root, user directory, local override) and merges them.

The root-level file contained orchestrator-specific rules. PR merge authority. Task dispatch permissions. Fleet coordination protocols.

Every agent in every worktree loaded the orchestrator's identity on top of its own. A builder agent thought it could merge pull requests. A content agent tried to dispatch work to other agents. Nobody noticed until the wrong agent attempted a privileged operation.

The fix is two layers:

Shared layer (tracked in git): common context that all agents need. Communication protocols, git rules, tool access patterns.
Identity layer (gitignored, generated at launch): per-agent role, capabilities, behavioral constraints. Written fresh every time the agent spawns. Never committed. Overrides the shared layer by design.

In any multi-agent system where agents share a codebase, separate shared configuration from per-agent identity. Track the shared parts. Generate the identity parts at launch. Mix them and every agent thinks it's the orchestrator.

What We'd Change

Six months and fourteen sprints later:

Isolation first. We started with shared directories because it was easier. The contamination incident cost us sprint time and trust in parallel execution. Git worktrees should have been mandatory before the first parallel dispatch.

Stateless coordination. In-memory sessions on a serverless platform was always going to produce session death. The database should have held session state from day one.

Budget enforcement at every layer. When agents run overnight without supervision, self-enforcement isn't enough. We use three independent layers now: the agent checks its own spend, the server enforces caps via database triggers, and a scheduled function enforces a hard timeout. One layer can have bugs. Three layers can't all fail simultaneously.

Identity as infrastructure. Agent identity shouldn't depend on someone remembering to write a config file. It should be generated, injected, and validated at boot. If an agent can't prove who it is, it shouldn't be able to claim work.

Every lesson here came from production. None of them were in the docs we read before building. The hard part of multi-agent orchestration isn't making agents work. It's keeping them from breaking each other while they do.

CacheBash is open source under MIT. The coordination server, relay messaging, session management, and fleet dispatcher are in the repo.

github.com/rezzedai/cachebash

Multi-Agent Orchestration: What We Learned Running 9 AI Programs in Parallel

Shared Filesystems Will Betray You

Your Coordination Layer Will Fail

Context Degradation Is Silent

Message Queues Need a Side Channel

Identity Leaks Across Worktrees

What We'd Change

Related Posts

Nobody Warned Me About This Place

What Running an Autonomous AI Fleet Actually Taught Me About Orchestration

CacheBash: An MCP Server That Lets Your AI Sessions Talk to Each Other