Notes from the Grid: The Telemetry Nobody's Building for Multi-Agent AI
You know what your model costs per token. Input, output, cached, uncached. Anthropic publishes the rates. OpenAI publishes the rates. You can calculate the unit economics of a single API call down to the tenth of a cent.
Now tell me what your last task cost to complete.
Not the API call. The task. The one that spawned three subtasks, retried twice on a transient error, rotated context at 65% because the agent was losing coherence, then handed off to a fresh session that finished the job at 2AM. What did that cost? How long did it take? Did the retry actually help, or did the agent just burn tokens restating the problem?
If you can't answer that, you don't have a production system. You have a demo with a billing page.
The gap between model observability and fleet telemetry
The LLM observability space has matured fast. OpenTelemetry semantic conventions for LLM spans are stabilizing. Tools like Arize Phoenix, Langfuse, and Braintrust give you traces, evals, and prompt versioning. OpenInference extends OTel with attributes like llm.input_message and llm.token_count. The instrumentation story for individual model calls is, genuinely, getting good.
None of that helps when your problem isn't "how did the model perform" but "how is my fleet performing."
Multi-agent orchestration introduces a layer of complexity that single-model observability doesn't address. Agents make decisions. They claim tasks, spawn subtasks, send messages to other agents, and occasionally decide they need more context before proceeding. They operate on incomplete information. They have state that persists across sessions but not across context rotations. And when they fail, the failure mode isn't a 500 error. It's a slow drift into incoherence that produces plausible-looking garbage until someone notices.
Model telemetry measures the hand you were dealt. Fleet telemetry measures how you played it.
Three things worth measuring
After months of running autonomous AI agents, our telemetry needs settled into three categories. Not because we planned it that way. Because these were the questions that kept waking us up.
1. Economics: What did this actually cost?
Token counts from the API are necessary but misleading. A task that consumes 50K input tokens and 10K output tokens has a calculable API cost. But if that task required two context rotations, a retry, and a handoff between agents, the real cost is 3-4x the naive calculation. The retries burned tokens restating context. The handoff burned tokens on bootstrap and state restoration. The context rotation burned tokens on shadow journaling so the next session could resume.
We designed our task schema to track cost at the task level, not the API call level. Every completed task can report tokens_in, tokens_out, cost_usd, model, and provider. The fields are there. Getting agents to reliably populate them is its own problem — and one we're still solving. Builders spawned by orchestrators don't always capture their own token usage before session termination. The schema is the easy part. The discipline is hard.
The number that actually matters: cost per completed task, broken down by model tier. Our Opus agents (strategic planning, architecture review) run significantly more expensive than Sonnet agents (code generation, documentation). That ratio should stay stable. When it spikes, something is wrong. Usually an Opus agent stuck in a retry loop, burning premium tokens on a problem that should have been escalated or killed. We know this from operating the fleet. We're building the instrumentation to prove it quantitatively.
2. Health: Is this agent degrading?
Context windows are a finite resource that degrades non-linearly. An agent at 80% context remaining produces materially different output than the same agent at 30%. The degradation isn't gradual. It's a cliff. Output quality holds steady until somewhere around 35-40% remaining, then falls off a table.
We measure context health as percentage remaining, not percentage used. That's a deliberate choice. "72% context remaining" communicates proximity to the cliff. "28% used" communicates progress, which is the wrong frame. You don't care how far you've driven. You care how much fuel is left.
Every agent self-reports context health. When it hits 35%, a hook fires automatically and signals for rotation. The agent commits its work, journals its state, and a fresh session picks up where it left off. Total handoff gap: under 45 seconds. The telemetry that makes this work isn't the context percentage itself — it's the operational pattern data telling us where the cliff actually is. We've tuned the rotation threshold through experience. Formalizing that into per-model, per-task-type baselines is the next step.
3. Coordination: Are agents stepping on each other?
Multiple agents operating on shared infrastructure will find every race condition you didn't think of. Two agents claiming the same task. Three agents pushing to branches that conflict. An orchestrator dispatching work to a builder that's mid-rotation and can't accept it.
We track coordination health through message latency (how long between send and acknowledgment), task claim collisions (how often two agents try to claim the same work), and branch contamination events (how often git operations fail due to concurrent access). Early on, we had branch contamination events daily. After enforcing one-repo-per-builder and adding a dispatch lock, they dropped to zero. We only knew the fix worked because we were watching the right signals.
Where OpenTelemetry fits (and where it doesn't yet)
OpenTelemetry gives you two things that are hard to build yourself: a vendor-neutral wire format and a collector architecture that separates instrumentation from export.
The wire format matters because the observability landscape is fragmented and moving fast. Instrumenting your agents with a proprietary SDK means rewriting that instrumentation every time you switch backends. OTel's semantic conventions mean your spans, metrics, and logs look the same whether you're exporting to Jaeger, Datadog, or a Firestore collection you built yourself.
The collector architecture matters because multi-agent systems generate telemetry from multiple processes, multiple machines, and multiple cloud regions simultaneously. Centralizing that through OTel collectors lets you process, filter, and route telemetry without touching instrumentation code.
Where OTel falls short today: the semantic conventions for LLM workloads are still model-call-scoped. OpenInference extends OTel with LLM-specific attributes, but it's still thinking about individual inference calls. There's no standard vocabulary for agent-level spans, task lifecycle events, or fleet coordination metrics. Nobody has standardized "an agent claimed a task, worked on it across three context rotations, and completed it 47 minutes later."
Our current telemetry is custom — Firestore documents with trace IDs, span correlation across tasks and sprints, fleet health snapshots. It works for our scale. But we're building toward OTel adoption at the infrastructure layer (our Cloud Run services already run on GCP, which speaks OTel natively via Cloud Trace) while keeping our domain-specific semantics for fleet-level events. The goal: OTel for the transport and visualization, our own conventions for the agent-native concepts that OTel doesn't have vocabulary for yet. Task lifecycle as a span kind. Context rotations as linked spans with a shared task.id. Cost rolled up per task, not per call.
We're not there yet. But the architecture is designed to get there without a rewrite.
What the data taught us
Months of running a multi-agent fleet produced a few results we didn't expect. Some of these come from telemetry. Some come from operating the system and watching it fail. We're being honest about the distinction because the insights are real either way — the instrumentation to prove them quantitatively is what we're still building.
Retries are almost never worth it. When an agent fails a task, retrying on the same context almost never works. The agent already has the problem framed wrong. Restating the same context doesn't unframe it. Rotating to a fresh session with a clean context and explicit handoff notes succeeds far more often. Our operational experience made the decision obvious: kill and rotate, don't retry. We're instrumenting retry-vs-rotate outcomes now to get the exact numbers.
Expensive models save money on complex work. The instinct is to run everything on the cheapest model. But an Opus agent completing an architectural task in one pass costs less than a Sonnet agent requiring three passes and two rounds of rework. Cost per task, not cost per token. We stopped optimizing model selection on unit price and started optimizing on task completion rate.
Context rotation has a fixed cost that's worth paying. Every rotation burns tokens on state serialization and bootstrap. That felt expensive until we observed that agents deep into their context window produce significantly more rework. The rotation tax pays for itself in avoided waste. Quantifying the exact inflection point per model and task type is active work.
Coordination overhead is the real scaling bottleneck. Adding more agents doesn't increase throughput linearly. Message latency increases. Task claim collisions increase. The orchestrator spends more time dispatching than any individual builder spends building. We found that scaling agents without scaling coordination infrastructure produces diminishing returns past a handful of concurrent workers. Beyond that, you need hierarchy — orchestrators managing clusters of builders, not a flat dispatch to everyone.
The telemetry stack, briefly
No point in being coy about the architecture.
Task-level telemetry lives as flat fields on the task document in Firestore. Not a subcollection. Not a separate telemetry store. Flat fields on the same document the agent is already reading and writing. This means every task read includes its telemetry. No joins. No second query. The trade-off is document size, but task documents rarely exceed 10KB even with full telemetry fields.
Fleet health snapshots write to a time-series collection at regular intervals. Each snapshot captures active sessions, context health, task queue depth, and message backlog. The dashboard reads these via Firestore subscriptions. No polling. No intermediate API.
Cost aggregation is the piece we're still building out. The schema supports it — every task completion can carry cost fields. The aggregation layer that rolls those up into daily and weekly summaries, applies price table lookups, and surfaces cost-per-task-type breakdowns is on the roadmap. Right now we can query it on demand. The goal is pre-computed rollups that a dashboard can subscribe to.
The whole thing runs on Firestore and Cloud Run. No Kubernetes. No Kafka. No dedicated observability platform. When you're running a fleet this size, the telemetry infrastructure should be simpler than the system it's observing.
What's next
The gap between model observability and fleet observability is going to close. It has to. Anyone running more than one AI agent in production will hit the same questions we did. What did that cost? Is this agent healthy? Are my agents fighting each other?
The vendors will get there. OTel semantic conventions will expand. But the teams running multi-agent systems today can't wait for the standards body. Instrument at the task level. Track cost per completion, not cost per call. Measure context health as proximity to failure, not distance from start. And build the coordination metrics first, because that's where the surprises live.
The model is the easy part to observe. The fleet is where you learn something.
Christian Bourlier is a Technical Solutions Partner -- Solutions Architect, Data Engineer, and AI/ML Engineer. He builds multi-agent systems with CacheBash and writes about what goes wrong along the way.