LLM Observability

LLM Observability is the ability to monitor, trace, and debug the performance of Large Language Models in production.

Key Metrics

  • Token Usage: Cost and efficiency tracking.
  • Latency: Time to first token and total response time.
  • Accuracy (Evals): Using LLMs or heuristic checks to grade the quality of the output.
  • Context Integrity: Measuring how much of the provided context was actually utilized or ignored (Context Window management).

SubAgent Cost Observability

When using Claude SubAgents, token observability becomes multi-dimensional: you must track both the parent orchestrator’s context and each subagent’s isolated context. Key metrics to monitor:

  • Parent Context Saturation: Use /context to track when the orchestrator is nearing delegation threshold (~70–80% window).
  • Per-Subagent Token Spend: Each subagent spawn has a fixed startup cost (system prompt + task prompt). Measure whether the isolation benefit outweighs the spawn overhead.
  • Summary Fidelity: Assess whether subagent summaries preserve enough precision for the parent to synthesize correctly.

Model-Tier Cost Routing

Custom subagents introduce a new observability dimension: model-tier tracking. When routing tasks to Haiku vs. Sonnet vs. Opus, track per-agent model tier to measure whether cost routing decisions are correct:

  • High-frequency exploration tasks running on Opus signal a misconfigured agent model: field
  • Complex reasoning tasks running on Haiku signal under-provisioning

MCP Context Bloat as Observability Concern

Each connected MCP server loads its tool descriptions into the context window at session start. This creates a new observability dimension: context pollution monitoring. Track the number of active MCP servers and estimate their context token contribution. Signs of MCP context bloat include:

  • Increased session startup token usage without corresponding tool usage
  • Model responses that ignore available tools (descriptions consumed context but tools went unused)
  • Degraded response quality correlated with the number of connected servers

Mitigation: Use /mcp to audit active servers. Remove inactive ones. Scope MCP servers to specific subagents rather than the parent session. See Claude + MCP Explained.

Related: Context Engineering, Claude SubAgents

References