Docs / Concepts

Failure classes and playbooks.

Mesedi runs twenty-plus detectors over the telemetry your SDK ships. When a detector fires, related events cluster into a failure group, and the Tier 1 Playbook for that class renders alongside the group in the dashboard. The playbook explains what the signal means and what to try first. Each section below names the detector, what it looks at, and the rough shape of the playbook. Multi-agent, human-in-the-loop, and the cross-tenant signals each have their own dedicated pages linked at the bottom for the deeper walkthroughs.

Crashes

An execution exited via an unhandled exception. The SDK records the exception type plus a stable hash of the top of the traceback (the crash signature), so retries of the same bug cluster into one failure group instead of paging on every attempt.

Signal: Same crash_signature appears more than once.

Playbook (Tier 1): Read the traceback in the dashboard, fix the bug, redeploy. If the same signature reappears after the deploy, the fix didn't land or didn't cover the path that fires.

Time budget exceeded

An execution ran longer than the time budget configured on @wrap. The detector fires when duration_ms exceeds the budget, even if the function eventually returned successfully.

Signal: duration_ms > policy.max_wall_seconds * 1000.

Playbook (Tier 1): Either raise the budget intentionally (and document why) or cut latency. Common culprits: tool calls hitting slow upstreams, unintentional ReAct loops with no maxSteps, or a model that's switching to a slower variant.

Step count exceeded

The number of LLM calls in one execution exceeded the configured maxSteps. ReAct-style agents that can't decide when to stop look like this.

Signal: Count of llm_call events on an execution > max_steps.

Playbook (Tier 1): Tighten your stopping criterion, add a final-answer validator, or lower maxSteps. If the agent is genuinely stuck, the surrounding loop detectors will also fire.

Tool failures

A tool call inside the execution failed with an exception or returned a result the calling code marked failed. The detector clusters repeated failures of the same tool inside one execution and across executions.

Signal: tool_call event with status=failed; same (tool_name, exception_type) seen more than once.

Playbook (Tier 1): Add retry-with-backoff at the tool boundary, or have the agent re-plan when a tool fails consistently. If the upstream tool is genuinely down, the upstream needs the fix, not the agent.

Validator failures

An explicit output validator (schema check, content policy, downstream parser) rejected the final answer. Captured by emitting a validator_check event with status=failed.

Signal: validator_check event with status=failed.

Playbook (Tier 1): Strengthen the prompt or add a repair step. Repeated validator failures on the same schema usually mean the model is producing structurally-plausible-but-wrong output that only the validator catches; that's a signal to upgrade the validator and re-run.

Prompt injection

User-supplied content appears to have overridden system instructions. The detector looks for known injection patterns (ignore previous instructions, system: ..., role-confusion markers) plus a heuristic that flags model outputs that explicitly reference the injected directive.

Signal: llm_call event where the user content matches an injection pattern, or the model output explicitly cites instructions absent from the system prompt.

Playbook (Tier 1): Strip or sanitize the injecting content, harden the system prompt with explicit precedence rules, and add a validator that rejects answers referencing instructions not in the system prompt.

Cost velocity

Dollar spend per execution is rising over time, even when the workload looks the same. Could be a model upgrade, a prompt that grew silently, or a retry storm.

Signal: Rolling mean cost per execution increased above the configured threshold relative to the prior window.

Playbook (Tier 1): Diff the recent prompt against the previous one, check the model id (auto-upgrades from one provider tier to another happen quietly), and verify retries aren't fanning out invisibly.

Drift

The mix of models or the texture of prompts shifted over time. Mesedi tracks model-mix shifts (model A used to be 90%, now 60%) and lexical drift (character-3-gram cosine distance) on prompts and outputs.

Signal: Model-mix entropy or 3-gram cosine distance from the baseline window exceeds threshold.

Playbook (Tier 1): Was the change intentional? If yes, snapshot a new baseline. If no, find the upstream change (provider deprecation, prompt-template tweak, A/B test rollout) and decide whether to revert.

Identical-call loop

The same LLM call (identical model, identical messages) repeated within one execution. A clean indicator of an agent stuck in a no-progress cycle.

Signal: Same (model, hash(messages)) tuple appears more than N times in one execution.

Playbook (Tier 1): Add a no-progress check or a stop-after-K-identical-calls heuristic. The hard-halt mechanism with a tight step count is the bluntest version of this.

Similar-call loop

Like identical-call, but the model or messages differ slightly (paraphrase, swapped tool name). Catches looping agents that flutter just enough to evade the identical-call detector.

Signal: Pairs of llm_call events with high lexical similarity (cosine > threshold) within one execution.

Playbook (Tier 1): Same as identical-call: add a no-progress check, log what the agent is trying to vary, and consider whether the prompt needs an explicit don't-repeat rule.

Data leakage

An outbound payload contained a credential, signed token, or PII pattern. Mesedi scans every llm_call (system prompt, user prompt, response) and every tool_call (arguments, return value) against thirteen built-in patterns covering AWS, GCP, Stripe, GitHub, Slack, OpenAI keys, JWTs, SSNs, credit cards, and private-key PEM blocks, plus any custom regex rules you configured per project.

Signal: A dlp_scan_result event with severity=critical or severity=high on any sibling llm_call or tool_call event.

Playbook (Tier 1): Redact the offending value at the source. Most leaks are accidental string interpolation that includes an env var; tightening the prompt template or the tool wrapper usually closes it. Mesedi already redacted the value in storage; the event records what type of secret was caught, not the secret itself.

Infrastructure throttled

An LLM provider returned HTTP 429, or your circuit breaker tripped. Not a code bug, an operations signal: the provider is rate-limiting you or you've exhausted a quota. Surfaces these distinctly from regular tool failures so the right person triages them.

Signal: An infrastructure_event with reason=rate_limited, circuit_breaker_open, or quota_exhausted.

Playbook (Tier 1): Raise the rate limit with the provider, add backoff with jitter at the SDK call site, or split traffic across multiple API keys. If the rate limit is intentional, surface it to your product (a banner that says 'we're rate-limited, results may be slow').

Context overflow

Cumulative input tokens crossed the model's context window. The agent is now talking to a truncated prompt, almost certainly missing the early portion of the system instructions or the conversation history. Warns at 90 percent of the window, fails at 100 percent.

Signal: Sum of input_tokens across llm_call events exceeds the configured context window for the model in use.

Playbook (Tier 1): Trim conversation history more aggressively, switch to a model with a larger window (Mesedi knows the limits for Claude, GPT-5, Gemini, Llama-4, Mistral), or compress retrieved context with a smaller summarization pass before the main call.

Token waste

The same leading prompt prefix was sent three or more times in one execution. You're paying for the same tokens repeatedly. Common cause: the agent re-includes the entire conversation history (or a large system prompt) on every retry instead of building up incrementally.

Signal: The first 2048 characters of three or more llm_call user prompts in the same execution hash to the same value.

Playbook (Tier 1): Refactor the prompt builder to send the changing slice only, not the whole history. Prompt caching (Anthropic, OpenAI) is the technical solution; the underlying problem is usually that retry logic re-builds the whole prompt each time.

Semantic loop

The agent revisited the same canonical state three or more times across checkpoint events, even though the surface text of each step differs. Catches the looping behavior that semantic-equivalence detection finds but lexical detection misses. Uses canonical-state hashing (sort keys, lowercase strings, round floats), NOT entropy.

Signal: Three or more checkpoint events on one execution whose canonical_state_hash collide.

Playbook (Tier 1): Add an explicit progress invariant the agent checks at each step ('did anything actually change?'), or have the agent emit a checkpoint with a one-sentence summary of what it's about to do and reject duplicates upfront.

Tool schema drift

A third-party API silently changed its return shape. Your agent is now interpreting an unfamiliar schema and probably misreading it. Built around return-value fingerprinting: sorted keys + value types only, with a 2/3 majority threshold over 10+ call history.

Signal: A tool_call return value's shape hash diverges from the project's historical baseline for that tool name.

Playbook (Tier 1): Diff the new return shape against the old. Common culprit: provider API version bumps with no advance notice. Either pin the API version at the SDK boundary or update the agent's parser to handle both shapes.

Grounding failure

External evaluators (Ragas, Promptfoo, Vectara HHEM, custom LLM judges) reported the output diverged from the retrieved context. Mesedi does not run the evaluation itself; you emit eval_score events and Mesedi aggregates them.

Signal: Any eval_score event with passed=false, or mean score across all eval_score events on the execution falls below 0.5 (for higher-is-better metrics).

Playbook (Tier 1): Open the eval_score events on the execution and read the evaluator's reason field. Usually the retrieved context is good but the agent ignored it; that's a prompting fix. Sometimes the retrieval is wrong; that's a RAG-pipeline fix.

Cascading failure

A parent agent handed off to a sub-agent, and the sub-agent crashed, timed out, or hit validation_failed. Without this detector, you'd see two separate failure groups for the same logical bug. Mesedi collapses them into one cluster attributed to the specific agent edge that broke.

Signal: An agent_handoff event on the parent execution whose child_execution_id resolves to a child in a failure terminal state.

Playbook (Tier 1): Open the topology view for the parent run, find the failing child node, fix it. The child node deep-links to its own execution detail where the actual crash is recorded.

Coordination deadlock

Two agents in this execution's topology subtree are waiting on each other to release control. Mesedi detects 2-cycles in the agent_handoff graph (A→B AND B→A) and surfaces them as the Coffman 'circular wait' condition.

Signal: Any pair of agent_handoff edges in the topology subtree form a 2-cycle on agent-role names.

Playbook (Tier 1): Decide which role is the unambiguous owner of the in-flight work. Coordination deadlocks always have an unwritten assumption about who is in charge; making that assumption explicit in the role definitions usually closes the loop.

Provider incident

Multiple distinct tenants in this project saw the same provider error (Anthropic, OpenAI, Gemini) within the recent 15-minute window. Strong signal the provider is having an incident rather than per-tenant code being broken. Cross-tenant detection; not visible in single-tenant tools.

Signal: Distinct tenants with at least one llm_call error matching the same (provider, error_class) pair in the project within the last 15 minutes meets the configured threshold (default 2).

Playbook (Tier 1): Check the provider's status page. If they're down, raise it in your incident channel and wait. If they're up, the cross-tenant pattern is likely a network or auth issue on your end; investigate the shared infrastructure.

Sandbox escape

A code-execution tool attempted to read host secrets, open raw sockets, access instance metadata, traverse paths outside its sandbox, or escalate privileges. Nine pattern classes built in, scanning both arguments and return values regardless of whether the call itself succeeded.

Signal: A tool_call event whose arguments or return value match a known sandbox-escape pattern (python_os_import, shell_invocation, dynamic_code_eval, raw_socket_open, instance_metadata_access, proc_sys_access, privilege_escalation, host_secret_read, path traversal).

Playbook (Tier 1): Investigate whether the agent legitimately needed the behavior or was attempting to break out. If legitimate, document it and add an exception. If not, harden the sandbox (drop network capabilities, mount read-only, restrict syscalls) and re-prompt the agent against acceptable behaviors.

HITL timeout

A human-in-the-loop request either explicitly timed out (response_kind=timeout) or breached the customer-declared SLA (wait_duration_ms > sla_seconds * 1000). The human side of your HITL loop is dropping or stalling requests.

Signal: A human_intervention event with response_kind=timeout, OR an event whose wait_duration_ms exceeds sla_seconds * 1000.

Playbook (Tier 1): If it's explicit (timeout response_kind), check whether your HITL queue worker is healthy, the on-call rotation has the right people, and your notification channel (email, Slack, PagerDuty) actually fires. If it's SLA breach, tune the SLA up to match real human response time or invest in faster routing.

HITL rejection spike

An unusually high fraction of recent runs in this project came back as rejected (humans saying NO) or edited (humans modifying the output before approving). Strongest indicator of agent quality regression Mesedi can produce because the humans are the canary.

Signal: In the last hour, at least 5 distinct executions had a human_intervention event, AND at least 40 percent of those came back as rejected (for the rejected variant) OR 30 percent came back as edited (for the edited variant).

Playbook (Tier 1): Roll back the most recent agent change. Then read the response_payload on the rejected events to identify the pattern. Edits are subtler quality drift; rejections are 'this is wrong.' Both signal the agent's outputs need correction.

How playbooks render

When a failure group opens, the dashboard renders the matching playbook in the same view. Playbook content lives as Markdown under backend/internal/playbooks/content/<class>/ in the monorepo. Playbook lookup is (class, signature) with fallback to a per-class default, so very common signatures can eventually get their own targeted playbook without touching the default.

Today, Mesedi ships the recommendation surface: a static playbook keyed by failure class and signature that surfaces the right canonical fix the moment a failure clusters.

What's next?

Multi-agent topology and handoffs covers the parent/child execution graph, the agent_handoff event, and how cascading_failure + coordination_deadlock + provider_incident cluster cross-execution.

Human-in-the-loop covers pausing executions for human input, the request/response capture event, SLA tracking, and the rejection-spike detector for agent quality regressions.

OpenTelemetry parallel emission covers routing Mesedi events to your existing Datadog, Honeycomb, or Grafana Tempo backend so agent traces sit alongside the rest of your infrastructure telemetry.

Self-hosting guide covers running the Go backend behind your own infrastructure if you'd rather not depend on the hosted Cloud version.

HTTP API reference covers the wire format detectors read from.