Docs / Multi-agent

Topology and handoffs.

Modern agent systems decompose into a planner and one or more workers, a supervisor and tools, or a swarm of specialized sub-agents. Mesedi models each of those agents as its own execution and links them via a parent/child graph. When a sub-agent crashes or two agents deadlock, the failure clusters at the right level instead of producing a confusing pile of disconnected single-execution failures.

The topology graph

Every execution carries an optional parent_execution_id. When the SDK opens a nested @wrap (Python) or wrap()(TypeScript) inside another wrapped execution, the inner execution's parent_execution_id is automatically set to the outer one. The backend traverses that pointer recursively via a SQL CTE to build a tree of every reachable ancestor and descendant for any execution_id.

The HTTP endpoint that serves the tree is GET /executions/{id}/topology?depth=N (default depth 8, max 32). The response is a flat list of TopologyNode records ordered by signed depth: negative depths are ancestors, zero is the seed execution you asked about, positive depths are descendants. Each node carries its status, duration, SDK language, and (when classified) its failure_group_id.

The dashboard renders the topology on the execution detail page as an indented tree. Single-agent runs (no parent, no children) hide the section entirely so the page stays uncluttered. Every node is a deep link to that execution's own detail page, so an operator can climb up the parent chain or descend into a sub-agent without leaving the timeline they came from.

The agent_handoff event

A handoff is the moment one agent delegates work to another. Mesedi captures it as an agent_handoff event emitted from inside the parent execution. Four well-known kinds describe the most common shapes:

delegate — one-shot, expects a return value
spawn — fire-and-forget background sub-agent
transfer — control transferred, no return expected
consult — short Q&A, return text only

Customers can also pass arbitrary strings; the backend does not enforce the enumeration. The four above are the ones downstream detectors recognize as well-known.

# Python
import mesedi

mesedi.emit_agent_handoff(
from_agent="planner",
to_agent="researcher",
handoff_kind="delegate",
task_summary="Find citations for the climate paper",
child_execution_id=child_exec_id,  # optional, if known
latency_ms=elapsed_ms,              # wall-clock the parent waited
)

// TypeScript
import { emitAgentHandoff } from "mesedi";

emitAgentHandoff("planner", "researcher", {
  handoffKind: "delegate",
  taskSummary: "Find citations for the climate paper",
  childExecutionId: childExecId,
  latencyMs: elapsedMs,
});

cascading_failure

A cascading failure is the multi-agent pathology where a parent delegates work, the sub-agent crashes or times out, and either the parent crashes in turn or carries the bad result forward. Without this detector you would see two separate failure groups for the same logical bug. Mesedi collapses them into one cluster attributed to the specific agent edge that broke.

The detector fires on the parent's terminal PATCH when any agent_handoff on the execution has a child whose status is crashed, timeout, or validation_failed. The signature deterministically clusters per cascading_failure:<from_agent>:<to_agent>:<child_status> so repeated cascades along the same edge collapse into one group rather than one-per-execution-pair.

coordination_deadlock

A coordination deadlock is the Coffman circular-wait condition in the agent-role graph. Two agents are suspended waiting on each other to release control. Mesedi detects 2-cycles in the agent_handoff edge graph within the topology subtree (A→B AND B→A) and fires the detector with a canonical lexicographically- ordered signature coordination_deadlock:<agent_a>:<agent_b>.

The agents are paused in your code, not on Mesedi's side. The detector surfaces the cycle so the operator can investigate why both roles think they're waiting on the other. Most deadlocks have an unwritten assumption about which role owns the in-flight work; making that assumption explicit usually fixes it.

provider_incident

When multiple distinct tenants in a project hit the same LLM provider error within a 15-minute window, the provider is almost certainly having an incident rather than each tenant's code being independently broken. Mesedi watches across all executions in a project and fires provider_incident when the cross-tenant count meets the threshold (default: 2 distinct tenants).

The signature is provider_incident:<provider>:<error_class> so each unique (provider, error_class) pair clusters separately. A single Anthropic 529 incident produces one group; if OpenAI is also having issues at the same time, that produces a second.

What's next?

Failure classes and playbooks for the full detector catalog.

Human-in-the-loop for pausing executions and capturing the human side of agent workflows.

API reference for the topology endpoint wire format.