Ignoring Durability in Agent Workflows Hold

Overview

Ignoring durability in agent workflows means building agent loops that depend on in-memory execution, transient chat history, best-effort retries, or ad hoc state files instead of a durable execution model. Durable execution is the practice of saving workflow progress at key points so a process can pause and later resume from the last recorded state, which LangGraph calls especially useful for human-in-the-loop workflows and long-running tasks that may hit interruptions or LLM timeouts (LangGraph durable execution).

This is a Hold item because production agents behave like distributed systems: they call models, tools, APIs, databases, MCP servers, queues, humans, and other agents, and each step can fail independently. Temporal frames AI agents as distributed systems with many remote requests and says Durable Execution lets an application pick up where it left off after crashes, restarts, network glitches, rate limits, or unavailable APIs (Temporal AI durable execution). Restate makes the same point more directly for agent loops, noting that every tool call is a remote hop, every user interaction is a pause, and every retry risks doing the same work twice without durable steps and idempotency (Restate durable AI loops).

The anti-pattern is not using a lightweight prototype before requirements are clear. The anti-pattern is carrying that prototype architecture into production after the workflow starts making irreversible changes, waiting on people, spanning minutes or days, running across workers, or coordinating multiple systems. Once an agent creates tickets, books resources, writes customer data, triggers payments, modifies repositories, or performs operational actions, durability becomes part of the product contract rather than an implementation detail.

Adoption Signals

  • Agent frameworks are making durability a first-class concern. LangGraph documents durable execution through persistence, checkpointers, thread identifiers, resumability, human-in-the-loop interrupts, graceful shutdown, and durability modes that trade off performance and checkpoint consistency (LangGraph durable execution).
  • Durable execution platforms are explicitly targeting AI agent workloads. Temporal describes model calls, tool calls, MCP clients, external APIs, human approval, workflow memory, and long-running execution as natural fits for Workflows, Activities, Signals, Updates, Queries, and event history (Temporal AI durable execution).
  • Microsoft Agent Framework separates in-memory execution from durable runtime execution, stating that in-memory workflows lose all state on process exit, while the durable runtime provides stateful durable execution, automatic checkpointing after each step, distributed execution, long-running orchestration, and dashboard observability (Microsoft Agent Framework durable workflows).
  • Pydantic AI now documents Temporal integration for durable agents, where model requests, I/O-heavy tool calls, and MCP server communication are offloaded to Temporal activities while deterministic coordination logic lives in the workflow (Pydantic AI Temporal integration).
  • The design pattern is no longer tied to one agent SDK. Restate demonstrates durable AI loops across the Vercel AI SDK and OpenAI Agents SDK, arguing that LLM inference and tool invocation should be wrapped into durable steps that can be restored after a crash (Restate durable AI loops).
  • Human-in-the-loop is a major forcing function. LangGraph highlights durable execution for workflows that pause for users to inspect, validate, or modify the process, and Temporal uses Signals, Updates, and Queries for human approval patterns (LangGraph durable execution, Temporal AI durable execution).

Risks

  • State loss becomes user-visible. Microsoft notes that an in-process runner executes everything in memory, so if the process exits because of a crash, restart, or long-running step boundary, all workflow state is lost (Microsoft Agent Framework durable workflows).
  • Retries can duplicate side effects. LangGraph advises making side effects such as API calls and file writes idempotent, using idempotency keys or existing-result checks, and wrapping side-effecting operations in tasks so resumed workflows do not unintentionally repeat work (LangGraph durable execution).
  • LLM and tool calls are unreliable boundaries. Temporal identifies network glitches, LLM unavailability, rate limiting, inaccessible APIs, crashes, and restarts as expected failure modes, and recommends putting LLM calls and external tool/resource access behind Activities that provide retry and durable state behavior (Temporal AI durable execution).
  • Human handoffs become fragile without suspend/resume. Durable execution lets workflows pause while waiting for approval or external input and then resume from persisted state; without it, teams often keep workers alive, poll, replay chat history, or reconstruct state manually after long waits (LangGraph durable execution, Restate durable AI loops).
  • Determinism constraints are easy to miss. Pydantic AI's Temporal integration separates deterministic workflows from non-deterministic activities because workflow code must be replayable, while model requests, tool calls, MCP communication, disk access, and network I/O belong in activities (Pydantic AI Temporal integration).
  • Checkpoints alone may not be enough. A checkpoint can save state, but production reliability also requires failure detection, retry policy, duplicate-execution prevention, locking or coordination, observability, ownership, and operator controls for stuck workflows.
  • Observability suffers when execution history is ephemeral. Temporal records full event history and returned Activity values, Microsoft highlights dashboard-based inspection of workflow runs and step inputs/outputs, and Restate tracks workflow steps and agent interactions; non-durable loops usually lack this operational history (Temporal AI durable execution, Microsoft Agent Framework durable workflows, Restate durable AI loops).
  • Operational ownership becomes unclear. Without durable run records, queues, dashboards, retry metadata, and terminal states, teams cannot reliably answer whether an agent is still running, paused for a human, blocked on a tool, already completed, safe to retry, or partially executed.

Pros & Cons

Advantages

  • Can be acceptable for throwaway demos, local experiments, and short-lived prototypes where losing state, repeating work, or restarting manually has no material user, operational, or compliance impact.
  • Reduces early implementation overhead by avoiding workflow engines, checkpoint stores, replay constraints, idempotency keys, and orchestration infrastructure while the problem is still exploratory.
  • Keeps the first version easy to understand when the agent loop has only a few short steps, no external side effects, no human approval waits, and no need to resume after interruption.

Disadvantages

  • Fails unpredictably in production when model calls, tool calls, external APIs, queues, workers, or processes time out, crash, restart, rate limit, or return partial results.
  • Creates duplicate side effects and corrupted workflow state unless retries, checkpoints, deterministic replay boundaries, and idempotent operations are explicitly designed.
  • Makes long-running, human-in-the-loop, multi-agent, and cross-system workflows hard to debug, audit, resume, or operate because the execution history is not durable.

Recommendation

Hold on production agent workflows that cannot survive model errors, tool timeouts, external API failures, rate limits, worker crashes, deploy restarts, human handoffs, or partial completion. Do not rely on a single long-running process, a chat transcript, or a background loop as the system of record for work that matters. If an agent performs side effects, waits for people, coordinates multiple systems, or runs longer than an interactive request, durability should be an explicit architecture requirement.

Use a durable workflow runtime or design equivalent guarantees before scaling. At minimum, require persisted workflow state, checkpointed progress, resumable run identifiers, idempotency keys for writes, retry policies with backoff and limits, separation of deterministic orchestration from non-deterministic model/tool calls, durable human-in-the-loop pause/resume, execution history, operator-visible statuses, and clear remediation paths for failed or stuck runs.

Evaluate the durability model with failure injection rather than documentation alone. Kill workers mid-run, restart during tool calls, simulate LLM rate limits, replay after deployment, retry write operations, pause for a human approval overnight, run duplicate resume attempts, rename or cancel tasks, and verify that the workflow either resumes correctly or fails in an observable, recoverable terminal state. Move away from this Hold only when agent workflows have repeatable recovery semantics and operators trust the run history.

Sources