Why model an agent as a graph?

A naive agent loop ("LLM picks a tool, runs it, repeats until done") works in demos and then fails unpredictably: it loops forever, loses context, can't recover from a tool error, and can't be paused for human approval. LangGraph represents the agent as an explicit state graph — nodes (units of work), edges (transitions, including conditional ones), and a typed shared state that persists across steps. That structure gives you the three things production needs: control (you decide the allowed transitions), durability (state can be checkpointed and resumed), and observability (you can see exactly where a run is).

A reference architecture (hub-and-spoke)

A reliable pattern for non-trivial agents:

        ┌────────────┐
        │   PLAN     │  decompose the task, decide next step
        └─────┬──────┘
              ▼
   ┌──────────────────────┐     conditional edges
   │  ACT / RETRIEVE node  │ ───► call tools, query data, sub-agents
   └─────────┬────────────┘
              ▼
        ┌────────────┐
        │  VERIFY    │  check the result; grounded? complete? safe?
        └─────┬──────┘
        pass ▼   ▼ fail → loop back to PLAN/ACT (bounded retries)
        ┌────────────┐
        │  RESPOND   │  final answer + citations + structured output
        └────────────┘

Specialized sub-agents (research, drafting, validation) hang off the ACT node as "spokes," each with a narrow job. A supervisor/router node decides which spoke to invoke. This keeps each component simple and testable.

The seven things that make it production-grade

  1. Typed state. Define the shared state explicitly (e.g., a typed schema). Every node reads/writes known fields — no implicit prompt-stuffing. This is what makes runs debuggable.
  2. Tools with real error handling. Each tool call can fail (timeout, bad input, API error). Catch, classify, and route failures to a recovery path — don't let one failed call crash the run or silently corrupt state.
  3. Bounded loops. Always cap retries/iterations. Unbounded "keep trying" is how agents burn tokens and hang. Add a max-step budget and a graceful exit.
  4. Verification node. Before responding, check: is every claim grounded in retrieved data? Is the output complete and well-formed? Does it pass safety/business rules? Loop back on failure (bounded).
  5. Human-in-the-loop checkpoints. For high-stakes actions (sending an email, moving money, publishing), pause at a checkpoint and require approval. LangGraph's persistence makes pause/resume first-class.
  6. Guardrails. Validate inputs and outputs (schemas, allow-lists, PII checks). Constrain what tools the agent may call in each state.
  7. Evals + observability. Maintain a test set of representative tasks and measure success rate, tool-call accuracy, and cost per run on every change. Trace every run. Without this you can't safely change a prompt.

The mistake that sinks most agent projects

Teams optimize the demo (the happy path) and ship. Then real inputs arrive — ambiguous requests, flaky APIs, edge cases — and the success rate that looked like 95% in the demo is 60% in production. Industry analyses repeatedly find the majority of agent projects never reach production, and the gap is almost never the model — it's the reliability engineering above (verification, error handling, bounded loops, evals). Build those from day one, not after the first outage.

When to use LangGraph vs. a simpler approach

  • Single prompt + a couple of tools, low stakes? A simple agent or even a plain function-calling loop is fine — don't over-engineer.
  • Multi-step reasoning, multiple tools, needs reliability, human approvals, or multiple cooperating agents? Use a state-graph framework like LangGraph. The structure pays for itself the moment you need to debug, resume, or guarantee a step.