mechai · arch-crew← Blog

Reference · LLM agents · Control-flow design

Agentic patterns

A working catalog of the control-flow patterns for getting reliable work out of an LLM that can reason, call tools, and act over multiple steps — what each one solves, the shape it takes, and what every extra loop, hand-off, or model call costs.

Core loops · workflow shapes · multi-agent topologies · system-theoretic patterns · governance

The best agentic system is the simplest one that meets the requirement

An agentic pattern is a reusable control-flow shape for orchestrating LLM calls plus tools. They span a spectrum. At one end, a workflow routes the model through predefined code paths — the developer decides the steps. At the other, an agent lets the model decide its own steps and tool calls at runtime. Workflows are cheaper, faster, testable; agents handle open-ended problems but cost more and compound errors.

At bottom every agent is a loop over an augmented LLM — a model with tools, memory, and retrieval. The patterns here nest and compose: an orchestrator's worker may run a ReAct loop whose answer is then refined by an evaluator-optimizer pass. The interviewer's signal — and the engineering one — is whether you match a pattern to a problem rather than reaching for the most autonomous design by reflex. Every step of added autonomy buys flexibility at the cost of latency, token spend, and debuggability.

The through-line

Pick the least autonomy that solves the problem. If a single well-prompted call (optionally with retrieval) suffices, use that. If the steps are predictable, a workflow is cheaper and testable. Reserve true agentic loops for genuinely open-ended paths — and be able to say out loud why each loop, hand-off, or extra model call earns its keep.

“The most successful implementations use simple, composable patterns, not complex frameworks.” — Anthropic [1]

In the diagrams below, color shows model identity, not role — so you can see at a glance whether a pattern reuses one model or coordinates several.

Part IThe agent loop

Reasoning, acting, and refining

The foundational patterns: how a model reasons, grounds that reasoning in real tool output, and improves its own work. Everything downstream nests these.

1.1ReAct

Interleave Thought, Action, and Observation in a loop until the goal is met.

ReAct [2] is the canonical agent loop: the model reasons a step (Thought), takes an Action (a tool call), reads the Observation (the result), and repeats. Its value is that reasoning is grounded in real tool feedback, not the model's priors. The failure mode to volunteer: with no stopping discipline it loops or oscillates, and the growing trace bloats the context window. ReAct's "world model" is just whatever fits in context, with no validation of observations and no long-term learning [4] — which is exactly why a raw loop is brittle and needs hardening.

Thoughtreason Actiontool call Observationresult loop until the goal is met — then a final answer
ReAct — one model loops, grounded in tool feedback; the danger is an unbounded loop.
Thought: I need ACME's latest revenue figure.
Action: search("ACME 2025 annual revenue")
Observation: $4.2B (FY2025 report)
Thought: that answers the question.
→ Final Answer: $4.2B
Use whenThe path is open-ended and each step depends on the result of the last.
CostLatency and tokens per step; without a budget it loops, oscillates, or acts on bad observations.

1.2Tool Use

The agent emits a structured call the runtime executes, returning the result to context.

Tool Use [3][7] is the bridge between a text generator and the real world: the model emits a structured call — JSON matching a declared schema — which the runtime executes, feeding the result back into context. It is called foundational because every higher pattern's reliability rests on tool design: tight schemas, typed parameters, clear errors. Toolformer [7] showed models can even learn when to call a tool. The classic trap is a vague, overloaded tool the model misuses — no amount of prompt-engineering the loop fixes a bad tool boundary. In GoF terms it is a Proxy/Adapter over an external capability.

Agent (LLM)emits JSON call Runtimevalidates schema Tool / APIreal world result returns to context
Tool Use — a typed, schema-validated call; reliability rides on the tool boundary.
{
  "name": "get_weather",
  "arguments": {"city": "Warsaw", "units": "metric"}
}
Use whenThe agent must act on or read from any system outside the model — always, for real work.
CostReliability is only as good as the schema; a vague or overloaded tool gets misused.

1.3Reflection

Generation–Critique–Refinement: the model evaluates and revises its own output.

Reflection [3][5] is a draft → critique-against-criteria → revise cycle; Self-Refine [5] formalised the iterative self-feedback loop. It pays off when there is a clear correctness signal the critique can latch onto — code that must pass tests, a proof, a constraint-checked translation — because evaluating is a different, often easier task than one-shot authoring. The cost: each round is at least one extra call, so cap iterations (commonly 1–3) and exit early when the critique reports no issues.

Generatedraft Critiquevs. criteria Reviseimprove loop 1–3×, early-exit when the critique passes
Reflection — same model critiques itself; best with a clear correctness signal.
draft = generate(task)
for _ in range(3):                # hard iteration cap
    notes = critique(draft, rubric)
    if notes.ok: break            # early-exit on pass
    draft = revise(draft, notes)
Use whenThere is a checkable signal — tests, a rubric, a constraint — that a critic can score against.
CostEvery round is an extra call; without a cap it spends freely for diminishing gains.

1.4Plan-and-Execute

A Planner decomposes the goal into ordered steps before an Executor acts.

A Planner decomposes the goal into ordered sub-steps up front; an Executor carries them out, each step often a tool call or small ReAct loop [3]. The difference from plain ReAct is commitment timing: ReAct decides one step at a time reacting to each observation, while Plan-and-Execute commits to a structure first. The advantage is that long-horizon, multi-system tasks surface hidden complexity early. The trade-offs: an upfront planning call adds latency, and a rigid plan goes stale — mature systems add re-planning when a step fails.

Plannerdecompose up front step 1 step 2 step 3 Executorrun each step re-plan if a step fails
Plan-and-Execute — planner and executor (often one model, sometimes a cheaper second); re-plan when reality diverges.
plan = planner(goal)                  # ordered sub-steps, up front
for step in plan:
    result = executor(step)           # often a small ReAct loop
    if result.failed:
        plan = replan(goal, done)     # a rigid plan goes stale
Use whenLong-horizon, multi-system tasks where surfacing structure early beats reacting step by step.
CostAn upfront planning call; plans drift unless you re-plan on failure.

Also in this family

Chain-of-Thought [6]

Elicit step-by-step internal reasoning before answering — no external action.

In practiceCoT [6] is the reasoning substrate ReAct acts on: pure thinking, no tools. Reach for it when a single call just needs to reason more carefully, not act.

answer = llm(question + "\nLet's think step by step.")
# reasoning in the open; ReAct adds the Action/Observation step

Part IIWorkflow shapes

Developer-steered control flow

When the steps are predictable, you don't need an agent — you need a workflow: composable shapes where developer code, not the model, decides the path. Cheaper, faster, testable.

2.1Prompt Chaining

Decompose a task into a fixed sequence of steps, each feeding the next.

Prompt chaining [1] breaks a task into an ordered series of LLM calls where each step's output is the next step's input — outline, then draft, then polish. Because the structure is fixed, you can drop a programmatic gate between steps (a check that fails fast). Use it when a task cleanly factors into stable sub-steps and you want each one simpler and more reliable than one giant prompt. The trade-off is latency: the calls are serial by construction.

Step 1 · outline gate Step 2 · draft Step 3 · polish
Prompt chaining — one model, different prompts down a fixed pipeline; an optional gate fails fast between steps.
outline = llm(brief)                  # step 1
if not gate(outline): return reject  # programmatic check
draft = llm(outline)                  # step 2 feeds on step 1
final = llm(draft, "polish for tone")  # step 3
Use whenA task factors into stable, ordered sub-steps and you want each call simple and checkable.
CostSerial latency; a fixed chain can't adapt to inputs that don't fit the shape.

2.2Routing

A classifier labels an input and dispatches it to a specialised follow-up or model.

Routing [1] puts a lightweight classifier (often a cheap LLM) at the front that labels the input and dispatches it to a specialised handler or model. Use it for distinct input categories — support triage, or sending easy queries to a small model and escalating hard ones for cost control. The characteristic failure is that a routing error propagates: a misclassification means the rest of the pipeline confidently solves the wrong problem, invisibly. Mitigate with a default/uncertain route, confidence thresholds, and logged route decisions.

input Routercheap classifier Billing handler Technical handler Small-model route
Routing — specialise by category; a misroute fails silently downstream.
label = classify(query)               # cheap model, up front
handler = ROUTES.get(label, default_route)   # always have a default
return handler(query)               # misroute = confident wrong answer
Use whenInputs fall into distinct categories, or you want to send easy cases to a cheaper model.
CostRouting errors propagate invisibly; needs a default route and confidence thresholds.

2.3Parallelization

Run independent LLM calls concurrently and aggregate — by sectioning or by voting.

Parallelization [1] runs independent calls at once and aggregates. Two shapes: sectioning splits a task into independent subtasks run in parallel; voting runs the same task several times and aggregates — majority vote, or "flag if any run flags it," which is useful for guardrails. It is the static counterpart to orchestrator-workers: parallelization fans out to a fixed, known set, while an orchestrator decides the fan-out dynamically. Use it when subtasks are independent and latency matters, or when independent looks raise confidence. Cost is linear in calls, and voting only helps if errors are uncorrelated across runs.

task call A call B call C Aggregatemerge / vote
Parallelization — the same model fanned out; sectioning splits work, voting raises confidence.
# sectioning: independent subtasks, concurrently
parts = await gather(*(llm(s) for s in sections))
answer = aggregate(parts)
# voting: run the same task N times, then majority / "flag if any flags"
Use whenSubtasks are independent and latency matters, or repeated looks raise confidence.
CostLinear in calls; voting only helps when errors are uncorrelated across runs.

2.4Evaluator-Optimizer

One model generates while a separate evaluator scores against criteria, in a loop.

Evaluator-Optimizer [1] separates roles: one call generates, a distinct evaluator scores the output against explicit, often external criteria, looping until the bar is met. It is close to Reflection but the distinction is who judges and against what — Reflection is typically the same model critiquing itself; here the critic is separated from the generator's framing. The separation helps when you have clear criteria and want an unbiased judge — literary translation against a rubric, high-stakes reasoning. Both buy quality through iteration and get expensive; guard with a hard iteration cap and early-exit on "pass."

Generatorproduces a draft Evaluatorscores vs. criteria draft feedback — loop until it passes
Evaluator-Optimizer — a separate critic against explicit criteria; cap the rounds.
draft = generator(task)
while True:
    score = evaluator(draft, criteria)   # distinct critic, external rubric
    if score.passes: break
    draft = generator(task, score.feedback)
Use whenYou have explicit criteria and want the critic free of the generator's framing.
CostExpensive over many rounds; needs a hard cap and early-exit on a pass.

Part IIIMulti-agent topologies

When one agent isn't enough — who holds control?

More agents multiply cost, latency, and communication failure surface, so a single good agent often wins. When you genuinely need several, the design question is who holds control.

3.1Orchestrator-Workers

A lead LLM dynamically decomposes a task, delegates to workers, and synthesises.

Orchestrator-Workers [1]: a lead model decomposes a task dynamically at runtime and spins up workers for sub-tasks it discovers — a coding agent finding which files to edit, for instance — then synthesises their results. The contrast with a Supervisor is whether sub-tasks are known in advance: here decomposition is unpredictable. Use it when you can't enumerate the sub-tasks up front. The shared risk: the lead is a coordination bottleneck and single point of failure.

Orchestratordecompose at runtime worker (discovered) worker (discovered) worker (discovered) synthesise
Orchestrator-Workers — fan-out decided at runtime; the lead is the bottleneck.
subtasks = orchestrator.decompose(goal)   # decided at runtime
results  = [worker(t) for t in subtasks]  # spun up per discovered task
return orchestrator.synthesize(results)
Use whenDecomposition is unpredictable — you can't enumerate the sub-tasks before running.
CostCoordination overhead; the orchestrator is a single point of failure.

3.2Supervisor (Hierarchical)

A supervisor routes work to named specialists and collects their answers.

A Supervisor is a fixed topology: a router over named specialists — coder, tester, reviewer — that holds control, delegates outward, and collects results. Use it when roles are stable and you want easy observability and a central view. Versus Orchestrator-Workers, the difference is that the roster is known in advance rather than discovered. Versus a Swarm, a central agent always sees the whole task, which makes global guardrails and audit far easier. The cost: the supervisor is the same bottleneck and single point of failure.

Supervisorcentral view Coder Tester Reviewer control returns to the supervisor after each specialist
Supervisor — a fixed roster, centrally observable; the lead remains a bottleneck.
SPECIALISTS = {"code": coder, "test": tester, "review": reviewer}
while not done:
    who = supervisor.route(state)     # fixed roster, central view
    state = SPECIALISTS[who](state)    # control returns to supervisor
Use whenRoles are stable and you want observability, a central view, and easy guardrails.
CostThe supervisor is a bottleneck and single point of failure.

3.3Handoff (Swarm)

Peer agents transfer control by calling a hand-off tool — no central orchestrator.

In a Handoff/Swarm [16], peers transfer control via a hand-off tool (Triage → Billing → Refunds); no one orchestrates. The hand-off is just a tool call that swaps the active agent and its instructions. Decentralised control wins when the flow is a chain of specialists that each fully own their segment and a global view matters less than low coordination overhead. The trade-off is sharp: no single agent ever sees the whole task, which makes end-to-end reasoning and global guardrails harder.

Triage Billing Refunds hand-off hand-off control transfers peer-to-peer — no agent sees the whole task
Handoff/Swarm — low coordination overhead, but no global view.
agent = triage
while agent:
    reply, handoff = agent.run(conversation)
    agent = handoff   # a tool call swaps the active agent + its instructions
Use whenThe flow is a chain of specialists that each own a segment; global view matters less.
CostNo agent sees the whole task, so end-to-end reasoning and guardrails are harder.

Also a topology

Multi-Agent Mesh [13]

Specialised agents communicate over an event backbone using agent-to-agent protocols.

In practiceAgents publish and subscribe over an event bus (Kafka) via A2A protocols [13] — best for AI-native rebuilds (pricing, fraud). Fully decentralised and scalable, but the communication surface is the failure surface.

bus.publish("price.updated", event)   # agents react over Kafka / A2A

Part IVSystem-theoretic patterns

The subsystem lens — Dao et al.

To move past convenience-based lists, Dao et al. [4] deconstruct an agent into five subsystems — Reasoning & World Model, Perception & Grounding, Action Execution, Learning & Adaptation, Inter-Agent Communication — and derive patterns that each fix a specific subsystem failure. Plain ReAct implements these implicitly and monolithically, which is exactly why it's brittle. The engineering value is a diagnostic vocabulary: name which subsystem a failure lives in, and which pattern fixes it.

Foundational — perception & memory

Integrator

Validate incoming information in Perception & Grounding before it enters reasoning.

FixesCognitive data quality — a stale or wrong observation acted on as fact. The first thing to add when hardening a raw ReAct loop.

value = fetch(); assert fresh(value)   # validate before reasoning consumes it

Retriever

A simplified, context-aware interface to memory — read the relevant slice, not everything.

FixesInefficient context retrieval and "lost in the middle" degradation. The read side of memory.

ctx = store.search(query, k=5)   # relevant slice, not the whole history

Recorder

Capture and externalise reasoning/world-model state so it can be restored.

FixesState saving & restoring — survives the context window so a long run can be resumed. The write side of memory.

store.save(agent.state)   # survives the context window; restore later

Cognitive & decisional — the planning stack

Selector

Prioritise and adapt goals — a Mediator over competing objectives.

FixesTactical goal selection: decide which objective to pursue now. GoF: Mediator.

goal = select(active_goals)   # prioritise which objective to pursue now

Deliberator

Select the optimal concrete action at each step.

FixesDynamic action adaptation: the action-level layer below Selector (which goal) and Planner (which route).

action = choose(candidates, state)   # best concrete next move

Execution & interaction

Executor

Reliably execute dispatched actions and collect feedback.

FixesExecution reliability and error recovery — the disciplined counterpart to a raw tool call. (Tool Use is the shared mechanism; see 1.2.)

result = run(action); recover(result.errors)   # reliable dispatch + feedback

Coordinator

Manage structured inter-agent communication.

FixesCommunication breakdowns — message contracts, who-talks-to-whom, shared-state rules. The antidote to a multi-agent system that "forgets" context or deadlocks.

msg = Contract(to="billing", payload=...)   # structured who-talks-to-whom

Adaptive & learning

Reflector

Analyse outcomes to infer causality and adjust strategy.

FixesCausal learning/adaptation — unlike Reflection (which revises one output), the Reflector learns across whole trajectories so the agent stops repeating mistakes.

lesson = analyse(trajectory)   # infer causality, adjust future strategy

Controller

Continuously monitor behaviour for alignment — an Observer.

FixesValue alignment & transparency. An always-on runtime guardrail, not a one-time eval. GoF: Observer. Central to governance (see Part VI).

if violates(policy, action): halt()   # always-on Observer over behaviour

The rest of the catalogue

Part VReliability & memory

From a loop to a durable system

Durable agents need more than a loop. Memory evolves from raw storage to retrieval-augmented context to experience — proactive exploration and cross-trajectory abstraction [11]. Voyager [8] showed lifelong skill acquisition via an automatic curriculum plus an ever-growing skill library of executable code (the Skill Build pattern).

Reliability needs structured error handling, not blind retries. The read/write split of memory lives in Retriever and Recorder; the learning loop in Reflector and Skill Build. To harden a brittle ReAct loop in practice: a hard step budget plus loop detection; an Integrator to validate observations; durable memory instead of cramming context; and a structured exception taxonomy so recovery has escalation pathways.

Part VIGovernance & human oversight

The line between a demo and something you'd let touch money

Production agents need layered defence: a human checkpoint on the narrow set of irreversible actions, an always-on policy monitor, and observability that turns a 15-step failure from an unreadable stack into a diagnosable trace.

6.1Human-in-the-Loop

A control point for human approval before irreversible or high-risk actions.

HITL [10] inserts a checkpoint before irreversible/high-risk actions — payments, account changes, deploys. Magentic-UI [10] is instructive: it defines six mechanisms — co-planning, co-tasking, multi-tasking, action guards, answer verification, and long-term memory — so human involvement is low-cost and targeted rather than a blanket gate. Scope HITL tightly to genuinely dangerous steps (or reviewers rubber-stamp), summarise clearly at the checkpoint, and give a "reject + feedback" path the agent can act on, not just approve/deny. The cost is latency and human time. Pair it with a Controller (always-on policy monitor) and full tracing for governance.

Agentproposes action Action guardrisk check Humanapprove / reject + feedback Executelow-risk: straight through high risk
HITL — action guards gate only the dangerous tail; the routine path stays fast.
action = agent.propose()
if action.risk == "high":                  # action guard
    if not human.approve(summary(action)):  # clear summary at the checkpoint
        return agent.revise(human.feedback)  # reject + feedback, not just deny
execute(action)
Use whenA few actions are irreversible or high-stakes — refunds, deploys, account changes.
CostLatency and human time; scope it tightly or reviewers rubber-stamp.

Pairs with

Observability

Trace every Thought / Action / Observation, tokens, cost, and tool latency.

In practiceMakes Controller and HITL auditable and turns a long failure into a diagnosable trace. Pair with structured exception handling [9] and hard loop/budget caps so a misbehaving agent fails safe.

trace(step, thought, action, tokens, latency)   # every step, auditable

Part VIIEnterprise integration

Bolting agents onto existing software

Most agents don't live in a greenfield loop — they integrate with software that already exists. These hybrid patterns [13] place the agent relative to your stack, ordered by how much you let it touch.

Part VIIIDecide

Design your agentic system

Agentic patterns compose, so this doesn't hand you one pattern — it builds a layered design. It opens with a least-autonomy gate: only if a single call or a workflow genuinely won't do does it layer on a reasoning loop, a topology, memory, reliability, and governance. Treat the result as a starting point you can argue with — and remember a single good agent usually beats a swarm.

How to choose

The honest takeaway: most tasks need far less autonomy than a catalog implies. Every loop, hand-off, and extra model call buys flexibility at the cost of latency, tokens, and debuggability — so make each one justify itself.

Start at the bottom of the spectrum. If a single well-prompted call (optionally with retrieval) answers it, stop there. If the steps are predictable, reach for a workflow shape — cheaper and testable. Escalate to an agent loop only for genuinely open-ended paths, and bound it: a step budget, an Integrator on observations, durable memory, structured error handling. Add a second agent only when one truly can't hold the task — a single good agent beats most multi-agent designs. Put a human on the irreversible actions, a Controller on policy, and a trace on everything. The patterns aren't new inventions — Selector is a Mediator, Tool Use a Proxy/Adapter, Controller an Observer — they're systematisations of structures you already know.

References

Sources behind the patterns; [n] markers throughout point here. Compiled from “Q&A: Agentic Patterns” (A. Krysztopa), verified June 2026.