Reference · LLM agents · Control-flow design
A working catalog of the control-flow patterns for getting reliable work out of an LLM that can reason, call tools, and act over multiple steps — what each one solves, the shape it takes, and what every extra loop, hand-off, or model call costs.
An agentic pattern is a reusable control-flow shape for orchestrating LLM calls plus tools. They span a spectrum. At one end, a workflow routes the model through predefined code paths — the developer decides the steps. At the other, an agent lets the model decide its own steps and tool calls at runtime. Workflows are cheaper, faster, testable; agents handle open-ended problems but cost more and compound errors.
At bottom every agent is a loop over an augmented LLM — a model with tools, memory, and retrieval. The patterns here nest and compose: an orchestrator's worker may run a ReAct loop whose answer is then refined by an evaluator-optimizer pass. The interviewer's signal — and the engineering one — is whether you match a pattern to a problem rather than reaching for the most autonomous design by reflex. Every step of added autonomy buys flexibility at the cost of latency, token spend, and debuggability.
The through-line
Pick the least autonomy that solves the problem. If a single well-prompted call (optionally with retrieval) suffices, use that. If the steps are predictable, a workflow is cheaper and testable. Reserve true agentic loops for genuinely open-ended paths — and be able to say out loud why each loop, hand-off, or extra model call earns its keep.
“The most successful implementations use simple, composable patterns, not complex frameworks.” — Anthropic [1]
In the diagrams below, color shows model identity, not role — so you can see at a glance whether a pattern reuses one model or coordinates several.
Part IThe agent loop
The foundational patterns: how a model reasons, grounds that reasoning in real tool output, and improves its own work. Everything downstream nests these.
Interleave Thought, Action, and Observation in a loop until the goal is met.
ReAct [2] is the canonical agent loop: the model reasons a step (Thought), takes an Action (a tool call), reads the Observation (the result), and repeats. Its value is that reasoning is grounded in real tool feedback, not the model's priors. The failure mode to volunteer: with no stopping discipline it loops or oscillates, and the growing trace bloats the context window. ReAct's "world model" is just whatever fits in context, with no validation of observations and no long-term learning [4] — which is exactly why a raw loop is brittle and needs hardening.
Thought: I need ACME's latest revenue figure. Action: search("ACME 2025 annual revenue") Observation: $4.2B (FY2025 report) Thought: that answers the question. → Final Answer: $4.2B
The agent emits a structured call the runtime executes, returning the result to context.
Tool Use [3][7] is the bridge between a text generator and the real world: the model emits a structured call — JSON matching a declared schema — which the runtime executes, feeding the result back into context. It is called foundational because every higher pattern's reliability rests on tool design: tight schemas, typed parameters, clear errors. Toolformer [7] showed models can even learn when to call a tool. The classic trap is a vague, overloaded tool the model misuses — no amount of prompt-engineering the loop fixes a bad tool boundary. In GoF terms it is a Proxy/Adapter over an external capability.
{
"name": "get_weather",
"arguments": {"city": "Warsaw", "units": "metric"}
}
Generation–Critique–Refinement: the model evaluates and revises its own output.
Reflection [3][5] is a draft → critique-against-criteria → revise cycle; Self-Refine [5] formalised the iterative self-feedback loop. It pays off when there is a clear correctness signal the critique can latch onto — code that must pass tests, a proof, a constraint-checked translation — because evaluating is a different, often easier task than one-shot authoring. The cost: each round is at least one extra call, so cap iterations (commonly 1–3) and exit early when the critique reports no issues.
draft = generate(task) for _ in range(3): # hard iteration cap notes = critique(draft, rubric) if notes.ok: break # early-exit on pass draft = revise(draft, notes)
A Planner decomposes the goal into ordered steps before an Executor acts.
A Planner decomposes the goal into ordered sub-steps up front; an Executor carries them out, each step often a tool call or small ReAct loop [3]. The difference from plain ReAct is commitment timing: ReAct decides one step at a time reacting to each observation, while Plan-and-Execute commits to a structure first. The advantage is that long-horizon, multi-system tasks surface hidden complexity early. The trade-offs: an upfront planning call adds latency, and a rigid plan goes stale — mature systems add re-planning when a step fails.
plan = planner(goal) # ordered sub-steps, up front for step in plan: result = executor(step) # often a small ReAct loop if result.failed: plan = replan(goal, done) # a rigid plan goes stale
Also in this family
Elicit step-by-step internal reasoning before answering — no external action.
In practiceCoT [6] is the reasoning substrate ReAct acts on: pure thinking, no tools. Reach for it when a single call just needs to reason more carefully, not act.
answer = llm(question + "\nLet's think step by step.") # reasoning in the open; ReAct adds the Action/Observation step
Part IIWorkflow shapes
When the steps are predictable, you don't need an agent — you need a workflow: composable shapes where developer code, not the model, decides the path. Cheaper, faster, testable.
Decompose a task into a fixed sequence of steps, each feeding the next.
Prompt chaining [1] breaks a task into an ordered series of LLM calls where each step's output is the next step's input — outline, then draft, then polish. Because the structure is fixed, you can drop a programmatic gate between steps (a check that fails fast). Use it when a task cleanly factors into stable sub-steps and you want each one simpler and more reliable than one giant prompt. The trade-off is latency: the calls are serial by construction.
outline = llm(brief) # step 1 if not gate(outline): return reject # programmatic check draft = llm(outline) # step 2 feeds on step 1 final = llm(draft, "polish for tone") # step 3
A classifier labels an input and dispatches it to a specialised follow-up or model.
Routing [1] puts a lightweight classifier (often a cheap LLM) at the front that labels the input and dispatches it to a specialised handler or model. Use it for distinct input categories — support triage, or sending easy queries to a small model and escalating hard ones for cost control. The characteristic failure is that a routing error propagates: a misclassification means the rest of the pipeline confidently solves the wrong problem, invisibly. Mitigate with a default/uncertain route, confidence thresholds, and logged route decisions.
label = classify(query) # cheap model, up front handler = ROUTES.get(label, default_route) # always have a default return handler(query) # misroute = confident wrong answer
Run independent LLM calls concurrently and aggregate — by sectioning or by voting.
Parallelization [1] runs independent calls at once and aggregates. Two shapes: sectioning splits a task into independent subtasks run in parallel; voting runs the same task several times and aggregates — majority vote, or "flag if any run flags it," which is useful for guardrails. It is the static counterpart to orchestrator-workers: parallelization fans out to a fixed, known set, while an orchestrator decides the fan-out dynamically. Use it when subtasks are independent and latency matters, or when independent looks raise confidence. Cost is linear in calls, and voting only helps if errors are uncorrelated across runs.
# sectioning: independent subtasks, concurrently parts = await gather(*(llm(s) for s in sections)) answer = aggregate(parts) # voting: run the same task N times, then majority / "flag if any flags"
One model generates while a separate evaluator scores against criteria, in a loop.
Evaluator-Optimizer [1] separates roles: one call generates, a distinct evaluator scores the output against explicit, often external criteria, looping until the bar is met. It is close to Reflection but the distinction is who judges and against what — Reflection is typically the same model critiquing itself; here the critic is separated from the generator's framing. The separation helps when you have clear criteria and want an unbiased judge — literary translation against a rubric, high-stakes reasoning. Both buy quality through iteration and get expensive; guard with a hard iteration cap and early-exit on "pass."
draft = generator(task) while True: score = evaluator(draft, criteria) # distinct critic, external rubric if score.passes: break draft = generator(task, score.feedback)
Part IIIMulti-agent topologies
More agents multiply cost, latency, and communication failure surface, so a single good agent often wins. When you genuinely need several, the design question is who holds control.
A lead LLM dynamically decomposes a task, delegates to workers, and synthesises.
Orchestrator-Workers [1]: a lead model decomposes a task dynamically at runtime and spins up workers for sub-tasks it discovers — a coding agent finding which files to edit, for instance — then synthesises their results. The contrast with a Supervisor is whether sub-tasks are known in advance: here decomposition is unpredictable. Use it when you can't enumerate the sub-tasks up front. The shared risk: the lead is a coordination bottleneck and single point of failure.
subtasks = orchestrator.decompose(goal) # decided at runtime results = [worker(t) for t in subtasks] # spun up per discovered task return orchestrator.synthesize(results)
A supervisor routes work to named specialists and collects their answers.
A Supervisor is a fixed topology: a router over named specialists — coder, tester, reviewer — that holds control, delegates outward, and collects results. Use it when roles are stable and you want easy observability and a central view. Versus Orchestrator-Workers, the difference is that the roster is known in advance rather than discovered. Versus a Swarm, a central agent always sees the whole task, which makes global guardrails and audit far easier. The cost: the supervisor is the same bottleneck and single point of failure.
SPECIALISTS = {"code": coder, "test": tester, "review": reviewer}
while not done:
who = supervisor.route(state) # fixed roster, central view
state = SPECIALISTS[who](state) # control returns to supervisor
Peer agents transfer control by calling a hand-off tool — no central orchestrator.
In a Handoff/Swarm [16], peers transfer control via a hand-off tool (Triage → Billing → Refunds); no one orchestrates. The hand-off is just a tool call that swaps the active agent and its instructions. Decentralised control wins when the flow is a chain of specialists that each fully own their segment and a global view matters less than low coordination overhead. The trade-off is sharp: no single agent ever sees the whole task, which makes end-to-end reasoning and global guardrails harder.
agent = triage while agent: reply, handoff = agent.run(conversation) agent = handoff # a tool call swaps the active agent + its instructions
Also a topology
Specialised agents communicate over an event backbone using agent-to-agent protocols.
In practiceAgents publish and subscribe over an event bus (Kafka) via A2A protocols [13] — best for AI-native rebuilds (pricing, fraud). Fully decentralised and scalable, but the communication surface is the failure surface.
bus.publish("price.updated", event) # agents react over Kafka / A2A
Part IVSystem-theoretic patterns
To move past convenience-based lists, Dao et al. [4] deconstruct an agent into five subsystems — Reasoning & World Model, Perception & Grounding, Action Execution, Learning & Adaptation, Inter-Agent Communication — and derive patterns that each fix a specific subsystem failure. Plain ReAct implements these implicitly and monolithically, which is exactly why it's brittle. The engineering value is a diagnostic vocabulary: name which subsystem a failure lives in, and which pattern fixes it.
Foundational — perception & memory
Validate incoming information in Perception & Grounding before it enters reasoning.
FixesCognitive data quality — a stale or wrong observation acted on as fact. The first thing to add when hardening a raw ReAct loop.
value = fetch(); assert fresh(value) # validate before reasoning consumes it
A simplified, context-aware interface to memory — read the relevant slice, not everything.
FixesInefficient context retrieval and "lost in the middle" degradation. The read side of memory.
ctx = store.search(query, k=5) # relevant slice, not the whole history
Capture and externalise reasoning/world-model state so it can be restored.
FixesState saving & restoring — survives the context window so a long run can be resumed. The write side of memory.
store.save(agent.state) # survives the context window; restore later
Cognitive & decisional — the planning stack
Prioritise and adapt goals — a Mediator over competing objectives.
FixesTactical goal selection: decide which objective to pursue now. GoF: Mediator.
goal = select(active_goals) # prioritise which objective to pursue now
Select the optimal concrete action at each step.
FixesDynamic action adaptation: the action-level layer below Selector (which goal) and Planner (which route).
action = choose(candidates, state) # best concrete next move
Execution & interaction
Reliably execute dispatched actions and collect feedback.
FixesExecution reliability and error recovery — the disciplined counterpart to a raw tool call. (Tool Use is the shared mechanism; see 1.2.)
result = run(action); recover(result.errors) # reliable dispatch + feedback
Manage structured inter-agent communication.
FixesCommunication breakdowns — message contracts, who-talks-to-whom, shared-state rules. The antidote to a multi-agent system that "forgets" context or deadlocks.
msg = Contract(to="billing", payload=...) # structured who-talks-to-whom
Adaptive & learning
Analyse outcomes to infer causality and adjust strategy.
FixesCausal learning/adaptation — unlike Reflection (which revises one output), the Reflector learns across whole trajectories so the agent stops repeating mistakes.
lesson = analyse(trajectory) # infer causality, adjust future strategy
Continuously monitor behaviour for alignment — an Observer.
FixesValue alignment & transparency. An always-on runtime guardrail, not a one-time eval. GoF: Observer. Central to governance (see Part VI).
if violates(policy, action): halt() # always-on Observer over behaviour
The rest of the catalogue
Part VReliability & memory
Durable agents need more than a loop. Memory evolves from raw storage to retrieval-augmented context to experience — proactive exploration and cross-trajectory abstraction [11]. Voyager [8] showed lifelong skill acquisition via an automatic curriculum plus an ever-growing skill library of executable code (the Skill Build pattern).
Reliability needs structured error handling, not blind retries. The read/write split of memory lives in Retriever and Recorder; the learning loop in Reflector and Skill Build. To harden a brittle ReAct loop in practice: a hard step budget plus loop detection; an Integrator to validate observations; durable memory instead of cramming context; and a structured exception taxonomy so recovery has escalation pathways.
Part VIGovernance & human oversight
Production agents need layered defence: a human checkpoint on the narrow set of irreversible actions, an always-on policy monitor, and observability that turns a 15-step failure from an unreadable stack into a diagnosable trace.
A control point for human approval before irreversible or high-risk actions.
HITL [10] inserts a checkpoint before irreversible/high-risk actions — payments, account changes, deploys. Magentic-UI [10] is instructive: it defines six mechanisms — co-planning, co-tasking, multi-tasking, action guards, answer verification, and long-term memory — so human involvement is low-cost and targeted rather than a blanket gate. Scope HITL tightly to genuinely dangerous steps (or reviewers rubber-stamp), summarise clearly at the checkpoint, and give a "reject + feedback" path the agent can act on, not just approve/deny. The cost is latency and human time. Pair it with a Controller (always-on policy monitor) and full tracing for governance.
action = agent.propose() if action.risk == "high": # action guard if not human.approve(summary(action)): # clear summary at the checkpoint return agent.revise(human.feedback) # reject + feedback, not just deny execute(action)
Pairs with
Trace every Thought / Action / Observation, tokens, cost, and tool latency.
In practiceMakes Controller and HITL auditable and turns a long failure into a diagnosable trace. Pair with structured exception handling [9] and hard loop/budget caps so a misbehaving agent fails safe.
trace(step, thought, action, tokens, latency) # every step, auditable
Part VIIEnterprise integration
Most agents don't live in a greenfield loop — they integrate with software that already exists. These hybrid patterns [13] place the agent relative to your stack, ordered by how much you let it touch.
Part VIIIDecide
Agentic patterns compose, so this doesn't hand you one pattern — it builds a layered design. It opens with a least-autonomy gate: only if a single call or a workflow genuinely won't do does it layer on a reasoning loop, a topology, memory, reliability, and governance. Treat the result as a starting point you can argue with — and remember a single good agent usually beats a swarm.
The honest takeaway: most tasks need far less autonomy than a catalog implies. Every loop, hand-off, and extra model call buys flexibility at the cost of latency, tokens, and debuggability — so make each one justify itself.
Start at the bottom of the spectrum. If a single well-prompted call (optionally with retrieval) answers it, stop there. If the steps are predictable, reach for a workflow shape — cheaper and testable. Escalate to an agent loop only for genuinely open-ended paths, and bound it: a step budget, an Integrator on observations, durable memory, structured error handling. Add a second agent only when one truly can't hold the task — a single good agent beats most multi-agent designs. Put a human on the irreversible actions, a Controller on policy, and a trace on everything. The patterns aren't new inventions — Selector is a Mediator, Tool Use a Proxy/Adapter, Controller an Observer — they're systematisations of structures you already know.
Sources behind the patterns; [n] markers throughout point here. Compiled from “Q&A: Agentic Patterns” (A. Krysztopa), verified June 2026.