a zine for agent-builders · v1

LangGraph &
Google ADK

two ways to build agents, side by side. primitives, data models, and the stuff the docs don't tell you — prepared for a forward deployed engineer interview.

read time~ 25 min
interactivesuper-step demo, event stream, quiz
vibejulia evans, but technical
A B C D E F

the big idea

Both frameworks solve the same problem (orchestrating stateful LLM workflows) from opposite directions.

LangGraph is a graph runtime. You define nodes, edges, and shared state. You build the agent.

ADK is an agent framework. You pick agent types (LlmAgent, SequentialAgent, LoopAgent) and compose them. You configure the agent.

Keep that split in your back pocket. Everything fits around it.

· · ·
Part One

LangGraph 🕸️

LangGraph models an agent as a directed graph with shared state, inspired by Google's Pregel system. Every node reads state → does something → returns a state update. Edges decide who runs next. That's really the whole thing.

three primitives, that's it

01

State

A shared TypedDict. Every node reads and writes to it.

02

Nodes

Plain functions: (state) → state_update. Do the work.

03

Edges

Rules for what runs next. Fixed or conditional.

interview memory hook: State · Nodes · Edges. Lead with those three when asked "what are LangGraph's primitives?" Everything else — reducers, Send, Command, checkpointers — is variations on these.

defining state, with a twist

from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
from langchain_core.messages import AnyMessage

class AgentState(TypedDict):
    messages: Annotated[list[AnyMessage], add_messages]  # ← reducer!
    user_name: str
    turn_count: int

That Annotated[list, add_messages] thing is a reducer. It's how LangGraph decides to merge updates to a key. Without it, new values overwrite old ones. With it, new messages append. We'll see that live in a minute.

a node is just a function

def classify(state: AgentState) -> dict:
    # read state
    last_msg = state["messages"][-1].content
    # do work (could be an LLM call, a tool, whatever)
    label = "urgent" if "help" in last_msg else "normal"
    # return ONLY the fields you want to update
    return {"classification": label}

edges: three flavors

from langgraph.graph import StateGraph, START, END

graph = StateGraph(AgentState)
graph.add_node("classify", classify)
graph.add_node("respond", respond)
graph.add_node("escalate", escalate)

# 1. fixed edge: always go from START to classify
graph.add_edge(START, "classify")

# 2. conditional edge: pick next based on state
def route(state):
    return "escalate" if state["classification"] == "urgent" else "respond"

graph.add_conditional_edges("classify", route, ["respond", "escalate"])

# 3. terminal edges
graph.add_edge("respond", END)
graph.add_edge("escalate", END)

app = graph.compile()  # ← don't forget this!
the thing you'll forget: .compile(). The graph object is a builder, not a runnable, until you compile it. Trips up every newcomer at least once.

how it actually runs: the super-step model

LangGraph doesn't traverse the graph step-by-step like a flowchart. It runs in super-steps: in each step, all active nodes run in parallel, their outputs merge via reducers, and the next set of active nodes is computed. Inspired by Pregel. Step through the demo below to see it live.

interactive demo

super-step executor

super-step: 0
normal urgent
START
classify
respond
escalate
END
Press step to begin executing the graph. The user message is "please help me urgently".
◆ state
messages: [
  "please help me urgently"
]
classification: null
response: null
◆ active nodes

Notice how it's not stepping edge by edge — it's in waves. All the active nodes in super-step N run together, their updates merge, and the waves keep propagating until every node goes quiet.

reducers: the trap nobody explains

When two parallel nodes update the same state key, or when you call the same node multiple times, LangGraph needs to know how to merge the new value with the old. That's what a reducer is.

Default behavior: overwrite. But for lists of messages, you usually want append. Watch the difference:

reducer playground

🚫 no reducer (overwrites)

messages: []

Each write replaces the list. Old messages vanish. 😱

✓ with add_messages reducer

messages: []

Reducer appends. History accumulates. 🙌

interview gold: if they ask "why do messages accumulate but other fields get overwritten?" — it's the add_messages reducer. Nothing more. Being able to explain this cleanly signals real understanding.

two advanced primitives worth knowing

Send is for dynamic fan-out: spawn N parallel invocations of a node when you don't know N at graph-definition time (map-reduce-ish).

from langgraph.types import Send

def dispatch(state):
    return [Send("make_joke", {"subject": s}) for s in state["subjects"]]

graph.add_conditional_edges("pick_subjects", dispatch)

Command lets a node update state AND route in one return — skipping the usual edge resolution. Great for multi-agent handoffs.

from langgraph.types import Command

def review(state) -> Command:
    return Command(
        update={"status": "approved"},
        goto="deploy"   # jump, don't use edges
    )

persistence = superpower

Attach a checkpointer and LangGraph saves state after every super-step. This is wild. It unlocks: fault tolerance, human-in-the-loop, time travel, and long-running agents.

from langgraph.checkpoint.memory import InMemorySaver
# or: from langgraph.checkpoint.postgres import PostgresSaver

app = graph.compile(checkpointer=InMemorySaver())

# every run needs a thread_id; same thread = same conversation history
cfg = {"configurable": {"thread_id": "conv-42"}}
app.invoke({"messages": [...]}, config=cfg)
· · ·
Part One · supplement

wait, where do LLMs plug in?

We've been hand-waving the "do work" part of every node. Time to unwave that. Here's the thing most tutorials don't make obvious:

LangGraph has zero LLM primitives. It's purely orchestration. The LLM stuff comes from LangChain — its sibling library. When you "use LangGraph with an LLM," you're really using two libraries, with a node being where they meet.

So the mental picture for any LLM-calling node is: a function that invokes a LangChain model and returns the result into graph state. That's the whole integration.

the four primitives you'll touch every day

🧠

Chat models

ChatOpenAI, ChatAnthropic, or init_chat_model("anthropic:claude-sonnet-4-6") for a unified wrapper.

💬

Messages

Human, AI, System, Tool. The native data flowing through the conversation.

🔧

Tools

@tool-decorated functions. LLM reads name + docstring + types to decide when to call.

🔗

bind_tools()

Attaches tool schemas to a model so its responses can include tool_calls.

the message list IS the conversation

An LLM call is literally llm.invoke([list_of_messages]) → new AIMessage. Four message types flow through state:

from langchain_core.messages import (
    HumanMessage,    # user said
    AIMessage,       # model said (may contain .tool_calls!)
    SystemMessage,   # system prompt
    ToolMessage,     # result of a tool execution
)

An AIMessage has two fields that matter: .content (the text) and .tool_calls (list of functions the model wants to invoke). This is how tool-calling LLMs say "I want to run this function" — they emit a structured tool_call instead of plain text.

tools: write function, add decorator, done

from langchain_core.tools import tool

@tool
def get_weather(location: str) -> dict:
    """Get the current weather for a location."""
    return {"temp": 72, "condition": "sunny"}

# attach to model — now the LLM knows this tool exists
llm_with_tools = llm.bind_tools([get_weather])

response = llm_with_tools.invoke(messages)
# response.tool_calls → [{"name": "get_weather", "args": {"location": "Tokyo"}, ...}]

The @tool decorator inspects the function's name, docstring, and type hints to build a JSON schema the model sees. Write good docstrings. The docstring IS the prompt.

two prebuilt nodes save you boilerplate

from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.graph import MessagesState   # pre-made state schema!

# ToolNode: looks at last AIMessage, executes every tool_call,
# appends results as ToolMessages to state
tool_node = ToolNode([get_weather])

# tools_condition: conditional edge function
# if last message has tool_calls → "tools" ; else → END
graph.add_conditional_edges("agent", tools_condition)

Also note MessagesState — a prebuilt state schema with messages: Annotated[list[AnyMessage], add_messages] already set up. Saves 3 lines and signals you know the idiom.

the agent loop, step by step

Put it all together and you get the ReAct loop — the shape 95% of LLM agents take. Watch one execute below. The key thing to notice: the message list grows on every iteration, and that's the mechanism by which the agent "remembers" what it did.

react loop executor

one tool-calling turn

User asks: "What's the weather in Tokyo?". Step through to watch state["messages"] grow.

state["messages"] ↓
tool_calls no tools
START
agent
tools
END
LLM calls: 0
Press step to begin. The user's HumanMessage is about to be added to state.

Notice: the add_messages reducer is doing silent work here. Every node returns {"messages": [new_msg]}, and the reducer appends. If you forgot the reducer, each step would overwrite history and the agent would forget everything between iterations. Ouch.

the full pattern in 15 lines

from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from langchain.chat_models import init_chat_model

llm = init_chat_model("anthropic:claude-sonnet-4-6").bind_tools([get_weather])

def call_model(state: MessagesState):
    return {"messages": [llm.invoke(state["messages"])]}

g = StateGraph(MessagesState)
g.add_node("agent", call_model)
g.add_node("tools", ToolNode([get_weather]))
g.add_edge(START, "agent")
g.add_conditional_edges("agent", tools_condition)   # auto-route on tool_calls
g.add_edge("tools", "agent")                  # loop back
app = g.compile()

even shorter: create_agent (the one-liner)

If that's exactly the shape you want — and it often is — LangGraph has it prebuilt:

from langchain.agents import create_agent   # new v1 API

agent = create_agent(
    model="anthropic:claude-sonnet-4-6",
    tools=[get_weather],
    prompt="You are a helpful assistant.",
)
agent.invoke({"messages": [{"role": "user", "content": "weather in Tokyo?"}]})
version note worth knowing: pre-v1 this was create_react_agent from langgraph.prebuilt. As of LangGraph v1 (late 2025), it was moved + renamed to create_agent in langchain.agents, with a more flexible middleware system. Mention both names in interviews — signals you're tracking the current library state.

two more primitives to know exist

with_structured_output() — forces the model to return a Pydantic object instead of free text:

class Decision(BaseModel):
    should_escalate: bool
    reason: str

result = llm.with_structured_output(Decision).invoke(messages)
# result is a Decision instance, not a string — typed, validated

Streamingapp.stream() emits per-super-step updates; app.astream_events() emits token-level events. Both matter for real app UX.

interview framing for this whole section:

"LangGraph is orchestration-only. The model layer is LangChain: chat models, message types, tools, and bind_tools. Inside a node, you invoke a LangChain model on state['messages'] and return the response; the add_messages reducer appends. ToolNode and tools_condition are prebuilt helpers that implement the agent/tool loop, and create_agent (formerly create_react_agent) wraps the whole ReAct pattern in one call."

That answer covers separation of concerns, the main primitives, the canonical pattern, and the current-vs-deprecated API. Exactly what an FDE interviewer wants.
· · ·
Part One · workshop

build the agent loop yourself

Here's the truth: Google interviews often happen in a Google Doc. No autocomplete. No running code. No syntax highlighting to bail you out. What they're really testing isn't "can you type langgraph.prebuilt from memory" — it's can you reason about the data?

Given a blank page, can you design the right primitives for an agent loop and explain why each piece exists? That's the skill. Let's build one together, layer by layer. Click through the tabs below to see how each layer builds on the last.

the interview framing you want in your head: "An agent is just a loop. The loop calls an LLM. The LLM may request tools. Tools run. Results go back into the conversation. Repeat until the LLM stops asking for tools or we hit a budget." — if you can say that sentence and then design the data structures to support it, you've essentially passed this portion.
whiteboard · step through
Layer 1 — what are we even building? "Implement an agent that can use tools to answer a user question."

Classic open-ended interview prompt. Before writing a single line, ask yourself (or the interviewer) these five questions. The answers drive your data model.

clarifying questions to ask before coding
  • What does "tool" mean here? A Python function with a name + params + return, callable by the model via structured output.
  • How does the LLM signal a tool call? The model emits a structured object — name + args — not free text. Let's assume we have an LLM adapter that parses that.
  • What's the stopping condition? Either the LLM stops asking for tools (emits plain text) OR we hit an iteration budget.
  • Is conversation history needed? Yes — the model needs to see what tools returned to decide next steps. History is the memory.
  • Can tool calls happen in parallel? Yes, modern LLMs often emit multiple tool calls in one response. Handle it.
what to say out loud
"Before I code, let me think about the data. We need four things flowing through this loop: messages (the conversation), tools (the capabilities), tool calls (requests from the model), and tool results (what came back). I'll model each as a dataclass, then write the loop on top."

→ click data model to see what those dataclasses look like.

Layer 2 — the four dataclasses you need "What does each thing flowing through the loop look like?"

This is the most important layer. Get these right and the loop writes itself. Get them wrong and nothing works. Notice the mapping to LangChain primitives — same ideas, just from scratch.

pseudocode · the shapes
# 1. A message in the conversation
Message:
  role    : "user" | "assistant" | "tool"
  content : text (possibly empty)
  tool_calls  : list of ToolCall (if assistant wants tools)
  tool_call_id: string (only when role = "tool")

# 2. What the model requests
ToolCall:
  id   : unique id for this request
  name : name of tool to invoke
  args : dict of arguments

# 3. A tool definition (what's available)
Tool:
  name        : identifier
  description : what it does (for the LLM!)
  fn          : callable that does the work

# 4. The whole thing
AgentState:
  messages  : list[Message]
  iteration : int   # budget tracker
python · with dataclasses
from dataclasses import dataclass, field
from typing import Callable, Any, Literal

@dataclass
class ToolCall:
    id: str
    name: str
    args: dict[str, Any]

@dataclass
class Message:
    role: Literal["user", "assistant", "tool"]
    content: str = ""
    tool_calls: list[ToolCall] = field(default_factory=list)
    tool_call_id: str | None = None

@dataclass
class Tool:
    name: str
    description: str
    fn: Callable[..., Any]

@dataclass
class AgentState:
    messages: list[Message] = field(default_factory=list)
    iteration: int = 0
↪ why this design Message carries both text AND tool_calls. That's not an oversight — it's how real LLMs work. When the model wants to call a tool, it emits a message where content is empty and tool_calls is populated. This mirrors OpenAI/Anthropic/Gemini exactly. A naive design puts tool calls in a separate "request" queue and then you have to reconcile two timelines. Don't do that.
↪ the tool_call_id detail When a tool returns, you need to know which call it's answering — especially if the model made 3 parallel tool calls. So every ToolCall has an id, and the resulting Message(role="tool", ...) echoes that id in tool_call_id. This is the exact pattern the OpenAI tool_calls API uses. Mentioning this in an interview = instant credibility.
mapping back to the frameworks
If asked "how does this map to LangChain?" — my Message ≈ LangChain's HumanMessage/AIMessage/ToolMessage. My ToolCall ≈ the tool_calls field on AIMessage. My AgentState ≈ LangGraph's MessagesState. Same primitives, just not hiding behind a package.
Layer 3 — the loop, in pseudocode "What are the exact steps, in order?"

Before writing real code, sketch the loop in English-ish. This is what you'd write FIRST on the whiteboard. Once it's right, the translation to Python is mechanical.

pseudocode · the agent loop
function run_agent(user_query, tools, max_iters):
    state = AgentState()
    state.messages.append(Message(role="user", content=user_query))

    while state.iteration < max_iters:
        state.iteration += 1

        # 1. Call the LLM with full history + available tools
        response = llm.call(messages=state.messages, tools=tools)
        state.messages.append(response)

        # 2. If no tool calls → we're done, return the answer
        if not response.tool_calls:
            return response.content

        # 3. Otherwise, execute each tool call in order
        for call in response.tool_calls:
            tool = find_tool(tools, call.name)
            result = tool.fn(**call.args)      # run it!
            state.messages.append(
                Message(
                    role="tool",
                    content=str(result),
                    tool_call_id=call.id,       # link back
                )
            )

        # loop continues — next iteration, LLM sees tool results

    return "Reached max iterations without final answer"
↪ why an iteration budget LLMs can loop forever if they keep calling tools. Always have a max_iters. Without it, a broken tool (always returning "error, try again") will burn your API budget to zero. Production code ALWAYS has this. Interviewers love when candidates add it unprompted.
↪ the two stopping conditions There are exactly two ways this loop exits: (a) the LLM returns a response with no tool_calls (natural finish), or (b) we hit max_iters (safety escape hatch). Every production agent loop has both. If you only have (a), you have an infinite-loop bug waiting to happen.
what to say out loud
"I'm modeling this as a while loop, not recursion, because (1) I can bound iterations explicitly and (2) stack depth isn't tied to conversation length. Two exit points: natural termination when the model emits plain text, or the budget escape hatch."
Layer 4 — translate to Python "Now make it runnable."

Using the dataclasses from Layer 2 and the pseudocode from Layer 3. Notice how much of this is just typing out what you already designed.

python · full agent loop
def run_agent(
    user_query: str,
    tools: list[Tool],
    llm,                    # some LLM client we have
    max_iters: int = 10,
) -> str:
    state = AgentState()
    state.messages.append(Message(role="user", content=user_query))

    # build a lookup once — O(1) dispatch beats O(n) search in the loop
    tool_registry = {t.name: t for t in tools}

    while state.iteration < max_iters:
        state.iteration += 1

        # 1. Ask the model. llm.call returns a Message (role="assistant").
        response: Message = llm.call(
            messages=state.messages,
            tools=tools,       # schemas, not the functions themselves
        )
        state.messages.append(response)

        # 2. Natural stopping condition
        if not response.tool_calls:
            return response.content

        # 3. Execute each tool call; append results
        for call in response.tool_calls:
            tool = tool_registry.get(call.name)
            if tool is None:
                result = f"Error: no tool named {call.name}"
            else:
                try:
                    result = tool.fn(**call.args)
                except Exception as e:
                    result = f"Error: {e}"   # feed error back to LLM!

            state.messages.append(Message(
                role="tool",
                content=str(result),
                tool_call_id=call.id,
            ))

    return "[Hit max iterations]"  # budget exceeded
python · using it (usage example)
# Define some tools
def get_weather(location: str) -> dict:
    return {"temp": 72, "condition": "sunny"}

def search_web(query: str) -> list[str]:
    return ["result 1...", "result 2..."]

tools = [
    Tool(name="get_weather",
         description="Get current weather for a location",
         fn=get_weather),
    Tool(name="search_web",
         description="Search the web for a query",
         fn=search_web),
]

answer = run_agent(
    user_query="What's the weather in Tokyo and what are they known for?",
    tools=tools,
    llm=my_llm_client,
    max_iters=5,
)
↪ tiny design choices that signal seniority Tool registry as a dict. tool_registry = {t.name: t for t in tools} is O(1) lookup per call. Searching a list is O(n). In a loop that runs dozens of times, this matters. Interviewers notice.

Try/except around tool execution. If the tool raises, you don't crash the agent — you feed the error back as a tool result and let the LLM recover. This is how real agents handle flaky APIs.
stretch extensions (if they ask "what would you add?")
"I'd add: parallel tool execution with asyncio.gather when multiple tool_calls arrive together — big latency win. Streaming so we yield tokens as they come. Checkpointing — serialize AgentState after each iteration so we can resume after a crash. Tool validation — check args against the tool's schema before invoking, so a bad LLM call doesn't blow up the process."
Layer 5 — what they're actually testing "What separates a passing answer from a great one?"

Reasoning > syntax. On a Google Doc, nobody's grading your imports. They're grading how you think. Here's what interviewers look for.

🚩 red flags

  • No data model — just strings and dicts, mutation everywhere
  • Tool call and tool result in separate data structures you have to join
  • No stopping condition other than "LLM says it's done"
  • Recursion instead of a loop (works but stack-bounded)
  • No error handling on tool execution — one bad API = dead agent
  • Hardcoded tool dispatch (if/elif chain) instead of a registry
  • Can't explain why tool_call_id exists

✓ green flags

  • Starts by defining data, not control flow
  • Unified Message type with tool_calls field (matches real APIs)
  • Iteration budget + natural termination — both exits named
  • Tool registry for O(1) dispatch
  • Errors fed back to the LLM as tool results, not raised
  • Mentions async/parallel tool calls even if they don't write it
  • Can map their primitives to LangChain or ADK without hesitation
↪ the 30-second elevator answer Memorize this structure: "I'd model it as four dataclasses — Message, ToolCall, Tool, AgentState. The loop is: append user message, then while iter < budget, call LLM with full history, if no tool_calls we're done, else execute tools and append results as tool messages, loop. Two exits: natural (no tool_calls) or budget. Use a dict registry for O(1) tool dispatch, wrap tool execution in try/except so failures become tool results the LLM can recover from."
follow-up questions you should be ready for
Q: "What if tools are slow?" → async + gather for parallelism.
Q: "What if the LLM hallucinates a tool name?" → registry lookup returns None, feed "tool not found" back.
Q: "How would you resume a crashed agent?" → serialize AgentState (it's a dataclass, so asdict) after each iteration. Restore from disk → feed back into run_agent.
Q: "How do you prevent infinite tool-calling?" → iteration budget (already in). Can also detect repeat tool_calls and force a text response.
Q: "How does this map to LangGraph?" → My loop IS the graph. call_llm is a node, execute_tools is a node, my while/if logic is the conditional edge. LangGraph makes it declarative; I made it imperative. Same shape.

seeing the data model in action

The Message dataclass has four fields, but each individual message only uses some of them. The class is a union — "a message is one of these four things" — which is confusing until you see a full conversation play out. Let's trace one together.

Use the controls below to step through a real tool-calling conversation. Watch how tool_calls and tool_call_id appear on different messages, never on the same one.

conversation trace · interactive
scenario: user asks "What's the weather in Tokyo and Paris?" — which forces the LLM to make two parallel tool calls in a single turn. This is when tool_call_id earns its keep.
0 / 5 messages
◆ which fields are "on" for each role?
role content tool_calls tool_call_id
"user" ✓ question ✗ empty ✗ None
"assistant" (answering) ✓ the answer ✗ empty ✗ None
"assistant" (wants tools) ⚠ often empty ✓ populated ✗ None
"tool" ✓ the result ✗ empty ✓ which call
↪ the two fields solve different problems tool_calls appears on assistant messages — it's how the model says "I want these functions run." A list, because the model can request several in parallel (see messages #2 above — Tokyo AND Paris were requested together).

tool_call_id appears on tool messages — it's how we answer "this result is for which request?" A single string, because each tool result answers exactly one call. The IDs (call_abc, call_xyz) link the request to its answer — think of them as order numbers on a restaurant ticket.
↪ why can't we just pair by order? Three reasons: (1) tools finish out of order — slow ones return later, so positional matching breaks. (2) Every real LLM API (OpenAI, Anthropic, Gemini) requires tool_call_id in the request format; without it the API rejects your message. (3) When debugging a 50-message conversation, IDs make the pairing obvious — positional matching forces you to count.
if asked "why one class instead of four subclasses?"
"Either works. A single Message with optional fields mirrors the on-the-wire format the LLM APIs use — they serialize as one JSON object per message with optional fields. Four separate classes (UserMsg, AssistantMsg, ToolCallMsg, ToolResultMsg) would give stricter type guarantees but require a discriminated union when iterating history. Trade-off between API fidelity and type safety. I went with the unified version because it's what real LLM APIs return."

the one-pager you should be able to recreate

If someone slides a blank Google Doc in front of you and says "implement an agent loop," this is the shape that should materialize. No imports, no frameworks — just primitives + logic.

from dataclasses import dataclass, field

@dataclass
class ToolCall: id: str; name: str; args: dict

@dataclass
class Message:
    role: str
    content: str = ""
    tool_calls: list = field(default_factory=list)
    tool_call_id: str = None

def run_agent(query, tools, llm, max_iters=10):
    registry = {t.name: t for t in tools}
    messages = [Message(role="user", content=query)]

    for _ in range(max_iters):
        resp = llm.call(messages=messages, tools=tools)
        messages.append(resp)

        if not resp.tool_calls:
            return resp.content

        for call in resp.tool_calls:
            try:
                result = registry[call.name].fn(**call.args)
            except Exception as e:
                result = f"Error: {e}"
            messages.append(Message(
                role="tool", content=str(result), tool_call_id=call.id
            ))

    return "[max iters]"
the meta-skill: notice how we went data model → pseudocode → Python in that order. On a whiteboard, this is the only way to not get lost. Start with control flow and you'll paint yourself into a corner. Start with data, and the control flow writes itself.
· · ·
Part Two

Google ADK 🤖

Where LangGraph gives you graph primitives, ADK gives you agent primitives. It's higher-level and more opinionated. Released at Google Cloud NEXT 2025, it's the same framework powering Google's own products (Agentspace, Customer Engagement Suite).

agent types, in order of importance

📝

LlmAgent

An LLM + instructions + tools + (optionally) sub-agents. The workhorse.

⚙️

Workflow Agents

SequentialAgent, ParallelAgent, LoopAgent. Deterministic orchestrators.

🔧

Custom Agents

Extend BaseAgent. For logic that doesn't fit the built-ins.

LlmAgent: the workhorse

from google.adk.agents import LlmAgent

capital_agent = LlmAgent(
    name="capital_agent",               # required, unique
    model="gemini-2.5-flash",
    description="Answers capital questions",  # for OTHER agents to route to this one
    instruction="Respond with the capital of the country asked.",  # system prompt
    tools=[get_capital_city],          # plain functions work!
    output_key="last_answer",          # auto-save response to state
)
don't confuse these: description is what OTHER agents see when deciding whether to delegate to this one. instruction is the system prompt for THIS agent's LLM. Both matter. They do different things.

workflow agents (the "no-LLM" orchestrators)

These are deterministic. They don't use an LLM to decide control flow — they just run their children in a fixed pattern. This is how ADK replaces the edge-routing you'd write by hand in LangGraph.

from google.adk.agents import SequentialAgent, ParallelAgent, LoopAgent

# assembly line: run in order, pass via state
pipeline = SequentialAgent(
    name="pipeline",
    sub_agents=[fetcher, analyst, summarizer],
)

# fan-out: all run concurrently
swarm = ParallelAgent(
    name="code_review_swarm",
    sub_agents=[security_checker, style_checker, performance_analyst],
)

# iterate until exit_loop tool called or max iterations hit
refiner = LoopAgent(
    name="refiner",
    sub_agents=[generator, critic],
    max_iterations=3,
)

the data model: session ▸ state ▸ events

Memorize this cold for the interview. Every ADK interaction lives inside a session, which holds:

state prefix conventions

prefixscopewhen to use
(none)this session onlyconversation-specific draft, plan, etc.
user:all sessions, same useruser preferences, saved settings
app:all sessions, all usersfeature flags, shared config
temp:one invocation only, not persistedintermediate scratch between sub-agents

events are the fundamental unit of flow

Every interaction produces events. State never changes directly — it changes because an event with a state_delta was emitted. Watch a full turn play out below.

adk event stream

one session, one turn

User asks: "What's the capital of Peru?" Watch events flow into the session as the LlmAgent (capital_agent) processes it.

◆ session.state
(empty)
◆ cumulative events
0
what you just saw: events are immutable, chronological, and every state change flows through one. The SessionService applies state_delta from events into session.state. That's why you use output_key or tool_context.state instead of mutating session.state directly — those helpers generate proper events.

sub-agents: delegation vs orchestration

There are two different ways an agent can have children. This distinction gets people confused:

# 1. LLM-DRIVEN delegation (non-deterministic)
coordinator = LlmAgent(
    name="coordinator",
    model="gemini-2.5-flash",
    instruction="Delegate to the right specialist.",
    sub_agents=[greeter_agent, weather_agent],   # LLM picks one
)

# 2. DETERMINISTIC orchestration (fixed order)
pipeline = SequentialAgent(
    name="pipeline",
    sub_agents=[fetcher, analyst],   # runs in this exact order
)

When a LlmAgent has sub_agents, its LLM dynamically routes to one using a built-in transfer_to_agent tool (it reads each sub-agent's description to decide). When a SequentialAgent has sub_agents, they just run in order. Know which you want.

the Runner and contexts

You don't run agents directly — you run them through a Runner, which creates an InvocationContext that travels with execution. For most code you only touch the specialized context types:

contextwhere you see itgives you
ToolContexttool function paramsstate + artifact helpers + auth
CallbackContextbefore/after-agent callbacksstate + artifacts
ReadonlyContextread-only spots (e.g. dynamic instruction)just read state
InvocationContextinside BaseAgent._run_async_impleverything (services, session, etc.)
· · ·
Part Two · workshop

build a multi-agent system in ADK

Google's interviewer is likely to pose something like this. Word it almost exactly as they will:

the interview prompt:

"Design a research assistant that takes a company name and produces a one-page brief covering the company profile, recent news, and financial snapshot. It should verify the sources don't conflict — if they do, flag it. Then tell me how you'd evaluate it."

The prompt is dense on purpose. It's testing whether you can (a) decompose the work, (b) pick the right agent type for each piece, (c) design the data flow, and (d) think about correctness. Let's walk through the full solution — with a visual of the architecture as it grows.

what ADK primitives do we need, and why?

↪ decomposition before coding Three independent lookups (profile, news, financials) → ParallelAgent (latency wins).
A fixed pipeline of "fan out → merge → quality gate" → SequentialAgent (deterministic order).
A revise-until-good loop on the draft → LoopAgent (iterate with exit condition).
An intelligent routing step for conflict resolution → LlmAgent with sub_agents (LLM decides).
Individual reasoning steps with tools → LlmAgent (your workhorse).

One problem statement, five different agent types. That's why this question is so common — it's a Rorschach test for whether you know the full primitive vocabulary.

the architecture, one piece at a time

Step through the tabs below to see how the system is built layer by layer. Each stage highlights a new agent, tells you which ADK primitive to use, and shows the exact code.

architecture · step through
Sequential research_pipeline
Parallel fetch_fanout
LlmAgent profile_agent
LlmAgent news_agent
LlmAgent financials_agent
LlmAgent synthesizer + check
LlmAgent · with subs quality_router
LlmAgent conflict_clarifier
LoopAgent refine_loop

putting it all together

Now that each piece is built, here's the full wiring. This is roughly what you'd write on the whiteboard:

from google.adk.agents import LlmAgent, SequentialAgent, ParallelAgent, LoopAgent

MODEL = "gemini-2.5-flash"

# ---------- 1. Three parallel fetchers ----------
profile_agent = LlmAgent(
    name="profile_agent",
    model=MODEL,
    description="Fetches company profile: CEO, HQ, industry.",
    instruction="Given a company name, use search to find profile. Return compact JSON.",
    tools=[google_search],
    output_key="profile",
)
news_agent = LlmAgent(
    name="news_agent",
    model=MODEL,
    instruction="Find 3 most recent news items for the company. Return dated bullets.",
    tools=[google_search],
    output_key="news",
)
financials_agent = LlmAgent(
    name="financials_agent",
    model=MODEL,
    instruction="Fetch latest market cap, revenue, and recent stock movement.",
    tools=[get_stock_data, google_search],
    output_key="financials",
)

fetch_fanout = ParallelAgent(
    name="fetch_fanout",
    sub_agents=[profile_agent, news_agent, financials_agent],
)

# ---------- 2. Synthesizer + conflict check ----------
synthesizer = LlmAgent(
    name="synthesizer",
    model=MODEL,
    instruction="""Merge {profile}, {news}, {financials} into a one-page brief.
    Also detect conflicts (e.g., news mentions acquisition, financials show old market cap).
    Output JSON: {"brief": str, "conflicts": list[str], "has_conflict": bool}.""",
    output_key="draft",
)

# ---------- 3. Refine loop (runs only when no conflict) ----------
critic = LlmAgent(
    name="critic",
    model=MODEL,
    instruction="Check {draft} for clarity, completeness. If good, call exit_loop. Else give feedback.",
    tools=[exit_loop],
    output_key="feedback",
)
reviser = LlmAgent(
    name="reviser",
    instruction="Revise {draft} per {feedback}. Overwrite draft.",
    output_key="draft",
)
refine_loop = LoopAgent(
    name="refine_loop",
    sub_agents=[critic, reviser],
    max_iterations=3,
)

# ---------- 4. Conflict clarifier (only when has_conflict) ----------
conflict_clarifier = LlmAgent(
    name="conflict_clarifier",
    description="Handles cases where sources disagree. Asks user for guidance.",
    model=MODEL,
    instruction="Present {conflicts} to the user. Ask which source to trust. Await response.",
)

# ---------- 5. Router: LlmAgent with sub_agents for delegation ----------
quality_router = LlmAgent(
    name="quality_router",
    model=MODEL,
    instruction="""Examine {draft}. If has_conflict is true, transfer to conflict_clarifier.
    Otherwise transfer to refine_loop to polish the output.""",
    sub_agents=[conflict_clarifier, refine_loop],   # ← LLM picks one!
)

# ---------- 6. Root: the full pipeline ----------
research_pipeline = SequentialAgent(
    name="research_pipeline",
    sub_agents=[fetch_fanout, synthesizer, quality_router],
)
↪ what this demonstrates All 5 ADK primitives in one coherent system. SequentialAgent for the outer pipeline (deterministic order). ParallelAgent for the fan-out (latency win). LlmAgents for each reasoning step (with output_key to pass data via state). LoopAgent for iterative refinement with max_iterations as budget cap and exit_loop for natural termination. LlmAgent-with-sub_agents for dynamic routing (LLM reads description of each sub-agent to pick one).
design trade-offs to mention out loud
Why Parallel, not Sequential, for fetching? Each lookup is independent; parallelism cuts latency ~3x. Sequential would be a correctness-equivalent but slower choice.

Why a separate router instead of a condition flag in Sequential? ADK workflow agents are deterministic — they don't branch. To branch, you need either a custom BaseAgent OR an LlmAgent whose LLM reads state and transfers to one of its sub_agents. I chose the latter because the descriptions on the sub-agents make the routing logic readable.

Why LoopAgent bounded at 3? Unbounded loops bleed API budget. Three iterations is enough for most quality issues. The critic can also short-circuit early via exit_loop.

What about callbacks? I'd add a before_tool_callback on the financials agent to block stale data (reject if timestamp > 24h old). And an after_agent_callback on synthesizer to log a metric if has_conflict=true — so I can measure conflict rate in prod.
data-flow summary (this matters in interviews): The three fetchers write to state["profile"], state["news"], state["financials"] via output_key. Synthesizer's instruction reads all three via {key} template injection, produces state["draft"] (structured JSON). Quality_router examines the draft, picks a sub-agent. All communication is via session.state — no direct function returns, no globals, no spooky action.
· · ·
Part Two · evaluation

"now, how would you evaluate this?"

This follow-up is where most candidates stumble. The common wrong answer: "I'd write some test cases and check if the output looks right." That's unit testing, which doesn't work for agents because they're non-deterministic. The right answer has a structure.

the key insight interviewers look for: agent eval is a pyramid, not a single test. You evaluate at multiple levels — individual tool calls, sub-agent outputs, full trajectories, end-to-end quality. Each level catches different failure modes. Click each tier below to expand.

the agent evaluation pyramid

1

Tool-level · trajectory correctness

Did the agent call the right tools with the right arguments, in roughly the right order?

ADK's built-in tool_trajectory_avg_score compares actual tool calls against an expected list. Supports three match modes:

# EXACT: every tool call matches, same order, no extras
# IN_ORDER: expected tools appear in order, extras allowed between
# ANY_ORDER: all expected tools appear, any order

expected_tools = [
    {"tool": "google_search", "args": {"query": "Acme Corp CEO"}},
    {"tool": "get_stock_data", "args": {"ticker": "ACME"}},
]
# For our research assistant, use ANY_ORDER at the top level
# (profile/news/fin run in parallel, real order is non-deterministic).
# But within each fetcher, IN_ORDER or EXACT makes sense.

When this catches bugs: The LLM hallucinates a tool name, calls the wrong tool, passes the wrong argument (e.g., "Acme" as a city name instead of company name), or skips a required tool call.

2

Sub-agent · handoff quality

For multi-agent systems: did the coordinator transfer to the right specialist? Did the right agent run at the right time?

This is the one that matters MOST for our system because we have quality_router dynamically picking between conflict_clarifier and refine_loop. Getting the routing wrong = shipping broken output.

# Golden dataset entry for a "has conflict" case
{
    "query": "Research Stripe",
    "initial_state": {"mock_news": "Stripe acquired by Visa",
                      "mock_financials": "Market cap $95B standalone"},
    "expected_trajectory": [
        "profile_agent", "news_agent", "financials_agent",   # parallel
        "synthesizer",
        "quality_router",
        "conflict_clarifier",                                 # ← critical
    ],
    "must_NOT_run": ["refine_loop"],                          # shouldn't polish bad data
}

When this catches bugs: Router sends to refine_loop when it should have flagged a conflict. Or transfers to a sub-agent that doesn't exist. Or infinite-transfers between two agents.

3

End-to-end · response quality

Is the final brief good? Coherent, accurate, complete?

Two approaches, use both:

ROUGE-1 (response_match_score): cheap, fast, word-overlap with a reference answer. Good for regression detection ("did the output change?"). Bad for semantic quality.

LLM-as-judge (final_response_match_v2): a separate LLM scores the agent's answer vs reference on semantic equivalence. Much better signal, but costs money per eval.

# test_config.json
{
    "criteria": {
        "tool_trajectory_avg_score": 0.8,       # loose for parallel
        "response_match_score": 0.6,             # ROUGE is lenient
        "final_response_match_v2": 0.85,         # LLM judge, strict
        "hallucinations_v1": 0.9,                # grounding check
    }
}

Why both? ROUGE catches "the output totally changed" regressions cheaply. LLM-judge catches "the output is plausibly different but actually wrong." In CI, run ROUGE on every commit, LLM-judge nightly.

4

Custom · business-rule guards

Domain assertions no generic metric can catch.

For our research assistant, critical custom checks:

def no_hallucinated_tickers(eval_case, result):
    # every stock ticker in output MUST appear in the financials tool result
    output_tickers = extract_tickers(result.final_response)
    source_tickers = extract_tickers(result.state["financials"])
    return output_tickers.issubset(source_tickers)

def conflict_flag_must_be_honest(eval_case, result):
    # if sources disagree in the fixture, has_conflict MUST be True
    if eval_case.fixture.get("sources_disagree"):
        return result.state["draft"]["has_conflict"] is True
    return True

def financials_freshness_enforced(eval_case, result):
    # the before_tool_callback should have blocked stale data
    return "stale_data_rejected" in result.callback_logs

When this catches bugs: The agent invents a ticker symbol. The synthesizer suppresses a real conflict to look more confident. Stale data slips through because the callback regressed.

the five ADK evaluation metrics you should name

Being able to name metrics by ADK's actual API names signals you've actually used the tool. Memorize these:

built-in
tool_trajectory_avg_score

Compares actual vs expected tool-call sequence. Three match modes: EXACT, IN_ORDER, ANY_ORDER. Score per invocation: 1.0 match / 0.0 mismatch; averaged.

tool correctnessregression
built-in
response_match_score

ROUGE-1 (word-overlap) between actual and reference responses. Cheap, fast, but syntactic — misses semantic equivalence. Default threshold: 0.8.

lexical matchCI friendly
llm-judge
final_response_match_v2

LLM-as-judge variant of response_match. Scores semantic equivalence, not word overlap. More signal, costs tokens per eval.

semantic qualitycosts money
llm-judge
hallucinations_v1

Sentence-level grounding check. For each claim in the response, is it supported by retrieved context? Backed by Vertex AI Eval SDK.

groundednessfactuality
llm-judge
safety_v1

Harmlessness scoring via Vertex AI Eval SDK. Checks for unsafe responses. Critical for user-facing agents; less so for internal pipelines.

harmlessnessprod guard
custom
your own assertions

Domain-specific Python checks that no generic metric can express: "ticker must exist," "conflict flag must be honest," "freshness enforced." Ship a few.

business ruleshigh signal

how you'd actually run it — the loop

Not just "define metrics" — describe the full eval lifecycle. This is what earns the senior-IC signal:

◆ eval lifecycle
1.
build a golden dataset → use adk web to chat with the agent, save good sessions as eval cases. Capture trajectory + final response + initial state fixtures. ~20 cases to start, covering happy path + each failure mode (conflict-yes, conflict-no, stale-data, unknown-company, etc).
2.
pick metrics per workflow type → ParallelAgent → ANY_ORDER trajectory. SequentialAgent → IN_ORDER or EXACT. LoopAgent → track iteration count vs quality threshold. LlmAgent-with-subs → handoff quality + dedicated routing-correctness evals.
3.
wire into CI → adk eval on every PR with cheap metrics (ROUGE, trajectory match, custom assertions) — fast, deterministic. Nightly: full suite with LLM-judge + hallucinations_v1 — more signal, more cost.
4.
debug failures with the trace view → ADK web UI's Trace tab: inspect every event, see which agent transferred where, view state snapshots. Failed trajectory → usually a prompt fix ("you MUST call X before Y"). Failed response → judge diff reveals what changed.
5.
production observability → ADK ships OpenTelemetry traces. Export to Cloud Trace, Arize, Langfuse, or similar. Auto-score online traces, flag regressions, feed new failure patterns back into the eval set. The dataset grows with the agent.
↪ the "how would you eval this?" elevator answer "I'd evaluate at four tiers. At the tool level, tool_trajectory_avg_score with ANY_ORDER for parallel branches, EXACT for sequential. At the sub-agent level, dedicated handoff tests — especially for the LlmAgent router, where bad routing is the scariest failure mode. At the response level, ROUGE in CI for cheap regressions plus final_response_match_v2 nightly for semantic quality. On top, custom Python assertions for business rules like 'no hallucinated tickers' and 'conflict flag honesty.' Build a golden dataset with adk web, version it with the agent code, and iterate on failures using the Trace tab."

failure modes specific to multi-agent systems

These come up in follow-up questions. Know them by name:

failure modewhat it looks likewhat catches it
wrong handoffRouter sends to wrong sub-agenttrajectory test with must_run + must_NOT_run
infinite loopLoopAgent never exits; hits max_iterationstrack iter count; assert < max in eval
premature exitLoop exits before quality achievedresponse quality score; iter count > 1
state pollutionSub-agent overwrites key it shouldn'tsnapshot state after each sub-agent run
parallel racesTwo ParallelAgents write same keyexplicit state-schema review; test for overwrites
hallucinated sub-agentRouter tries to transfer to non-existent agentschema validation in before_agent_callback
tool-response rotCached/stale data used without refreshfreshness assertions in before_tool_callback
the meta-skill: notice that "how would you eval this?" is really asking three things at once — what can go wrong? (failure modes), how do you detect it? (metrics), how do you operationalize it? (CI + observability loop). A great answer hits all three. A mediocre answer only hits metrics.
· · ·

same problem, both frameworks

The fastest way to internalize the difference is to see the same solution written in both. Flip the tab.

side by side
Task: Build a content pipeline that (1) fetches a topic summary, (2) writes a draft, and (3) revises it for clarity. Three steps, fixed order, each step's output feeds the next.
from typing import TypedDict
from langgraph.graph import StateGraph, START, END

class State(TypedDict):
    topic: str
    summary: str
    draft: str
    final: str

def fetch_summary(state):
    return {"summary": llm_summarize(state["topic"])}

def write_draft(state):
    return {"draft": llm_draft(state["summary"])}

def revise(state):
    return {"final": llm_revise(state["draft"])}

g = StateGraph(State)
g.add_node("fetch", fetch_summary)
g.add_node("draft", write_draft)
g.add_node("revise", revise)
g.add_edge(START, "fetch")
g.add_edge("fetch", "draft")
g.add_edge("draft", "revise")
g.add_edge("revise", END)

app = g.compile()
result = app.invoke({"topic": "octopi"})
primitives used
State (TypedDict) shape of shared data
Nodes (3 functions) each returns partial state update
Edges (4 fixed) linear path through the graph
compile() + invoke() you run the graph yourself
observation: ~20 lines of scaffolding. Total control. You wire every edge.
from google.adk.agents import LlmAgent, SequentialAgent

fetcher = LlmAgent(
    name="fetcher",
    model="gemini-2.5-flash",
    instruction="Summarize the topic from the query.",
    output_key="summary",
)

drafter = LlmAgent(
    name="drafter",
    model="gemini-2.5-flash",
    instruction="Write a draft from this summary: {summary}",
    output_key="draft",
)

reviser = LlmAgent(
    name="reviser",
    model="gemini-2.5-flash",
    instruction="Revise for clarity: {draft}",
    output_key="final",
)

pipeline = SequentialAgent(
    name="content_pipeline",
    sub_agents=[fetcher, drafter, reviser],
)
# run via Runner + SessionService
primitives used
LlmAgent × 3 each is an LLM + instruction
output_key auto-saves response to session.state
{key} template injection reads state into instructions
SequentialAgent deterministic order, no LLM routing
observation: no state class, no edges. Agents + one workflow agent. More concise, less control.

the cheatsheet

conceptLangGraphADK
unit of workNode (function)Agent (LlmAgent)
shared dataState (TypedDict)session.state (dict)
routingedges (explicit)workflow agents OR LLM delegation
historystate["messages"]session.events
persistenceCheckpointersSessionService
merging updatesReducers (per-key)Event-based state_delta
parallelismsuper-steps, Send()ParallelAgent
loopscycles + conditional edgesLoopAgent
HITLinterrupt() + checkpointerlong-running tool
compilation neededyes (.compile())no
long-term memoryBYO vectorstoreMemoryService (Memory Bank)
· · ·

gotchas

Click any card to expand. These are the things that don't show up in the tutorials but will trip you up in the interview.

!
Forgetting .compile()
langgraph

Your StateGraph is a builder, not a runnable. Until you call .compile(), you can't .invoke() it. Classic first-timer mistake.

Reducers silently overwrite
langgraph

No Annotated[..., reducer] = last-write-wins. If two parallel nodes both write to state["log"] without a reducer, only one survives. Non-deterministic. Always annotate list-like fields.

Cycles with no END condition
langgraph

LangGraph happily lets you build cycles. Great for agent loops, dangerous if you forget a conditional edge routing to END. Always have an exit.

🧵
thread_id is the unit of persistence
langgraph

Two invocations with the same thread_id share history. Different thread = fresh conversation. Forgetting to pass it on resume = lost state.

description ≠ instruction
adk

description is what OTHER agents see when deciding to delegate. instruction is the system prompt for THIS agent's LLM. Mixing them up breaks multi-agent routing in weird ways.

Don't mutate session.state directly
adk

Use output_key, tool_context.state, or EventActions(state_delta=...). Direct mutation skips the event log and can desync on persistence. The event IS the state change.

InMemorySessionService in prod
adk

Default for adk web. Fine for dev, disastrous for prod — scale to 2+ instances and sessions stop being shared. Use DatabaseSessionService or VertexAiSessionService for anything real.

🤔
Sub-agents under LlmAgent vs Workflow
adk

Under LlmAgent: LLM decides (non-deterministic routing). Under SequentialAgent: fixed order. Under ParallelAgent: all at once. Exact same parameter name, radically different behavior.

📝
Tool docstrings are prompts
adk

The LLM sees the function name, docstring, and param types to decide when to call your tool. A bad docstring = a tool that never gets called (or gets called wrong). Treat them like API docs.

· · ·

quiz yourself

Ten questions. Instant feedback. The ones you miss are the ones to study.

knowledge check

good luck! 🍀

The deepest interview signal isn't knowing every API — it's being able to say "here's what the framework gives me, here's what I'd have to build, here's the trade-off." You've got this.

end of zine · v1