two ways to build agents, side by side. primitives, data models, and the stuff the docs don't tell you — prepared for a forward deployed engineer interview.
Both frameworks solve the same problem (orchestrating stateful LLM workflows) from opposite directions.
LlmAgent, SequentialAgent, LoopAgent) and compose them. You configure the agent.
Keep that split in your back pocket. Everything fits around it.
LangGraph models an agent as a directed graph with shared state, inspired by Google's Pregel system. Every node reads state → does something → returns a state update. Edges decide who runs next. That's really the whole thing.
A shared TypedDict. Every node reads and writes to it.
Plain functions: (state) → state_update. Do the work.
Rules for what runs next. Fixed or conditional.
Send, Command, checkpointers — is variations on these.
from typing import TypedDict, Annotated
from langgraph.graph.message import add_messages
from langchain_core.messages import AnyMessage
class AgentState(TypedDict):
messages: Annotated[list[AnyMessage], add_messages] # ← reducer!
user_name: str
turn_count: int
That Annotated[list, add_messages] thing is a reducer. It's how LangGraph decides to merge updates to a key. Without it, new values overwrite old ones. With it, new messages append. We'll see that live in a minute.
def classify(state: AgentState) -> dict:
# read state
last_msg = state["messages"][-1].content
# do work (could be an LLM call, a tool, whatever)
label = "urgent" if "help" in last_msg else "normal"
# return ONLY the fields you want to update
return {"classification": label}
from langgraph.graph import StateGraph, START, END
graph = StateGraph(AgentState)
graph.add_node("classify", classify)
graph.add_node("respond", respond)
graph.add_node("escalate", escalate)
# 1. fixed edge: always go from START to classify
graph.add_edge(START, "classify")
# 2. conditional edge: pick next based on state
def route(state):
return "escalate" if state["classification"] == "urgent" else "respond"
graph.add_conditional_edges("classify", route, ["respond", "escalate"])
# 3. terminal edges
graph.add_edge("respond", END)
graph.add_edge("escalate", END)
app = graph.compile() # ← don't forget this!
.compile(). The graph object is a builder, not a runnable, until you compile it. Trips up every newcomer at least once.
LangGraph doesn't traverse the graph step-by-step like a flowchart. It runs in super-steps: in each step, all active nodes run in parallel, their outputs merge via reducers, and the next set of active nodes is computed. Inspired by Pregel. Step through the demo below to see it live.
Notice how it's not stepping edge by edge — it's in waves. All the active nodes in super-step N run together, their updates merge, and the waves keep propagating until every node goes quiet.
When two parallel nodes update the same state key, or when you call the same node multiple times, LangGraph needs to know how to merge the new value with the old. That's what a reducer is.
Default behavior: overwrite. But for lists of messages, you usually want append. Watch the difference:
Each write replaces the list. Old messages vanish. 😱
Reducer appends. History accumulates. 🙌
add_messages reducer. Nothing more. Being able to explain this cleanly signals real understanding.
Send is for dynamic fan-out: spawn N parallel invocations of a node when you don't know N at graph-definition time (map-reduce-ish).
from langgraph.types import Send
def dispatch(state):
return [Send("make_joke", {"subject": s}) for s in state["subjects"]]
graph.add_conditional_edges("pick_subjects", dispatch)
Command lets a node update state AND route in one return — skipping the usual edge resolution. Great for multi-agent handoffs.
from langgraph.types import Command
def review(state) -> Command:
return Command(
update={"status": "approved"},
goto="deploy" # jump, don't use edges
)
Attach a checkpointer and LangGraph saves state after every super-step. This is wild. It unlocks: fault tolerance, human-in-the-loop, time travel, and long-running agents.
from langgraph.checkpoint.memory import InMemorySaver
# or: from langgraph.checkpoint.postgres import PostgresSaver
app = graph.compile(checkpointer=InMemorySaver())
# every run needs a thread_id; same thread = same conversation history
cfg = {"configurable": {"thread_id": "conv-42"}}
app.invoke({"messages": [...]}, config=cfg)
We've been hand-waving the "do work" part of every node. Time to unwave that. Here's the thing most tutorials don't make obvious:
So the mental picture for any LLM-calling node is: a function that invokes a LangChain model and returns the result into graph state. That's the whole integration.
ChatOpenAI, ChatAnthropic, or init_chat_model("anthropic:claude-sonnet-4-6") for a unified wrapper.
Human, AI, System, Tool. The native data flowing through the conversation.
@tool-decorated functions. LLM reads name + docstring + types to decide when to call.
Attaches tool schemas to a model so its responses can include tool_calls.
An LLM call is literally llm.invoke([list_of_messages]) → new AIMessage. Four message types flow through state:
from langchain_core.messages import (
HumanMessage, # user said
AIMessage, # model said (may contain .tool_calls!)
SystemMessage, # system prompt
ToolMessage, # result of a tool execution
)
An AIMessage has two fields that matter: .content (the text) and .tool_calls (list of functions the model wants to invoke). This is how tool-calling LLMs say "I want to run this function" — they emit a structured tool_call instead of plain text.
from langchain_core.tools import tool
@tool
def get_weather(location: str) -> dict:
"""Get the current weather for a location."""
return {"temp": 72, "condition": "sunny"}
# attach to model — now the LLM knows this tool exists
llm_with_tools = llm.bind_tools([get_weather])
response = llm_with_tools.invoke(messages)
# response.tool_calls → [{"name": "get_weather", "args": {"location": "Tokyo"}, ...}]
The @tool decorator inspects the function's name, docstring, and type hints to build a JSON schema the model sees. Write good docstrings. The docstring IS the prompt.
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.graph import MessagesState # pre-made state schema!
# ToolNode: looks at last AIMessage, executes every tool_call,
# appends results as ToolMessages to state
tool_node = ToolNode([get_weather])
# tools_condition: conditional edge function
# if last message has tool_calls → "tools" ; else → END
graph.add_conditional_edges("agent", tools_condition)
Also note MessagesState — a prebuilt state schema with messages: Annotated[list[AnyMessage], add_messages] already set up. Saves 3 lines and signals you know the idiom.
Put it all together and you get the ReAct loop — the shape 95% of LLM agents take. Watch one execute below. The key thing to notice: the message list grows on every iteration, and that's the mechanism by which the agent "remembers" what it did.
User asks: "What's the weather in Tokyo?". Step through to watch state["messages"] grow.
Notice: the add_messages reducer is doing silent work here. Every node returns {"messages": [new_msg]}, and the reducer appends. If you forgot the reducer, each step would overwrite history and the agent would forget everything between iterations. Ouch.
from langgraph.graph import StateGraph, MessagesState, START, END
from langgraph.prebuilt import ToolNode, tools_condition
from langchain.chat_models import init_chat_model
llm = init_chat_model("anthropic:claude-sonnet-4-6").bind_tools([get_weather])
def call_model(state: MessagesState):
return {"messages": [llm.invoke(state["messages"])]}
g = StateGraph(MessagesState)
g.add_node("agent", call_model)
g.add_node("tools", ToolNode([get_weather]))
g.add_edge(START, "agent")
g.add_conditional_edges("agent", tools_condition) # auto-route on tool_calls
g.add_edge("tools", "agent") # loop back
app = g.compile()
If that's exactly the shape you want — and it often is — LangGraph has it prebuilt:
from langchain.agents import create_agent # new v1 API
agent = create_agent(
model="anthropic:claude-sonnet-4-6",
tools=[get_weather],
prompt="You are a helpful assistant.",
)
agent.invoke({"messages": [{"role": "user", "content": "weather in Tokyo?"}]})
create_react_agent from langgraph.prebuilt. As of LangGraph v1 (late 2025), it was moved + renamed to create_agent in langchain.agents, with a more flexible middleware system. Mention both names in interviews — signals you're tracking the current library state.
with_structured_output() — forces the model to return a Pydantic object instead of free text:
class Decision(BaseModel):
should_escalate: bool
reason: str
result = llm.with_structured_output(Decision).invoke(messages)
# result is a Decision instance, not a string — typed, validated
Streaming — app.stream() emits per-super-step updates; app.astream_events() emits token-level events. Both matter for real app UX.
bind_tools. Inside a node, you invoke a LangChain model on state['messages'] and return the response; the add_messages reducer appends. ToolNode and tools_condition are prebuilt helpers that implement the agent/tool loop, and create_agent (formerly create_react_agent) wraps the whole ReAct pattern in one call."Here's the truth: Google interviews often happen in a Google Doc. No autocomplete. No running code. No syntax highlighting to bail you out. What they're really testing isn't "can you type langgraph.prebuilt from memory" — it's can you reason about the data?
Given a blank page, can you design the right primitives for an agent loop and explain why each piece exists? That's the skill. Let's build one together, layer by layer. Click through the tabs below to see how each layer builds on the last.
Classic open-ended interview prompt. Before writing a single line, ask yourself (or the interviewer) these five questions. The answers drive your data model.
name + args — not free text. Let's assume we have an LLM adapter that parses that.→ click data model to see what those dataclasses look like.
This is the most important layer. Get these right and the loop writes itself. Get them wrong and nothing works. Notice the mapping to LangChain primitives — same ideas, just from scratch.
# 1. A message in the conversation
Message:
role : "user" | "assistant" | "tool"
content : text (possibly empty)
tool_calls : list of ToolCall (if assistant wants tools)
tool_call_id: string (only when role = "tool")
# 2. What the model requests
ToolCall:
id : unique id for this request
name : name of tool to invoke
args : dict of arguments
# 3. A tool definition (what's available)
Tool:
name : identifier
description : what it does (for the LLM!)
fn : callable that does the work
# 4. The whole thing
AgentState:
messages : list[Message]
iteration : int # budget tracker
from dataclasses import dataclass, field
from typing import Callable, Any, Literal
@dataclass
class ToolCall:
id: str
name: str
args: dict[str, Any]
@dataclass
class Message:
role: Literal["user", "assistant", "tool"]
content: str = ""
tool_calls: list[ToolCall] = field(default_factory=list)
tool_call_id: str | None = None
@dataclass
class Tool:
name: str
description: str
fn: Callable[..., Any]
@dataclass
class AgentState:
messages: list[Message] = field(default_factory=list)
iteration: int = 0
content is empty and tool_calls is populated. This mirrors OpenAI/Anthropic/Gemini exactly. A naive design puts tool calls in a separate "request" queue and then you have to reconcile two timelines. Don't do that.
ToolCall has an id, and the resulting Message(role="tool", ...) echoes that id in tool_call_id. This is the exact pattern the OpenAI tool_calls API uses. Mentioning this in an interview = instant credibility.
Message ≈ LangChain's HumanMessage/AIMessage/ToolMessage. My ToolCall ≈ the tool_calls field on AIMessage. My AgentState ≈ LangGraph's MessagesState. Same primitives, just not hiding behind a package.Before writing real code, sketch the loop in English-ish. This is what you'd write FIRST on the whiteboard. Once it's right, the translation to Python is mechanical.
function run_agent(user_query, tools, max_iters):
state = AgentState()
state.messages.append(Message(role="user", content=user_query))
while state.iteration < max_iters:
state.iteration += 1
# 1. Call the LLM with full history + available tools
response = llm.call(messages=state.messages, tools=tools)
state.messages.append(response)
# 2. If no tool calls → we're done, return the answer
if not response.tool_calls:
return response.content
# 3. Otherwise, execute each tool call in order
for call in response.tool_calls:
tool = find_tool(tools, call.name)
result = tool.fn(**call.args) # run it!
state.messages.append(
Message(
role="tool",
content=str(result),
tool_call_id=call.id, # link back
)
)
# loop continues — next iteration, LLM sees tool results
return "Reached max iterations without final answer"
max_iters. Without it, a broken tool (always returning "error, try again") will burn your API budget to zero. Production code ALWAYS has this. Interviewers love when candidates add it unprompted.
max_iters (safety escape hatch). Every production agent loop has both. If you only have (a), you have an infinite-loop bug waiting to happen.
while loop, not recursion, because (1) I can bound iterations explicitly and (2) stack depth isn't tied to conversation length. Two exit points: natural termination when the model emits plain text, or the budget escape hatch."Using the dataclasses from Layer 2 and the pseudocode from Layer 3. Notice how much of this is just typing out what you already designed.
def run_agent(
user_query: str,
tools: list[Tool],
llm, # some LLM client we have
max_iters: int = 10,
) -> str:
state = AgentState()
state.messages.append(Message(role="user", content=user_query))
# build a lookup once — O(1) dispatch beats O(n) search in the loop
tool_registry = {t.name: t for t in tools}
while state.iteration < max_iters:
state.iteration += 1
# 1. Ask the model. llm.call returns a Message (role="assistant").
response: Message = llm.call(
messages=state.messages,
tools=tools, # schemas, not the functions themselves
)
state.messages.append(response)
# 2. Natural stopping condition
if not response.tool_calls:
return response.content
# 3. Execute each tool call; append results
for call in response.tool_calls:
tool = tool_registry.get(call.name)
if tool is None:
result = f"Error: no tool named {call.name}"
else:
try:
result = tool.fn(**call.args)
except Exception as e:
result = f"Error: {e}" # feed error back to LLM!
state.messages.append(Message(
role="tool",
content=str(result),
tool_call_id=call.id,
))
return "[Hit max iterations]" # budget exceeded
# Define some tools
def get_weather(location: str) -> dict:
return {"temp": 72, "condition": "sunny"}
def search_web(query: str) -> list[str]:
return ["result 1...", "result 2..."]
tools = [
Tool(name="get_weather",
description="Get current weather for a location",
fn=get_weather),
Tool(name="search_web",
description="Search the web for a query",
fn=search_web),
]
answer = run_agent(
user_query="What's the weather in Tokyo and what are they known for?",
tools=tools,
llm=my_llm_client,
max_iters=5,
)
tool_registry = {t.name: t for t in tools} is O(1) lookup per call. Searching a list is O(n). In a loop that runs dozens of times, this matters. Interviewers notice. asyncio.gather when multiple tool_calls arrive together — big latency win. Streaming so we yield tokens as they come. Checkpointing — serialize AgentState after each iteration so we can resume after a crash. Tool validation — check args against the tool's schema before invoking, so a bad LLM call doesn't blow up the process."Reasoning > syntax. On a Google Doc, nobody's grading your imports. They're grading how you think. Here's what interviewers look for.
tool_call_id existsasdict) after each iteration. Restore from disk → feed back into run_agent.call_llm is a node, execute_tools is a node, my while/if logic is the conditional edge. LangGraph makes it declarative; I made it imperative. Same shape.
The Message dataclass has four fields, but each individual message only uses some of them. The class is a union — "a message is one of these four things" — which is confusing until you see a full conversation play out. Let's trace one together.
Use the controls below to step through a real tool-calling conversation. Watch how tool_calls and tool_call_id appear on different messages, never on the same one.
tool_call_id earns its keep.
| role | content | tool_calls | tool_call_id |
|---|---|---|---|
| "user" | ✓ question | ✗ empty | ✗ None |
| "assistant" (answering) | ✓ the answer | ✗ empty | ✗ None |
| "assistant" (wants tools) | ⚠ often empty | ✓ populated | ✗ None |
| "tool" | ✓ the result | ✗ empty | ✓ which call |
tool_calls appears on assistant messages — it's how the model says "I want these functions run." A list, because the model can request several in parallel (see messages #2 above — Tokyo AND Paris were requested together).tool_call_id appears on tool messages — it's how we answer "this result is for which request?" A single string, because each tool result answers exactly one call. The IDs (call_abc, call_xyz) link the request to its answer — think of them as order numbers on a restaurant ticket.
tool_call_id in the request format; without it the API rejects your message. (3) When debugging a 50-message conversation, IDs make the pairing obvious — positional matching forces you to count.
Message with optional fields mirrors the on-the-wire format the LLM APIs use — they serialize as one JSON object per message with optional fields. Four separate classes (UserMsg, AssistantMsg, ToolCallMsg, ToolResultMsg) would give stricter type guarantees but require a discriminated union when iterating history. Trade-off between API fidelity and type safety. I went with the unified version because it's what real LLM APIs return."If someone slides a blank Google Doc in front of you and says "implement an agent loop," this is the shape that should materialize. No imports, no frameworks — just primitives + logic.
from dataclasses import dataclass, field
@dataclass
class ToolCall: id: str; name: str; args: dict
@dataclass
class Message:
role: str
content: str = ""
tool_calls: list = field(default_factory=list)
tool_call_id: str = None
def run_agent(query, tools, llm, max_iters=10):
registry = {t.name: t for t in tools}
messages = [Message(role="user", content=query)]
for _ in range(max_iters):
resp = llm.call(messages=messages, tools=tools)
messages.append(resp)
if not resp.tool_calls:
return resp.content
for call in resp.tool_calls:
try:
result = registry[call.name].fn(**call.args)
except Exception as e:
result = f"Error: {e}"
messages.append(Message(
role="tool", content=str(result), tool_call_id=call.id
))
return "[max iters]"
Where LangGraph gives you graph primitives, ADK gives you agent primitives. It's higher-level and more opinionated. Released at Google Cloud NEXT 2025, it's the same framework powering Google's own products (Agentspace, Customer Engagement Suite).
An LLM + instructions + tools + (optionally) sub-agents. The workhorse.
SequentialAgent, ParallelAgent, LoopAgent. Deterministic orchestrators.
Extend BaseAgent. For logic that doesn't fit the built-ins.
from google.adk.agents import LlmAgent
capital_agent = LlmAgent(
name="capital_agent", # required, unique
model="gemini-2.5-flash",
description="Answers capital questions", # for OTHER agents to route to this one
instruction="Respond with the capital of the country asked.", # system prompt
tools=[get_capital_city], # plain functions work!
output_key="last_answer", # auto-save response to state
)
description is what OTHER agents see when deciding whether to delegate to this one. instruction is the system prompt for THIS agent's LLM. Both matter. They do different things.
These are deterministic. They don't use an LLM to decide control flow — they just run their children in a fixed pattern. This is how ADK replaces the edge-routing you'd write by hand in LangGraph.
from google.adk.agents import SequentialAgent, ParallelAgent, LoopAgent
# assembly line: run in order, pass via state
pipeline = SequentialAgent(
name="pipeline",
sub_agents=[fetcher, analyst, summarizer],
)
# fan-out: all run concurrently
swarm = ParallelAgent(
name="code_review_swarm",
sub_agents=[security_checker, style_checker, performance_analyst],
)
# iterate until exit_loop tool called or max iterations hit
refiner = LoopAgent(
name="refiner",
sub_agents=[generator, critic],
max_iterations=3,
)
Memorize this cold for the interview. Every ADK interaction lives inside a session, which holds:
user:, app:, temp:, or none)| prefix | scope | when to use |
|---|---|---|
| (none) | this session only | conversation-specific draft, plan, etc. |
user: | all sessions, same user | user preferences, saved settings |
app: | all sessions, all users | feature flags, shared config |
temp: | one invocation only, not persisted | intermediate scratch between sub-agents |
Every interaction produces events. State never changes directly — it changes because an event with a state_delta was emitted. Watch a full turn play out below.
User asks: "What's the capital of Peru?" Watch events flow into the session as the LlmAgent (capital_agent) processes it.
SessionService applies state_delta from events into session.state. That's why you use output_key or tool_context.state instead of mutating session.state directly — those helpers generate proper events.
There are two different ways an agent can have children. This distinction gets people confused:
# 1. LLM-DRIVEN delegation (non-deterministic)
coordinator = LlmAgent(
name="coordinator",
model="gemini-2.5-flash",
instruction="Delegate to the right specialist.",
sub_agents=[greeter_agent, weather_agent], # LLM picks one
)
# 2. DETERMINISTIC orchestration (fixed order)
pipeline = SequentialAgent(
name="pipeline",
sub_agents=[fetcher, analyst], # runs in this exact order
)
When a LlmAgent has sub_agents, its LLM dynamically routes to one using a built-in transfer_to_agent tool (it reads each sub-agent's description to decide). When a SequentialAgent has sub_agents, they just run in order. Know which you want.
You don't run agents directly — you run them through a Runner, which creates an InvocationContext that travels with execution. For most code you only touch the specialized context types:
| context | where you see it | gives you |
|---|---|---|
ToolContext | tool function params | state + artifact helpers + auth |
CallbackContext | before/after-agent callbacks | state + artifacts |
ReadonlyContext | read-only spots (e.g. dynamic instruction) | just read state |
InvocationContext | inside BaseAgent._run_async_impl | everything (services, session, etc.) |
Google's interviewer is likely to pose something like this. Word it almost exactly as they will:
The prompt is dense on purpose. It's testing whether you can (a) decompose the work, (b) pick the right agent type for each piece, (c) design the data flow, and (d) think about correctness. Let's walk through the full solution — with a visual of the architecture as it grows.
One problem statement, five different agent types. That's why this question is so common — it's a Rorschach test for whether you know the full primitive vocabulary.
Step through the tabs below to see how the system is built layer by layer. Each stage highlights a new agent, tells you which ADK primitive to use, and shows the exact code.
Now that each piece is built, here's the full wiring. This is roughly what you'd write on the whiteboard:
from google.adk.agents import LlmAgent, SequentialAgent, ParallelAgent, LoopAgent
MODEL = "gemini-2.5-flash"
# ---------- 1. Three parallel fetchers ----------
profile_agent = LlmAgent(
name="profile_agent",
model=MODEL,
description="Fetches company profile: CEO, HQ, industry.",
instruction="Given a company name, use search to find profile. Return compact JSON.",
tools=[google_search],
output_key="profile",
)
news_agent = LlmAgent(
name="news_agent",
model=MODEL,
instruction="Find 3 most recent news items for the company. Return dated bullets.",
tools=[google_search],
output_key="news",
)
financials_agent = LlmAgent(
name="financials_agent",
model=MODEL,
instruction="Fetch latest market cap, revenue, and recent stock movement.",
tools=[get_stock_data, google_search],
output_key="financials",
)
fetch_fanout = ParallelAgent(
name="fetch_fanout",
sub_agents=[profile_agent, news_agent, financials_agent],
)
# ---------- 2. Synthesizer + conflict check ----------
synthesizer = LlmAgent(
name="synthesizer",
model=MODEL,
instruction="""Merge {profile}, {news}, {financials} into a one-page brief.
Also detect conflicts (e.g., news mentions acquisition, financials show old market cap).
Output JSON: {"brief": str, "conflicts": list[str], "has_conflict": bool}.""",
output_key="draft",
)
# ---------- 3. Refine loop (runs only when no conflict) ----------
critic = LlmAgent(
name="critic",
model=MODEL,
instruction="Check {draft} for clarity, completeness. If good, call exit_loop. Else give feedback.",
tools=[exit_loop],
output_key="feedback",
)
reviser = LlmAgent(
name="reviser",
instruction="Revise {draft} per {feedback}. Overwrite draft.",
output_key="draft",
)
refine_loop = LoopAgent(
name="refine_loop",
sub_agents=[critic, reviser],
max_iterations=3,
)
# ---------- 4. Conflict clarifier (only when has_conflict) ----------
conflict_clarifier = LlmAgent(
name="conflict_clarifier",
description="Handles cases where sources disagree. Asks user for guidance.",
model=MODEL,
instruction="Present {conflicts} to the user. Ask which source to trust. Await response.",
)
# ---------- 5. Router: LlmAgent with sub_agents for delegation ----------
quality_router = LlmAgent(
name="quality_router",
model=MODEL,
instruction="""Examine {draft}. If has_conflict is true, transfer to conflict_clarifier.
Otherwise transfer to refine_loop to polish the output.""",
sub_agents=[conflict_clarifier, refine_loop], # ← LLM picks one!
)
# ---------- 6. Root: the full pipeline ----------
research_pipeline = SequentialAgent(
name="research_pipeline",
sub_agents=[fetch_fanout, synthesizer, quality_router],
)
output_key to pass data via state). LoopAgent for iterative refinement with max_iterations as budget cap and exit_loop for natural termination. LlmAgent-with-sub_agents for dynamic routing (LLM reads description of each sub-agent to pick one).
condition flag in Sequential? ADK workflow agents are deterministic — they don't branch. To branch, you need either a custom BaseAgent OR an LlmAgent whose LLM reads state and transfers to one of its sub_agents. I chose the latter because the descriptions on the sub-agents make the routing logic readable.exit_loop.before_tool_callback on the financials agent to block stale data (reject if timestamp > 24h old). And an after_agent_callback on synthesizer to log a metric if has_conflict=true — so I can measure conflict rate in prod.
state["profile"], state["news"], state["financials"] via output_key. Synthesizer's instruction reads all three via {key} template injection, produces state["draft"] (structured JSON). Quality_router examines the draft, picks a sub-agent. All communication is via session.state — no direct function returns, no globals, no spooky action.
This follow-up is where most candidates stumble. The common wrong answer: "I'd write some test cases and check if the output looks right." That's unit testing, which doesn't work for agents because they're non-deterministic. The right answer has a structure.
ADK's built-in tool_trajectory_avg_score compares actual tool calls against an expected list. Supports three match modes:
# EXACT: every tool call matches, same order, no extras
# IN_ORDER: expected tools appear in order, extras allowed between
# ANY_ORDER: all expected tools appear, any order
expected_tools = [
{"tool": "google_search", "args": {"query": "Acme Corp CEO"}},
{"tool": "get_stock_data", "args": {"ticker": "ACME"}},
]
# For our research assistant, use ANY_ORDER at the top level
# (profile/news/fin run in parallel, real order is non-deterministic).
# But within each fetcher, IN_ORDER or EXACT makes sense.
When this catches bugs: The LLM hallucinates a tool name, calls the wrong tool, passes the wrong argument (e.g., "Acme" as a city name instead of company name), or skips a required tool call.
This is the one that matters MOST for our system because we have quality_router dynamically picking between conflict_clarifier and refine_loop. Getting the routing wrong = shipping broken output.
# Golden dataset entry for a "has conflict" case
{
"query": "Research Stripe",
"initial_state": {"mock_news": "Stripe acquired by Visa",
"mock_financials": "Market cap $95B standalone"},
"expected_trajectory": [
"profile_agent", "news_agent", "financials_agent", # parallel
"synthesizer",
"quality_router",
"conflict_clarifier", # ← critical
],
"must_NOT_run": ["refine_loop"], # shouldn't polish bad data
}
When this catches bugs: Router sends to refine_loop when it should have flagged a conflict. Or transfers to a sub-agent that doesn't exist. Or infinite-transfers between two agents.
Two approaches, use both:
ROUGE-1 (response_match_score): cheap, fast, word-overlap with a reference answer. Good for regression detection ("did the output change?"). Bad for semantic quality.
LLM-as-judge (final_response_match_v2): a separate LLM scores the agent's answer vs reference on semantic equivalence. Much better signal, but costs money per eval.
# test_config.json
{
"criteria": {
"tool_trajectory_avg_score": 0.8, # loose for parallel
"response_match_score": 0.6, # ROUGE is lenient
"final_response_match_v2": 0.85, # LLM judge, strict
"hallucinations_v1": 0.9, # grounding check
}
}
Why both? ROUGE catches "the output totally changed" regressions cheaply. LLM-judge catches "the output is plausibly different but actually wrong." In CI, run ROUGE on every commit, LLM-judge nightly.
For our research assistant, critical custom checks:
def no_hallucinated_tickers(eval_case, result):
# every stock ticker in output MUST appear in the financials tool result
output_tickers = extract_tickers(result.final_response)
source_tickers = extract_tickers(result.state["financials"])
return output_tickers.issubset(source_tickers)
def conflict_flag_must_be_honest(eval_case, result):
# if sources disagree in the fixture, has_conflict MUST be True
if eval_case.fixture.get("sources_disagree"):
return result.state["draft"]["has_conflict"] is True
return True
def financials_freshness_enforced(eval_case, result):
# the before_tool_callback should have blocked stale data
return "stale_data_rejected" in result.callback_logs
When this catches bugs: The agent invents a ticker symbol. The synthesizer suppresses a real conflict to look more confident. Stale data slips through because the callback regressed.
Being able to name metrics by ADK's actual API names signals you've actually used the tool. Memorize these:
Compares actual vs expected tool-call sequence. Three match modes: EXACT, IN_ORDER, ANY_ORDER. Score per invocation: 1.0 match / 0.0 mismatch; averaged.
ROUGE-1 (word-overlap) between actual and reference responses. Cheap, fast, but syntactic — misses semantic equivalence. Default threshold: 0.8.
LLM-as-judge variant of response_match. Scores semantic equivalence, not word overlap. More signal, costs tokens per eval.
Sentence-level grounding check. For each claim in the response, is it supported by retrieved context? Backed by Vertex AI Eval SDK.
Harmlessness scoring via Vertex AI Eval SDK. Checks for unsafe responses. Critical for user-facing agents; less so for internal pipelines.
Domain-specific Python checks that no generic metric can express: "ticker must exist," "conflict flag must be honest," "freshness enforced." Ship a few.
Not just "define metrics" — describe the full eval lifecycle. This is what earns the senior-IC signal:
adk web to chat with the agent, save good sessions as eval cases. Capture trajectory + final response + initial state fixtures. ~20 cases to start, covering happy path + each failure mode (conflict-yes, conflict-no, stale-data, unknown-company, etc).
adk eval on every PR with cheap metrics (ROUGE, trajectory match, custom assertions) — fast, deterministic. Nightly: full suite with LLM-judge + hallucinations_v1 — more signal, more cost.
tool_trajectory_avg_score with ANY_ORDER for parallel branches, EXACT for sequential. At the sub-agent level, dedicated handoff tests — especially for the LlmAgent router, where bad routing is the scariest failure mode. At the response level, ROUGE in CI for cheap regressions plus final_response_match_v2 nightly for semantic quality. On top, custom Python assertions for business rules like 'no hallucinated tickers' and 'conflict flag honesty.' Build a golden dataset with adk web, version it with the agent code, and iterate on failures using the Trace tab."
These come up in follow-up questions. Know them by name:
| failure mode | what it looks like | what catches it |
|---|---|---|
| wrong handoff | Router sends to wrong sub-agent | trajectory test with must_run + must_NOT_run |
| infinite loop | LoopAgent never exits; hits max_iterations | track iter count; assert < max in eval |
| premature exit | Loop exits before quality achieved | response quality score; iter count > 1 |
| state pollution | Sub-agent overwrites key it shouldn't | snapshot state after each sub-agent run |
| parallel races | Two ParallelAgents write same key | explicit state-schema review; test for overwrites |
| hallucinated sub-agent | Router tries to transfer to non-existent agent | schema validation in before_agent_callback |
| tool-response rot | Cached/stale data used without refresh | freshness assertions in before_tool_callback |
The fastest way to internalize the difference is to see the same solution written in both. Flip the tab.
from typing import TypedDict
from langgraph.graph import StateGraph, START, END
class State(TypedDict):
topic: str
summary: str
draft: str
final: str
def fetch_summary(state):
return {"summary": llm_summarize(state["topic"])}
def write_draft(state):
return {"draft": llm_draft(state["summary"])}
def revise(state):
return {"final": llm_revise(state["draft"])}
g = StateGraph(State)
g.add_node("fetch", fetch_summary)
g.add_node("draft", write_draft)
g.add_node("revise", revise)
g.add_edge(START, "fetch")
g.add_edge("fetch", "draft")
g.add_edge("draft", "revise")
g.add_edge("revise", END)
app = g.compile()
result = app.invoke({"topic": "octopi"})
from google.adk.agents import LlmAgent, SequentialAgent
fetcher = LlmAgent(
name="fetcher",
model="gemini-2.5-flash",
instruction="Summarize the topic from the query.",
output_key="summary",
)
drafter = LlmAgent(
name="drafter",
model="gemini-2.5-flash",
instruction="Write a draft from this summary: {summary}",
output_key="draft",
)
reviser = LlmAgent(
name="reviser",
model="gemini-2.5-flash",
instruction="Revise for clarity: {draft}",
output_key="final",
)
pipeline = SequentialAgent(
name="content_pipeline",
sub_agents=[fetcher, drafter, reviser],
)
# run via Runner + SessionService
| concept | LangGraph | ADK |
|---|---|---|
| unit of work | Node (function) | Agent (LlmAgent) |
| shared data | State (TypedDict) | session.state (dict) |
| routing | edges (explicit) | workflow agents OR LLM delegation |
| history | state["messages"] | session.events |
| persistence | Checkpointers | SessionService |
| merging updates | Reducers (per-key) | Event-based state_delta |
| parallelism | super-steps, Send() | ParallelAgent |
| loops | cycles + conditional edges | LoopAgent |
| HITL | interrupt() + checkpointer | long-running tool |
| compilation needed | yes (.compile()) | no |
| long-term memory | BYO vectorstore | MemoryService (Memory Bank) |
Click any card to expand. These are the things that don't show up in the tutorials but will trip you up in the interview.
.compile()Your StateGraph is a builder, not a runnable. Until you call .compile(), you can't .invoke() it. Classic first-timer mistake.
No Annotated[..., reducer] = last-write-wins. If two parallel nodes both write to state["log"] without a reducer, only one survives. Non-deterministic. Always annotate list-like fields.
LangGraph happily lets you build cycles. Great for agent loops, dangerous if you forget a conditional edge routing to END. Always have an exit.
Two invocations with the same thread_id share history. Different thread = fresh conversation. Forgetting to pass it on resume = lost state.
description is what OTHER agents see when deciding to delegate. instruction is the system prompt for THIS agent's LLM. Mixing them up breaks multi-agent routing in weird ways.
Use output_key, tool_context.state, or EventActions(state_delta=...). Direct mutation skips the event log and can desync on persistence. The event IS the state change.
Default for adk web. Fine for dev, disastrous for prod — scale to 2+ instances and sessions stop being shared. Use DatabaseSessionService or VertexAiSessionService for anything real.
Under LlmAgent: LLM decides (non-deterministic routing). Under SequentialAgent: fixed order. Under ParallelAgent: all at once. Exact same parameter name, radically different behavior.
The LLM sees the function name, docstring, and param types to decide when to call your tool. A bad docstring = a tool that never gets called (or gets called wrong). Treat them like API docs.
Ten questions. Instant feedback. The ones you miss are the ones to study.
The deepest interview signal isn't knowing every API — it's being able to say "here's what the framework gives me, here's what I'd have to build, here's the trade-off." You've got this.