Agent Infrastructure7 min read

Debugging Production Agents: Why Agent Monitoring and Logging Aren't Optional

Dan Hartman headshotDan HartmanEditor··7 min read

Shipping AI agents means dealing with silent failures and cost overruns. Learn essential agent monitoring and logging practices to debug effectively and keep your deployments stable.

Last quarter, we pushed an agent to production that was supposed to automate initial customer support triage. It’d pull data from our CRM, check our knowledge base, and draft a first response. Seemed simple enough in dev. Then it hit production. For a few days, things looked fine. Then, slowly, silently, the agent started failing. Not crashing, just… producing garbage, or worse, getting stuck in a loop trying to re-read the same knowledge base article. Our support team started seeing bizarre drafts, and our token costs began to creep up. We had no idea why. This is the nightmare of agent monitoring and logging done poorly, or not at all.

Debugging traditional software is hard enough. You’ve got stack traces, predictable execution paths, clear input/output. Agents? They’re a different beast entirely. They the Make platformdecisions based on probabilistic models. They call external tools, often with non-deterministic results. They hallucinate. A single bad LLM output or a subtly misconfigured tool call can send an agent down a rabbit hole, burning tokens and frustrating users with nonsensical outputs. You can’t just attach a debugger and step through an LLM’s thought process like you would a Python function. The non-deterministic nature of LLM responses, the long, branching chains of reasoning, the asynchronous external tool interactions, and the inherent opacity of the ‘thought’ process — it all conspires against you. Without proper visibility into each step, each decision, and each tool invocation, you’re flying blind. You’re essentially hoping your agent doesn’t decide to book a flight to Fiji on the company card, or worse, accidentally delete critical customer data because of a misinterpretation. This isn’t just about finding a bug; it’s about understanding why the agent chose a particular path, which is a much harder problem.

Structured Logging: Your First Line of Defense

The absolute minimum you need is structured logging. Every step an agent takes, every tool it calls, every LLM prompt and response, every state change — it needs to be logged. Not just print("Agent did something"). We’re talking JSON logs that capture context: agent ID, trace ID, step number, tool name, input, output, duration, token usage. This isn’t optional; it’s foundational.

Here’s a simplified example of what I mean, using a hypothetical LangGraph node:

def call_tool_node(state):
tool_name = state["tool_to_call"]
tool_input = state["tool_input"]
try:
result = tools[tool_name].run(tool_input)
logging.info({
"event": "tool_call_success",
"agent_id": state["agent_id"],
"trace_id": state["trace_id"],
"step": state["step_count"],
"tool_name": tool_name,
"tool_input": tool_input,
"tool_output": result,
"duration_ms": ...,
"status": "success"
})
return {"tool_output": result}
except Exception as e:
logging.error({
"event": "tool_call_failure",
"agent_id": state["agent_id"],
"trace_id": state["trace_id"],
"step": state["step_count"],
"tool_name": tool_name,
"tool_input": tool_input,
"error": str(e),
"status": "failure"
})
raise

This level of detail lets you filter logs, build dashboards, and actually pinpoint when and where an agent went off the rails. Without it, you’re just staring at a wall of text, guessing.

Tracing Agent Execution: Seeing the Whole Picture

Logs are great for individual events, but they don’t show the flow. That’s where tracing comes in. Tools like LangSmith or Langfuse are indispensable here. They visualize the entire execution path of an agent: which LLM calls were made, which tools were invoked, the inputs and outputs at each stage, and the time taken. It’s like a debugger for your agent’s brain.

I’ve spent hours trying to figure out why an agent was looping, only for a LangSmith trace to immediately highlight a recursive tool call I hadn’t anticipated. It showed me the exact prompt that led to the bad tool call, and the subsequent tool output that fed back into the LLM, creating an infinite cycle. That’s a concrete love: the ability to visually inspect the entire chain. It saves days of head-scratching.

LangSmith, for instance, integrates pretty well with LangChain and LangGraph. You just set a few environment variables, and suddenly your runs are visible. It’s not perfect, though. My concrete gripe with LangSmith is its pricing model for high-volume production. It can get expensive quickly, especially if you’re doing a lot of experimentation or have agents that generate verbose outputs (which, yes, is annoying). For a small team, the free tier is enough for solo work and initial development, but once you hit serious production traffic, you’ll feel it. I think $0.000001 per token for traces, on top of LLM costs, adds up faster than you’d expect. It’s not ridiculous, but it’s something you need to budget for, and it’s certainly not cheap for large-scale deployments.

Metrics and Alerts: Catching Problems Before They Explode

Beyond logs and traces, you absolutely need metrics. Track token usage per agent run, both input and output. Monitor the latency of individual tool calls and the overall agent completion time. Crucially, track the success rates of specific tools and overall agent completion rates. If your agent is supposed to answer a question, track if it actually provides a coherent answer or if it just gives up, perhaps returning an empty string or a generic error. For agents interacting with external APIs, monitor API call success rates and response times.

Set up alerts for anomalies. If token usage suddenly spikes by 200% for a specific agent, that’s a blaring red flag for a potential loop or an agent generating excessively verbose output. If a critical tool’s success rate drops below 90%, something’s broken with that integration or the agent’s input to it. You can push these metrics to your existing observability stack — Datadog, Prometheus, Grafana, whatever you’re already using. Don’t build a separate dashboard just for agents unless you absolutely have to; integrate it into your existing operational views. This keeps your monitoring centralized and reduces cognitive load for your ops team.

For example, if you’re building agents with something like LangGraph, you can instrument your nodes to emit custom metrics. This isn’t just about debugging; it’s about cost control and performance optimization. You can’t optimize what you don’t measure. And you can’t measure if you don’t log and trace every meaningful interaction.

import time
import prometheus_client

TOOL_CALL_LATENCY = prometheus_client.Summary('tool_call_latency_seconds', 'Latency of tool calls', ['tool_name', 'status'])
AGENT_TOKEN_USAGE = prometheus_client.Counter('agent_token_usage_total', 'Total tokens used by agent', ['agent_id', 'llm_model'])

def instrumented_tool_node(state):
tool_name = state["tool_to_call"]
start_time = time.time()
status = "failure"
try:
# ... tool call logic ...
result = tools[tool_name].run(state["tool_input"])
status = "success"
return {"tool_output": result}
finally:
duration = time.time() - start_time
TOOL_CALL_LATENCY.labels(tool_name=tool_name, status=status).observe(duration)
# Assuming LLM token usage is captured elsewhere or passed in state
# AGENT_TOKEN_USAGE.labels(agent_id=state["agent_id"], llm_model="gpt-4o").inc(state["tokens_used"])

Beyond Debugging: Compliance and Choosing Your Tools

For agents touching real money or real user data, compliance isn’t a nice-to-have; it’s a hard requirement. Your logs and traces become your audit trail. If an agent makes a decision that impacts a user’s account, you need to be able to reconstruct exactly why that decision was made. What was the prompt? What was the LLM’s response? Which tools were called, and with what parameters? This is where the detailed, structured logging and comprehensive tracing really pay off. It’s not just about finding bugs; it’s about proving your agent acted correctly, or identifying exactly where it deviated from policy. This is especially true for financial services or healthcare applications, where data provenance is critical.

You don’t need to build everything from scratch. For agent frameworks, LangChain, LangGraph, CrewAI, and AutoGen all have varying levels of observability hooks. LangGraph, in particular, makes it relatively straightforward to instrument nodes. For observability platforms, LangSmith and Langfuse are purpose-built for LLM applications. Arize is another strong contender, especially if you’re already deep into ML observability. If you’re building simpler automation agents, tools like n8n or Bardeen might offer some basic logging, but they won’t give you the deep LLM-specific insights of a dedicated tracing platform. For development and testing, even something like Replit Agent can give you a quick environment to experiment, but for production, you’ll need more capable monitoring.

For more on this exact angle, AI meeting tools coverage.

I’ve seen too many teams get excited about building agents, only to hit a wall when they try to deploy. The excitement of ‘how to build agents’ quickly turns into the frustration of ‘how to fix this damn agent.’ Agent monitoring and logging aren’t glamorous. They’re the unsexy, essential infrastructure that makes production deployments possible. If you’re serious about deploying agents, don’t treat observability as an afterthought. Build it in from day one. Your sanity, your budget, and your users will thank you.

— The Colophon

One AI tool. Tested. Reviewed.
In your inbox every Sunday.

~3 minute read. Real outcomes from operators, not marketers.