Agent Platforms8 min read

Why Machine Learning for Agent Improvement Isn't Optional Anymore

Dan Hartman headshotDan HartmanEditor··8 min read

Agents silently failing or costing too much? Learn how targeted machine learning for agent improvement, like fine-tuning and critic agents, solves production issues.

The Silent Drift: When Agents Degrade

Last quarter, we shipped a content-briefing agent for a SaaS marketing team. The idea was simple: feed it a primary keyword, a few competitor URLs, and it’d spit out a detailed brief for a writer. Initial tests looked great. We used LangGraph to orchestrate a few steps: research, outline generation, and a quick self-critique. For the first few weeks, it hummed along, saving the team hours. Then, the silent failures started. Not crashes, but subtle degradations in output quality. Briefs became generic, sometimes missing key competitive insights. The client started complaining about needing to heavily edit the agent’s output, which defeated the whole purpose. Our costs, tied to GPT-4 API calls, started creeping up because the agent was generating longer, less focused drafts, requiring more internal retries. This wasn’t a bug; it was a slow, expensive drift. That’s when I realized we needed proper machine learning for agent improvement, not just better prompt engineering.

The core issue wasn’t a single broken tool or a bad prompt. It was the agent’s inability to adapt to new data, new competitor strategies, or even subtle shifts in the client’s brand voice. We’d built it on a snapshot of knowledge, and the world moved on. Our initial evaluation metrics were too simplistic: ‘does it generate a brief?’ not ‘is it a good brief that saves the writer time?’ We were using LangSmith for tracing, which helped us see the execution paths, but it didn’t tell us why a particular path led to a poor outcome. We could see the agent calling the search tool, then the summarization tool, but the quality of the summarization was the problem. We needed a way to teach the agent what ‘good’ looked like, and crucially, what ‘bad’ looked like, without constantly rewriting system prompts.

Targeted ML: Guardrails and Critics

Our first attempt at machine learning for agent improvement involved a feedback loop. We started tagging outputs in LangSmith: ‘good,’ ‘needs minor edits,’ ‘bad.’ This gave us a small dataset. The challenge was turning those qualitative tags into something actionable for the agent. We couldn’t just fine-tune the LLM on ‘good’ examples because the problem wasn’t the LLM’s general knowledge; it was its specific application within our agent’s workflow. We needed to influence the agent’s decision-making at critical junctures, like when it decided which search results to prioritize or how to synthesize information for the outline.

We ended up focusing on two areas. First, a small, task-specific classifier. Instead of the LLM deciding if a search result was ‘relevant,’ we trained a tiny BERT model on a few hundred examples of ‘relevant’ vs. ‘irrelevant’ search snippets for content briefs. This model ran before the LLM even saw the data, filtering out noise. It wasn’t perfect, but it cut down the LLM’s token consumption by about 15% on average, and the quality of the raw input improved dramatically. This was a concrete love: a small, focused ML model acting as a guardrail, making the LLM’s job easier and cheaper.

The BERT model itself was a simple transformers pipeline. We collected about 500 search result snippets, manually labeled them ‘relevant’ or ‘irrelevant’ based on our content brief criteria, and then fine-tuned a distilbert-base-uncased model for text classification. The training took less than an hour on a single GPU instance, costing maybe $10. The real work was in the data collection and labeling, which we initially underestimated. We used a simple Python script to fetch search results for a given keyword, then a small internal web app for the labeling. The model’s output was a probability score, which we then used to filter the top N results for the LLM. This meant the LLM was working with cleaner, more focused data, reducing its chances of hallucinating or going off-topic. It’s a classic example of using a smaller, specialized model to improve the performance and cost-efficiency of a larger, general-purpose one.

Second, we implemented a more sophisticated evaluation system. Instead of just human tags, we built a ‘critic agent’ using a smaller, cheaper LLM (like GPT-3.5 Turbo) that would evaluate the output of our main agent against a set of explicit criteria derived from our client’s feedback. This critic agent would generate a score and a short critique. We then used these critiques to inform a dynamic prompt adjustment system. If the critic consistently flagged ‘lack of specific examples,’ we’d inject a specific instruction into the next prompt for the main agent: ‘Ensure you include at least two concrete examples from competitor sites.’ This wasn’t full-blown reinforcement learning, but a pragmatic, rule-based adaptation driven by ML-generated feedback. It’s a form of ‘human-in-the-loop’ learning, but with the human providing the initial ground truth and the ML system automating the feedback application.

The critic agent itself was a LangGraph flow. It took the generated brief, the original prompt, and a set of evaluation guidelines. Its primary tool was a function that compared the brief against the guidelines, looking for specific elements like keyword density, competitor analysis depth, and structural integrity. For instance, one guideline might be: “Does the brief include at least three distinct subheadings for the main content?” The critic would then use its own LLM call to generate a summary of its findings and a numerical score. This feedback loop, while not directly training the main agent’s LLM, significantly improved its behavior over time by dynamically adjusting its instructions based on observed performance. It’s a powerful way to get closer to true agent improvement without the massive data requirements of full model fine-tuning.

The Unseen Costs: Data Labeling and Debugging

This wasn’t a smooth ride. The biggest gripe was the data labeling. Getting consistent, high-quality labels for ‘good’ vs. ‘bad’ content briefs was a nightmare. Our client’s marketing team had varying standards, and what one person considered ‘good,’ another found ‘too aggressive.’ We spent weeks aligning on a rubric, and even then, the initial labels were noisy. Langfuse helped us track these evaluations, but it didn’t solve the human consistency problem. We had to build internal tools just to manage the labeling process, which felt like a significant distraction from building the agent itself — and good luck finding docs for this, it’s all custom. Honestly, this part is where most teams will stumble. It’s expensive, tedious, and requires a level of organizational discipline that’s rare. Without clear, consistent labeling, any machine learning for agent improvement effort is doomed. You’re essentially teaching the agent to be confused.

Debugging these systems also adds layers of complexity. When an agent fails, you’re not just looking at a stack trace. You’re looking at the interaction between multiple LLM calls, tool uses, and now, potentially, the outputs of smaller ML models. If the BERT classifier misclassifies a snippet, the LLM gets bad input, and the output suffers. If the critic agent misinterprets a guideline, it gives bad feedback, and the main agent adjusts incorrectly. Tracing tools like LangSmith are essential here, letting you inspect each step. But even with detailed traces, understanding the why behind a poor decision often requires deep domain knowledge and a lot of manual review. It’s not like debugging a traditional software bug where you can pinpoint a line of code. It’s more like debugging a conversation where one participant is subtly misunderstanding the context.

Paying for Performance: Costs and Returns

The cost implications were interesting. Training the small BERT classifier was cheap, maybe $50 in compute time on a cloud GPU. The real cost came from the human labeling, which probably ran us a few thousand dollars in contractor fees. The critic agent, running GPT-3.5 Turbo, added about $0.05 per brief generated, which is acceptable given the quality improvement. Overall, the initial investment in machine learning for agent improvement paid off. Our API costs for the main GPT-4 agent dropped by about 20% due to fewer retries and more focused outputs, and the client’s satisfaction went way up. We even saw a 10% increase in content production efficiency because writers spent less time editing.

If you’re building agents that touch real money or real user data, you can’t afford to ignore this. The free tier of LangSmith or Langfuse is enough for solo work to get started with basic tracing and evaluation, but once you’re serious about improving agent performance with ML, you’ll need their paid tiers for more extensive logging and custom metrics. For example, LangSmith’s developer plan starts around $500/month for higher usage, which is fair if you’re deploying agents in production and need reliable monitoring and feedback loops. For smaller, more experimental agent projects, especially those involving custom Python scripts or even fine-tuning smaller models, platforms like Replit Agent can be surprisingly effective. I’ve used Replit for quick iterations on the BERT classifier, spinning up environments without much fuss. It’s a solid choice for prototyping before you commit to heavier cloud infrastructure.

This approach also helps with compliance and governance. By explicitly defining what ‘good’ output looks like and building systems to enforce it, you reduce the risk of agents generating biased, inaccurate, or off-brand content. The critic agent, for instance, could be configured with rules to check for specific compliance requirements, acting as an automated audit layer. This gives you a verifiable trail of how an agent arrived at a decision and why certain outputs were accepted or rejected. It’s not just about making agents better; it’s about making them safer and more accountable.

For more on this exact angle, AI meeting tools coverage.

Building agents isn’t just about chaining LLM calls. It’s about building systems that learn and adapt. If your agents are failing silently, costing too much, or generating inconsistent results, you’re not alone. The path to reliable, production-grade agents runs directly through machine learning for agent improvement. It’s hard work, especially the data labeling, but it’s the only way to move beyond brittle prompt engineering and into truly useful agent deployments. You’ll spend more time on data and evaluation than you think, but the payoff in agent reliability and cost savings is undeniable. Honestly, this is the only way I’d actually pay for a complex agent system in production.

— The Colophon

One AI tool. Tested. Reviewed.
In your inbox every Sunday.

~3 minute read. Real outcomes from operators, not marketers.