The Silent Drift: When Agents Degrade
Last quarter, we shipped a content-briefing agent for a SaaS marketing team. The idea was simple: feed it a primary keyword, a few competitor URLs, and it’d spit out a detailed brief for a writer. Initial tests looked great. We used LangGraph to orchestrate a few steps: research, outline generation, and a quick self-critique. For the first few weeks, it hummed along, saving the team hours. Then, the silent failures started. Not crashes, but subtle degradations in output quality. Briefs became generic, sometimes missing key competitive insights. The client started complaining about needing to heavily edit the agent’s output, which defeated the whole purpose. Our costs, tied to GPT-4 API calls, started creeping up because the agent was generating longer, less focused drafts, requiring more internal retries. This wasn’t a bug; it was a slow, expensive drift. That’s when I realized we needed proper machine learning for agent improvement, not just better prompt engineering.
The core issue wasn’t a single broken tool or a bad prompt. It was the agent’s inability to adapt to new data, new competitor strategies, or even subtle shifts in the client’s brand voice. We’d built it on a snapshot of knowledge, and the world moved on. Our initial evaluation metrics were too simplistic: ‘does it generate a brief?’ not ‘is it a good brief that saves the writer time?’ We were using LangSmith for tracing, which helped us see the execution paths, but it didn’t tell us why a particular path led to a poor outcome. We could see the agent calling the search tool, then the summarization tool, but the quality of the summarization was the problem. We needed a way to teach the agent what ‘good’ looked like, and crucially, what ‘bad’ looked like, without constantly rewriting system prompts.
Targeted ML: Guardrails and Critics
Our first attempt at machine learning for agent improvement involved a feedback loop. We started tagging outputs in LangSmith: ‘good,’ ‘needs minor edits,’ ‘bad.’ This gave us a small dataset. The challenge was turning those qualitative tags into something actionable for the agent. We couldn’t just fine-tune the LLM on ‘good’ examples because the problem wasn’t the LLM’s general knowledge; it was its specific application within our agent’s workflow. We needed to influence the agent’s decision-making at critical junctures, like when it decided which search results to prioritize or how to synthesize information for the outline.
We ended up focusing on two areas. First, a small, task-specific classifier. Instead of the LLM deciding if a search result was ‘relevant,’ we trained a tiny BERT model on a few hundred examples of ‘relevant’ vs. ‘irrelevant’ search snippets for content briefs. This model ran before the LLM even saw the data, filtering out noise. It wasn’t perfect, but it cut down the LLM’s token consumption by about 15% on average, and the quality of the raw input improved dramatically. This was a concrete love: a small, focused ML model acting as a guardrail, making the LLM’s job easier and cheaper.
The BERT model itself was a simple transformers pipeline. We collected about 500 search result snippets, manually labeled them ‘relevant’ or ‘irrelevant’ based on our content brief criteria, and then fine-tuned a distilbert-base-uncased model for text classification. The training took less than an hour on a single GPU instance, costing maybe $10. The real work was in the data collection and labeling, which we initially underestimated. We used a simple Python script to fetch search results for a given keyword, then a small internal web app for the labeling. The model’s output was a probability score, which we then used to filter the top N results for the LLM. This meant the LLM was working with cleaner, more focused data, reducing its chances of hallucinating or going off-topic. It’s a classic example of using a smaller, specialized model to improve the performance and cost-efficiency of a larger, general-purpose one.
Second, we implemented a more sophisticated evaluation system. Instead of just human tags, we built a ‘critic agent’ using a smaller, cheaper LLM (like GPT-3.5 Turbo) that would evaluate the output of our main agent against a set of explicit criteria derived from our client’s feedback. This critic agent would generate a score and a short critique. We then used these critiques to inform a dynamic prompt adjustment system. If the critic consistently flagged ‘lack of specific examples,’ we’d inject a specific instruction into the next prompt for the main agent: ‘Ensure you include at least two concrete examples from competitor sites.’ This wasn’t full-blown reinforcement learning, but a pragmatic, rule-based adaptation driven by ML-generated feedback. It’s a form of ‘human-in-the-loop’ learning, but with the human providing the initial ground truth and the ML system automating the feedback application.
The critic agent itself was a LangGraph flow. It took the generated brief, the original prompt, and a set of evaluation guidelines. Its primary tool was a function that compared the brief against the guidelines, looking for specific elements like keyword density, competitor analysis depth, and structural integrity. For instance, one guideline might be: “Does the brief include at least three distinct subheadings for the main content?” The critic would then use its own LLM call to generate a summary of its findings and a numerical score. This feedback loop, while not directly training the main agent’s LLM, significantly improved its behavior over time by dynamically adjusting its instructions based on observed performance. It’s a powerful way to get closer to true agent improvement without the massive data requirements of full model fine-tuning.