Blog·May 5, 2026

The Hidden Token Cost of Bad Retrieval

token efficiencyRAGAI agentsagentic searchMCP

Most teams trying to cut LLM costs look at the obvious places: model choice, output length, batching, prompt caching.

The expensive part is often hiding earlier.

Before your agent writes a line of code or answers a question, it has to retrieve context. If that context is raw HTML, long docs pages, duplicated navigation, cookie banners, and unrelated sidebars, your agent pays for all of it.

This is retrieval waste: tokens your agent reads, reasons over, and discards before it can do useful work.

And at scale, it can become one of the biggest line items in an agent's budget.

What Retrieval Waste Actually Means

Every search call returns something before the agent can act.

The question is simple:

How much of that context is signal?

Cloudflare's Markdown for Agents launch gives a clean example. One Cloudflare blog post consumed 16,180 tokens as HTML and 3,150 tokens as Markdown, an 80% token reduction.[1]

Same page.

Same information.

Much less junk for the model to process.

In agent workflows, the waste usually comes from:

  • Navigation menus
  • Sidebars
  • Cookie banners
  • Footer boilerplate
  • Related-article blocks
  • Scripts and markup
  • Repeated docs chrome

Your agent does not need most of that. It still pays for it.

The Five-Minute Audit

You do not need a complex evaluation harness to find retrieval waste.

Take the last 20 searches your agent performed and log two numbers:

{
  "query": query,
  "source_url": url,
  "fetched_tokens": count_tokens(raw_content),
  "cited_tokens": count_tokens(agent_citation),
  "efficiency": cited_tokens / fetched_tokens,
  "acted_on": bool
}

The ratio is your retrieval efficiency.

If the agent fetched 10,000 tokens and only used 800, your efficiency is 8%. That means 92% of the retrieved context was overhead.

Most teams expect to find a latency problem.

They often find a cost problem.

The Compounding Math

Let's make the cost concrete.

Assume a moderate agent workload:

  • 20 searches per day
  • 12,000 tokens fetched per raw page
  • 800 tokens of useful signal per search
  • $2.50 per 1 million input tokens, roughly GPT-5.4 standard input pricing as listed by OpenAI at the time of writing[3]

Raw retrieval:

20 searches x 12,000 tokens = 240,000 input tokens/day
0.24M tokens x $2.50 / 1M = $0.60/day

Efficient retrieval:

20 searches x 800 tokens = 16,000 input tokens/day
0.016M tokens x $2.50 / 1M = $0.04/day

That is about $0.56/day in avoidable input cost, or roughly $17/month, before the agent generates a single output token.

Now multiply it by a team, higher traffic, longer docs pages, or more expensive model tiers.

The problem stops looking theoretical.

Three Retrieval Patterns

Not all retrieval is equally wasteful.

1. Raw HTML fetch. The agent gets the full page and has to separate content from page structure. This is easy to wire up, but the model pays for the web's design choices.

2. Search snippets. The agent receives short excerpts from a search API. Better than raw HTML, but snippets are often too short, cut off at the wrong place, or missing the source context needed to act.

3. Agent-native retrieval. The retrieval layer extracts the useful content, strips page chrome, ranks by authority, and returns clean Markdown or evidence packets. Firecrawl's docs describe this pattern directly: converting URLs into Markdown or structured data for LLM applications, with support for dynamic pages and multiple output formats.[2]

The Cost Is Not Just Tokens

Noisy input also makes the agent think harder.

When retrieval is messy, the agent has to:

  • Decide what is navigation vs. content
  • Ignore unrelated links and widgets
  • Judge whether the source is authoritative
  • Work around partial or stale excerpts
  • Recover when the wrong context leads to the wrong action

Each step consumes reasoning time, output tokens, and user trust.

Clean retrieval removes that work before the model sees the context. The agent gets a smaller, clearer packet and can spend its reasoning budget on the task itself.

What To Optimize First

Do not start by changing the model.

Do not start by rewriting the prompt.

Start by measuring retrieval efficiency.

If most fetched tokens are never cited or acted on, the retrieval layer is the optimization target. The biggest savings may come from giving the agent less context, as long as it is the right context.

The goal is not more retrieval.

The goal is less waste.


We are building Ninelayer for teams who have seen this bill up close: smart agents, noisy context, and tokens spent on everything except the answer. If that sounds familiar, get started.

Sources

  1. Cloudflare: Introducing Markdown for Agents
  2. Firecrawl docs: Scrape endpoint: turn any URL into clean data
  3. OpenAI: API Pricing
← Back to Blog