Blog·June 22, 2026

How to Reduce AI Agent Token Usage

token efficiencyAI agentsretrievalMCP

Most teams try to reduce token usage at the wrong end of the pipeline.

They shorten the final answer.

That helps a little.

But AI agents usually waste tokens before they write the answer: noisy retrieval, oversized tool outputs, repeated failed attempts, and stale context that creates retry loops.

If you want meaningful savings, optimize the agent loop.

1. Return Cleaner Retrieval Context

Raw pages are expensive.

They include navigation, cookie banners, footer links, scripts, related content, and repeated layout text.

Agents do not need a webpage.

They need evidence.

Replace raw fetches with retrieval that returns:

  • relevant passages
  • source URLs
  • short summaries
  • authority signals
  • enough context to act

This is the core reason Ninelayer exists: compact search results for agents instead of human-browser pages.

2. Cap Tool Output

Every tool should have an output budget.

For example:

{
  "max_results": 5,
  "max_chars_per_result": 1200,
  "include_full_text": false
}

The agent can always ask for more.

The default should be small enough to think over.

Large outputs are not just expensive. They also distract the model from the useful signal.

3. Search Before Reading Full URLs

Do not fetch ten full pages by default.

Use a two-step pattern:

This mirrors how a good engineer works: scan, choose, then read.

For agents, it prevents full-page fetches from becoming the default cost center.

4. Compress State Between Steps

Long-running agents accumulate context.

After each major step, write a compact state summary:

Current facts:
- Next.js route handlers in this repo live under src/app/api.
- The failing test is auth-refresh.spec.ts.
- The current SDK requires refreshToken(), not refresh().
- We changed auth/client.ts and still need to update auth/server.ts.

Open risks:
- Need to verify middleware behavior.

Then carry the summary forward instead of the entire transcript.

5. Prevent Retry Loops

Retry loops are token fires.

An agent fails, reads the same stale context, tries a similar patch, fails again, and repeats.

Add circuit breakers:

  • after two failed attempts, stop and inspect
  • after a compiler error, search exact error text
  • after an API mismatch, write a small probe
  • after repeated test failures, summarize what changed

The goal is to change the agent's strategy, not just repeat generation.

6. Prefer Structured Tool Results

Structured output is easier to summarize and filter.

Instead of:

Huge blob of HTML

Return:

{
  "title": "Route Handlers",
  "url": "https://nextjs.org/docs/...",
  "evidence": "The relevant passage...",
  "source_type": "official_docs"
}

This lets the agent cite, deduplicate, and reason over sources with fewer tokens.

7. Make Prompts Shorter by Moving Policy Into Tools

If every prompt says:

Prefer official docs, avoid stale pages, return compact evidence, include URLs...

you are paying for the same instruction again and again.

Better retrieval tools can encode those defaults.

The prompt becomes:

Use Ninelayer to find current official docs before editing.

That is cheaper and harder to forget.

A Practical Token Budget

For coding agents, a healthy budget often looks like:

StepTarget
Search result packet800-1,500 tokens
Full URL extractionOnly for selected sources
PlanUnder 500 tokens
Patch summaryUnder 400 tokens
Failure analysisFocused on new information

The exact numbers depend on the model and task, but the principle is stable:

Spend tokens on evidence and decisions, not boilerplate.

The Practical Takeaway

To reduce AI agent token usage, start with retrieval.

Cleaner context reduces:

  • input tokens
  • irrelevant reasoning
  • failed edits
  • repeated searches
  • human cleanup

Token efficiency is not about starving the agent.

It is about feeding it better.

Sources

  1. Claude Code docs: MCP output limits and warnings
  2. Ninelayer: Full LLM reference
← Back to Blog