How to Reduce AI Agent Token Usage

Most teams try to reduce token usage at the wrong end of the pipeline.

They shorten the final answer.

That helps a little.

But AI agents usually waste tokens before they write the answer: noisy retrieval, oversized tool outputs, repeated failed attempts, and stale context that creates retry loops.

If you want meaningful savings, optimize the agent loop.

1. Return Cleaner Retrieval Context

Raw pages are expensive.

They include navigation, cookie banners, footer links, scripts, related content, and repeated layout text.

Agents do not need a webpage.

They need evidence.

Replace raw fetches with retrieval that returns:

relevant passages
source URLs
short summaries
authority signals
enough context to act

This is the core reason Ninelayer exists: compact search results for agents instead of human-browser pages.

2. Cap Tool Output

Every tool should have an output budget.

For example:

{
  "max_results": 5,
  "max_chars_per_result": 1200,
  "include_full_text": false
}

The agent can always ask for more.

The default should be small enough to think over.

Large outputs are not just expensive. They also distract the model from the useful signal.

3. Search Before Reading Full URLs

Do not fetch ten full pages by default.

Use a two-step pattern:

This mirrors how a good engineer works: scan, choose, then read.

For agents, it prevents full-page fetches from becoming the default cost center.

4. Compress State Between Steps

Long-running agents accumulate context.

After each major step, write a compact state summary:

Current facts:
- Next.js route handlers in this repo live under src/app/api.
- The failing test is auth-refresh.spec.ts.
- The current SDK requires refreshToken(), not refresh().
- We changed auth/client.ts and still need to update auth/server.ts.

Open risks:
- Need to verify middleware behavior.

Then carry the summary forward instead of the entire transcript.

5. Prevent Retry Loops

Retry loops are token fires.

An agent fails, reads the same stale context, tries a similar patch, fails again, and repeats.

Add circuit breakers:

after two failed attempts, stop and inspect
after a compiler error, search exact error text
after an API mismatch, write a small probe
after repeated test failures, summarize what changed

The goal is to change the agent's strategy, not just repeat generation.

6. Prefer Structured Tool Results

Structured output is easier to summarize and filter.

Instead of:

Huge blob of HTML

Return:

{
  "title": "Route Handlers",
  "url": "https://nextjs.org/docs/...",
  "evidence": "The relevant passage...",
  "source_type": "official_docs"
}

This lets the agent cite, deduplicate, and reason over sources with fewer tokens.

7. Make Prompts Shorter by Moving Policy Into Tools

If every prompt says:

Prefer official docs, avoid stale pages, return compact evidence, include URLs...

you are paying for the same instruction again and again.

Better retrieval tools can encode those defaults.

The prompt becomes:

Use Ninelayer to find current official docs before editing.

That is cheaper and harder to forget.

A Practical Token Budget

For coding agents, a healthy budget often looks like:

Step	Target
Search result packet	800-1,500 tokens
Full URL extraction	Only for selected sources
Plan	Under 500 tokens
Patch summary	Under 400 tokens
Failure analysis	Focused on new information

The exact numbers depend on the model and task, but the principle is stable:

Spend tokens on evidence and decisions, not boilerplate.

The Practical Takeaway

To reduce AI agent token usage, start with retrieval.

Cleaner context reduces:

input tokens
irrelevant reasoning
failed edits
repeated searches
human cleanup

Token efficiency is not about starving the agent.

It is about feeding it better.

Sources

Claude Code docs: MCP output limits and warnings
Ninelayer: Full LLM reference