Most teams try to reduce token usage at the wrong end of the pipeline.
They shorten the final answer.
That helps a little.
But AI agents usually waste tokens before they write the answer: noisy retrieval, oversized tool outputs, repeated failed attempts, and stale context that creates retry loops.
If you want meaningful savings, optimize the agent loop.
1. Return Cleaner Retrieval Context
Raw pages are expensive.
They include navigation, cookie banners, footer links, scripts, related content, and repeated layout text.
Agents do not need a webpage.
They need evidence.
Replace raw fetches with retrieval that returns:
- relevant passages
- source URLs
- short summaries
- authority signals
- enough context to act
This is the core reason Ninelayer exists: compact search results for agents instead of human-browser pages.
2. Cap Tool Output
Every tool should have an output budget.
For example:
{
"max_results": 5,
"max_chars_per_result": 1200,
"include_full_text": false
}
The agent can always ask for more.
The default should be small enough to think over.
Large outputs are not just expensive. They also distract the model from the useful signal.
3. Search Before Reading Full URLs
Do not fetch ten full pages by default.
Use a two-step pattern:
This mirrors how a good engineer works: scan, choose, then read.
For agents, it prevents full-page fetches from becoming the default cost center.
4. Compress State Between Steps
Long-running agents accumulate context.
After each major step, write a compact state summary:
Current facts:
- Next.js route handlers in this repo live under src/app/api.
- The failing test is auth-refresh.spec.ts.
- The current SDK requires refreshToken(), not refresh().
- We changed auth/client.ts and still need to update auth/server.ts.
Open risks:
- Need to verify middleware behavior.
Then carry the summary forward instead of the entire transcript.
5. Prevent Retry Loops
Retry loops are token fires.
An agent fails, reads the same stale context, tries a similar patch, fails again, and repeats.
Add circuit breakers:
- after two failed attempts, stop and inspect
- after a compiler error, search exact error text
- after an API mismatch, write a small probe
- after repeated test failures, summarize what changed
The goal is to change the agent's strategy, not just repeat generation.
6. Prefer Structured Tool Results
Structured output is easier to summarize and filter.
Instead of:
Huge blob of HTML
Return:
{
"title": "Route Handlers",
"url": "https://nextjs.org/docs/...",
"evidence": "The relevant passage...",
"source_type": "official_docs"
}
This lets the agent cite, deduplicate, and reason over sources with fewer tokens.
7. Make Prompts Shorter by Moving Policy Into Tools
If every prompt says:
Prefer official docs, avoid stale pages, return compact evidence, include URLs...
you are paying for the same instruction again and again.
Better retrieval tools can encode those defaults.
The prompt becomes:
Use Ninelayer to find current official docs before editing.
That is cheaper and harder to forget.
A Practical Token Budget
For coding agents, a healthy budget often looks like:
| Step | Target |
|---|---|
| Search result packet | 800-1,500 tokens |
| Full URL extraction | Only for selected sources |
| Plan | Under 500 tokens |
| Patch summary | Under 400 tokens |
| Failure analysis | Focused on new information |
The exact numbers depend on the model and task, but the principle is stable:
Spend tokens on evidence and decisions, not boilerplate.
The Practical Takeaway
To reduce AI agent token usage, start with retrieval.
Cleaner context reduces:
- input tokens
- irrelevant reasoning
- failed edits
- repeated searches
- human cleanup
Token efficiency is not about starving the agent.
It is about feeding it better.
Sources
- Claude Code docs: MCP output limits and warnings
- Ninelayer: Full LLM reference
