Do Claude Code Token-Saving Tools Actually Cut Your Bill?

The first time you point rtk at a noisy git status, it feels like a cheat code. Output that used to cost 4,000 tokens comes back as 400, the terminal prints a smug little “90% saved,” and you start to wonder why you ever paid for the long version. Install headroom and caveman next to it over a weekend, watch three counters tick down at once, and the conclusion writes itself: you just cut your Claude Code bill by most of itself.

Then the invoice arrives, and it looks almost exactly like last month’s.

That is not a bug in your setup. An independent replay of 500 real Claude Code sessions put the weekend feeling on a real ledger, and stacked together, rtk, headroom, and caveman trimmed a $926 bill by 3.7 percent. About $34. The advertised 60 to 90 percent was not a lie. It was measured on the one payload each tool was built to squeeze, then read as if it applied to the whole bill. It does not, and the cleanest way to see why is to stop arguing about percentages and read the receipt.

Read the receipt

A Claude Code session is a loop. Each turn re-sends the growing context to the model, the model answers, the answer and the next tool result fold back in, and the loop runs again. That shape is the whole game, because it decides which tokens you buy and at what rate. Anthropic’s own Claude Code cost guidance puts the average developer between $150 and $250 a month, so the urge to claw some of that back is rational. The question is where the money actually sits.

Break the $926 into its streams and the answer is uncomfortable for a compression tool. Cache writes, the cost of putting fresh context into the model’s prompt cache, were 42 percent of the bill. Model output, everything Claude generates back, was another 29 percent. That is 71 cents of every dollar living in two streams that rtk, headroom, and caveman never touch. What is left, the cache reads and fresh input a tool can actually reach, is a $268 slice. Of that slice, all three tools combined pulled out $34.

How a $926 Claude Code bill divides, and how little is reachable
Step	Change	Running total
The bill	$926	$926
Cache writes (untouchable, 42% of the bill)	-$389	$926
Model output (untouchable, 29%)	-$269	$537
Reachable: cache reads + input (29% of the bill)	$268	$268
All three tools combined saved about $34	$34	$268

Where a real $926 Claude Code bill goes, from a 614-million-token replay. Cache writes (42%) and model output (29%) are 71 percent of the spend, and none of the three tools compresses either stream. The reachable pool is everything else, and the tools pulled about $34 out of it, most of which lands at the cheapest cache-read rate.Source: Yongkyun (2026), Cutting LLM Token Costs with rtk, headroom, and caveman (codepointer)

The compressors are fighting over the narrow pool on the right, and even there they capture only about an eighth of it. The reason is the gold sliver’s hiding place: the tokens these tools save get written into the cache once and then re-read on every later turn at the cache_read rate, the cheapest token in the system. The two streams that actually set the invoice, cache writes and model output, never enter the fight.

What the three tools actually compress

The three sit at different points in the agent’s token stream, and that placement is the whole story of what each one can possibly save.

rtk, or “Rust Token Killer,” is a CLI proxy that intercepts shell commands and compacts their output before the bytes reach the model. Its README advertises a “60-90% reduction in LLM token consumption on common dev commands,” with a worked 30-minute session shrinking from roughly 118,000 tokens to 23,900, and a single Rust binary applies filtering, grouping, truncation, and deduplication at under ten milliseconds of overhead. headroom sits one layer deeper, between your agent and the model API as a library, a local proxy, or an MCP server, and advertises “60-95% fewer tokens” on tool outputs, logs, JSON, and code, with reversible compression so the model can pull the original back when it needs it. Its published benchmarks include a code search dropping 92 percent, from 17,765 tokens to 1,408.

caveman works the opposite end. It is a Claude Code skill that makes the model talk less, instructing Claude to drop articles, filler, and hedging while keeping code and error strings exact, and it claims a 65 to 75 percent cut on output prose.

So rtk and headroom compress what goes into the model, the tool results. caveman compresses what comes out, the assistant’s prose. Hold that distinction, because it decides which tool can reach an expensive token and which is stuck shaving the cheap one.

The demo numbers are real. That is the trap.

Here is the part that makes the pitch so persuasive: every headline percentage reproduces. The replay, published as Cutting LLM Token Costs with rtk, headroom, and caveman by the developer Yongkyun on the codepointer Substack in June 2026, ran headroom’s real compressor against recorded tool results and confirmed it. On grep and diff dumps, headroom cut a median of 54 percent, and the high-redundancy cases hit the advertised 60 to 95 percent range. rtk’s per-command rates and caveman’s prose reduction held up too, each measured on exactly the payload it was tuned for.

The advertised number is a true measurement of a real thing. It is the ratio of saved tokens to one payload: a single grep result, a single JSON array, a single verbose reply. What it is not is the ratio of saved tokens to a bill. Those are different fractions, and the gap between them is where the savings quietly evaporate.

What the demo leaves off the bill

Two effects shrink the headline before pricing ever gets the final word.

The first is the denominator. A 92 percent cut on a 17,000-token payload is a real 16,000 tokens saved, and on that payload it is exactly 92 percent. But a bill is not one payload. It is hundreds of turns, most of which the tool does nothing to. Divide the same 16,000 saved tokens across the whole session and the 92 percent becomes a low single digit. Same numerator, wildly different denominator.

The second is that real workloads do not cooperate. The aggressive tricks only fire on redundant, structured dumps, and most of what an agent reads is plain source and prose with little to squeeze. On the replay’s real traffic, headroom activated on just 45 percent of payloads, and when it did, the median reduction was 25 percent, not 60 to 95. The strategies that produce the headline numbers fired on 46 of 2,781 activations. Put the advertised range next to what each tool actually returned on the bill and the collapse is the entire argument in one picture.

Advertised cut per payload versus measured cut on a real bill
Item	Advertised per payload	Measured on a $926 bill
headroom	77.5%	2.8%
rtk	75%	0.5%
caveman	70%	0.4%

Advertised figures are the midpoint of each tool's published per-payload range. Measured figures are each tool's share of a real $926 bill across 500 replayed sessions. headroom was run directly against recorded tool results; rtk and caveman are generous estimates from their own published rates, so two of these three are ceilings, not floors.Source: Yongkyun (2026), Cutting LLM Token Costs with rtk, headroom, and caveman (codepointer)

These tools compress the cheapest token in the bill. The expensive tokens, cache writes and output, are the ones that set the invoice, and nothing here touches them.

The 78 percent rtk never sees

rtk has a second problem that has nothing to do with pricing: it never sees most of the traffic. It hooks the Bash tool, but Claude Code reads files and searches code through its own native Read, Grep, and Glob tools, and those bypass any shell hook entirely. In the replay, 78 percent of tool-output tokens flowed through the native tools, almost all of it through Read. rtk only reached the remaining 22 percent. A shell-output compressor inside an agent that mostly does not use the shell to read is optimizing a path the agent rarely takes, and no amount of per-command efficiency fixes a coverage hole that large.

An independent benchmark lands in the same place

This is not one developer’s idiosyncratic corpus. caveman in particular has been benchmarked on its own, and the pattern repeats. In an independent test by Kuba Guzik on DEV.to (2026), the prompt that advertises 75 percent delivered 13 to 14 percent on Claude Sonnet and 9 to 21 percent on Claude Opus on real coding tasks.

Guzik’s diagnosis is the useful part. The advertised 75 percent came from a baseline with no efficiency instructions at all. Measured against a baseline that already said “be concise, return JSON,” most of the savings were already in the bag, and his 85-token “micro” prompt beat the full 552-token skill outright. The lesson generalizes well past caveman: the real savings come from how you scope the request, not from a wrapper that compresses the wrong stream after the fact.

What saving 3.7 percent actually costs you

There is a side of this ledger that has nothing to do with tokens. Each tool earns its cut by sitting somewhere it can read your code, your prompts, or your output, and the trust it asks for scales with how much it sees.

rtk is the narrowest, local and scoped to shell output, but its hook executes commands on your machine, so a compromised release means arbitrary command execution. caveman runs mostly local hooks with one crossing to the API, which means a malicious version runs on every prompt. headroom takes the maximal-trust position: as a proxy in the API path, it sees full prompts, full responses, and the Authorization header carrying your API key, so a bad release there is a key disclosure. None of these projects has done anything wrong. The point is the shape of the bet.

Where a Claude Code bill is actually won

If the goal is a smaller invoice, the levers that move it are the ones that act on the expensive 71 percent, and all of them are free.

Start with the model. Claude Code’s cost documentation is blunt about it: “Sonnet handles most coding tasks well and costs less than Opus.” On Anthropic’s published rates, Opus output runs $25 per million tokens against Sonnet’s $15, and against $5 of input, so model choice is a multiple, not a trim. While you are there, mind extended thinking, because those tokens bill as output, the single priciest line. Lowering the effort level on tasks that do not need deep reasoning cuts that stream directly, and no tool-output compressor can.

Then manage context the way the docs tell you to. Use /clear between unrelated tasks so you are not paying to re-cache a context the next task does not need, and run /context to see what is actually eating the window before you reach for a third party. Scope requests tightly, because a vague “improve this codebase” triggers a broad scan while “add input validation to the login function in auth.ts” does not, and every file the agent pulls in becomes cache-write tokens this turn and cache-read tokens every turn after. Stop a run the moment it starts spinning, since an agent looping without progress bills cache writes and output the whole time.

There is one honest concession to the compression idea. Anthropic’s own guidance recommends a preprocessing hook that, for example, greps a test runner’s output down “from tens of thousands of tokens to hundreds” before Claude ever sees it. That is the legitimate kernel rtk and headroom are built on, and for genuinely verbose operations it is worth doing. The difference is that a targeted local hook costs you nothing, ships nothing to a third party, and never sees your API key, while still bumping into the same arithmetic: it acts on the reachable slice, so its effect on the bill is bounded by the same $268. The economics work through what costs money, laid out in our breakdown of what Claude Code costs per task, and the expensive lines are decided by the model you run and how tightly you scope the work, covered across the 2026 AI coding agent landscape and the AI coding plan pricing comparison.

None of that is a download. It is operating discipline, and it acts on the part of the bill the compression tools cannot reach.

Frequently asked questions

Do tools like rtk and headroom actually reduce Claude Code costs?

Barely, on a real bill. In an independent replay of 500 sessions, rtk, headroom, and caveman combined cut a $926 bill by 3.7 percent, about $34, despite advertising 60 to 95 percent per payload. The headline percentages are real but measured on a single payload, while a bill spreads the same savings across hundreds of turns and is dominated by streams the tools never compress.

Why do the advertised 60-90% savings not show up on my bill?

Three reasons stack. The advertised percentage divides savings by one payload; your bill divides by the whole session. The high-compression tricks only fire on redundant, structured output, which is a minority of real traffic. And prompt caching means most of your tokens are already billed at the cheap cache_read rate, while the expensive streams, cache writes and output, are the ones these tools do not touch.

What is the cheapest way to lower Claude Code token costs?

Use the smallest capable model, since output is 5x the input price and model choice is the biggest lever. Lower extended thinking on simple tasks, because thinking bills as output. Then scope requests so the agent reads less, clear context between unrelated tasks, and stop runs that loop without progress. These act on the expensive part of the bill, which third-party compressors do not.

Are token-compression tools a security risk?

They can be. Each sits in the path of your code, prompts, or output. rtk’s shell hook can execute commands, caveman runs on every prompt, and headroom as an API proxy can see your Authorization header and API key. A compromised or resold release becomes a supply-chain exposure, which is a steep price for a few percent of savings.

The bottom line

The pitch behind rtk, headroom, and caveman is not a lie. It is an advertised ratio measured against the wrong denominator. Each does exactly what it claims to the payload it was built for, and that payload is a small, cheap slice of an agent’s bill. The expensive tokens, cache writes and model output, are set by which model you run and how tightly you scope the work, and no compressor in the tool-output path can reach them. Before installing a dependency that reads your prompts and credentials to save a few dollars a month, the honest question is whether the model choice and the scope of your requests are right first. Every figure here is sourced and dated, in line with our editorial standards.

Sources

Yongkyun (2026). Cutting LLM Token Costs with rtk, headroom, and caveman. codepointer (Substack). Verified 2026-06-19. codepointer.substack.com
rtk-ai (2026). rtk: CLI proxy that reduces LLM token consumption by 60-90% on common dev commands (repository README). Verified 2026-06-19. github.com/rtk-ai/rtk
Chopra, T. (2026). headroom: Compress tool outputs, logs, files, and RAG chunks before they reach the LLM (repository and benchmarks). github.com/chopratejas/headroom and benchmarks
Brussee, J. (2026). caveman: Claude Code skill that cuts tokens by talking like caveman (repository README). github.com/JuliusBrussee/caveman
Guzik, K. (2026). I Benchmarked the Viral “Caveman” Prompt to Save LLM Tokens. Then My 6-Line Version Beat It. DEV Community. dev.to
Anthropic (2026). Manage costs effectively (Claude Code documentation: model selection, context management, extended thinking, preprocessing hooks). Verified 2026-06-19. code.claude.com/docs
Anthropic (2026). Pricing (per-token API rates). Verified 2026-06-19. claude.com/pricing
Anthropic (2026). Prompt caching (cache read and cache write multipliers). Verified 2026-06-19. platform.claude.com