Prompt caching for cheaper LLM tokens

164
37
samwho
2 days ago
ngrok.com

holbrad
·
6 minutes ago
·
[ - ]

I gave the table of inputs and outputs to both Gemini 3.0 flash and GPT 5.2 instant and they were stumped.

https://t3.chat/share/j2tnfwwful https://t3.chat/share/k1xhgisrw1

WillAdams
·
26 minutes ago
·
[ - ]

When will Microsoft do this sort of thing?

It's a pain having to tell Copilot "Open in pages mode" each time it's launched, and then after processing a batch of files run into:

https://old.reddit.com/r/Copilot/comments/1po2cuf/daily_limi...

Havoc
·
1 hour ago
·
[ - ]

Does anyone know whether the cache is segregated by user/API key for the big providers?

Was looking at modifying outgoing requests via proxy and wondering whether that's harming caching. Common coding tools presumably have a shared prompt across all their installs so universal cache would save a lot

moebrowne
·
1 hour ago
·
[ - ]

For ChatGPT:

> Prompt caches are not shared between organizations. Only members of the same organization can access caches of identical prompts.

https://platform.openai.com/docs/guides/prompt-caching#frequ...

samwho
·
1 hour ago
·
[ - ]

I was wondering about this when I was reading around the topic. I can’t personally think of a reason you would need to segregate, though it wouldn’t surprise me if they do for some sort of compliance reasons. I’m not sure though, would love to hear something first-party.

samwho
·
55 minutes ago
·
[ - ]

The only thing that comes to mind is some kind of timing attack. Send loads of requests specific to a company you’re trying to spy on and if it comes back cached you know someone has sent that prompt recently. Expensive attack, though, with a large search space.

gunalx
·
34 minutes ago
·
[ - ]

I habe come across turning on caching means the llm has a faint memory of what was in the cache, even to unrelated queries. If this is the case its fully unreasonable to share the cache, because of possibility of information leakage.

samwho
·
29 minutes ago
·
[ - ]

How would information leak, though? There’s no difference in the probability distribution the model outputs when caching vs not caching.

willvarfar
·
1 hour ago
·
[ - ]

A really clear explanation!

So if I were running a provider I would be caching popular prefixes for questions across all users. There must be so many questions that start 'what is' or 'who was' etc?

Also, can subsequences in the prompt be cached and reused? Or is it only prefixes? I mean, can you cache popular phrases that might appear in the middle of the prompt and reuse that somehow rather than needing to iterate through them token by token? E.g. must be lots of times that "and then tell me what" appears in the middle of a prompt?

GeneralMayhem
·
1 hour ago
·
[ - ]

Really only prefixes, without a significant loss in accuracy. The point is that because later tokens can't influence earlier ones, the post-attention embeddings for those first tokens can't change. But the post-attention embeddings for "and then tell me what" would be wildly different for every prompt, because the embeddings for those tokens are affected by what came earlier.

My favorite not-super-accurate mental model of what's going on with attention is that the model is sort of compressing the whole preceding context into each token. So the word "tell" would include a representation not just of the concept of telling, but also of what it is that's supposed to be told. That's explicitly what you don't want to cache.

> So if I were running a provider I would be caching popular prefixes for questions across all users

Unless you're injecting user context before the question. You can have a pre baked cache with the base system prompt, but not beyond that. Imagine that the prompt always starts with "SYSTEM: You are ChatGPT, a helpful assistant. The time is 6:51 ET on December 19, 2025. The user's name is John Smith. USER: Hi, I was wondering..." You can't cache the "Hi, I was wondering" part because it comes after a high-entropy component (timestamp and user name).

samwho
·
1 hour ago
·
[ - ]

With KV caching as it’s described there it has to be a prefix match. OpenAI state in their docs they don’t cache anything below 1024 tokens long, and I’m sure I read somewhere that they only cache in 1024 token blocks (so 1024, 2048, 3072, etc) but I can’t find it now.

There’s been some research into how to cache chunks in the middle, but I don’t think any of the providers are doing it yet because it needs the prompt to be structured in a very specific way.

moebrowne
·
1 hour ago
·
[ - ]

https://platform.openai.com/docs/guides/prompt-caching#requi...

> Caching is available for prompts containing 1024 tokens or more.

No mention of caching being in blocks of 1024 tokens thereafter.

est
·
6 hours ago
·
[ - ]

This is a surprising good read of how LLM works in general.

samwho
·
2 hours ago
·
[ - ]

It’s funny, I didn’t set out for that to be the case. When I pitched the idea internally, I wanted to scratch my own itch (what on earth is a cached token?) and produce a good post. But then I realised I had to go deeper and deeper to get to my answer and accidentally made a very long explainer.

duggan
·
1 hour ago
·
[ - ]

It was a real facepalm moment when I realised we were busting the cache on every request by including date time near the top of the main prompt.

Even just moving it to the bottom helped move a lot of our usage into cache.

Probably went from something like 30-50% cached tokens to 50-70%.

aitchnyu
·
3 hours ago
·
[ - ]

Took me a minute to see it is same Ngrok which provided freemium tunnels to localhost. How did they adapt to the AI revolution?

samwho
·
2 hours ago
·
[ - ]

It is the same ngrok!

The product has grown a lot since the mid 2010s. Still got free localhost tunnelling, but we also have a whole bunch of production-grade API gateway tooling and, as of recently, AI gateway stuff too.

tomhow
·
1 hour ago
·
[ - ]

[under-the-rug stub]

[see https://news.ycombinator.com/item?id=45988611 for explanation]

·
3 hours ago
·
[ - ]

coderintherye
·
4 hours ago
·
[ - ]

Really well done article.

I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.

samwho
·
2 hours ago
·
[ - ]

Huh, when I was writing the article it was GPT-5.1 and I remember it got it no problem.

ThePyCoder
·
3 hours ago
·
[ - ]

What an excellent write-up. Thank you!

samwho
·
2 hours ago
·
[ - ]

Thank you so much <3

simedw
·
2 days ago
·
[ - ]

Thanks for sharing; you clearly spent a lot of time making this easy to digest. I especially like the tokens-to-embedding visualisation.

I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…

samwho
·
2 days ago
·
[ - ]

Thank you so much <3

Yes, I recently wrote https://github.com/samwho/llmwalk and had a similar experience with cache vs no cache. It’s so impactful.

mrgaro
·
5 hours ago
·
[ - ]

Hopefully you can write the teased next article about how Feedforward and Output layers work. The article was super helpful for me to get better understanding on how LLM GPTs work!

samwho
·
2 hours ago
·
[ - ]

Yeah! It’s planned for sure. It won’t be the direct next one, though. I’m taking a detour into another aspect of LLMs first.

I’m really glad you liked it, and seriously the resources I link at the end are fantastic.

wesammikhail
·
3 hours ago
·
[ - ]

Amazing article. I was under the misapprehension that temp and other output parameters actually do affect caching. Turns out I was wrong and this explains why beautifully.

Great work. Learned a lot!

stingraycharles
·
2 hours ago
·
[ - ]

I had a “somebody is wrong on the internet!!” discussion about exactly this a few weeks ago, and they proclaimed to be a professor in AI.

Where do people get the idea from that temperature affects caching in any way? Temperature is about next token prediction / output, not input.

semi-extrinsic
·
1 hour ago
·
[ - ]

Being wrong about details like this is exactly what I would expect from a professor. They are mainly grant writers and PhD herders, often they are good at presenting as well, but they mostly only have gut feelings about technical details of stuff invented after they became a professor.

wesammikhail
·
53 minutes ago
·
[ - ]

Because in my mind, as a person not working directly on this kind of stuff, I figured that caching was done similar to any resource caching in a webserver environment.

It´s a semantics issue where the word caching is overloaded depending on context. For people that are not familiar with the inner workings of llm models, this can cause understandable confusion.

samwho
·
2 hours ago
·
[ - ]

Yay, glad I could help! The sampling process is so interesting on its own that I really want to do a piece on it as well.

wesammikhail
·
53 minutes ago
·
[ - ]

Looking forward to it!

NooneAtAll3
·
2 hours ago
·
[ - ]

Blog starts loading and then gives "Something Went Wrong. D is not a function" error displayed

samwho
·
2 hours ago
·
[ - ]

Could you tell me what browser/OS/device you’re using? A few people have said this and I haven’t been able to reproduce it.

belter
·
2 hours ago
·
[ - ]

You should upgrade IE6. It has been out of support for a while...

Youden
·
1 day ago
·
[ - ]

Link seems to be broken: content briefly loads then is replaced with "Something Went Wrong" then "D is not a function". Stays broken with adblock disabled.

samwho
·
2 hours ago
·
[ - ]

Another person had this problem as well and we couldn’t figure out what causes it. We suspect something to do with WebGL support. What browser/device are you using? Does it still break if you disable all extensions? I’d love to fix this.