https://t3.chat/share/j2tnfwwful https://t3.chat/share/k1xhgisrw1
It's a pain having to tell Copilot "Open in pages mode" each time it's launched, and then after processing a batch of files run into:
https://old.reddit.com/r/Copilot/comments/1po2cuf/daily_limi...
Was looking at modifying outgoing requests via proxy and wondering whether that's harming caching. Common coding tools presumably have a shared prompt across all their installs so universal cache would save a lot
> Prompt caches are not shared between organizations. Only members of the same organization can access caches of identical prompts.
https://platform.openai.com/docs/guides/prompt-caching#frequ...
So if I were running a provider I would be caching popular prefixes for questions across all users. There must be so many questions that start 'what is' or 'who was' etc?
Also, can subsequences in the prompt be cached and reused? Or is it only prefixes? I mean, can you cache popular phrases that might appear in the middle of the prompt and reuse that somehow rather than needing to iterate through them token by token? E.g. must be lots of times that "and then tell me what" appears in the middle of a prompt?
My favorite not-super-accurate mental model of what's going on with attention is that the model is sort of compressing the whole preceding context into each token. So the word "tell" would include a representation not just of the concept of telling, but also of what it is that's supposed to be told. That's explicitly what you don't want to cache.
> So if I were running a provider I would be caching popular prefixes for questions across all users
Unless you're injecting user context before the question. You can have a pre baked cache with the base system prompt, but not beyond that. Imagine that the prompt always starts with "SYSTEM: You are ChatGPT, a helpful assistant. The time is 6:51 ET on December 19, 2025. The user's name is John Smith. USER: Hi, I was wondering..." You can't cache the "Hi, I was wondering" part because it comes after a high-entropy component (timestamp and user name).
There’s been some research into how to cache chunks in the middle, but I don’t think any of the providers are doing it yet because it needs the prompt to be structured in a very specific way.
> Caching is available for prompts containing 1024 tokens or more.
No mention of caching being in blocks of 1024 tokens thereafter.
Even just moving it to the bottom helped move a lot of our usage into cache.
Probably went from something like 30-50% cached tokens to 50-70%.
The product has grown a lot since the mid 2010s. Still got free localhost tunnelling, but we also have a whole bunch of production-grade API gateway tooling and, as of recently, AI gateway stuff too.
[see https://news.ycombinator.com/item?id=45988611 for explanation]
I'd note, when I gave the input/output screenshot to ChatGPT 5.2 it failed on it (with lots of colorful chain of thought), though Gemini got it right away.
I recently had some trouble converting a HF transformer I trained with PyTorch to Core ML. I just couldn’t get the KV cache to work, which made it unusably slow after 50 tokens…
Yes, I recently wrote https://github.com/samwho/llmwalk and had a similar experience with cache vs no cache. It’s so impactful.
I’m really glad you liked it, and seriously the resources I link at the end are fantastic.
Great work. Learned a lot!
Where do people get the idea from that temperature affects caching in any way? Temperature is about next token prediction / output, not input.
It´s a semantics issue where the word caching is overloaded depending on context. For people that are not familiar with the inner workings of llm models, this can cause understandable confusion.