For real-world speeds though yeah, you'd need serious hardware. This is more of a "deploy your own stamp" model, less a "local" model.
I may not be able to reasonably run it myself, but at least I can choose who I trust to run it and can have inference pricing determined by a competitive market. According to their benchmarks the model is about in a class with Claude 4 Sonet, yet already costs less than one third of Sonet's inference pricing
So running it locally is the exact opposite of what I’m looking for.
Rather, I’m willing to pay more, to have it be run on a faster than normal cloud inference machine.
Anthropic is already too slow.
Since this model is open source, maybe someone could offer it at a “premium” pay per use price, where the response rate / inference is done a lot faster, with more resources thrown at it.
There's your issue. Use Claude Code or the API directly and compare the speeds. Cursor is slowing down requests to maintain costs.
Good on you for not exaggerating.
I am very curious what exactly they see in that, 2-3 people hopped in to handwave that you just have it do agent stuff overnight and it's well worth it. I can't even begin to imagine unless you have a metric **-ton of easily solved problems that aren't coding. Even a 90% success rate gets you into "useless" territory quick when one step depends on the other, and you're running it autonomoously for hours
1. Some more creative than others, with slightly different injected prompts or perhaps even different models entirely.
Yeah that. Why can't we just `find ./tasks/ | grep \.md$ | xargs llm`. Can't we just write up a government proposal style document, have LLM recursively down into sub-sub-projects and back up until the original proposal document can be translated into a completion report. Constantly correcting a humongous LLM with infinite context length that can keep everything in its head doesn't feel like the right approach.
Maybe with bigger models it'll work well.
Now, with agentic coding, thinking models, a “chat with my pdf” or whatever artifacts are being called now, no, I don’t think 5/s is enough.
How many tokens/second would this likely achieve?
They claim 14 tps for the 4-bit quant on a single socket system with 600 GB RAM and 14 GB GPU memory.
All the models I tested, which includes Sonnet 4, DeepSeekR1, 4o and Gemini 2.5 understand this isn't your normal email and what I ask is literary/philosophical criticism, not remarks about conventions, formatting or how to convey my message in a more impactful way.
Quick fix:
- Replace half the commas with periods.
- Let one metaphor bloom and then stop.
- Read it aloud; wherever you inhale, insert a line break.
Before:
>And if the collective imagination is so seized by it, it's because
we are already—not in some other world—but in this very life, in a
state of socio-, photo-, and videosynthesis.
After:
>The collective imagination is seized.
>We are not in another world.
>We are in this life, in a state of videosynthesis.
Yes, this quote is by Baudrillard. None of the other models fixated on the fact that it’s an email (I only used the word once in the prompt). My gut feeling is that this reflects not so much a lack of intelligence as a difference in model personality. Here's what it replied when I shared Gemini's analysis. The point was to have it understand that: Gemini, second sentence: It is less a simple email than an intellectual
and personal exploration still in the making.
Point deflected as soon as understood: **Gemini is polite and pedagogical; I’m going to be blunt.**
1. **This isn’t an email, it’s a trip.**
...
2. **The content is good, the format is not.**
...
3. **Gemini is right about one thing:**
...
* cut 80% of the material,
* ask **one clear question**
* put the rest in a separate Google Doc or PDF for posterity.
4. **Brutal summary:**
Keep your draft for yourself — it’s valuable. But for the email, write a new one in **three sentences**:
* thank them,
* recall the context in one line,
* ask the question.
You’ve already got Gemini’s answer to stroke your ego — now give me the short version you’re actually going to send.
The solution to sycophancy is not disparagement (misplaced criticism). The classical true/false positive/negative dilemma is at play here. I guess the bot got caught in the crossfire of 1°) its no-bullshit attitude (it can only be an attitude) 2°) preference for delivering blunt criticism over insincere flattery 3°) being a helpful assistant. Remove point 3°), and it could have replied: "I'm not engaging in this nonsense". Preserve it and it will politely suggest that you condense your bullshit text, because shorter explanations are better than long winding rants (it's probably in the prompt).> each time the resulting executable program or a program dependent thereon is launched, a prominent display (e.g., splash screen or banner text) of the Author’s attribution information
> The Kardashev scale (Russian: шкала Кардашёва, romanized: shkala Kardashyova) is a method of measuring a civilization's level of technological advancement based on the amount of energy it is capable of harnessing and using.
> Under this scale, the sum of human civilization does not reach Type I status, though it continues to approach it.
My guess is: the cost of the minisplits, pretty certain if you had them and turned them all on, you could still draw that much power from the grid.
And probably you are underestimating the cost of nuclear anyway.
Does this actually mean "they" not "we"
Except that instead of this, we're spinning up old coal plants, because apparently nuclear bad.
I think it hasn’t received much attention because the frontier shifted to reasoning and multi-modal AI models. In accuracy benchmarks, all the top models are reasoning ones:
https://artificialanalysis.ai/
If someone took Kimi k2 and trained a reasoning model with it, I’d be curious how that model performs.
I imagine that's what they are going at MoonshotAI right now
Interestingly enough, EQ-Bench/Creative Writing Bench doesn't spot this despite clearly having it in their samples. This makes me trust it even less.
I found that while looking for reports of the best agents to use with K2. The usual suspects like Cline and forks, Aider, and Zed should be interesting to test with K2 as well.
R1 (and K2) is MoE, whereas Llama 3 is a dense model family. MoE actually makes these models practical to run on cheaper hardware. DeepSeek R1 is more comfortable for me than Llama 3 70B for exactly that reason - if it spills out of the GPU, you take a large performance hit.
If you need to spill into CPU inference, you really want to be multiplying a different set of 32B weights for every token compared to the same 70B (or more) instead, simply because the computation takes so long.
IMHO it sets the local LLM community back when we lean on extreme quantization & streaming weights from disk to say something is possible*, because when people try it out, it turns out it's an awful experience.
* the implication being, anything is possible in that scenario
I will also point out that having three API-based providers deploying an impractically-large open-weights model beats the pants of having just one. Back in the day, this was called second-sourcing IIRC. With proprietary models, you're at the mercy of one corporation and their Kafkaesque ToS enforcement.
That seems separate from the post it was replying to, about 1T param models.
If it is intended to be a reply, it hand waves about how having a bad experience with it will teach them to buy more expensive hardware.
Is that "Good."?
The post points out that if people are taught they need an expensive computer to get 1 token/second, much less try it and find out it's a horrible experience (let's talk about prefill), it will turn them off against local LLMs unnecessarily.
Is that "Good."?
I'll remain here happily using 2.something tokens / second model.
Now, where's that spare SSD...
For GPU inference at scale, I think token-level batching is used.
For the big MLP tensors they would be split across GPUs in a cluster. Then for the MoE parts you would spread the experts across the GPUs and route to them based on which experts are active (there would likely be more than one if the batch size is > 1).
Given that R1 uses 37B active parameters (compared to 32B for K2), K2 should be slightly faster than that - around 1.15 tokens/second.
My R1 most likely isn't as smart as the output coming from an int8 or FP16 API, but that's just a given. It still holds up pretty well for what I did try.
From reading articles online, "agentic" means like you have a "virtual" Virtual Assistant with "hands" that can google, open apps, etc, on their own.
Why not use existing "non-agentic" model and "orchestrate" them using LangChain, MCP etc? Why create a new breed of model?
I'm sorry if my questions sound silly. Following AI world is like following JavaScript world.
When an LLM says it's "agentic" it usually means that it's been optimized for tool use. Pretty much all the big models (and most of the small ones) are designed for tool use these days, it's an incredibly valuable feature for a model to offer.
I don't think this new model is any more "agentic" than o3, o4-mini, Gemini 2.5 or Claude 4. All of those models are trained for tools, all of them are very competent at running tool calls in a loop to try to achieve a goal they have been given.
Creating models for this specific problem domain will have a better chance at reliability, which is not a solved problem.
Jules is the gemini coder that links to github. Half the time it doesn't create a pull request and forgets and assumes I'll do some testing or something. It's wild.
You are more right than you could possibly imagine.
TL;DR: "agentic" just means "can call tools it's been given access to, autonomously, and then access the output" combined with an infinite loop in which the model runs over and over (compared to a one-off interaction like you'd see in ChatGPT). MCP is essentially one of the methods to expose the tools to the model.
Is this something the models could do for a long while with a wrapper? Yup. "Agentic" is the current term for it, that's all. There's some hype around "agentic AI" that's unwarranted, but part of the reason for the hype is that models have become better at tool calling and using data in their context since the early days.
https://github.com/MoonshotAI/Kimi-K2/blob/main/docs/deploy_...
"The smallest deployment unit for Kimi-K2 FP8 weights with 128k seqlen on mainstream H200 or H20 platform is a cluster with 16 GPUs with either Tensor Parallel (TP) or "data parallel + expert parallel" (DP+EP)."
16 GPUs costing ~$30k each. No one is running a ~$500k server at home.
Once that's running it can serve the needs of many users/clients simultaneously. It'd be too expensive and underutilized for almost any individual to use regularly, but it's not unreasonable for them to do it in short intervals just to play around with it. And it might actually be reasonable for a small number of students or coworkers to share a $70/hr deployment for ~40hr/week in a lot of cases; in other cases, that $70/hr expense could be shared across a large number of coworkers or product users if they use it somewhat infrequently.
So maybe you won't host it at home, but it's actually quite feasible to self-host, and is it ever really worth physically hosting anything at home except as a hobby?
Not sure if they’ll trust a Chinese model but dropping $50-100k for a quantized model that replaces, say, 10 paralegals is good enough for a law firm
In addition, some people on /r/localLlama are having success with streaming the weights off SSD storage at 1 token/second, which is about the rate I get for DeepSeek R1.
Our only modification part is that, if the Software (or any derivative works
thereof) is used for any of your commercial products or services that have
more than 100 million monthly active users, or more than 20 million US dollars
(or equivalent in other currencies) in monthly revenue, you shall prominently
display "Kimi K2" on the user interface of such product or service.
OSI purism is deleterious and has led to industry capture.
Non-viral open source is simply a license for hyperscalers to take advantage. To co-opt offerings and make hundreds of millions without giving anything back.
We need more "fair source" licensing to support sustainable engineering that rewards the small ICs rather than mega conglomerate corporations with multi-trillion dollar market caps. The same companies that are destroying the open web.
This license isn't even that protective of the authors. It just asks for credit if you pass a MAU/ARR threshold. They should honestly ask for money if you hit those thresholds and should blacklist the Mag7 from usage altogether.
The resources put into building this are significant and they're giving it to you for free. We should applaud it.
The majority of open source code is contributed by companies, typically very large corporations. The thought of the open source ecosystem being largely carried by lone hobbyist contributors in their spare time after work is a myth. There are such folks (heck I'm one of them) and they are appreciated and important, but their perception far exceeds their real role in the open source ecosystem.
> c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.)
And the 4-clause BSD license says:
> 3. All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the organization.
Both of these licenses are not just non-controversially open-source licenses; they're such central open-source licenses that IIRC much of the debate on the adoption of the OSD was centered on ensuring that they, or the more difficult Artistic license, were not excluded.
It's sort of nonsense to talk about neural networks being "open source" or "not open source", because there isn't source code that they could be built from. The nearest equivalent would be the training materials and training procedure, which isn't provided, but running that is not very similar to recompilation: it costs millions of dollars and doesn't produce the same results every time.
But that's not a question about the license.
My personal feeling is that almost every project (I'll hedge a little because life is complicated) should prefer an OSI certified license and NOT make up their own license (even if that new license is "just" a modification of an existing license). License proliferation[1] is generally considered a Bad Thing for good reason.
What makes us comfortable with the "traditional open source licenses" is that people have been using them for decades and nothing bad has happened. But that's mostly because breaking an open source license is rarely litigated against, not because we have some special knowledge of what those licenses mean and how to abide by that
OK, fair enough. Pretend I said "not well understood" instead. The point is, the long-standing, well known licenses that have been around for decades are better understood that some random "I made up my own thing" license. And yes, some of that may be down to just norms and conventions, and yes, not all of these licenses have been tested in court. But I think most people would feel more comfortable using an OSI approved license, and are hesitant to foster the creation of even more licenses.
If nothing else, license proliferation is bad because of the combinatorics of understanding license compatibility issues. Every new license makes the number of permutations that much bigger, and creates more unknown situations.
A lot of open source, copyleft things already have attribution clauses. You're allowed commerical use of someone else's work already, regardless of scale. Attribution is a very benign ask.
https://en.wikipedia.org/wiki/Common_Public_Attribution_Lice...
What I'm saying, if I'm saying anything at all, is that it might have been better to pick one of these existing licenses that has some attribution requirement, rather than adding to the license proliferation problem.
But is it really?
Sure, it may make some licenses incompatible with each other, but that's basically equivalent to whining about somebody releasing their code in GPL and it can't be used in a project that uses MIT...
And your argument that the terms are "less understood" really doesn't matter. It's not like people know the Common Public Attribution License in and out either. (I'm going to argue that 99% devs don't even know the GPL well.) Poor drafting could be an issue, but I don't think this is the case here.
And on an ideological standpoint, I don't think people should be shamed into releasing their code under terms they aren't 100% comfortable with.
"The license must not discriminate against any person or group of persons."
"The license must not restrict anyone from making use of the program in a specific field of endeavor. For example, it may not restrict the program from being used in a business, or from being used for genetic research."
By having a clause that discriminates based on revenue, it cannot be Open Source.
If they had required everyone to provide attribution in the same manner, then we would have to examine the specifics of the attribution requirement to determine if it is compatible... but since they discriminate, it violates the open source definition, and no further analysis is necessary.
* Small companies may use it without attribution
* Anyone may use it with attribution
The first may not be OSI compatible, but if the second license is then it’s fair to call the offering open weights, in the same way that dual-licensing software under GPL and a commercial license is a type of open source.
Presumably the restriction on discrimination relates to license terms which grant _no_ valid open source license to some group of people.
Being required to display branding in that way contradicts "run the program as you wish".
I think basically everybody considers CC BY to be open source, so a strictly more permissive license should be too, I think.
Open-weight. As usual, you don't get the dataset, training scripts, etc.
Modified MIT License
Copyright (c) 2025 Moonshot AI
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Our only modification part is that, if the Software (or any derivative works
thereof) is used for any of your commercial products or services that have
more than 100 million monthly active users, or more than 20 million US dollars
(or equivalent in other currencies) in monthly revenue, you shall prominently
display "Kimi K2" on the user interface of such product or service.
Tangent: I don't understand the contingent that gets upset about open LLMs not shipping with their full training regimes or source data. The software a company spent hundreds of millions of dollars creating, which you are now free to use and distribute with essentially no restrictions, is open source. It has weights in it, and a bunch of related software for actually running a model with those weights. How dare they!
The poaching was probably more aimed at hamstringing Meta's competition.
Because the disruption caused by them leaving in droves is probably more severe than the benefits of having them on board. Unless they are gods, of course.
Moonshot AI [1] (Moonshot; Chinese: 月之暗面; pinyin: Yuè Zhī Ànmiàn) is an artificial intelligence (AI) company based in Beijing, China. As of 2024, it has been dubbed one of China's "AI Tiger" companies by investors with its focus on developing large language models.
I guess everyone is up to date with AI stuff but this is the first time I heard of Kimi and Moonshot and was wondering where it is from. And it wasn't obvious from a quick glance of comments.
Perhaps their open source model release doesn't look so good compared to this one
Is this the largest open-weight model?
At 1T MoE on 15.5T tokens, K2 is one of the largest open source models to date. But BAAI's TeleFM is 1T dense on 15.7T tokens: https://huggingface.co/CofeAI/Tele-FLM-1T
You can always check here: https://lifearchitect.ai/models-table/
Grok-1 is 341B, DeepSeek-v3 is 671B, and recent new open weights models are around 70B~300B.
See https://github.com/peteryuqin/Kimi-K2-Mini, a project that keeps a small portion of experts and layers and keep the model capabilities across multiple domains.
What I did find instead is that some MoE models are explicitly domain-routed (MoDEM), but it doesn't apply to deepseek which is just equally load balanced, so it's unlikely to apply to Kimi. On the other hand, https://arxiv.org/html/2505.21079v1 shows modality preferences between experts, even in mostly random training. So maybe there's something there.
I developed an intelligent vector database agent using Kimi K2 and Milvus, which enhances document interaction via natural language commands.
However, 1t parameters makes it nearly impossible for local inference, let alone fine-tuning.
Often a faster answer is more useful to me for quick research. Reasoning has its place but don’t think that place is always
Is there any way that I could do so?
Open Router? Or does kimi have their own website? Just curious to really try it out!
It's open-weight. As usual, you don't get the dataset, training scripts, etc.