Things I noted:
- It's fast. I tested it in EU tz, so ymmv
- It does agentic in an interesting way. Instead of editing a file whole or in many places, it does many small passes.
- Had a feature take ~110k tokens (parsing html w/ bs4). Still finished the task. Didn't notice any problems at high context.
- When things didn't work first try, it created a new file to test, did all the mocking / testing there, and then once it worked edited the main module file. Nice. GPT5-mini would often times edit working files, and then get confused and fail the task.
All in all, not bad. At the price point it's at, I could see it as a daily driver. Even agentic stuff w/ opus + gpt5 high as planners and this thing as an implementer. It's fast enough that it might be worth setting it up in parallel and basically replicate pass@x from research.
IMO it's good to have options at every level. Having many providers fight for the market is good, it keeps them on their toes, and brings prices down. GPT5-mini is at 2$/MTok, this is at 1.5$/MTok. This is basically "free", in the great scheme of things. I ndon't get the negativity.
Grok is owned by Elon Musk. Anything positive that is even tangentially related to him will be treated negatively by certain people here. Additionally, it is an AI coding tool which is seen as a threat to some people’s livelihoods here. It’s a double whammy, so I’m not surprised by the reaction to it at all.
See also the Microsoft threads on HN where everyone threatens to switch to Linux, and by reading them you'd think Linux is finally about to have its infamous glory year on the desktop.
Are you aware that outside your echo chamber bubble that comes up with these talking points and implanting its own values in you that the rest of us DO NOT CARE at all or sympathize with you in any way?
Trying to politicize everything that comes your way and regurgitating implanted values isn't a healthy way to live. I really pray for people like you to find inner peace and find healthy connections outside your comfort zone.
The OP was descriptive, not prescriptive.
I would have thought it uncontroversial view among software engineers that token quality is much important than token output speed.
If an LLM is often going to be wrong anyway, then being able to try prompts quickly and then iterate on those prompts, could possibly be more valuable than a slow higher quality output.
Ad absurdum, if it could injest and work on an entire project in milliseconds, then it has mucher geater value to me, than a process which might take a day to do the same, even if the likelihood of success is also strongly affected.
It simply enables a different method of interactive working.
Or it could supply 3 different suggestions in-line while working on something, rather than a process which needs to be explicitly prompted and waited on.
Latency can have critical impact on not just user experience but the very way tools are used.
Now, will I try Grok? Absolutely not, but that's a personal decision due to not wanting anything to do with X, rather than a purely rational decision.
Before MoE was a thing, I built what I called the Dictator, which was one strong model working with many weaker ones to achieve a similar result as MoE, but all the Dictator ever got was Garbage In, so guess what came out?
this site is the fucking worst
Asking any model to do things in steps is usually better too, as opposed to feeding it three essays.
* Scaffolding
* Ask it what's wrong with the code
* Ask it for improvements I could make
* Ask it what the code does (amazing for old code you've never seen)
* Ask it to provide architect level insights into best practices
One area where they all seem to fail is lesser known packages they tend to either reference old functionality that is not there anymore, or never was, they hallucinate. Which is part of why I don't ask it for too much.
Junie did impress me, but it was very slow, so I would love to see a version of Junie using this version of Grok, it might be worthwhile.
not if you have too much! a few hundred thousand lines of code and you can't ask shit!
plus, you just handed over your company's entire IP to whoever hosts your model
There's nothing wrong with doing it, but it's entirely unrelated to performance.
For autocompleting simple functions (string manipulation, function definitions, etc), the quality bar is pretty easy to hit, and speed is important.
If you're just vibe coding, then yeah, you want quality. But if you know what you're doing, I find having a dumber fast model is often nicer than a slow smart model that you still need to correct a bit, because it's easier to stay in flow state.
With the slow reasoning models, the workflow is more like working with another engineer, where you have to review their code in a PR
We already know that in most software domains, fast (as in, getting it done faster) is better than 100% correct.
Fast is good for tool use and synthesizing the results.
It's not long enough for you to context switch to something else, but long enough to be annoying and these wait times add up during the whole day.
It also discourages experimentation if you know that every prompt will potentially take multiple minutes to finish. If it instead finished in seconds then you could iterate faster. This would be especially valuable in the frontend world where you often tweak your UI code many times until you're satisfied with it.
They reduce the costs tough !
Different models for different things.
Not everyone is solving complicated things every time they hit cmd-k in Cursor or use autocomplete, and they can easily switch to a different model when working harder stuff out via longer form chat.
At least this comment was written fast.
I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds. I will usually have eyeballed the code somewhere in the middle here but I'm not fully reviewing until this whole dance is done.
I mean, I obviously agree with you in that I've chosen the slowest models available at every turn here, but my point is I would be very excited if they also got faster because I am using a lot of extra inference to buy more quality before I'm touching the code myself.
> I use Opus 4.1 exclusively in Claude Code but then I also use zen-mcp server to get both gpt5 and gemini-2.5-pro to review the code and then Opus 4.1 responds.
I'd love to hear how you have this set up.While the top coding models have become much more trustworthy lately, Grok isn't there yet. It doesn't matter if it's fast and/or free; if you can't trust a tool with your code, you can't use it.
(If that's what you meant)
I guess if you cannot do well in benchmarks, instead pick an easier to pump up one and run with that - speed. Looking online for benchmarks the first thing that came up was a reddit post from an (obvious) spam account[1] gloating about how amazing it was on a bunch of subs.
Let's see this harness, then, because third party reports rate it at 57.6%
Opus 4.1 is by far the best right now for most tasks. It’s the first model I think will almost always pump out “good code”. I do always plan first as a separate step, and I always ask it for plans or alternatives first and always remind it to keep things simple and follow existing code patterns. Sometimes I just ask it to double check before I look at it and it makes good tweaks. This works pretty well for me.
For me, I found Sonnet 3.5 to be a clear step up in coding, I thought 3.7 was worse, 2.5 pro equivalent, and 4 sonnet equal maybe tiny better than 3.5. Opus 4.1 is the first one to me that feels like a solid step up over sonnet 3.5. This of course required me to jump to Claude code max plan, but first model to be worth that (wouldn’t pay that much for just sonnet).
I recently found it much more valuable, and why I am now preferring GPT-5 over Sonnet 4, is that if I start asking it to give me different architectural choices, its really quite good at summarizing trade-offs and and offering step-by-step navigation towards problem solving. I am liking this process a lot more than trying to "one shot" or getting tons of code completely rewritten, thats unrelated to what I am really asking for. This seems to be a really bad problem with Opus 4.1 Thinking or even Sonnet Thinking. I don't think it's accurate, to rate models on "one-shoting" a problem. Rate it on, how easy it is to work with, as an assistant.
*edit Case in point, downvotes in less than 30 seconds
https://i.imgur.com/qgBq6Vo.png
I'm going to test it. My bottleneck currently is waiting for agent to scan/think/apply changes.
Eg, https://www.msn.com/en-us/news/world/musk-retweets-hitler-di...
How pathetic. They aren't even able to accept that a loser like Musk is a nazi rat
But anytime I hear of Grok or xAI, the only thing I can think about is how it's hoovering up water from the Memphis municipal water supply and running natural gas turbines to power all for a chat bot.
Looks like they are bringing even more natural gas turbines online...great!
https://netswire.usatoday.com/story/money/business/developme...
A hint to all AI companies, nobody wants quickly generated broken code.
I haven't used Copilot in a while but Cursor lets you easily switch the model depending on what you're trying to do.
Having options for thinking, normal, fast covers every sort of problem. GPT-5 doesn't let you choose which IMO is only helpful for non-IDE type integrations, although even in ChatGPT it can be annoying to get "thinking" constantly for simple questions.
I'm getting 30-50% larger code changes in per day now. Yesterday I plumbed six slightly mechanical, but still major changes through our schema, several microservice layers, API client libraries, and client code. I wrote down the change sites ahead of time to track progress: 54. All requiring individual business logic. This would have been tedious without tab complete.
And that's not the only thing I did yesterday.
I wouldn't trust these tools with non-developers, but in our hands they're an exoskeleton. I like them like I like my vim movements.
A similar analogy can be made for the AI graphics design and editing models. They're extremely good time saving tools, but they still require a human that knows what they're doing to pilot them.
I also think it is optimistic to think the jailbreak percentage will stay at "0.00" after public use, but time will tell.
https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf