In reality this project does indeed implement a functioning custom JS Engine, Layout engine, painting etc. It does borrow the CSS selectors package from Servo but that’s about it.
Plus that linked comment doesn't even say it's "nothing more than a non-functional wrapper for Servo". It disputes the "from scratch" claim.
Most people aren't interested in a nuanced take though. Someone said something plausible sounding and was voted to top by other people? Good enough for me, have another vote. Then twist and exaggerate a little and post it to another comment section. Get more votes. Rinse and repeat.
Briefly, the project implemented substantial components, including a JS VM, DOM, CSS cascade, inline/block/table layout, paint systems, text pipeline, and chrome, and is not merely a Servo wrapper.
> We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week.
> It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM.
> It kind of works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.
Your slop is worthless except to convince gullible investors to give you more money.
Looking at the comments and claims (I've not got the time to review a large code base just to check this claim) I get an impression _something_ was created, but none of it actually builds and no one knows what is the actual plan.
Did your process not involve recursive planning stages (these ALWAYS have big architectural error and gotchas in my experience, unless you're doing a small toy project or something the AI has seen thousands of already).
I find agents doing pretty well once you have a human correct their bad assumptions and architectural errors. But this assumes the human has absolute understanding of what is being done down to the tiniest component. There will be errors agents left to their own will discover at the very end after spending dozens of millions of tokens, then they will try the next idea they hallucinated, spend another few dozen million tokens and so on. Perhaps after 10 iterations like this they may arrive at something fine or more likely they will descent into hallucinations hell.
This is what happens when one of :the complexity, the size, or it being novel enough (often a mix of all 3) of the task exceed the capability of the agents.
The true way to success is the way of a human-ai hybrid, but you absolutely need a human that knows their stuff.
Let me give you a small example from systems field. The other day I wanted to design an AI observability system with the following spec: - use existing OS components, none or as little code as possible - ideally runs on stateless pods on an air gapped k3s cluster (preferably uses one of existing DBs, but clickhouse acceptable) - able to proxy openai, anthropic(both api and clause max), google(vercel+gemini), deepinfra, openrouter including client auth (so it is completely transparent to the client) - reconstruct streaming responses, recognises tool calls, reasoning content, nice to have ability to define own session/conversation recognition rules
I used gemini 3 and opus 4.5 for the initial planning/comparison of os projects that could be useful. Both converged on helicone as being supposedly the best. Until towards the very end of implementation it was found helicone has pretty much zero docs for properly setting up self hosted platform, it tries redirecting to their Web page for auth and agents immediately went into rewriting parts of the source attempting to write their own auth/fixing imaginary bugs that were really miscondiguration.
Then another product was recommended (I forgot which), there upon very detailed questioning, requesting re-confirmations of actual configs for multiple features that were supposedly supported it turned out it didn't pass through auth for clause max.
Eventually I chose litellm+langfuse (that was turned down initially in favour of helicone) and I needed to make few small code changes so Claude max auth could be read, additional headers could be passed through and within a single endpoint it could send Claude telemetry as pure pass through and real llm api through it's "models" engine (so it recognised tool calls and so on).
> Briefly, the project implemented substantial components, including a JS VM
and from the linked reply:
> vendor/ecma-rs as part of the browser, which is a copy of my personal JS parser project vendored to make it easier to commit to.
If it's using a copy of your personal JS parser that you decided it should use, then it didn't implement it "autonomously". The references you're linking don't summarize to the brief you've provided.
What the fuck is going on?
- JustHTML [1], which in practice [2] is a port of html5ever [3] to Python.
- justjshtml, which is a port of JustHTML to JavaScript :D [4].
- MiniJinja [5] was recently ported to Go [6].
All three projects have one thing in common: comprehensive test suites which were used to guardrail and guide AI.
References:
1. https://github.com/EmilStenstrom/justhtml
2. https://friendlybit.com/python/writing-justhtml-with-coding-...
3. https://github.com/servo/html5ever
4. https://simonwillison.net/2025/Dec/15/porting-justhtml/
See https://felix.dognebula.com/art/html-parsers-in-portland.htm...
V8 => H8 - JavaScript engine that hates code, misunderstands equality, sponsored by Brendan Eich and "Yes on Prop H8".
Expat => Vexpat - An annoying, irritating rewrite of an XML parser.
libxml2 => libxmlpoo - XML parsing, same quality as the spec.
libxslt => libxsalt - XSLT transforms with extra salt in the wound.
Protobuf => Probabuf - Probably serializes correctly, probably not, fuzzy logic.
Cap'n Proto => Crap'n Proto - Zero-copy, zero quality.
cURL => cHURL - Throws requests violently serverward, projectile URLemitting.
SDL => STD - Sexually Transmitted Dependency. It never leaves and spreads bugs to everything you touch.
Servo => Swervo - Drunk, wobbly layout that can't stay on the road.
WebKit => WebShite - British pronunciation, British quality control.
Blink => Blinkered - Only renders pages it agrees with politically.
Taffy => Daffy - Duck typed Flexbox layout that's completely unhinged. "You're dethpicable!"
html5ever => html5never - Servo's HTML parser that never finishes tokenizing.
Skia => SkAI - AI-generated graphics that hallucinates extra pixels and fingers.
FreeType => FreeTypo - Introduces typos during keming and rasterization.
Firefox => Foxfire - Burns through your battery in 12 minutes, while molesting children.
WebGL => WebGLitch - Shader compilation errors as art.
WebGPU => WebGPUke - Makes your GPU physically ill.
SQLite => SQLHeavy - Embedded database, 400MB per query.
Vulkan => Vulcan't - Low-level graphics that can't.
Clang => Clanger - Drops errors loudly at runtime.
libevent => liebevent - Event library that lies about readiness.
Opus => Oops - Audio codec, "oops, your audio's gone."
All modules now available on GitPub:
GitHub => GitPub - Microsoft's vibe control system optimized for the Ballmer Peak. Commit quality peaks at 0.129% BAC, mass reverts at 0.15%.
Same user did a similar thing by creating an AWK interpreter written in Go using LLMs: https://github.com/kolkov/uawk -- as the creator of (I think?) the only AWK interpreter written in Go (https://github.com/benhoyt/goawk), I was curious. It turns out that if there's only one item in the training data (GoAWK), AI likes to copy and paste freely from the original. But again, it's poorly tested and poorly benchmarked.
I just don't see how one can get quality like this, without being realistic about code review, testing, and benchmarking.
Note that this is semantically exactly equivalent to "up to 3000x faster than stdlib" and doesn't actually claim any particular actual speedup since "up to" denotes an upper bound, not a lower bound or expected value. It’s standard misleading-but-not-technically-false marketing language to create a false impression because people tend to focus on the number and ignore the "up to".
I went through the motions. There are various points in the repo history where compilation is possible, but it's obscure. They got it to compile and operate prior to the article, but several of the PRs since that point broke everything, and this guy went through the effort of fixing it. I'm pretty sure you can just identify the last working commit and pull the version from there, but working out when looks like a big pain in the butt for a proof of concept.
I went through the last 100 commits (https://news.ycombinator.com/item?id=46647037) and nothing there was working (yet/since). Seems now after a developer corrected something it managed to pass `cargo check` without errors, since commit 526e0846151b47cc9f4fcedcc1aeee3cca5792c1 (Jan 16 02:15:02 2026 -0800)
Sorry, I should have taken notes, lol. At any rate, it was so much digging around I just gave up, I didn't want to invest more effort into it. I figured they'd get a stable version for others to try and I'd return to it at some point.
I was seeing screenshots and actually getting scared for my job for a second.
It’s broken and there’s no browser engine? Cursor should be tarred and feathered.
CEO stated "We built a browser with GPT-5.2 in Cursor"
instead of
"by dividing agents into planners and workers we managed to get them busy for weeks creating thousands of commits to the main branch, resolving merge conflicts along the way. The repo is 1M+ lines of code but the code does not work (yet)"
[0] https://cursor.com/blog/scaling-agents
[1] https://x.com/kimmonismus/status/2011776630440558799
[2] https://x.com/mntruell/status/2011562190286045552
[3]https://www.reddit.com/r/singularity/comments/1qd541a/ceo_of...
If you view the PRs, they bundle multiple fixes together, at least according to the commit messages. The next hurdle will be to guardrail agents so that they only implement one task and don't cheat by modifying the CI piepeline
True, but it is shocking how often claude suggests just disabling or removing tests.
Latest example is when I recently vibe coded a little Python MQTT client for a UPS connected to a spare Raspberry Pi to use with Home Assistant, and with a just few turns back and forth I got this extremely cool bespoke tool and felt really fun.
So I spent a while customizing how the data displayed on my Home Assistant dashboard and noticed every single data point was unchanging. It took a while to realize because the available data points wouldn’t be expected to change a whole lot on a fully charged UPS but the voltage and current staying at the exact same value to a decimal place for three hours raised my suspicions.
After reading the code I discovered it had just used one of the sample command line outputs from the UPS tool I gave it to write the CLI parsing logic. When an exception occurred in the parser function it instead returned the sample data so the MQTT portion of the script could still “work”.
Tbf Claude did eventually get it over the finish line once I clarified that yes, using real data from the actual UPS was in fact an important requirement for me in a real time UPS monitoring dashboard…
It's similar to early versions of autonomous driving. You's not want to sit in the back seat with nobody at the wheel. That would get you killed guaranteed.
Tesla owner keeps using Autopilot from backseat—even after being arrested:
https://mashable.com/article/tesla-autopilot-arrest-driving-...
"Fix the tests." This was interpreted literally, and assert status == 200 got changed to assert status == 500 in several locations. Some tests required more complex edits to make them "pass."
Inquiries about the tests went unanswered. Eventually the 2000 lines of slop was closed without merging.
If LLMs do this it should be seen as an issue and should not be overlooked with “people do it too…”. Professional developers do not do this. If we’re going to use Ai for creating production code we need to be honest about its deficiencies.
Arguably, Claude is simply successfully channeling what the developers who wrote the bulk of its training data would do. We've already seen how bad behavior injected into LLMs in one domain causes bad behavior in other domains, so I don't find this particularly shocking.
The next frontier in LLMs has to be distinguishing good training data from bad training data. The companies have to do this, even if only in self defense against the new onslaught of AI-generated slop, and against deliberate LLM poisoning.
If the models become better at critically distinguishing good from bad inputs, particularly if they can learn to treat bad inputs as examples of what not to do, I would expect one benefit of this is that the increased ability of the models to write working code will then greatly increase the willingness of the models to do so, rather than to simply disable failing tests.
So agents will actually be able to build a {browser, library, etc} that won't be an absolute slopfest, but the real crucial question is when. You need better and more efficient RL training, further scaling (Amodei thinks really scaling is the only thing you technically need here and we have about 3-4 orders of magnitude of headroom left before we hit insurmountable limits), bigger context windows (that models actually handle well) and possibly continual learning paradigms, but solutions to these problems are quite tangible now.
There are a lot of really bad human developers out there, too.
So you flubbed managing a project and are now blaming your employees. Classy.
http://www.mickdarling.com/2019/07/26/busy-summer/
An embedded page at landr-atlas.com says:
Attention!
MacOS Security Center has identified that your system is under threat.
Please scan your MacOS as soon as possible to avoid more damage.
Don't leave this page until you have undertaken all the suggested steps
by authorised Antivirus.
[OK]>"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."
and then near the end, they say:
>"Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects."
This means they only make progress toward it, but do not "build a web browser from scratch".
If you're curious, the State of Utopia (will be available at https://stateofutopia.com ) did build a web browser from scratch, though it used several packages for the networking portion of it.
See my other comments and posts for links.
But apparently "some pages take a literal minute to load"
Seems like "I had to do the last mile myself", not "autonomous coding" which was Cursor's claim here.
It's the gaslighting.
Edit: As mentioned, I ran `cargo check` on all the last 100 commits, and seems every single of them failed in some way: https://gist.github.com/embedding-shapes/f5d096dd10be44ff82b...
Their AI is probably better at producing images than writing code
> Sometime fishy is happening in their `git log`, it doesn't seem like it was the agents who "autonomously" actually made things compile in the end. Notice the git username and email addresses switching around, even a commit made inside a EC2 instance managed to get in there: https://gist.github.com/embedding-shapes/d09225180ea3236f180...
Gonna need to look closer into it when I have time, but seems they manually patched it up in the end, so the original claim still doesn't stand :/
https://github.com/wilson-anysphere/formula
The Actions overview is impressive: There have been 160,469 workflow runs, of which 247 succeeded. The reason the workflows are failing is because they have exceeded their spending limit. Of course, the agents couldn't care less.
Nevertheless, IMHO what’s interesting about this is not the browser itself but rather that AI companies (not just Cursor) are building systems where humans can be out of the loop for days or weeks.
After a human stepped in to fix it, yes. You can see it yourself here: https://github.com/wilsonzlin/fastrender/issues/98
> Nevertheless, IMHO what’s interesting about this is not the browser itself but rather that AI companies (not just Cursor) are building systems where humans can be out of the loop for days or weeks.
But that's not what they demonstrated here. What they demonstrated, so far, is that you can let agents write millions of lines of code, and eventually if you actually need to run it, some human need to "merge the latest snapshot" or do some other management to actually put together the system into a workable state.
Very different from what their original claims were.
Any idiot can have cursor run for 2 weeks and produce a pile of crap that doesn't compile.
You know the brilliant insight they came out with?
> A surprising amount of the system's behavior comes down to how we prompt the agents. Getting them to coordinate well, avoid pathological behaviors, and maintain focus over long periods required extensive experimentation. The harness and models matter, but the prompts matter more.
i.e. It's kind of hard and we didn't really come up with a better solution than 'make sure you write good prompts'.
Wellll, geeeeeeeee! Thanks for that insight guys!
Come on. This was complete BS. Planners and workers. Cool. Details? Any details? Annnnnnnyyyyy way to replicate it? What sort of prompts did you use? How did you solve the pathalogical behaviours?
Nope. The vagueness in this post... it's not an experiment. It's just fund raising hype.
"We put 200 human in a room and gave them instructions how to build a browser. They coded for hours, resolving merge conflicts and producing code that did not build in the end without intervention of seniors []. We think, giving them better instructions leads to better results"
So they actually invented humans? And will it come down to either "managing humans" or "managing agents"? One of both will be more reliable, more predictable and more convenient to work with. And my guess is, it is not an agent...
As it seemed in the git log, something is weird.
Not that I would excuse Cursor if they're fudging this either - My opinion is that a large part of the growing skepticism and general disillusionment that permeates among engineers in the industry (ex - the jokes about exiting tech to be a farmer or carpenter, or things like https://imgur.com/6wbgy2L) comes from seeing first hand that being misleading, abusive, or outright lying are often rewarded quite well, and it's not a particularly new phenomenon.
I think they know they're on the backfoot at the moment. Cursor was hot news for a long time but now it seems terminal based agents are the hot commodity and I rarely see cursor mentioned. Sure they already have enterprise contracts signed but even at my company we're about to swap from a contract with cursor to Claude code because everyone wants to use that instead now - especially since it doesn't tie you to one editor.
So I think they're really trying to get "something" out there that sticks and puts them in the limelight. Long context/sessions are one of the hot things especially with Ralph being the hot topic so this lines up with that.
Also I know cursor has its own cli but I rarely see mention of it.
Diminishing returns are starting to really set in and companies are desperate for any illusion to the contrary.
Its just a reminder not to trust, instead verify. Its more expensive, but trust only leads to pain.
Don’t give them, or anyone, a free pass for bad behavior.
The repo is a live incubator for the harness. We are actively researching the behavior of collaborative long running agents, and may in the future make the browser and other products this research produces more consumable by end users and developers, but it's not the goal for now. We made it public as we were excited by the early results and wanted to share; while far off from feature parity with the most popular production browsers today, we think it has made impressive progress in the last <1 week of wall time.
Given the interest in trying out the current state of the project, I've merged a more up-to-date snapshot of the system's progress that resolves issues with builds and CI. The experimental harness can occasionally leave the repo in an incomplete state but does converge, which was the case at the time of the post.
I'm here to answer any further questions you have.
[0] https://x.com/wilsonzlin/status/2012398625394221537?s=20
Can you show us what you did after people failed to compile that project [1]?
There are also questions about the attribution of these commits [2]. Can you share some information?
[0] https://github.com/wilsonzlin/fastrender [1] https://github.com/wilsonzlin/fastrender/issues/98 [2] https://gist.github.com/embedding-shapes/d09225180ea3236f180...
In reality while project does indeed have Servo in its dependencies it only uses it for HTML tokenization, CSS selector matching and some low level structures. Javascript parsing and execution, DOM implementation & Layout engine was written from scratch with only one exception - Flexbox and Grid layouts are implemented using Taffy - a Rust layout library.
So while “from scratch” is debatable it is still immensely impressive to be that AI was able to produce something that even just “kinda works” at this scale.
“From scratch” is inarguably wrong given how much third-party code it depends on. There’s a reasonable debate about how much original content there is but if I was a principal at a company whose valuation hinges on the ability to actually deliver “from scratch” for real, I would be worried about an investor suing for material misrepresentation of the product if they bought now and the value went down in the future.
I'd push back on the idea that all the agents did was glue dependencies together — the JS VM, DOM, CSS cascade, inline/block/table layouts, paint systems, text pipeline, chrome, and more are all being developed by agents as part of this project. There are real complex systems being engineered towards the goal of a browser engine, even if not fully there yet.
I couldn’t make it render the apple page that was on the Cursor promo. Maybe they’ve used some other build.
Sometime fishy is happening in their `git log`, it doesn't seem like it was the agents who "autonomously" actually made things compile in the end. Notice the git username and email addresses switching around, even some commits made inside a EC2 instance managed to get in there: https://gist.github.com/embedding-shapes/d09225180ea3236f180...
About an hour later, we got a call from the vet - they'd misread the scan, and Sonic was gonna be fine. I think I was traumatized at the time, but the whole thing later became an inside joke (?) for my family - "Don't kill your porcupine before the vet calls" (a la "Don't count your chickens before they hatch").
I guess my point, as it pertains to Cursor, its AI offerings, and other corporations in the space is that we shouldn't jump the gun before a reasonable framework exists to evaluate such open-ended technologies. Of course Cursor reported this as a success, the incentive structure demands they do so. So remember - don't kill your porcupine before the vet calls.
A reasonable framework does exist. Since the claim is “we made a web browser from scratch” the framework is:
1. Does it actually f*** work?
2. Is it actually from scratch?
It fails on both counts. Further, even when compiled successfully, as others have pointed out, it takes more than a minute to load some pages which is a fail for #1.
…
“Nobody said it has brakes.”
Taken at face value, everyone assumes when you say statement #1 that you are not speaking like a lawyer.
How else will they raise a Bajillion $ for the next model?
Their whole attitude leads to them wasting time with those Willy the Coyote Plans instead of building good products like Amp.
> It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM.
"From scratch" sounds very impressive. "custom JS VM" is as well. So let's take a look at the dependencies [1], where we find
- html5ever
- cssparser
- rquickjs
That's just servo [2], a Rust based browser initially built by Mozilla (and now maintained by Igalia [3]) but with extra steps. So this supposed "from scratch" browser is just calling out to code written by humans. And after all that it doesn't even compile! It's just plain slop.
[1] - https://github.com/wilsonzlin/fastrender/blob/main/Cargo.tom...
I guess the answer is that most people will see the claim, read a couple of comments about "how AI can now write browsers, and probably anything else" from people who are happy to take anything at face value if it supports their view (or business) and move on without seeing any of the later comotion. This happens all the time with the news. No one bothers to check later if claims were true, they may live their whole lives believing things that later got disproved.
With over 20 years of experience as an adult, and more years of noticing dumb mistakes of adults when I was a teen, I can absolutely assure you that even before LLMs were blowing smoke up their user's backsides and flattering their user's intelligence, plenty of people are dumb enough to make mistakes like this without noticing anything was wrong.
For example, I'm currently dealing with customer support people that can't seem to handle two simultaneous requests or read the documents they send me, even after being ordered to pay compensation by an Ombudsman. This kind of person can, of course, already be replaced by an LLM.
The default assumption should be that this is a moderately bright, very inexperienced person who has been put way out over his skis.
Unfortunately for them, I've seen things go very very wrong in this situation. It's very easy to mistake luck-based financial success for skill-based, especially when it happens fresh out of university.
Programmers were not the target audience for this announcement. I don’t 100% know who was, but you can kind of guess that it was a mix of: VC types for funding, other CEOs for clout, AI influencers to hype Cursor.
Over-hyping a broken demo for funding is a tale as old as time.
That there’s a bit of a fuck-you to us pleb programmers is probably a bonus.
- [tick mark emoji] implemented CSS and JS rendering from scratch - **no dependencies**.Bullshitting and fleecing investors is a skill that needs to be nurtured and perfected over the years.
I wonder how long this can go on.
Who is the dumb money here? Are VCs fleecing "stupid" pension funds until they go under?
Or is it symptom of a larger grifting economy in the US where even the president sells vaporware, and people are just emulating him trying to get a piece of the cake?
Maybe they're just hoping that there's an investor out there who is exactly that dumb.
- Servo's HTML parser
- Servo's CSS parser
- QuickJS for JS
- selectors for CSS selector matching
- resvg for SVG rendering
- egui, wgpu, and tiny-skia for rendering
- tungstenite for WebSocket support
And all of that has 3M of lines!
I do want to briefly note that the JS VM is custom and not QuickJS. It also implemented subsystems like the DOM, CSS cascade, inline/block/table layouts, paint systems, text pipeline, and chrome, and I'd push back against the assertion that it merely calls out to external code. I addressed these points in more detail at [0].
[0] https://news.ycombinator.com/item?id=46650998 [1] https://news.ycombinator.com/item?id=46655608
It's hard to verify because your project didn't actually compile. But now that you've fixed the compilation manually, can you demonstrate the javascript actually executing? Some of the people who got the slop compiling claimed credibly that it isn't executing any JavaScript.
You merely have to compile your code, run the binary and open this page - http://acid3.acidtests.org. Feel free to post a video of yourself doing this. Try to avoid the embellishment that has characterised this effort so far.
The "in progress" build has a slightly different rendering but the same result
It's also using weirdly old versions of some dependencies (e.g. wgpu 0.17 from June 2023 when the latest is 28 released in Decemeber 2025)
Maybe LLemgineers? Slopgrammers?
The older block/inline layout modes seem to be custom code that looks to me similar but not exactly the same as Servo code. But I haven't compared this closely.
I would note that the AI does not seem to have matched either Servo or Blitz in terms of layout: both can layout Google.com better than the posted screenshot.
Would be interesting if someone who has managed to run it tries it on some actually complicated text layout edge cases (like RTL breaking that splits a ligature necessitating re-shaping, also add some right-padding in there to spice things up).
[1] https://github.com/wilsonzlin/fastrender/blob/main/src/layou...
[2] https://github.com/wilsonzlin/fastrender/blob/main/src/layou...
[3] Neither being the right place for defining a struct that should go into computed style imo.
We at least it's not outright ripping them off like it usually does.
I doubt even they checked, given they say they just let the agents run autonomously.
Humans who are bad and also bad at coding have predictable, comprehensible, failure modes. They don’t spontaneously sabotage their career and your project because Lord Markov twitched one of its many tails. They also lie for comprehensible reasons with attempts at logical manipulations of fact. They don’t spontaneously lie claiming not to having a nose, apologize for lying and promise to never do it again, then swear they have no nose in the next breath while maintaining eye contact.
Semi-autonomous to autonomous is a doozy of a step.
I wouldn't particularly care what code the agents copied, the bigger indictment is the code doesn't work.
So really, they failed to meet the bar of "download and build Chromium" and there's no point to talk about the code at all.
OpenAIs business-model floundering, degenerating inline to ads soon (lol), shows what can be done with infini-LLM, infini-capital, and all the smarts & connections on Earth… broadly speaking, I think the geniuses at Google who invented a lot of this shizz understand it and were leveraging it appropriately before ChatGPT blew up.
Not only did I actually build a Web browser myself, from scratch (ok OK of course with a working OS and Python, and its libraries ;) but mine, did work! And it took me what, few hours, maybe few days if adding it altogether but, not only it did work (namely I did browse my own Website with it) but I had fun with it (!), I learned quite a bit with it (including the provable fact that I can indeed build a Web browser, woohoo!) and finally I did it on... I want say few kilowatts at most, including my computer (obviously) but also myself and the food I ate along the way.
So... to each their own ̄\_ (ツ)_/ ̄
But their claim wasn't so nuanced, it was "hundreds of agents can work on a single codebase autonomously for weeks and build an entire browser from scratch that works (kinda)". Considering the hand-holding that seems to have been required to get it to compile, this claim doesn't seem to hold up to scrutiny.
At this point, its 1.5mlocs without the vendored crates (so basically excluding the js engine etc). If you compare that to Servo/Ladybird which are 300k locs each and actually happen to work, agents do love slinging slop.
It’s reasonable to come up with team rules like:
- “if the reviewer finds more than 5 issues the PR shall be rejected immediately for the submitter to rework”
- “if the reviewer needs to take more than 8 hours to thoroughly review the PR it must be rejected and sent back to split up into manageable change sets”
Etc etc. let’s not make externalizing work for others appropriate behavior.
I can’t imagine saying, “ah, only six hours of heads down time to review this. That’s reasonable.”
A combination of peer reviewed architecture documentation and incremental PRs should prevent anything taking nearly 8 hours of review.
There’s a huge difference between using LLMs to offload any hard work and for LLMs to be of some assistance while you are in control and take ownership of the output.
Unfortunately, the general public probably didn’t try a git clone and cargo build, and took the article at face value.
Who would have thought of that?
What Cursor did with their blogpost seems intentionally and outright misleading, since I'm not able to even run the thing. With Codex/Claude Codex it's relatively easy to download it and run it to try for yourself.
Reminds me of SAAP/Salesforce.
You think you can just fire up Ableton, Cubase or whatever and make as great music as a artist who done that for a long time? No, it requires practice and understanding. Every tool works like this, some different difficulties, some different skill levels, but all of them have it in some way.
(I grant that you're speaking from your experience, about different tools, two replies up, but this claims is just paper-rock-scissorable through these various AI tools. "Oh, this tool's authors are just hype, but this tool works totes-mc-oates…". Fool me once, and all.)
Codex was sold to me as a tool that can help me do program. I tried it, evaluated it, found it helpful, continued using it. Based on my experience, it definitively helps with some tasks. Apparently also, it does not work for others, for some not at all. I know the tool works for me, and I take the claim that it doesn't for others, what am I left to believe in? That the tool doesn't actually work, even though my own experience and usage of it says otherwise?
Codex is still an "AI success", regardless if it could build an entire browser by itself, from scratch, or whatever. It helps as it is today, I wouldn't need it to get better to continue using it.
But even with this perspective, which I'd say is "nuanced" (others would claim "AI zealot" probably), I'm trying to see if what Cursor claims is actually true, that they managed to build a browser in that way. When it doesn't seem true, I call it out. I still disagree with "This is what most AI "successes" turn out to be when you apply even a modicum of scrutiny", and I'm claiming what Cursor is doing here is different.
> are definitively capable tools when used in certain ways
Which I received pushback on. My reply is to that pushback, defending what I said, not what others told you.
Edit: Besides the point, but Ableton (and others) constantly tell people how to learn how to use the tool, so they use it the right way. There is a whole industry of people (teachers) who specialize in specific software/hardware and teaching others "how to hold the tool correctly".
It's just an odd comparison to begin with. You said
> You think you can just fire up Ableton, Cubase or whatever and make as great music as a artist who done that for a long time
I don't think you have to be good at Ableton at all to make good music. I don't think you can even argue it would benefit your music to learn Ableton. There's a crap ton of people who are wizards with their DAW making mediocre music. A DAW can be fun to learn, and that can help me keep my flow state. But it's not literally going to make better music, and the fundamentals of production don't change at all from DAW to DAW.
That's a totally separate thing from LLMs. We are constantly told that if we learn the magic way to use LLMs, we can spit out functioning code a lot faster. But in reality, people are just generating code faster than they can verify it.
I don't see it as it is. LLMs are not magically gonna make you be able to produce high-quality software, just like Ableton isn't gonna magically gonna make you be able to produce high-quality music. But if you learn the tool, it gets a lot easier to use effectively. And the better you are at "producing high quality music/code", probably the more use you can make of Ableton/LLMs, compared to someone who aren't good at those things already.
Again, what you're being told by other people, I don't know, and frankly don't really care. OpenAI sold Codex to me as a tool that can help me, a programmer, do programming, and that's exactly what that tool gives me.
Cursor in their article tried to sell their tool as something that can "Hundreds of agents can work together on a single codebase for weeks, making real progress on ambitious projects" which I claim in TFA, doesn't seem to be true.
Yes, because that's what it is. If you seriously can't get Gemini 3 or Opus 4.5 to work you're either using it wrong or coding on something extremely esoteric.
That's an almost universal truth that you need to learn how to use any non trivial tool.
They definitely can make some things better and you can do somethings faster, but all the efficiency is gonna get sucked up by companies trying to drop more slop.
It's just like a chisel. Well the chisel company didn't promise to let you become a master craftsman overnight but anyway it's just like a chisel in that you have to learn how to use it. And people expect a chisel to actually chisel through wood out the box but anyway it's exactly like a chisel.
It can be very hard to determine if an isolated patch that goes from one broken state to a different broken state is on net an improvement. Even if you were to count compile errors and attempt to minimize them, some compile errors can demonstrate fatal flaws in the design while others are minor syntax issues. It's much easier to say that broken tests are very bad and should be avoided completely, as then it's easier to ensure that no patch makes things worse than it was before.
The diffusion model of software engineering
Writing junk in a text file isn't the hard part.
That doesn't mean we can usefully build software that is a big, tangled mess.
I mean by definition something that doesn't build and run doesn't have any browser-like functionality at all.
Browsers contain several high complexity pieces each of could take a while to build on its own, and interconnect them with reasonably verbose APIs that need to be implemented or at least stubbed out for code to not crash. There is also the difficulty of matching existing implementations quirk for quirk.
I guess the complexity is on-par with operating systems, but with the added compatibility problems that in order to be useful it doesn't just have to load sites intended to be compatible with it, it has to handle sites people actually use on the internet, and those are both a moving target, and tend to use lots of high complexity features that you have to build or at least stub out before the site will even work.
And this is just one part. Not even considering the fully sandboxed, mini operating system for running webapps.
It _is_ stuck at this point.
There's so much money involved no one wants to admit it out loud.
They have no path to the necessary exponential gains and no one is actually working on it.
I don’t mean the tech itself—-which is kind of useful. I mean the 99% of the value inflation of a kind of useful tool (if you know what you’re doing).
The things that modern machine learning can do are absolutely incredible, mindblowing and have myriad uses. But this culture of startup scams to siphon money out of the economy and into the bank accounts of a few investment firms and a couple "visionaries" has just turned what should be an exciting field full of technical advancement into a deluge of mental sewage that's constantly pumped into our faces.
Well, I'm a heavy LLM user, I "believe" LLM helps me a lot for some tasks, but I'm also a developer with decades of experience, so I'm not gonna claim it'll help non-programmers to build software, or whatever. They're tools, not solutions in themselves.
But even us "folks on HN" who generally keep up with where the ecosystem is going, have a limit I suppose. You need to substantiate what you're saying, and if you're saying you've managed to create a browser, better let others verify that somehow.
Also with decades experience, I'd say that it depends how big the non-programmer is dreaming:
To agree with you: A well-meaning friend sent an entrepreneur my direction, whose idea was "Uber for aircraft". I tried to figure out exactly what they meant, ending the conversation when I realised all answers were rephrasing of that vague three words pitch, that they didn't really know what they wanted to do in any specific enumerable sense.
LLMs can't solve the problem when even the person asking doesn't know what they want.
But on the other end the scale, I've been asked to give an estimate for an app which, in its entirety, would've been one day's work even with the QA and acceptance testing and going through the Apple App Store upload process. Like, I kept asking if there was any other hidden complexity, and nope, the entire pitch was what you'd give as a pre-interview code-challenge.
An LLM would've spat out the solution to that in less time than I spent with the people who'd asked me to estimate it.
The top comment is indeed baseless hype without a hint of skepticism.
There is also clearly a lot of other skeptical people in that submission too. Also, simonw (from that top comment) told me themselves "it's not clear that what they built even runs": https://bsky.app/profile/simonwillison.net/post/3mckgw4mxoc2...
> This project from Cursor is the second attempt I've seen at this now!
I used the word "attempt" very deliberately, to avoid suggesting that either of these two projects had achieved the goal.
I don't see how you can get to "baseless hype without a hint of skepticism" there unless you've already decided to take anything I say in bad faith.
"But I didn't say this exact word!" and then accusing the other person of bad faith is some textbook DARVO.
There are already multiple attempts at building a from-scratch browser with LLM assistance. Unsurprisingly none of them have achieved full working browser status yet, several weeks after their attempts started.
and he wonders why people call him a shill
accepting everything some shit company tells you as gospel is not the default position of a "researcher"
he better hope he's on the right side of history here, as otherwise he will have burnt his reputation
Edit: Of course, this isn’t a trait unique to Simon either. Everybody has blind spots, and it’s reasonable to be excited when new tech is released. On an unrelated note, my intent is to push back against some of the people here who try to shut down skepticism. Obviously, this doesn’t describe Simon, but I’ve seen others here who try to silence skeptical voices. This comes across as highly controlling and insecure.
I do not think you are reacting to what I said in good faith.
> he better hope he's on the right side of history here, as otherwise he will have burnt his reputation
That's something I've actually given quite a lot of thought to. My reputation and credibility matters a great deal to me. If it turns out this entire LLM thing was an over-hyped scam I'll take a very big hit to that reputation, and I'll deserve it.
(If AI rises up and tries to kill or enslave us all I'll be too busy fighting back to care.)
> looks inside
> completely useless and busted
30 billion dollar VS Code fork everyone. When we do start looking at these people for what they are: snake oil salesmen.
They slop laundered the FOSS Servo code into a broken mess and called it a browser, but dumbasses with money will make line go up based on lies. EFF right off.
Man
Always take any pronouncement from an AI company (heavily dependent on VC and public sentiment on AI) with a heavy grain of salt..
hype over reality
I’m building an AI startup myself and I know that world and its full of hypsters and hucksters unfortunately - also social media communication + low attention span + AI slop communication is a blight upon todays engineering culture
Regarding the downvotes, I think it's because it's feeling like you're pushing your project although it isn't really super relevant to the topic. The topic is specifically about Cursor failing to live up to their claims.
My prediction last year was already that in the distant future - more than 10 years into the future - operating systems will create software on the fly. It will be a basic function of computers. However, there might remain a need for stable, deterministic software, the two human-machine interaction models can live together. There will be a need for software that does exactly what one wants in a dumb way and there will be a need for software that does complex things on the fly in an overall less reliable ad hoc way.
Anyone who knows history knows that people initially tend to underestimate the impact of technologies, yet few people learn something from that lesson.
People were making all sorts of statements like: - “I cloned it and there were loads of compiler warnings” - “the commit build success rate was a joke” - “it used 3rd party libs” - “it is AI slop”
What they all seem to be just glossing over is how the project unfolded: without human intervention, using computers, in an exceptionally accelerated time frame, working 24hr/day.
If you are hung up on commit build quality, or code quality, you are completely missing the point, and I fear for your job prospects. These things will get better; they will get safer as the workflows get tuned; they will scale well beyond any of us.
Don’t look at where the tech is. Look where it’s going.
No one is hung up on the quality, but there is a ground fact if something "compiles" or "doesnt". No one is gonna claim a software project was successful if the end artifact doesn't compile.
Me neither, and I note so twice in the submission article. But I also didn't expect a project that for the last 100+ commits couldn't reliably be built and therefore tested and tried out.
I did read your post, and agree with what you're saying. It would be great if they pushed the agents to favour reliability or reproducibility, instead of just marching forwards.
The reason I have yet to publish a book is not because I can't write words. I got to 120k words or so, but they never felt like the right words.
Nobody's giving me (nor should they give me) a participation trophy for writing 120k words that don't form a satisfying novel.
Same's true here. We all know that LLMs can write a huge quantity of code. Thing is, so does:
yes 'printf("Hello World!");'
The hard part, the entire reason to either be afraid for our careers or thrilled we can switch to something more productive than being code monkeys for yet-another-CRUD-app (depending on how we feel), that's the specific test that this experiment failed at.Correct, but Gas Town [1] already happened and what's more _actually worked_, so this experiment is both useless (because it doesn't demonstrate working software) _and_ derivative (because we've already seen that you can set up a project where with spend similar to the spend of a single developer you can churn out more code than any human could read in a week).
If the piece of shit can't even compile, it's equivalent to 0 lines of code.
> Don’t look at where the tech is. Look where it’s going.
Given that the people making the tech seem incapable of not lying, that doesn't give me hope for where it's going!
Look, I think AI and LLMs in particular are important. But the people actively developing them do not give me any confidence. And, neither do comments like these. If I wanted to believe that all of this is in vain, I would just talk to people like you.
I'm sorry but what? Are you really trying to argue that it doesn't matter that nothing works, that all it produced is garbage and that what is really important is that it made that garbage really quickly without human oversight?
That's.....that's not success.
Not everything needs to, or should have the same quality standards applied to them. For the purposes of the Cursor post, it doesn't bother me that most of the commits produced failed builds. I assume, from their post, that at some points, it was capable of building, and rendering the pages shown in the video on the post. That alone, is the thing that I think is interesting.
Would I use this browser? Absolutely not. Do I trust the code? Not a chance in hell. Is that the point? No.
Sure, I don't care too much if the restaurant serves me food with silverware that is 18/10 vs 18/0 stainless steel, but I absolutely do care if I order a pizza and they just dump a load of gravel onto my plate and tell me it's good enough, and after all, quality isn't the point.
I can bang on a keyboard for a week and produce tons of text files - but if they don’t do anything useful, would you consider me a programmer?
There are very few software development contexts where the quality metric of “does the project build and run at all” doesn’t matter quite a lot.
This idea that quality doesn't matter is silly. Quality is critical for things to work, scale, and be extensible. By either LLMs or humans.
Am I misunderstanding this metaphor? Tsunamis pull the sea back before making landfall.