SWE-Bench (Pro / Verified)
Model | Pro (%) | Verified (%)
--------------------+---------+--------------
GPT-5.2-Codex | 56.4 | ~80
GPT-5.2 | 55.6 | ~80
Claude Opus 4.5 | n/a | ~80.9
Gemini 3 Pro | n/a | ~76.2
And for terminal workflows, where agentic steps matter: Terminal-Bench 2.0
Model | Score (%)
--------------------+-----------
Claude Opus 4.5 | ~60+
Gemini 3 Pro | ~54
GPT-5.2-Codex | ~47
So yes, GPT-5.2-Codex is good, but when you put it next to its real competitors:- Claude is still ahead on strict coding + terminal-style tasks
- Gemini is better for huge context + multimodal reasoning
- GPT-5.2-Codex is strong but not clearly the new state of the art across the board
It feels a bit odd that the page only shows internal numbers instead of placing them next to the other leaders.
And then there's the part of models that is hard to measure. Opus has some sort of HAL-like smoothness I don't see in other models, but meanwhile, I haven't tried gpt-5.2 for coding yet. (Neither Gemini 3 Pro; I'm not claiming superiority of Opus, just that something in practical usability is hard to measure.)
And I don't think your Terminal-Bench 2.0 scores are accurate. Per the latest benchmarks: Opus 4.5 is at 59% GPT-5.2-Codex is at 64%
See the charts at the bottom of https://marginlab.ai/blog/swe-bench-deep-dive/ and https://marginlab.ai/blog/terminal-bench-deep-dive/
Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).
Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.
So much so that now I rely completely on Codex for code reviews and actual coding. I will pick higher quality over speed every day. Please don’t change it, OpenAI team!
I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.
It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.
While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.
I don't understand why not. People pay for quality all the time, and often they're begging to pay for quality, it's just not an option. Of course, it depends on how much more quality is being offered, but it sounds like a significant amount here.
I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?
Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly
Aggressively recreating your context is still the best way to get the best results from these tools too, so it has a secondary benefit.
My take is that the overlap is strongest with engineering management. If you can learn how to manage a team of human engineers well, that translates to managing a team of agents well.
The other skill is in knowing exactly when to roll up your sleeves and do it the old fashioned way. Which things they're good/useful for, and which things they aren't.
I do wish that ChatGPT had a toggle next to each project file instead of having to delete and reupload to toggle or create separate projects for various combinations of files.
So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.
See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here
https://artificialanalysis.ai/
...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!
In my mind, they're hardly making any money compared to how much they're spending, and are relying on future modeling and efficiency gains to be able to reduce their costs but are pursuing user growth and engagement almost fully -- the more queries they get, the more data they get, the bigger a data moat they can build.
All the money they keep raising goes to R&D for the next model. But I don't see how they ever get off that treadmill.
Or is it already saturated?
everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.
They are obviously making awful lot of revenue.
> everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.
It has nothing to do with their management or investors being "dummies" but the numbers are the numbers.
OpenAI has data center rental costs approaching $620 billion, which is expected to rise to $1.4 trillion by 2033.
Annualized revenue is expected to be "only" $20 billion this year.
$1.4 trillion is 70x current revenue.
So unless they execute their strategy perfectly, hit all of their projections and hoping that neither the stock market or economy collapses, making a profit in the foreseeable future is highly unlikely.
[1]: "OpenAI's AI money pit looks much deeper than we thought. Here's my opinion on why this matters" - https://diginomica.com/openais-ai-money-pit-much-deeper-we-t...
That's what the investors are chasing, in my opinion.
They are drowning in debt and go into more and more ridiculous schemes to raise/get more money.
--- start quote ---
OpenAI has made $1.4 trillion in commitments to procure the energy and computing power it needs to fuel its operations in the future. But it has previously disclosed that it expects to make only $20 billion in revenues this year. And a recent analysis by HSBC concluded that even if the company is making more than $200 billion by 2030, it will still need to find a further $207 billion in funding to stay in business.
https://finance.yahoo.com/news/openai-partners-carrying-96-b...
--- end quote ---
I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.
The OpenAI token limits seem more generous than the Anthropic ones too.
I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.
In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.
OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.
I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).
Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.
I basically use it to drive Claude Code, which will nuke the codebase with abandon.
I had it whip this up to try and avoid this, while still running it in yolo mode (which is still not recommended).
https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...
I asked codex to take a look. It took a couple minutes, but it managed to track the issue down using a bunch of tricks I've never seen before. I was blown away. In particular, it reran qemu with different flags to get more information about a CPU fault I couldn't see. Then got a hex code of the instruction pointer at the time of the fault, and used some tools I didn't know about to map that pointer to the lines of code which were causing the problem. Then took a read of that part of the code and guessed (correctly) what the issue was. I guess I haven't worked with operating systems much, so I haven't seen any of those tricks before. But, holy cow!
Its tempting to just accept the help and move on, but today I want to go through what it did in detail, including all the tools it used, so I can learn to do the same thing myself next time.
edit: username joke, don't get me banned
(unrelated, but piggybacking on requests to reach the teams)
If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.
Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.
OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :
https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...
https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U
Google fails the tests currently, but can probably easily catch up :
One surprising thing that codex helped with is procrastination. I'm sure many people had this feeling when you have some big task and you don't quite know where to start. Just send it to Codex. It might not get it right, but it's almost always good starting point that you can quickly iterate on.
And same thought with both procrastination because of not knowing where to start, but also getting stuck in the middle and not knowing where to go. Literally never happens anymore. Having discussions with it for doing the planning and different options for implementations, and you get to the end with a good design description and then, what's the point of writing the code yourself when with that design, it's going to write it quickly and matching the agreements.
(here I am remembering a time I had no computer and would program data structures in OCaml with pen and paper, then would go to university the next day to try it. Often times it worked the first try)
> Emil concluded his article like this:
> JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent. > But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. > That’s probably the right division of labor.
>I couldn’t agree more. Coding agents replace the part of my job that involves typing the code into a computer. I find what’s left to be a much more valuable use of my time.
If you leave an agent for hours trying to increase coverage by percentage without further guiding instructions you will end up with lots of garbage.
In order to achieve this, you need several distinct loops. One that creates tests (there will be garbage), one that consolidates redundant tests, one that parametrizes repetitive tests, and so on.
Agents create redundant tests for all sorts of reasons. Maybe they're trying a hard to reach line and leave several attempts behind. Or maybe they "get creative" and try to guess what is uncovered instead of actually following the coverage report, etc.
Less capable models are actually better at doing this. They're faster, don't "get creative" with weird ideas mid-task and cost less. Just make them work one test at the time. Spawn, do one test that verifiably increases overall coverage, exit. Once you reach a treshold, start the consolidating loop: pick a redundant pair of tests, consolidate, exit. And so on...
Of course, you can use a powerful model and babysit it as well. A few disambiguating questions and interruptions will guide them well. If you want true unattended though, it's damn hard to get stable results.
Skill issue... And perhaps the wrong model + harness
The only hint you can dig out is where they might have limits feasibility around it. E.g. "I can fly first class all the time (if I limit the number of flights and spend an unreasonable portion of my weath on tickets)" is typically less useful an interpretation than "I can fly first class all the time (frequently without concern, because I'm very well off)", but you have to figure out which they are trying to say (which isn't always easy).
It's so fascinating to me that the thread above this one on this page says the opposite, and the funniest thing is I'm sure you're both right. What a wild world we live in, I'm not sure how one is supposed to objectively analyse the performance of these things
This is very similar to how humans behave. Most people are great a small number of things, and there's always a larger set of things that we may individually be pretty terrible at.
The bots the same way, except: Instead of billions of people who each have their own skillsets and personalities, we've got a small handful of distinct bots from different companies.
And of course: Lies.
When we ask Bob or Lisa help with a thing that they don't understand very well, they usually will try to set reasonable expectations . ("Sorry, ssl-3, I don't really understand ZFS very well. I can try to get the SLOG -- whatever that is -- to work better with this workload, but I can't promise anything.")
Bob or Lisa may figure it out. They'll gather up some background and work on it, bring in outside help if that's useful, and probably tread lightly. This will take time. But they probably won't deliberately lie [much] about what they expect from themselves.
But when the bot is asked to do a thing that it doesn't understand very well, it's chipper as fuck about it. ("Oh yeah! Why sure I can do that! I'm well-versed in -everything-! [Just hold my beer and watch this!]")
The bot will then set forth to do the thing. It might fuck it all up with wild abandon, but it doesn't care: It doesn't feel. It doesn't understand expectations. Or cost. Or art. Or unintended consequences.
Or, it might get it right. Sometimes, amazingly-right.
But it's impossible to tell going in whether it's going to be good, or bad: Unlike Bob or Lisa, the bot always heads into a problem as an overly-ambitious pack of lies.
(But the bot is very inexpensive to employ compared to Bob or Lisa, so we use the bot sometimes.)
A full week of that should give you a pretty good idea
Maybe some models just suit particular styles of prompting that do or don't match what you're doing
Like, do you run a proper experiment where you hand the same task to multiple models several times and compare the results? Not snark by the way, I’m asking in earnest how you pick one model over another.
This is what I do. I have a little TUI that fires off Claude Code, Codex, Gemini, Qwen Coder and AMP in separate containers for most task I do (although I've started to use AMP less and less), and either returns the last message of what they replied and/or a git diff of what exactly they did. Then I compare them side by side. If all of them got something wrong, I update the prompt, fire them off again. Always starting from zero, and always include the full context of what you're doing with the first message, they're all non-interactive sessions.
Sometimes I do 3x Codex instead of different agents, just to double-check that all of them would do the same thing. If they go off and do different things from each other, I know the initial prompt isn't specific/strict enough, and again iterate.
Honestly, I'd love to try that. My Gmail username is the same as my HN username.
- Claude
- Codex
- Kilocode
- Amp
- Mistral Vibe
Very vibe coded though.GPT-5.2 Thinking (with extended thinking selected) is significantly better in my testing on software problems with 40k context.
I attribute this to thinking time, with GPT-5.2 Thinking I can coax 5 minutes+ of thinking time but with Gemini 3.0 Pro it only gives me about 30 seconds.
The main problem with the Plus sub in ChatGPT is you can't send more than 46k tokens in a single prompt, and attaching files doesn't help either because the VM blocks the model from accessing the attachments if there's ~46k tokens already in the context.
Gemini 3 and Gemini 3 Flash identified the root cause and nailed the fix. GPT 5.1 Codex misdiagnosed the issue and attempted a weird fix despite my prompt saying “don’t write code, simply investigate.”
I run these tests regularly, and Codex has not impressed me. Not even once. At best it’s on par, but most of the time it just fails miserably.
Languages: JavaScript, Elixir, Python
The codex agent ran for a long time and created and executed a bunch of python scripts (according to the output thinking text) to compare the translations and found a number of possible issues. I am not sure where the scripts were stored or executed, our project doesn't use python.
Then I fed the output of the issues codex found to claude for a second "opinion". Claude said that the feedback was obviously from someone that knew the native language very well and agreed with all the feedback.
I was really surprised at how long Codex was thinking and analyzing - probably 10 minutes. (This was ~1+mo ago, I don't recall exactly what model)
Claude is pretty decent IMO - amp code is better, but seems to burn through money pretty quick.
So yeah, I use codex a lot and like it, but it has some really bad blind spots.
Heh. It's about the same as an efficient compilation or integration testing process that is long enough to let it do it's thing while you go and browse Hacker News.
IMHO, making feedback loops faster is going to be key to improving success rates with agentic coding tools. They work best if the feedback loop is fast and thorough. So compilers, good tests, etc. are important. But it's also important that that all runs quickly. It's almost an even split between reasoning and tool invocations for me. And it is rather trigger happy with the tool invocations. Wasting a lot of time to find out that a naive approach was indeed naive before fixing it in several iterations. Good instructions help (Agents.md).
Focusing attention on just making builds fast and solid is a good investment in any case. Doubly so if you plan on using agentic coding tools.
The key is to adapt to this by learning how to parallelize your work, instead of the old way of doings things where devs are expected to focus on and finish one task at a time (per lean manufacturing principles).
I find now that painfully slow builds are no longer a serious issue for me. Because I'm rotating through 15-20 agents across 4-6 projects so I always have something valuable to progress on. One of these projects and a few of these agents are clear priorities I return to sooner than the others.
The Roomba effect is real. The AI models do all the heavy implementation work, and when it asks me to setup an execute tests, I feel obliged to get to it ASAP.
It ships with 300+ MCP tools (crawl, Google search, Gmail/GCal/GDrive, Slack, scheduling, web indexing, embeddings, transcription, and more). Many came from tools I originally built for Claude Desktop—OpenAI’s MCP has been stable across 20+ versions so I prefer it.
I will note I usually run this in Danger mode but because it runs in a container, it doesn't have access to ENVs I don't want it messing with, and have it in a directory I'm OK with it changing or poking about in.
Headless browser setup for the crawl tools: https://github.com/DeepBlueDynamics/gnosis-crawl.
My email is in my profile if anyone needs help.
My only gripe is I wish they'd publish Codex CLI updates to homebrew the same time as npm :)
Claude still tends to add "fluff" around the solution and over-engineer, not that the code doesn't work, it's just that it's ugly
I watched a man struggle for 3 hours last night to "prove me wrong" that Gemini 3 Pro can convert a 3000 line C program to Python. It can't do it. It can't even do half of it, it can't understand why it can't, it's wrong about what failed, it can't fix it when you tell it what it did wrong, etc.
Of course, in the end, he had an 80 line Python file that didn't work, and if it did work, it's 80 lines, of course it can't do what the 3000 line C program is doing. So even if it had produced a working program, which it didn't, it would have produced a program that did not solve the problems it was asked to solve.
The AI can't even guess that it's 80 line output is probably wrong just based on the size of it, like a human instantly would.
AI doesn't work. It hasn't worked, and it very likely is not going to work. That's based on the empirical evidence and repeatable tests we are surrounded by.
The guy last night said that sentiment was copium, before he went on a 3 hour odyssey that ended in failure. It'll be sad to watch that play out in the economy at large.
I'm not sure what your "empirical evidence and repeatable tests" is supposed to be. The AI not successfully converting a 3000 line C program to Python, in a test you probably designed to fail, doesn't strike me as particularly relevant.
Also, I suspect that AI could most likely guess that 80 lines of Python aren't correctly replicating 3000 lines of C, if you prompted it correctly.
For some definition of "works". This seems to be yours:
> I'd go further and say vibe coding it up, testing the green case, and deploying it straight into the testing environment is good enough. The rest we can figure out during testing, or maybe you even have users willing to beta-test for you.
> This way, while you're still on the understanding part and reasoning over the code, your competitor already shipped ten features, most of them working.
> Ok, that was a provocative scenario. Still, nowadays I am not sure you even have to understand the code anymore. Maybe having a reasonable belief that it does work will be sufficient in some circumstances.
It's interesting how this workflow appears to almost offend some users here.
I get it, we all don't like sloppy code that does not work well or is not maintainable.
I think some developers will need to learn to give control away rather than trying to understand every line of code in their project - depending of course on the environment and use case.
Also worth to keep in mind that even if you think you understand all the code in your project - as far as that is even possible in larger projects with multiple developers - there are still bugs anyway. And a few months later, your memory might be fuzzy in any case.
Given my personal experience, and how much more productive AI has made me, it seems to me that some people are just using it wrong. Either that, or I'm delusional, and it doesn't actually work for me.
> "In parallel, we’re piloting invite-only trusted access to upcoming capabilities and more permissive models for vetted professionals and organizations focused on defensive cybersecurity work. We believe that this approach to deployment will balance accessibility with safety."
Sure, you can also use the same tools to find attack surfaces preemptively, but let's be honest, most wouldn't.
Scary good.
But the good ones are not open. It's not even a matter of money. I know at OpenAI they are invite only for instance. Pretty sure there's vetting and tracking going on behind those invites.
So far it's rarely been the leading frontier model, but at least it's not full of dumb guardrails that block many legitimate use cases in order to prevent largely imagined harm.
You can also use Grok without sign in, in a private window, for sensitive queries where privacy matters.
A lot of liberals badmouth the model for obviously political reasons, but it's doing an important job.
Most of the time spent in vulnerability analysis is automatable grunt work. If you can just take that off the table, and free human testers up to think creatively about anomalous behavior identified for them, you're already drastically improving effectiveness.
We've come a long way since gpt-3.5, and it's rewarding to see people who are willing to change their cached responses
Especially in the CLI, it seems that its so way too eager to start writing code nothing can stop it, not even the best Agents.md.
Asking it a question or telling it to check something doesn‘t mean it should start editing code, it means answer the question. All models have this issue to some degree, but codex is the worst offender for me.
I see people gushing over these codex models but they seem worse than the big gpt models in my own actual use (i.e. I'll give the same prompt to gpt-5.1 and gpt-5.1-codex and codex will give me functional but weird/ugly code, whereas gpt-5.1 code is cleaner)
I feel the same. CodexTheModel (why have two things named the same way?!) is a good deal faster than the other models, and probably on the "fast/accuracy" scale it sits somewhere else, but most code I want to be as high quality as possible, and the base models do seem better at that than CodexTheModel.
What has somewhat worked for me atm is to ask to only update an .md plan file and act on the file only, seems to appease its eagerness to write files.
Yeah, this makes sense. There's a fine line between good enough to do security research and good enough to be a prompt kiddie on steroids. At the same time, aligning the models for "safety" would probably make them worse overall, especially when dealing with security questions (i.e. analyse this code snippet and provide security feedback / improvements).
At the end of the day, after some KYC I see no reason why they shouldn't be "in the clear". They get all the positive news (i.e. our gpt666-pro-ultra-krypto-sec found a CVE in openBSD stable release), while not being exposed to tabloid style titles like "a 3 year old asked chatgpt to turn on the lights and chatgpt hacked into nasa, news at 5"...
> GPT‑5.2-Codex has stronger cybersecurity capabilities than any model we’ve released so far. These advances can help strengthen cybersecurity at scale, but they also raise new dual-use risks that require careful deployment.
I'm curious what they mean by the dual-use risks.
Humans are very suggestible.
It was indeed!
> very likely I just read it
I find myself doing it frequently. I'm even sure that's why I used indeed above after reading your comment - it wasn't intentional!
curious if this, or coincidence
"The most advanced agentic coding model for professional software engineers"
If you want to compare the weakest models from both companies then Gemini Flash vs GPT Instant would seem to be best comparison, although Claude Opus 4.5 is by all accounts the most powerful for coding.
In any case, it will take a few weeks for any meaningful test comparisons to be made, and in the meantime it's hard not to see any release from OpenAI since they announced "Code Red" (aka "we're behind the competition") a few days ago as more marketing than anything else.
Gemini 3 Pro is a great foundation model. I use as a math tutor, and it's great. I previously used Gemini 2.5 Pro as a math tutor, and Gemini 3 Pro was a qualitative improvement over that. But Gemini 3 Pro sucks at being a coding agent inside a harness. It sucks at tool calling. It's borderline unusable in Cursor because of that, and likely the same in Antigravity. A few weeks ago I attended a demo of Antigravity that Google employees were giving, and it was completely broken. It got stuck for them during the demo, and they ended up not being able to show anything.
Opus 4.5 is good, and faster than GPT-5.2, but less reliable. I use it for medium difficulty tasks. But for anything serious—it's GPT 5.2
Just yesterday, in Antigravity, while applying changes, it deleted 500 lines of code and replaced it with a `<rest of code goes here>`. Unacceptable behavior in 2025, lol.
again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
edit: also FWIW, I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor. now I primarily use Claude Code given I found Codex slow and less “reliable” in a sense, but I try to try all 3 and keep up with the changes (it is hard)
Such as?
> again I’m not saying Codex is worse, they’re just different and claiming the only one you actively use is the best is a stretch
I am testing all models in Cursor.
> I initially dismissed Claude Code at launch, then loved Codex when it released. never really liked Cursor
I also don't actually like Cursor. It's a VSCode fork, and a mediocre harness. I am only using it because my company refuses to buy anything else, because Cursor has all models, and it appears to them that it's not worth having anything else.
> Such as?
changelog is here: https://github.com/anthropics/claude-code/blob/main/CHANGELO...
glhf
btw you started this thread with pure vibes, no evidence:
> I can confirm GPT 5.2 is better than Gemini and Claude. GPT 5.2 Codex is probably even better.
I’m saying you’re wrong. N=2, 1 against 1, one of us is making a much less bold claim
> “prompting”/harness that improves how it actually performs
Is an abstract statement without any meaningful details.
I had assumed OpenAI was irrelevant, but 5.1 has been so much better than Gemini.
On top of that, the Codex CLI team is responsive on github and it's clear that user complaints make their way to the team responsible for fine tuning these models.
I run bake offs on between all three models and GPT 5.2 generally has a higher success rate of implementing features, followed closely by Opus 4.5 and then Gemini 3, which has troubles with agentic coding. I'm interested to see how 5.2-codex behaves. I haven't been a fan of the codex models in general.
(Also, I can't imagine who is blessed with so much spare tome that they would look down on an assistant that does decent work)
Yeah, it feels really strange sometimes. Bumping up against something that Codex seemingly can't work out, and you give it to Claude and suddenly it's easy. And you continue with Claude and eventually it gets stuck on something, and you try Codex which gets it immediately. My guess would be that the training data differs just enough for it to have an impact.
But if you want that last 10%, codex is vital.
Edit: Literally after I typed this just had this happen. Codex 5.2 reports a P1 bug in a PR. I look closely, I'm not actually sure it's a "bug". I take it to Claude. Claude agrees it's more of a product behavioral opinion on whether or not to persist garbage data, and offer it's own product opinion that I probably want to keep it the way it is. Codex 5.2 meanwhile stubbornly accepts the view it's a product decision but won't seem to offer it's own opinion!
It's because performance degrades over longer conversations, which decreases the chance that the same conversation will result in a solution, and increases the chance that a new one will. I suspect you would get the same result even if you didn't switch to a different model.
They just have different strengths and weaknesses.
which is analogous to taking your problem to another model and ideally feeding it some sorta lesson
i guess this is a specific example but one i play out a lot. starting fresh with the same problem is unusual for me. usually has a lesson im feeding it from the start
- Planning mode. Codex is extremely frustrating. You have to constantly tell it not to edit when you talk to it, and even then it will sometimes just start working.
- Better terminal rendering (Codex seems to go for a "clean" look at the cost of clearly distinguished output)
- It prompts you for questions using menus
- Sub-agents don't pollute your context
since nobody (other than that paper) has been trying to measure output, everything is based on feelings and fashion, like you say.
I'm still raw dogging my code. I'll start using these tools when someone can measure the increase in output. Leadership at work is beginning to claim they can, so maybe the writing is on the wall for me. They haven't shown their methodology for what they are measuring, just telling everyone they "can tell"
But until then, I can spot too many psychological biases inherent in their use to trust my own judgement, especially when the only real study done so far on this subject shows that our intuition lies about this.
And in the meantime, I've already lost time investigating reasonable looking open source projects that turned out to be 1) vibe coded and 2) fully non functional even in the most trivial use. I'm so sick of it. I need a new career
Just safety nerds being gatekeepers.
They did the same thing for gpt-5.1-codex-max (code name “arcticfox”), delaying its availability in the API and only allowing it to be used by monthly plan users, and as an API user I found it very annoying.
This is a privacy and security risk. Your code diffs and prompts are there (seemingly) forever. Best you can do is "archive" them, which is a fancy word for "put it somewhere else so it doesn't clutter the main page".
I use it because it works out cheaper than Codex Cloud and gives you greater flexibility. Although it doesn't have 5.2-codex yet.
Unsure where that could be if you're using Windows.
You know what would be fun to try? Give Codex full access and then ask it to delete that folder, lol.
Then again, I wouldn't put much trust into OpenAI's handling of information either way.
> [ADD/LINK TO ROLLOUT THAT DISCOVERED VULNERABILITY]
What’s up with these in the article?
The models are so good, unbelievable good. And getting better weekly, including pricing.
Translation: "Hey y'all! Get ready for a tsunami of AI-generated CVEs!"
I find it to be incorrectly pattern matching with a very narrow focus and will ignore real documented differences even when explicitly highlighted in the prompt text (this is X crdt algo not Y crdt algo.)
I've canceled my subscription, the idea that on any larger edits it will just start wrecking nuance and then refuse to accept prompts that point this out is an extremely dangerous form of target fixation.
I have https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8... as a guard against that, for anyone that's stupid enough like me to run it in yolo mode and wants to copy it.
Codex also has command line options so you can specifically prohibit running rm in bash, so look those up too.
I can imagine what the vetting looks like: The professionals are not allowed to disclose that the models don't work.
EDIT: It must really hurt that ORCL is down 40% from its high due to overexposure in OpenAI.
Devstral Small 2 Instruct running locally seems about as capable with the upside that when it's wrong its very obvious instead of covering it in bullshit.
What about 2 weeks before Christmas?
But Opus 4.5 exists.