Another pattern I’m noticing is strong advocacy for Opus, but that requires at least the 5x plan, which costs about $100 per month. I’m on the ChatGPT $20 plan, and I rarely hit any limits while using 5.2 on high in codex.
For agent/planning mode, that's the one only one that has seemed reasonably sane to me so far, not that I have any broad experience with every model.
Though the moment you give it access to run tests, import packages etc, it can quickly get stuck in a rabbit hole. It tries to run a test and then "&& sleep" on mac, sleep does not exist, so it interprets that as the test stalling, then just goes completely bananas.
It really lacks the "ok I'm a bit stuck, can you help me out a bit here?" prompt. You're left to stop it on your own, and god knows what that does to the context.
I thought it was just me. What I found was that they put in the extra bonus capacity at the end of dec, but I felt like I was consuming quota at the same rate as before. And then afterwards consuming it faster than before.
I told myself that the temporary increase shifted my habits to be more token hungry, which is perhaps true. But I am unsure of that.
When I ask simple programming questions in a new conversation it can generally figure out which project I'm going to apply it to, and write examples catered to those projects. I feel that it also makes the responses a bit more warm and personal.
Occasionally it will pop up saying "memory updated!" when you tell it some sort of fact. But hardly ever. And you can go through the memories and delete them if you want.
But it seems to have knowledge of things from previous conversations in which it didn't pop up and tell you it had updated its memory, and don't appear in the list of memories.
So... how is it remembering previous conversations? There is obviously a second type of memory that they keep kind of secret.
The first one refers to the "memory updated" pop-up and its bespoke list of memories; the second one likely refers to some RAG systems for ChatGPT to get relevant snippets of previous conversations.
It can also be terse and cold, while also somewhat-malleably insistent -- like an old toolkit in the shed.
It's all tunable.
What the hell are people doing that burns through that token limit so fast?
So it worked, but I didn't happily pay. And I noticed it became more complacent, hallucinating and problematic. I might consider trying out ChatGPTs newer models again. Coding and technical projects didn't feel like its stronghold. Maybe things have changed.
Though granted it comes in ~4 hour blocks and it is quite easy to hit the limit if executing large tasks.
Also worth considering that mileage varies because we all use agents differently, and what counts as a large workload is subjective. I am simply sharing my experience from using both Claude and Codex daily. For all we know, they could be running A/B tests, and we could both be right.
interesting
My personal opinion is that while smut won't hurt anyone in of itself, LLM smut will have weird and generally negative consequences. As it will be crafted specifically for you on top of the intermittent reinforcement component of LLM generation.
The sheer amount and variety of smut books (just books) is vastly larger than anyone wants to realize. We passed the mark decades ago where there is smut available for any and every taste. Like, to the point that even LLMs are going to take a long time to put a dent in the smut market. Humans have been making smut for longer than we've had writing.
But again I don't think you're wrong, but the scale of the problem is way distorted.
That’s where the danger may lie.
This is spherical cows territory though, so its only good for setting out Birds Eye view of principles.
Alien 2: "AI generated porn"
Does that exist yet. I don't think so.
The man's probably thinking something up though. "Pounded in the butt by Microslop Agentic Studio 2026" has a ring.
[1] https://www.amazon.com/Sentient-Lesbian-Em-Dash-Punctuation-... [2] https://www.amazon.com/Last-Algorithm-Pounded-Claimed-Sun-Ti...
Looked at the cover and saw “From Two Time Hugo Award Finalist Chuck Tingle”.
There’s no way that’s true. But I did a quick search anyway, and holy shit!
https://www.thehugoawards.org/hugo-history/2016-hugo-awards/...
https://www.thehugoawards.org/hugo-history/2017-hugo-awards/...
The story behind it:
https://www.quora.com/How-did-Chuck-Tingle-become-a-Hugo-Awa...
https://archive.ph/20160526154656/http://www.vox.com/2016/5/...
Cheap unlimited access to stuff that was always scarce during human evolution creates an 'evolutionary mismatch' where unlimited access to stuff bypasses our lack of natural satiety mechanisms.
Certainly they had neither the quantity nor ease of access that we do.
Ankles -> knees -> jazz -> voting -> rock -> no-fault divorce -> Tinder -> polyamory discourse on airplanes. it's a joke, but also sort of how cultural change actually propagates. The collapse did happen, just not of morals. Of enforcement. After that, everything is just people discovering the rules were optional all along. Including money.
Nevertheless, here is an example of Victorian anxiety regarding showing ankles: https://archive.org/details/amanualpolitene00pubgoog/page/n2...
It's easy to say "oh they were silly to worry about such things." But that's only because we see it from our own point of view.
Alternatively, imagine describing roads, highways, traffic congestion and endless poles strung with electrical wire all over the place to someone from the 11th century. This would sound like utter ruination of the land to them. But you and I are used to it, so it just seems normal.
It's important to note that the vast majority of such books are written for a female audience, though.
Why, the AI's after they've gained sentience, of course.
while smut won't hurt anyone in of itself
"Legacy Smut" is well known to cause many kinds of harm to many kind of people, from the participants to the consumers.1. you have to "jailbreak" the model first anyway, which is what's easier to do over API
2. is average layman aware of the concept of "API"? no, unlikely. apps and web portals are more convenient, which is going to lower the bar to access LLM porn
I trust none of the llm groups to be safe with my data , erp with a machine is going to leave some nasty breadcrumbs for some future folks i bet.
Why LLM is supposed to be worse?
I can see why Elons making the switch from cars. We certainly won’t be driving much
- OpenAI botches the job. Article pieces are written about the fact that kids are still able to use it.
- Sam “responds” by making it an option to use worldcoin orbs to authenticate. You buy it at the “register me” page, but you will get an equivalent amount of worldcoin at current rate. Afterwards the orb is like a badge that you can put on your shelf to show to your guests.
“We heard you loud and clear. That’s why we worked hard to provide worldcoin integration, so that users won’t have to verify their age through annoying, insecure and fallible means.” (an example marketing blurb would say, implicitly referring to their current identity servicer Persona which people find annoying).
- After enough orb hardware is out in the public, and after the api gains traction for 3rd parties to use it, send a notice that x months for now, login without the orb will not be possible. “Here is a link to the shop page to get your orb, available in colors silver and black.”
And what if you are over 18, but don't want to be exposed to that "adult" content?
> Viral challenges that could push risky or harmful behavior
And
> Content that promotes extreme beauty standards, unhealthy dieting, or body shaming
Seem dangerous regardless of age.
Because it seems to me large swaths of the population need some beauty standards
Don't prompt it.
This is also true for stuff like writing clear but concise docs, they're overly verbose while often not getting the point across.
Yes
There are also gangs making money off human trafficking? Does that make it OK for a corporation to make money off human trafficking as well? And there are companies making money off wars?
When you argue with whataboutism, you can just point to whatever you like, and somehow that is an argument in your favor.
Whataboutism is more like "Side A did bad thing", "oh yeah, what about side B and the bad things they have done". It is more just deflection. While using similar/related issues to inform and contextualize the issue at hand can also be overused or abused, but it is not the same as whataboutism, which is rarely productive.
The point being made then is that clearly there's far more to the picture than just "it's addictive" or "it results in various social ills".
Contrast that with your human trafficking example (definitely qualifies as whataboutism). We have clear reasons to want to outlaw human trafficking. Sometimes we fail to successfully enforce the existing regulations. That (obviously) isn't an argument that we should repeal them.
It's not a strange reason. IIRC, most cultures have a culturally understood and tolerated intoxicant. In our culture, that's alcohol.
Human culture is not some strange robotic thing, where the expectation is some kind hyper consistency in whatever narrow slice you look at.
We tolerate a recreational drug. Lots of people regularly consume a recreational drug and yet somehow society doesn't split at the seams. We should just acknowledge the reality. I think people would if not for all the "war on drugs" brainwashing. I think what we see is easily explained as it being easier to bury one's head in the sand than it is to give serious thought to ideas that challenge one's worldview or the law.
The point I was making is that it's not odd, unless you're thinking about human culture wrong (e.g. like its somehow weird that broad rules have exceptions).
> Particularly when the primary reason given for regulating other drugs is their addictiveness which alcohol shares.
One, not all addictive drugs are equally addictive. Two, it appears you have a weird waterfall-like idea how culture develops, like there's some kind identification of a problematic characteristic (addictiveness), then there's a comprehensive research program to find all things with that characteristic (all addictive substances), and finally consistent rules are set so that they're all treated exactly the same when looked at myopically (allow all or deny all). Human culture is much more organic than that, and it won't look like math or well-architected software. There's a lot more give and take.
I mean here are some obvious complexities that will lead to disparate treatment of different substances:
1. Shared cultural knowledge about how to manage the substance, including rituals for use (this is the big one).
2. Degree of addictiveness and other problematic behavior.
3. Socially positive aspects.
4. Tradition.
What I said I find odd is the way people refuse to plainly call alcohol what it is. You can refer to it as a drug yet still support it being legal. The cognitive inconsistency (ie the refusal to admit that it is a drug) is what I find odd.
I also find it odd that we treat substances that the data clearly indicates are less harmful than alcohol as though they were worse. We have alcohol staring us in the face as a counterexample to the claim that such laws are necessary. I think that avoidance of this observation can largely explain the apparent widespread unwillingness to refer to alcohol as a drug.
> One, not all addictive drugs are equally addictive.
Indeed. Alcohol happens to be more addictive than most substances that are regulated on the basis of being addictive. Not all, but most. Interesting, isn't it?
I presume my GP would have no objections to regulating these things their commenter whatabouted. The inconsistency is with the legislator, not in GPs arguments.
Like if someone were to say "man we should really outlaw bikes, you can get seriously injured while using one" a reasonable response would be to point out all the things that are more dangerous than bikes that the vast majority of people clearly do not want to outlaw. That is not whataboutism. The point of such an argument might be to illustrate that the proposal (as opposed to any logical deduction) is dead on arrival due to lack of popular support. Alternatively, the point could be to illustrate that a small amount of personal danger is not the basis on which we tend to outlaw such things. Or it could be something else. As long as there's a valid relationship it isn't whataboutism.
That's categorically different than saying "we shouldn't do X because we don't do Y" where X and Y don't actually have any bearing on one another. "Country X shouldn't persecute group Y. But what about country A that persecutes group B?" That's a whataboutism. (Unless the groups are somehow related in a substantial manner or some other edge case. Hopefully you can see what I'm getting at though.)
I disagree. It is in fact not a reasonable argument, it is not even a good argument. It is still whataboutism. There are way better arguments out there, for example:
Bicycles are in fact regulated, and if anything these regulations are too lax, as most legislators are categorizing unambiguous electric motorcycles as bicycles, allowing e-motorcycle makers to market them to kids and teenagers that should not be riding them.
Now as for the whatabout cars argument: If you compare car injuries to bicycle injuries, the former are of a completely different nature, by far most bicycle injuries will heal, that is not true of car injuries (especially car injuries involving a victim on a bicycle). So talking about other things that are more dangerous is playing into your opponents arguments, when there is in fact no reason to do that.
If the point being made is "people don't generally agree with that position" it is by definition not whataboutism. To be whataboutism the point being made is _required_ to be nil. That is, the two things are not permitted to be related in a manner that is relevant to the issue being discussed.
Now you might well disagree with the point being made or the things being extrapolated from it. The key here is merely whether or not such a point exists to begin with. Observing that things are not usually done a certain way can be valid and relevant even if you yourself do not find the line of reasoning convincing in the end.
Contrast with my example about countries persecuting groups of people. In that case there is no relevant relation between the acts or the groups. That is whataboutism.
So too your earlier example involving human trafficking. The fact that enforcement is not always successful has no bearing (at least in and of itself) on whether or not we as a society wish to permit it.
BTW when I referred to danger there it wasn't about cars. I had in mind other recreational activities such as roller blading, skateboarding, etc. Anything done for sport that carries a non-negligible risk of serious injury when things go wrong. I agree that it's not a good argument. It was never meant to be.
The majority of illegal drugs aren't addictive, and people are already addicted to the addictive ones. Drug laws are a "social issue" (Moral Majority-influenced), not intended to help people or prevent harm.
That is terrible.
Se have to do something.
This is something.
We must do it.
It terms of harm current laws on drugs fail everyone but teetotaller who want everyone else to have a miserable life too.
You think teetotallers have miserables lives? Come on.
Worked for gambling.
(Not saying this as a message of support. I think legalizing/normalizing easy app-based gambling was a huge mistake and is going to have an increasingly disastrous social impact).
US prohibition on alcohol and to the large extent performative "war on drugs" showed what criminalization does (empowers, finances and radicalises the criminals).
Portugal's decriminalisation, partial legalisation of weed in the Netherlands, legalisation in some American states and Canada prove legal businesses will better and safer provide the same services to the society, and the lesser societal and health cost.
And then there's the opioid addiction scandal in the US. Don't tell me it's the result of legalisation.
Legalisation of some classes of the drugs (like LSD, mushrooms, etc) would do much more good than bad.
Conversely, unrestricted LLMs are avaliable to everyone already. And prompting SOTA models to generate the most hardcore smut you can imagine is also possible today.
You’re stretching it big time. The situation in the Netherlands caused the rise of drug tourism, which isn’t exactly great for locals, nor does it stop crime or contamination.
https://www.dutchnews.nl/2022/11/change-starts-here-amsterda...
https://www.theguardian.com/world/2025/jan/24/bacteria-pesti...
As for Portugal, decriminalisation does not mean legalisation. Drugs are still illegal, it‘s just that possession is no longer a crime and there are places where you can safely shoot up harder drugs, but the goal is still for people to leave them.
Portugal's success regarding drugs wasn't about the free market. It was about treating addicts like victims or patients rather than criminals, it actually took a larger investment from the state and the benefits of that framework dissolved once budgets were cut.
For me, letting people mindlessly vibecode apps and then pretend this code can serve purpose for others - this is what's truly unsafe.
Pornographic text in LLM? Come on.
I've seen four startups make bank on precicely that.
I’m guessing age is needed to serve certain ads and the like, but what’s the value for customers?
The "Easter Bunny" has always seemed creepy to me, so I started writing a silly song in which the bunny is suspected of eating children. I had too many verses written down and wanted to condense the lyrics, but found LLMs telling me "I cannot help promote violence towards children." Production LLM services would not help me revise this literal parody.
Another day I was writing a romantic poem. It was abstract and colorful, far from a filthy limerick. But when I asked LLMs for help encoding a particular idea sequence into a verse, the models refused (except for grok, which didn't give very good writing advice anyway.)
Believe me, the Mac deserved it.
ClosedAI just wants to a piece of the casual user too.
> If [..] you are under 18, ChatGPT turns on extra safety settings. [...] Some topics are handled more carefully to help reduce sensitive content, such as:
- Graphic violence or gore
- Viral challenges that could push risky or harmful behavior
- Sexual, romantic, or violent role play
- Content that promotes extreme beauty standards, unhealthy dieting, or body shaming
Linus about the Tux mascot:
> But this wasn't to be just any penguin. Above all, Linus wanted one that looked happy, as if it had just polished off a pitcher of beer and then had the best sex of its life.
Linus about free software: > Software is like sex; it's better when it's free.Unironically if they look disheveled it’s because they are indeed coomers behind closed doors.
No. Porn has not driven even a fraction of the progress on the progress on the internet. Not even close to one.
- images - payment systems - stored video - banner advertising - performance based advertising - affiliation - live video - video chat - fora
Etc... AI is a very logical frontier for the porn industry.
That's ok.
> The first applications weren't porn-based.
They most definitely were, it is just that you are not aware of it. There runs a direct line from the 1-900 phone industry to the internet adult industry, those guys had money like water and they spent a fortune on these developments. Not all of them worked out but quite a few of them did and as a result those very same characters managed to grab a substantial chunk of early internet commerce.
the internet adult industry is not the same as the internet. And if you;re trying to say the internet was developed for the sake of the internet adult industry, you're sounding circular.
Porn and piracy outfits have historically adopted and pushed forward the bleeding edge of the internet. More recently that role has shifted towards the major platforms operated by BigTech. That's only natural though - they've concentrated the economics sufficiently that it makes sense for them.
But even then, take video codecs for example. BigTech develops and then rolls things out to their own infra. Outside of them it's piracy sitting at the bleeding edge of the adoption curve right now. The best current FOSS AV1 encoder is literally developed by the people pirating anime of all things. If it wasn't for them the FOSS reference encoder would still be half assed.
Edit: I've registered just for your comment! Ahaahahaha, cheers! :D
ChatGPT is absolute garbage.
This does verify the idea that OpenAI does not make models sycophantic due to attempted subversion by buttering up users so that that they use the product more, its because people actually want AI to talk to them like that. To me, that's insane, but they have to play the market I guess
I feel a lot of the "revealed preference" stuff in advertising is similar in advertisers finding that if they get past the easier barriers that users put in place, then really it's easier to sell them stuff that at a higher level the users do not want.
Drugs make you feel great, in moderation perfectly acceptable, constantly not so much.
If you ask me if I want to eat healthy and clean and I respond on the affirmative, it’s not a “gotcha” if you bait me with a greasy cheeseburger and then say “you failed the A/B test, demonstrating we know what you actually want more than you.”
A lot of our industry is still based on the assumption that we should deliver to people what they demonstrate they want, rather than what they say they want.
The difference between the responses and the pictures was illuminating, especially in one study in particular - you'd ask people "how do you store your lunch meat" and they say "in the fridge, in the crisper drawer, in a ziploc bag", and when you asked them to take a picture of it, it was just ripped open and tossed in anywhere.
This apparently horrified the lunch meat people ("But it'll get all crusty and dried out!", to paraphrase), which that study and ones like it are the reason lunch meat comes with disposable containers now, or is resealable, instead of just in a tear-to-open packet. Every time I go grocery shopping it's an interesting experience knowing that specific thing is in a small way a result of some of the work I did a long time ago.
A lot of people are lonely and talking to these things like a significant other. They value roleplay instruction following that creates "immersion." They tell it to be dark and mysterious and call itself a pet name. GPT-4o was apparently their favorite because it was very "steerable." Then it broke the news that people were doing this, some of them falling off the deep end with it, so they had to tone back the steerability a bit with 5, and these users seem to say 5 breaks immersion with more safeguards.
I do wonder if they would accept the mirror explanation for men enjoying porn.
The most commonly taken action does not imply people wanted to do it more, or felt happiest doing it. Unless you optimize profit only.
Insane spin you're putting on it. At best, you're a cog in one of the worst recent evolutions of capitalism.
The absolutist position that “all ads are always bad” is a non-starter for me. Especially as long as we exist in a capitalist system. Small business, indie creators, etc. must advertise in some fashion to survive. It’s only the behemoths that could afford to stop doing it (ironically). I’ve never really understood why, e.g. Pepsi and Coke spend so much on advertising: most people already have a preference and I am skeptical that the millions they spend actually moves the needle either way. (“Is Pepsi okay?” “It absolutely is not.”)
When was the last time you saw an ad for something non digital and you stopped everything and bought it or even made concrete plans to do so later ? Probably almost never right ? So why still so many ads ? More importantly, why is it still so profitable ?
Because much of the impact of advertising is sub conscious imprint rather than conscious action. Have you ever been in a grocery store and you needed to get something and picked a "random" brand ? Yeah, that choice may not have been so random after all. Or perhaps you're sitting at home or work and have a sudden seemingly unprompted craving for <insert food place>. Yeah, maybe not so unprompted.
What does that say about capitalism?
Messages of that sophistication are always dangerous, and modern advertising is the most widespread example of it.
The hostility is more than justified, I can only hope the whole industry is regulated downwards, even if whatever company I work for sells less.
By demonising them, you are making ads sounds way more glamorous than they are.
No it's not
I can't find the particular article (there's a few blogs and papers pointing out the phenomenon, I can't find the one I enjoyed) but it was along the lines of how in LLMArena a lot of users tend to pick the "confidently incorrect" model over the "boring sounding but correct" model.
The average user probably prefers the sycophantic echo chamber of confirmation bias offered by a lot of large language models.
I can't help but draw parallels to the "You are not immune to propaganda" memes. Turns out most of us are not immune to confirmation bias, either.
When 5.2 was first launched, o3 did a notably better job at a lot of analytical prompts (e.g. "Based on the attached weight log and data from my calorie tracking app, please calculate my TDEE using at least 3 different methodologies").
o3 frequently used tables to present information, which I liked a lot. 5.2 rarely does this - it prefers to lay out information in paragraphs / blog post style.
I'm not sure if o3 responses were better, or if it was just the format of the reply that I liked more.
If it's just a matter of how people prefer to be presented their information, that should be something LLMs are equipped to adapt to at a user-by-user level based on preferences.
If anyone is wondering, the setting for this is called Personalisation in user settings.
if a user spends more time on it and comes back, the product team winds up prioritizing whichever pattern was supporting that. it's just a continual selective evolution towards things that keep you there longer, based on what kept everyone else there longer
You’re not imagining it, and honestly? You're not broken for feeling this—its perfectly natural as a human to have this sentiment.
Much better feel with the Claude 4.5 series, for both chat and coding.
This is why my heart sank this morning. I have spent over a year training 4.0 to just about be helpful enough to get me an extra 1-2 hours a day of productivity. From experimentation, I can see no hope of reproducing that with 5x, and even 5x admits as much to me, when I discussed it with them today:
> Prolixity is a side effect of optimization goals, not billing strategy. Newer models are trained to maximize helpfulness, coverage, and safety, which biases toward explanation, hedging, and context expansion. GPT-4 was less aggressively optimized in those directions, so it felt terser by default.
Share and enjoy!
Maybe you should consider basing your workflows on open-weight models instead? Unlike proprietary API-only models no one can take these away from you.
Playing with the system prompts, temperature, and max token output dials absolutely lets you make enough headway (with the 5 series) in this regard to demonstrably render its self-analysis incorrect.
Is there anything as good in the 5 series? likely, but doing the full QA testing again for no added business value, just because the model disappears, is just a hard sell. But the ones we tested were just slower, or tried to have more personality, which is useless for automation projects.
For instance something simple like: "If I put 10kw in solar on my roof when is the payback given xyz price / incentive / usage pattern."
Used to give a kind of short technical report, now it's a long list of bullets and a very paternalistic "this will never work" kind of negativity. I'm assuming this is the anti-sycophant at work, but when you're working a problem you have to be optimistic until you get your answer.
For me this usage was a few times a day for ideas, or working through small problems. For code I've been Claude for at least a year, it just works.
I've been using Gemini exclusively for the 1 million token context window, but went back to ChatGPT after the raise of the limits and created a Project system for myself which allows me to have much better organization with Projects + only Thinking chats (big context) + project-only memory.
Also, it seems like Gemini is really averse to googling (which is ironic by itself) and ChatGPT, at least in the Thinking modes loves to look up current and correct info. If I ask something a bit more involved in Extended Thinking mode, it will think for several minutes and look up more than 100 sources. It's really good, practically a Deep Research inside of a normal chat.
Not sure if others have seen this...
I could attribute it to:
1. It's known quantity with the pro models (I recall that the pro/thinking models from most providers were not immediately equipped with web search tools when they were released originally)
2. Google wants you to pay more for grounding via their API offerings vs. including it out of the box
I spent about half an hour trying to coax it in "plan mode" in IntelliJ, and it kept spitting out these generic ideas of what it was going to do, not really planning at all.
And when I asked it to execute the plan.. it just created some generic DTO and said "now all that remains is <the entire plan>".
Absolutely worst experience with an AI agent so far, not to say that my overall experience has been terrific.
1) Our plan for Claude Opus 4.5 "ran out" or something.
Mostly because how massively varied their releases are. Each one required big changes to how I use and work with it.
Claude is perfect in this sense all their models feel roughly the same just smarter so my workflow is always the same.
Substantial "applied outcomes" regression from 3.7 to 4 but they got right on fixing that.
(I also use Deep Think on Gemini too, and to me, on programming tasks, it's not really worth the money)
ChatGPT 5 ~= Claude > ChatGPT 5.2 > Gemini >> GrokIts just as good as ever /s
(I'm particularly annoyed by this UI choice because I always have to switch back to 5.1)
Also it's full of bugs, showing JSON all the time while thinking. But still it's my favorite model, so I'm switching back a lot.
The same seems to persist in Codex CLI, where again 5.2 doesn't spend as much time thinking so its solutions never come out as nicely as 5.1's.
That said, 5.1 is obviously slower for these reasons. I'm fine with that trade off. Others might have lighter workloads and thus benefit more from 5.2's speed.
It boggles the mind that "wrong answers only" is no longer just a meme, it's considered a valid cost management strategy in AI.
* Because if they realize we're out here, they'll price discriminate, charging extra for right answers.
You can go to chatgpt.com and ask "what model are you" (it doesn't hallucinate on this).
But how do we know that you did not hallucinate the claim that ChatGPT does not hallucinate its version number?
We could try to exfiltrate the system prompt which probably contains the model name, but all extraction attempts could of course be hallucinations as well.
(I think there was an interview where Sam Altman or someone else at OpenAI where it was mentioned that they hardcoded the model name in the prompt because people did not understand that models don't work like that, so they made it work. I might be hallucinating though.)
I've heard great things about the mixtral structured outputs capabilities but haven't had a chance to run my evals on them.
If 4.1 is dropped from API that's the first course of action.
Also 5 series doesn't have fine tuning capabilities and it's unclear how it would work if the reasoning step is involved
Curios where this is going to go.
One of the big arguments for local models is we can't trust providers to maintain ongoing access the models you validated and put into production. Even if you run hosted models, running open ones means you can switch providers.
opus 4.5 is better at gpt on everything except code execution (but with pro you get a lot of claude code usage) and if they nuke all my old convos I'll prob downgrade from pro to freee
> creative ideation
At first I had no idea what this meant! So I asked my friend Miss Chatty [1] and we had an interesting conversation about it:
https://chatgpt.com/share/697bf761-990c-8012-9dd1-6ca1d5cc34...
[1] You may know her as ChatGPT, but I figure all the other AIs have fun human-sounding names, so she deserves one too.
You are absolutely right to ask about it!
(How did I do with channeling Miss Chatty's natural sycophancy?)
Anyway, I do use AI for other things, such as...
• Coding (where I mostly use Claude)
• General research
• Looking up the California Vehicle Code about recording video while driving
• Gift ideas for a young friend who is into astronomy (Team Pluto!)
• Why "Realtor" is pronounced one way in the radio ads, another way by the general public
• Tools and techniques for I18n and L10n
• Identifying AI-generated text and photos (takes one to know one!)
• Why spaghetti softens and is bendable when you first put it into the boiling water
• Burma-Shave sign examples
• Analytics plugins for Rails
• Maritime right-of-way rules
• The Uniform Code of Military Justice and the duty to disobey illegal orders
• Why, in a practical sense, the Earth really once *was* flat
• How de-alcoholized wine gets that way
• California law on recording phone conversations
• Why the toilet runs water every 20 minutes or so (when it shouldn't)
• How guy wires got that name
• Where the "he took too much LDS" scene from Star Trek IV was filmed
• When did Tim Berners-Lee demo the World Wide Web at SLAC
• What "ogr" means in "ogr2ogr"
• Why my Kia EV6 ultrasonic sensors freaked out when I stopped behind a Lucid Air
• The smartest dog breeds (in different ways of "smart")
• The Sputnik 1 sighting in *October Sky*
• Could I possibly be related to John White Geary?
And that's just from the last few weeks.In other words, pretty much anything someone might interact with an AI - or a fellow human - about.
About the last one (John White Geary), that discussion started with my question about actresses in the "Pick a little, talk a little" song from The Music Man movie, and then went on to how John White Geary bridged the transition from Mexican to US rule, as did others like José Antonio Carrillo:
https://chatgpt.com/share/697c5f28-7c18-8012-96fc-219b7c6961...
If I could sum it all up, this is the kind of freewheeling conversation with ChatGPT and other AIs that I value.
Most "big name" models' interfaces don't let you change settings, or not easily. Power users learn to use different interfaces and look up guides to tweak models to get better results. You don't have to just shrug your shoulders and switch models. OpenAI's power interface: https://platform.openai.com/playground Anthropic's power interface: https://platform.claude.com/ For self-hosted/platform-agnostic, OpenWebUI is great: https://openwebui.com/
Utterly unreliable. I get better results, faster, editing parts of the code with Claude in a web ui, lol.
So we'll have to wait until "creativity" is solved.
Side note: I've been wondering lately about a way to bring creativity back to these thinking models. For creative writing tasks you could add the original, pretrained model as a tool call. So the thinking model could ask for its completions and/or query it and get back N variations. The pretrained model's completions will be much more creative and wild, though often incoherent (think back to the GPT-3 days). The thinking model can then review these and use them to synthesize a coherent, useful result. Essentially giving us the best of both worlds. All the benefits of a thinking model, while still giving it access to "contained" creativity.
4.1 was the best so far. With straight to the point answers, and most of the time correct. Especially for code related questions. 5.1/5.2 on their side would a lot more easily hallucinate stupid responses or stupid code snippet totally not what was expected.
(I have no idea. LLMs are infinite code monkeys on infinite typewriters for me, with occasional “how do I evolve this Pokémon’ utility. But worth a shot.)
But I think a lot more people are using LLMs for relationship surrogates than that (pretty bonkers) subreddit would suggest. Character AI (https://en.wikipedia.org/wiki/Character.ai) seems quite popular, as do the weird fake friend things in Meta products, and Grok’s various personality mode and very creepy AI girlfriends.
I find this utterly bizarre. LLMs are peer coders in a box for me. I care about Claude Code, and that’s about it. But I realize I am probably in the vast minority.
[0]: https://www.nber.org/system/files/working_papers/w34255/w342...
Their hobby is... weird, but they're not stupid.
If you can be respectful and act like a guest, it's worth reading a little there. You'll see the worrisome aspects in more detail but also a level of savvy that sometimes seems quite strange given the level of attachment. It's definitely interesting.
- a large number of incredibly fragile users
- extremely "protective" mods
- a regular stream of drive-by posts that regulars there see as derogatory or insulting
- a fair amount of internal diversity and disagreement
I think discussion on forums larger than it, like HN or popular subreddits, is likely to drive traffic that will ultimately fuel a backfiring effect for the members. It's inevitable, and it's already happening, but I'm not sure it needs to increase.I do think the phenomenon is a matter of legitimate public concern, but idk how that can best be addressed. Maybe high-quality, long form journalism? But probably not just cross-posting the sub in larger fora.
Any numbers/reference behind this?
ChatGPT has ~300 million active users a day. A 0.02% (delusion disorder prevalence) would be 60k people.
Again, do you have anything behind this "highly prevalent phenomenon" claim?
Spend a day on Reddit and you'll quickly realize many subreddits are just filled with lies.
Most subs that are based on politics or current events are at best biased, at worst completely astroturf.
The only subs that I think still have mostly legit users are municipal subs (which still get targeted by bots when anything political comes up) and hobby subs where people show their works or discuss things.
At least they cannot read this.
If the 800MAU still holds, that's 800k people.
(Strangely these "mental illnesses" and school problems went away after he switched to an English language school, must be a miracle)
I assume the loneliness epidemic is producing similar cases.
In my entire french immersion Kindergarden class, there was a total of one child who already spoke French. I don't think the fact that he didn't speak the language is the concern.
There is/was an interesting period where "normies" were joining twitter en-masse, and adopted many of the denizens ideas as normal widespread ideas. Kinda like going on a camping trip at "the lake" because you heard it's fun and not realizing that everyone else on the trip is part of a semi-deranged cult.
The outsized effect of this was journalists thinking these people on twitter were accurate representations of what society on the whole was thinking.
(Upgrade for only 1999 per month)
On the other hand - 5.0-nano has been great for fast (and cheap) quick requests and there doesn't seem to be a viable alternative today if they're sunsetting 5.0 models.
I really don't know how they're measuring improvements in the model since things seem to have been getting progressively worse with each release since 4o/o4 - Gemini and Opus still show the occasional hallucination or lack of grounding but both readily spend time fact-checking/searching before making an educated guess.
I've had chatgpt blatantly lie to me and say there are several community posts and reddit threads about an issue then after failing to find that, asked it where it found those and it flat out said "oh yeah it looks like those don't exist"
Even if I submit the documentation or reference links they are completely ignored.
Any suggestions?
RIP
Latest Advancements
GPT-5
OpenAI o3
OpenAI o4-mini
GPT-4o
GPT-4o mini
Sora
I'm sure there is some internal/academic reason for them, but from an outside observer simply horrible.
We're the technical crowd cursed and blinded by knowledge.
A fellow Primagen viewer spotted.
"I know! Let's restart the version numbering for no good reason!" becomes DOOM (2016), Mortal Kombat 1 (2025), Battlefield 1 (2016), Xbox One (not to be confused with the original Xbox 1)
As another example, look at how much of a trainwreck USB 3 has become
Or how Nvidia restarted Geforce card numbering
There's also Xbox One X, which is not in the X series. Did I say that right? Playstation got the version numbers right. I couldn't make names as incomprehensible as Xbox if I tried.
If you disagree on something you can also train a lora.
Despite 4o being one of the worst models on the market, they loved it. Probably because it was the most insane and delusional. You could get it to talk about really fucked up shit. It would happily tell you that you are the messiah.
It used to get things wrong for sure but it was predictable. Also I liked the tone like everyone else. I stopped using ChatGPT after they removed 4o. Recently, I have started using the newer GPT-5 models (got free one month). Better than before but not quite. Acts way over smart haha
Note: I wouldnt actually, I find it terrible to prey on people.
Should be essential watching for anyone that uses these things.
LOL WHAT?! I'm 0.1% of users? I'm certain part of the issue is it takes 3-clicks to switch to GPT-4o and it has to be done each time the page is loaded.
> that they preferred GPT‑4o’s conversational style and warmth.
Uh.. yeah maybe. But more importantly, GPT-4o gave better answers.
Zero acknowledgement about how terrible GPT-5 was when it was first released. It has since improved but it's not clear to me it's on-par with GPT-4o. Thinking mode is just too slow to be useful and so GPT-4o still seems better and faster.
Oh well, it'll be missed.