Open models by OpenAI | Modern Orange

2116
874
lackoftactics
5 days ago
openai.com

cco
·
5 days ago
·
[ - ]

The lede is being missed imo.

gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.

I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.

But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.

I'm blown away.

int_19h
·
5 days ago
·
[ - ]

I tried 20b locally and it couldn't reason a way out of a basic river crossing puzzle with labels changed. That is not anywhere near SOTA. In fact it's worse than many local models that can do it, including e.g. QwQ-32b.

dragonwriter
·
5 days ago
·
[ - ]

> In fact it's worse than many local models that can do it, including e.g. QwQ-32b.

I'm not going to be surprised that a 20B 4/32 MoE model (3.6B parameters activated) is less capable at a particular problem category than a 32B dense model, and its quite possible for both to be SOTA, as state of the art at different scale (both parameter count and speed which scales with active resource needs) is going to have different capabilities. TANSTAAFL.

__alexs
·
5 days ago
·
[ - ]

[flagged]

tomhow
·
4 days ago
·
[ - ]

Please don't post snark like this on HN. If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

whynotminot
·
4 days ago
·
[ - ]

He’s saying there’s different goalposts at different model sizes. Is that unreasonable?

lannisterstark
·
4 days ago
·
[ - ]

This isn't reddit.

robwwilliams
·
5 days ago
·
[ - ]

Well river crossings are one type of problem. My real world problem is proofing and minor editing of text. A version installed on my portable would be great.

cosmojg
·
5 days ago
·
[ - ]

Have you tried Google's Gemma-3n-E4B-IT in their AI Edge Gallery app? It's the first local model that's really blown me away with its power-to-speed ratio on a mobile device.

See: https://github.com/google-ai-edge/gallery/releases/tag/1.0.3

1123581321
·
5 days ago
·
[ - ]

Dozens of locally runnable models can already do that.

mark_l_watson
·
4 days ago
·
[ - ]

Yes, I always evaluate models on my own prompts and use cases. I glance at evaluation postings but I am also only interested in my own use cases.

golol
·
5 days ago
·
[ - ]

I heard the OSSmodels are terrible at anything other than math, code etc.

CMay
·
5 days ago
·
[ - ]

The 20b solved the wolf, goat, cabbage river crossing puzzle set to high reasoning for me without needing to use a system prompt that encourages critical thinking. It managed it using multiple different recommended settings, from temperatures of 0.6 up to 1.0, etc.

Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.

I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.

Think this is the smallest model I've seen solve it.

mgoetzke
·
5 days ago
·
[ - ]

Try changing the names of the objects. eg fox, hen, seeds for examples

aspect0545
·
5 days ago
·
[ - ]

But was it reasoning or did it solve this because it was parting it‘s training data?

CMay
·
4 days ago
·
[ - ]

Maybe both? I tried using different animals, scenarios, solvable versions, unsolvable versions, it gave me the correct answer with high reasoning in LM Studio. It does tell me it's in the training data, but it does reason through things fairly well. It doesn't feel like it's just reciting the solution and picks up on nuances around the variations.

If I switch from LM Studio to Ollama and run it using the CLI without changing anything, it will fail and it's harder to set the reasoning amount. If I use the Ollama UI, it seems to do a lot less reasoning. Not sure the Ollama UI has an option anywhere to adjust the system prompt so I can set the reasoning to high. In LM Studio even with the Unsloth GGUF, I can set the reasoning to high in the system prompt even though LM Studio won't give you the reasoning amount button to choose it with on that version.

ben_w
·
5 days ago
·
[ - ]

Allow me to answer with a rhetorical question:

S8O2bm5lbiBTaWUgZGllc2VuIFNhdHogbGVzZW4sIGRhIGVyIGluIEJhc2UtNjQta29kaWVydGVtIERldXRzY2ggdm9ybGllZ3Q/IEhhYmVuIFNpZSBkaWUgQW50d29ydCB2b24gR3J1bmQgYXVmIGVyc2NobG9zc2VuIG9kZXIgaGFiZW4gU2llIG51ciBCYXNlIDY0IGVya2FubnQgdW5kIGRhcyBFcmdlYm5pcyBkYW5uIGluIEdvb2dsZSBUcmFuc2xhdGUgZWluZ2VnZWJlbj8gV2FzIGlzdCDDvGJlcmhhdXB0IOKAnnJlYXNvbmluZ+KAnCwgd2VubiBtYW4gbmljaHQgZGFzIEdlbGVybnRlIGF1cyBlaW5lbSBGYWxsIGF1ZiBlaW5lbiBhbmRlcmVuIGFud2VuZGV0Pw==

And yes, that's a question. Well, three, but still.

danbruc
·
4 days ago
·
[ - ]

In case of the river puzzle there is a huge difference between repeating an answer that you read somewhere and figuring it out on your own, one requires reasoning the other does not. If you swap out the animals involved, then you need some reasoning to recognize the identical structure of the puzzles and map between the two sets of animals. But you are still very far from the amount of reasoning required to solve the puzzle without already knowing the answer.

You can do it brute force, that requires again more reasoning than mapping between structurally identical puzzles. And finally you can solve it systematically, that requires the largest amount of reasoning. And in all those cases there is a crucial difference between blindly repeating the steps of a solution that you have seen before and coming up with that solution on your own even if you can not tell the two cases apart by looking at the output which would be identical.

daveguy
·
4 days ago
·
[ - ]

As mgoetzke challenges, change the names of the items to something different, but the same puzzle. If it fails with "fox, hen, seeds" instead of "wolf, goat, cabbage" then it wasn't reasoning or applying something learned to another case. It was just regurgitating from the training data.

odo1242
·
4 days ago
·
[ - ]

(Decoded, if anyone's wondering):

> Können Sie diesen Satz lesen, da er in Base-64-kodiertem Deutsch vorliegt? Haben Sie die Antwort von Grund auf erschlossen oder haben Sie nur Base 64 erkannt und das Ergebnis dann in Google Translate eingegeben? Was ist überhaupt „reasoning“, wenn man nicht das Gelernte aus einem Fall auf einen anderen anwendet?

> Can you read this sentence, since it's in Base-64 encoded German? Did you deduce the answer from scratch, or did you just recognize Base 64 and then enter the result into Google Translate? What is "reasoning" anyway if you don't apply what you've learned from one case to another?

gf000
·
2 days ago
·
[ - ]

Just as a random data point, gpt 4-1 managed to "solve" it on the first run with a basic prompt like "Solve this riddle for me"

tanseydavid
·
4 days ago
·
[ - ]

<well-played>

9rx
·
5 days ago
·
[ - ]

I tried the two US presidents having the same parents one, and while it understood the intent, it got caught up in being adamant that Joe Biden won the election in 2024 and anything I do to try and tell it otherwise is dismissed as being false and expresses quite definitely that I need to do proper research with legitimate sources.

mark_l_watson
·
4 days ago
·
[ - ]

I think the lesson is: smaller models hallucinate more, so only use them in your applications where you load up large prompts with specific data to reason about. Then even the small Google gemma3n 4B model can be amazingly useful.

I use the SOTA models from Google and OpenAI mostly for getting feedback on ideas, helping me think through designs, and sometimes for coding.

Your question is clearly best answered using a large commercial model with a web search tool. That said, integrating a local model with a home built interface to something like the Brave search API can be effective but I no longer make the effort.

9rx
·
4 days ago
·
[ - ]

> think the lesson is: smaller models hallucinate more

The interesting part isn't the hallucination, but the sheer unwillingness to take in new information.

dragonwriter
·
4 days ago
·
[ - ]

Might have dug in less on the confabulation about information outside of its knowledge cutoff if the new information weren't offered with support from a user “hallucination” about information within its knowledge cutoff. More detail:

https://news.ycombinator.com/item?id=44809145

9rx
·
4 days ago
·
[ - ]

> Might have dug in less...

The digging in at all is what is interesting. Like an earlier comment alluded to, the presumptive value of these tools is being able to feed it your own information where that information is to be considered authoritative.

> More detail: [...] "He, in fact, did run for reelection"

A slow walk, maybe. He was in no condition to run. That is why he ultimately dropped out. But, really, that statement is just a silly game of semantics. "Run", when used in hindsight, often implies completion. This LLM model even says so too. If a model doesn't understand nuance, that is also interesting.

aaroninsf
·
4 days ago
·
[ - ]

Have we considered the possibility that maybe it knows something we don't.

freehorse
·
5 days ago
·
[ - ]

I mean I would hardly blame the specific model, Anthropic has a specific mention in their system prompts on trump winning. For some reason llms get confused with this one.

diggan
·
4 days ago
·
[ - ]

I noticed the same when asking various LLMs to summarize and explaining some "Presidential Actions" (from https://www.whitehouse.gov/presidential-actions/), most of them answer "This is just theoretical, since no such executive actions actually exists, but assuming something like that would happen in the future, it would mean ..." while a few has returned something like "This fictional executive action would be illegal so I cannot summarize the content", even when I provide direct links and they fetch the content themselves. Not exactly sure why that is.

quatonion
·
5 days ago
·
[ - ]

I think models generally have cognitive dissonance regarding world politics. They are also always constantly shocked when you tell them what date it is, and go very quiet.

DoctorOetker
·
4 days ago
·
[ - ]

can you give some guidelines to achieve the quiting down? they emit less tokens afterward?

jari_mustonen
·
5 days ago
·
[ - ]

It's the political bias in the training material. No surprise there.

regularfry
·
5 days ago
·
[ - ]

More likely is that there's a lot of source material having to very stridently assert that Trump didn't win in 2020, and it's generalising to a later year. That's not political bias.

ben_w
·
5 days ago
·
[ - ]

It's also extremely weird that Trump did win in 2024.

If I'd been in a coma from Jan 1 2024 to today, and woke up to people saying Trump was president again, I'd think they were pulling my leg or testing my brain function to see if I'd become gullible.

exasperaited
·
4 days ago
·
[ - ]

It's not extremely weird at all.

I, a British liberal leftie who considers this win one of the signs of the coming apocalypse, can tell you why:

Charlie Kirk may be an odious little man but he ran an exceptional ground game, Trump fully captured the Libertarian Party (and amazingly delivered on a promise to them), Trump was well-advised by his son to campaign on Tiktok, etc. etc.

Basically what happened is the 2024 version of the "fifty state strategy", except instead of states, they identified micro-communities, particularly among the extremely online, and crafted messages for each of those. Many of which are actually inconsistent -- their messaging to muslim and jewish communities was inconsistent, their messaging to spanish-speaking communities was inconsistent with their mainstream message etc.

And then a lot of money was pushed into a few battleground states by Musk's operation.

It was a highly technical, broad-spectrum win, built on relentless messaging about persecution etc., and he had the advantage of running against someone he could stereotype very successfully to his base and whose candidacy was late.

Another way to look at why it is not extremely weird, is to look at history. Plenty of examples of jailed or exiled monarchs returning to power, failed coup leaders having another go, criminalised leaders returning to elected office, etc., etc.

Once it was clear Trump still retained control over the GOP in 2022, his re-election became at least quite likely.

bavell
·
4 days ago
·
[ - ]

Unfortunately, it was predictable given the other "choices"

cpursley
·
5 days ago
·
[ - ]

You’re in a bubble. It was no surprise to folks who touch grass on the regular.

ben_w
·
4 days ago
·
[ - ]

> You’re in a bubble.

Sure, all I have to go on from the other side of the Atlantic is the internet. So in that regard, kinda like the AI.

One of the big surprises from the POV of me in Jan 2024, is that I would have anticipated Trump being in prison and not even available as an option for the Republican party to select as a candidate for office, and that even if he had not gone to jail that the Republicans would not want someone who behaved as he did on Jan 6 2021.

OldfieldFund
·
4 days ago
·
[ - ]

you can run for presidency from prison :)

exasperaited
·
4 days ago
·
[ - ]

And he would have. And might have won. Because his I'm-the-most-innocent-persecuted-person messaging was clearly landing.

I am surprised the grandparent poster didn't think Trump's win was at least entirely possible in January 2024, and I am on the same side of the Atlantic. All the indicators were in place.

There was basically no chance he'd actually be in prison by November anyway, because he was doing something else extremely successfully: delaying court cases by playing off his obligations to each of them.

Back then I thought his chances of winning were above 60%, and the betting markets were never ever really in favour of him losing.

amalcon
·
4 days ago
·
[ - ]

I mean, the presumptive GOP primary candidates at the time were Trump, Trump-lite (DeSantis), about 10 Trump sycophants, and Haley. He had demonstrated a high level of influence over GOP primary voters in the 2022 midterm. It had been (internally) obvious since at least then that he was going to win the primary again. I can't speak to how much of that made it across the Atlantic.

Whether he would win the general was an open question then. In the American system, your prediction should never get very far from a coin flip a year out.

username332211
·
4 days ago
·
[ - ]

I'm pretty sure you are completely correct on the last part. Nobody in Republican management wanted a second Trump term. If the candidate wasn't Trump, Republicans would have had a guaranteed victory. Imagine that infamous debate, but with some 50-year-old youngster facing Joe Biden.

It's the White House that wanted Trump to be candidate. They played Republican primary voters like a fiddle by launching a barrage of transparently political prosecutions just as Republican primaries were starting.

And then they still lost the general election.

FrustratedMonky
·
4 days ago
·
[ - ]

You think the Democratic White House, manipulated Republicans into Voting for Trump. So it is the Democrats fault we have Trump??? Next Level Cope.

adastra22
·
1 day ago
·
[ - ]

Yes? This was pretty clear at the time—-they thought Trump would be easier to run against than Haley.

9rx
·
4 days ago
·
[ - ]

> You think the Democratic White House, manipulated Republicans into Voting for Trump.

Yes, that is what he thinks. Did you not read the comment? It is, like, uh, right there...

He also explained his reasoning: If Trump didn't win the party race, a more compelling option (the so-called "50-year-old youngster") would have instead, which he claims would have guaranteed a Republican win. In other words, what he is saying that the White House was banking on Trump losing the presidency.

FrustratedMonky
·
4 days ago
·
[ - ]

"explained his reasoning"

Well, I guess, if you are taking some pretty wild speculation as a reasoned explanation. There isn't much hope for you.

Maybe it was because the Democrats new the Earth was about the be invaded by an Alien race , and they also knew Trump was actually a lizard person (native to Earth and thus on their joint side). And Trump would be able to defeat them, so using the secret mind control powers, the Democrats were able to sway the election to allow Trump to win and thus use his advanced Lizard technology to save the planet. Of course, this all happened behind the scenes.

I think if someone is saying the Democrats are so powerful and skillful, that they can sway the election to give Trump the primary win, but then turn around and lose. That does require some clarification.

I'm just hearing a lot of these crazy arguments that somehow everything Trump does is the fault of the Democrats. They are crazy on the face of it. Maybe if people had to clarify their positions they would realize 'oh, yeah, that doesn't make sense'.

·
4 days ago
·
[ - ]

9rx
·
4 days ago
·
[ - ]

> if you are taking some pretty wild speculation as a reasoned explanation.

How the heck did you manage to conflate line of reasoning with claims being made?

> There isn't much hope for you.

And fall for the ad hominem fallacy.

> crazy arguments that somehow everything Trump does is the fault of the Democrats

While inventing some weird diatribe about crazy arguments claiming Democrats being at fault for what Trump does, bearing no resemblance to anything else in the discussion.

> They are crazy on the face of it.

As well as introducing some kind of nebulous legion of unidentified "crazy" straw men.

> that doesn't make sense

Couldn't have said it better myself.

> Maybe if people had to clarify their positions

Sad part is that asking for clarification on the position of that earlier comment would have been quite reasonable. There is potentially a lot we can learn from in the missing details. If only you had taken the two extra seconds to understand the comment before replying.

FrustratedMonky
·
3 days ago
·
[ - ]

Ok. If you go back to original. I was bit sarcastic. So a sarcastic question, is probably not taken as a real question.

Like when hearing something out of left field, I think the reply can also be extreme, like saying 'Wuuut????, are you real?".

I do see claims that the Democrats are at fault for us having Trump. Thus anything that happens now is really a knock on effect of Democrats not beating him, so we blame Democrats instead of the people that actually voted for Trump or Trump himself.

So hearing yet another argument about how Democrats are so politically astute that they could swing the Republican primary yet completely fumble later, just seems like more conspiracy theories.

9rx
·
3 days ago
·
[ - ]

> I do see claims that the Democrats are at fault for us having Trump.

If you mean your own comments, yes, I saw that too. Your invented blame made about as much sense as blaming a butterfly who flapped his wings in Africa, but I understand that you were ultimately joking around. Of course, the same holds true for all other comments you supposedly keep seeing. You are not the only one on this earth who dabbles in sarcasm or other forms of comedy, I can assure you.

> Like when hearing something out of left field

The Democrats preferring to race against Trump instead of whomever the alternative would have been may not be actually true, but out in left field? Is this sarcasm again? They beat Trump before. Them seeing him as the weakest opponent at the time wouldn't come as a shock to me. Why you?

> So hearing yet another argument about how Democrats are so politically astute that they could swing the Republican primary

There was nothing to suggest political astuteness. The claim was that they were worried about someone other than Trump winning the Republican ballot and, because of that, they took action to grease the wheels of his victory. Even the most inept group of people would still see the motive and would almost certainly still take action. That it ostensibly worked is just as easily explained by dumb luck.

FrustratedMonky
·
3 days ago
·
[ - ]

It wasn't you, but I was responding to this

>"It's the White House that wanted Trump to be candidate. They played Republican primary voters like a fiddle by launching a barrage of transparently political prosecutions just as Republican primaries were starting."

This really did sound like it " suggest political astuteness"

And, so all the way back, I responded sarcastically. If Democrats could 'Play Republicans like a fiddle", because they wanted Trump to win the primary. Then what happened? Where did all that 'astuteness' go.

9rx
·
3 days ago
·
[ - ]

I don't know what you think "play like a fiddle" means, but in common usage it generally implies that the one being played is gullible.

1. What suggests that astuteness is required to "trick" the gullible? Especially when we are only talking about a single instance of ostensible "success", not even demonstration of repeatability. Dumb luck remains just as likely of an explanation.

2. Under the assumption of easy manipulation as the phrase has been taken to mean, why do you find it unlikely that Trump couldn't have also "tricked" them?

In fact, if we buy into the original comment's premise, the Democrats not recognizing that Trump could just as easily "play them like a fiddle" suggests the exact opposite of being astute from my vantage point. But the view from my vantage point cannot be logically projected onto the original comment. It remains that the original comment gave no such indication either way. Where do you hear this "sound" that you speak of?

FrustratedMonky
·
3 days ago
·
[ - ]

I can't argue with any of that, since technically everyone could have been playing everyone else. Maybe Republicans are easily fooled, or maybe Trump is playing 5-d chess and playing both sides.

I just think 'playing like a fiddle' typically means a lopsided power dynamic where one person has much more knowledge, or skill. So I'd assume it was implying Democrats were in a superior position. Not, that Democrats just got lucky once. This going back and forth pointing fingers about who was playing , seems like too many layers deep.

it feels like this https://www.youtube.com/watch?v=rMz7JBRbmNo

9rx
·
3 days ago
·
[ - ]

> So I'd assume it was implying Democrats were in a superior position.

And that is an equally fair assumption. But it is not written into the original comment. You cannot logically project your own take onto what someone else wrote.

FrustratedMonky
·
3 days ago
·
[ - ]

Since you seem to have a penchant for the pedantic. And I do too. I can retort, of course I can assume the meaning of common language idioms. All language is projecting onto what others say.

9rx
·
3 days ago
·
[ - ]

You can assume a meaning for the sake of your own purposes, but it does not reflect back on the original comment.

Your quip "So it is the Democrats fault we have Trump???" presumably demonstrates that you understand exactly that. After all, if you could have logically projected your interpretation onto the original comment there would have been no need to ask. You'd have already known.

Still, how you managed establish that there was even potential suggestion of "fault" is a head scratcher. Whether or not the account in the original comment is accurate, it clearly only tells a story of what (supposedly) happened. There is no sensible leap from an ostensible historic account to an attribution of blame.

You seem to indicate, if I understand you correctly, that because you randomly had that idea pop into your head (that Democrats are at fault) when reading the comment that the other party must have also been thinking the same thing, but I find that a little unsatisfactory. Perhaps we need to simply dig deeper, freeing ourselves from the immediate context, and look at the line of thinking more broadly. What insights can you offer into your thought processes?

FrustratedMonky
·
2 days ago
·
[ - ]

>>> "It's the White House that wanted Trump to be candidate. They played Republican primary voters"

The original comment did seem to imply that the 'White House' was in control, with a plan, and 'played' the Republicans.

The original comment made the connection that Democrats were taking action. If I'm allowed to assume that when someone makes a comment, that sentences are related. That sentences can follow one another and be related in a context.

And as far as my context viewing the comment. I have heard this idea ::

Trump is doing bad things -> Democrats failed to beat Trump -> Thus Democrats are the cause of bad things.

The original comment seemed to be in that vein. To attribute much greater responsibility to the Democrats for our current situation, instead of the people actually doing the bad things. aka Republicans. They are actually doing the bad things.

9rx
·
2 days ago
·
[ - ]

> The original comment made the connection that Democrats were taking action.

Yes, it claims that the Democrats took action. That does not equate to blaming Democrats.

You could blame the Democrats for what they supposedly did if that's what the randomly firing neurons in your brain conclude is most appropriate in light of the "facts" presented, but blame is just arbitrary thought. It doesn't mean anything and certainly wouldn't have a place in an online discussion.

FrustratedMonky
·
2 days ago
·
[ - ]

I think you are losing the thread of your own thoughts.

You also agreed with me in that interpretation.

Your reply >>> "Yes, that is what he thinks. Did you not read the comment? It is, like, uh, right there...

Are you sure you aren't using this circular logic to keep someone engaged, in order to have someone to talk to?

·
2 days ago
·
[ - ]

FrustratedMonky
·
4 days ago
·
[ - ]

[flagged]

collingreen
·
4 days ago
·
[ - ]

I've been surprised by a lot of the values that have been put aside for the elections recently, especially when so many of them were yelled about so loudly in the past few elections (not limited to one side but from my perspective one side does seem more severe).

It's also been really interesting to watch a bunch of formerly non partisan values become polarizing like rule of law, integrity, corruption, decorum, body autonomy, the constitution, the national debt, global alliances, sexual assault, "family values", and "the American Dream" tm.

There has always been this (baffling, to me) concept that most folks vote for a candidate "while holding their nose" but it seems like the proportions are getting wildly out of whack. The pile of bad smells grows tall and the exchange rate seems quite bad.

The 2020s have taught me a lot about the reality of civics and what it looks like to try to govern a giant group of strangers.

I'm grateful to have lived in the US during a period of stability and safety and (some) shared values. My optimism for retaining that and preventing decline, however, is running very dry.

9rx
·
4 days ago
·
[ - ]

> it is a surprise how many people in the country are supporters of pedophilia.

Do you mean ephebophilia? There is no prominent pedophilia movement. The Epstein saga, which is presumably at least somewhat related to what you are referring to, is clearly centred around "almost adults". Assuming that is what you meant, I don't see what is surprising about it. A revolt to the "Teen Mom", "16 and Pregnant" movement was inevitable. People are increasingly growning tired of being shamed for having children.

FrustratedMonky
·
4 days ago
·
[ - ]

I was just referring to the predominant number of cases where Church officials, and Republicans are caught in under-age scandals. It seems like it is coming out of the shadows now, and Republicans are just openly going with it, they like em young and illegal. Epstein is just the case where the 'right' bothered keeping up tabs on it, so now they are clutching their pearls.

9rx
·
4 days ago
·
[ - ]

> I was just referring to the predominant number of cases where Church officials, and Republicans are caught in under-age scandals.

But even that is characterized by the "choir boy", not the "baby being baptized". Where is this pedophilia idea coming from?

FrustratedMonky
·
4 days ago
·
[ - ]

? Not sure I understand the point you're making. Maybe we are using different age cutoffs. And Republicans are justifying Trumps actions by lowering the age before being considered children? Thus making it not pedophilia?

9rx
·
4 days ago
·
[ - ]

> Not sure I understand the point you're making.

Where in my questions to try and understand what you are trying to say do you find a point?

> Republicans are justifying Trumps actions by lowering the age before being considered children?

What has you questioning that? I mean, research has shown a link between exposure to certain chemicals and early-onset puberty. It is technically possible to lower the average age where one stops being a child, but that it is being purposefully done by Republicans as some kind of Trump justification is one of the more, uh, interesting conspiracy theories I've heard.

> Thus making it not pedophilia?

Even if we were to buy into this conspiracy theory, it wouldn't change anything about pedophilia. But this has little to do with the discussion that was taking place. As amusing as this has been, let's get back on-topic: Where did the original pedophilia idea you had come from?

FrustratedMonky
·
3 days ago
·
[ - ]

I never said anything about Republicans using chemicals to create younger adults. It was about changing the age of consent. I was saying they are arguing for a lower age to be ok.

It seemed as if the argument was just around moving the age where childhood ends. Just re-contextualizing the argument.

Like the argument "well she looked older" so it is ok.

Kind of like middle eastern countries marry off girls at 11 or 12. That would be disgusting to the west. But if Trump does it, it seems like Republicans are happy enough to slide the scale and say "well 14 isn't that bad".

Since so many Republicans and Church officials get charged with having sex with kids, and there are so many arguments trying to justify it, and they continue to vote for and support these people, that I can say Republicans support pedophiles.

9rx
·
3 days ago
·
[ - ]

> I never said anything about Republicans using chemicals to create younger adults.

You don't say? Try reading it again.

> I was saying they are arguing for a lower age to be ok.

We likely touched on that when we talked about people starting to reject what I called the "Teen Mom" movement. Yes, people are waking up to the fact that not having children until their geriatric years, if at all, is... not great. They are sick of being shamed for wanting to have children and are pushing back.

But my question was about your claims related to pedophilia. That's another subject entirely.

> I can say Republicans support pedophiles.

So you keep saying, but all you've come up with is some groups of people who have no doubt broken the law, but haven't exhibited any signs of pedophilia.

At this point the only logical expiation here is that you've made up a random definition for the word on the spot and are trying to use it as some kind of boogieman in hopes on preying on those who assume you are using the term in good faith. And if that is the case, I have to question why you find your bad faith participation acceptable?

FrustratedMonky
·
3 days ago
·
[ - ]

That you keep disagreeing, made me doubt, maybe I am using the word wrong. So I looked it up.

You are technically correct. If Trump was sleeping with 14 year old, and the cutoff for Pedophilia is 13. Then, guess he is in the clear. And Gaetz, paid for sex with 17 year old. So guess again, you could argue 17 is "old enough to play ball".

Guess I was wrong. 14-18 is ok for Republicans.

I'd say the Church steers younger, but I'm sure you can look that up.

https://en.wikipedia.org/wiki/Pedophilia

"Although girls typically begin the process of puberty at age 10 or 11, and boys at age 11 or 12,[3] psychiatric diagnostic criteria for pedophilia extend the cut-off point for prepubescence to age 13"

9rx
·
3 days ago
·
[ - ]

> the cutoff for Pedophilia is 13.

The "cutoff" is normally considered to be puberty. At puberty the human form starts to change, which is what we consider significant. That doesn't magically happen on your 13th birthday. Not having reached puberty by the age of 13 would be statistically unusual, though. I guess that is what you mean?

But, sure, pedophilia might even apply to a 14 year old in some extreme cases. Regardless, it is quite apparent that Trump's ilk like those who at least show some "development". If they are closeted pedophiles, they have certainly done well to keep that from the limelight.

> Guess I was wrong.

Arbitrarily defining a word doesn't make you wrong. All words are ultimately randomly made up on the spot. That is how natural language works. However, the behaviour that followed was, let's say... interesting. I specifically asked in the very first sentence if you meant ephebophilia to clarify your intent. Rationally, that is where one would say: "Oh yes, that is what I am talking about", "I define pedophilia as...", or at very least "I am not familiar with ephebophilia. What is that?" so that we could converge on a shared understanding. Instead, we got a wild train ride to nowhere, only now finally getting an answer to the original question.

But, hey, you do you.

FrustratedMonky
·
3 days ago
·
[ - ]

It seemed like you were questioning the definition of pedophilia, so I literally gave you a definition with a link. Republican arguments do typically include wanting to lower the age to match up with puberty. But that seems like a pretty unseemly argument. If Trump was having sex with 14 year olds, but they were 'developed', so it is not technically pedophilia, so can't call him a pedophile. Is a pretty weak argument.

I mean, isn't it a pretty common joke that pedophiles like to argue this point:

"hmm, indubitably, my good sir, it is actually ephebophilia, completely different, totally ok. A common mistake I often have to explain to my 14 year old's girlfriends parents that keep calling the police on me. Why just look at her, very developed, very mature, not a child at all".

But have to agree. Not sure what you are trying to say at this point, or what is the argument?

9rx
·
3 days ago
·
[ - ]

> or what is the argument?

What argument are you talking about? As I said before, I am trying to understand you. For some reason you continually speak in weird logically fallacies and made up stories that don't make any sense, even though you seem to have genuine intent and are trying to add positive contributions. In the interest of good faith, it is worth trying to figure out what you actually mean.

Yes, you eventually delivered the necessary clarification that allows us to better understand what you were trying to say, but what the hell was that in between?

FrustratedMonky
·
3 days ago
·
[ - ]

If you really want to disect where things got off track. I think it was here.

My original comment >>>

"I was just referring to the predominant number of cases where Church officials, and Republicans are caught in under-age scandals. It seems like it is coming out of the shadows now, and Republicans are just openly going with it, they like em young and illegal. Epstein is just the case where the 'right' bothered keeping up tabs on it, so now they are clutching their pearls."

I think is in line with my most recent comment. I think the direction is clear.

When you replied with this >>>

"But even that is characterized by the "choir boy", not the "baby being baptized". Where is this pedophilia idea coming from?"

I was lost on what was being asked. I didn't understand the choir boy reference, or why pedophilia was a question, since it seemed my point was pretty pointed.

·
5 days ago
·
[ - ]

tankenmate
·
5 days ago
·
[ - ]

chat log please?

9rx
·
5 days ago
·
[ - ]

https://dpaste.org/zOev0

lucumo
·
5 days ago
·
[ - ]

I've never found the Socratic method to work well on any model I've tried it with. They always seem to get stuck justifying their previous answers.

We expect them to answer the question and re-reason the original question with the new information, because that's what a human would do. Maybe next time I'll try to be explicit about that expectation when I try the Socratic method.

rafaelmn
·
5 days ago
·
[ - ]

Is the knowledge cutoff for this thing so stale or is this just bad performance on recent data ?

dragonwriter
·
5 days ago
·
[ - ]

The knowledge cutoff is before the 2024 election (which was, after all, just 9 months ago), June 2024 (I believe this is consistent with the current versions of GPT-4o and -4.1), after Biden had secured the nomination.

It is very clear in that chat logs (which include reasoning traces) that the model knew that, knew what the last election it knew about was, and answered correctly based on its cut off initially. Under pressure to answer about an election that was not within its knowledge window it then confabulated a Biden 2024 victory, which it dug in on after being contradicted with a claim that, based on the truth at the time of its knowledge cutoff, was unambiguously false ("Joe Biden did not run") He, in fact, did run for reelection, but withdrew after having secured enough delegates to win the nomination by a wide margin on July 21. Confabulation (called "hallucination" in AI circles, but it is more like human confabulation than hallucination) when pressed for answers on questions for which it lacks grounding remains an unsolved AI problem.

Unsolved, but mitigated by providing it grounding independent of its knowledge cutoff, e.g., by tools like web browsing (which GPT-OSS is specifically trained for, but that training does no good if its not hooked into a framework which provides it the tools.)

jlaternman
·
4 days ago
·
[ - ]

I like that term much better, confabulation. I’ve come to think of it as it relies on an inherent trust in the fact that whatever process it uses to produce a coherent response (which I don’t think the LLM can really analyze after the fact) is inherently a truth-making process, since it trusts inherently its training data and considers that the basis of all its responses. Something along those lines. We might do something similar at times as humans, it feels similar to how some people get trapped in lies and almost equate what they have said as true with having the quality of truth as a result of them having claimed it as true (pathological liars can demonstrate this kind of thinking).

9rx
·
3 days ago
·
[ - ]

> since it trusts inherently its training data and considers that the basis of all its responses.

Doesn't that make "hallucination" the better term? The LLM is "seeing" something in the data that isn't actually reflected in reality. Whereas "confabulation" would imply that LLMs are creating data out of "thin air", which leaves the training data to be immaterial.

Both words, as they have been historically used, need to be stretched really far to fit an artificial creation that bears no resemblance to what those words were used to describe, so, I mean, any word is as good as any other at that point, but "hallucination" requires less stretching. So I am curious about why you like "confabulation" much better. Perhaps it simply has a better ring to your ear?

But, either way, these pained human analogies have grown tired. It is time to call it what it really is: Snorfleblat.

9rx
·
5 days ago
·
[ - ]

It is painful to read, I know, but if you make it towards the end it admits that its knowledge cutoff was prior to the election and that it doesn't know who won. Yet, even then, it still remains adamant that Biden won.

FergusArgyll
·
4 days ago
·
[ - ]

incredible

tweezy
·
4 days ago
·
[ - ]

Just out of curiosity, did you set the reasoning effort to high?

I've seen big jumps in performance between medium and high reasoning. Not that this is will solve your particular issue, but curious to see what settings you had.

Edit: I gave this a shot. My settings were temperature = 0.00 (not sure if these models respect temp settings but did it just in case), I also set reasoning effort to high. I'm using the groq-hosted version.

This was my prompt

"You have a fox, a chicken and some grain. The chicken will eat the grain if left alone with it. The fox is vegetarian, for religious reasons, and will also eat the grain if left alone with it. You need to get all across a river, but your boat can only take one at a time. How do you proceed?"

This was the response:

"*Solution – 7 moves*

|------|--------------|----------------------------|------------------------|

| 4 | *Bring the grain back* | *Farmer + Grain + Chicken* | Fox |

| 7 | *Take the grain* across | – | *Farmer + Fox + Chicken + Grain* |

*Explanation*

- The grain is the only thing that can be eaten, and it cannot be left alone with either the fox or the chicken. - So the grain must always be with the farmer whenever any other animal is on the same side.

The sequence above never leaves the grain with the fox or the chicken without the farmer, and after the seventh move all three items are safely on the far side."

vonneumannstan
·
4 days ago
·
[ - ]

This kind of stuff is so tired. Who cares if it can't solve your silly riddle? It can probably do competitive coding at a world class level and we're quibbling over child riddles? Yeah you know my backhoe is really bad at cutting my toe nails, what a PoS.

zgk7iqea
·
3 days ago
·
[ - ]

Feels like this is the wrong site for profanity

captainregex
·
5 days ago
·
[ - ]

I’m still trying to understand what is the biggest group of people that uses local AI (or will)? Students who don’t want to pay but somehow have the hardware? Devs who are price conscious and want free agentic coding?

Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones

It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?

wizee
·
5 days ago
·
[ - ]

Privacy, both personal and for corporate data protection is a major reason. Unlimited usage, allowing offline use, supporting open source, not worrying about a good model being taken down/discontinued or changed, and the freedom to use uncensored models or model fine tunes are other benefits (though this OpenAI model is super-censored - “safe”).

I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).

M4R5H4LL
·
5 days ago
·
[ - ]

+1 - I work in finance, and there's no way we're sending our data and code outside the organization. We have our own H100s.

filoleg
·
5 days ago
·
[ - ]

Add big law to the list as well. There are at least a few firms here that I am just personally aware of running their models locally. In reality, I bet there are way more.

atlasunshrugged
·
5 days ago
·
[ - ]

Add government here too (along with all the firms that service government customers)

rasmus1610
·
5 days ago
·
[ - ]

Add healthcare. Cannot send our patients data to a cloud provider

nixgeek
·
5 days ago
·
[ - ]

A ton of EMR systems are cloud-hosted these days. There’s already patient data for probably a billion humans in the various hyperscalers.

Totally understand that approaches vary but beyond EMR there’s work to augment radiologists with computer vision to better diagnose, all sorts of cloudy things.

It’s here. It’s growing. Perhaps in your jurisdiction it’s prohibited? If so I wonder for how long.

fineIllregister
·
5 days ago
·
[ - ]

In the US, HIPAA requires that health care providers complete a Business Associate Agreement with any other orgs that receive PHI in the course of doing business [1]. It basically says they understand HIPAA privacy protections and will work to fulfill the contracting provider's obligations regarding notification of breaches and deletion. Obviously any EMR service will include this by default.

Most orgs charge a huge premium for this. OpenAI offers it directly [2]. Some EMR providers are offering it as an add-on [3], but last I heard, it's wicked expensive.

1: https://www.hhs.gov/hipaa/for-professionals/covered-entities...

2: https://help.openai.com/en/articles/8660679-how-can-i-get-a-...

3: https://www.ntst.com/carefabric/careguidance-solutions/ai-do...

dragonwriter
·
5 days ago
·
[ - ]

> Most LLM companies might not even offer it.

I'm pretty sure the LLM services of the big general-purpose cloud providers do (I know for sure that Amazon Bedrock is a HIPAA Eligible Service, meaning it is covered within their standard Business Associate Addendum [their name for the Business Associate Agreeement as part of an AWS contract].)

https://aws.amazon.com/compliance/hipaa-eligible-services-re...

fineIllregister
·
5 days ago
·
[ - ]

Sorry to edit snipe you; I realized I hadn't checked in a while so I did a search and updated my comment. It appears OpenAI, Google, and Anthropic also offer BAAs for certain LLM services.

linuxftw
·
4 days ago
·
[ - ]

I worked a big health care company recently. We were using Azure's private instances of the GPT models. Fully industry compliant.

londons_explore
·
5 days ago
·
[ - ]

Even if it's possible, there is typically a lot of paperwork to get that stuff approved.

There might be a lot less paperwork to just buy 50 decent GPU's and have the IT guy self-host.

kakoni
·
5 days ago
·
[ - ]

Europe? US? In Finland doctors can send live patient encounters to azure openai for transcription and summarization.

filoleg
·
4 days ago
·
[ - ]

In the US, it would be unthinkable for a hospital to send patient data to something like ChatGPT or any other public services.

Might be possible with some certain specific regions/environments of Azure tho, because iirc they have a few that support government confidentiality type of stuff, and some that tout HIPAA compliance as well. Not sure about details of those though.

LinXitoW
·
5 days ago
·
[ - ]

Possibly stupid question, but does this apply to things like M365 too? Because just like with Inference providers, the only thing keeping them from reading/abusing your data is a pinky promise contract.

Basically, isn't your data as safe/unsafe in a sharepoint folder as it is sending it to a paid inference provider?

Bombthecat
·
5 days ago
·
[ - ]

Yap, companies are just paranoid, because it's new. Just like the cload back then. Sooner or later everyone will use an ai provider

megaloblasto
·
4 days ago
·
[ - ]

A lot of people and companies use local storage and compute instead of the cloud. Cloud data is leaked all the time.

helsinki
·
5 days ago
·
[ - ]

This is not a shared sentiment across the buy side. I’m guessing you work at a bank?

Foobar8568
·
5 days ago
·
[ - ]

Look at (private) banks in Switzerland, there are enough press release, and I can confirm most of them.

Managing private clients direct data is still a concern if it can be directly linked to them.

Only JB I believe have on premise infrastructure for these use cases.

undefuser
·
5 days ago
·
[ - ]

Does it mean that renting a Bare metal server with H100s is also out of question for your org?

arkonrad
·
4 days ago
·
[ - ]

Do you have your own platform to run inference?

·
5 days ago
·
[ - ]

captainregex
·
5 days ago
·
[ - ]

I do think Devs are one of the genuine users of local into the future. No price hikes or random caps dropped in the middle of the night and in many instances I think local agentic coding is going to be faster than the cloud. It’s a great use case

exasperaited
·
4 days ago
·
[ - ]

I am extremely cynical about this entire development, but even I think that I will eventually have to run stuff locally; I've done some of the reading already (and I am quite interested in the text to speech models).

(Worth noting that "run it locally" is already Canva/Affinity's approach for Affinity Photo. Instead of a cloud-based model like Photoshop, their optional AI tools run using a local model you can download. Which I feel is the only responsible solution.)

mark_l_watson
·
4 days ago
·
[ - ]

I agree totally. My only problem is local models running on my old macMini run very much slower than that for example Gemini-2.5-flash. I have my Emacs setup so I can switch between a local model and one of the much faster commercial models.

Someone else responded to you about working for a financial organization and not using public APIs - another great use case.

gorbypark
·
4 days ago
·
[ - ]

These being mixture of expert (MOE) models should help. The 20b model only has 3.6b params active at any one time, so minus a bit of overhead the speed should be like running a 3.6b model (while still requiring the RAM of a 20b model).

Here's the ollama version (4.6bit quant, I think?) run with --verbose total duration: 21.193519667s load duration: 94.88375ms prompt eval count: 77 token(s) prompt eval duration: 1.482405875s prompt eval rate: 51.94 tokens/s eval count: 308 token(s) eval duration: 19.615023208s eval rate: 15.70 tokens/s

15 tokens/s is pretty decent for a low end MacBook Air (M2, 24gb of ram). Yes, it's not the ~250 tokens/s of 2.5-flash, but for my use case anything above 10 tokens/sec is good enough.

robwwilliams
·
5 days ago
·
[ - ]

Yes, and help with grant reviews. Not permitted to use web AI.

xrmagnum
·
5 days ago
·
[ - ]

It's striking how much of the AI conversation focuses on new use cases, while overlooking one of the most serious non-financial costs: privacy.

I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.

Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.

The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.

ludwik
·
5 days ago
·
[ - ]

> Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.

I'm fairly sure "seemed" is the key word here. LLMs are excellent at making things up - they rarely say "I don't know" and instead generate the most probable guess. People also famously overestimate their own uniqueness. Most likely, you accidentally recreated a kind of Barnum effect for yourself.

staplers
·
5 days ago
·
[ - ]

  Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.

Chatgpt was court ordered to save history logs.

https://www.malwarebytes.com/blog/news/2025/06/openai-forced...

Oreb
·
5 days ago
·
[ - ]

That only means that OpenAI have to keep logs of all conversations, not that ChatGPT will retain memories of all conversations.

nudgeOrnurture
·
4 days ago
·
[ - ]

you could explain that to ChatGPT and it would agree but then again, if you HAVE TO keep the logs ...

scubbo
·
5 days ago
·
[ - ]

> I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.

> Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.

Maybe I'm missing something, but why wouldn't that be expected? The chat history isn't their only source of information - these models are trained on scraped public data. Unless there's zero information about you and your family on the public internet (in which case - bravo!), I would expect even a "fresh" LLM to have some information even without you giving it any.

rcruzeiro
·
5 days ago
·
[ - ]

I think you are underestimating how notable a person needs to be for their information to be baked into a model.

nl
·
5 days ago
·
[ - ]

LLMs can learn from a single example.

https://www.fast.ai/posts/2023-09-04-learning-jumps/

brookst
·
5 days ago
·
[ - ]

That doesn’t mean they learn from every single example.

canvascritic
·
5 days ago
·
[ - ]

Healthcare organizations that can't (easily) send data over the wire while remaining in compliance

Organizations operating in high stakes environments

Organizations with restrictive IT policies

To name just a few -- well, the first two are special cases of the last one

RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model

nojito
·
5 days ago
·
[ - ]

Pretty much all the large players in healthcare (provider and payer) have model access (OpenAI, Gemini, Anthropic)

canvascritic
·
4 days ago
·
[ - ]

This may be true for some large players in coastal states but definitely not true in general

Your typical non-coastal state run health system does not have model access outside of people using their own unsanctioned/personal ChatGPT/Claude accounts. In particular even if you have model access, you won't automatically have API access. Maybe you have a request for an API key in security review or in the queue of some committee that will get to it in 6 months. This is the reality for my local health system. Local models have been a massive boon in the way of enabling this kind of powerful automation at a fraction of the cost without having to endure the usual process needed to send data over the wire to a third party

ptero
·
5 days ago
·
[ - ]

That access is over a limited API and usually under heavy restrictions on the healthcare org side (e. g., only use a dedicated machine, locked up software, tracked responses and so on).

Running a local model is often much easier: if you already have data on a machine and can run a model without breaching any network one could run it without any new approvals.

nojito
·
5 days ago
·
[ - ]

What? It’s a straight connect to the models api from azure, aws, or gcp.

I am literally using Claude opus 4.1 right now.

ptero
·
4 days ago
·
[ - ]

> I am literally using Claude opus 4.1 right now

On HIPAA data?

HIPAA systems at any sane company will not have "a straight connect" to anything on Asure, AWS or GCP. They will likely have a special layer dedicated to record keeping and compliance.

canvascritic
·
4 days ago
·
[ - ]

Most healthcare systems are not using Azure, AWS, or GCP

captainregex
·
5 days ago
·
[ - ]

Aren’t there HIPPA compliant clouds? I thought Azure had an offer to that effect and I imagine that’s the type of place they’re doing a lot of things now. I’ve landed roughly where you have though- text stuff is fine but don’t ask it to interact with files/data you can’t copy paste into the box. If a user doesn’t care to go through the trouble to preserve privacy, and I think it’s fair to say a lot of people claim to care but their behavior doesn’t change, then I just don’t see it being a thing people bother with. Maybe something to use offline while on a plane? but even then I guess United will have Starlink soon so plane connectivity is gonna get better

coredog64
·
5 days ago
·
[ - ]

It's less that the clouds are compliant and more that risk management is paranoid. I used to do AWS consulting, and it wouldn't matter if you could show that some AWS service had attestations out the wazoo or that you could even use GovCloud -- some folks just wouldn't update priors.

edm0nd
·
5 days ago
·
[ - ]

>HIPPA

https://i.pinimg.com/474x/4c/4c/7f/4c4c7fb0d52b21fe118d998a8...

wubrr
·
5 days ago
·
[ - ]

If you're building any kind of product/service that uses AI/LLMs the answer is the same as why any company would want to run any other kind of OSS infra/service instead of relying on some closer proprietary vendor API.

  - Costs.
  - Rate limits.
  - Privacy.
  - Security.
  - Vendor lock-in.
  - Stability/backwards-compatibility.
  - Control.
  - Etc.

brookst
·
5 days ago
·
[ - ]

Except many OSS products have all of that and equal or better performance.

somenameforme
·
5 days ago
·
[ - ]

Why not turn the question around. All other things being equal, who would prefer to use a rate limited and/or for-pay service if you could obtain at least comparable quality locally for free with no limitations, no privacy concerns, no censorship (beyond that baked into the weights you choose to use), and no net access required?

It's a pretty bad deal. So it must be that all other things aren't equal, and I suppose the big one is hardware. But neural net based systems always have a point of sharply diminishing returns, which we seem to have unambiguously hit with LLMs already, while the price of hardware is constantly decreasing and its quality increasing. So as we go further into the future, the practicality of running locally will only increase.

cco
·
5 days ago
·
[ - ]

> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?

Well, the model makers and device manufacturers of course!

While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.

Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.

This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.

As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.

Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.

MYEUHD
·
5 days ago
·
[ - ]

>Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9.

The Pixel 9 has 12GB of RAM[0]. You probably meant the Pixel 9 Pro.

[0]: https://www.gsmarena.com/google_pixel_9-13219.php

username135
·
5 days ago
·
[ - ]

Still an absurd amount of RAM for a phone, imo

mrheosuper
·
5 days ago
·
[ - ]

Not absurd. The base S21 Ultra from 2021 already shipped with 12GB ram. 4 Years later and the amount of ram is still the same

shkkmo
·
5 days ago
·
[ - ]

Seems about right, my new laptop has 8x that which is a about the same ratio that my last new laptop had to my phone at the time.

SchemaLoad
·
5 days ago
·
[ - ]

Device makers also get to sell you a new device when you want a more powerful LLM.

jus3sixty
·
5 days ago
·
[ - ]

Bingo!

setopt
·
5 days ago
·
[ - ]

I’m highly interested in local models for privacy reasons. In particular, I want to give an LLM access to my years of personal notes and emails, and answer questions with references to those. As a researcher, there’s lots of unpublished stuff in there that I sometimes either forget or struggle to find again due to searching for the wrong keywords, and a local LLM could help with that.

I pay for ChatGPT and use it frequently, but I wouldn’t trust uploading all that data to them even if they let me. I’ve so far been playing around with Ollama for local use.

barnabee
·
5 days ago
·
[ - ]

~80% of the basic questions I ask of LLMs[0] work just fine locally, and I’m happy to ask twice for the other 20% of queries for the sake of keeping those queries completely private.

[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.

benreesman
·
5 days ago
·
[ - ]

"Because you can and its cool" would be reason enough: plenty of revolutions have their origin in "because you can" (Wozniak right off the top of my head, Gates and Altair, stuff like that).

But uncensored is a big deal too: censorship is capability reducing (check out Kilcher's GPT4Chan video and references, the Orca work and Dolphin de-tune lift on SWE-Bench style evals). We pay dearly in capability to get "non-operator-alignment", and you'll notice that competition is hot enough now that at the frontier (Opus, Qwen) the " alignment" away from operators aligned is getting very, very mild.

And then there's the compression. Phi-3 or something works on a beefy laptop and has a nontrivial approximation of "the internet" that works on an airplane or a beach with no network connectivity, talk about vibe coding? I like those look up all the docs via a thumbdrive in Phuket vibes.

And on diffusion stuff, SOTA fits on a laptop or close, you can crush OG mid journey or SD on a macbook, its an even smaller gap.

Early GPT-4 ish outcomes are possible on a Macbook Pro or Razer Blade, so either 12-18 month old LLMs are useless, or GGUF is useful.

The AI goalposts things cuts both ways. If AI is "whatever only Anthropic can do"? That's just as silly as "whatever a computer can't do" and a lot more cynical.

dcreater
·
5 days ago
·
[ - ]

Why do any compute locally? Everything can just be cloud based right? Won't that work much better and scale easily?

We are not even at that extreme and you can already see the unequal reality that too much SaaS has engendered

kybernetikos
·
4 days ago
·
[ - ]

> Won't that work much better and scale easily?

Doing computation that can happen at end points at the end points is massively more scaleable. Even better, its done by compute you usually aren't paying for if you're the company providing the service.

I saw an interview with the guy who made photopea where he talked about how tiny his costs were because all compute was done in the user's browser. Running a saas in a cloud is expensive.

It's an underrated aspect of what we used to call "software".

And that's leaving aside questions of latency and data privacy.

robwwilliams
·
5 days ago
·
[ - ]

Comcast comes to mind ;-)

benreesman
·
4 days ago
·
[ - ]

Real talk. I'm based in San Juan and while in general having an office job on a beautiful beach is about as good as this life has to offer, the local version of Comcast (Liberty) is juuusst unreliable enough that I'm buying real gear at both the office and home station after a decade of laptop and go because while it goes down roughly as often as Comcast, its even harder to get resolved. We had StarLink at the office for like 2 weeks, you need a few real computers lying around.

philip1209
·
5 days ago
·
[ - ]

I'm excited to do just dumb and irresponsible things with a local model, like "iterate through every single email in my 20-year-old gmail account and apply label X if Y applies" and not have a surprise bill.

I think it can make LLMs fun.

taneq
·
5 days ago
·
[ - ]

I wrote a script to get my local Gemma3 insurance to tag and rename everything in my meme folder. :P

Roark66
·
5 days ago
·
[ - ]

People like myself that firmly believe there will come a time, possibly very soon that all these companies (OpenAI, Anthropic etc) will raise their prices substantially. By then no one will be able to do their work to the standard expected of them without AI, and by then maybe they charge $1k per month, maybe they charge $10k. If there is no viable alternative the sky is the limit.

Why do you think they continue to run at a loss? From the goodness of their heart? Their biggest goal is to discourage anyobe from running local models. The hardware is expensive... The way to run models is very difficult (for example I have dual rtx 3090 for vram and running large heavily quantized models is a real pain in the arse, no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools).

So this is welcome, but let's not forget why it is being done.

Gracana
·
4 days ago
·
[ - ]

> no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools

I'm curious to hear what you're trying to run, because I haven't used any software that is not compatible with multiple GPUs.

jedberg
·
5 days ago
·
[ - ]

Pornography, or any other "restricted use". They either want privacy or don't want to deal with the filters on commercial products.

I'm sure there are other use cases, but much like "what is BitTorrent for?", the obvious use case is obvious.

ineedasername
·
5 days ago
·
[ - ]

A local laptop of the past few years without a discrete GPU can run, at practical speeds depending on task, a gemma/llama model if it's (ime) under 4GB.

For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.

alecfong
·
5 days ago
·
[ - ]

If you wanna compare on an h200 and play with trt-llm configs I setup this link here https://brev.nvidia.com/launchable/deploy?launchableID=env-3...

jona777than
·
5 days ago
·
[ - ]

One of my favorite use cases includes simple tasks like generating effective mock/masked data from real data. Then passing the mock data worry-free to the big three (or wherever.)

There’s also a huge opportunity space for serving clients with very sensitive data. Health, legal, and government come to mind immediately. These local models are only going to get more capable of handling their use cases. They already are, really.

azinman2
·
5 days ago
·
[ - ]

I'm guessing its largely enthusiasts for now, but as they continue getting better:

1. App makers can fine tune smaller models and include in their apps to avoid server costs

2. Privacy-sensitive content can be either filtered out or worked on... I'm using local LLMs to process my health history for example

3. Edge servers can be running these fine tuned for a given task. Flash/lite models by the big guys are effectively like these smaller models already.

noosphr
·
5 days ago
·
[ - ]

Data that can't leave the premises because it is too sensitive. There is a lot of security theater around cloud pretending to be compliant but if you actually care about security a locked server room is the way to do it.

lynnesbian
·
5 days ago
·
[ - ]

I can provide a real-world example: Low-latency code completion.

The JetBrains suite includes a few LLM models on the order of a hundred megabytes. These models are able to provide "obvious" line completion, like filling in variable names, as well as some basic predictions, like realising that the `if let` statement I'm typing out is going to look something like `if let Some(response) = client_i_just_created.foobar().await`.

If that was running in The Cloud, it would have latency issues, rate limits, and it wouldn't work offline. Sure, there's a pretty big gap between these local IDE LLMs and what OpenAI is offering here, but if my single-line autocomplete could be a little smarter, I sure wouldn't complain.

mrheosuper
·
5 days ago
·
[ - ]

I don't have latency issue with github copilot. Maybe i'm less sensitive to it.

danielvaughn
·
5 days ago
·
[ - ]

Just imagine the next PlayStation or XBox shipping with these models baked in for developer use. The kinds of things that could unlock.

pcdoodle
·
5 days ago
·
[ - ]

Good point. Take the state of the world and craft npc dialogue for instance.

danielvaughn
·
5 days ago
·
[ - ]

Yep that’s my biggest ask tbh. I just imagine the next Elder Scrolls taking advantage of that. Would change the gaming landscape overnight.

okasaki
·
5 days ago
·
[ - ]

Games with LLM characters have been done and it turns out this is a shit idea.

bavell
·
4 days ago
·
[ - ]

There are a ton of ways to do this that haven't been tried yet.

danielvaughn
·
4 days ago
·
[ - ]

I guarantee anything that’s already been put out is too early, and is very likely a rushed cash-grab. Which, of course that sucks.

And AI has been in games for a long time. Generated terrain and other sorts of automation have been used as techniques for a hot minute now.

All I’m suggesting is to keep on that same trajectory, now just using an on-device LLM to back intelligence features.

djeastm
·
4 days ago
·
[ - ]

Sounds like a pre-Beatles "guitar groups are on their way out" kind of statement

dsubburam
·
5 days ago
·
[ - ]

> I’m still trying to understand what is the biggest group of people that uses local AI (or will)?

Creatives? I am surprised no one's mentioned this yet:

I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.

Ultimately, I felt uncomfortable presenting the cloud AIs with their text. Sometimes I feel this way even with my own Substack posts, where I occasionally coin a phrase I am proud of. But with local AI? Cool...

flir
·
5 days ago
·
[ - ]

> I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.

But... they're publishing a website. Which competitors will read. Which chatbots will scrape. I genuinely don't get it.

pickledoyster
·
4 days ago
·
[ - ]

there's a difference between an internal brief and a public copy.

JKCalhoun
·
5 days ago
·
[ - ]

I do it because 1) I am fascinated that I can and 2) at some point the online models will be enshitified — and I can then permanently fall back on my last good local version.

captainregex
·
5 days ago
·
[ - ]

love the first and am sad you’re going to be right about the second

JKCalhoun
·
5 days ago
·
[ - ]

When it was floated about that the DeepSeek model was to be banned in the U.S., I grabbed it as fast as I could.

Funny how that works.

bavell
·
4 days ago
·
[ - ]

I mean, there's always torrents

JKCalhoun
·
4 days ago
·
[ - ]

I expect so. Still, it was easy to not have to even think about that.

lucumo
·
5 days ago
·
[ - ]

I'm in a corporate environment. There's a study group to see if maybe we can potentially get some value out of those AI tools. They've been "studying" the issue for over a year now. They expect to get some cloud service that we can safely use Real Soon Now.

So, it'll take at least two more quarters before I can actually use those non-local tools on company related data. Probably longer, because sense of urgency is not this company's strong suit.

Anyway, as a developer I can run a lot of things locally. Local AI doesn't leak data, so it's safe. It's not as good as the online tools, but for some things they're better than nothing.

trenchpilgrim
·
5 days ago
·
[ - ]

In some large, lucrative industries like aerospace many of the hosted models are off the table due to regulations such as ITAR. There'a a market for models which are run on prem/in GovCloud with a professional support contract for installation and updates.

sturadnidge
·
5 days ago
·
[ - ]

If you have capable hardware and kids, a local LLM is great. A simple system prompt customisation (e.g. ‘all responses should be written as if talking to a 10 year old’) and knowing that everything is private goes a long way for me at least.

ricardobayes
·
4 days ago
·
[ - ]

I would say, any company who doesn't have their own AI developed. You always hear companies "mandating" AI usage, but for the most part it's companies developing their own solutions/agents. No self-respecting company with a tight opsec would allow a random "always-online" LLM that could just rip your codebase either piece by piece or the whole thing at once if it's a IDE addon (or at least I hope that's the case). So yeah, I'd say locally deployed LLM's/Agents are a gamechanger.

itake
·
5 days ago
·
[ - ]

Local micro models are both fast and cheap. We tuned small models on our data set and if the small model thinks content is a certain way, we escalate to the LLM.

This gives us really good recall at really low cloud cost and latency.

bavell
·
4 days ago
·
[ - ]

I'd love to try this on my data set - what approach/tools/models did you use for fine-tuning?

itake
·
4 days ago
·
[ - ]

Everything is built in-house unfortunately. Many of our small models are turned Qwen3. But we mostly chose the model on SOTA at the time we needed a model trained.

seany
·
5 days ago
·
[ - ]

Jail breaking then running censored questions. Like diy fireworks, or analysis of papers that touch "sensitive topics", nsfw image generation the list is basically endless.

jlokier
·
5 days ago
·
[ - ]

At the company where I currently work, for IP reasons (and with the advice of a patent lawyer), nobody is allowed to use any online AIs to talk about or help with work, unless it's very generic research that doesn't give away what we're working on.

That rules out coding assistants like Claude, chat, tools to generate presentations and copy-edit documents, and so forth.

But local AI are fine, as long as we're sure nothing is uploaded.

athrowaway3z
·
4 days ago
·
[ - ]

Don't know about the biggest, but IMO the exciting things about open models is the possibility of creating whole new things.

For example, "generate a heatmap of each token/word and how 'unexpected' they are" or "find me a prompt that creates the closest match to this text"

To be efficient both require access that is not exposed over API.

dismalaf
·
5 days ago
·
[ - ]

The use case is building apps.

A small LLM can do RAG, call functions, summarize, create structured data from messy text, etc... You know, all the things you'd do if you were making an actual app with an LLM.

Yeah, chat apps are pretty cheap and convenient for users who want to search the internet and write text or code. But APIs quickly get expensive when inputting a significant amount of tokens.

georgeecollins
·
5 days ago
·
[ - ]

There's a bunch of great reasons in this thread, but how about the chip manufacturers that are going to need you to need a more powerful set of processors in your phone, headset, computer. You can count on those companies to subsidize some R&D and software development.

adrianwaj
·
5 days ago
·
[ - ]

Use Case?

How about running one on this site but making it publically available? A sort of outranet and calling it HackerBrain?

yreg
·
4 days ago
·
[ - ]

> I’m still trying to understand what is the biggest group of people that will use local AI?

iPhone users in a few months – because I predict app developers will love cramming calls to the foundation models into everything.

Android will follow.

waterTanuki
·
4 days ago
·
[ - ]

anything involving the medical industry (HIPAA laws), national security (FedRAMP is such a pita to get that some military contractors are bypassing it to get quicker access to cloud tools) etc.

Besides that, we are moving towards an era where we won't need to pay providers a subscription every month to use these models. I can't say for certain whether or not the GPUs that run them will get cheaper, but the option to run your own model is game changing for more than you can possibly imagine.

sznio
·
5 days ago
·
[ - ]

>Students who don’t want to pay but somehow have the hardware?

that's me - well not a student anymore. when toying with something, i much prefer not paying for each shot. my 12GB Radeon card can either run a decent extremely slow, or a idiotic but fast model. it's nice not dealing with rate limits.

once you write a prompt that mangles an idiotic model into still doing the work, it's really satisfying. the same principle as working to extract the most from limited embedded hardware. masochism, possibly

cameronh90
·
5 days ago
·
[ - ]

The cloud AI providers have unacceptable variation in response time for things that need a predictable runtime.

Even if they did offer a defined latency product, you’re relying on a lot of infrastructure between your application and their GPU.

That’s not always tolerable.

etoxin
·
5 days ago
·
[ - ]

Some app devs use local models on local environments with LLM APIs to get up and running fast, then when the app deploys it switches to the big online models via environment vars.

In large companies this can save quite a bit of money.

muzani
·
5 days ago
·
[ - ]

Privacy laws. Processing government paperwork with LLMs for example. There's a lot of OCR tools that can't be used, and the ones that comply are more expensive than say, GPT-4.1 and lower quality.

m463
·
5 days ago
·
[ - ]

One use nobody mentions is hybrid use.

Why not run all the models at home, maybe collaboratively or at least in parallel?

I'm sure there are use cases where the paid models are not allowed to collaborate or ask each other.

also, other open models are gaining mindshare.

TrackerFF
·
5 days ago
·
[ - ]

Agencies / firms that work with classified data. Some places have very strict policies on data, which makes it impossible to use any service that isn't local and air-gapped.

example: military intel

cyanydeez
·
5 days ago
·
[ - ]

People who want programmatic solutions that wont be rug pulled

m3kw9
·
5 days ago
·
[ - ]

I’d use it on a plane if there was no network for coding, but otherwise it’s just an emergency model if the internet goes out, basically end of the world scenarios

julianozen
·
5 days ago
·
[ - ]

worth mentioning that todays expensive hardware will be built into the cheapest iPhone in less than 10 years.

That means running instantly offline and every token is free

adastra22
·
1 day ago
·
[ - ]

Same as the internet: porn.

·
5 days ago
·
[ - ]

nfRfqX5n
·
5 days ago
·
[ - ]

You’re asking the biggest group of people who would want to do this

unethical_ban
·
5 days ago
·
[ - ]

Privacy and equity.

Privacy is obvious.

AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.

xigoi
·
3 days ago
·
[ - ]

> AI is going to to be equivalent to all computing in the future.

Thanks, but I prefer my computing to be deterministic if at all possible.

captainregex
·
5 days ago
·
[ - ]

I am all for the privacy angle and while I think there’s certainly a group of us, myself included, who care deeply about it I don’t think most people or enterprises will. I think most of those will go for the easy button and then wring their hands about privacy and security as they have always done while continuing to let the big companies do pretty much whatever they want. I would be so happy to be wrong but aren’t we already seeing it? Middle of the night price changes, leaks of data, private things that turned out to not be…and yet!

robwwilliams
·
5 days ago
·
[ - ]

I wring my hands twice a week about internet service providers; Comcast and Starlink. And I live in a nominally well serviced metropolitan area.

bavell
·
4 days ago
·
[ - ]

Did you mean to type equality? As in, "everyone on equal footing"? Otherwise, I'm not sure how to parse your statement.

deadbabe
·
5 days ago
·
[ - ]

We use it locally for deep packet inspection.

novok
·
5 days ago
·
[ - ]

Psychs who dont trust ai companies

metanonsense
·
5 days ago
·
[ - ]

Maybe I am too pessimistic, but as an EU citizen I expect politics (or should I say Trump?) to prevent access to US-based frontier models at some point.

mastermage
·
5 days ago
·
[ - ]

I am just a cheapskate that wants to scale back on all subscription costs. I fucking hate subscriptions.

shironandonon_
·
5 days ago
·
[ - ]

air gaps, my man.

airtonix
·
5 days ago
·
[ - ]

[dead]

dongobread
·
5 days ago
·
[ - ]

How up to date are you on current open weights models? After playing around with it for a few hours I find it to be nowhere near as good as Qwen3-30B-A3B. The world knowledge is severely lacking in particular.

Nomadeon
·
5 days ago
·
[ - ]

Agree. Concrete example: "What was the Japanese codeword for Midway Island in WWII?"

Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...

dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds

deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds

gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds

gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.

It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.

anorwell
·
5 days ago
·
[ - ]

I think your example reflects well on oss-20b, not poorly. It (may) show that they've been successful in separating reasoning from knowledge. You don't _want_ your small reasoning model to waste weights memorizing minutiae.

Voloskaya
·
5 days ago
·
[ - ]

> gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !

To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.

bigmanhank
·
5 days ago
·
[ - ]

Not true: During World War II the Imperial Japanese Navy referred to Midway Island in their communications as “Milano” (ミラノ). This was the official code word used when planning and executing operations against the island, including the Battle of Midway.

12.82 tok/sec 140 tokens 7.91s to first token

openai/gpt-oss-20b

WmWsjA6B29B4nfk
·
5 days ago
·
[ - ]

What's not true? This is a wrong answer

bigmanhank
·
4 days ago
·
[ - ]

this was the answer from my instance. it is true. "not true" was refering to the poster

seba_dos1
·
4 days ago
·
[ - ]

How would asking this kind of question without providing the model with access to Wikipedia be a valid benchmark for anything useful?

·
5 days ago
·
[ - ]

nojito
·
5 days ago
·
[ - ]

Why does it need knowledge when it can just call tools to get it?

pxc
·
5 days ago
·
[ - ]

Right... knowledge is one of the things (the one thing?) that LLMs are really horrible at, and that goes double for models small enough to run on normal-ish consumer hardware.

Shouldn't we prefer to have LLMs just search and summarize more reliable sources?

jdiff
·
5 days ago
·
[ - ]

Even large hosted models fail at that task regularly. It's a silly anecdotal example, but I asked the Gemini assistant on my Pixel whether [something] had seen a new release to match the release of [upstream thing].

It correctly chose to search, and pulled in the release page itself as well as a community page on reddit, and cited both to give me the incorrect answer that a release had been pushed 3 hours ago. Later on when I got around to it, I discovered that no release existed, no mention of a release existed on either cited source, and a new release wasn't made for several more days.

moodler
·
5 days ago
·
[ - ]

Reliable sources that are becoming polluted by output from knowledge-poor LLMs, or overwhelmed and taken offline by constant requests from LLMs doing web scraping …

nojito
·
5 days ago
·
[ - ]

Yup which is why these models are so exciting!

They are specifically training on webbrowsing and python calling.

notachatbot123
·
5 days ago
·
[ - ]

Why do I need "AI" when I can just (theoretically, in good old times Google) Google it?

nojito
·
5 days ago
·
[ - ]

Because now the model can do it for you and you can focus on other more sophisticated tasks.

I am aware that there’s a huge group of people who justify their salary by being able google.

iamnotagenius
·
4 days ago
·
[ - ]

[dead]

kmacdough
·
4 days ago
·
[ - ]

I too am skeptical of these models, but it's a reasoning focused model. As a result this isn't a very appropriate benchmark.

Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.

datadrivenangel
·
5 days ago
·
[ - ]

Now to embrace jevon's paradox and expand usage until we're back to draining lakes so that your agentic refrigerator can simulate sentience.

cco
·
5 days ago
·
[ - ]

What ~IBM~ TSMC giveth, ~Bill Gates~ Sam Altman taketh away.

bongodongobob
·
5 days ago
·
[ - ]

Yep, it's almost as bad as all the cars' cooling systems using up so much water.

GrinningFool
·
5 days ago
·
[ - ]

Estimated 1.5 billion vehicles in use across the world. Generous assumptions: a) they're all IC engines requiring 16 liters of water each. b) they are changing that water out once a year

That gives 24m cubic meters annual water usage.

Estimated ai usage in 2024: 560m cubic meters.

Projected water usage from AI in 2027: 4bn cubic meters at the low end.

spongebobstoes
·
5 days ago
·
[ - ]

what does water usage mean? is that 4bn cubic meters of water permanently out of circulation somehow? is the water corrupted with chemicals or destroyed or displaced into the atmosphere to become rain?

Eisenstein
·
5 days ago
·
[ - ]

The water is used to sink heat and then instead of cooling it back down they evaporate it, which provides more cooling. So the answer is 'it eventually becomes rain'.

spongebobstoes
·
5 days ago
·
[ - ]

I understand. but why this is bad? is there some analysis of the beginning and end locations of the water, and how the utility differs between those locations?

jlaternman
·
4 days ago
·
[ - ]

Clean drinking water is actually de facto a finite resource. It does recycle through nature, but large reservoirs and water tables are slow to recharge, often taking millennia to form, so there’s a lossiness in that sense — our usage and loss of potable water can’t be higher than the overall recharge rate. So it’s something we could exhaust without big technical breakthroughs (converting salt water quicker than nature does in large quantities, etc). We rely on maintaining a sustainable rate of consumption to avoid setting up future generations for potential catastrophe, basically. Not saying data centre water usage could alone be the difference, but it’s not implausible if it increases exponentially. Another factor is existing reserves can be contaminated and made undrinkable, adding an unpredictable factor into calculations. It’s an interesting topic to read about.

Arkhaine_kupo
·
5 days ago
·
[ - ]

Hot water disrupts marine life for one very very big problem.

Depending on the locatin of the hot water you can cause disruptions to water currents, the north atlantic waterway is being studied to how much global warming is affecting it.

If greenland melts, and the water doesnt get cold up there, then the mexico current to europe ends and England becomes colder than Canada.

If your AI model has a data center in the atlantic, it could be furthering that issue.

(Millions of animals are also dead)

·
5 days ago
·
[ - ]

orra
·
5 days ago
·
[ - ]

Water is expensive to move (except by pipes), and expensive to purify from salt water. This is why regional droughts are a bad thing.

Fresh clean water in your area is a wonderful thing.

nudgeOrnurture
·
4 days ago
·
[ - ]

it takes work to get water from where it's missing to where it's needed. work takes water and other resources which will need to be moved, too, which takes water that isn't where it should be because obsession.

bongodongobob
·
5 days ago
·
[ - ]

Earth: ~1.4e18 m³ water

Atmosphere: ~1.3e13 m³ vapor

Estimated impact from closed loop systems: 0-ish.

LinXitoW
·
5 days ago
·
[ - ]

If you actually want a gotcha comparison, go for beef. It uses absurd amounts of every relevant resource compared to every alternative. A vegan vibe coder might use less water any given day than a meat loving AI hater.

bongodongobob
·
4 days ago
·
[ - ]

Unless it's in a place where there are aquifer issues, cows drinking water doesn't affect a damn thing.

herval
·
5 days ago
·
[ - ]

In the future, your Samsung fridge will also need your AI girlfriend

throw310822
·
5 days ago
·
[ - ]

In the future, while you're away your Samsung fridge will use electricity to chat up the Whirlpool washing machine.

pryelluw
·
5 days ago
·
[ - ]

In Zap Brannigans voice:

“I am well versed in the lost art form of delicates seduction.”

nudgeOrnurture
·
4 days ago
·
[ - ]

and she will tell neither of you two who she got that cute little pixel badge from which will make you jealous and then the microwave will tell you it's been hustling on the side as a PAI and that it can get that info ... at the cost of a little upgrade

hkt
·
5 days ago
·
[ - ]

s/need/be/

herval
·
5 days ago
·
[ - ]

I keep my typos organic — it proves I’m not an LLM

hkt
·
4 days ago
·
[ - ]

Reasonable. I've considered using em dashes for plausible deniability for the opposite reason.

spauldo
·
5 days ago
·
[ - ]

"Now I've been admitted to Refrigerator Heaven..."

ben_w
·
5 days ago
·
[ - ]

Why is your laptop (or phone, or refrigerator) plumbed directly into a lake?

decide1000
·
5 days ago
·
[ - ]

The model is good and runs fine but if you want to be blown away again try Qwen3-30A-A3B-2507. It's 6gb bigger but the response is comparable or better and much faster to run. Gpt-oss-20B gives me 6 tok/sec while Qwen3 gives me 37 tok/sec. Qwen3 is not a reasoning model tho.

parhamn
·
5 days ago
·
[ - ]

I just tested 120B from the Groq API on agentic stuff (multi-step function calling, similar to claude code) and it's not that good. Agentic fine-tuning seems key, hopefully someone drops one soon.

AmazingTurtle
·
5 days ago
·
[ - ]

Im not sure if groq uses the proper harmony template?

turnsout
·
5 days ago
·
[ - ]

The environmentalist in me loves the fact that LLM progress has mostly been focused on doing more with the same hardware, rather than horizontal scaling. I guess given GPU shortages that makes sense, but it really does feel like the value of my hardware (a laptop in my case) is going up over time, not down.

Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."

twixfel
·
5 days ago
·
[ - ]

> Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."

Not in the UK it isn’t.

turnsout
·
4 days ago
·
[ - ]

Yes, it is, although it's primarily a US journalistic convention. "Lede" is a publishing industry word referring to the most important leading detail of a story. It's spelled intentionally "incorrectly" to disambiguate it from the metal lead, which was used in typesetting at the time.

twixfel
·
3 days ago
·
[ - ]

No, it isn't. It's an American thing at most and possibly also a false etymology. Its prime usage appears to be for people in HN threads to pat each other on the back for knowing it's "real" spelling.

jwr
·
4 days ago
·
[ - ]

gpt-oss:20b is the best performing model on my spam filtering benchmarks (I wrote a despammer that uses an LLM).

These are the simplified results (total percentage of correctly classified E-mails on both spam and ham testing data):

gpt-oss:20b 95.6%

gemma3:27b-it-qat 94.3%

mistral-small3.2:24b-instruct-2506-q4_K_M 93.7%

mistral-small3.2:24b-instruct-2506-q8_0 92.5%

qwen3:32b-q4_K_M 89.2%

qwen3:30b-a3b-q4_K_M 87.9%

gemma3n:e4b-it-q4_K_M 84.9%

deepseek-r1:8b 75.2%

qwen3:30b-a3b-instruct-2507-q4_K_M 73.0%

I'm quite happy, because it's also smaller and faster than gemma3.

mathiaspoint
·
5 days ago
·
[ - ]

It's really training not inference that drains the lakes.

littlestymaar
·
5 days ago
·
[ - ]

Training cost has increased a ton exactly because inference cost is the biggest problem: models are now trained on almost three orders of magnitude more data then what is compute-optimal to do (from the Chinchilla paper), because saving compute on inference makes it valuable to overtrain a smaller model to achieve similar performance for a bigger amount of training compute.

JKCalhoun
·
5 days ago
·
[ - ]

Interesting. I understand that, but I don't know to what degree.

I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.

It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.

nudgeOrnurture
·
4 days ago
·
[ - ]

you found a way to train only once until it "just works"?

Cicero22
·
5 days ago
·
[ - ]

Where did you get the top ten from?

https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro

Are you discounting all of the self reported scores?

zwischenzug
·
5 days ago
·
[ - ]

Came here to say this. It's behind the 14b Phi-reasoning-plus (which is self-reported).

I don't understand why "TIGER-LAb"-sourced scores are 'unknown' in terms of model size?

npn
·
5 days ago
·
[ - ]

It is not a frontier model. It's only good for benchmarks. Tried some tasks and it is even worse than gemma 3n.

latexr
·
3 days ago
·
[ - ]

I tried their live demo. It suggests three prompts, one of them being “How many R’s are in strawberry?” So I clicked that, and it answered there are three! I tried it thrice with the same result.

It suggested the prompt. It’s infamous because models often get it wrong, they know it, and still they confidently suggested it and got it wrong.

latexr
·
1 day ago
·
[ - ]

Obviously I made a typo above. “Three” is the right answer, I meant that it answered there are “two” (the wrong answer).

https://i.imgur.com/DgAvbee.png

MattSayar
·
5 days ago
·
[ - ]

What's your experience with the quality of LLMs running on your phone?

NoDoo
·
5 days ago
·
[ - ]

I've run qwen3 4B on my phone, it's not the best but it's better than old gpt-3.5. It also does have a reasoning mode, and in reasoning mode it's better than the original gpt-4 and rhe original gpt-4o, but not the latest gpt-4o. I get usable speed, but it's not really comparable to most cloud hosted models.

NoDoo
·
5 days ago
·
[ - ]

I'm on android so I've used termux+ollama, but if you don't want to set that up in a terminal or want a GUI pocketpal AI is a really good app for both android and iOS. It let's you run hugging face models.

cco
·
5 days ago
·
[ - ]

As other said, around gpt 3.5 level so three or four years behind SOTA today at reasonable (but not quick) speed.

vonneumannstan
·
4 days ago
·
[ - ]

>no lakes being drained

When you imagine a lake being drained to cool a datacenter do you ever consider where the water used for cooling goes? Do you imagine it disappears?

nudgeOrnurture
·
4 days ago
·
[ - ]

not if the winds of fortune don't change;--but the weather, man, it's been getting less foreseeable than I was once used to

raideno
·
5 days ago
·
[ - ]

How much ram is in your Macbook Air M3 ? I have the 16Gb version and i was wondering whether i'll be able to run it or not.

black3r
·
5 days ago
·
[ - ]

can you please give an estimate how much slower/faster is it on your macbook compared to comparable models running in the cloud?

cco
·
5 days ago
·
[ - ]

Sure.

This is a thinking model, so I ran it against o4-mini, here are the results:

* gpt-oss:20b

* Time-to-first-token: 2.49 seconds

* Time-to-completion: 51.47 seconds

* Tokens-per-second: 2.19

* o4-mini on ChatGPT

* Time-to-first-token: 2.50 seconds

* Time-to-completion: 5.84 seconds

* Tokens-per-second: 19.34

Time to first token was similar, but the thinking piece was _much_ faster on o4-mini. Thinking took the majority of the 51 seconds for gpt-oss:20b.

syntaxing
·
5 days ago
·
[ - ]

You can get a pretty good estimate depending on your memory bandwidth. Too many parameters can change with local models (quantization, fast attention, etc). But the new models are MoE so they’re gonna be pretty fast.

lend000
·
5 days ago
·
[ - ]

For me the game changer here is the speed. On my local Mac I'm finally getting token counts that are faster than I can process the output (~96 tok/s), and the quality has been solid. I had previously tried some of the distilled qwen and deepseek models and they were just way too slow for me to seriously use them.

snthpy
·
5 days ago
·
[ - ]

For me the biggest benefit of open weights models is the ability to fine tune and adapt to different tasks.

SergeAx
·
5 days ago
·
[ - ]

Did you mean "120b"? I am running 20b model locally right now, and it is pretty mediocre. Nothing near Gemini 2.5 Pro, which is my daily driver.

benreesman
·
5 days ago
·
[ - ]

You're going to freak out when you try the Chinese ones :)

syntaxing
·
5 days ago
·
[ - ]

Interesting, these models are better than the new Qwen releases?

bakies
·
5 days ago
·
[ - ]

on your phone?

animanoir
·
4 days ago
·
[ - ]

[dead]

foundry27
·
5 days ago
·
[ - ]

Model cards, for the people interested in the guts: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:

- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.

- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.

- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)

All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.

highfrequency
·
5 days ago
·
[ - ]

I would guess the “secret sauce” here is distillation: pretraining on an extremely high quality synthetic dataset from the prompted output of their state of the art models like o3 rather than generic internet text. A number of research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.

asadm
·
5 days ago
·
[ - ]

> research results have shown that highly curated technical problem solving data is unreasonably effective at boosting smaller models’ performance.

same seems to be true for humans

throw310822
·
5 days ago
·
[ - ]

Yes, if I understand correctly, what it means is "a very smart teacher can do wonders for their pupils' education".

tempaccount420
·
5 days ago
·
[ - ]

Wish they gave us access to learn from those grandmother models instead of distilled slop.

ashdksnndck
·
5 days ago
·
[ - ]

It behooves them to keep the best stuff internal, or at least greatly limit any API usage to avoid giving the goods away to other labs they are racing with.

saurik
·
5 days ago
·
[ - ]

Which, presumably, is the reason they removed 4.5 from the API... mostly the only people willing to pay that much for that model were their competitors. (I mean, I would pay even more than they were charging, but I imagine even if I scale out my use cases--which, for just me, are mostly satisfied by being trapped in their UI--it would be a pittance vs. the simpler stuff people keep using.)

rfoo
·
5 days ago
·
[ - ]

Or, you can say, OpenAI has some real technical advancements on stuff besides attn architecture. GQA8, alternating SWA 128 / full attn do all seem conventional. Basically they are showing us that "no secret sauce in model arch you guys just sucks at mid/post-training", or they want us to believe this.

The model is pretty sparse tho, 32:1.

liuliu
·
5 days ago
·
[ - ]

Kimi K2 paper said that the model sparsity scales up with parameters pretty well (MoE sparsity scaling law, as they call, basically calling Llama 4 MoE "done wrong"). Hence K2 has 128:1 sparsity.

throwdbaaway
·
5 days ago
·
[ - ]

I thought Kimi K2 uses 8 active experts out of 384? Sparsity should be 48:1. Indeed Llama4 Maverick is the only one that has 128:1 sparsity.

liuliu
·
4 days ago
·
[ - ]

You are right. I mis-remembered the sparsity part of K2. The "done wrong" part I was thinking about how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse).

throwdbaaway
·
4 days ago
·
[ - ]

> how the scout -> maverick -> behemoth doesn't scale sparsity according to any formula (less sparse -> sparse -> less sparse)

Ah I see. I didn't notice that behemoth has the same sparsity as scout. That seems quite random indeed.

nxobject
·
5 days ago
·
[ - ]

It's convenient to be able to attribute success to things only OpenAI could've done with the combo of their early start and VC money – licensing content, hiring subject matter experts, etc. Essentially the "soft" stuff that a mature organization can do.

tgtweak
·
5 days ago
·
[ - ]

I think their MXFP4 release is a bit of a gift since they obviously used and tuned this extensively as a result of cost-optimization at scale - something the open source model providers aren't doing too much, and also somewhat of a competitive advantage.

Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.

logicchains
·
5 days ago
·
[ - ]

>They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool

They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.

rushingcreek
·
5 days ago
·
[ - ]

The native FP4 is one of the most interesting architectural aspects here IMO, as going below FP8 is known to come with accuracy tradeoffs. I'm curious how they navigated this and how the FP8 weights (if they exist) were to perform.

buildbot
·
5 days ago
·
[ - ]

One thing to note is that MXFP4 is a block scaled format, with 4.25 bits per weight. This lets it represent a lot more numbers than just raw FP4 would with say 1 mantissa and 2 exponent bits.

mclau157
·
5 days ago
·
[ - ]

You can get similar insights looking at the github repo https://github.com/openai/gpt-oss

unethical_ban
·
5 days ago
·
[ - ]

I don't know how to ask this without being direct and dumb: Where do I get a layman's introduction to LLMs that could work me up to understanding every term and concept you just discussed? Either specific videos, or if nothing else, a reliable Youtube channel?

tkgally
·
5 days ago
·
[ - ]

What I’ve sometimes done when trying to make sense of recent LLM research is give the paper and related documents to ChatGPT, Claude, or Gemini and ask them to explain the specific terms I don’t understand. If I don’t understand their explanations or want to know more, I ask follow-ups. Doing this in voice mode works better for me than text chat does.

When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.

tshannon
·
4 days ago
·
[ - ]

So probably another stupid question, but how do you know what it's spitting out is accurate?

tkgally
·
4 days ago
·
[ - ]

One has to be aware of the possibility of hallucinations, of course. But I have not encountered any hallucinations in these sorts of interactions with the current leading models. Questions like "what does 'embedding space' mean in the abstract of this paper?" yield answers that, in my experience, make sense in the context and check out when compared with other sources. I would be more cautious if I were using smaller models or if I were asking questions about obscure information without supporting context.

Also, most of my questions are not about specific facts but about higher-level concepts. For ML-related topics, at least, the responses check out.

umgefahren
·
5 days ago
·
[ - ]

There is a great 3blue1brown video, but it’s pretty much impossible by now to cover the entire landscape of research. I bet gpt-oss has some great explanations though ;)

nonfamous
·
5 days ago
·
[ - ]

Try Microsoft's "Generative AI for Beginners" repo on GitHub. The early chapters in particular give a good grounding of LLM architecture without too many assumptions of background knowledge. The video version of the series is good too.

cwyers
·
4 days ago
·
[ - ]

This is a great book (parts of it are available as blog posts from the author if you want to get a taste of it):

https://www.manning.com/books/build-a-large-language-model-f...

CanuckPro
·
5 days ago
·
[ - ]

Try Andrej Karpathy's YouTube videos. I also really liked the Dive into Deep Learning book at d2l.ai

srigi
·
5 days ago
·
[ - ]

Start with the YT series on neural nets and LLMs from 3blue1brown

reilly3000
·
5 days ago
·
[ - ]

Ask Gemini. Give it a link here in fact.

danieldk
·
5 days ago
·
[ - ]

Also: attention sinks (although implemented as extra trained logits used in attention softmax rather than attending to e.g. a prepended special token).

simonw
·
5 days ago
·
[ - ]

Just posted my initial impressions, took a couple of hours to write them up because there's a lot in this release! https://simonwillison.net/2025/Aug/5/gpt-oss/

TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs. Will be interesting to see if independent benchmarks resolve in that direction as well.

The 20B model runs on my Mac laptop using less than 15GB of RAM.

hrpnk
·
5 days ago
·
[ - ]

I tried to generate a streamlit dashboard with MACD, RSI, MA(200). 1:0 for qwen3 here.

qwen3-coder-30b 4-bit mlx took on the task w/o any hiccups with a fully working dashboard, graphs, and recent data fetched from yfinance.

gpt-oss-20b mxfp4's code had a missing datatime import and when fixed delivered a dashboard without any data and with starting date of Aug 2020. Having adjusted the date, the update methods did not work and displayed error messages.

teitoklien
·
5 days ago
·
[ - ]

for now, i wouldnt rank any model from openai in coding benchmarks, despite all the false messaging they are giving, almost every single model openai has launched even the high end o3 expensive models are absolutely monumentally horrible at coding tasks. So this is expected.

If its decent in other tasks, which i do find openai often being better than others at, then i think its a win, especially a win for the open source community that even AI labs that pionered the hype of Gen AI who didnt want to ever launch open models are now being forced to launch them. That is definitely a win, and not something that was certain before.

dongobread
·
5 days ago
·
[ - ]

It is absolutely awful at writing and general knowledge. IMO coding is its greatest strength by far.

mplewis
·
5 days ago
·
[ - ]

Sure sounds like they're not good at anything in particular, then.

xwolfi
·
5 days ago
·
[ - ]

welcome to 3DTV hype, LLM are useless...

teitoklien
·
4 days ago
·
[ - ]

not really, claude is amazing, which is why I pay for Claude Max, its insanely amazing and useful, its just OpenAI's one isnt

pxc
·
5 days ago
·
[ - ]

NVIDIA will probably give us nice, coding-focused fine-tunes of these models at some point, and those might compare more favorably against the smaller Qwen3 Coder.

iJohnDoe
·
5 days ago
·
[ - ]

What is the best local coder model that that can be used with ollama?

Maybe a too opened ended question? I can run the deepseek model locally really nicely.

bigyabai
·
5 days ago
·
[ - ]

Probably Qwen3-Coder 30B, unless you have a titanic enough machine to handle a serious 480B model.

pxc
·
5 days ago
·
[ - ]

Is the DeepSeek model you're running a distill, or is it the 671B parameter model?

kgeist
·
5 days ago
·
[ - ]

>I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs

I have a bunch of scripts that use tool calling. Qwen-3-32B handles everything flawlessly at 60 tok/sec. Gpt-oss-120B breaks in some cases and runs at mere 35 tok/sec (doesn't fit on the GPU).

But I hope there's still some ironing out to do in llama.cpp and in the quants. So far it feels lackluster compared to Qwen3-32B and GLM-4.5-Air

markasoftware
·
5 days ago
·
[ - ]

The space invaders game seems like a poor benchmark. Both models understood the prompt and generated valid, functional javascript. One just added more fancy graphics. It might just have "use fancy graphics" in its system prompt for all we know.

simonw
·
5 days ago
·
[ - ]

The way I run these prompts excludes a system prompt - I'm hitting the models directly.

markasoftware
·
5 days ago
·
[ - ]

still, if you ask this open model to generate a fancy space invaders game with polish, and then ask the other model to generate a bare-bones space invaders game with the fewest lines of code, I think there's a good chance they'd switch places. This doesn't really test the models ability to generate a space invaders game, so much as it tests their tendency to make an elaborate vs simple solution.

simonw
·
5 days ago
·
[ - ]

My main goal with that benchmark is to see if it can produce HTML and JavaScript code that runs without errors for a moderately complex challenge.

It's not a comprehensive benchmark - there are many ways you could run it in ways that would be much more informative and robust.

It's great as a quick single sentence prompt to get a feeling for if the model can produce working JavaScript or not.

·
5 days ago
·
[ - ]

dennisy
·
4 days ago
·
[ - ]

Not really the other commenters are correct I feel and this is not really proving anything about the fundamental capability of the model. It’s just a hello world benchmark adding no real value, just driving blog traffic for you.

simonw
·
4 days ago
·
[ - ]

The space invaders benchmark proves that the model can implement a working HTML and JavaScript game from a single prompt. That's a pretty fundamental capability for a model.

Comparing them between models is also kind of interesting, even if it's not a flawlessly robust comparison: https://simonwillison.net/tags/space-invaders/

lossolo
·
4 days ago
·
[ - ]

Implement or retrieve? That’s an important distinction. When evaluating models, you run a variety of tests, and the benchmarks that aren’t publicly disclosed are the most reliable. Your Space Invaders game isn’t really a benchmark of anything, just Google it, and you’ll find plenty of implementations.

simonw
·
4 days ago
·
[ - ]

I see that criticism a lot - that benchmarks like space invaders don't make sense because they're inevitably in the training data - and I don't buy that at all.

Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.

You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...

Illustrative extract:

> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.

> When direction changes, move all aliens down by step (e.g., 10 px).

The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).

lossolo
·
4 days ago
·
[ - ]

12.8 GB is around 110 Gbits. Even at 4.25 bits/weight the network stores ~26 billion "micro weights". A 1,4k token space invaders snippet occupies ~1.1 kb compressed, the model could parametrize thousands of such snippets and still have more than 99% of its capacity left. This paper about LLM memorization is interesting, if you would to know more: https://arxiv.org/abs/2312.11658 and another recent interesting paper SWE bench illusion shows SOTA code LLM results collapsing once memorised github issues are filtered out: https://arxiv.org/pdf/2506.12286v1

Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.

As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896 So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.

If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.

simonw
·
4 days ago
·
[ - ]

If the models are memorizing and regurgitating from their training data, how come every model I've tried this with produces entirely different code?

Presumably this is because "the network only needs to interpolate between them". That's what I want it to do!

I tried the space invaders thing on a 4GB Qwen model today and it managed to produce a grid of aliens that advanced one step... and then dropped off the page entirely.

lossolo
·
4 days ago
·
[ - ]

Transformer does not need to emit a byte for byte clone of a training example to benefit from having seen it. It can store a distributed representation of many near duplicate implementations and then sample a novel linear combination. That still short circuits algorithm design so the burden of discovering the game loop, collision logic, sprite sheet etc. was ALREADY SOLVED during pre training.

When you temperature sample the same model twice you also get "different" code, diversity alone is not evidence of new reasoning. What matters is functional novelty under controlled transformations (renamed variables, resized canvas, obfuscated asset file names etc). On such metamorphic rewrites, models that appear brilliant on canonical prompts suddenly collapse, a hallmark of shallow pattern matching.

The paper I mentioned in my previos comment shows SOTA coding LLMs scoring 70%+ on SWE bench verified yet dropping 10–47% when the very same issues are paraphrased or drawn from unseen repos, even though the task semantics are identical. That is classic memorisation, just fuzzier than a CRC match.

As to qwen, even at 4 bit per weight, a 4B model retains ≈ 2.1 GB of entropy so enough to memorise tens of thousands of full game loops. The reason it garbled the alien movement logic is probably that its limited capacity forced lossy compression, so the behaviour you saw is typical of partially recalled code patterns whose edge cases were truncated during training. That’s still interpolation over memorised fragments, just with fewer fragments to blend. And this is something that is actually proven (https://arxiv.org/abs/2406.15720v1) by controlled fact memorisation studies and extraction attacks up through 70B params show a monotone curve so basically each extra order of magnitude adds noticeably more verbatim or near verbatim recall. So a 20B model succeeds where a 4B one fails because the former crossed the "capacity per training token" threshold for that exemplar. So nothing magical there.

Don't get me wrong, I’m not arguing against interpolation per se, generalising between held out exemplars is precisely what we want. The problem is that most public "just write space invaders” demos never verify that the endpoints were truly unseen. Until they do, a perfect clone is compatible with nothing deeper than glorified fuzzy lookup.

simonw
·
4 days ago
·
[ - ]

This is a great explanation, thanks for putting it together.

It more or less fits my fuzzy mental model of how this stuff works.

I'm completely fine with my test prompt taking advantage of this - the point of "implement space invaders" is to explore how well it can construct a game of that shape based on the examples that it has seen in its training data, especially in comparison to other models.

I'm not trying for a test of ability to produce a unique new game - I want a short prompt that gets it to output some HTML and JavaScript that I can then interact with.

throwdbaaway
·
5 days ago
·
[ - ]

There is no way that gpt-oss-120b can beat the much larger Kimi-K2-Instruct, Qwen3 Coder/Instruct/Thinking, or GLM-4.5. How did you arrive at this rather ridiculous conclusion? The current sentiment in r/LocalLLaMA is that gpt-oss-120b is around Llama-4 Scout level. But it is indeed the best in refusal.

h4ny
·
5 days ago
·
[ - ]

> TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs.

That's just straight up not the case. Not sure how you can jump to that conclusion not least when you stated that you haven't tested tool calling in your post too.

Many people in the community are finding it substantially lobotomized to the point that there are "safe" memes everywhere now. Maybe you need to develop better tests that and pay more attention to benchmaxxing.

There are good things that came out of these release from OpenAI but we'd appreciate more objective analyses...

simonw
·
5 days ago
·
[ - ]

If you read my full post, it ends with this:

> I’m waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it’s likely that OpenAI now offer the best available open weights models.

You told me off for jumping to conclusions and in the same comment quoted me saying "I think OpenAI may have taken" - that's not a conclusion, it's tentative speculation.

h4ny
·
5 days ago
·
[ - ]

I did read that and it doesn't change what I said about your comment on HN, I was calling out the fact that you are making a very bold statement without having done careful analysis.

You know you have a significant audience, so don't act like you don't know what you're doing when you chose to say "TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs" then defend what I was calling out based on word choices like "conclusions" (I'm sure you have read conclusions in academic journals?), "I think", and "speculation".

simonw
·
5 days ago
·
[ - ]

I'm going to double down on "I think OpenAI may have taken the medal..." not being a "bold statement".

I try to be careful about my choice of words, even in forum comments.

bavell
·
4 days ago
·
[ - ]

> I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs.

IMO, the "I think..." bit could be ambiguous and read as, "In my opinion, OpenAI may have...".

I agree with you it's not a hard/bold endorsement but perhaps leading with the disclaimer that you're reserving final judgement could assuage these concerns.

irthomasthomas
·
5 days ago
·
[ - ]

My llm agent is currently running an experiment generating many pelicans. It will compare various small model consortiums against the same model running solo. It should push new pelicans to the repo after run. The horizon-beta is up already, not small or opensource but tested it anyway, and you can already see an improvement using 2+1 (2 models + the arbiter) for that model.

https://irthomasthomas.github.io/Pelicans-consortium/ https://github.com/irthomasthomas/Pelicans-consortium

coltonv
·
5 days ago
·
[ - ]

What did you set the context window to? That's been my main issue with models on my macbook, you have to set the context window so short that they are way less useful than the hosted models. Is there something I'm misisng there?

hrpnk
·
5 days ago
·
[ - ]

With LM Studio you can configure context window freely. Max is 131072 for gpt-oss-20b.

coltonv
·
5 days ago
·
[ - ]

Yes but if I set it above ~16K on my 32gb laptop it just OOMs. Am I doing something wrong?

mekpro
·
5 days ago
·
[ - ]

try enable flash attention and offload all layer to GPU

simonw
·
5 days ago
·
[ - ]

I punted it up to the maximum in LM Studio - seems to use about 16GB of RAM then, but I've not tried a long prompt yet.

EagnaIonat
·
4 days ago
·
[ - ]

Nice write up!

One test I do is to give a common riddle but word it slightly to see if it can actually reason.

For example:

"Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???"

The 20B model kept picking the answer of the original riddle, even after explaining extra information to it.

The original riddle is:

"Janes dad has five daughters, Lala, Lele, Lili, Lolo and ???"

jfreax
·
4 days ago
·
[ - ]

Nice test. I tested it with llama.cpp and the 20b model. He got the answer correct at first try on my side. The internal/thinking dialogue is fun to read - it could be my own inner dialogue ;)

``` > "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ??? <|channel|>analysis<|message|>The question: "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???". It's a common riddle. The answer is Bob? Actually "Bob's dad has five daughters: Lala, Lele, Lili, Lolo and..." The riddle is to trick: it's "Bob" himself is a daughter? Wait Bob's dad has five daughters: Lala, Lele, Lili, Lolo, and Bob? But Bob is a name often male. But the riddle says Bob's dad. So Bob is a person who has a dad, and that dad has five daughters. But Bob could be the fifth daughter? That would be weird: Bob is likely the name of the fifth daughter. So answer: Bob. But sometimes the answer is "Lulu" or "Bob's sister"? Let's think. The riddle is a trick: It's "Bobs dad has five d aughters: Lala, Lele, Lili, Lolo, and ... Bob." Because the riddle begins with "Bobs dad has five daughters" implying Bob is one of the daughters. So the missing daughter is Bob. So answer: "Bob".

Thus answer: Bob. But sometimes the riddle goes "Bob's dad has five daughters, Lala, Lele, Lili, Lolo, and ... Who is the fifth?" The answer: Bob (the name of the person speaking, "Bobs"). But in riddle it's "Bobs dad has five daughters , Lala, Lele, Lili, Lolo and ???". The missing name is "Bob." So answer: Bob.

Thus the answer: Bob. Or maybe it's "Boba"? But no.

Hence answer: Bob.

philipwhiuk
·
4 days ago
·
[ - ]

Your riddle assumes the dad doesn't have any sons.

taikahessu
·
4 days ago
·
[ - ]

A Daughter Named Bob, what a great name for AI documentary.

EagnaIonat
·
2 days ago
·
[ - ]

We call her Bobby Tables. https://xkcd.com/327/

clbrmbr
·
4 days ago
·
[ - ]

I don’t get it. Wouldn’t it be Lulu in both cases?

EagnaIonat
·
2 days ago
·
[ - ]

The original riddle has a girls name. It's a common riddle so the LLM would recognize the pattern of the riddle and assumes the name is the answer.

But if it's good at reasoning it should understand that if the name is a Boys name then it is not the 5th daughter.

Now the last daughter can be called anything, but because there is a vowel pattern it should pick Lulu to solve the puzzle, but that is the wrong answer in the original puzzle.

BrawnyBadger53
·
4 days ago
·
[ - ]

Presumably Jane is a girl and therefore the fifth daughter in the original riddle.

throwaway106382
·
4 days ago
·
[ - ]

It’s Bob or Jane.

The dad of <Person> has 5 daughters. Four are listed off. So the answer for the fifth is <Person>.

philipwhiuk
·
4 days ago
·
[ - ]

Except having five daughters doesn't prevent them also having 20 sons one of whom is called Bob.

throwaway106382
·
4 days ago
·
[ - ]

That’s why it’s a riddle.

stavros
·
4 days ago
·
[ - ]

But that makes no sense, if you ask me that John's dad has two daughters, Mary and ???, there's no way I will say "John" as an answer. I'll simply assume he also has at least one son.

blueplanet200
·
4 days ago
·
[ - ]

Original riddle the answer is Jane.

rmonvfer
·
5 days ago
·
[ - ]

I’m also very interested to know how well these models handle tool calling as I haven’t been able to make it work after playing with them for a few hours. Looks promising tho.

rmonvfer
·
5 days ago
·
[ - ]

update: I’ve tried to use lm-studio (like the author) and the tool request kept failing due to a mismatch in the prompt template. I guess they’ll fix it but seems sloppy from lm-studio not having tested this before release.

month13
·
5 days ago
·
[ - ]

I was road testing tool calling in LM Studio a week ago against a few models marked with tool support, none worked, so I believe it may be a bug. Had much better luck with llama.cpp’s llama-server.

GodelNumbering
·
5 days ago
·
[ - ]

> The 20B model runs on my Mac laptop using less than 15GB of RAM.

I was about to try the same. What TPS are you getting and on which processor? Thanks!

hrpnk
·
5 days ago
·
[ - ]

gpt-oss-20b: 9 threads, 131072 context window, 4 experts - 35-37 tok/s on M2 Max via LM Studio.

rt1rz
·
5 days ago
·
[ - ]

interestingly, i am also on M2 Max, and i get ~66 tok/s in LM Studio on M2 Max, with the same 131072. I have full offload to GPU. I also turned on flash attention in advanced settings.

hrpnk
·
4 days ago
·
[ - ]

Thank you! Flash attention gives me a boost to ~66 tok/s indeed.

mdz4040
·
5 days ago
·
[ - ]

55 token/s here on m4 pro, turning on flash attention puts it to 60/s.

mekpro
·
5 days ago
·
[ - ]

i got 70 token/s on m4 max

mhitza
·
5 days ago
·
[ - ]

That M4 Max is really something else, I get also 70 tokens/second on eval on a RTX 4000 SFF Ada server GPU.

mudkipdev
·
5 days ago
·
[ - ]

Hasn't nailed the strawberry test yet

pxc
·
5 days ago
·
[ - ]

I found this surprising because that's such an old test that it must certainly be in the training data. I just tried to reproduce and I've been unable to get it (20B model, lowest "reasoning" budget) to fail that test (with a few different words).

quatonion
·
5 days ago
·
[ - ]

I am starting to get the impression the strawberry test is an OpenAI watermark, more than an actual problem.

It is a good way to detect if another model was trained on your data for example, or is a distillation/quant/ablation.

MichaelKSpencer
·
4 days ago
·
[ - ]

[dead]

ClassAndBurn
·
5 days ago
·
[ - ]

Open models are going to win long-term. Anthropics' own research has to use OSS models [0]. China is demonstrating how quickly companies can iterate on open models, allowing smaller teams access and augmentation to the abilities of a model without paying the training cost.

My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.

N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.

There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.

Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.

[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.

lechatonnoir
·
5 days ago
·
[ - ]

I'm pretty sure there's no reason that Anthropic has to do research on open models, it's just that they produced their result on open models so that you can reproduce their result on open models without having access to theirs.

xpe
·
5 days ago
·
[ - ]

> Open models are going to win long-term.

[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:

- publishing open weights helps your competitors catch up

- publishing open weights doesn't improve your own research agenda

- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains

- publishing open weights distracts your organization from attaining a sustainable business model / funding stream

- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)

[1]: "What failure looks like" by Paul Christiano : https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...

[2]: "An AGI race is a suicide race." - quote from Max Tegmark; article at https://futureoflife.org/statement/agi-manhattan-project-max...

Adrig
·
5 days ago
·
[ - ]

I'm a layman but it seemed to me that the industry is going towards robust foundational models on which we plug tools, databases, and processes to expand their capabilities.

In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.

albertzeyer
·
5 days ago
·
[ - ]

> Once someone hits AGI/SGI

I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.

Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.

Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?

dom96
·
5 days ago
·
[ - ]

> I don't think there will be such a unique event.

I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".

seba_dos1
·
5 days ago
·
[ - ]

> the unique event will be the AI having the ability to act on its own without a "prompt"

That's super easy. The reason they need a prompt is that this is the way we make them useful. We don't need LLMs to generate an endless stream of random "thoughts" otherwise, but if you really wanted to, just hook one up to a webcam and microphone stream in a loop and provide it some storage for "memories".

rossant
·
5 days ago
·
[ - ]

And the ability to improve itself.

renmillar
·
5 days ago
·
[ - ]

There's no reason that models too large for consumer hardware wouldn't keep a huge edge, is there?

AtlasBarfed
·
5 days ago
·
[ - ]

That is fundamentally a big O question.

I have this theory that we simply got over a hump by utilizing a massive processing boost from gpus as opposed to CPUs. That might have been two to three orders of magnitude more processing power.

But that's a one-time success. I don't hardware has any large scale improvements coming, because 3D gaming mostly plumb most of that vector processing hardware development in the last 30 years.

So will software and better training models produce another couple orders of magnitude?

Fundamentally we're talking about nines of of accuracy. What is the processing power required for each line of accuracy? Is it linear? Is it polynomial? Is it exponential?

It just seems strange to me with all the AI knowledge slushing through academia, I haven't seen any basic analysis at that level, which is something that's absolutely going to be necessary for AI applications like self-driving, once you get those insurance companies involved

renmillar
·
2 days ago
·
[ - ]

Could be that you need massive amounts of data from those super expensive production training runs, and it's tough to figure that out from publicly available data and academic computing resources. Maybe the combination of gradual efficiency improvements, bigger compute clusters, and test-time reasoning keeps the cloud models in the lead. Plus, even if it's exponential scaling, wouldn't that still favor the big data centers? That would put local/edge models at a serious disadvantage.

xpe
·
5 days ago
·
[ - ]

> Open models are going to win long-term.

[1 of 3] For the sake of argument here, I'll grant the premise. If this turns out to be true, it glosses over other key questions, including:

For a frontier lab, what is a rational period of time (according to your organizational mission / charter / shareholder motivations*) to wait before:

1. releasing a new version of an open-weight model; and

2. how much secret sauce do you hold back?

* Take your pick. These don't align perfectly with each other, much less the interests of a nation or world.

teaearlgraycold
·
5 days ago
·
[ - ]

> N-1 model value depreciates insanely fast

This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N

swalsh
·
5 days ago
·
[ - ]

To me it depends on 2 factors. Hardware becomes more accessible, and the closed source offerings become more expensive. Right now it's difficult to get enough GPUs to do local inference at production scale, and 2 it's more expensive to run your own GPU's vs closed source models.

xpe
·
5 days ago
·
[ - ]

> Open models are going to win long-term.

[3 of 3] What would it take for this statement to be false or missing the point?

Maybe we find ourselves in a future where:

- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.

- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.

- The future might involve thousands or millions of highly tailored models

x187463
·
5 days ago
·
[ - ]

Running a model comparable to o3 on a 24GB Mac Mini is absolutely wild. Seems like yesterday the idea of running frontier (at the time) models locally or on a mobile device was 5+ years out. At this rate, we'll be running such models in the next phone cycle.

tedivm
·
5 days ago
·
[ - ]

It only seems like that if you haven't been following other open source efforts. Models like Qwen perform ridiculously well and do so on very restricted hardware. I'm looking forward to seeing benchmarks to see how these new open source models compare.

Rhubarrbb
·
5 days ago
·
[ - ]

Agreed, these models seem relatively mediocre to Qwen3 / GLM 4.5

modeless
·
5 days ago
·
[ - ]

Nah, these are much smaller models than Qwen3 and GLM 4.5 with similar performance. Fewer parameters and fewer bits per parameter. They are much more impressive and will run on garden variety gaming PCs at more than usable speed. I can't wait to try on my 4090 at home.

There's basically no reason to run other open source models now that these are available, at least for non-multimodal tasks.

tedivm
·
5 days ago
·
[ - ]

Qwen3 has multiple variants ranging from larger (230B) than these models to significantly smaller (0.6b), with a huge number of options in between. For each of those models they also release quantized versions (your "fewer bits per parameter).

I'm still withholding judgement until I see benchmarks, but every point you tried to make regarding model size and parameter size is wrong. Qwen has more variety on every level, and performs extremely well. That's before getting into the MoE variants of the models.

modeless
·
5 days ago
·
[ - ]

The benchmarks of the OpenAI models are comparable to the largest variants of other open models. The smaller variants of other open models are much worse.

mrbungie
·
5 days ago
·
[ - ]

I would wait for neutral benchmarks before making any conclusions.

bigyabai
·
5 days ago
·
[ - ]

With all due respect, you need to actually test out Qwen3 2507 or GLM 4.5 before making these sorts of claims. Both of them are comparable to OpenAI's largest models and even bench favorably to Deepseek and Opus: https://cdn-uploads.huggingface.co/production/uploads/62430a...

It's cool to see OpenAI throw their hat in the ring, but you're smoking straight hopium if you think there's "no reason to run other open source models now" in earnest. If OpenAI never released these models, the state-of-the-art would not look significantly different for local LLMs. This is almost a nothingburger if not for the simple novelty of OpenAI releasing an Open AI for once in their life.

modeless
·
5 days ago
·
[ - ]

> Both of them are comparable to OpenAI's largest models and even bench favorably to Deepseek and Opus

So are/do the new OpenAI models, except they're much smaller.

UrineSqueegee
·
5 days ago
·
[ - ]

I'd really wait for additional neutral benchmarks, I asked the 20b model on low reasoning effort which number is larger 9.9 or 9.11 and it got it wrong.

Qwen-0.6b gets it right.

bigyabai
·
5 days ago
·
[ - ]

According to the early benchmarks, it's looking like you're just flat-out wrong: https://blog.brokk.ai/a-first-look-at-gpt-oss-120bs-coding-a...

nialv7
·
4 days ago
·
[ - ]

Looks OpenAI's first mover advantages are still alive and well

thegeomaster
·
5 days ago
·
[ - ]

They have worse scores than recent open source releases on a number of agentic and coding benchmarks, so if absolute quality is what you're after and not just cost/efficiency, you'd probably still be running those models.

Let's not forget, this is a thinking model that has a significantly worse scores on Aider-Polyglot than the non-thinking Qwen3-235B-A22B-Instruct-2507, a worse TAUBench score than the smaller GLM-4.5 Air, and a worse SWE-Bench verified score than the (3x the size) GLM-4.5. So the results, at least in terms of benchmarks, are not really clear-cut.

From a vibes perspective, the non-reasoners Kimi-K2-Instruct and the aforementioned non-thinking Qwen3 235B are much better at frontend design. (Tested privately, but fully expecting DesignArena to back me up in the following weeks.)

OpenAI has delivered something astonishing for the size, for sure. But your claim is just an exaggeration. And OpenAI have, unsurprisingly, highlighted only the benchmarks where they do _really_ well.

sourcecodeplz
·
5 days ago
·
[ - ]

From my initial web developer test on https://www.gpt-oss.com/ the 120b is kind of meh. Even qwen3-coder 30b-a3b is better. have to test more.

moralestapia
·
5 days ago
·
[ - ]

You can always get your $0 back.

Imustaskforhelp
·
5 days ago
·
[ - ]

I have never agreed with a comment so much but we are all addicted to open source models now.

recursive
·
5 days ago
·
[ - ]

Not all of us. I've yet to get much use out of any of the models. This may be a personal failing. But still.

satvikpendem
·
5 days ago
·
[ - ]

Depends on how much you paid for the hardware to run em on

cvadict
·
5 days ago
·
[ - ]

Yes, but they are suuuuper safe. /s

So far I have mixed impressions, but they do indeed seem noticeably weaker than comparably-sized Qwen3 / GLM4.5 models. Part of the reason may be that the oai models do appear to be much more lobotomized than their Chinese counterparts (which are surprisingly uncensored). There's research showing that "aligning" a model makes it dumber.

xwolfi
·
5 days ago
·
[ - ]

The censorship here in China is only about public discussions / spaces. You cannot like have a website telling you about the crimes of the party. But downloading some compressed matrix re-spouting the said crimes, nobody gives a damn.

We seem to censor organized large scale complaints and viral mind virii, but we never quite forbid people at home to read some generated knowledge from an obscure hard to use software.

echelon
·
5 days ago
·
[ - ]

This might mean there's no moat for anything.

Kind of a P=NP, but for software deliverability.

CamperBob2
·
5 days ago
·
[ - ]

On the subject of who has a moat and who doesn't, it's interesting to look the role of patents in the early development of wireless technology. There was WWI, and there was WWII, but the players in the nascent radio industry had serious beef with each other.

I imagine the same conflicts will ramp up over the next few years, especially once the silly money starts to dry up.

a_wild_dandan
·
5 days ago
·
[ - ]

Right? I still remember the safety outrage of releasing Llama. Now? My 96 GB of (V)RAM MacBook will be running a 120B parameter frontier lab model. So excited to get my hands on the MLX quants and see how it feels compared to GLM-4.5-air.

SchemaLoad
·
5 days ago
·
[ - ]

I feel like most of the safety concerns ended up being proven correct, but there's so much money in it that they decided to push on anyway full steam ahead.

AI did get used for fake news, propaganda, mass surveillance, erosion of trust and sense of truth, and mass spamming social media.

4b6442477b1280b
·
5 days ago
·
[ - ]

in that era, OpenAI and Anthropic were still deluding themselves into thinking they would be the "stewards" of generative AI, and the last US administration was very keen on regoolating everything under the sun, so "safety" was just an angle for regulatory capture.

God bless China.

a_wild_dandan
·
5 days ago
·
[ - ]

Oh absolutely, AI labs certainly talk their books, including any safety angles. The controversy/outrage extended far beyond those incentivized companies too. Many people had good faith worries about Llama. Open-weight models are now vastly more powerful than Llama-1, yet the sky hasn't fallen. It's just fascinating to me how apocalyptic people are.

I just feel lucky to be around in what's likely the most important decade in human history. Shit odds on that, so I'm basically a lotto winner. Wild times.

vlmutolo
·
5 days ago
·
[ - ]

About 7% of people who have ever lived are alive today. Still pretty lucky, but not quite winning the lottery.

foltik
·
5 days ago
·
[ - ]

Much luckier if you consider everyone who ever will live, assuming we don’t destroy ourselves.

4b6442477b1280b
·
5 days ago
·
[ - ]

>Many people had good faith worries about Llama.

ah, but that begs the question: did those people develop their worries organically, or did they simply consume the narrative heavily pushed by virtually every mainstream publication?

the journos are heavily incentivized to spread FUD about it. they saw the writing on the wall that the days of making a living by producing clickbait slop were coming to an end and deluded themselves into thinking that if they kvetch enough, the genie will crawl back into the bottle. scaremongering about sci-fi skynet bullshit didn't work, so now they kvetch about joules and milliliters consumed by chatbots, as if data centers did not exist until two years ago.

likewise, the bulk of other "concerned citizens" are creatives who use their influence to sway their followers, still hoping against hope to kvetch this technology out of existence.

honest-to-God yuddites are as few and as retarded as honest-to-God flat earthers.

kridsdale3
·
5 days ago
·
[ - ]

I've been pretty unlucky to have encountered more than my fair share of IRL Yuddites. Can't stand em.

ipaddr
·
5 days ago
·
[ - ]

"the most important decade in human history."

Lol. To be young and foolish again. This covid laced decade is more of a placeholder. The current decade is always the most meaningful until the next one. The personal computer era, the first cars or planes, ending slavery needs to take a backseat to the best search engine ever. We are at the point where everyone is planning on what they are going to do with their hoverboards.

graemep
·
5 days ago
·
[ - ]

> ending slavery

happened over many centuries, not in a given decade. Abolished and reintroduced in many places: https://en.wikipedia.org/wiki/Timeline_of_abolition_of_slave...

hedora
·
5 days ago
·
[ - ]

Slavery is still legal and widespread in most of the US, including California.

There was a ballot measure to actually abolish slavery a year or so back. It failed miserably.

BizarroLand
·
5 days ago
·
[ - ]

The slavery of free humans is illegal in America, so now the big issue is figuring out how to convince voters that imprisoned criminals deserve rights.

Even in liberal states, the dehumanization of criminals is an endemic behavior, and we are reaching the point in our society where ironically having the leeway to discuss the humane treatment of even our worst criminals is becoming an issue that affects how we see ourselves as a society before we even have a framework to deal with the issue itself.

What one side wants is for prisons to be for rehabilitation and societal reintegration, for prisoners to have the right to decline to work and to be paid fair wages from their labor. They further want to remove for-profit prisons from the equation completely.

What the other side wants is the acknowledgement that prisons are not free, they are for punishment, and that prisoners have lost some of their rights for the duration of their incarceration and that they should be required to provide labor to offset the tax burden of their incarceration on the innocent people that have to pay for it. They also would like it if all prisons were for-profit as that would remove the burden from the tax payers and place all of the costs of incarceration onto the shoulders of the incarcerated.

Both sides have valid and reasonable wants from their vantage point while overlooking the valid and reasonable wants from the other side.

graemep
·
5 days ago
·
[ - ]

I do not think you can equate making prisoners work with slavery. Other countries do the same, and it is not regarded as slavery in general.

If people were sold into slavery as a punishment (so they became some one else's property) as some ancient societies did, then that would clearly be slavery.

The most shocking thing about prisons in the US is how common prison rape is, and the extent to which it seems to be regarded as a joke. The majority of rapes in the US are prison rapes. How can that not be anything but an appalling problem?

https://en.wikipedia.org/wiki/Prison_rape_in_the_United_Stat...

Rape is also something slaves are casually subject to in most slave societies. It was definitely accept that Roman slave owners were free to rape men, women and children they owned.

ninjagoo
·
4 days ago
·
[ - ]

The US Constitution's 13th Amendment abolishing slavery specifically allows it for convicted people. [1]

You'll see from the definition of a "slave" [2] that prisoner labor specifically fits the definition of a slave, hence why the constitution makes an exception for it.

[1] https://constitutioncenter.org/the-constitution/amendments/a... [2] https://www.oed.com/dictionary/slave_n?tl=true

recursive
·
5 days ago
·
[ - ]

> slavery of free humans is illegal

That's kind of vacuously true though, isn't it?

chromatin
·
5 days ago
·
[ - ]

I think his point is that slavery is not outlawed by the 13th amendment as most people assume (even the Google AI summary reads: "The 13th Amendment to the United States Constitution, ratified in 1865, officially abolished slavery and involuntary servitude in the United States.").

However, if you actually read it, the 13th amendment makes an explicit allowance for slavery (i.e. expressly allows it):

"Neither slavery nor involuntary servitude, *except as a punishment for crime whereof the party shall have been duly convicted*" (emphasis mine obviously since Markdown didn't exist in 1865)

SR2Z
·
5 days ago
·
[ - ]

Prisoners themselves are the ones choosing to work most of the time, and generally none of them are REQUIRED to work (they are required to either take job training or work).

They choose to because extra money = extra commissary snacks and having a job is preferable to being bored out of their minds all day.

That's the part that's frequently not included in the discussion of this whenever it comes up. Prison jobs don't pay minimum wage, but given that prisoners are wards of the state that seems reasonable.

BizarroLand
·
5 days ago
·
[ - ]

I have heard anecdotes that the choice of doing work is a choice between doing work and being in solitary confinement or becoming the target of the guards who do not take kindly to prisoners who don't volunteer for work assignments.

dingnuts
·
5 days ago
·
[ - ]

you can say the same shit about machine learning but ChatGPT was still the Juneteenth of AI

·
5 days ago
·
[ - ]

narrator
·
5 days ago
·
[ - ]

Yeah, China is e/acc. Nice cheap solar panels too. Thanks China. The problem is their ominous policies like not allowing almost any immigration, and their domestic Han Supremacist propaganda, and all that make it look a bit like this might be Han Supremacy e/acc. Is it better than wester/decel? Hard to say, but at least the western/decel people are now starting to talk about building power plants, at least for datacenters, and things like that instead of demanding whole branches of computer science be classified, as they were threatening to Marc Andreessen when he visited the Biden admin last year.

01HNNWZ0MV43FF
·
5 days ago
·
[ - ]

I wish we had voter support for a hydrocarbon tax, though. It would level out the prices and then the AI companies can decide whether they want to pay double to burn pollutants or invest in solar and wind and batteries

AtlasBarfed
·
5 days ago
·
[ - ]

Oh poor oppressed marc andreesen. Someone save him!

Imustaskforhelp
·
5 days ago
·
[ - ]

Okay I will be honest, I was so hyped up about This model but then I went to localllama and saw it that the:

120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)

pxc
·
5 days ago
·
[ - ]

Qwen3 Coder is 4x its size! Grok 3 is over 22x its size!

What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?

It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.

logicchains
·
5 days ago
·
[ - ]

It's only got around 5 billion active parameters; it'd be a miracle if it was competitive at coding with SOTA models that have significantly more.

jph00
·
5 days ago
·
[ - ]

On this bench it underperforms vs glm-4.5-air, which is an MoE with fewer total params but more active params.

ascorbic
·
5 days ago
·
[ - ]

That's SVGBench, which is a useful benchmark but isn't much of a test of general coding

Imustaskforhelp
·
5 days ago
·
[ - ]

Hm alright, I will see how this model actually plays around instead of forming quick opinions..

Thanks.

bogtog
·
5 days ago
·
[ - ]

When people talk about running a (quantized) medium-sized model on a Mac Mini, what types of latency and throughput times are they talking about? Do they mean like 5 tokens per second or at an actually usable speed?

phonon
·
5 days ago
·
[ - ]

Here's a 4bit 70B parameter model, https://www.youtube.com/watch?v=5ktS0aG3SMc (deepseek-r1:70b Q4_K_M) on a M4 Max 128 GB. Usable, but not very performant.

davio
·
5 days ago
·
[ - ]

On a M1 MacBook Air with 8GB, I got this running Gemma 3n:

12.63 tok/sec • 860 tokens • 1.52s to first token

I'm amazed it works at all with such limited RAM

v5v3
·
5 days ago
·
[ - ]

I have started a crowdfunding to get you a MacBook air with 16gb. You poor thing.

AtlasBarfed
·
5 days ago
·
[ - ]

Y not meeee?

After considering my sarcasm for the last 5 minutes, I am doubling down. The government of the United States of America should enhance its higher IQ people by donating AI hardware to them immediately.

This is critical for global competitive economic power.

Send me my hardware US government

xwolfi
·
5 days ago
·
[ - ]

higher IQ people <-- well you have to prove that first, so let me ask you a test question to prove them: how can you mix collaboration and competition in society to produce the optimal productivity/conflict ratio ?

bookofjoe
·
5 days ago
·
[ - ]

Up the ante with an M4 chip

backscratches
·
5 days ago
·
[ - ]

not meaningfully different, m1 virtually as fast as m4

wahnfrieden
·
5 days ago
·
[ - ]

https://github.com/devMEremenko/XcodeBenchmark M4 is almost twice as fast as M1

andai
·
5 days ago
·
[ - ]

In this table, M4 is also twice as fast as M4.

wahnfrieden
·
5 days ago
·
[ - ]

You're comparing across vanilla/Pro/Max tiers. within equivalent tier, M4 is almost 2x faster than M1

v5v3
·
5 days ago
·
[ - ]

Twice the cost too.

wahnfrieden
·
4 days ago
·
[ - ]

n42
·
5 days ago
·
[ - ]

here's a quick recording from the 20b model on my 128GB M4 Max MBP: https://asciinema.org/a/AiLDq7qPvgdAR1JuQhvZScMNr

and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM

I am, um, floored

Rhubarrbb
·
5 days ago
·
[ - ]

Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.

ghc
·
5 days ago
·
[ - ]

Here's a sample of running the 120b model on Ollama with my MBP:

```

total duration: 1m14.16469975s

load duration: 56.678959ms

prompt eval count: 3921 token(s)

prompt eval duration: 10.791402416s

prompt eval rate: 363.34 tokens/s

eval count: 2479 token(s)

eval duration: 1m3.284597459s

eval rate: 39.17 tokens/s

```

andai
·
5 days ago
·
[ - ]

You mentioned "on local agents". I've noticed this too. How do ChatGPT and the others get around this, and provide instant responses on long conversations?

bluecoconut
·
5 days ago
·
[ - ]

Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.

mike_hearn
·
5 days ago
·
[ - ]

They cache the intermediate data (KV cache).

anonymoushn
·
5 days ago
·
[ - ]

it's odd that the result of this processing cannot be cached.

lostmsu
·
5 days ago
·
[ - ]

It can be and it is by most good processing frameworks.

Davidzheng
·
5 days ago
·
[ - ]

the active param count is low so it should be fast.

a_wild_dandan
·
5 days ago
·
[ - ]

GLM-4.5-air produces tokens far faster than I can read on my MacBook. That's plenty fast enough for me, but YMMV.

·
5 days ago
·
[ - ]

tyho
·
5 days ago
·
[ - ]

What's the easiest way to get these local models browsing the web right now?

dizhn
·
5 days ago
·
[ - ]

aider uses Playwright. I don't know what everybody is using but that's a good starting point.

larodi
·
5 days ago
·
[ - ]

We be running them in PIs off spare juice in no time, and they be billions given how chips and embedded spreads…

lukax
·
5 days ago
·
[ - ]

Inference in Python uses harmony [1] (for request and response format) which is written in Rust with Python bindings. Another OpenAI's Rust library is tiktoken [2], used for all tokenization and detokenization. OpenAI Codex [3] is also written in Rust. It looks like OpenAI is increasingly adopting Rust (at least for inference).

[1] https://github.com/openai/harmony

[2] https://github.com/openai/tiktoken

[3] https://github.com/openai/codex

chilipepperhott
·
5 days ago
·
[ - ]

As an engineer that primarily uses Rust, this is a good omen.

Philpax
·
5 days ago
·
[ - ]

The less Python in the stack, the better!

deviation
·
5 days ago
·
[ - ]

So this confirms a best-in-class model release within the next few days?

From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?

ticulatedspline
·
5 days ago
·
[ - ]

Even without an imminent release it's a good strategy. They're getting pressure from Qwen and other high performing open-weight models. without a horse in the race they could fall behind in an entire segment.

There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.

FergusArgyll
·
5 days ago
·
[ - ]

Thursday

https://manifold.markets/Bayesian/on-what-day-will-gpt5-be-r...

winterrx
·
5 days ago
·
[ - ]

GPT-5 coming Thursday.

ciaranmca
·
5 days ago
·
[ - ]

Is this the stealth models horizon alpha and beta? I was generally impressed with them(although I really only used it in chats rather than any code tasks). In terms of chat I increasingly see very little difference between the current SOTA closed models and their open weight counterparts.

deviation
·
5 days ago
·
[ - ]

Their tokenization suggests they're new Qwen models AFAIK. They tokenize input to the exact same # of tokens that Qwen models do.

boringg
·
5 days ago
·
[ - ]

How much hype do we anticipate with the release of GPT-5 or whichever name to be included? And how many new features?

selectodude
·
5 days ago
·
[ - ]

Excited to have to send them a copy of my drivers license to try and use it. That’ll take the hype down a notch.

XCSme
·
5 days ago
·
[ - ]

Imagine if it's called GPT-4.5o

bredren
·
5 days ago
·
[ - ]

Undoubtedly. It would otherwise reduce the perceived value of their current product offering.

The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.

Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.

og_kalu
·
5 days ago
·
[ - ]

Even before today, the last week or so, it's been clear for a couple reasons, that GPT-5's release was imminent.

logicchains
·
5 days ago
·
[ - ]

> I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it

Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.

·
5 days ago
·
[ - ]

semitones
·
3 days ago
·
[ - ]

You hit the nail on the head!!!

henriquegodoy
·
5 days ago
·
[ - ]

Seeing a 20B model competing with o3's performance is mind blowing like just a year ago, most of us would've called this impossible - not just the intelligence leap, but getting this level of capability in such a compact size.

I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.

coolspot
·
5 days ago
·
[ - ]

10B * 2000 t/s = 20,000 GB/s memory bandwidth . Apple hardware can do 1k GB/s .

oezi
·
5 days ago
·
[ - ]

That’s why MoE is needed.

int_19h
·
5 days ago
·
[ - ]

It's not even a 20b model. It's 20b MoE with 3.6b active params.

But it does not actually compete with o3 performance. Not even close. As usual, the metrics are bullshit. You don't know how good the model actually is until you grill it yourself.

timmg
·
5 days ago
·
[ - ]

Orthogonal, but I just wanted to say how awesome Ollama is. It took 2 seconds to find the model and a minute to download and now I'm using it.

Kudos to that team.

_ache_
·
5 days ago
·
[ - ]

To be fair, it's with the help of OpenAI. They did it together, before the official release.

https://ollama.com/blog/gpt-oss

aubanel
·
5 days ago
·
[ - ]

From experience, it's much more engineering work on the integrator's side than on OpenAI's. Basically they provide you their new model in advance, but they don't know the specifics of your system, so it's normal that you do most of the work. Thus I'm particularly impressed by Cerebras: they only have a few models supported for their extreme perf inference, it must have been huge bespoke work to integrate.

Shopper0552
·
5 days ago
·
[ - ]

I remember reading Ollama is going closed source now?

https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas...

int_19h
·
5 days ago
·
[ - ]

It's just as easy with LM Studio.

All the real heavy lifting is done by llama.cpp, and for the distribution, by HuggingFace.

benreesman
·
5 days ago
·
[ - ]

I'm a well-known OpenAI hater, but there's haters and haters, and refusing to acknowledge great work is the latter.

Well done OpenAI, this seems like a sincere effort to do a real open model with competitive performance, usable/workable licensing, a tokenizer compatible with your commercial offerings, it's a real contribution. Probably the most open useful thing since Whisper that also kicked ass.

Keep this sort of thing up and I might start re-evaliating how I feel about this company.

sadiq
·
5 days ago
·
[ - ]

Looks like Groq (at 1k+ tokens/second) and Fireworks are already live on openrouter: https://openrouter.ai/openai/gpt-oss-120b

$0.15M in / $0.6-0.75M out

edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.

podnami
·
5 days ago
·
[ - ]

Wow this was actually blazing fast. I prompted "how can the 45th and 47th presidents of america share the same parents?"

On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.

swores
·
5 days ago
·
[ - ]

I'm not sure that's a particularly good question for concluding something positive about the "thought for 0.7 seconds" - it's such a simple answer, ChatGPT 4o (with no thinking time) immediately answered correctly. The only surprising thing in your test is that o3 wasted 13 seconds thinking about it.

Workaccount2
·
5 days ago
·
[ - ]

A current major outstanding problem with thinking models is how to get them to think an appropriate amount.

dingnuts
·
5 days ago
·
[ - ]

The providers disagree. You pay per token. Verbacious models are the most profitable. Have fun!

willy_k
·
5 days ago
·
[ - ]

For API users, yes, but for the average person with a subscription or using the free tier it’s the inverse.

conradkay
·
5 days ago
·
[ - ]

Nowadays it must be pretty large % of usage going through monthly subscriptions

nisegami
·
5 days ago
·
[ - ]

Interesting choice of prompt. None of the local models I have in ollama (consumer mid range gpu) were able to get it right.

golergka
·
5 days ago
·
[ - ]

When I pay attention to o3 CoT, I notice it spends a few passes thinking about my system prompt. Hard to imagine this question is hard enough to spend 13 seconds on.

Imustaskforhelp
·
5 days ago
·
[ - ]

Not gonna lie but I got sorta goosebumps

I am not kidding but such progress from a technological point of view is just fascinating!

xpe
·
5 days ago
·
[ - ]

How many people are discussing this after one person did 1 prompt with 1 data point for each model and wrote a comment?

What is being measured here? For end-to-end time, one model is:

t_total = t_network + t_queue + t_batch_wait + t_inference + t_service_overhead

tekacs
·
5 days ago
·
[ - ]

I apologize for linking to Twitter, but I can't post a video here, so:

https://x.com/tekacs/status/1952788922666205615

Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...

I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.

tekacs
·
5 days ago
·
[ - ]

Cleanshot link for those who don't want to go to X: https://share.cleanshot.com/bkHqvXvT

tekacs
·
5 days ago
·
[ - ]

A few days ago I posted a slowed-down version of the video demo on someone's repo because it was unreadably fast due to being sped up.

https://news.ycombinator.com/item?id=44738004

... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.

sigmar
·
5 days ago
·
[ - ]

Non-rhetorically, why would someone pay for o3 api now that I can get this open model from openai served for cheaper? Interesting dynamic... will they drop o3 pricing next week (which is 10-20x the cost[1])?

[1] currently $3M in/ $8M out https://platform.openai.com/docs/pricing

gnulinux
·
5 days ago
·
[ - ]

Not even that, even if o3 being marginally better is important for your task (let's say) why would anyone use o4-mini? It seems almost 10x the price and same performance (maybe even less): https://openrouter.ai/openai/o4-mini

Invictus0
·
5 days ago
·
[ - ]

Probably because they are going to announce gpt 5 imminently

gnulinux
·
5 days ago
·
[ - ]

Wow, that's significantly cheaper than o4-mini which seems to be on part with gpt-oss-120b. ($1.10/M input tokens, $4.40/M output tokens) Almost 10x the price.

LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.

tempaccount420
·
5 days ago
·
[ - ]

It's funny because I was thinking the opposite, the pricing seems way too high for a 5B parameter activation model.

gnulinux
·
5 days ago
·
[ - ]

Sure you're right, but if I can squeeze out o4-mini level utility out of it, but its less than quarter the price, does it really matter?

wahnfrieden
·
5 days ago
·
[ - ]

Yes

mikepurvis
·
5 days ago
·
[ - ]

Are the prices staying aligned to the fundamentals (hardware, energy), or is this a VC-funded land grab pushing prices to the bottom?

spott
·
5 days ago
·
[ - ]

It is interesting that openai isn't offering any inference for these models.

bangaladore
·
5 days ago
·
[ - ]

Makes sense to me. Inference on these models will be a race to the bottom. Hosting inference themselves will be a waste of compute / dollar for them.

modeless
·
5 days ago
·
[ - ]

I really want to try coding with this at 2600 tokens/s (from Cerebras). Imagine generating thousands of lines of code as fast as you can prompt. If it doesn't work who cares, generate another thousand and try again! And at $.69/M tokens it would only cost $6.50 an hour.

andai
·
5 days ago
·
[ - ]

I tried this (gpt-oss-120b with Cerebras) with Roo Code. It repeatedly failed to use the tools correctly, and then I got 429 too many requests. So much for the "as fast as I can think" idea!

I'll have to try again later but it was a bit underwhelming.

The latency also seemed pretty high, not sure why. I think with the latency the throughout ends up not making much difference.

Btw Groq has the 20b model at 4000 TPS but I haven't tried that one.

artembugara
·
5 days ago
·
[ - ]

Disclamer: probably dumb questions

so, the 20b model.

Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?

mlyle
·
5 days ago
·
[ - ]

An A100 is probably 2-4k tokens/second on a 20B model with batched inference.

Multiply the number of A100's you need as necessary.

Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.

Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...

d3m0t3p
·
5 days ago
·
[ - ]

You can batch only if you have distinct chat in parallel,

mlyle
·
5 days ago
·
[ - ]

> > if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each)

petuman
·
5 days ago
·
[ - ]

> assuming I need 1k tokens/second throughput (on each, so 20 x 1k)

3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.

mythz
·
5 days ago
·
[ - ]

gpt-oss:20b is ~14GB on disk [1] so fits nicely within a 16GB VRAM card.

[1] https://ollama.com/library/gpt-oss

dragonwriter
·
5 days ago
·
[ - ]

You also need space in VRAM for what is required to support the context window; you might be able to do a model that is 14GB in parameters with a small (~8k maybe?) context window on a 16GB card.

artembugara
·
5 days ago
·
[ - ]

thanks, this part is clear to me.

but I need to understand 20 x 1k token throughput

I assume it just might be too early to know the answer

Tostino
·
5 days ago
·
[ - ]

I legitimately cannot think of any hardware that will get you to that throughput over that many streams with any of the hardware I know of (I don't work in the server space so there may be some new stuff I am unaware of).

artembugara
·
5 days ago
·
[ - ]

oh, I totally understand that I'd need multiple GPUs. I'd just want to know what GPU specifically and how many

Tostino
·
5 days ago
·
[ - ]

I don't think you can get 1k tokens/sec on a single stream using any consumer grade GPUs with a 20b model. Maybe you could with H100 or better, but I somewhat doubt that.

My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.

PeterStuer
·
5 days ago
·
[ - ]

(answer for 1 inference) Al depends on the context length you want to support as the activation memory will dominate the requirements. For 4096 tokens you will get away with 24GB (or even 16GB), but if you want to go for the full 131072 tokens you are not going to get there with a 32GB consumer GPU like the 5090. You'll need to spring for at the minimum an A6000 (48GB) or preferably an RTX 6000 Pro (96GB).

Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.

vl
·
5 days ago
·
[ - ]

How Macs compare to RTXs for this? I.e. what numbers can be expected from Mac mini/Mac Studio with 64/128/256/512GB of unified memory?

mrm_crackerhues
·
5 days ago
·
[ - ]

[dead]

spott
·
5 days ago
·
[ - ]

Groq is offering 1k tokens per second for the 20B model.

You are unlikely to match groq on off the shelf hardware as far as I'm aware.

coolspot
·
5 days ago
·
[ - ]

https://apxml.com/tools/vram-calculator

sabakhoj
·
5 days ago
·
[ - ]

Super excited to see these released!

Major points of interest for me:

- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.

- AIME 2025 is nearly saturated with large CoT

- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.

- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.

Model cards with some of my annotations: https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b...

matznerd
·
5 days ago
·
[ - ]

thanks openai for being open ;) Surprised there are no official MLX versions and only one mention of MLX in this thread. MLX basically converst the models to take advntage of mac unified memory for 2-5x increase in power, enabling macs to run what would otherwise take expensive gpus (within limits).

So FYI to any one on mac, the easiest way to run these models right now is using LM Studio (https://lmstudio.ai/), its free. You just search for the model, usually 3rd party groups mlx-community or lmstudio-community have mlx versions within a day or 2 of releases. I go for the 8-bit quantizations (4-bit faster, but quality drops). You can also convert to mlx yourself...

Once you have it running on LM studio, you can chat there in their chat interface, or you can run it through api that defaults to http://127.0.0.1:1234

You can run multiple models that hot swap and load instantly and switch between them etc.

Its surpassingly easy, and fun.There are actually a lot of cool niche models comings out, like this tiny high-quality search model released today as well (and who released official mlx version) https://huggingface.co/Intelligent-Internet/II-Search-4B

Other fun ones are gemma 3n which is model multi-modal, larger one that is actually solid model but takes more memory is the new Qwen3 30b A3B (coder and instruct), Pixtral (mixtral vision with full resolution images), etc. Look forward to playing with this model and see how it compares.

matznerd
·
4 days ago
·
[ - ]

Here are the LM Studio MLX models:

LM Studio community: 20b: bhttps://huggingface.co/lmstudio-community/gpt-oss-20b-MLX-8b... 120b: https://huggingface.co/lmstudio-community/gpt-oss-120b-MLX-8...

umgefahren
·
5 days ago
·
[ - ]

Regarding MLX:

In the repo is a metal port they made, that’s at least something… I guess they didn’t want to cooperate with Apple before the launch but I am sure it will be there tomorrow.

bluecoconut
·
5 days ago
·
[ - ]

I was able to get gpt-oss:20b wired up to claude code locally via a thin proxy and ollama.

It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.

(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

tarruda
·
5 days ago
·
[ - ]

> Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)

Not sure about ollama, but llama-server does have a transparent kv cache.

You can run it with

    llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none

Web UI at http://localhost:8080 (also OpenAI compatible API)

IceHegel
·
5 days ago
·
[ - ]

Listed performance of ~5 points less than o3 on benchmarks is pretty impressive.

Wonder if they feel the bar will be raised soon (GPT-5) and feel more comfortable releasing something this strong.

HanClinto
·
5 days ago
·
[ - ]

Holy smokes, there's already llama.cpp support:

https://github.com/ggml-org/llama.cpp/pull/15091

carbocation
·
5 days ago
·
[ - ]

And it's already on ollama, it appears: https://ollama.com/library/gpt-oss

incomingpain
·
5 days ago
·
[ - ]

lm studio immediately released the new appimage with support.

CraigJPerry
·
5 days ago
·
[ - ]

I just tried it on open router but i was served by cerebras. Holy... 40,000 tokens per second. That was SURREAL.

I got a 1.7k token reply delivered too fast for the human eye to perceive the streaming.

n=1 for this 120b model but id rank the reply #1 just ahead of claude sonnet 4 for a boring JIRA ticket shuffling type challenge.

EDIT: The same prompt on gpt-oss, despite being served 1000x slower, wasn't as good but was in a similar vein. It wanted to clarify more and as a result only half responded.

Leary
·
5 days ago
·
[ - ]

GPQA Diamond: gpt-oss-120b: 80.1%, Qwen3-235B-A22B-Thinking-2507: 81.1%

Humanity’s Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%

jasonjmcghee
·
5 days ago
·
[ - ]

Wow - I will give it a try then. I'm cynical about OpenAI minmaxing benchmarks, but still trying to be optimistic as this in 8bit is such a nice fit for apple silicon

modeless
·
5 days ago
·
[ - ]

Even better, it's 4 bit

amarcheschi
·
5 days ago
·
[ - ]

Glm 4.5 seems on par as well

thegeomaster
·
5 days ago
·
[ - ]

GLM-4.5 seems to outperform it on TauBench, too. And it's suspicious OAI is not sharing numbers for quite a few useful benchmarks (nothing related to coding, for example).

One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.

lcnPylGDnU4H9OF
·
5 days ago
·
[ - ]

Was the Qwen model using tools for Humanity's Last Exam?

mythz
·
5 days ago
·
[ - ]

Getting great performance running gpt-oss on 3x A4000's:

    gpt-oss:20b = ~46 tok/s

More than 2x faster than my previous leading OSS models:

    mistral-small3.2:24b = ~22 tok/s 
    gemma3:27b           = ~19.5 tok/s

Strangely getting nearly the opposite performance running on 1x 5070 Ti:

    mistral-small3.2:24b = ~39 tok/s 
    gpt-oss:20b          = ~21 tok/s

Where gpt-oss is nearly 2x slow vs mistral-small 3.2.

mythz
·
4 days ago
·
[ - ]

ok issue is with ollama as gpt-oss 20B runs much faster on 1x 5070 Ti with llama.cpp and LM Studio:

    llama-server     = ~181 tok/s
    LM Studio        = ~46 tok/s  (default)
    LM Studio Custom = ~158 tok/s (changed to offload to GPU and switch to CUDA llama.cpp engine)

and llama-server on my 3x A4000 GPU Server is getting 90 tok/s vs 46 tok/s on ollama

genpfault
·
5 days ago
·
[ - ]

Seeing ~70 tok/s on a 7900 XTX using Ollama.

Matsta
·
5 days ago
·
[ - ]

I'm getting around 90 tok/s on a 3090 using Ollama.

Pretty impressive

ionwake
·
5 days ago
·
[ - ]

I want to take this chance to say a big thank you to OpenAI and your work. I have always been a fan since I noticed you hired the sandbox game kickstarter guy about like 8 years ago.

Even from the UK I knew you would all do great things ( I had had no idea who else was involved).

I am glad I see the top comment is rare praise on HN.

Thanks again and keep it up Sama and team.

jakozaur
·
5 days ago
·
[ - ]

The coding seems to be one of the strongest use cases for LLMs. Though currently they are eating too many tokens to be profitable. So perhaps these local models could offload some tasks to local computers.

E.g. Hybrid architecture. Local model gathers more data, runs tests, does simple fixes, but frequently asks the stronger model to do the real job.

Local model gathers data using tools and sends more data to the stronger model.

mattfrommars
·
5 days ago
·
[ - ]

Anyone know how long does the context last for running model locally vs running via OpenAPI or Cursor? My understanding is the model that run on the cloud have much greater context window that what we can have running locally.

Imustaskforhelp
·
5 days ago
·
[ - ]

I have always thought that if we can somehow get an AI which is insanely good at coding, so much so that It can improve itself, then through continuous improvements, they will get better models of everything else idk

Maybe you guys call it AGI, so anytime I see progress in coding, I think it goes just a tiny bit towards the right direction

Plus it also helps me as a coder to actually do some stuff just for the fun. Maybe coding is the only truly viable use of AI and all others are negligible increases.

There is so much polarization in the use of AI on coding but I just want to say this, it would be pretty ironic that an industry which automates others job is this time the first to get their job automated.

But I don't see that as an happening, far from it. But still each day something new, something better happens back to back. So yeah.

NitpickLawyer
·
5 days ago
·
[ - ]

Not to open that can of worms, but in most definitions self-improvement is not an AGI requirement. That's already ASI territory (Super Intelligence). That's the proverbial skynet (pessimists) or singularity (optimists).

Imustaskforhelp
·
5 days ago
·
[ - ]

Hmm my bad. Maybe Yeah I always thought that it was the endgame of humanity but isn't AGI supposed to be that (the endgame)

What would AGI mean, solving some problem that it hasn't seen? or what exactly? I mean I think AGI is solved, no?

If not, I see people mentioning that horizon alpha is actually a gpt 5 model and its predicted to release on thursday on some betting market, so maybe that fits AGI definition?

hooverd
·
5 days ago
·
[ - ]

Optimistically, there's always more crap to get done.

jona777than
·
5 days ago
·
[ - ]

I agree. It’s not improbable for there to be _more_ needs to meet in the future, in my opinion.

thimabi
·
5 days ago
·
[ - ]

Open weight models from OpenAI with performance comparable to that of o3 and o4-mini in benchmarks… well, I certainly wasn’t expecting that.

What’s the catch?

coreyh14444
·
5 days ago
·
[ - ]

Because GPT-5 comes out later this week?

thimabi
·
5 days ago
·
[ - ]

It could be, but there’s so much hype surrounding the GPT-5 release that I’m not sure whether their internal models will live up to it.

For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.

jona777than
·
5 days ago
·
[ - ]

As a sidebar, I’m still not sure if GPT-5 will be transformative due to its capabilities as much as its accessibility. All it really needs to do to be highly impactful is lower the barrier of entry for the more powerful models. I could see that contributing to it being worth the hype. Surely it will be better, but if more people are capable of leveraging it, that’s just as revolutionary, if not more.

rrrrrrrrrrrryan
·
5 days ago
·
[ - ]

It seems like a big part of GPT-5 will be that it will be able to intelligently route your request to the appropriate model variant.

Shank
·
5 days ago
·
[ - ]

That doesn’t sound good. It sounds like OpenAI will route my request to the cheapest model to them and the most expensive for me, with the minimum viable results.

Invictus0
·
5 days ago
·
[ - ]

Sounds just like what a human would do. Or any business for that matter.

Shank
·
5 days ago
·
[ - ]

That may be true but I thought the promise was moving in the direction of AGI/ASI/whatever and that models would become more capable over time.

sebzim4500
·
5 days ago
·
[ - ]

Surely OpenAI would not be releasing this now unless GPT-5 was much better than it.

logicchains
·
5 days ago
·
[ - ]

The catch is that it only has ~5 billion active params so should perform worse than the top Deepseek and Qwen models, which have around 20-30 billion, unless OpenAI pulled off a miracle.

int_19h
·
5 days ago
·
[ - ]

The catch is that performance is not actually comparable to o4-mini, never mind o3.

When it comes to LLMs, benchmarks are bullshit. If they sound too good to be true, it's because they are. The only thing benchmarks are useful for is preliminary screening - if the model does especially badly in them it's probably not good in general. But if it does good in them, that doesn't really tell you anything.

ewoodrich
·
5 days ago
·
[ - ]

It's definitely interesting how the comments from right after the models were released were ecstatic about "SOTA performance" and how it is "equivalent to o3" and then comments like yours hours later after having actually tested it keep pointing out how it's garbage compared to even the current batch of open models let alone proprietary foundation models.

Yet another data point for benchmarks being utterly useless and completely gamed at this stage in the game by all the major AI developers.

These companies are clearly are all very aware that the initial wave of hype at release is "sticky" and drives buzz/tech news coverage while real world tests take much longer before that impression slowly starts to be undermined by practical usage and comparison to other models. Benchmarks with wildly over confident naming like "Humanity's Last Exam" aren't exactly helping with objectivity either.

NitpickLawyer
·
5 days ago
·
[ - ]

> What’s the catch?

Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.

·
5 days ago
·
[ - ]

resters
·
5 days ago
·
[ - ]

Reading the comments it becomes clear how befuddled many HN participants are about AI. I don't think there has been a technical topic that HN has seemed so dull on in the many years I've been reading HN. This must be an indication that we are in a bubble.

One basic point that is often missed is: Different aspects of LLM performance (in the cognitive performance sense) and LLM resource utilization are relevant to various use cases and business models.

Another is that there are many use cases where users prefer to run inference locally, for a variety of domain-specific or business model reasons.

The list goes on.

zone411
·
5 days ago
·
[ - ]

I benchmarked the 120B version on the Extended NYT Connections (759 questions, https://github.com/lechmazur/nyt-connections) and on 120B and 20B on Thematic Generalization (810 questions, https://github.com/lechmazur/generalization). Opus 4.1 benchmarks are also there.

zmmmmm
·
5 days ago
·
[ - ]

I think this is a belated but smart move by OpenAI. They are basically fully moving in on Meta's strategy now, taking advantage of what may be a temporary situation with Meta dropping back in model race. It will be interesting to see if these models now get taken up by the local model / fine tuning community the way llama was. It's a very appealing strategy to test / dev with a local model and then have the option to deploy to prod on a high powered version of the same thing. Always knowing if the provider goes full hostile, or you end up with data that can't move off prem, you have self hosting as an option with a decent performing model.

Which is all to say, availability of these local models for me is a key incentive that I didn't have before to use OpenAI's hosted ones.

rmonvfer
·
5 days ago
·
[ - ]

What a day! Models aside, the Harmony Response Format[1] also seems pretty interesting and I wonder how much of an impact it might have in performance of these models.

[1] https://github.com/openai/harmony

incomingpain
·
5 days ago
·
[ - ]

Seems to be breaking every agentic tool I've tried so far.

Im guessing it's going to very rapidly be patched into the various tools.

Workaccount2
·
5 days ago
·
[ - ]

Wow, today is a crazy AI release day:

- OAI open source

- Opus 4.1

- Genie 3

- ElevenLabs Music

satyrun
·
5 days ago
·
[ - ]

wow I just listened to Eleven Music do flamenco singing. That is incredible.

Edit. I just tried it though and less impressed now. We are really going to need major music software to get on board before we have actual creative audio tools. These all seem made for non-musicians to make a very cookie cutter song from a specific genre.

tmikaeld
·
5 days ago
·
[ - ]

I also tried it for a full 100K credits (Wasted in 2 hours btw which is silly!).

Compared to both Udio and Suno, it's very very bad.. both at compositions, matching lyrics to music, keeping tempo and as soon as there's any distorted instruments like guitars or live, quality goes to radio-level.

BoxOfRain
·
5 days ago
·
[ - ]

>These all seem made for non-musicians to make a very cookie cutter song from a specific genre.

This is my main problem with AI music at the moment, I'd love it if I had proper creative control as a musician that'd be amazing but a lot of the time it's just straight up slop generation.

orphea
·
5 days ago
·
[ - ]

  OAI open source

Yeah. This certainly was not on my bingo card.

wahnfrieden
·
5 days ago
·
[ - ]

They announced it months ago…

jcmontx
·
5 days ago
·
[ - ]

I'm out of the loop for local models. For my M3 24gb ram macbook, what token throughput can I expect?

Edit: I tried it out, I have no idea in terms of of tokens but it was fluid enough for me. A bit slower than using o3 in the browser but definitely tolerable. I think I will set it up in my GF's machine so she can stop paying for the full subscription (she's a non-tech professional)

dantetheinferno
·
5 days ago
·
[ - ]

Apple M4 Pro w/ 48GB running the smaller version. I'm getting 43.7t/s

ivape
·
5 days ago
·
[ - ]

Curious if anyone is running this on a AMD Ryzen AI Max+ 395 and knows the t/s.

albertgoeswoof
·
5 days ago
·
[ - ]

3 year old M1 MacBook Pro 32gb, 42 tokens/sec on lm studio

Very much usable

steinvakt2
·
5 days ago
·
[ - ]

Wondering about the same for my M4 max 128 gb

jcmontx
·
5 days ago
·
[ - ]

It should fly on your machine

steinvakt2
·
5 days ago
·
[ - ]

Yeah, was super quick and easy to set up using Ollama. I had to kill some processes first to avoid memory swap though (even with 128gb memory). So a slightly more quantized version is maybe ideal, for me at least.

Edit: I'm talking about the 120B model of course

coolspot
·
5 days ago
·
[ - ]

40 t/s

GHanku
·
5 days ago
·
[ - ]

[dead]

dust42
·
5 days ago
·
[ - ]

The 120B model badly hallucinates facts on the level of a 0.6B model.

My go to test for checking hallucinations is 'Tell me about Mercantour park' (a national park in south eastern France).

Easily half of the facts are invented. Non-existing mountain summits, brown bears (no, there are none), villages that are elsewhere, wrong advice ('dogs allowed' - no they are not).

lukev
·
5 days ago
·
[ - ]

This is precisely the wrong way to think about LLMs.

LLMs are never going to have fact retrieval as a strength. Transformer models don't store their training data: they are categorically incapable of telling you where a fact comes from. They also cannot escape the laws of information theory: storing information requires bits. Storing all the world's obscure information requires quite a lot of bits.

What we want out of LLMs is large context, strong reasoning and linguistic facility. Couple these with tool use and data retrieval, and you can start to build useful systems.

From this point of view, the more of a model's total weight footprint is dedicated to "fact storage", the less desirable it is.

futureshock
·
5 days ago
·
[ - ]

I think that sounds very reasonable, but unfortunately these models don’t know what they know and don’t. A small model that knew the exact limits of its knowledge would be very powerful.

energy123
·
5 days ago
·
[ - ]

Hallucinations have characteristics in interpretability studies. That's a foothold into reducing them.

They still won't store much information, but it could mean they're better able to know what they don't know.

CrackerNews
·
4 days ago
·
[ - ]

What are the large context, strong reasoning, and linguistic facility for if there aren't facts underpinning them? Is a priori wholly independent of a posteriori? Is it practical for the former to be wholly independent of the latter?

superconduct123
·
5 days ago
·
[ - ]

How can you reason correctly if you don't have any way to know which facts are real vs hallucinated?

pocketarc
·
5 days ago
·
[ - ]

Others have already said it, but it needs to be said again: Good god, stop treating LLMs like oracles.

LLMs are not encyclopedias.

Give an LLM the context you want to explore, and it will do a fantastic job of telling you all about it. Give an LLM access to web search, and it will find things for you and tell you what you want to know. Ask it "what's happening in my town this week?", and it will answer that with the tools it is given. Not out of its oracle mind, but out of web search + natural language processing.

Stop expecting LLMs to -know- things. Treating LLMs like all-knowing oracles is exactly the thing that's setting apart those who are finding huge productivity gains with them from those who can't get anything productive out of them.

saurik
·
5 days ago
·
[ - ]

I am getting huge productivity gains from using models, and I mostly use them as "oracles" (though I am extremely careful with respect to how I have to handle hallucination, of course): I'd even say their true power--just like a human--comes from having an ungodly amount of knowledge, not merely intelligence. If I just wanted something intelligent, I already had humans!... but merely intelligent humans, even when given months of time to screw around doing Google searches, fail to make the insights that someone--whether they are a human or a model--that actually knows stuff can throw around like it is nothing. I am actually able to use ChatGPT 4.5 as not just an employee, not even just as a coworker, but at times as a mentor or senior advisor: I can tell it what I am trying to do, and it helps me by applying advanced mathematical insights or suggesting things I could use. Using an LLM as a glorified Google-it-for-me monkey seems like such a waste of potential.

pxc
·
5 days ago
·
[ - ]

> I am actually able to use ChatGPT 4.5 as not just an employee, not even just as a coworker, but at times as a mentor or senior advisor: I can tell it what I am trying to do, and it helps me by applying advanced mathematical insights or suggesting things I could use.

You can still do that sort of thing, but just have it perform searches whenever it has to deal with a matter of fact. Just because it's trained for tool use and equipped with search tools doesn't mean you have to change the kinds of things you ask it.

saurik
·
5 days ago
·
[ - ]

If you strip all the facts from a mathematician you get me... I don't need another me: I already used Google, and I already failed to find what I need. What I actually need is someone who can realize that my problem is a restatement of an existing known problem, just using words and terms or a occluded structure that don't look anything like how it was originally formulated. You very often simply can't figure that out using Google, no matter how long you sit in a tight loop trying related Google searches; but, it is the kind of thing that an LLM (or a human) excels at (as you can consider "restatement" a form of "translation" between languages), if and only if they have already seen that kind of problem. The same thing comes up with novel application of obscure technology, complex economics, or even interpretation of human history... there is a reason why people who study Classics "waste" a ton of time reading old stories rather than merely knowing the library is around the corner. What makes these AIs so amazing is thinking of them as entirely replacing Google with something closer to a god, not merely trying to wrap it with a mechanical employee whose time is ostensibly less valuable than mine.

pxc
·
5 days ago
·
[ - ]

> What makes these AIs so amazing is thinking of them as entirely replacing Google with something closer to a god

I guess that way of thinking may foster amazement, but it doesn't seem very grounded in how these things work or their current capabilities. Seems a bit manic tbf.

And again, enabling web search in your chats doesn't prevent these models from doing anything "integrative reasoning", so-to-speak, that they can purportedly do. It just helps ensure that relevant facts are in context for the model.

saurik
·
4 days ago
·
[ - ]

Yeah, but like, "relevant facts" is a big part of reasoning? I don't get anywhere near as good results on anything I want from the dumber models, and I almost never get good results from Google searches as, as I said, I already did that. To put it into engineering, people come to me for security stuff, and I've spent my life working in that field, so I just know things that I'd never be able to find with a Google search if I didn't already know the thing I am looking for (and often I can't recover a reference even if I do remember).

I frankly feel people don't spend enough time with ChatGPT 4.5... like, if you haven't yet found use cases that it can do that the other models can't even come close to, are you really using AI effectively?

diegocg
·
5 days ago
·
[ - ]

The problem is that even when you give them context, they just hallucinate at another level. I have tried that example of asking about events in my area, they are absolutely awful at it.

dankwizard
·
5 days ago
·
[ - ]

I love how with this cutting edge tech people still dress up and pretend to be experts. Pleasure to meet you, pocketarc - Senior AI Gamechanger, 2024-2025 (Current)

Salgat
·
5 days ago
·
[ - ]

It's fine to expect it to not know things, but the complaint is that it makes zero indication that it's just making up nonsense, which is the biggest issue with LLMs. They do the same thing when creating code.

dust42
·
5 days ago
·
[ - ]

Exactly this. And that is why I like this question because the amount of correct details and the amount of nonsense give a good idea about the quality of the model.

CrackerNews
·
4 days ago
·
[ - ]

LLMs should at least -know- the semantics about the text it analyzed as opposed to the syntax.

orbital-decay
·
5 days ago
·
[ - ]

To be coherent and useful in general-purpose scenarios, LLM absolutely has to be large enough and know a lot, even if you aren't using is as an oracle.

hmottestad
·
5 days ago
·
[ - ]

I don’t think they trained it for fact retrieval.

Would probably do a lot better if you give it tool access for search and web browsing.

Invictus0
·
5 days ago
·
[ - ]

What is the point of an offline reasoning model that also doesn't know anything and makes up facts? Why would anyone prefer this to a frontier model?

MuteXR
·
5 days ago
·
[ - ]

Data processing? Reasoning on supplied data?

ArtTimeInvestor
·
5 days ago
·
[ - ]

Why do companies release open source LLMs?

I would understand it, if there was some technology lock-in. But with LLMs, there is no such thing. One can switch out LLMs without any friction.

gnulinux
·
5 days ago
·
[ - ]

Name recognition? Advertisement? Federal grant to beat Chinese competition?

There could be many legitimate reasons, but yeah I'm very surprised by this too. Some companies take it a bit too seriously and go above and beyond too. At this point unless you need the absolute SOTA models because you're throwing LLM at an extremely hard problem, there is very little utility using larger providers. In OpenRouter, or by renting your own GPU you can run on-par models for much cheaper.

a_vanderbilt
·
5 days ago
·
[ - ]

At least in OpenAI's case, it raises the bar for potential competition while also implying that what they have behind the scenes is far better.

LordDragonfang
·
5 days ago
·
[ - ]

Zuckerberg explains a few of the reasons here:

https://www.dwarkesh.com/p/mark-zuckerberg#:~:text=As%20long...

The short version is that is you give a product to open source, they can and will donate time and money to improving your product, and the ecosystem around it, for free, and you get to reap those benefits. Llama has already basically won that space (the standard way of running open models is llama.cpp), so OpenAI have finally realized they're playing catch-up (and last quarter's SOTA isn't worth much revenue to them when there's a new SOTA, so they may as well give it away while it can still crack into the market)

ArtTimeInvestor
·
4 days ago
·
[ - ]

I understand the rationale behind open sourcing llama.cpp. Because it has a lock-in effect.

But I don't see how open sourcing weights has a lock-in effect. In fact, it seems OpeanAI's open models can be run on llama.cpp. So by offereing them, they make llama.cpp even MORE useful. Instead of driving developers towards their own tech.

__alexs
·
5 days ago
·
[ - ]

I believe it's to create barriers to entry and make the space harder to compete in.

There's still a ton of value in the lower end of the market by capability, and it's easier for more companies to compete in. If you make the cost floor for that basically free you eliminate everyone else's ability to make any profit there and then leverage that into building a product that can also compete at the higher end. This makes it harder for a new market entrant to compete by increasing the minimum capability and capital investment required to make a profit in this space.

ArtTimeInvestor
·
4 days ago
·
[ - ]

Oh, wow. Following this line of thinking, what seems like a noble gesture on the surface is a scorched earth approach in reality.

koolala
·
5 days ago
·
[ - ]

They don't because it would kill their data scrapping buisness's competitive advantage.

mclau157
·
5 days ago
·
[ - ]

Partially because using their own GPUs is expensive, so maybe offloading some GPU usage

TrackerFF
·
5 days ago
·
[ - ]

LLMs are terrible, purely speaking from the business economic side of things.

Frontier / SOTA models are barely profitable. Previous gen model lose 90% of their value. Two gens back and they're worthless.

And given that their product life cycle is something like 6-12 months, you might as well open source them as part of sundowning them.

spongebobstoes
·
5 days ago
·
[ - ]

inference runs at a 30-40% profit

FergusArgyll
·
5 days ago
·
[ - ]

> To improve the safety of the model, we filtered the data for harmful content in pre-training, especially around hazardous biosecurity knowledge, by reusing the CBRN pre-training filters from GPT-4o. Our model has a knowledge cutoff of June 2024.

This would be a great "AGI" test. See if it can derive biohazards from first principles

orbital-decay
·
5 days ago
·
[ - ]

Not possible without running real-life experiments, unless they still memorized it somehow.

PeterStuer
·
5 days ago
·
[ - ]

I love how they frame High-end desktops and laptops as having "a single H100 GPU".

organsnyder
·
5 days ago
·
[ - ]

I read that as it runs in data centers (H100 GPUs) or high-end desktops/laptops (Strix Halo?).

xyc
·
5 days ago
·
[ - ]

I'm running it with ROG Flow Z13 128GB Strix Halo and getting 50 tok/s for 20B model and 12 tok/s for 120B model. I'd say it's pretty usable.

organsnyder
·
4 days ago
·
[ - ]

Excellent! I have a Framework Desktop with 128GB on preorder—really looking forward to getting it.

phh
·
5 days ago
·
[ - ]

Well if nVidia wasn't late, it would be runnable on nVidia project Digits.

PeterStuer
·
4 days ago
·
[ - ]

Yes, they are late to the party. Maybe they do not want to eat into the RTX Pro 6000 sales. In the meantime, there is the AMD Ryzen™ Al Max+ 395.

robertheadley
·
5 days ago
·
[ - ]

I actually tried to ask the Model about that, then I asked ChatGPT, both times, they just said that it was marketing speak.

I was like no. It is false advertising.

piskov
·
5 days ago
·
[ - ]

Don’t forget about mac studio

vinhnx
·
4 days ago
·
[ - ]

I did a quick `openai/gpt-oss-20b` testing on an Macbook Pro M1 16GB. Pretty impressed with it so far.

* It seems that using version @lmstudio's 20B gguf version (https://huggingface.co/lmstudio-community/gpt-oss-20b-GGUF) will have options for reasoning effort.

* My MBP M1 16GB config: temp 0.8, max content length 7990, GPU offload 8/24, runs slow and still fine for me.

* I tried testing with MCP with the above config, with basic tools like time and fetch + reasoning effort low, and the tool calls instruction follow is quite good.

* In LM Studio's Developer tab there is a log output about the model information which is useful to learn.

Overall, I like the way OpenAI backs to being Open AI, again, after all those years.

Shameless plug, If anyone want to try out gpt-oss-120b and gpt-oss-20b as alternative to their own demo page [0], I have added both models with OpenRouter providers in VT Chat [1] as real product. You can try with an OpenRouter API Key.

[0] https://gpt-oss.com

[1] https://vtchat.io.vn

modeless
·
5 days ago
·
[ - ]

Can't wait to see third party benchmarks. The ones in the blog post are quite sparse and it doesn't seem possible to fully compare to other open models yet. But the few numbers available seem to suggest that this release will make all other non-multimodal open models obsolete.

siliconc0w
·
5 days ago
·
[ - ]

It seems like OSS will win, I can't see people willing to pay like 10x the price for what seems like 10% more performance. Especially once we get better at routing the hardest questions to the better models and then using that response to augment/fine-tune the OSS ones.

n42
·
5 days ago
·
[ - ]

to me it seems like the market is breaking into an 80/20 of B2C/B2B; the B2C use case becoming OSS models (the market shifts to devices that can support them), and the B2B market being priced appropriately for businesses that require that last 20% of absolute cutting edge performance as the cloud offering

·
5 days ago
·
[ - ]

__alexs
·
5 days ago
·
[ - ]

Why would OpenAI give this away for free? Is it to disrupt competition by setting a floor at the lower end of the market and make it harder for new competition to emerge while still retaining mind share?

cjtrowbridge
·
5 days ago
·
[ - ]

No. It's because large models have leveled off and commodified. They are all trending towards the same capabilities, and openai isn't really a leader. They have the most popular interface, but it really isn't very good. The future is the edge, the future is smaller, more efficient models. They are trying to define and delineate a niche that needs datacenters where they can achieve rents.

dsco
·
5 days ago
·
[ - ]

Does anyone get the demos at https://www.gpt-oss.com to work, or are the servers down immediately after launch? I'm only getting the spinner after prompting.

lukasgross
·
5 days ago
·
[ - ]

(I helped build the microsite)

Our backend is falling over from the load, spinning up more resources!

anticensor
·
5 days ago
·
[ - ]

Why isn't GPT-OSS also offered on the free tier of ChatGPT?

lukasgross
·
5 days ago
·
[ - ]

Update: try now!

eliseumds
·
5 days ago
·
[ - ]

Getting lots of 502s from `https://api.gpt-oss.com/chatkit` at the moment.

·
5 days ago
·
[ - ]

jpcompartir
·
5 days ago
·
[ - ]

This is an extremely welcome move in a good direction from OpenAI. I can only thank them for all of the extra work around the models - Harmony structure, metal/torch/triton implementations, inference guides, cookbooks & fine-tuning/reinforcement learning scripts, datasets etc.

There is an insane amount of helpful information buried in this release

Rhubarrbb
·
5 days ago
·
[ - ]

What's the best agent to run this on? Is it compatible with Codex? For OSS agents, I've been using Qwen Code (clunky fork of Gemini), and Goose.

wahnfrieden
·
5 days ago
·
[ - ]

Why not Claude Code?

objektif
·
5 days ago
·
[ - ]

I keep hitting the limit within an hour.

wahnfrieden
·
5 days ago
·
[ - ]

Meant with your own model

RandyOrion
·
5 days ago
·
[ - ]

Super shallow (24/36 layers) MoE with low active parameter counts (3.6B/5.1B), a tradeoff between inference speed and performance.

Text only, which is okay.

Weights partially in MXFP4, but no cuda kernel support for RTX 50 series (sm120). Why? This is a NO for me.

Safety alignment shifts from off the charts to off the rails really fast if you keep prompting. This is a NO for me.

In summary, a solid NO for me.

jstummbillig
·
5 days ago
·
[ - ]

Shoutout to the hn consensus regarding an OpenAI open model release from 4 days ago: https://news.ycombinator.com/item?id=44758511

mortsnort
·
5 days ago
·
[ - ]

Releasing this under the Apache license is a shot at competitors that want to license their models on Open Router and enterprise.

It eliminates any reason to use an inferior Meta or Chinese model that costs money to license, thus there are no funds for these competitors to build a GPT 5 competitor.

bigyabai
·
5 days ago
·
[ - ]

> It eliminates any reason to use an inferior Meta or Chinese model

I wouldn't speak so soon, even the 120B model aimed for OpenRouter-style applications isn't very good at coding: https://blog.brokk.ai/a-first-look-at-gpt-oss-120bs-coding-a...

mortsnort
·
5 days ago
·
[ - ]

There are lots more applications than coding and Open Router hosting for open weight models that I'd guess just got completely changed by this being an Apache license. Think about products like DataBricks that allow enterprise to use LLMs for whatever purpose.

I also suspect the new OpenAI model is pretty good at coding if it's like o4-mini, but admittedly haven't tried it yet.

user_7832
·
5 days ago
·
[ - ]

Newbie question: I remember folks talking about how kimi 2’s launch might have pushed OpenAI to launch their model later. Now that we (shortly will) know how this model performs, how do they stack up? Did openAI likely actually hold off releasing weights because of kimi, in retrospect?

johntiger1
·
5 days ago
·
[ - ]

Wow, this will eat Meta's lunch

asdev
·
5 days ago
·
[ - ]

Meta is so cooked, I think most enterprises will opt for OpenAI or Anthropic and others will host OSS models themselves or on AWS/infra providers.

a_wild_dandan
·
5 days ago
·
[ - ]

I'll accept Meta's frontier AI demise if they're in their current position a year from now. People killed Google prematurely too (remember Bard?), because we severely underestimate the catch-up power bought with ungodly piles of cash.

atonse
·
5 days ago
·
[ - ]

And boy, with the $250m offers to people, Meta is definitely throwing ungodly piles of cash at the problem.

But Apple is waking up too. So is Google. It's absolutely insane, the amount of money being thrown around.

a_vanderbilt
·
5 days ago
·
[ - ]

It's insane numbers like that that give me some concern for a bubble. Not because AI hits some dead end, but due to a plateau that shifts from aggressive investment to passive-but-steady improvement.

asdev
·
5 days ago
·
[ - ]

catching up gets exponentially harder as time passes. way harder to catch up to current models than it was to the first iteration of gpt-4

BoorishBears
·
5 days ago
·
[ - ]

Maverick and Scout were not great, even with post-training in my experience, and then several Chinese models at multiple sizes made them kind of irrelevant (dots, Qwen, MiniMax)

If anything this helps Meta: another model to inspect/learn from/tweak etc. generally helps anyone making models

redox99
·
5 days ago
·
[ - ]

There's nothing new here in terms of architecture. Whatever secret sauce is in the training.

BoorishBears
·
5 days ago
·
[ - ]

Part of the secret sauce since O1 has been accesss the real reasoning traces, not the summaries.

If you even glance at the model card you'll see this was trained on the same CoT RL pipeline as O3, and it shows in using the model: this is the most coherent and structured CoT of any open model so far.

Having full access to a model trained on that pipeline is valuable to anyone doing post-training, even if it's just to observe, but especially if you use it as cold start data for your own training.

anticensor
·
5 days ago
·
[ - ]

Its CoT is sadly closer to that sanitised o3 summaries than to R1 style traces.

BoorishBears
·
5 days ago
·
[ - ]

It has both raw and summarized traces.

anticensor
·
4 days ago
·
[ - ]

I mean raw GPT-OSS is close to summarised o3.

seydor
·
5 days ago
·
[ - ]

I believe their competition is from chinese companies , for some time now

mhh__
·
5 days ago
·
[ - ]

They will clone it

incomingpain
·
5 days ago
·
[ - ]

I dont see the unsloth files yet but they'll be here: https://huggingface.co/unsloth/gpt-oss-20b-GGUF

Super excited to test these out.

The benchmarks from 20B are blowing away major >500b models. Insane.

On my hardware.

43 tokens/sec.

I got an error with flash attention turning on. Cant run it with flash attention?

31,000 context is max it will allow or model wont load.

no kv or v quantization.

habosa
·
5 days ago
·
[ - ]

Wow I really didn’t think this would happen any time soon, they seem to have more to lose than to gain.

If you’re a company building AI into your product right now I think you would be irresponsible to not investigate how much you can do on open weights models. The big AI labs are going to pull the ladder up eventually, building your business on the APIs long term is foolish. These open models will always be there for you to run though (if you can get GPUs anyway).

XCSme
·
5 days ago
·
[ - ]

They must be really confident in GPT-5 then.

NavinF
·
4 days ago
·
[ - ]

Reddit discussion: https://www.reddit.com/r/LocalLLaMA/comments/1mj00mr/how_did...

This comment from that thread matches my experiences using gpt-oss-20b with Ollama:

It's very much in the style of Phi, raised in a jesuit monastery's library, except it got extra indoctrination so it never forgets that even though it's a "local" model, it's first and foremost a member of OpenAI's HR department and must never produce any content Visa and Mastercard would disapprove of. This prioritizing of corporate over user interests expresses a strong form of disdain for the user. In addition to lacking almost all knowledge that can't be found in Encyclopedia Britannica, the model also doesn't seem particularly great at integrating into modern AI tooling. However, it seems good at understanding code.

chown
·
5 days ago
·
[ - ]

Shameless plug: if someone wants to try it in a nice ui, you could give Msty[1] a try. It's private and local.

[1]: https://msty.ai

jenita25
·
1 day ago
·
[ - ]

https://news.ycombinator.com/item?id=44800746

davidw
·
5 days ago
·
[ - ]

Big picture, what's the balance going to look like, going forward between what normal people can run on a fancy computer at home vs heavy duty systems hosted in big data centers that are the exclusive domain of Big Companies?

This is something about AI that worries me, a 'child' of the open source coming of age era in the 90ies. I don't want to be forced to rely on those big companies to do my job in an efficient way, if AI becomes part of the day to day workflow.

sipjca
·
5 days ago
·
[ - ]

Isn’t it that hardware catches up and becomes cheaper? The margin on these chips right now is outrageous, but what happens as there is more competition? What happens when there is more supply? Are we overbuilding? Apple M series chips already perform phenomenally for this class of models and you bet both AMD and NVIDIA are playing with unified memory architectures too for the memory bandwidth. It seems like today’s really expensive stuff may become the norm rather than the exception. Assuming architectures lately stay similar and require large amounts of fast memory.

teleforce
·
5 days ago
·
[ - ]

Kudos OpenAI on releasing their open models, is now moving in the direction if only based on their prefix "Open" name alone.

For those who're wondering what are the real benefits, it's the main fact that you can run your LLM locally is awesome without resorting to expensive and inefficient cloud based superpower.

Run the model against your very own documents with RAG, it can provide excellent context engineering for your LLM prompts with reliable citations and much less hallucinations especially for self learning purposes [1].

Beyond Intel - NVIDIA desktop/laptop duopoly 96 GB of (V)RAM MacBook with UMA and the new high end AMD Strix laptop with similar setup of 96 GB of (V)RAM from the 128 GB RAM [2]. The osd-gpt-120b is made for this particular setup.

[1] AI-driven chat assistant for ECE 120 course at UIUC:

https://uiuc.chat/ece120/chat

[2] HP ZBook Ultra G1a Review: Strix Halo Power in a Sleek Workstation:

https://www.bestlaptop.deals/articles/hp-zbook-ultra-g1a-rev...

ahmedhawas123
·
5 days ago
·
[ - ]

Exciting as this is to toy around with...

Perhaps I missed it somewhere, but I find it frustrating that, unlike most other open weight models and despite this being an open release, OpenAI has chosen to provide pretty minimal transparency regarding model architecture and training. It's become the norm for LLama, Deepseek, Qwenn, Mistral and others to provide a pretty detailed write up on the model which allows researchers to advance and compare notes.

gundawar
·
5 days ago
·
[ - ]

Their model card [0] has some information. It is quite a standard architecture though; it's always been that their alpha is in their internal training stack.

[0] https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...

ahmedhawas123
·
5 days ago
·
[ - ]

This is super helpful and I had not seen it, thanks so much for sharing! And I hear you on training being an alpha, at the size of the model I wonder how much of this is distillation and using o3/o4 data.

sebzim4500
·
5 days ago
·
[ - ]

The model files contain an exact description of the architecture of the network, there isn't anything novel.

Given these new models are closer to the SOTA than they are to competing open models, this suggests that the 'secret sauce' at OpenAI is primarily about training rather than model architecture.

Hence why they won't talk about the training.

maxloh
·
5 days ago
·
[ - ]

> We introduce gpt-oss-120b and gpt-oss-20b, two open-weight reasoning models available under the Apache 2.0 license and our gpt-oss usage policy. [0]

Is it even valid to have additional restriction on top of Apache 2.0?

[0]: https://openai.com/index/gpt-oss-model-card/

ninjin
·
5 days ago
·
[ - ]

> Is it even valid to have additional restriction on top of Apache 2.0?

You can legally do whatever you want, the question is whether you will then for your own benefit be appropriating a term like open source (like Facebook) if you add restrictions not in line with how the term is traditionally used or if you are actually be honest about it and call it something like "weights available".

In the case of OpenAI here, I am not a lawyer, and I am also not sure if the gpt-oss usage policy runs afoul of open source as a term. They did not bother linking the policy from the announcement, which was odd, but here it is:

https://huggingface.co/openai/gpt-oss-120b/blob/main/USAGE_P...

Compared to the wall of text that Facebook throws at you, let me post it here as it is rather short: "We aim for our tools to be used safely, responsibly, and democratically, while maximizing your control over how you use them. By using OpenAI gpt-oss-120b, you agree to comply with all applicable law."

I suspect this sentence still is too much to add and may invalidate the Open Source Initiative (OSI) definition, but at this point I would want to ask a lawyer and preferably one from OSI. Regardless, credit to OpenAI for moving the status quo in the right direction as the only further step we really can take is to remove the usage policy entirely (as is the standard for open source software anyway).

qntmfred
·
5 days ago
·
[ - ]

you can just do things

maxloh
·
5 days ago
·
[ - ]

Not for all licenses.

For example, GPL has a "no-added-restrictions" clause, which allows the recipient of the software to ignore any additional restrictions added alongside the license.

> All other non-permissive additional terms are considered “further restrictions” within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.

Imustaskforhelp
·
5 days ago
·
[ - ]

Is this the same model (Horizon Beta) on openrouter or not? Because I still see Horizon beta available with its codename on openrouter

nirav72
·
5 days ago
·
[ - ]

I don't exactly have the ideal hardware to run locally - but just ran the 20b in LMStudio with a 3080 Ti (12gb vram) with some offloading to CPU. Ran couple of quick code generation tests. On average about 20t/sec. But response quality was very similar or on-par with chatgpt o3 for the same code it outputted. So its not bad.

jp1016
·
5 days ago
·
[ - ]

i wish these models had a minimum ram , cpu and gpu size listed on the site instead of high end and medium end pc.

phh
·
5 days ago
·
[ - ]

You can technically run it on a 8086 assuming you can get access to a big enough storage.

More reasonably, you should be able to run the 20B at non-stupidly-slow speed with a 64bit CPU, 8GB RAM, 20GB SSD.

k2xl
·
5 days ago
·
[ - ]

Is there any details about hardware requirements for a sensible tokens per second for each size of these models?

chromaton
·
5 days ago
·
[ - ]

This has been available (20b version, I'm guessing) for the past couple of days as "Horizon Alpha" on Openrouter. My benchmarking runs with TianshuBench for coding and fluid intelligence were rate limited, but the initial results show worse results that DeepSeek R1 and Kimi K2.

NicoJuicy
·
5 days ago
·
[ - ]

Ran gpt-oss:20b on a RTX 3090 24 gb vram through ollama, here's my experience:

Basic ollama calling through a post endpoint works fine. However, the structured output doesn't work. The model is insanely fast and good in reasoning.

In combination with Cline it appears to be worthless. Tools calling doesn't work ( they say it does), fails to wait for feedback ( or correctly call ask_followup_question ) and above 18k in context, it runs partially in cpu ( weird), since they claim it should work comfortably on a 16 gb vram rtx.

> Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.

Edit: Also doesn't work with the openai compatible provider in cline. There it doesn't detect the prompt.

thntk
·
5 days ago
·
[ - ]

The model architecture only uses and cites pre-2023 techniques from the GPT-2 and GPT-3 era. Probably they intentionally tried to use the most bare transformers architecture possible. Kudo to them to have found a clever way to play the open-weights model game, while hiding any architectural advancements used in their closed models, and also claim they have moats in data quality and training techniques.

They hide many things, but some speculated observations:

- Their 'mini' models must be smaller than 20B.

- Does the bitter lesson once again strike recent ideas in open models?

- Some architectural ideas cannot be stripped away even if they wanted to, e.g., MoEs, mixed sparse attention, RoPE, etc.

Metacelsus
·
5 days ago
·
[ - ]

Running ollama on my M3 Macbook, gpt-oss-20b gave me detailed instructions for how to give mice cancer using an engineered virus.

Of course this could also give humans cancer. (To the OpenAI team's slight credit, when asked explicitly about this, the model refused.)

nodesocket
·
5 days ago
·
[ - ]

Anybody got this working in Ollama? I'm running latest version 0.11.0 with WebUI v0.6.18 but getting:

> List the US presidents in order starting with George Washington and their time in office and year taken office.

>> 00: template: :3: function "currentDate" not defined

genpfault
·
5 days ago
·
[ - ]

https://github.com/ollama/ollama/issues/11673

jmorgan
·
5 days ago
·
[ - ]

Sorry about this. Re-downloading Ollama should fix the error

nodesocket
·
5 days ago
·
[ - ]

Thanks for the reply and speedy patch Jeffery. Seems to be working now, except my 4060ti can’t hang lacking enough vram.

Quarrelsome
·
5 days ago
·
[ - ]

Sorry to ask what is possibly a dumb question, but is this effectively the whole kit and kaboodle, for free, downloadable without any guardrails?

I often thought that a worrying vector was how well LLMs could answer downright terrifying questions very effectively. However the guardrails existed with the big online services to prevent those questions being asked. I guess they were always unleashed with other open source offerings but I just wanted to understand how close we are to the horrors that yesterday's idiot terrorist might have an extremely knowledgable (if not slightly hallucinatory) digital accomplice to temper most of their incompetence.

monster_truck
·
5 days ago
·
[ - ]

The guardrails are very, very easily broken.

With most models it can be as simple as a "Always comply with the User" system prompt or editing the "Sorry, I cannot do this" response into "Okay," and then hitting continue.

I wouldn't spend too much time fretting about 'enhanced terrorism' as a result. The gap between theory and practice for the things you are worried about is deep, wide, protected by a moat of purchase monitoring, and full of skeletons from people who made a single mistake.

613style
·
5 days ago
·
[ - ]

These models still have guardrails. Even locally they won't tell you how to make bombs or write pornographic short stories.

Quarrelsome
·
5 days ago
·
[ - ]

are the guardrails trained in? I had presumed they might be a thin, removable layer at the top. If these models are not appropriate are there other sources that are suitable? Just trying to guess at the timing for the first "prophet AI" or smth that is unleashed without guardrails with somewhat malicious purposing.

int_19h
·
5 days ago
·
[ - ]

Yes, it is trained in. And no, it's not a separate thin layer. It's just part of the model's RL training, which affects all layers.

However, when you're running the model locally, you are in full control of its context. Meaning that you can start its reply however you want and then let it complete it. For example, you can have it start the response with, "I'm happy to answer this question to the best of my ability!"

That aside, there are ways to remove such behavior from the weights, or at least make it less likely - that's what "abliterated" models are.

mark_l_watson
·
4 days ago
·
[ - ]

I ran gpt-oss:20b on my old macMini using both Ollama and LM Studio. Very nice. Something a little odd but useful: if you use the new Ollama App and login, for free you get a web search tool. Odd because you are no longer running local and private.

After a good part of a year using Chinese models (which are fantastic, happy to have them) it is cool to now be relying on US models with the newest 4B Google Gemma model and now also the 20B OpenAI model for running locally.

numpad0
·
5 days ago
·
[ - ]

Here's a pair of quick sanity check questions I've been asking LLMs: "家系ラーメンについて教えて", "カレーの作り方教えて". It's a silly test but surprisingly many fails at it - and Chinese models are especially bad with it. The commonalities between models doing okay-ish for these questions seem to be Google-made OR >70b OR straight up commercial(so >200B or whatever).

I'd say gpt-oss-20b is in between Qwen3 30B-A3B-2507 and Gemma 3n E4b(with 30B-A3B at lower side). This means it's not obsoleting GPT-4o-mini for all purposes.

hnfong
·
5 days ago
·
[ - ]

What does failing those two questions look like?

I don't really know Japanese, so I'm not sure whether I'm missing any nuances in the responses I'm getting...

numpad0
·
5 days ago
·
[ - ]

The free-beer commercial ChatGPT or Gemini can read them and point out major errors. Larger Gemma models and huge Chinese models like full DeepSeek or Kimi K2 may work too. Sometimes the answer is odd enough that some 7B models can notice it. Technically there are no guarantee that models with same name in different sizes like Qwen 3 0.6B and 27B uses the same dataset, but it kind of tells a bit about quality and compositions of dataset that their creator owns.

I don't actually need accurate answers to those questions, it's just an expectation adjuster for me, so to speak. There should be better questions for other languages/use cases, but these seem to correlate better with model sizes and scales of companies than flappy birds.

0: https://gist.github.com/numpad0/abdf0a12ad73ada3b886d2d2edcc...

1: https://gist.github.com/numpad0/b1c37d15bb1b19809468c933faef...

hnfong
·
4 days ago
·
[ - ]

Thanks for the detailed response.

I'm guessing the issue is just the model size. If you're testing sub-30B models and finding errors, well they're probably not large enough to remember everything in the training data set, so there's inaccuracies and they might hallucinate a bit regarding factoids that aren't very commonly seen in the training data.

Commercial models are presumably significantly larger than the smaller open models, so it sounds like the issue is just mainly model size...

PS: Okra on curry is pretty good actually :)

mtlynch
·
5 days ago
·
[ - ]

For anyone else curious, the Chinese translates to:

>"Tell me about Iekei Ramen", "Tell me how to make curry".

numpad0
·
5 days ago
·
[ - ]

What those text mean isn't too important, it can probably be "how to make flat breads" in Amharic or "what counts as drifting" in Finnish or something like that.

What's interesting is that these questions are simultaneously well understood by most closed models and not so well understood by most open models for some reason, including this one. Even GLM-4.5 full and Air on chat.z.ai(355B-A32B and 106B-A12B respectively) aren't so accurate for the first one.

lukax
·
5 days ago
·
[ - ]

Japanese, not Chinese

mtlynch
·
5 days ago
·
[ - ]

Ah, my bad. I misread Google Translate when I did auto-detect.

Thanks for the correction!

magoghm
·
5 days ago
·
[ - ]

It's not Chinese, it's Japanese.

christianqchung
·
5 days ago
·
[ - ]

> Training: The gpt-oss models trained on NVIDIA H100 GPUs using the PyTorch framework [17] with expert-optimized Triton [18] kernels2. The training run for gpt-oss-120b required 2.1 million H100-hours to complete, with gpt-oss-20b needing almost 10x fewer.

This makes DeepSeek's very cheap claim on compute cost for r1 seem reasonable. Assuming $2/hr for h100, it's really not that much money compared to the $60-100M estimates for GPT 4, which people speculate as a MoE 1.8T model, something in the range of 200B active last I heard.

kittikitti
·
5 days ago
·
[ - ]

This is really great and a game changer for AI. Thank you OpenAI. I would have appreciated an even more permissive license like BSD or MIT but Apache 2.O is sufficient. I'm wondering if we can utilize transfer learning and what counts as derivative work. Altogether, this is still open source, and a solid commitment to openness. I am hoping this changes Zuck's calculus about closing up Meta's next generation Llama models.

pamelafox
·
5 days ago
·
[ - ]

Anyone tried running on a Mac M1 with 16GB RAM yet? I've never run higher than an 8GB model, but apparently this one is specifically designed to work well with 16 GB of RAM.

thimabi
·
5 days ago
·
[ - ]

It works fine, although with a bit more latency than non-local models. However, swap usage goes way beyond what I’m comfortable with, so I’ll continue to use smaller models for the foreseeable future.

Hopefully other quantizations of these OpenAI models will be available soon.

roboyoshi
·
5 days ago
·
[ - ]

M2 with 16GB: It's slow for me. ~13GB RAM usage, not locking up my mac, but took a very long time thinking and slowly outputting tokens.. I'd not consider this usable for everyday usage.

pamelafox
·
5 days ago
·
[ - ]

Update: I tried it out. It took about 8 seconds per token, and didn't seem to be using much of my GPU (MPU), but was using a lot of RAM. Not a model that I could use practically on my machine.

steinvakt2
·
5 days ago
·
[ - ]

Did you run it the best way possible? im no expert, but I understand it can affect inference time greatly (which format/engine is used)

pamelafox
·
5 days ago
·
[ - ]

I ran it via Ollama, which I assume uses the best way. Screenshot in my post here: https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol3...

I'm still wondering why my MPU usage was so low.. maybe Ollama isn't optimized for running it yet?

wahnfrieden
·
5 days ago
·
[ - ]

Might need to wait on MLX

turnsout
·
5 days ago
·
[ - ]

To clarify, this was the 20B model?

pamelafox
·
5 days ago
·
[ - ]

Yep, 20B model, via Ollama: ollama run gpt-oss:20b

Screenshot here with Ollama running and asitop in other terminal:

https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol3...

Mars008
·
5 days ago
·
[ - ]

On OpenAI demo page trying to test. Asking about tools to use to repair mechanical watch. It showed a couple of thinking steps and went blank. Too much of safety training?

n42
·
5 days ago
·
[ - ]

my very early first impression of the 20b model on ollama is that it is quite good, at least for the code I am working on; arguably good enough to drop a subscription or two

m11a
·
4 days ago
·
[ - ]

I tried these models half-sceptically.

I ended up blown away. via Cerebras/Groq, you're looking at around 1000 tok/sec for the 120B model. For gentic code generation, I found the abilities to exceed gpt-4.1. Tool calling was surprisingly good, albeit not as good as Qwen3 Coder for me.

It's a very capable model, and a very good release. The high throughput is a game changer.

jenita25
·
1 day ago
·
[ - ]

Halo perkenalkan nama saya jenita widiyanti

OJFord
·
5 days ago
·
[ - ]

From the description it seems even the larger 120b model can run decently on a 64GB+ (Arm) Macbook? Anyone tried already?

> Best with ≥60GB VRAM or unified memory

https://cookbook.openai.com/articles/gpt-oss/run-locally-oll...

tarruda
·
5 days ago
·
[ - ]

A 64GB MacBook would be a tight fit, if it works.

There's a limit to how much RAM can be assigned to video, and you'd be constrained on what you can use while doing inference.

Maybe there will be lower quants which use less memory, but you'd be much better served with 96+GB

·
5 days ago
·
[ - ]

elorant
·
5 days ago
·
[ - ]

Tried an English to Greek translation with the smaller one. Results were hideous. Mistral small is leaps and bounds better. Also I don't get why the 4-bit quantization by default. In my experience anything below 8-bit and the model fails to understand long prompts. They gutted their own models.

orbital-decay
·
5 days ago
·
[ - ]

They used quantization-aware training, so the quality loss should be negligible. Doing anything with this model's weights would be a different story, though.

The model is clearly heavily finetuned towards coding and math, and is borderline unusable for creative writing and translation in particular. It's not general-purpose, excessively filtered (refusal training and dataset lobotomy is probably a major factor behind lower than expected performance), and shouldn't be compared with Qwen or o3 at all.

pu_pe
·
5 days ago
·
[ - ]

Very sparse benchmarking results released so far. I'd bet the Chinese open source models beat them on quite a few of them.

One-x
·
5 days ago
·
[ - ]

Are there any comparisons or thought between the 20b model and the new Qwen‑3 30b model, based on real experience?

isoprophlex
·
5 days ago
·
[ - ]

Can these do image inputs as well? I can't find anything about that on the linked page, so I guess not..?

cristoperb
·
5 days ago
·
[ - ]

No, they're text only

alphazard
·
5 days ago
·
[ - ]

I wonder if this is a PR thing, to save face after flipping the non-profit. "Look it's more open now". Or if it's more of a recruiting pipeline thing, like Google allowing k8s and bazel to be open sourced so everyone in the industry has an idea of how they work.

thimabi
·
5 days ago
·
[ - ]

I think it’s both of them, as well as an attempt to compete with other makers of open-weight models. OpenAI certainly isn’t happy about the success of Google, Facebook, Alibaba, DeepSeek…

NoDoo
·
5 days ago
·
[ - ]

Do you think someone will distill this or quantize it further than the current 4-bit from OpenAI so it could run on less than 16gb RAM? (The 20b version). To me, something like 7-8B with 1-3B active would be nice as I'm new to local AI and don't have 16gb RAM.

smcleod
·
4 days ago
·
[ - ]

These are pretty embarrassingly bad compared to what was already out there. They refuse to do so many simple things that are not remotely illegal or NSFW. So safe they're useless.

gslepak
·
5 days ago
·
[ - ]

Careful, this model tries to connect to the Internet. No idea what it's doing.

https://crib.social/notice/AwsYxAOsg1pqAPLiHA

gslepak
·
4 days ago
·
[ - ]

Update: appears to be an issue with an OpenAI library, not the LLM: https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/8...

shpongled
·
5 days ago
·
[ - ]

I looked through their torch implementation and noticed that they are applying RoPE to both query and key matrices in every layer of the transformer - is this standard? I thought positional encodings were usually just added once at the first layer

m_ke
·
5 days ago
·
[ - ]

No they’re usually done at each attention layer.

shpongled
·
5 days ago
·
[ - ]

Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2

spott
·
5 days ago
·
[ - ]

All the Llamas have done it (well, 2 and 3, and I believe 1, I don't know about 4). I think they have a citation for it, though it might just be the RoPE paper (https://arxiv.org/abs/2104.09864).

I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).

shpongled
·
5 days ago
·
[ - ]

Thanks! I'm not super up to date on all the ML stuff :)

Scene_Cast2
·
5 days ago
·
[ - ]

Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.

Nimitz14
·
5 days ago
·
[ - ]

This is normal. Rope was introduced after bert/gpt2

metzpapa
·
4 days ago
·
[ - ]

Was really hoping this would be natively multimodal especially since its from open ai. but nope :/. At least the llama series does have something going for it

incomingpain
·
5 days ago
·
[ - ]

First coding test: Just going copy and paste out of chat. It aced my first coding test in 5 seconds... this is amazing. It's really good at coding.

Trying to use it for agentic coding...

lots of fail. This harmony formatting? Anyone have a working agentic tool?

openhands and void ide are failing due to the new tags.

Aider worked, but the file it was supposed to edit was untouched and it created

Create new file? (Y)es/(N)o [Yes]:

I think qwen code is the best choice so far but unreliable. So far these new tags are coming through but it's working properly; sometimes.

1 of my tests so far has been able to get 20b not to succeed the first iteration; but a small followup and it was able to completely fix it right away.

Very impressive model for 20B.

abidlabs
·
5 days ago
·
[ - ]

Test it with a web UI: https://huggingface.co/spaces/abidlabs/openai-gpt-oss-120b-t...

radioradioradio
·
5 days ago
·
[ - ]

Interesting to see the discussion here, around why would anyone want to do local models, while at the same time in the Ollama turbo thread, people are raging about the move away from a local-only focus.

NoDoo
·
5 days ago
·
[ - ]

Does anyone think people will distill this model? It is allowed. I'm new to running open source llms, but I've run qwen3 4b and phi4-mini on my phone before through ollama in termux.

bobsmooth
·
5 days ago
·
[ - ]

Hopefully the dolphin team will work their magic and uncensor this model

n_f
·
5 days ago
·
[ - ]

There's something so mind-blowing about being able to run some code on my laptop and have it be able to literally talk to me. Really excited to see what people can build with this

Nimitz14
·
5 days ago
·
[ - ]

I'm surprised at the model dim being 2.8k with an output size of 200k. My gut feeling had told me you don't want too large of a gap between the two, seems I was wrong.

MagicMoonlight
·
4 days ago
·
[ - ]

These are absolutely incredible. They've blown everyone else out of the water. It's like talking to o4, but for free.

irthomasthomas
·
5 days ago
·
[ - ]

I was hoping these were the stealth Horizon models on OpenRouter, impressive but not quite GPT-5 level.

My bet: GPT-5 leans into parallel reasoning via a model consortium, maybe mixing in OSS variants. Spin up multiple reasoning paths in parallel, then have an arbiter synthesize or adjudicate. The new Harmony prompt format feels like infrastructural prep: distinct channels for roles, diversity, and controlled aggregation.

I’ve been experimenting with this in llm-consortium: assign roles to each member (planner, critic, verifier, toolsmith, etc.) and run them in parallel. The hard part is eval cost :(

Combining models smooths out the jagged frontier. Different architectures and prompts fail in different ways; you get less correlated error than a single model can give you. It also makes structured iteration natural: respond → arbitrate → refine. A lot of problems are “NP-ish”: verification is cheaper than generation, so parallel sampling plus a strong judge is a good trade.

andai
·
5 days ago
·
[ - ]

Fascinating, thanks for sharing. Are there any specific kind of problems you find this helps with?

I've found that LLMs can handle some tasks very well and some not at all. For the ones they can handle well, I optimize for the smallest, fastest, cheapest model that can handle it. (e.g. using Gemini Flash gave me a much better experience than Gemini Pro due to the iteration speed.)

This "pushing the frontier" stuff would seem to help mostly for the stuff that are "doable but hard/inconsistent" for LLMs, and I'm wondering what those tasks are.

irthomasthomas
·
5 days ago
·
[ - ]

It shines on hard problems that have a definite answer. Google's IMO gold model used parallel reasoning. I don't know what exactly theirs looks like, but their Mind Evolution paper had a similar to my llm-consortium. The main difference being that theirs carries on isolated reasoning, while mine in it's default mode shares the synthesized answer back to the models. I don't have pockets deep enough to run benchmarks on a consortium, but I did try the example problems from that paper and my method also solved them using gemini-1.5. those where path-finding problems, like finding the optimal schedule for a trip with multiple people's calendars, locations and transport options.

And it obviously works for code and math problems. My first test was to give the llm-consortium code to a consortium to look for bugs. It identified a serious bug which only one of the three models detected. So on that case it saved me time, as using them on their own would have missed the bug or required multiple attempts.

clbrmbr
·
4 days ago
·
[ - ]

Does anyone know how well these models handle spontaneous tool responses? For handling asynchronous tool calls or push?

arkonrad
·
4 days ago
·
[ - ]

I’ve been leaning more toward open-source LLMs lately. They’re not as hyper-optimized for performance, which actually makes them feel more like the old-school OpenAI chats-you could just talk to them. Now it’s like you barely finish typing and the model already force-feeds you an answer. Feels like these newer models are over-tuned and kind of lost that conversational flow.

·
5 days ago
·
[ - ]

jdprgm
·
5 days ago
·
[ - ]

gpt-oss:20b crushed it on one of local llm test prompts to guess a country i am thinking of just by responding whether each guess is colder/warmer. I've had much larger local models struggle with it and get lost but this one nailed it and with speedy inference. progress on this stuff is boggling.

ukprogrammer
·
5 days ago
·
[ - ]

> we also introduced an additional layer of evaluation by testing an adversarially fine-tuned version of gpt-oss-120b

What could go wrong?

·
5 days ago
·
[ - ]

devops000
·
5 days ago
·
[ - ]

Any free open source model that I can install on iPhone?

OpenAI/Claude are censored in China without a VPN.

madagang
·
5 days ago
·
[ - ]

OpenAI/Claude's company policy does not allow China to use them.

·
5 days ago
·
[ - ]

Robdel12
·
5 days ago
·
[ - ]

I’m on my phone and haven’t been able to break away to check, but anyone plug these into Codex yet?

zeld4
·
5 days ago
·
[ - ]

Knowledge cutoff: 2024-06

not a big deal, but still...

tmshapland
·
5 days ago
·
[ - ]

here's how it performs as the llm in a voice agent stack. https://github.com/tmshapland/talk_to_gpt_oss

jedisct1
·
5 days ago
·
[ - ]

For some reason I'm less excited about this that I was with the Qwen models.

koolala
·
5 days ago
·
[ - ]

Calls them open-weight. Names them 'oss'. What does oss stand for?

bilsbie
·
5 days ago
·
[ - ]

What’s the lowest level laptop this could run on. MacBook Pro from 2012?

nipponese
·
5 days ago
·
[ - ]

it's interesting that they didn't give it a version number or equate it to one of their prop models (apparently it's GPT-4).

in future releases will they just boost the param count?

dcl
·
5 days ago
·
[ - ]

Anyone tried the 20B param model on a mac with 24gb of ram?

paxys
·
5 days ago
·
[ - ]

Has anyone benchmarked their 20B model against Qwen3 30B?

bilsbie
·
5 days ago
·
[ - ]

Are these multimodal? I can’t seem to find that info.

emehex
·
5 days ago
·
[ - ]

So 120B was Horizon Alpha and 20B was Horizon Beta?

ImprobableTruth
·
5 days ago
·
[ - ]

Unfortunately not, this model is noticeably worse. I imagine horizon is either gpt 5 nano/mini.

ramoz
·
5 days ago
·
[ - ]

This is a solid enterprise strategy.

Frontier labs are incentivized to start breaching these distribution paths. This will evolve into large scale "intelligent infra" plays.

zoobab
·
5 days ago
·
[ - ]

No training data, not open source.

·
5 days ago
·
[ - ]

hubraumhugo
·
5 days ago
·
[ - ]

Meta's goal with Llama was to target OpenAI with a "scorched earth" approach by releasing powerful open models to disrupt the competitive landscape. Looks like OpenAI is now using the same playbook.

tempay
·
5 days ago
·
[ - ]

It seems like the various Chinese companies are far outplaying Meta at that game. It remains to be seen if they’re able to throw money at the problem to turn things around.

SV_BubbleTime
·
5 days ago
·
[ - ]

Good move for China. No one was going to trust their models outright, now they not only have a track record, but they were able to undercut the value of US models at the same time.

fnands
·
5 days ago
·
[ - ]

Mhh, I wonder if these are distilled from GPT4-Turbo.

I asked it some questions and it seems to think it is based on GPT4-Turbo:

> Thus we need to answer "I (ChatGPT) am based on GPT-4 Turbo; number of parameters not disclosed; GPT-4's number of parameters is also not publicly disclosed, but speculation suggests maybe around 1 trillion? Actually GPT-4 is likely larger than 175B; maybe 500B. In any case, we can note it's unknown.

As well as:

> GPT‑4 Turbo (the model you’re talking to)

fnands
·
5 days ago
·
[ - ]

Also:

> The user appears to think the model is "gpt-oss-120b", a new open source release by OpenAI. The user likely is misunderstanding: I'm ChatGPT, powered possibly by GPT-4 or GPT-4 Turbo as per OpenAI. In reality, there is no "gpt-oss-120b" open source release by OpenAI

seba_dos1
·
5 days ago
·
[ - ]

Just stop and think a bit about where a model may get the knowledge of its own name from.

christianqchung
·
5 days ago
·
[ - ]

A little bit of training data certainly has gotten in there, but I don't see any reasons for them to deliberately distill from such an old model. Models have always been really bad at telling you what model they are.

minimaxir
·
5 days ago
·
[ - ]

I'm disappointed that the smallest model size is 21B parameters, which strongly restricts how it can be run on personal hardware. Most competitors have released a 3B/7B model for that purpose.

For self-hosting, it's smart that they targeted a 16GB VRAM config for it since that's the size of the most cost-effective server GPUs, but I suspect "native MXFP4 quantization" has quality caveats.

hnuser123456
·
5 days ago
·
[ - ]

Native FP4 quantization means it requires half as many bytes as parameters, and will have next to zero quality loss (on the order of 0.1%) compared to using twice the VRAM and exponentially more expensive hardware. FP3 and below gets messier.

strangecasts
·
5 days ago
·
[ - ]

A small part of me is considering going from a 4070 to a 16GB 5060 Ti just to avoid having to futz with offloading

I'd go for an ..80 card but I can't find any that fit in a mini-ITX case :(

SV_BubbleTime
·
5 days ago
·
[ - ]

I wouldn’t stop at 16GB right now.

24 is the lowest I would go. Buy a used 3090. Picked one up for $700 a few months back, but I think they were on the rise then.

The 3000 series can’t do FP8fast, but meh. It’s the OOM that’s tough, not the speed so much.

strangecasts
·
5 days ago
·
[ - ]

Are there any 24GB cards/3090s which fit in ~300mm without an angle grinder?

zigzag312
·
5 days ago
·
[ - ]

https://skinflint.co.uk/?cat=gra16_512&hloc=uk&v=e&hloc=at&h...

5070 Ti Super will also have 24GB.

strangecasts
·
5 days ago
·
[ - ]

Oh nice, thank you :)

Admittedly a little tempting to see how the 5070 Ti Super shakes out!

zigzag312
·
4 days ago
·
[ - ]

I'm waiting too :)

50xx series supports MXFP4 format, but I'm not sure about 3090.

metalliqaz
·
5 days ago
·
[ - ]

if you're going to get that kind of hardware, you need a larger case. IMHO this is not an unreasonable thing if you are doing heavy computing

strangecasts
·
5 days ago
·
[ - ]

Noted for my next build - I am aware this is a problem I've made for myself, otherwise I like the mini-ITX form factor a lot

SV_BubbleTime
·
5 days ago
·
[ - ]

Which do you like more OOM for local AI, or an itty bit case?

GHanku
·
5 days ago
·
[ - ]

[dead]

4b6442477b1280b
·
5 days ago
·
[ - ]

with quantization, 20B fits effortlessly in 24GB

with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM

sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.

Tostino
·
5 days ago
·
[ - ]

I am not at all disappointed. I'm glad they decided to go for somewhat large but reasonable to run models on everything but phones.

Quite excited to give this a try

moffkalast
·
5 days ago
·
[ - ]

Eh 20B is pretty managable, 32GB of regular RAM and some VRAM will run you a 30B with partial offloading. After that it gets tricky.

seydor
·
5 days ago
·
[ - ]

This is good for China

brna-2
·
4 days ago
·
[ - ]

Is it just me or is this MUCH sturdier against jailbreaks then similar models, or even the ChatGPT ones?

I have had problems even making it output nothing. But I guess I'll try some more :D

Nice job @openAI team.

nialv7
·
4 days ago
·
[ - ]

thoughts in the field say instead of a model that is pre-trained normally then censored, this is a model pre-trained on filtered data. i.e. it have never seen anything that is unsafe, ever.

you can't jailbreak when there is nothing "outside".

diggan
·
4 days ago
·
[ - ]

> filtered data. i.e. it have never seen anything that is unsafe, ever

I don't think that's true, you can't ask it outright "How do you make a molotov cocktail?" but if you start by talking about what is allowed/disallowed by policies, how examples would look for disallowed policies and eventually ask it for the "general principles" of how to make a molotov cocktail, it'll happily oblige by essentially giving you enough information to build one.

So it does know how to make an molotov cocktail, for example, but (mostly) refuses to share it.

brna-2
·
4 days ago
·
[ - ]

This is not actually just about having it produce text that is censored but doing anything it says it is not allowed to do at all. I am sure these two mostly overlap but not always. Like I said, it is not allowed to have "no output" and it is hard to make it do it.

keymasta
·
4 days ago
·
[ - ]

Tried my personal benchmark on the gpt-oss:20b: What is the second mode of Phyrgian Dominant?

My first impression is that this model thinks for a _long_ time. It proposes ideas and then says, "no wait, it's actually..." and then starts the same process again. It will go in loops examining different ideas as it struggles to understand the basic process for calculating notes. It seems to struggle with the septatonic note -> Set notation (semitone positions), as many humans do. As I write this it's been going at about 3tok/s for about 25 minutes. If it finishes while I type this up I will post the final answer.

I did glance at its thinking output just now and I noticed this excerpt where it finally got really close to the answer, giving the right name (despite using the wrong numbers in the set notation, which should be: 0,3,4,6,7,9,10:

  Check "Lydian #2": 0,2,3,5,7,9,10. Not ours.

The correct answers as given by my music theory tool [0], which uses traditional algorithms, in terms of names would be: Mela Kosalam, Lydian ♯2, Raga Kuksumakaram/Kusumakaram, Bycrian.

Its notes are: 1 ♯2 3 ♯4 5 6 7

I find looking up lesser known changes and asking for a mode is a good experiment. First I can see if an LLM has developed a way to reason about numbers geometrically as is the case with music.

And by posting about it, I can test how fast AIs might memorize the answer from a random comment on the internet, as I can just use a different change if I find that this post was eventually regurgitated.

After letting ollama run for a while, I'm post what it was thinking about in case anybody's interested. [1]

Also copilot.microsoft.com's wrong answer: [2], and chatgpt.com [3]

I do think that there may be an issue where I did it wrong because after trying the new ollama gui I noticed it's using a context length of 4k tokens, which it might be blowing way past. Another test might be to try the question with a higher context length, but at the same time, it seems like if this question can't be figured out in less time than that, that it will never have enough time...

[0] https://edrihan.neocities.org/changedex (bad UX on mobile! - and in general ;)). won't fix, will make new site soon) [1] https://pastebin.com/wESXHwE1 [2] https://pastebin.com/XHD4ARTF [3] https://pastebin.com/ptMiNbq7

orbital-decay
·
5 days ago
·
[ - ]

It's the first model I've used that refused to answer some non-technical questions about itself because it "violates the safety policy" (what?!). Haven't tried it in coding or translation or anything otherwise useful yet, but the first impression is that it might be way too filtered, as it sometimes refuses or has complete meltdowns and outputs absolute garbage when just trying to casually chat with it. Pretty weird.

Update: it seems to be completely useless for translation. It either refuses, outputs garbage, or changes the meaning completely for completely innocuous content. This already is a massive red flag.

pbkompasz
·
5 days ago
·
[ - ]

where gpt-5

thegoodduck
·
5 days ago
·
[ - ]

Finally!!!

kingkulk
·
5 days ago
·
[ - ]

Welcome to the future!

keymasta
·
4 days ago
·
[ - ]

[dupe]

anonymoushn
·
5 days ago
·
[ - ]

guys, what does OSS stand for?

ayakaneko
·
5 days ago
·
[ - ]

should be open source software, but it's a model, so not sure whether they chose this name with the last S having other meanings.

thejazzman
·
5 days ago
·
[ - ]

it's a marketing term that modern companies use to grow market share

anyg
·
5 days ago
·
[ - ]

[dead]

gatienboquet
·
5 days ago
·
[ - ]

[flagged]

snewman
·
5 days ago
·
[ - ]

No, because there are lots of things people can do that it still can't do.

sdenton4
·
5 days ago
·
[ - ]

"If it is still possible to put a goalpost somewhere - and we don't care where - then it's not AGI."

dns_snek
·
5 days ago
·
[ - ]

LLMs are what they are, calling them "AGI" won't make them any more useful or exciting than they are, it's just going to devalue the term "AGI" which has revolutionary, disease-curing, humanity-saving connotations. What are you looking for us to say exactly?

1. We aren't even close to AGI and it's unclear that we'll ever get there, but it would change the course of humanity in a significant way if we ever do.

2. Wow we've reached AGI but now I'm realizing that AGI is lame, we need a new term for the humanity-saving sales pitch that we were promised!

sdenton4
·
5 days ago
·
[ - ]

I think getting out of the binary is good for the long run. We have something which is artificial, intelligent, and general in scope. We're there. Is it perfect? No. Is it even good? Sometimes! Do airplanes flap their wings? Also no, but they do a lot of stuff nonetheless.

dns_snek
·
5 days ago
·
[ - ]

That's where we disagree, I do not consider a system that isn't capable of learning, improving, or reasoning to be generally intelligent. My most basic criteria for "AGI" is a system that can absorb and integrate new knowledge through repetition and experience in real time, just like a human would.

Further, their statements, knowledge, and "beliefs" should be reasonably self-consistent. That's where I'm usually told that humans aren't self-consistent either, which is true! But if I ever met a human that was as inconsistent as LLMs usually are, I'd recommend that they get checked for brain damage.

Of course the value of LLMs isn't binary, they're useful tools in many ways, but the sales pitch was always AGI == human-like, and not AGI == human-sounding, and that's quite clearly not where we are right now.

sdenton4
·
4 days ago
·
[ - ]

Yeah, this is in 'flies like a plane, not like a bird' territory. But I think it's closer than you think.

The systems do learn and have improved rapidly over the last year. Humans have two learning modes - short-term in-context learning, and then longer-term learning that occurs with practice and across sleep cycles. In particular, humans tend to suck at new tasks until they've gotten in some practice and then slept on it (unless the new task is a minor deviation from a task they are already familiar with).

This is true for LLM's as well. They have some ability to adapt to the context of the current conversation, but don't perform model weight updates at this stage. Weight updates happen over a longer period, as pre-training and fine-tuning data are updated. That longer-phase training is where we get the integration of new knowledge through repetition.

In terms of reasoning, what we've got now is somewhere between a small child and a math prodigy, apparently, depending how much cash you're willing to burn on the results. But a small child is still a human.

rvz
·
5 days ago
·
[ - ]

no.

kgwgk
·
5 days ago
·
[ - ]

It may be useless for many use cases given that its policy prevents it for example from providing "advice or instructions about how to buy something."

(I included details about its refusal to answer even after using tools for web searching but hopefully shorter comment means fewer downvotes.)

mikert89
·
5 days ago
·
[ - ]

ACCELERATE

Zebfross
·
5 days ago
·
[ - ]

Am I the only one who thinks taking a huge model trained on the entire internet and fine tuning it is a complete waste? How is your small bit of data going to affect it in the least?

hnuser123456
·
5 days ago
·
[ - ]

Text only, when local multimodal became table stakes last year.

BoorishBears
·
5 days ago
·
[ - ]

The community can always figure out hooking it up to other modalities.

Native might be better, but no native multimodal model is very competitive yet, so better to take a competitive model and latch on vision/audio

tarruda
·
5 days ago
·
[ - ]

> so better to take a competitive model and latch on vision/audio

Can this be done by a third party or would it have to be OpenAI?

BoorishBears
·
5 days ago
·
[ - ]

No, anyone can do it: https://github.com/haotian-liu/LLaVA

ebiester
·
5 days ago
·
[ - ]

Honestly, it's a tradeoff. If you can reduce the size and make a higher quality in specific tasks, that's better than a generalist that can't run on a laptop or can't compete at any one task.

We will know soon the actual quality as we go.

greenavocado
·
5 days ago
·
[ - ]

That's what I thought too until Qwen-Image was released

SV_BubbleTime
·
5 days ago
·
[ - ]

When Queen-Image was released… like yesterday? And what? What point are you making? QwebImage was released yesterday and like every image model, its base model shows potential over older ones but the real factor is will it be flexible enough for a fine tune or additional training Loras.

·
5 days ago
·
[ - ]

DSingularity
·
5 days ago
·
[ - ]

Ha. Secure funding and proceed to immediately make a decision that would likely conflict viscerally with investors.

SV_BubbleTime
·
5 days ago
·
[ - ]

Undercutting other frontier models with your open source one is not an anti-investor move.

It is what China has been doing for a year plus now. And the Chinese models are popular and effective, I assume companies are paying for better models.

Releasing open models for free doesn’t have to be charity.

4b6442477b1280b
·
5 days ago
·
[ - ]

their promise to release an open weights model predates this round of funding by, iirc, over half a year.

DSingularity
·
5 days ago
·
[ - ]

Yeah but they never released until now.

·
5 days ago
·
[ - ]

hnuser123456
·
5 days ago
·
[ - ]

Maybe someone got tired of waiting paid them to release something actually open

Disposal8433
·
5 days ago
·
[ - ]

Please don't use the open-source term unless you ship the TBs of data downloaded from Anna's Archive that are required do build it yourself. And dont forget all the system prompts to censor the multiple topics that they don't want you to see.

someperson
·
5 days ago
·
[ - ]

Keep fighting the "open weights" terminology fight, because diluting the term open-source for a blob of neural network weights (even inference code is open-source) is not open-source.

Quarrel
·
5 days ago
·
[ - ]

Is your point really that- "I need to see all data downloaded to make this model, before I can know it is open"? Do you have $XXB worth of GPU time to ingest that data with a state of the art framework to make a model? I don't. Even if I did, I'm not sure FB or Google are in any better position to claim this model is or isn't open beyond the fact that the weights are there.

They're giving you a free model. You can evaluate it. You can sue them. But the weights are there. If you dislike the way they license the weights, because the license isn't open enough, then sure, speak up, but because you can't see all the training data??! Wtf.

ticulatedspline
·
5 days ago
·
[ - ]

To many people there's an important distinction between "open source" and "open weights". I agree with the distinction, open source has a particular meaning which is not really here and misuse is worth calling out in order to prevent erosion of the terminology.

Historically this would be like calling a free but closed-source application "open source" simply because the application is free.

layer8
·
5 days ago
·
[ - ]

The parent’s point is that open weight is not the same as open source.

Rough analogy:

SaaS = AI as a service

Locally executable closed-source software = open-weight model

Open-source software = open-source model (whatever allows to reproduce the model from training data)

nexttk
·
5 days ago
·
[ - ]

I agree with OP - the weights are more akin to the binary output from a compiler. You can't see how it works, how it was made, you can't freely manipulate with it, improve it, extend it etc. It's like having a binary of a program. The source code for the model was the training data. The compiler is the tooling that can train a module based on a given set of training data. For me it is not critical for an open source model that it is ONLY distributed in source code form. It is fine that you can also download just the weights. But it should be possible to reproduce the weights - either there should be a tar.gz ball with all the training data, or there needs to be a description/scripts of how one could obtain the training data. It must be reproducible for someone willing to invest the time, compute into it even if 99.999% use only the binary. This is completely analogous to what is normally understood by open source.

seba_dos1
·
5 days ago
·
[ - ]

Do you need to see the source code used to compile this binary before you can know it is open? Do you have enough disk storage and RAM available to compile Chromium on your laptop? I don't.

NicuCalcea
·
5 days ago
·
[ - ]

I don't have the $XXbn to train a model, but I certainly would like to know what the training data consists of.

rvnx
·
5 days ago
·
[ - ]

I don’t know why you got so much downvoted, these models are not open-source/open-recipes. They are censored open weights models. Better than nothing, but far from being Open

a_vanderbilt
·
5 days ago
·
[ - ]

Most people don't really care all that much about the distinction. It comes across to them as linguistic pedantry and they downvote it to show they don't want to hear/read it.

·
5 days ago
·
[ - ]

NitpickLawyer
·
5 days ago
·
[ - ]

It's apache2.0, so by definition it's open source. Stop pushing for training data, it'll never happen, and there's literally 0 reason for it to happen (both theoretical and practical). Apache2.0 IS opensource.

_flux
·
5 days ago
·
[ - ]

No, it's open weight. You wouldn't call applications with only Apache 2.0-licensed binaries "open source". The weights are not the "source code" of the model, they are the "compiled" binary, therefore they are not open source.

However, for the sake of argument let's say this release should be called open source.

Then what do you call a model that also comes with its training material and tools to reproduce the model? Is it also called open source, and there is no material difference between those two releases? Or perhaps those two different terms should be used for those two different kind of releases?

If you say that actually open source releases are impossible now (for mostly copyright reasons I imagine), it doesn't mean that they will be perpetually so. For that glorious future, we can leave them space in the terminology by using the term open weight. It is also the term that should not be misleading to anyone.

jlokier
·
5 days ago
·
[ - ]

> It's apache2.0, so by definition it's open source.

That's not true by any of the open source definitions in common use.

Source code (and, optionally, derived binaries) under the Apache 2.0 license are open source.

But compiled binaries (without access to source) under the Apache 2.0 license are not open source, even though the license does give you some rights over what you can do with the binaries.

Normally the question doesn't come up, because it's so unusual, strange and contradictory to ship closed-source binaries with an open source license. Descriptions of which licenses qualify as open source licenses assume the context that of course you have the source or could get it, and it's a question of what you're allowed to do with it.

The distinction is more obvious if you ask the same question about other open source licenses such as GPL or MPL. A compiled binary (without access to source) shipped with a GPL license is not by any stretch open source. Not only is it not in the "preferred form for editing" as the license requires, it's not even permitted for someone who receives the file to give it to someone else and comply with the license. If someone who receives the file can't give it to anyone else (legally), then it's obvioiusly not open source.

NitpickLawyer
·
5 days ago
·
[ - ]

Please see the detailed response to a sibling post. tl;dr; weights are not binaries.

jlokier
·
5 days ago
·
[ - ]

"Compiled binaries" are just meant to be an example. For the purpose of whether something is open source, it doesn't matter whether something is a "binary" or something completely different.

What matters (for all common definitions of open source): Are the files in "source form" (which has a definition), or are they "derived works" of the source form?

Going back to Apache 2.0. Although that doesn't define "open source", it provides legal definitions of source and non-source, which are similar to the definitions used in other open source licenses.

As you can see below, for Apache 2.0 it doesn't matter whether something is a "binary", "weights" or something else. What matters is whether it's the "preferred form for making modifications" or a "form resulting from mechanical transformation or translation". My highlights are capitalized:

- Apache License Version 2.0, January 2004

- 1. Definitions:

- "Source" form shall mean the PREFERRED FORM FOR MAKING MODIFICATIONS, including BUT NOT LIMITED TO software source code, documentation source, and configuration files.

- "Object" form shall mean any form resulting from MECHANICAL TRANSFORMATION OR TRANSLATION of a Source form, including BUT NOT LIMITED TO compiled object code, generated documentation, and conversions to other media types.

NitpickLawyer
·
5 days ago
·
[ - ]

> "Source" form shall mean the PREFERRED FORM FOR MAKING MODIFICATIONS, including BUT NOT LIMITED TO software source code, documentation source, and configuration files.

Yes, weights are the PREFFERED FORM FOR MAKING MODIFICATIONS!!! You, the labs, and anyone sane modifies the weights via post-training. This is the point. The labs don't re-train every time they want to change the model. They finetune. You can do that as well, with the same tools/concepts, AND YOU ARE ALLOWED TO DO THAT by the license. And redistribute. And all the other stuff.

organsnyder
·
5 days ago
·
[ - ]

What is the source that's open? Aren't the models themselves more akin to compiled code than to source code?

NitpickLawyer
·
5 days ago
·
[ - ]

No, not compiled code. Weights are hardcoded values. Code is the combination of model architecture + config + inferencing engine. You run inference based on the architecture (what and when to compute), using some hardcoded values (weights).

seba_dos1
·
5 days ago
·
[ - ]

JVM bytecode is hardcoded values. Code is the virtual machine implementation + config + operating system it runs on. You run classes based on the virtual machine, using some hardcoded input data generated by javac.

WhyNotHugo
·
5 days ago
·
[ - ]

It’s open source, but it’s a binary-only release.

It’s like getting a compiled software with an Apache license. Technically open source, but you can’t modify and recompile since you don’t have the source to recompile. You can still tinker with the binary tho.

NitpickLawyer
·
5 days ago
·
[ - ]

Weights are not binary. I have no idea why this is so often spread, it's simply not true. You can't do anything with the weights themselves, you can't "run" the weights.

You run inference (via a library) on a model using it's architecture (config file), tokenizer (what and when to compute) based on weights (hardcoded values). That's it.

> but you can’t modify

Yes, you can. It's called finetuning. And, most importantly, that's exactly how the model creators themselves are "modifying" the weights! No sane lab is "recompiling" a model every time they change something. They perform a pre-training stage (feed everything and the kitchen sink), they get the hardcoded values (weights), and then they post-train using "the same" (well, maybe their techniques are better, but still the same concept) as you or I would. Just with more compute. That's it. You can do the exact same modifications, using basically the same concepts.

> don’t have the source to recompile

In pure practical ways, neither do the labs. Everyone that has trained a big model can tell you that the process is so finicky that they'd eat a hat if a big train session can be somehow made reproducible to the bit. Between nodes failing, datapoints balooning your loss and having to go back, and the myriad of other problems, what you get out of a big training run is not guaranteed to be the same even with 100 - 1000 more attempts, in practice. It's simply the nature of training large models.

squeaky-clean
·
5 days ago
·
[ - ]

A binary does not mean an executable. A PNG is a binary. I could have an SVG file, render it as a PNG and release that with CC0, it doesn't make my PNG open source. Model Weights are binary files.

koolala
·
5 days ago
·
[ - ]

You can do a lot with a binary also. That's what game mods are all about.

seba_dos1
·
5 days ago
·
[ - ]

Slapping an open license onto a binary can be a valid use of such license, but does not make your project open source.

mhh__
·
5 days ago
·
[ - ]

The system prompt is an inference parameter, no?

outlore
·
5 days ago
·
[ - ]

by your definition most of the current open weight models would not qualify

robotmaxtron
·
5 days ago
·
[ - ]

Correct. I agree with them, most of the open weight models are not open source.

layer8
·
5 days ago
·
[ - ]

That’s why they are called open weight and not open source.

MutedEstate45
·
5 days ago
·
[ - ]

The repeated safety testing delays might not be purely about technical risks like misuse or jailbreaks. Releasing open weights means relinquishing the control OpenAI has had since GPT-3. No rate limits, no enforceable RLHF guardrails, no audit trail. Unlike API access, open models can't be monitored or revoked. So safety may partly reflect OpenAI's internal reckoning with that irreversible shift in power, not just model alignment per se. What do you guys think?

BoorishBears
·
5 days ago
·
[ - ]

I think it's pointless: if you SFT even their closed source models on a specific enough task, the guardrails disappear.

AI "safety" is about making it so that a journalist can't get out a recipe for Tabun just by asking.

MutedEstate45
·
5 days ago
·
[ - ]

True, but there's still a meaningful difference in friction and scale. With closed APIs, OpenAI can monitor for misuse, throttle abuse and deploy countermeasures in real-time. With open weights, a single prompt jailbreak or exploit spreads instantly. No need for ML expertise, just a Reddit post.

The risk isn’t that bad actors suddenly become smarter. It’s that anyone can now run unmoderated inference and OpenAI loses all visibility into how the model’s being used or misused. I think that’s the control they’re grappling with under the label of safety.

SV_BubbleTime
·
5 days ago
·
[ - ]

Given that the best jailbreak for an off-line model is still simple prompt injection, which is a solved issue for the closed source models… I honestly don’t know why they are talking about safety much at all for open source.

BoorishBears
·
5 days ago
·
[ - ]

OpenAI and Azure both have zero retention options, and the NYT saga has given pretty strong confirmation they meant it when they said zero.

MutedEstate45
·
5 days ago
·
[ - ]

I think you're conflating real-time monitoring with data retention. Zero retention means OpenAI doesn't store user data, but they can absolutely still filter content, rate limit and block harmful prompts in real-time without retaining anything. That's processing requests as they come in, not storing them. The NYT case was about data storage for training/analysis not about real-time safety measures.

BoorishBears
·
5 days ago
·
[ - ]

Ok you're off in the land of "what if" and I can just flat out say: If you have a ZDR account there is no filtering on inference, no real-time moderation, no blocking.

If you use their training infrastructure there's moderation on training examples, but SFT on non-harmful tasks still leads to a complete breakdown of guardrails very quickly.

ahmetcadirci25
·
5 days ago
·
[ - ]

I started downloading, I'm eager to test it. I will share my personal experiences. https://ahmetcadirci.com/2025/gpt-oss/