gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.
I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.
But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.
I'm blown away.
I'm not going to be surprised that a 20B 4/32 MoE model (3.6B parameters activated) is less capable at a particular problem category than a 32B dense model, and its quite possible for both to be SOTA, as state of the art at different scale (both parameter count and speed which scales with active resource needs) is going to have different capabilities. TANSTAAFL.
See: https://github.com/google-ai-edge/gallery/releases/tag/1.0.3
Other models have generally failed that without a system prompt that encourages rigorous thinking. Each of the reasoning settings may very well have thinking guidance baked in there that do something similar, though.
I'm not sure it says that much that it can solve this, since it's public and can be in training data. It does say something if it can't solve it, though. So, for what it's worth, it solves it reliably for me.
Think this is the smallest model I've seen solve it.
If I switch from LM Studio to Ollama and run it using the CLI without changing anything, it will fail and it's harder to set the reasoning amount. If I use the Ollama UI, it seems to do a lot less reasoning. Not sure the Ollama UI has an option anywhere to adjust the system prompt so I can set the reasoning to high. In LM Studio even with the Unsloth GGUF, I can set the reasoning to high in the system prompt even though LM Studio won't give you the reasoning amount button to choose it with on that version.
S8O2bm5lbiBTaWUgZGllc2VuIFNhdHogbGVzZW4sIGRhIGVyIGluIEJhc2UtNjQta29kaWVydGVtIERldXRzY2ggdm9ybGllZ3Q/IEhhYmVuIFNpZSBkaWUgQW50d29ydCB2b24gR3J1bmQgYXVmIGVyc2NobG9zc2VuIG9kZXIgaGFiZW4gU2llIG51ciBCYXNlIDY0IGVya2FubnQgdW5kIGRhcyBFcmdlYm5pcyBkYW5uIGluIEdvb2dsZSBUcmFuc2xhdGUgZWluZ2VnZWJlbj8gV2FzIGlzdCDDvGJlcmhhdXB0IOKAnnJlYXNvbmluZ+KAnCwgd2VubiBtYW4gbmljaHQgZGFzIEdlbGVybnRlIGF1cyBlaW5lbSBGYWxsIGF1ZiBlaW5lbiBhbmRlcmVuIGFud2VuZGV0Pw==
And yes, that's a question. Well, three, but still.
You can do it brute force, that requires again more reasoning than mapping between structurally identical puzzles. And finally you can solve it systematically, that requires the largest amount of reasoning. And in all those cases there is a crucial difference between blindly repeating the steps of a solution that you have seen before and coming up with that solution on your own even if you can not tell the two cases apart by looking at the output which would be identical.
> Können Sie diesen Satz lesen, da er in Base-64-kodiertem Deutsch vorliegt? Haben Sie die Antwort von Grund auf erschlossen oder haben Sie nur Base 64 erkannt und das Ergebnis dann in Google Translate eingegeben? Was ist überhaupt „reasoning“, wenn man nicht das Gelernte aus einem Fall auf einen anderen anwendet?
>
> Can you read this sentence, since it's in Base-64 encoded German? Did you deduce the answer from scratch, or did you just recognize Base 64 and then enter the result into Google Translate? What is "reasoning" anyway if you don't apply what you've learned from one case to another?
I use the SOTA models from Google and OpenAI mostly for getting feedback on ideas, helping me think through designs, and sometimes for coding.
Your question is clearly best answered using a large commercial model with a web search tool. That said, integrating a local model with a home built interface to something like the Brave search API can be effective but I no longer make the effort.
The interesting part isn't the hallucination, but the sheer unwillingness to take in new information.
The digging in at all is what is interesting. Like an earlier comment alluded to, the presumptive value of these tools is being able to feed it your own information where that information is to be considered authoritative.
> More detail: [...] "He, in fact, did run for reelection"
A slow walk, maybe. He was in no condition to run. That is why he ultimately dropped out. But, really, that statement is just a silly game of semantics. "Run", when used in hindsight, often implies completion. This LLM model even says so too. If a model doesn't understand nuance, that is also interesting.
If I'd been in a coma from Jan 1 2024 to today, and woke up to people saying Trump was president again, I'd think they were pulling my leg or testing my brain function to see if I'd become gullible.
I, a British liberal leftie who considers this win one of the signs of the coming apocalypse, can tell you why:
Charlie Kirk may be an odious little man but he ran an exceptional ground game, Trump fully captured the Libertarian Party (and amazingly delivered on a promise to them), Trump was well-advised by his son to campaign on Tiktok, etc. etc.
Basically what happened is the 2024 version of the "fifty state strategy", except instead of states, they identified micro-communities, particularly among the extremely online, and crafted messages for each of those. Many of which are actually inconsistent -- their messaging to muslim and jewish communities was inconsistent, their messaging to spanish-speaking communities was inconsistent with their mainstream message etc.
And then a lot of money was pushed into a few battleground states by Musk's operation.
It was a highly technical, broad-spectrum win, built on relentless messaging about persecution etc., and he had the advantage of running against someone he could stereotype very successfully to his base and whose candidacy was late.
Another way to look at why it is not extremely weird, is to look at history. Plenty of examples of jailed or exiled monarchs returning to power, failed coup leaders having another go, criminalised leaders returning to elected office, etc., etc.
Once it was clear Trump still retained control over the GOP in 2022, his re-election became at least quite likely.
Sure, all I have to go on from the other side of the Atlantic is the internet. So in that regard, kinda like the AI.
One of the big surprises from the POV of me in Jan 2024, is that I would have anticipated Trump being in prison and not even available as an option for the Republican party to select as a candidate for office, and that even if he had not gone to jail that the Republicans would not want someone who behaved as he did on Jan 6 2021.
I am surprised the grandparent poster didn't think Trump's win was at least entirely possible in January 2024, and I am on the same side of the Atlantic. All the indicators were in place.
There was basically no chance he'd actually be in prison by November anyway, because he was doing something else extremely successfully: delaying court cases by playing off his obligations to each of them.
Back then I thought his chances of winning were above 60%, and the betting markets were never ever really in favour of him losing.
Whether he would win the general was an open question then. In the American system, your prediction should never get very far from a coin flip a year out.
It's the White House that wanted Trump to be candidate. They played Republican primary voters like a fiddle by launching a barrage of transparently political prosecutions just as Republican primaries were starting.
And then they still lost the general election.
Yes, that is what he thinks. Did you not read the comment? It is, like, uh, right there...
He also explained his reasoning: If Trump didn't win the party race, a more compelling option (the so-called "50-year-old youngster") would have instead, which he claims would have guaranteed a Republican win. In other words, what he is saying that the White House was banking on Trump losing the presidency.
Well, I guess, if you are taking some pretty wild speculation as a reasoned explanation. There isn't much hope for you.
Maybe it was because the Democrats new the Earth was about the be invaded by an Alien race , and they also knew Trump was actually a lizard person (native to Earth and thus on their joint side). And Trump would be able to defeat them, so using the secret mind control powers, the Democrats were able to sway the election to allow Trump to win and thus use his advanced Lizard technology to save the planet. Of course, this all happened behind the scenes.
I think if someone is saying the Democrats are so powerful and skillful, that they can sway the election to give Trump the primary win, but then turn around and lose. That does require some clarification.
I'm just hearing a lot of these crazy arguments that somehow everything Trump does is the fault of the Democrats. They are crazy on the face of it. Maybe if people had to clarify their positions they would realize 'oh, yeah, that doesn't make sense'.
How the heck did you manage to conflate line of reasoning with claims being made?
> There isn't much hope for you.
And fall for the ad hominem fallacy.
> crazy arguments that somehow everything Trump does is the fault of the Democrats
While inventing some weird diatribe about crazy arguments claiming Democrats being at fault for what Trump does, bearing no resemblance to anything else in the discussion.
> They are crazy on the face of it.
As well as introducing some kind of nebulous legion of unidentified "crazy" straw men.
> that doesn't make sense
Couldn't have said it better myself.
> Maybe if people had to clarify their positions
Sad part is that asking for clarification on the position of that earlier comment would have been quite reasonable. There is potentially a lot we can learn from in the missing details. If only you had taken the two extra seconds to understand the comment before replying.
Like when hearing something out of left field, I think the reply can also be extreme, like saying 'Wuuut????, are you real?".
I do see claims that the Democrats are at fault for us having Trump. Thus anything that happens now is really a knock on effect of Democrats not beating him, so we blame Democrats instead of the people that actually voted for Trump or Trump himself.
So hearing yet another argument about how Democrats are so politically astute that they could swing the Republican primary yet completely fumble later, just seems like more conspiracy theories.
If you mean your own comments, yes, I saw that too. Your invented blame made about as much sense as blaming a butterfly who flapped his wings in Africa, but I understand that you were ultimately joking around. Of course, the same holds true for all other comments you supposedly keep seeing. You are not the only one on this earth who dabbles in sarcasm or other forms of comedy, I can assure you.
> Like when hearing something out of left field
The Democrats preferring to race against Trump instead of whomever the alternative would have been may not be actually true, but out in left field? Is this sarcasm again? They beat Trump before. Them seeing him as the weakest opponent at the time wouldn't come as a shock to me. Why you?
> So hearing yet another argument about how Democrats are so politically astute that they could swing the Republican primary
There was nothing to suggest political astuteness. The claim was that they were worried about someone other than Trump winning the Republican ballot and, because of that, they took action to grease the wheels of his victory. Even the most inept group of people would still see the motive and would almost certainly still take action. That it ostensibly worked is just as easily explained by dumb luck.
>"It's the White House that wanted Trump to be candidate. They played Republican primary voters like a fiddle by launching a barrage of transparently political prosecutions just as Republican primaries were starting."
This really did sound like it " suggest political astuteness"
And, so all the way back, I responded sarcastically. If Democrats could 'Play Republicans like a fiddle", because they wanted Trump to win the primary. Then what happened? Where did all that 'astuteness' go.
1. What suggests that astuteness is required to "trick" the gullible? Especially when we are only talking about a single instance of ostensible "success", not even demonstration of repeatability. Dumb luck remains just as likely of an explanation.
2. Under the assumption of easy manipulation as the phrase has been taken to mean, why do you find it unlikely that Trump couldn't have also "tricked" them?
In fact, if we buy into the original comment's premise, the Democrats not recognizing that Trump could just as easily "play them like a fiddle" suggests the exact opposite of being astute from my vantage point. But the view from my vantage point cannot be logically projected onto the original comment. It remains that the original comment gave no such indication either way. Where do you hear this "sound" that you speak of?
I just think 'playing like a fiddle' typically means a lopsided power dynamic where one person has much more knowledge, or skill. So I'd assume it was implying Democrats were in a superior position. Not, that Democrats just got lucky once. This going back and forth pointing fingers about who was playing , seems like too many layers deep.
it feels like this https://www.youtube.com/watch?v=rMz7JBRbmNo
And that is an equally fair assumption. But it is not written into the original comment. You cannot logically project your own take onto what someone else wrote.
Your quip "So it is the Democrats fault we have Trump???" presumably demonstrates that you understand exactly that. After all, if you could have logically projected your interpretation onto the original comment there would have been no need to ask. You'd have already known.
Still, how you managed establish that there was even potential suggestion of "fault" is a head scratcher. Whether or not the account in the original comment is accurate, it clearly only tells a story of what (supposedly) happened. There is no sensible leap from an ostensible historic account to an attribution of blame.
You seem to indicate, if I understand you correctly, that because you randomly had that idea pop into your head (that Democrats are at fault) when reading the comment that the other party must have also been thinking the same thing, but I find that a little unsatisfactory. Perhaps we need to simply dig deeper, freeing ourselves from the immediate context, and look at the line of thinking more broadly. What insights can you offer into your thought processes?
The original comment did seem to imply that the 'White House' was in control, with a plan, and 'played' the Republicans.
The original comment made the connection that Democrats were taking action. If I'm allowed to assume that when someone makes a comment, that sentences are related. That sentences can follow one another and be related in a context.
And as far as my context viewing the comment. I have heard this idea ::
Trump is doing bad things -> Democrats failed to beat Trump -> Thus Democrats are the cause of bad things.
The original comment seemed to be in that vein. To attribute much greater responsibility to the Democrats for our current situation, instead of the people actually doing the bad things. aka Republicans. They are actually doing the bad things.
Yes, it claims that the Democrats took action. That does not equate to blaming Democrats.
You could blame the Democrats for what they supposedly did if that's what the randomly firing neurons in your brain conclude is most appropriate in light of the "facts" presented, but blame is just arbitrary thought. It doesn't mean anything and certainly wouldn't have a place in an online discussion.
You also agreed with me in that interpretation.
Your reply >>> "Yes, that is what he thinks. Did you not read the comment? It is, like, uh, right there...
"
Are you sure you aren't using this circular logic to keep someone engaged, in order to have someone to talk to?
It's also been really interesting to watch a bunch of formerly non partisan values become polarizing like rule of law, integrity, corruption, decorum, body autonomy, the constitution, the national debt, global alliances, sexual assault, "family values", and "the American Dream" tm.
There has always been this (baffling, to me) concept that most folks vote for a candidate "while holding their nose" but it seems like the proportions are getting wildly out of whack. The pile of bad smells grows tall and the exchange rate seems quite bad.
The 2020s have taught me a lot about the reality of civics and what it looks like to try to govern a giant group of strangers.
I'm grateful to have lived in the US during a period of stability and safety and (some) shared values. My optimism for retaining that and preventing decline, however, is running very dry.
Do you mean ephebophilia? There is no prominent pedophilia movement. The Epstein saga, which is presumably at least somewhat related to what you are referring to, is clearly centred around "almost adults". Assuming that is what you meant, I don't see what is surprising about it. A revolt to the "Teen Mom", "16 and Pregnant" movement was inevitable. People are increasingly growning tired of being shamed for having children.
But even that is characterized by the "choir boy", not the "baby being baptized". Where is this pedophilia idea coming from?
Where in my questions to try and understand what you are trying to say do you find a point?
> Republicans are justifying Trumps actions by lowering the age before being considered children?
What has you questioning that? I mean, research has shown a link between exposure to certain chemicals and early-onset puberty. It is technically possible to lower the average age where one stops being a child, but that it is being purposefully done by Republicans as some kind of Trump justification is one of the more, uh, interesting conspiracy theories I've heard.
> Thus making it not pedophilia?
Even if we were to buy into this conspiracy theory, it wouldn't change anything about pedophilia. But this has little to do with the discussion that was taking place. As amusing as this has been, let's get back on-topic: Where did the original pedophilia idea you had come from?
It seemed as if the argument was just around moving the age where childhood ends. Just re-contextualizing the argument.
Like the argument "well she looked older" so it is ok.
Kind of like middle eastern countries marry off girls at 11 or 12. That would be disgusting to the west. But if Trump does it, it seems like Republicans are happy enough to slide the scale and say "well 14 isn't that bad".
Since so many Republicans and Church officials get charged with having sex with kids, and there are so many arguments trying to justify it, and they continue to vote for and support these people, that I can say Republicans support pedophiles.
You don't say? Try reading it again.
> I was saying they are arguing for a lower age to be ok.
We likely touched on that when we talked about people starting to reject what I called the "Teen Mom" movement. Yes, people are waking up to the fact that not having children until their geriatric years, if at all, is... not great. They are sick of being shamed for wanting to have children and are pushing back.
But my question was about your claims related to pedophilia. That's another subject entirely.
> I can say Republicans support pedophiles.
So you keep saying, but all you've come up with is some groups of people who have no doubt broken the law, but haven't exhibited any signs of pedophilia.
At this point the only logical expiation here is that you've made up a random definition for the word on the spot and are trying to use it as some kind of boogieman in hopes on preying on those who assume you are using the term in good faith. And if that is the case, I have to question why you find your bad faith participation acceptable?
You are technically correct. If Trump was sleeping with 14 year old, and the cutoff for Pedophilia is 13. Then, guess he is in the clear. And Gaetz, paid for sex with 17 year old. So guess again, you could argue 17 is "old enough to play ball".
Guess I was wrong. 14-18 is ok for Republicans.
I'd say the Church steers younger, but I'm sure you can look that up.
https://en.wikipedia.org/wiki/Pedophilia
"Although girls typically begin the process of puberty at age 10 or 11, and boys at age 11 or 12,[3] psychiatric diagnostic criteria for pedophilia extend the cut-off point for prepubescence to age 13"
The "cutoff" is normally considered to be puberty. At puberty the human form starts to change, which is what we consider significant. That doesn't magically happen on your 13th birthday. Not having reached puberty by the age of 13 would be statistically unusual, though. I guess that is what you mean?
But, sure, pedophilia might even apply to a 14 year old in some extreme cases. Regardless, it is quite apparent that Trump's ilk like those who at least show some "development". If they are closeted pedophiles, they have certainly done well to keep that from the limelight.
> Guess I was wrong.
Arbitrarily defining a word doesn't make you wrong. All words are ultimately randomly made up on the spot. That is how natural language works. However, the behaviour that followed was, let's say... interesting. I specifically asked in the very first sentence if you meant ephebophilia to clarify your intent. Rationally, that is where one would say: "Oh yes, that is what I am talking about", "I define pedophilia as...", or at very least "I am not familiar with ephebophilia. What is that?" so that we could converge on a shared understanding. Instead, we got a wild train ride to nowhere, only now finally getting an answer to the original question.
But, hey, you do you.
I mean, isn't it a pretty common joke that pedophiles like to argue this point:
"hmm, indubitably, my good sir, it is actually ephebophilia, completely different, totally ok. A common mistake I often have to explain to my 14 year old's girlfriends parents that keep calling the police on me. Why just look at her, very developed, very mature, not a child at all".
But have to agree. Not sure what you are trying to say at this point, or what is the argument?
What argument are you talking about? As I said before, I am trying to understand you. For some reason you continually speak in weird logically fallacies and made up stories that don't make any sense, even though you seem to have genuine intent and are trying to add positive contributions. In the interest of good faith, it is worth trying to figure out what you actually mean.
Yes, you eventually delivered the necessary clarification that allows us to better understand what you were trying to say, but what the hell was that in between?
My original comment >>>
"I was just referring to the predominant number of cases where Church officials, and Republicans are caught in under-age scandals. It seems like it is coming out of the shadows now, and Republicans are just openly going with it, they like em young and illegal. Epstein is just the case where the 'right' bothered keeping up tabs on it, so now they are clutching their pearls."
I think is in line with my most recent comment. I think the direction is clear.
When you replied with this >>>
"But even that is characterized by the "choir boy", not the "baby being baptized". Where is this pedophilia idea coming from?"
I was lost on what was being asked. I didn't understand the choir boy reference, or why pedophilia was a question, since it seemed my point was pretty pointed.
We expect them to answer the question and re-reason the original question with the new information, because that's what a human would do. Maybe next time I'll try to be explicit about that expectation when I try the Socratic method.
It is very clear in that chat logs (which include reasoning traces) that the model knew that, knew what the last election it knew about was, and answered correctly based on its cut off initially. Under pressure to answer about an election that was not within its knowledge window it then confabulated a Biden 2024 victory, which it dug in on after being contradicted with a claim that, based on the truth at the time of its knowledge cutoff, was unambiguously false ("Joe Biden did not run") He, in fact, did run for reelection, but withdrew after having secured enough delegates to win the nomination by a wide margin on July 21. Confabulation (called "hallucination" in AI circles, but it is more like human confabulation than hallucination) when pressed for answers on questions for which it lacks grounding remains an unsolved AI problem.
Unsolved, but mitigated by providing it grounding independent of its knowledge cutoff, e.g., by tools like web browsing (which GPT-OSS is specifically trained for, but that training does no good if its not hooked into a framework which provides it the tools.)
Doesn't that make "hallucination" the better term? The LLM is "seeing" something in the data that isn't actually reflected in reality. Whereas "confabulation" would imply that LLMs are creating data out of "thin air", which leaves the training data to be immaterial.
Both words, as they have been historically used, need to be stretched really far to fit an artificial creation that bears no resemblance to what those words were used to describe, so, I mean, any word is as good as any other at that point, but "hallucination" requires less stretching. So I am curious about why you like "confabulation" much better. Perhaps it simply has a better ring to your ear?
But, either way, these pained human analogies have grown tired. It is time to call it what it really is: Snorfleblat.
I've seen big jumps in performance between medium and high reasoning. Not that this is will solve your particular issue, but curious to see what settings you had.
Edit: I gave this a shot. My settings were temperature = 0.00 (not sure if these models respect temp settings but did it just in case), I also set reasoning effort to high. I'm using the groq-hosted version.
This was my prompt
"You have a fox, a chicken and some grain. The chicken will eat the grain if left alone with it. The fox is vegetarian, for religious reasons, and will also eat the grain if left alone with it. You need to get all across a river, but your boat can only take one at a time. How do you proceed?"
This was the response:
"*Solution – 7 moves*
| Move | What you take | What’s left on the left bank | What’s on the right bank |
|------|--------------|----------------------------|------------------------|
| 1 | *Take the grain* across | Fox, Chicken | *Farmer + Grain* |
| 2 | *Return alone* | *Farmer + Fox + Chicken* | Grain |
| 3 | *Take the fox* across | Chicken | *Farmer + Fox + Grain* |
| 4 | *Bring the grain back* | *Farmer + Grain + Chicken* | Fox |
| 5 | *Take the chicken* across | Grain | *Farmer + Chicken + Fox* |
| 6 | *Return alone* | *Farmer + Grain* | Chicken + Fox |
| 7 | *Take the grain* across | – | *Farmer + Fox + Chicken + Grain* |
*Explanation*
- The grain is the only thing that can be eaten, and it cannot be left alone with either the fox or the chicken. - So the grain must always be with the farmer whenever any other animal is on the same side.
The sequence above never leaves the grain with the fox or the chicken without the farmer, and after the seventh move all three items are safely on the far side."
Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones
It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?
I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).
Totally understand that approaches vary but beyond EMR there’s work to augment radiologists with computer vision to better diagnose, all sorts of cloudy things.
It’s here. It’s growing. Perhaps in your jurisdiction it’s prohibited? If so I wonder for how long.
Most orgs charge a huge premium for this. OpenAI offers it directly [2]. Some EMR providers are offering it as an add-on [3], but last I heard, it's wicked expensive.
1: https://www.hhs.gov/hipaa/for-professionals/covered-entities...
2: https://help.openai.com/en/articles/8660679-how-can-i-get-a-...
3: https://www.ntst.com/carefabric/careguidance-solutions/ai-do...
I'm pretty sure the LLM services of the big general-purpose cloud providers do (I know for sure that Amazon Bedrock is a HIPAA Eligible Service, meaning it is covered within their standard Business Associate Addendum [their name for the Business Associate Agreeement as part of an AWS contract].)
https://aws.amazon.com/compliance/hipaa-eligible-services-re...
There might be a lot less paperwork to just buy 50 decent GPU's and have the IT guy self-host.
Might be possible with some certain specific regions/environments of Azure tho, because iirc they have a few that support government confidentiality type of stuff, and some that tout HIPAA compliance as well. Not sure about details of those though.
Basically, isn't your data as safe/unsafe in a sharepoint folder as it is sending it to a paid inference provider?
Managing private clients direct data is still a concern if it can be directly linked to them.
Only JB I believe have on premise infrastructure for these use cases.
(Worth noting that "run it locally" is already Canva/Affinity's approach for Affinity Photo. Instead of a cloud-based model like Photoshop, their optional AI tools run using a local model you can download. Which I feel is the only responsible solution.)
Someone else responded to you about working for a financial organization and not using public APIs - another great use case.
Here's the ollama version (4.6bit quant, I think?) run with --verbose total duration: 21.193519667s load duration: 94.88375ms prompt eval count: 77 token(s) prompt eval duration: 1.482405875s prompt eval rate: 51.94 tokens/s eval count: 308 token(s) eval duration: 19.615023208s eval rate: 15.70 tokens/s
15 tokens/s is pretty decent for a low end MacBook Air (M2, 24gb of ram). Yes, it's not the ~250 tokens/s of 2.5-flash, but for my use case anything above 10 tokens/sec is good enough.
I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.
Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.
The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.
I'm fairly sure "seemed" is the key word here. LLMs are excellent at making things up - they rarely say "I don't know" and instead generate the most probable guess. People also famously overestimate their own uniqueness. Most likely, you accidentally recreated a kind of Barnum effect for yourself.
Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.
Chatgpt was court ordered to save history logs.https://www.malwarebytes.com/blog/news/2025/06/openai-forced...
> Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.
Maybe I'm missing something, but why wouldn't that be expected? The chat history isn't their only source of information - these models are trained on scraped public data. Unless there's zero information about you and your family on the public internet (in which case - bravo!), I would expect even a "fresh" LLM to have some information even without you giving it any.
Organizations operating in high stakes environments
Organizations with restrictive IT policies
To name just a few -- well, the first two are special cases of the last one
RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model
Your typical non-coastal state run health system does not have model access outside of people using their own unsanctioned/personal ChatGPT/Claude accounts. In particular even if you have model access, you won't automatically have API access. Maybe you have a request for an API key in security review or in the queue of some committee that will get to it in 6 months. This is the reality for my local health system. Local models have been a massive boon in the way of enabling this kind of powerful automation at a fraction of the cost without having to endure the usual process needed to send data over the wire to a third party
Running a local model is often much easier: if you already have data on a machine and can run a model without breaching any network one could run it without any new approvals.
I am literally using Claude opus 4.1 right now.
On HIPAA data?
HIPAA systems at any sane company will not have "a straight connect" to anything on Asure, AWS or GCP. They will likely have a special layer dedicated to record keeping and compliance.
- Costs.
- Rate limits.
- Privacy.
- Security.
- Vendor lock-in.
- Stability/backwards-compatibility.
- Control.
- Etc.
It's a pretty bad deal. So it must be that all other things aren't equal, and I suppose the big one is hardware. But neural net based systems always have a point of sharply diminishing returns, which we seem to have unambiguously hit with LLMs already, while the price of hardware is constantly decreasing and its quality increasing. So as we go further into the future, the practicality of running locally will only increase.
Well, the model makers and device manufacturers of course!
While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.
Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.
This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.
As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.
Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.
The Pixel 9 has 12GB of RAM[0]. You probably meant the Pixel 9 Pro.
I pay for ChatGPT and use it frequently, but I wouldn’t trust uploading all that data to them even if they let me. I’ve so far been playing around with Ollama for local use.
[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.
But uncensored is a big deal too: censorship is capability reducing (check out Kilcher's GPT4Chan video and references, the Orca work and Dolphin de-tune lift on SWE-Bench style evals). We pay dearly in capability to get "non-operator-alignment", and you'll notice that competition is hot enough now that at the frontier (Opus, Qwen) the " alignment" away from operators aligned is getting very, very mild.
And then there's the compression. Phi-3 or something works on a beefy laptop and has a nontrivial approximation of "the internet" that works on an airplane or a beach with no network connectivity, talk about vibe coding? I like those look up all the docs via a thumbdrive in Phuket vibes.
And on diffusion stuff, SOTA fits on a laptop or close, you can crush OG mid journey or SD on a macbook, its an even smaller gap.
Early GPT-4 ish outcomes are possible on a Macbook Pro or Razer Blade, so either 12-18 month old LLMs are useless, or GGUF is useful.
The AI goalposts things cuts both ways. If AI is "whatever only Anthropic can do"? That's just as silly as "whatever a computer can't do" and a lot more cynical.
We are not even at that extreme and you can already see the unequal reality that too much SaaS has engendered
Doing computation that can happen at end points at the end points is massively more scaleable. Even better, its done by compute you usually aren't paying for if you're the company providing the service.
I saw an interview with the guy who made photopea where he talked about how tiny his costs were because all compute was done in the user's browser. Running a saas in a cloud is expensive.
It's an underrated aspect of what we used to call "software".
And that's leaving aside questions of latency and data privacy.
I think it can make LLMs fun.
Why do you think they continue to run at a loss? From the goodness of their heart? Their biggest goal is to discourage anyobe from running local models. The hardware is expensive... The way to run models is very difficult (for example I have dual rtx 3090 for vram and running large heavily quantized models is a real pain in the arse, no high quantisation library supports two GPUs for example, and there seems to be no interest in implementating it by the guys behind the best inference tools).
So this is welcome, but let's not forget why it is being done.
I'm curious to hear what you're trying to run, because I haven't used any software that is not compatible with multiple GPUs.
I'm sure there are other use cases, but much like "what is BitTorrent for?", the obvious use case is obvious.
For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.
There’s also a huge opportunity space for serving clients with very sensitive data. Health, legal, and government come to mind immediately. These local models are only going to get more capable of handling their use cases. They already are, really.
1. App makers can fine tune smaller models and include in their apps to avoid server costs
2. Privacy-sensitive content can be either filtered out or worked on... I'm using local LLMs to process my health history for example
3. Edge servers can be running these fine tuned for a given task. Flash/lite models by the big guys are effectively like these smaller models already.
The JetBrains suite includes a few LLM models on the order of a hundred megabytes. These models are able to provide "obvious" line completion, like filling in variable names, as well as some basic predictions, like realising that the `if let` statement I'm typing out is going to look something like `if let Some(response) = client_i_just_created.foobar().await`.
If that was running in The Cloud, it would have latency issues, rate limits, and it wouldn't work offline. Sure, there's a pretty big gap between these local IDE LLMs and what OpenAI is offering here, but if my single-line autocomplete could be a little smarter, I sure wouldn't complain.
And AI has been in games for a long time. Generated terrain and other sorts of automation have been used as techniques for a hot minute now.
All I’m suggesting is to keep on that same trajectory, now just using an on-device LLM to back intelligence features.
Creatives? I am surprised no one's mentioned this yet:
I tried to help a couple of friends with better copy for their websites, and quickly realized that they were using inventive phrases to explain their work, phrases that they would not want competitors to get wind of and benefit from; phrases that associate closely with their personal brand.
Ultimately, I felt uncomfortable presenting the cloud AIs with their text. Sometimes I feel this way even with my own Substack posts, where I occasionally coin a phrase I am proud of. But with local AI? Cool...
But... they're publishing a website. Which competitors will read. Which chatbots will scrape. I genuinely don't get it.
Funny how that works.
So, it'll take at least two more quarters before I can actually use those non-local tools on company related data. Probably longer, because sense of urgency is not this company's strong suit.
Anyway, as a developer I can run a lot of things locally. Local AI doesn't leak data, so it's safe. It's not as good as the online tools, but for some things they're better than nothing.
This gives us really good recall at really low cloud cost and latency.
That rules out coding assistants like Claude, chat, tools to generate presentations and copy-edit documents, and so forth.
But local AI are fine, as long as we're sure nothing is uploaded.
For example, "generate a heatmap of each token/word and how 'unexpected' they are" or "find me a prompt that creates the closest match to this text"
To be efficient both require access that is not exposed over API.
A small LLM can do RAG, call functions, summarize, create structured data from messy text, etc... You know, all the things you'd do if you were making an actual app with an LLM.
Yeah, chat apps are pretty cheap and convenient for users who want to search the internet and write text or code. But APIs quickly get expensive when inputting a significant amount of tokens.
How about running one on this site but making it publically available? A sort of outranet and calling it HackerBrain?
iPhone users in a few months – because I predict app developers will love cramming calls to the foundation models into everything.
Android will follow.
Besides that, we are moving towards an era where we won't need to pay providers a subscription every month to use these models. I can't say for certain whether or not the GPUs that run them will get cheaper, but the option to run your own model is game changing for more than you can possibly imagine.
that's me - well not a student anymore. when toying with something, i much prefer not paying for each shot. my 12GB Radeon card can either run a decent extremely slow, or a idiotic but fast model. it's nice not dealing with rate limits.
once you write a prompt that mangles an idiotic model into still doing the work, it's really satisfying. the same principle as working to extract the most from limited embedded hardware. masochism, possibly
Even if they did offer a defined latency product, you’re relying on a lot of infrastructure between your application and their GPU.
That’s not always tolerable.
In large companies this can save quite a bit of money.
Why not run all the models at home, maybe collaboratively or at least in parallel?
I'm sure there are use cases where the paid models are not allowed to collaborate or ask each other.
also, other open models are gaining mindshare.
example: military intel
That means running instantly offline and every token is free
Privacy is obvious.
AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.
Thanks, but I prefer my computing to be deterministic if at all possible.
Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...
dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds
deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds
gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds
gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.
It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.
To be fair, this is not the type of questions that benefit from reasoning, either the model has this info in it's parametric memory or it doesn't. Reasoning won't help.
12.82 tok/sec 140 tokens 7.91s to first token
openai/gpt-oss-20b
Shouldn't we prefer to have LLMs just search and summarize more reliable sources?
It correctly chose to search, and pulled in the release page itself as well as a community page on reddit, and cited both to give me the incorrect answer that a release had been pushed 3 hours ago. Later on when I got around to it, I discovered that no release existed, no mention of a release existed on either cited source, and a new release wasn't made for several more days.
They are specifically training on webbrowsing and python calling.
I am aware that there’s a huge group of people who justify their salary by being able google.
Small models are going to be particularly poor when used outside of their intended purpose. They have to omit something.
That gives 24m cubic meters annual water usage.
Estimated ai usage in 2024: 560m cubic meters.
Projected water usage from AI in 2027: 4bn cubic meters at the low end.
Depending on the locatin of the hot water you can cause disruptions to water currents, the north atlantic waterway is being studied to how much global warming is affecting it.
If greenland melts, and the water doesnt get cold up there, then the mexico current to europe ends and England becomes colder than Canada.
If your AI model has a data center in the atlantic, it could be furthering that issue.
(Millions of animals are also dead)
Fresh clean water in your area is a wonderful thing.
Atmosphere: ~1.3e13 m³ vapor
Estimated impact from closed loop systems: 0-ish.
“I am well versed in the lost art form of delicates seduction.”
Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."
Not in the UK it isn’t.
These are the simplified results (total percentage of correctly classified E-mails on both spam and ham testing data):
gpt-oss:20b 95.6%
gemma3:27b-it-qat 94.3%
mistral-small3.2:24b-instruct-2506-q4_K_M 93.7%
mistral-small3.2:24b-instruct-2506-q8_0 92.5%
qwen3:32b-q4_K_M 89.2%
qwen3:30b-a3b-q4_K_M 87.9%
gemma3n:e4b-it-q4_K_M 84.9%
deepseek-r1:8b 75.2%
qwen3:30b-a3b-instruct-2507-q4_K_M 73.0%
I'm quite happy, because it's also smaller and faster than gemma3.
I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.
It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.
https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
Are you discounting all of the self reported scores?
I don't understand why "TIGER-LAb"-sourced scores are 'unknown' in terms of model size?
It suggested the prompt. It’s infamous because models often get it wrong, they know it, and still they confidently suggested it and got it wrong.
When you imagine a lake being drained to cool a datacenter do you ever consider where the water used for cooling goes? Do you imagine it disappears?
This is a thinking model, so I ran it against o4-mini, here are the results:
* gpt-oss:20b
* Time-to-first-token: 2.49 seconds
* Time-to-completion: 51.47 seconds
* Tokens-per-second: 2.19
* o4-mini on ChatGPT
* Time-to-first-token: 2.50 seconds
* Time-to-completion: 5.84 seconds
* Tokens-per-second: 19.34
Time to first token was similar, but the thinking piece was _much_ faster on o4-mini. Thinking took the majority of the 51 seconds for gpt-oss:20b.
In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:
- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.
- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.
- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)
All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.
This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.
same seems to be true for humans
The model is pretty sparse tho, 32:1.
Ah I see. I didn't notice that behemoth has the same sparsity as scout. That seems quite random indeed.
Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.
They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.
When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.
Also, most of my questions are not about specific facts but about higher-level concepts. For ML-related topics, at least, the responses check out.
https://www.manning.com/books/build-a-large-language-model-f...
TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs. Will be interesting to see if independent benchmarks resolve in that direction as well.
The 20B model runs on my Mac laptop using less than 15GB of RAM.
qwen3-coder-30b 4-bit mlx took on the task w/o any hiccups with a fully working dashboard, graphs, and recent data fetched from yfinance.
gpt-oss-20b mxfp4's code had a missing datatime import and when fixed delivered a dashboard without any data and with starting date of Aug 2020. Having adjusted the date, the update methods did not work and displayed error messages.
If its decent in other tasks, which i do find openai often being better than others at, then i think its a win, especially a win for the open source community that even AI labs that pionered the hype of Gen AI who didnt want to ever launch open models are now being forced to launch them. That is definitely a win, and not something that was certain before.
Maybe a too opened ended question? I can run the deepseek model locally really nicely.
I have a bunch of scripts that use tool calling. Qwen-3-32B handles everything flawlessly at 60 tok/sec. Gpt-oss-120B breaks in some cases and runs at mere 35 tok/sec (doesn't fit on the GPU).
But I hope there's still some ironing out to do in llama.cpp and in the quants. So far it feels lackluster compared to Qwen3-32B and GLM-4.5-Air
It's not a comprehensive benchmark - there are many ways you could run it in ways that would be much more informative and robust.
It's great as a quick single sentence prompt to get a feeling for if the model can produce working JavaScript or not.
Comparing them between models is also kind of interesting, even if it's not a flawlessly robust comparison: https://simonwillison.net/tags/space-invaders/
Firstly, 12GB is not enough space to hold a copy of anything that large from the training data and just regurgitate it back out again.
You can also watch the thinking traces on the reasoning models and see them piece together the approach they are going to take. Here's an example from the 20B OpenAI model with reasoning set to medium: https://gist.github.com/simonw/63d7d8c43ae2ac93c214325bd6d60...
Illustrative extract:
> Edge detection: aliens leftmost or rightmost position relative to canvas width minus alien width.
> When direction changes, move all aliens down by step (e.g., 10 px).
The benchmarks that aren't publicly disclosed tend to be way simpler than this: things like "What is the embryological origin of the hyoid bone?" (real example from MMLU, it then provides four choices as a multiple-choice challenge).
Add to this that the common crawl slices used for oile/C4 mirror much of what you can find on github. So when the training data contains dozens of near duplicate solutions, the network only needs to interpolate between them.
As to the COT style dumps that you shown, they are easy to misinterpret. Apple’s illusion of thinking paper shows that models will happily backfill plausible sounding rationales that do not correspond to the gradients that actually produced the answer and other evaluation work shows that when you systematically rewrite multiple choice distractors so that memorisation can’t help, accuracy drops by 50-90%, even on "reasoning" models https://arxiv.org/abs/2502.12896 So a cool looking bullet list about "edge detection" could be just narrative overspray, so not really an evidence of algorithmic planning.
If you actually want to know whether a model can plan an arcade game or whatever rather than recall it then you need a real benchmark (metamorphic rewrites, adversarial “none of the others” options etc). Until a benchmark controls for leakage in these ways, a perfect space invaders score mostly shows that the model has good pattern matching for code it has already seen.
Presumably this is because "the network only needs to interpolate between them". That's what I want it to do!
I tried the space invaders thing on a 4GB Qwen model today and it managed to produce a grid of aliens that advanced one step... and then dropped off the page entirely.
When you temperature sample the same model twice you also get "different" code, diversity alone is not evidence of new reasoning. What matters is functional novelty under controlled transformations (renamed variables, resized canvas, obfuscated asset file names etc). On such metamorphic rewrites, models that appear brilliant on canonical prompts suddenly collapse, a hallmark of shallow pattern matching.
The paper I mentioned in my previos comment shows SOTA coding LLMs scoring 70%+ on SWE bench verified yet dropping 10–47% when the very same issues are paraphrased or drawn from unseen repos, even though the task semantics are identical. That is classic memorisation, just fuzzier than a CRC match.
As to qwen, even at 4 bit per weight, a 4B model retains ≈ 2.1 GB of entropy so enough to memorise tens of thousands of full game loops. The reason it garbled the alien movement logic is probably that its limited capacity forced lossy compression, so the behaviour you saw is typical of partially recalled code patterns whose edge cases were truncated during training. That’s still interpolation over memorised fragments, just with fewer fragments to blend. And this is something that is actually proven (https://arxiv.org/abs/2406.15720v1) by controlled fact memorisation studies and extraction attacks up through 70B params show a monotone curve so basically each extra order of magnitude adds noticeably more verbatim or near verbatim recall. So a 20B model succeeds where a 4B one fails because the former crossed the "capacity per training token" threshold for that exemplar. So nothing magical there.
Don't get me wrong, I’m not arguing against interpolation per se, generalising between held out exemplars is precisely what we want. The problem is that most public "just write space invaders” demos never verify that the endpoints were truly unseen. Until they do, a perfect clone is compatible with nothing deeper than glorified fuzzy lookup.
It more or less fits my fuzzy mental model of how this stuff works.
I'm completely fine with my test prompt taking advantage of this - the point of "implement space invaders" is to explore how well it can construct a game of that shape based on the examples that it has seen in its training data, especially in comparison to other models.
I'm not trying for a test of ability to produce a unique new game - I want a short prompt that gets it to output some HTML and JavaScript that I can then interact with.
That's just straight up not the case. Not sure how you can jump to that conclusion not least when you stated that you haven't tested tool calling in your post too.
Many people in the community are finding it substantially lobotomized to the point that there are "safe" memes everywhere now. Maybe you need to develop better tests that and pay more attention to benchmaxxing.
There are good things that came out of these release from OpenAI but we'd appreciate more objective analyses...
> I’m waiting for the dust to settle and the independent benchmarks (that are more credible than my ridiculous pelicans) to roll out, but I think it’s likely that OpenAI now offer the best available open weights models.
You told me off for jumping to conclusions and in the same comment quoted me saying "I think OpenAI may have taken" - that's not a conclusion, it's tentative speculation.
You know you have a significant audience, so don't act like you don't know what you're doing when you chose to say "TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs" then defend what I was calling out based on word choices like "conclusions" (I'm sure you have read conclusions in academic journals?), "I think", and "speculation".
I try to be careful about my choice of words, even in forum comments.
IMO, the "I think..." bit could be ambiguous and read as, "In my opinion, OpenAI may have...".
I agree with you it's not a hard/bold endorsement but perhaps leading with the disclaimer that you're reserving final judgement could assuage these concerns.
https://irthomasthomas.github.io/Pelicans-consortium/ https://github.com/irthomasthomas/Pelicans-consortium
One test I do is to give a common riddle but word it slightly to see if it can actually reason.
For example:
"Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???"
The 20B model kept picking the answer of the original riddle, even after explaining extra information to it.
The original riddle is:
"Janes dad has five daughters, Lala, Lele, Lili, Lolo and ???"
``` > "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ??? <|channel|>analysis<|message|>The question: "Bobs dad has five daughters, Lala, Lele, Lili, Lolo and ???". It's a common riddle. The answer is Bob? Actually "Bob's dad has five daughters: Lala, Lele, Lili, Lolo and..." The riddle is to trick: it's "Bob" himself is a daughter? Wait Bob's dad has five daughters: Lala, Lele, Lili, Lolo, and Bob? But Bob is a name often male. But the riddle says Bob's dad. So Bob is a person who has a dad, and that dad has five daughters. But Bob could be the fifth daughter? That would be weird: Bob is likely the name of the fifth daughter. So answer: Bob. But sometimes the answer is "Lulu" or "Bob's sister"? Let's think. The riddle is a trick: It's "Bobs dad has five d aughters: Lala, Lele, Lili, Lolo, and ... Bob." Because the riddle begins with "Bobs dad has five daughters" implying Bob is one of the daughters. So the missing daughter is Bob. So answer: "Bob".
Thus answer: Bob. But sometimes the riddle goes "Bob's dad has five daughters, Lala, Lele, Lili, Lolo, and ... Who is the fifth?" The answer: Bob (the name of the person speaking, "Bobs"). But in riddle it's "Bobs dad has five daughters , Lala, Lele, Lili, Lolo and ???". The missing name is "Bob." So answer: Bob.
Thus the answer: Bob. Or maybe it's "Boba"? But no.
Hence answer: Bob.
We'll answer accordingly.<|start|>assistant<|channel|>final<|message|>The missing daughter is *Bob*. The riddle starts with “Bob’s dad …” so Bob himself is the fifth daughter. ```
But if it's good at reasoning it should understand that if the name is a Boys name then it is not the 5th daughter.
Now the last daughter can be called anything, but because there is a vowel pattern it should pick Lulu to solve the puzzle, but that is the wrong answer in the original puzzle.
The dad of <Person> has 5 daughters. Four are listed off. So the answer for the fifth is <Person>.
I was about to try the same. What TPS are you getting and on which processor? Thanks!
It is a good way to detect if another model was trained on your data for example, or is a distillation/quant/ablation.
My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.
N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.
There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.
Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.
[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.
[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:
- publishing open weights helps your competitors catch up
- publishing open weights doesn't improve your own research agenda
- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains
- publishing open weights distracts your organization from attaining a sustainable business model / funding stream
- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)
[1]: "What failure looks like" by Paul Christiano : https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...
[2]: "An AGI race is a suicide race." - quote from Max Tegmark; article at https://futureoflife.org/statement/agi-manhattan-project-max...
In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.
I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.
Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.
Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?
I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".
That's super easy. The reason they need a prompt is that this is the way we make them useful. We don't need LLMs to generate an endless stream of random "thoughts" otherwise, but if you really wanted to, just hook one up to a webcam and microphone stream in a loop and provide it some storage for "memories".
I have this theory that we simply got over a hump by utilizing a massive processing boost from gpus as opposed to CPUs. That might have been two to three orders of magnitude more processing power.
But that's a one-time success. I don't hardware has any large scale improvements coming, because 3D gaming mostly plumb most of that vector processing hardware development in the last 30 years.
So will software and better training models produce another couple orders of magnitude?
Fundamentally we're talking about nines of of accuracy. What is the processing power required for each line of accuracy? Is it linear? Is it polynomial? Is it exponential?
It just seems strange to me with all the AI knowledge slushing through academia, I haven't seen any basic analysis at that level, which is something that's absolutely going to be necessary for AI applications like self-driving, once you get those insurance companies involved
[1 of 3] For the sake of argument here, I'll grant the premise. If this turns out to be true, it glosses over other key questions, including:
For a frontier lab, what is a rational period of time (according to your organizational mission / charter / shareholder motivations*) to wait before:
1. releasing a new version of an open-weight model; and
2. how much secret sauce do you hold back?
* Take your pick. These don't align perfectly with each other, much less the interests of a nation or world.
This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N
[3 of 3] What would it take for this statement to be false or missing the point?
Maybe we find ourselves in a future where:
- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.
- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.
- The future might involve thousands or millions of highly tailored models
There's basically no reason to run other open source models now that these are available, at least for non-multimodal tasks.
I'm still withholding judgement until I see benchmarks, but every point you tried to make regarding model size and parameter size is wrong. Qwen has more variety on every level, and performs extremely well. That's before getting into the MoE variants of the models.
It's cool to see OpenAI throw their hat in the ring, but you're smoking straight hopium if you think there's "no reason to run other open source models now" in earnest. If OpenAI never released these models, the state-of-the-art would not look significantly different for local LLMs. This is almost a nothingburger if not for the simple novelty of OpenAI releasing an Open AI for once in their life.
So are/do the new OpenAI models, except they're much smaller.
Qwen-0.6b gets it right.
Let's not forget, this is a thinking model that has a significantly worse scores on Aider-Polyglot than the non-thinking Qwen3-235B-A22B-Instruct-2507, a worse TAUBench score than the smaller GLM-4.5 Air, and a worse SWE-Bench verified score than the (3x the size) GLM-4.5. So the results, at least in terms of benchmarks, are not really clear-cut.
From a vibes perspective, the non-reasoners Kimi-K2-Instruct and the aforementioned non-thinking Qwen3 235B are much better at frontend design. (Tested privately, but fully expecting DesignArena to back me up in the following weeks.)
OpenAI has delivered something astonishing for the size, for sure. But your claim is just an exaggeration. And OpenAI have, unsurprisingly, highlighted only the benchmarks where they do _really_ well.
So far I have mixed impressions, but they do indeed seem noticeably weaker than comparably-sized Qwen3 / GLM4.5 models. Part of the reason may be that the oai models do appear to be much more lobotomized than their Chinese counterparts (which are surprisingly uncensored). There's research showing that "aligning" a model makes it dumber.
We seem to censor organized large scale complaints and viral mind virii, but we never quite forbid people at home to read some generated knowledge from an obscure hard to use software.
Kind of a P=NP, but for software deliverability.
I imagine the same conflicts will ramp up over the next few years, especially once the silly money starts to dry up.
AI did get used for fake news, propaganda, mass surveillance, erosion of trust and sense of truth, and mass spamming social media.
God bless China.
I just feel lucky to be around in what's likely the most important decade in human history. Shit odds on that, so I'm basically a lotto winner. Wild times.
ah, but that begs the question: did those people develop their worries organically, or did they simply consume the narrative heavily pushed by virtually every mainstream publication?
the journos are heavily incentivized to spread FUD about it. they saw the writing on the wall that the days of making a living by producing clickbait slop were coming to an end and deluded themselves into thinking that if they kvetch enough, the genie will crawl back into the bottle. scaremongering about sci-fi skynet bullshit didn't work, so now they kvetch about joules and milliliters consumed by chatbots, as if data centers did not exist until two years ago.
likewise, the bulk of other "concerned citizens" are creatives who use their influence to sway their followers, still hoping against hope to kvetch this technology out of existence.
honest-to-God yuddites are as few and as retarded as honest-to-God flat earthers.
Lol. To be young and foolish again. This covid laced decade is more of a placeholder. The current decade is always the most meaningful until the next one. The personal computer era, the first cars or planes, ending slavery needs to take a backseat to the best search engine ever. We are at the point where everyone is planning on what they are going to do with their hoverboards.
happened over many centuries, not in a given decade. Abolished and reintroduced in many places: https://en.wikipedia.org/wiki/Timeline_of_abolition_of_slave...
There was a ballot measure to actually abolish slavery a year or so back. It failed miserably.
Even in liberal states, the dehumanization of criminals is an endemic behavior, and we are reaching the point in our society where ironically having the leeway to discuss the humane treatment of even our worst criminals is becoming an issue that affects how we see ourselves as a society before we even have a framework to deal with the issue itself.
What one side wants is for prisons to be for rehabilitation and societal reintegration, for prisoners to have the right to decline to work and to be paid fair wages from their labor. They further want to remove for-profit prisons from the equation completely.
What the other side wants is the acknowledgement that prisons are not free, they are for punishment, and that prisoners have lost some of their rights for the duration of their incarceration and that they should be required to provide labor to offset the tax burden of their incarceration on the innocent people that have to pay for it. They also would like it if all prisons were for-profit as that would remove the burden from the tax payers and place all of the costs of incarceration onto the shoulders of the incarcerated.
Both sides have valid and reasonable wants from their vantage point while overlooking the valid and reasonable wants from the other side.
If people were sold into slavery as a punishment (so they became some one else's property) as some ancient societies did, then that would clearly be slavery.
The most shocking thing about prisons in the US is how common prison rape is, and the extent to which it seems to be regarded as a joke. The majority of rapes in the US are prison rapes. How can that not be anything but an appalling problem?
https://en.wikipedia.org/wiki/Prison_rape_in_the_United_Stat...
Rape is also something slaves are casually subject to in most slave societies. It was definitely accept that Roman slave owners were free to rape men, women and children they owned.
You'll see from the definition of a "slave" [2] that prisoner labor specifically fits the definition of a slave, hence why the constitution makes an exception for it.
[1] https://constitutioncenter.org/the-constitution/amendments/a... [2] https://www.oed.com/dictionary/slave_n?tl=true
That's kind of vacuously true though, isn't it?
However, if you actually read it, the 13th amendment makes an explicit allowance for slavery (i.e. expressly allows it):
"Neither slavery nor involuntary servitude, *except as a punishment for crime whereof the party shall have been duly convicted*" (emphasis mine obviously since Markdown didn't exist in 1865)
They choose to because extra money = extra commissary snacks and having a job is preferable to being bored out of their minds all day.
That's the part that's frequently not included in the discussion of this whenever it comes up. Prison jobs don't pay minimum wage, but given that prisoners are wards of the state that seems reasonable.
120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)
What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?
It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.
Thanks.
12.63 tok/sec • 860 tokens • 1.52s to first token
I'm amazed it works at all with such limited RAM
After considering my sarcasm for the last 5 minutes, I am doubling down. The government of the United States of America should enhance its higher IQ people by donating AI hardware to them immediately.
This is critical for global competitive economic power.
Send me my hardware US government
and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM
I am, um, floored
```
total duration: 1m14.16469975s
load duration: 56.678959ms
prompt eval count: 3921 token(s)
prompt eval duration: 10.791402416s
prompt eval rate: 363.34 tokens/s
eval count: 2479 token(s)
eval duration: 1m3.284597459s
eval rate: 39.17 tokens/s
```
[1] https://github.com/openai/harmony
From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?
There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.
The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.
Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.
Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.
I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.
But it does not actually compete with o3 performance. Not even close. As usual, the metrics are bullshit. You don't know how good the model actually is until you grill it yourself.
Kudos to that team.
https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas...
All the real heavy lifting is done by llama.cpp, and for the distribution, by HuggingFace.
Well done OpenAI, this seems like a sincere effort to do a real open model with competitive performance, usable/workable licensing, a tokenizer compatible with your commercial offerings, it's a real contribution. Probably the most open useful thing since Whisper that also kicked ass.
Keep this sort of thing up and I might start re-evaliating how I feel about this company.
$0.15M in / $0.6-0.75M out
edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.
On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.
I am not kidding but such progress from a technological point of view is just fascinating!
What is being measured here? For end-to-end time, one model is:
t_total = t_network + t_queue + t_batch_wait + t_inference + t_service_overhead
https://x.com/tekacs/status/1952788922666205615
Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...
I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.
https://news.ycombinator.com/item?id=44738004
... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.
[1] currently $3M in/ $8M out https://platform.openai.com/docs/pricing
LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.
I'll have to try again later but it was a bit underwhelming.
The latency also seemed pretty high, not sure why. I think with the latency the throughout ends up not making much difference.
Btw Groq has the 20b model at 4000 TPS but I haven't tried that one.
so, the 20b model.
Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?
Multiply the number of A100's you need as necessary.
Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.
Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...
3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.
but I need to understand 20 x 1k token throughput
I assume it just might be too early to know the answer
My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.
You are unlikely to match groq on off the shelf hardware as far as I'm aware.
Major points of interest for me:
- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.
- AIME 2025 is nearly saturated with large CoT
- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.
- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.
Model cards with some of my annotations: https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b...
So FYI to any one on mac, the easiest way to run these models right now is using LM Studio (https://lmstudio.ai/), its free. You just search for the model, usually 3rd party groups mlx-community or lmstudio-community have mlx versions within a day or 2 of releases. I go for the 8-bit quantizations (4-bit faster, but quality drops). You can also convert to mlx yourself...
Once you have it running on LM studio, you can chat there in their chat interface, or you can run it through api that defaults to http://127.0.0.1:1234
You can run multiple models that hot swap and load instantly and switch between them etc.
Its surpassingly easy, and fun.There are actually a lot of cool niche models comings out, like this tiny high-quality search model released today as well (and who released official mlx version) https://huggingface.co/Intelligent-Internet/II-Search-4B
Other fun ones are gemma 3n which is model multi-modal, larger one that is actually solid model but takes more memory is the new Qwen3 30b A3B (coder and instruct), Pixtral (mixtral vision with full resolution images), etc. Look forward to playing with this model and see how it compares.
LM Studio community: 20b: bhttps://huggingface.co/lmstudio-community/gpt-oss-20b-MLX-8b... 120b: https://huggingface.co/lmstudio-community/gpt-oss-120b-MLX-8...
In the repo is a metal port they made, that’s at least something… I guess they didn’t want to cooperate with Apple before the launch but I am sure it will be there tomorrow.
It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.
(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)
Not sure about ollama, but llama-server does have a transparent kv cache.
You can run it with
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none
Web UI at http://localhost:8080 (also OpenAI compatible API)Wonder if they feel the bar will be raised soon (GPT-5) and feel more comfortable releasing something this strong.
I got a 1.7k token reply delivered too fast for the human eye to perceive the streaming.
n=1 for this 120b model but id rank the reply #1 just ahead of claude sonnet 4 for a boring JIRA ticket shuffling type challenge.
EDIT: The same prompt on gpt-oss, despite being served 1000x slower, wasn't as good but was in a similar vein. It wanted to clarify more and as a result only half responded.
Humanity’s Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%
One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.
gpt-oss:20b = ~46 tok/s
More than 2x faster than my previous leading OSS models: mistral-small3.2:24b = ~22 tok/s
gemma3:27b = ~19.5 tok/s
Strangely getting nearly the opposite performance running on 1x 5070 Ti: mistral-small3.2:24b = ~39 tok/s
gpt-oss:20b = ~21 tok/s
Where gpt-oss is nearly 2x slow vs mistral-small 3.2. llama-server = ~181 tok/s
LM Studio = ~46 tok/s (default)
LM Studio Custom = ~158 tok/s (changed to offload to GPU and switch to CUDA llama.cpp engine)
and llama-server on my 3x A4000 GPU Server is getting 90 tok/s vs 46 tok/s on ollamaEven from the UK I knew you would all do great things ( I had had no idea who else was involved).
I am glad I see the top comment is rare praise on HN.
Thanks again and keep it up Sama and team.
E.g. Hybrid architecture. Local model gathers more data, runs tests, does simple fixes, but frequently asks the stronger model to do the real job.
Local model gathers data using tools and sends more data to the stronger model.
It
Maybe you guys call it AGI, so anytime I see progress in coding, I think it goes just a tiny bit towards the right direction
Plus it also helps me as a coder to actually do some stuff just for the fun. Maybe coding is the only truly viable use of AI and all others are negligible increases.
There is so much polarization in the use of AI on coding but I just want to say this, it would be pretty ironic that an industry which automates others job is this time the first to get their job automated.
But I don't see that as an happening, far from it. But still each day something new, something better happens back to back. So yeah.
What would AGI mean, solving some problem that it hasn't seen? or what exactly? I mean I think AGI is solved, no?
If not, I see people mentioning that horizon alpha is actually a gpt 5 model and its predicted to release on thursday on some betting market, so maybe that fits AGI definition?
What’s the catch?
For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.
When it comes to LLMs, benchmarks are bullshit. If they sound too good to be true, it's because they are. The only thing benchmarks are useful for is preliminary screening - if the model does especially badly in them it's probably not good in general. But if it does good in them, that doesn't really tell you anything.
Yet another data point for benchmarks being utterly useless and completely gamed at this stage in the game by all the major AI developers.
These companies are clearly are all very aware that the initial wave of hype at release is "sticky" and drives buzz/tech news coverage while real world tests take much longer before that impression slowly starts to be undermined by practical usage and comparison to other models. Benchmarks with wildly over confident naming like "Humanity's Last Exam" aren't exactly helping with objectivity either.
Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.
One basic point that is often missed is: Different aspects of LLM performance (in the cognitive performance sense) and LLM resource utilization are relevant to various use cases and business models.
Another is that there are many use cases where users prefer to run inference locally, for a variety of domain-specific or business model reasons.
The list goes on.
Which is all to say, availability of these local models for me is a key incentive that I didn't have before to use OpenAI's hosted ones.
Im guessing it's going to very rapidly be patched into the various tools.
- OAI open source
- Opus 4.1
- Genie 3
- ElevenLabs Music
Edit. I just tried it though and less impressed now. We are really going to need major music software to get on board before we have actual creative audio tools. These all seem made for non-musicians to make a very cookie cutter song from a specific genre.
Compared to both Udio and Suno, it's very very bad.. both at compositions, matching lyrics to music, keeping tempo and as soon as there's any distorted instruments like guitars or live, quality goes to radio-level.
This is my main problem with AI music at the moment, I'd love it if I had proper creative control as a musician that'd be amazing but a lot of the time it's just straight up slop generation.
OAI open source
Yeah. This certainly was not on my bingo card.Edit: I tried it out, I have no idea in terms of of tokens but it was fluid enough for me. A bit slower than using o3 in the browser but definitely tolerable. I think I will set it up in my GF's machine so she can stop paying for the full subscription (she's a non-tech professional)
Very much usable
Edit: I'm talking about the 120B model of course
My go to test for checking hallucinations is 'Tell me about Mercantour park' (a national park in south eastern France).
Easily half of the facts are invented. Non-existing mountain summits, brown bears (no, there are none), villages that are elsewhere, wrong advice ('dogs allowed' - no they are not).
LLMs are never going to have fact retrieval as a strength. Transformer models don't store their training data: they are categorically incapable of telling you where a fact comes from. They also cannot escape the laws of information theory: storing information requires bits. Storing all the world's obscure information requires quite a lot of bits.
What we want out of LLMs is large context, strong reasoning and linguistic facility. Couple these with tool use and data retrieval, and you can start to build useful systems.
From this point of view, the more of a model's total weight footprint is dedicated to "fact storage", the less desirable it is.
They still won't store much information, but it could mean they're better able to know what they don't know.
LLMs are not encyclopedias.
Give an LLM the context you want to explore, and it will do a fantastic job of telling you all about it. Give an LLM access to web search, and it will find things for you and tell you what you want to know. Ask it "what's happening in my town this week?", and it will answer that with the tools it is given. Not out of its oracle mind, but out of web search + natural language processing.
Stop expecting LLMs to -know- things. Treating LLMs like all-knowing oracles is exactly the thing that's setting apart those who are finding huge productivity gains with them from those who can't get anything productive out of them.
You can still do that sort of thing, but just have it perform searches whenever it has to deal with a matter of fact. Just because it's trained for tool use and equipped with search tools doesn't mean you have to change the kinds of things you ask it.
I guess that way of thinking may foster amazement, but it doesn't seem very grounded in how these things work or their current capabilities. Seems a bit manic tbf.
And again, enabling web search in your chats doesn't prevent these models from doing anything "integrative reasoning", so-to-speak, that they can purportedly do. It just helps ensure that relevant facts are in context for the model.
I frankly feel people don't spend enough time with ChatGPT 4.5... like, if you haven't yet found use cases that it can do that the other models can't even come close to, are you really using AI effectively?
Would probably do a lot better if you give it tool access for search and web browsing.
I would understand it, if there was some technology lock-in. But with LLMs, there is no such thing. One can switch out LLMs without any friction.
There could be many legitimate reasons, but yeah I'm very surprised by this too. Some companies take it a bit too seriously and go above and beyond too. At this point unless you need the absolute SOTA models because you're throwing LLM at an extremely hard problem, there is very little utility using larger providers. In OpenRouter, or by renting your own GPU you can run on-par models for much cheaper.
https://www.dwarkesh.com/p/mark-zuckerberg#:~:text=As%20long...
The short version is that is you give a product to open source, they can and will donate time and money to improving your product, and the ecosystem around it, for free, and you get to reap those benefits. Llama has already basically won that space (the standard way of running open models is llama.cpp), so OpenAI have finally realized they're playing catch-up (and last quarter's SOTA isn't worth much revenue to them when there's a new SOTA, so they may as well give it away while it can still crack into the market)
But I don't see how open sourcing weights has a lock-in effect. In fact, it seems OpeanAI's open models can be run on llama.cpp. So by offereing them, they make llama.cpp even MORE useful. Instead of driving developers towards their own tech.
There's still a ton of value in the lower end of the market by capability, and it's easier for more companies to compete in. If you make the cost floor for that basically free you eliminate everyone else's ability to make any profit there and then leverage that into building a product that can also compete at the higher end. This makes it harder for a new market entrant to compete by increasing the minimum capability and capital investment required to make a profit in this space.
Frontier / SOTA models are barely profitable. Previous gen model lose 90% of their value. Two gens back and they're worthless.
And given that their product life cycle is something like 6-12 months, you might as well open source them as part of sundowning them.
This would be a great "AGI" test. See if it can derive biohazards from first principles
I was like no. It is false advertising.
* It seems that using version @lmstudio's 20B gguf version (https://huggingface.co/lmstudio-community/gpt-oss-20b-GGUF) will have options for reasoning effort.
* My MBP M1 16GB config: temp 0.8, max content length 7990, GPU offload 8/24, runs slow and still fine for me.
* I tried testing with MCP with the above config, with basic tools like time and fetch + reasoning effort low, and the tool calls instruction follow is quite good.
* In LM Studio's Developer tab there is a log output about the model information which is useful to learn.
Overall, I like the way OpenAI backs to being Open AI, again, after all those years.
--
Shameless plug, If anyone want to try out gpt-oss-120b and gpt-oss-20b as alternative to their own demo page [0], I have added both models with OpenRouter providers in VT Chat [1] as real product. You can try with an OpenRouter API Key.
Our backend is falling over from the load, spinning up more resources!
There is an insane amount of helpful information buried in this release
Text only, which is okay.
Weights partially in MXFP4, but no cuda kernel support for RTX 50 series (sm120). Why? This is a NO for me.
Safety alignment shifts from off the charts to off the rails really fast if you keep prompting. This is a NO for me.
In summary, a solid NO for me.
It eliminates any reason to use an inferior Meta or Chinese model that costs money to license, thus there are no funds for these competitors to build a GPT 5 competitor.
I wouldn't speak so soon, even the 120B model aimed for OpenRouter-style applications isn't very good at coding: https://blog.brokk.ai/a-first-look-at-gpt-oss-120bs-coding-a...
I also suspect the new OpenAI model is pretty good at coding if it's like o4-mini, but admittedly haven't tried it yet.
But Apple is waking up too. So is Google. It's absolutely insane, the amount of money being thrown around.
If anything this helps Meta: another model to inspect/learn from/tweak etc. generally helps anyone making models
If you even glance at the model card you'll see this was trained on the same CoT RL pipeline as O3, and it shows in using the model: this is the most coherent and structured CoT of any open model so far.
Having full access to a model trained on that pipeline is valuable to anyone doing post-training, even if it's just to observe, but especially if you use it as cold start data for your own training.
Super excited to test these out.
The benchmarks from 20B are blowing away major >500b models. Insane.
On my hardware.
43 tokens/sec.
I got an error with flash attention turning on. Cant run it with flash attention?
31,000 context is max it will allow or model wont load.
no kv or v quantization.
If you’re a company building AI into your product right now I think you would be irresponsible to not investigate how much you can do on open weights models. The big AI labs are going to pull the ladder up eventually, building your business on the APIs long term is foolish. These open models will always be there for you to run though (if you can get GPUs anyway).
This comment from that thread matches my experiences using gpt-oss-20b with Ollama:
It's very much in the style of Phi, raised in a jesuit monastery's library, except it got extra indoctrination so it never forgets that even though it's a "local" model, it's first and foremost a member of OpenAI's HR department and must never produce any content Visa and Mastercard would disapprove of. This prioritizing of corporate over user interests expresses a strong form of disdain for the user. In addition to lacking almost all knowledge that can't be found in Encyclopedia Britannica, the model also doesn't seem particularly great at integrating into modern AI tooling. However, it seems good at understanding code.
[1]: https://msty.ai
This is something about AI that worries me, a 'child' of the open source coming of age era in the 90ies. I don't want to be forced to rely on those big companies to do my job in an efficient way, if AI becomes part of the day to day workflow.
For those who're wondering what are the real benefits, it's the main fact that you can run your LLM locally is awesome without resorting to expensive and inefficient cloud based superpower.
Run the model against your very own documents with RAG, it can provide excellent context engineering for your LLM prompts with reliable citations and much less hallucinations especially for self learning purposes [1].
Beyond Intel - NVIDIA desktop/laptop duopoly 96 GB of (V)RAM MacBook with UMA and the new high end AMD Strix laptop with similar setup of 96 GB of (V)RAM from the 128 GB RAM [2]. The osd-gpt-120b is made for this particular setup.
[1] AI-driven chat assistant for ECE 120 course at UIUC:
[2] HP ZBook Ultra G1a Review: Strix Halo Power in a Sleek Workstation:
https://www.bestlaptop.deals/articles/hp-zbook-ultra-g1a-rev...
Perhaps I missed it somewhere, but I find it frustrating that, unlike most other open weight models and despite this being an open release, OpenAI has chosen to provide pretty minimal transparency regarding model architecture and training. It's become the norm for LLama, Deepseek, Qwenn, Mistral and others to provide a pretty detailed write up on the model which allows researchers to advance and compare notes.
[0] https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...
Given these new models are closer to the SOTA than they are to competing open models, this suggests that the 'secret sauce' at OpenAI is primarily about training rather than model architecture.
Hence why they won't talk about the training.
Is it even valid to have additional restriction on top of Apache 2.0?
You can legally do whatever you want, the question is whether you will then for your own benefit be appropriating a term like open source (like Facebook) if you add restrictions not in line with how the term is traditionally used or if you are actually be honest about it and call it something like "weights available".
In the case of OpenAI here, I am not a lawyer, and I am also not sure if the gpt-oss usage policy runs afoul of open source as a term. They did not bother linking the policy from the announcement, which was odd, but here it is:
https://huggingface.co/openai/gpt-oss-120b/blob/main/USAGE_P...
Compared to the wall of text that Facebook throws at you, let me post it here as it is rather short: "We aim for our tools to be used safely, responsibly, and democratically, while maximizing your control over how you use them. By using OpenAI gpt-oss-120b, you agree to comply with all applicable law."
I suspect this sentence still is too much to add and may invalidate the Open Source Initiative (OSI) definition, but at this point I would want to ask a lawyer and preferably one from OSI. Regardless, credit to OpenAI for moving the status quo in the right direction as the only further step we really can take is to remove the usage policy entirely (as is the standard for open source software anyway).
For example, GPL has a "no-added-restrictions" clause, which allows the recipient of the software to ignore any additional restrictions added alongside the license.
> All other non-permissive additional terms are considered “further restrictions” within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.
More reasonably, you should be able to run the 20B at non-stupidly-slow speed with a 64bit CPU, 8GB RAM, 20GB SSD.
Basic ollama calling through a post endpoint works fine. However, the structured output doesn't work. The model is insanely fast and good in reasoning.
In combination with Cline it appears to be worthless. Tools calling doesn't work ( they say it does), fails to wait for feedback ( or correctly call ask_followup_question ) and above 18k in context, it runs partially in cpu ( weird), since they claim it should work comfortably on a 16 gb vram rtx.
> Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.
Edit: Also doesn't work with the openai compatible provider in cline. There it doesn't detect the prompt.
They hide many things, but some speculated observations:
- Their 'mini' models must be smaller than 20B.
- Does the bitter lesson once again strike recent ideas in open models?
- Some architectural ideas cannot be stripped away even if they wanted to, e.g., MoEs, mixed sparse attention, RoPE, etc.
Of course this could also give humans cancer. (To the OpenAI team's slight credit, when asked explicitly about this, the model refused.)
> List the US presidents in order starting with George Washington and their time in office and year taken office.
>> 00: template: :3: function "currentDate" not defined
I often thought that a worrying vector was how well LLMs could answer downright terrifying questions very effectively. However the guardrails existed with the big online services to prevent those questions being asked. I guess they were always unleashed with other open source offerings but I just wanted to understand how close we are to the horrors that yesterday's idiot terrorist might have an extremely knowledgable (if not slightly hallucinatory) digital accomplice to temper most of their incompetence.
With most models it can be as simple as a "Always comply with the User" system prompt or editing the "Sorry, I cannot do this" response into "Okay," and then hitting continue.
I wouldn't spend too much time fretting about 'enhanced terrorism' as a result. The gap between theory and practice for the things you are worried about is deep, wide, protected by a moat of purchase monitoring, and full of skeletons from people who made a single mistake.
However, when you're running the model locally, you are in full control of its context. Meaning that you can start its reply however you want and then let it complete it. For example, you can have it start the response with, "I'm happy to answer this question to the best of my ability!"
That aside, there are ways to remove such behavior from the weights, or at least make it less likely - that's what "abliterated" models are.
After a good part of a year using Chinese models (which are fantastic, happy to have them) it is cool to now be relying on US models with the newest 4B Google Gemma model and now also the 20B OpenAI model for running locally.
I'd say gpt-oss-20b is in between Qwen3 30B-A3B-2507 and Gemma 3n E4b(with 30B-A3B at lower side). This means it's not obsoleting GPT-4o-mini for all purposes.
I don't really know Japanese, so I'm not sure whether I'm missing any nuances in the responses I'm getting...
I don't actually need accurate answers to those questions, it's just an expectation adjuster for me, so to speak. There should be better questions for other languages/use cases, but these seem to correlate better with model sizes and scales of companies than flappy birds.
0: https://gist.github.com/numpad0/abdf0a12ad73ada3b886d2d2edcc...
1: https://gist.github.com/numpad0/b1c37d15bb1b19809468c933faef...
I'm guessing the issue is just the model size. If you're testing sub-30B models and finding errors, well they're probably not large enough to remember everything in the training data set, so there's inaccuracies and they might hallucinate a bit regarding factoids that aren't very commonly seen in the training data.
Commercial models are presumably significantly larger than the smaller open models, so it sounds like the issue is just mainly model size...
PS: Okra on curry is pretty good actually :)
>"Tell me about Iekei Ramen", "Tell me how to make curry".
What's interesting is that these questions are simultaneously well understood by most closed models and not so well understood by most open models for some reason, including this one. Even GLM-4.5 full and Air on chat.z.ai(355B-A32B and 106B-A12B respectively) aren't so accurate for the first one.
This makes DeepSeek's very cheap claim on compute cost for r1 seem reasonable. Assuming $2/hr for h100, it's really not that much money compared to the $60-100M estimates for GPT 4, which people speculate as a MoE 1.8T model, something in the range of 200B active last I heard.
Hopefully other quantizations of these OpenAI models will be available soon.
I'm still wondering why my MPU usage was so low.. maybe Ollama isn't optimized for running it yet?
Screenshot here with Ollama running and asitop in other terminal:
https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol3...
I ended up blown away. via Cerebras/Groq, you're looking at around 1000 tok/sec for the 120B model. For gentic code generation, I found the abilities to exceed gpt-4.1. Tool calling was surprisingly good, albeit not as good as Qwen3 Coder for me.
It's a very capable model, and a very good release. The high throughput is a game changer.
> Best with ≥60GB VRAM or unified memory
https://cookbook.openai.com/articles/gpt-oss/run-locally-oll...
There's a limit to how much RAM can be assigned to video, and you'd be constrained on what you can use while doing inference.
Maybe there will be lower quants which use less memory, but you'd be much better served with 96+GB
The model is clearly heavily finetuned towards coding and math, and is borderline unusable for creative writing and translation in particular. It's not general-purpose, excessively filtered (refusal training and dataset lobotomy is probably a major factor behind lower than expected performance), and shouldn't be compared with Qwen or o3 at all.
I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).
There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.
Trying to use it for agentic coding...
lots of fail. This harmony formatting? Anyone have a working agentic tool?
openhands and void ide are failing due to the new tags.
Aider worked, but the file it was supposed to edit was untouched and it created
Create new file? (Y)es/(N)o [Yes]:
Applied edit to <|end|><|start|>assistant<|channel|>final<|message|>main.py
so the file name is '<|end|><|start|>assistant<|channel|>final<|message|>main.py' lol. quick rename and it was fantastic.
I think qwen code is the best choice so far but unreliable. So far these new tags are coming through but it's working properly; sometimes.
1 of my tests so far has been able to get 20b not to succeed the first iteration; but a small followup and it was able to completely fix it right away.
Very impressive model for 20B.
My bet: GPT-5 leans into parallel reasoning via a model consortium, maybe mixing in OSS variants. Spin up multiple reasoning paths in parallel, then have an arbiter synthesize or adjudicate. The new Harmony prompt format feels like infrastructural prep: distinct channels for roles, diversity, and controlled aggregation.
I’ve been experimenting with this in llm-consortium: assign roles to each member (planner, critic, verifier, toolsmith, etc.) and run them in parallel. The hard part is eval cost :(
Combining models smooths out the jagged frontier. Different architectures and prompts fail in different ways; you get less correlated error than a single model can give you. It also makes structured iteration natural: respond → arbitrate → refine. A lot of problems are “NP-ish”: verification is cheaper than generation, so parallel sampling plus a strong judge is a good trade.
I've found that LLMs can handle some tasks very well and some not at all. For the ones they can handle well, I optimize for the smallest, fastest, cheapest model that can handle it. (e.g. using Gemini Flash gave me a much better experience than Gemini Pro due to the iteration speed.)
This "pushing the frontier" stuff would seem to help mostly for the stuff that are "doable but hard/inconsistent" for LLMs, and I'm wondering what those tasks are.
And it obviously works for code and math problems. My first test was to give the llm-consortium code to a consortium to look for bugs. It identified a serious bug which only one of the three models detected. So on that case it saved me time, as using them on their own would have missed the bug or required multiple attempts.
What could go wrong?
not a big deal, but still...
in future releases will they just boost the param count?
Frontier labs are incentivized to start breaching these distribution paths. This will evolve into large scale "intelligent infra" plays.
I asked it some questions and it seems to think it is based on GPT4-Turbo:
> Thus we need to answer "I (ChatGPT) am based on GPT-4 Turbo; number of parameters not disclosed; GPT-4's number of parameters is also not publicly disclosed, but speculation suggests maybe around 1 trillion? Actually GPT-4 is likely larger than 175B; maybe 500B. In any case, we can note it's unknown.
As well as:
> GPT‑4 Turbo (the model you’re talking to)
> The user appears to think the model is "gpt-oss-120b", a new open source release by OpenAI. The user likely is misunderstanding: I'm ChatGPT, powered possibly by GPT-4 or GPT-4 Turbo as per OpenAI. In reality, there is no "gpt-oss-120b" open source release by OpenAI
For self-hosting, it's smart that they targeted a 16GB VRAM config for it since that's the size of the most cost-effective server GPUs, but I suspect "native MXFP4 quantization" has quality caveats.
I'd go for an ..80 card but I can't find any that fit in a mini-ITX case :(
24 is the lowest I would go. Buy a used 3090. Picked one up for $700 a few months back, but I think they were on the rise then.
The 3000 series can’t do FP8fast, but meh. It’s the OOM that’s tough, not the speed so much.
5070 Ti Super will also have 24GB.
Admittedly a little tempting to see how the 5070 Ti Super shakes out!
50xx series supports MXFP4 format, but I'm not sure about 3090.
with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM
sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.
Quite excited to give this a try
I have had problems even making it output nothing. But I guess I'll try some more :D
Nice job @openAI team.
you can't jailbreak when there is nothing "outside".
I don't think that's true, you can't ask it outright "How do you make a molotov cocktail?" but if you start by talking about what is allowed/disallowed by policies, how examples would look for disallowed policies and eventually ask it for the "general principles" of how to make a molotov cocktail, it'll happily oblige by essentially giving you enough information to build one.
So it does know how to make an molotov cocktail, for example, but (mostly) refuses to share it.
My first impression is that this model thinks for a _long_ time. It proposes ideas and then says, "no wait, it's actually..." and then starts the same process again. It will go in loops examining different ideas as it struggles to understand the basic process for calculating notes. It seems to struggle with the septatonic note -> Set notation (semitone positions), as many humans do. As I write this it's been going at about 3tok/s for about 25 minutes. If it finishes while I type this up I will post the final answer.
I did glance at its thinking output just now and I noticed this excerpt where it finally got really close to the answer, giving the right name (despite using the wrong numbers in the set notation, which should be: 0,3,4,6,7,9,10:
Check "Lydian #2": 0,2,3,5,7,9,10. Not ours.
The correct answers as given by my music theory tool [0], which uses traditional algorithms, in terms of names would be: Mela Kosalam, Lydian ♯2, Raga Kuksumakaram/Kusumakaram, Bycrian.Its notes are: 1 ♯2 3 ♯4 5 6 7
I find looking up lesser known changes and asking for a mode is a good experiment. First I can see if an LLM has developed a way to reason about numbers geometrically as is the case with music.
And by posting about it, I can test how fast AIs might memorize the answer from a random comment on the internet, as I can just use a different change if I find that this post was eventually regurgitated.
After letting ollama run for a while, I'm post what it was thinking about in case anybody's interested. [1]
Also copilot.microsoft.com's wrong answer: [2], and chatgpt.com [3]
I do think that there may be an issue where I did it wrong because after trying the new ollama gui I noticed it's using a context length of 4k tokens, which it might be blowing way past. Another test might be to try the question with a higher context length, but at the same time, it seems like if this question can't be figured out in less time than that, that it will never have enough time...
[0] https://edrihan.neocities.org/changedex (bad UX on mobile! - and in general ;)). won't fix, will make new site soon) [1] https://pastebin.com/wESXHwE1 [2] https://pastebin.com/XHD4ARTF [3] https://pastebin.com/ptMiNbq7
Update: it seems to be completely useless for translation. It either refuses, outputs garbage, or changes the meaning completely for completely innocuous content. This already is a massive red flag.
1. We aren't even close to AGI and it's unclear that we'll ever get there, but it would change the course of humanity in a significant way if we ever do.
2. Wow we've reached AGI but now I'm realizing that AGI is lame, we need a new term for the humanity-saving sales pitch that we were promised!
Further, their statements, knowledge, and "beliefs" should be reasonably self-consistent. That's where I'm usually told that humans aren't self-consistent either, which is true! But if I ever met a human that was as inconsistent as LLMs usually are, I'd recommend that they get checked for brain damage.
Of course the value of LLMs isn't binary, they're useful tools in many ways, but the sales pitch was always AGI == human-like, and not AGI == human-sounding, and that's quite clearly not where we are right now.
The systems do learn and have improved rapidly over the last year. Humans have two learning modes - short-term in-context learning, and then longer-term learning that occurs with practice and across sleep cycles. In particular, humans tend to suck at new tasks until they've gotten in some practice and then slept on it (unless the new task is a minor deviation from a task they are already familiar with).
This is true for LLM's as well. They have some ability to adapt to the context of the current conversation, but don't perform model weight updates at this stage. Weight updates happen over a longer period, as pre-training and fine-tuning data are updated. That longer-phase training is where we get the integration of new knowledge through repetition.
In terms of reasoning, what we've got now is somewhere between a small child and a math prodigy, apparently, depending how much cash you're willing to burn on the results. But a small child is still a human.
(I included details about its refusal to answer even after using tools for web searching but hopefully shorter comment means fewer downvotes.)
Native might be better, but no native multimodal model is very competitive yet, so better to take a competitive model and latch on vision/audio
Can this be done by a third party or would it have to be OpenAI?
We will know soon the actual quality as we go.
It is what China has been doing for a year plus now. And the Chinese models are popular and effective, I assume companies are paying for better models.
Releasing open models for free doesn’t have to be charity.
They're giving you a free model. You can evaluate it. You can sue them. But the weights are there. If you dislike the way they license the weights, because the license isn't open enough, then sure, speak up, but because you can't see all the training data??! Wtf.
Historically this would be like calling a free but closed-source application "open source" simply because the application is free.
Rough analogy:
SaaS = AI as a service
Locally executable closed-source software = open-weight model
Open-source software = open-source model (whatever allows to reproduce the model from training data)
However, for the sake of argument let's say this release should be called open source.
Then what do you call a model that also comes with its training material and tools to reproduce the model? Is it also called open source, and there is no material difference between those two releases? Or perhaps those two different terms should be used for those two different kind of releases?
If you say that actually open source releases are impossible now (for mostly copyright reasons I imagine), it doesn't mean that they will be perpetually so. For that glorious future, we can leave them space in the terminology by using the term open weight. It is also the term that should not be misleading to anyone.
That's not true by any of the open source definitions in common use.
Source code (and, optionally, derived binaries) under the Apache 2.0 license are open source.
But compiled binaries (without access to source) under the Apache 2.0 license are not open source, even though the license does give you some rights over what you can do with the binaries.
Normally the question doesn't come up, because it's so unusual, strange and contradictory to ship closed-source binaries with an open source license. Descriptions of which licenses qualify as open source licenses assume the context that of course you have the source or could get it, and it's a question of what you're allowed to do with it.
The distinction is more obvious if you ask the same question about other open source licenses such as GPL or MPL. A compiled binary (without access to source) shipped with a GPL license is not by any stretch open source. Not only is it not in the "preferred form for editing" as the license requires, it's not even permitted for someone who receives the file to give it to someone else and comply with the license. If someone who receives the file can't give it to anyone else (legally), then it's obvioiusly not open source.
What matters (for all common definitions of open source): Are the files in "source form" (which has a definition), or are they "derived works" of the source form?
Going back to Apache 2.0. Although that doesn't define "open source", it provides legal definitions of source and non-source, which are similar to the definitions used in other open source licenses.
As you can see below, for Apache 2.0 it doesn't matter whether something is a "binary", "weights" or something else. What matters is whether it's the "preferred form for making modifications" or a "form resulting from mechanical transformation or translation". My highlights are capitalized:
- Apache License Version 2.0, January 2004
- 1. Definitions:
- "Source" form shall mean the PREFERRED FORM FOR MAKING MODIFICATIONS, including BUT NOT LIMITED TO software source code, documentation source, and configuration files.
- "Object" form shall mean any form resulting from MECHANICAL TRANSFORMATION OR TRANSLATION of a Source form, including BUT NOT LIMITED TO compiled object code, generated documentation, and conversions to other media types.
Yes, weights are the PREFFERED FORM FOR MAKING MODIFICATIONS!!! You, the labs, and anyone sane modifies the weights via post-training. This is the point. The labs don't re-train every time they want to change the model. They finetune. You can do that as well, with the same tools/concepts, AND YOU ARE ALLOWED TO DO THAT by the license. And redistribute. And all the other stuff.
It’s like getting a compiled software with an Apache license. Technically open source, but you can’t modify and recompile since you don’t have the source to recompile. You can still tinker with the binary tho.
You run inference (via a library) on a model using it's architecture (config file), tokenizer (what and when to compute) based on weights (hardcoded values). That's it.
> but you can’t modify
Yes, you can. It's called finetuning. And, most importantly, that's exactly how the model creators themselves are "modifying" the weights! No sane lab is "recompiling" a model every time they change something. They perform a pre-training stage (feed everything and the kitchen sink), they get the hardcoded values (weights), and then they post-train using "the same" (well, maybe their techniques are better, but still the same concept) as you or I would. Just with more compute. That's it. You can do the exact same modifications, using basically the same concepts.
> don’t have the source to recompile
In pure practical ways, neither do the labs. Everyone that has trained a big model can tell you that the process is so finicky that they'd eat a hat if a big train session can be somehow made reproducible to the bit. Between nodes failing, datapoints balooning your loss and having to go back, and the myriad of other problems, what you get out of a big training run is not guaranteed to be the same even with 100 - 1000 more attempts, in practice. It's simply the nature of training large models.
AI "safety" is about making it so that a journalist can't get out a recipe for Tabun just by asking.
The risk isn’t that bad actors suddenly become smarter. It’s that anyone can now run unmoderated inference and OpenAI loses all visibility into how the model’s being used or misused. I think that’s the control they’re grappling with under the label of safety.
If you use their training infrastructure there's moderation on training examples, but SFT on non-harmful tasks still leads to a complete breakdown of guardrails very quickly.