Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
with pastebin - https://pastebin.com/aiWuYcrF
We are currently at nonsensical pacing while writing novels.
Most human-written books don't do that, so that seems to be a ceiteria for a very different test that a Turing test.
I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.
After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.
https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.
> With little or no human involvement, write Pulitzer-caliber books, fiction and non-fiction.
So, yeah. I know you made a joke, but you have the same issue as the Onion I guess.
What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?
Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?
2026 news feed: Anthropic cited as AI agents simultaneously block traffic across 42 major cities while trying to capture a not-even-that-rare pokemon
I currently assert that it's not, but I would also say that trying to follow your suggestion is better than our current approach of measuring everything by money.
No. Screw quantifiability. I don't want "we've improved the sota by 1.931%" on basically anything that matters. Show me improvements that are obvious, improvements that stand out.
Claude Plays Pokemon is one of the few really important "benchmarks". No numbers, just the progress and the mood.
Of course, this is just some pedantry.
I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.
There were popular writeups about this from the Deepseek-R1 era: https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...
Not sure what is better for humanity in long term.
I could build a machine that phones my mother and tells her I love her, but it wouldn't obsolete me doing it.
I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.
I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.
Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!
AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.
(For those too young or unfamiliar: Mary Poppins famously had a bag that she could keep pulling things out of.)
Yes, Z is indeed a big advance over Y was a big advance over X. Also yes, Z is just as underwhelming.
Are customers hurting the AI companies' feelings?
No. It's the critics' feelings that are being hurt by continued advances, so they keep moving goalposts so they can keep believing they're right.
Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.
That's a better criticism of college exams than the benchmarks and/or those exams likely have either the exact questions or very similar ones in the training data.
The list of things that LLMs do better than the average human tends to rest squarely in the "problems already solved by above average humans" realm.
The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?
In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.
There are some logical inconsistencies, though. For instance, when they both enter the cellar through a trapdoor, Kael goes first, but the innkeeper instructs him to close the trapdoor behind them, which makes no sense. Also, Kael goes down the stairs and "risks a quick look back up" and can somehow see the front door bulging and the chaos outside through the windows, which obviously is impossible when you look up through a trapdoor, not to mention that previously it was said this entry is behind the bar counter, surely blocking the sight. Kael lights an oily rag which somehow becomes a torch. There's more generic things, like somehow these Eldertides being these mythical things no one has ever seen, yet they seem to be pretty common occurrences? The dimensions of the cellar are completely unclear, at first it seems to be very small but yet they move around it quite a bit. There's other issues, like people using the same words as the narrator ("the ooze"), like they listen to him. The inkeeper suddenly calling Kael by his name like they already know each other.
Anyway, I would rate it "first draft". Of course, it is unclear whether the LLM would manage to write a consistent book, but I can fully believe that it would manage. I probably wouldn't want to read it.
And it also, as you say, really reuses words. Just reading I notice "phosphorescence" 4 times for example in this chapter, or "ooze" 17 times (!).
It is very impressive though that it can create a somewhat cohesive storyline, and certainly an improvement over previous models.
From an artistic standpoint, the result is... I'd say: incredibly mediocre, with some glaring errors in between. This does not mean that an average person could produce a similar chapter. Gemini can clearly produce better prose than the vast majority of people. However, the vast majority of people does not publish books. Gemini would have to be on par with the best professional writers, and it clearly isn't. Why would you read this when there is no shortage of great books out there? It's the same with music, movies, paintings, etc. There is more great art than you could ever consume in your lifetime. All LLMs/GenAI do in art is pollute everything with their incredible mediocrity. For art (and artists), these are sad times.
Ideally I'd prefer to read material written by a the top 1%ile expert in that field, but due to constraints you almost always get to read material written by a midwit, intern, junior associate. In which case AI written content is much better especially as I can interrogate the material and match the top 1%ile quality.
If this ever gets good enough to write your next bestseller or award winner, i might not even share it and if i did, i wouldn't care if some stranger read it or not because it was created entirely for my pleasure.
That would be the next step but I'd previously never thought going any further might be worth it.
When was this time you speak of?
LLMs have been producing new things all the time. The question was always about quality of output, never about being able to produce anything new.
10000s is still much smaller than the space of possibilities for even a short prompt.
You might be right that good human novels are better than what LLMs can manage today. But that's rapidly changing.
And if you really need that Harry Potter / Superman / Three Musketeers crossover fan fiction itch scratched, you might not care that some other existing novel is 'better' in some abstract sense.
I've read a LOT of fiction. I love reading. And if it's good enough, the idea of reading something created by a machine does not bother me at all. So of course i will continue to see if the machine is finally good enough and i can be a bit more specific.
Tens-of-thousands is probably low by something in the neighborhood of four orders of magnitude.
Human-generated slop is real.
Just for fun: Asked it to rewrite the first page of ‘The Fountainhead’ where Howard is a computer engineer, the rewrite is hilarious lol.
https://gist.github.com/sagarspatil/e0b5443132501a3596c3a9a2...
I envisioned that one day, a framework will be created that can persist LLM current state on disk and then "fragments of memories" can be paged in and out into memory.
When that happened, LLM will be able to remember everything.
I wouldn't be surprised if someone figured out a solid mixture of models working as a writer (team of writers?) + editor(s) and managed to generate a full book from it.
Maybe some mixture of general outlining + maintaining a wiki with a basic writing and editing flow would be enough. I think you could probably find a way to maintain plot consistency, but I'm not so sure about maintaining writing style.
Planning generated content is often more important to invest in than the writing of it.
Looking at your paste, your prompt is short and basic, it should probably be broken up into clear, formatted sections (try directives inside XML style tags). For such a large output as you're expecting id expect a considerable prompt of rules and context setting (maybe a page or two).
I don't know why, but that is just such a literal thing to say that it seems almost random.
What I have found that works is to give the LLM the "world" outline at the beginning and then just feed it one line summary of each chapter and get it to write a chapter at a time.
The problem is that the quality of results drastically decreases as the context length increases. After about 10 chapters the dialogue will start to get real snippy. I've tried getting it to summarize all the previous chapters and feed that back in, but it never includes enough detail.
Brandon Sanderson has a great series of lectures on how he approaches it that are awesome ->
https://www.youtube.com/playlist?list=PLSH_xM-KC3ZvzkfVo_Dls...
You will get so many mental benefits from writing, too. I promise it is worth it. AI is a great tool if you hit a block and need to brainstorm.
My other problem is... lack of time :)
I am working on some world-building for something I want to write one day, but I am trying just to write little things to help. I write a lot of nonfiction stuff for work, but I am worried that it might not translate as well to characters...
For example, with filters right now you can do things like show me hard sci fi with AI: https://shepherd.com/bookshelf/hard-science-fiction?topics=Q...
Reddit is also a great source for recommendations: https://www.reddit.com/r/booksuggestions/ https://www.reddit.com/r/fantasybooks/ https://www.reddit.com/r/scifi/
Humans write books, AI is for doing the dishes or laundry :)
Not really. Everyone recommends the same 20 books that most have read or at least considered.
Let me give you an example that is real to me. I'd like to - 1. Read a fantasy series that pairs a human male and elf female romantically over the course of the series. - 2. What i'm looking for is to read the challenges of two fantasy races that aren't on very good terms so just being an elf won't really cut it. - 3. I also want a love interest that is a big active character in the story so not just a dozen mentions in a book. - 4. Obviously, i have to like the book(s).
It doesn't even have to be elves, it's just much harder trying to find such recs from a bespoke species.
You would think this would be an easy enough recommendation. Elves are the fantasy race after all and they usually aren't on the best of terms with humans. But it's not.. and at this point, i could give you more obscure recommendations that meet at least requirement 1, than you'd get in the vast majority of reddit threads. I spent months going through general amazon/goodreads recs and goodreads shelves with elves and still came out wanting.
Once you are even a little bit specific, options decay and if they exist, they are hard to find.
Shepherd looks good though
That type of deep analysis is hard, as nobody has access to inside the books (unless your are FB and do it illegally, plus have billions of compute dollars to spend) :)
like, if it decides to <think> a table of contents, or chapter summaries, rather than just diving in at page 1
- Buildup and happy background world-building
- Subtle foreshadowing
- Orcs attack
- Hero is saved by unlikely warrior of astounding beauty
- Evil is defeated until sales justify unnecessary sequel
That's the kind of story fit for the modern attention span...
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
https://www.reddit.com/r/math/comments/32m611/logic_question...
So it’s likely that it’s part of the training data by now.
Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."
Whoa. Math is not AI's strong suit...
Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."
Hats???
Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.
You know what would be even stupider? Putting a cheap, wrong LLM response atop each search query.
The second result is this reddit.com answer, https://www.reddit.com/r/math/comments/32m611/logic_question..., where at least the numbers make sense. I haven't examined the logic portion of the answer.
Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).
I'm eagerly awaiting an article about RAG caching strategies though!
- This kid is a genius! - you yell
- But wait, the kid has just picked an answer from the ground, it didn't actually come up...
- But the other toddlers could do it also but didn't!
There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:
* Random chance (these are still statistical machines after all)
* The problem resurfaced recently and shows up more often than it used to.
* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.
Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.
Doesn’t seem like a role with long-term prospects.
Generally speaking, posting output from a LLM, without explaining exactly what do you think it illustrates and why is frowned upon here. I don't think your comment does a great job of the latter.
> I don't think this means what you think it means.
> I did some interacting with the Tencent model that showed up here a couple days ago [...]
> This is a question that obviously was in the training data. How do you get the answer back out of the training data?
What do I think the conversation illustrates? Probably that having the answer in the training data doesn't get it into the output.
How does the conversation illustrate that? It isn't subtle. You can see it without reading any of the Chinese. If you want to read the Chinese, Google Translate is more than good enough for this purpose; that's what I used.
I cannot figure out what the comment is trying to get across either. It's easy for you because you already know what you are trying to say. You know what the pasted output shows. The poor execution is in not spending enough time thinking about how someone coming in totally blind would interpret the comment.
I can't, and I imagine most of the people who downvoted you couldn't either.
I think asking people to go to Google Translate to parse a random comment that seems to be 90% LLM output by volume is a bit much.
I didn't downvote you, but like (probably) most people here, I can't read Chinese; I can't derive whatever point you're trying to make just from with text you provided.
Even ignoring the fact that this or similar problem may have appeared in the training data, it's something a careful brute-force math logic should solve. It's neither difficult, nor interesting, nor useful. Yes, it may suggest a slight improvement on the basic logic, but no more so than a million other benchmarks people quote.
This goes to show that evaluating models is not a trivial problem. In fact, it's a hard problem (in particular, it's a far far harder than this math puzzle).
In my experience, the difference between someone who solves this type of logic puzzle and someone who doesn't, has more to do with persistence and ability to maintain focus, rather than "intelligence" in terms of problem-solving ability per se. I've worked with college students helping them learn to solve these kinds of problems (eg. as part of pre-interview test prep), and in most cases, those who solve it and those who don't have the same rate of progress towards the solution as long as they're actively working at it. The difference comes in how quickly they get frustrated (at themselves mostly), decide they're not capable of solving it, and give up on working on it further.
I mention this because this frustration itself comes from a belief that the ability to solve these belongs some "exceedingly clever" people only, and not someone like them. So, this kind of thinking ends up being a vicious cycle that keeps them from working on their actual issues.
It was a fun puzzle though and I'm surprised I didn't know it already. Thanks for sharing.
Seriously though, nice work.
That's not to say that you didn't, or you're recalling from a previous time that happens to be this exact puzzle (despite there being scant prior references to this puzzle, and precisely the reason for using it). But you can see how some might see that as not entirely credible.
Best guess: this random person is someone that really likes puzzles, is presumably good at them and is very, very far from being representative to the extent you would require to be in support of your argument.
Read: just a heavy flex about puzzle solving.
I randomly answered this post and can't solve it in half an hour. Is the point leet code but for AI? I rather it solve real problems than "elite problems".
Side note: couldn't even find pen and paper around in half an hour.
Granted this is all way over my head, but the solution gemini comes to matches the one given on reddit (and now here in future training runs)
65×26×39=65910
I don't think Gemini does an unnecessary amount of computation, it's just more verbose. This is typical of reasoning models, almost every step is necessary but many would not be written down by a human.
Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.
If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.
It doesnt say anything about the skill levels of the participants, whether their answers are just guessing, or why they arent just guessing the sum of the other two people each time asked to provide more information?
It doesnt say the guy saying 65 is even correct
How could three statements of "no new information" give information to the first guy that didn't know the first time he was asked?
I guess really the only missing information is that they have the exact same information you do, plus the numbers above their friends heads.
If this is true, then back in the original 1->2->3->1 form, shouldn't person #3 have been able to answer it?
i.e. You observe three people in a magical room. The first person is standing underneath a 65, the second person is standing underneath a 26 and the third person is standing underneath a 39. They can see the others numbers but not the one they are directly under. You tell them one of the three numbers is the sum of the other two and all numbers are positive integers. You ask the first person for their number, they respond that they don't know. You ask the second person for their number, they respond that they don't know. You ask the third person, they respond that they don't know. You ask the first person again and they respond with the correct value, how did they know?
And of course, if it responds with a verbatim answer in the line of https://www.reddit.com/r/math/comments/32m611/logic_question..., we can be pretty confident what's happening under the hood.
It's a common logic puzzle, Google can't turn up an exact match to the wording you have, but ex. here: https://www.futilitycloset.com/2018/03/03/three-hat-problem/
Tried this in deepseek and grok and it kept thunking in loops for a while and I just turned it off
I haven’t seen a question loop this long ever.
Very impressed
Gemini 2.5 and claude 3.7 thinking get it right, o3 mini and 4o get it wrong
Again we find that the failure state of LLMs is a problem – yeah, when you know the answer already and it gets it right, that's impressive! When it fails, it still acts the same exact way and someone who doesn't already know the answer is now a lil stupider.
It's basically: given a sequence of heap operations (insert element, delete minimum element), can you predict the left-over elements (that are in the heap at the end) in linear time in the comparison model?
(The answer is surprisingly: Yes.)
N is number of turns of don't know answers. the bad predicate means that the person can know its number at turn N.
bad(_,_,_,-1) :- !,false.
bad(_,A,A,0) :- !.
bad(A,_,A,0) :- !.
bad(A,A,_,0) :- !.
bad(B,C,A,N) :- D is abs(B-A),D<C,N1 is N-1, bad(B,D,A,N1),!.
bad(C,A,B,N) :- D is abs(B-A),D<C,N1 is N-1, bad(D,A,B,N1),!.
bad(A,B,C,N) :- D is abs(B-A),D<C,N1 is N-1, bad(A,B,D,N1),!.
solve(X,Y,Z) :- Y1 is X-1, between(1,Y1,Y),
between(0,2,N), Z is X-Y,bad(X,Y,Z,N).
?- solve(65,X,Y).
X = 26,
Y = 39 ;
X = 39,
Y = 26 .
(All state is stored in localStorage so you can come back to it :) ).
After a couple of minutes it decided on the answer being 65000. (S = {65, 40, 25)}
It's not an exaggeration it's a non-sequitur, you first have to show that the LLMs are reasoning in the same way humans do.
Call the three numbers a, b, and c. This means c = a + b, but we still don’t know to which person each number belongs.
When person 1 (p1) is asked what his number is, he has no way to know whether he has a, b, or c, so he says he doesn’t know. Same goes for p2 and p3. Clearly p1 somehow gains information by p2 and p3 passing. Either he realizes that he must be either a or b, and such his number is the difference between p2 and p3’s numbers, or he realizes that he must be c and so his number is the sum of p2 and p3’s numbers.
That’s all I have so far. Anyone have other ideas?
P1 knows that P2 and P3 are not equal. So they know that the set isn't [2A, A, A].
P2 knows that P1 and P3 are not equal. So they know that the set isn't [A, 2A, A]. They also know that if P1 doesn't know, then they were able to make the same deduction. So they now know that both [2A, A, A] and [A, 2A, A] aren't correct. Since they know that [2A, A, A] isn't correct, they can also know that [2A, 3A, A] isn't correct either. Because they'd be able to see if P1 = 2A and P3 = A, and if that were true and P1 doesn't know their number, it would have to be because P2 isn't A. And if P2 isn't A, they'd have to be 3A.
P3 knows that P1 and P2 aren't equal. Eliminates [A, A, 2A]. Knows that [2A, A, A], [A, 2A, A], and [2A, 3A, A], are eliminated. Using the same process as P2, they can eliminate [2A, A, 3A], [A, 2A, 3A], and also [2A, 3A, 5A]. Because they can see the numbers and they know if P1 is 2A and P2 is 3A.
Now we're back at P1. Who now knows.
So P2 and P3 are in the eliminated sets. Which means we're one of these
[2A, A, A]; [3A, 2A, A]; [4A, 3A, A]; [3A, A, 2A]; [4A, A, 3A]; [5A, 2A, 3A]; [8A, 3A, 5A]
We know his number is 65. To find the set, we can factor 65: (5 * 13). We can check the other numbers 2(13) = 26. 3(13) = 39. And technically, you don't need to find the other numbers. The final answer is 5A * 2A * 3A or (A^3) * 30.
Why? Couldn't it be an infinite number of 3 size arrays comprised of A where two elements sum to the third? [24A, 13A, 11A]? How did we deduce this set of arrays?
EDIT: Solved from another reddit comment. Tuples without a common factor like the one above are considered as a=1.
"They're not eliminated; they correspond to a = 1."
You should be able to generate an infinite number of these problems just by multiplying the first formula factor by a prime number. Like the same question but the person answers '52' restricts you to either [4a, 3a, a] or [4a, a, 3a]. Since the question only asks for the product of all the terms the answer is 4 * 13 + 3 * 13 + 13 = 104.
If p1 KNOWS that he’s the largest then he has to have gained some other piece of information. Say the numbers he sees are 32 and 33. His number would have to be either 1 or 65. If p1 was 1 then the other two would have known p1 couldn’t be the sum of the other two
If p2 sees 1 and 33, s/he would wonder if s/he is 32 or 34.
P3 would consider 31 or 33.
So A + B = C and A + C = B. But we know that A + B = C, so we can replace C with (A + B). So we know that A + A + B = B.
So 2A + B = B. Or 2A = 0.
And this holds any way you slice it.
Even if you were to try and brute force it.
A = 1
B = 2
Then C = 3. But A + C has to equal B. That's 1 + 3 = 2? That's not true.
I don't see a case where you can add to the sum of two numbers one of the numbers and get the other number.
I'm guessing that's a misreading of the problem. Because it looks like the third number is the sum of the first two.
The original problem is a little ambiguously worded. You could say "one of their numbers is the sum of the other two" and it would be a little clearer.
No it isn't. If it said "the sum of any two of the numbers is equal to the third", that would be a contradiction. What it says is "the sum of two of the numbers is equal to the third".
Buying two of the items gets you the third for free.
The implication is any two.
It’s ok that it’s ambiguous. It happens. In most cases, we clarify and move on. There’s no need to defend it.
What's especially strange here is, they repeatedly demonstrate if you interpret it that way, the problem is obviously, trivially, unsolvable, in a way that a beginner in algebra could intuit. (roughly 12 years old, at least, we started touching algebra in 7th grade)
I really don't get it.
When I've seen this sort of thing play out this way, the talking-down is usually for the benefit of demonstrating something to an observer (i.e. I am smart look at this thing I figured out; I can hold my own when the haters chirp; look they say $INTERLOCUTOR is a thinker but they can't even understand me!), but ~0 of that would apply here, at least traditionally.
That’s one-shot for o1 pro.
Let’s carefully analyze this puzzle step-by-step:
⸻
Step 1: Define the Problem Clearly
We have three people in a circle. Let’s call them: • Person A (first person) • Person B (second person) • Person C (third person)
Each has a positive integer above their head. Each sees the other two numbers but not their own. The numbers satisfy: A + B = C \quad \text{or} \quad B + C = A \quad \text{or} \quad C + A = B
Given the puzzle scenario, we need to find out the exact numbers given the clues, and ultimately find the product of the three numbers.
⸻
Step 2: Logic of the Puzzle (Step-by-Step Reasoning)
Initial conditions: The numbers are positive integers. The puzzle implies a situation of ambiguity at first, but after multiple rounds of “I don’t know,” Person A finally knows their number is 65.
Let’s denote clearly the conditions again: • Exactly one of the following three equations is true: • A + B = C, or • B + C = A, or • C + A = B.
Person A sees B and C, but not A. Similarly for Persons B and C.
⸻
Step 3: Logical deductions from the repeated “I don’t know”
First round deductions: • Person A initially doesn’t know their number. This means, from Person A’s perspective, both scenarios A = B + C and A + B = C (or A + C = B) are still possible. If Person A had seen identical numbers, for instance, Person A could have instantly known their number. Thus, Person A sees two different numbers. • Similarly, Person B doesn’t know either, implying that Person B also sees ambiguity in their own number. Thus, Person B also sees two numbers that could result in at least two possible solutions. • Person C also doesn’t know, same logic as above.
Thus, after one full round, we understand that initially, all three numbers must create ambiguity. No trivial case (like two identical numbers) is present.
Second round deductions:
After hearing that no one knows their number, each person gains information: they now know the scenario is one in which none of the other two people could have immediately solved it.
Crucially, in the second round: • Person A now knows their number is 65 after considering the previous statements. This implies that Person A sees numbers B and C, and from these numbers plus the statements of uncertainty, Person A deduces that their own number can only be 65.
⸻
Step 4: Deriving the Numbers (Critical deduction)
Let’s denote clearly again: Person A’s number is now known to be 65.
We have three possible equations. Let’s test each carefully given A = 65: 1. Case 1: A + B = C Then we have: • 65 + B = C. Person A sees numbers B and C, so they would see B and 65 + B. Person A sees numbers B and 65+B. If Person A sees these two numbers, they might initially consider their number could be either the sum or difference of these two. • However, if Person A sees two numbers, and one is exactly 2× the other, Person A could immediately guess their number is the difference (since if you see numbers X and 2X, your number is clearly X). So this scenario of B and 65+B is okay only if these two numbers don’t exhibit an immediate trivial scenario. • Let’s hold this scenario and continue examining others first. 2. Case 2: B + C = A (So, B + C = 65) • Then Person A sees numbers B and C, adding exactly to 65. • Initially, Person A sees two numbers summing exactly to 65. Could Person A know their number immediately at first glance? No, since their number could either be the sum (65) or the difference of the two numbers. Ambiguity here initially makes sense. • Initially, Person B and Person C also see ambiguous scenarios. They both don’t know their numbers either at first round. This scenario is plausible. 3. Case 3: C + A = B (So, C + 65 = B) • Then Person A sees numbers B = C + 65 and C. • Person A sees C and C+65. Similar reasoning as Case 1.
⸻
Step 5: Identifying the Correct Scenario
The puzzle’s logic typically aligns best with Case 2: B + C = A = 65.
Here’s why clearly stated: • Initially, Person A sees two numbers adding to exactly 65. At first glance, Person A’s number could either be 65 (sum) or the difference between the two numbers. • Person B also initially sees two numbers and cannot distinguish immediately their own number, nor can Person C. • After everyone states “I don’t know,” Person A gains critical insight: if the difference between the two numbers (that A sees) allowed Person B or Person C to know their number immediately, they’d have known right away. The fact that B and C didn’t know means the difference is not immediately obvious to B or C. Thus, the difference cannot be trivial (such as identical numbers or easy multiples).
Therefore, the two numbers must be such that their difference doesn’t immediately clue in the other persons. This occurs most clearly with pairs of numbers which have multiple factorization possibilities or aren’t trivial multiples.
But we must carefully test quickly to find a suitable, plausible integer pair whose sum is 65 and whose ambiguity would logically survive through multiple rounds. The puzzle is a classic logic puzzle with a known structure:
A common scenario in these classic puzzles is the use of Fibonacci-like or similar “golden” sets. Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26 • Check quickly: indeed, 39 + 26 = 65.
Checking logic quickly: • Person A sees 39 and 26. Initially ambiguous (number could be 65 or 13). • Person B sees 65 and 26. Could be either 39 or 91 at first glance. Initially ambiguous. • Person C sees 65 and 39. Could be either 26 or 104 initially. Initially ambiguous.
After each person says they don’t know: • Now, Person A considers: If my number was 13, then Person B would see numbers 13 and 26. Person B would then immediately know their number is 39 (13+26=39), no ambiguity. But Person B didn’t know! So Person A’s number cannot be 13. Hence, Person A confidently concludes their number is 65.
Thus, the numbers 65, 39, and 26 satisfy the puzzle conditions perfectly!
⸻
Step 6: Final Answer (Product of the Numbers)
The three numbers identified clearly are: • A = 65 • B = 39 • C = 26
Then, the product is: 65 \times 39 \times 26 = 65 \times 1014 = 65,910
⸻
Final Answer:
\boxed{65,910}
Let’s quickly check the following well-known ambiguous triple: • Numbers: 65, 39, 26
Plus it drew me a very decent pelican riding a bicycle.
Notes here: https://simonwillison.net/2025/Mar/25/gemini/
Personally I'm convinced this model is the best out there right now.
https://www.reddit.com/r/Bard/comments/1jjobaz/pelican_on_a_...
I mean, hell, I have drawings from when I was eight of leaves and they are botanically-accurate enough to still be used for plant identification, which itself is a very difficult task that people study decades for. I don't see why this is interesting or noteworthy, call me a neo-luddite if you must.
It's sort of irrelevant though as the test is about SVGs.
It could be useful for much more than just silly benchmarks, there's a reason why physics students are taught to draw a diagram before attempting a problem.
https://help.kagi.com/kagi/ai/llm-benchmark.html
High quality, to the point. Bit on the slow side. Indeed a very strong model.
Google is back in the game big time.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
In which case, google should be working on achieving better output format following, as Claude and R1 are able to hit nearly 100% accuracy on the format.
Aider automatically asks models to retry malformed edits, so it recovers. And goes on to produce a SOTA score.
- Our state-of-the-art model.
- Benchmarks comparing to X,Y,Z.
- "Better" reasoning.
It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.
It's interesting how the recent AI announcements are following the same trend over a smaller timeframe.
once you get all your apps, wallpaper, shortcut order and same OS, you really quickly get the feeling you spent 1000$ for the exact same thing
But it needs to be seamless to remove any friction from the purchase, but at the same time if it feels the same then we felt like we wasted money.
So what I usually do is buy a different colored phone and change the wallpaper.
My MacBook was the same. Seamless transition and 2 hours later I was used to the new m4 speeds.
These LLM models were supposedly limited by the training run, but these point-version models are mostly post-training driven, which seems to be taking less time.
If models were tied to a specific hardware (say, a "AI PC" or whatever) the cycle would get slower and we'll get a slower summer which I'm secretly wishing.
So, I think these results are very interesting, if you know what features specifically you are using.
Excited to see if it works this time.
This is the commodification of models. There is nothing special about the new models but they perform better on the benchmarks.
They are all interchangeable. This is great for users as it adds to price pressure.
They are not so good at measuring reasoning, out-of-domain performance, or creativity.
As big players look to start monetizing, they are going to desperately be searching for moats.
Right now, in my opinion, OpenAI has actually a useful deep research feature which I've found nobody else matches. But there is no moat to be seen there.
I might be mistaken, but originally the reasoning was fully hidden? Or maybe it was just far more aggressively purged. I agree that today the reasoning output seems higher quality then originally.
It's called the "first step fallacy", and AI hype believers continue to fall for it.
The rate of announcements is a sign that models are increasing in ability at an amazing rate, and the content is broadly the same because they’re fungible commodities.
The latter, that models are fungible commodities, is what’s driving this explosion and leading to intense competition that benefits us all.
If these companies start failing to beat the competition, then we should prepare ourselves for very creative writing in the announcements.
> mturk workers in Nigeria
Serious question: Has anyone tested how much money you can actually make doing a month of Amazon Mechanical Turk? (It would make for an interesting YouTube video!) I am curious if it is middle class wages in very poor countries (like Nigeria). Some light Googling tells me that middle class salary in Nigeria is about 6K USD, so about 3 USD/hour (assuming: 50 weeks/year * 40 hours/week = 2000 hours/year). Is this possible with MTurk?Also, I think google's winning the race on actually integrating the AI to do useful things. The agent demo from OpenAI is interesting, but frankly, I don't care to watch the machine use my computer. A real virtual assistant can browse the web headless and pick flights or food for me. That's the real workflow unlock, IMO.
> I've moved entire workflows to Gemini because it's just _way_ better than what openai has to offer, especially for the money.
This is useful feedback. I'm not here to shill for OpenAI, nor Google/Gemini, but can you share a concrete example? It would be interesting to hear more about your use case. More abstractly: Do you think these "moved entire workflows" offset a full worker, or X% of a full worker? I am curious to see how and when we will see low-end/junior knowledge workers displaced by solid LLMs. Listening to the Oxide and Friends podcast, I learned that they make pretty regular use of LLMs to create graphs using GNU plot. To paraphrase, they said "it is like have a good intern".Upload a complicated PDF of presentation and ask for insights that require some critical thinking about them.
> Do you think these "moved entire workflows" offset a full worker, or X% of a full worker
It can replace many junior analysts IMO.
The Achilles heel of LLMs is the distinct lack of practical real-world applications. Yes, Google and Microsoft have been shoving the tech into everything they can fit, but that doesn't a product make.
Practical, real-world application.
If I'm being incredibly generous I will concede that this could have been the case for the first few weeks when it was making headlines, but it clearly isn't true now.
It would be literally impossible to keep up these figures for as long as ChatGPT has without a ton of repeat users. There simply aren't enough people/devices.
Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
The full extract of the terms of usage: How human reviewers improve Google AI
To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.
Conversations that have been reviewed or annotated by human reviewers (and related data like your language, device type, location info, or feedback) are not deleted when you delete your Gemini Apps activity because they are kept separately and are not connected to your Google Account. Instead, they are retained for up to three years.
Emphasis on "retained for up to three years" even if you delete it!!This seems better, not worse, than keeping the user-conversation mapping so that the user may delete their conversations.
Also, maybe some services are doing "machine learning" training with user data, but it is the first time I see recent LLM service saying that you can feed your data to human reviewers at their will.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
I'm sorry Dave, I'm afraid I can't do that.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
If you could get much better performance without changing the algorithm (eg just by scaling), you'd still bump the number.
I think it makes sense to increase the major / minor numbers based on the importance of the release, but this is not semver.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
Breakthrough moment.
Today, it seems like Google has beat them and supports way larger context window and is way better at keeping track of what has being said and memorize older tokens.
"A boy eating a sandwhich"
https://www.svgviewer.dev/s/VhcGxnIR
"A multimeter"
Maybe it learned from Gianluca's gallery!
I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.
From https://x.com/OfficialLoganK/status/1904583353954882046
The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.
This time, the Googler says: “good news! you will be charged for experimental models, though for now it’s still free”
I assumed that meant there was a paid version with a higher rate limit coming out today
Available as experimental and for free right now in Google AI Studio + API, with pricing coming very soon!
And the pricing page [1] still does not show 2.5 yet.1 o1-pro (medium reasoning) 82.3
2 o1 (medium reasoning) 70.8
3 o3-mini-high 61.4
4 Gemini 2.5 Pro Exp 03-25 54.1
5 o3-mini (medium reasoning) 53.6
6 DeepSeek R1 38.6
7 GPT-4.5 Preview 34.2
8 Claude 3.7 Sonnet Thinking 16K 33.6
9 Qwen QwQ-32B 16K 31.4
10 o1-mini 27.0
[1] https://arxiv.org/pdf/2409.12640
[2] https://fiction.live/stories/Fiction-liveBench-Feb-20-2025/o...
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
[1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
I thought memory requirement grows exponentially with context size?
> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.
You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.
This stuff has a long way to go.
This way they get two rounds of headlines. "Gemini 2.5 released" and later on "Gemini 2.5 coming to all Google accounts."
I've also had it switch languages in the middle of output... like one word in the middle of a sentence was randomly output in some strange hieroglyphs, but when I translated them, it was the right word and the sentence made sense.
For as amazing as these things are, AGI they are not.
The exceptions are auto regressive image generation and audio models.
(DM me for the questions)
This is part of the code output (after several interactions of it not returning actual code):
// Tail Call Optimization (very basic)
if(func->type == VAL_FUNCTION){
return apply(func, args, env); //no stack growth.
}
else{
return apply(func, args, env);
}
I'm not very impressed.I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
Nobody is going to say "Announcing Foobar 7.1 - not our best!"
"It beats all the benchmarks...but you really really don't want to use it."
(Disclaimer, Googler, but I don’t work on any of this, I only have an external layperson’s understanding of it)
Some of the desktop chat clients are turning into great productivity tools. I tried the Claude one last week and quickly went back to Chat GPT. Claude might be a better model for coding. But it's less effort to make Chat GPT do what I want at this point and it's kind of good enough for a lot of stuff. Every release it's getting better. It connects to my IDE automatically, it can look at the files I have open. It can patch those files (I actually disabled that because it's too slow for my taste), etc.
But most importantly, I can trigger all that with option+shift+1. I do this gazillions times per day. Mostly simple stuff with really short prompts, "check this" (file, selection, current line, etc.), fix that, what do you think about x, "address the FIXMEs/TODOs", "document this", etc.
I can ask other models the same questions and they'd get the job done. But then I have to do more work to give them the same context. Claude has a Github connect option, which is great. But unfortunately it's just a glorified file picker, which really sucks. I have files open in my editor, just look at those. I don't want to have to manually open files do that for me or specify what files to look at every time I go near the tool.
Chat GPT actually asked me yesterday whether it could add a different file than the one it was looking at. I said "yes" and it did. That's a great UX. Don't make me do work.
That's a good UX.
I use Gemini mainly because it's integrated into google's tools. So it's kind of there. And chat GPT for whatever reason does can not look at the browser window. But from a UX point of view, that kind of deep integration is what you want. You have this implicit shared context which is the thing you are looking at that you don't have to spell out anymore.
The UX of populating the context is the deciding factor in how useful models are at this point, not how well it solves pet benchmark questions or renders pelicans on bicycles.
I have good hopes for agentic coding tools progressing rapidly this year. The ones I've tried recently need a lot of work though. I keep going back to Chat GPT because it's just the quickest & easiest to use at this point.
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
[1] https://openai.com/index/openai-o3-mini/ [2] https://www.vals.ai/benchmarks/aime-2025-03-24
You just have to use the models yourself and see. In my experience o3-mini is much worse than o1.
The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.
I ran this command to create it:
curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
jq -r 'recurse(.children[]) | .author + ": " + .text' | \
llm -m "gemini-2.5-pro-exp-03-25" -s \
'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themesBut then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
But then where is the productivity increases?
But its more complex than that - if many people start losing their jobs we all take a hit on gdp because they can't consume as much anymore, so it could take perhaps a long time until gdp actually sees meaningful gains.
And one last thought - Satya likely hasn't spent much time thinking about gdp, it's just not his field. He's a smart guy for sure but this isn't what he does.
Unemployment hasn't really picked up, and is unlikely to do so, unless the central bank is incompetent. (They have been from time to time.)
However, some advances don't show up in GDP. Eg Wikipedia is a tremendous achievement. But nobody pays for it, so it doesn't show up in GDP statistics.
That's an important assessment. I don't know if you're right. If the models are going to continue to get more capable I'm expecting unemployment to rise , I don't see how it won't (sure we are promised A.I to create tons of new jobs no one has imagined yet, I haven't seen a reliable clue for such jobs yet).
All the questions I had before language models, I have answered with language models.
That doesn't mean I have no more questions though. Answering those questions opened up 10X more questions I have now.
In general, everyone knows that answering scientific questions leads to new and more questions. It is the exact same process in the economy. There is a collectivist sentiment though in society and the economy that wants to pretend this isn't true. That the economic questions can be "solved", the spoils divided up and we live happily ever after in some kind of equilibrium.
As far as new jobs, they are here now but they surely sound as ridiculous to think about as being a professional youtuber in 2005. Or I think of the person making a geocities website in 1997 vs a front end developer. There is no date that a front end developer emerges from the html code monkey. It is a slow and organic process that is hard to game.
How many people can make an actual living out of Youtube? Surely they exist but to reliably live off it for decades (not just 1-2 years of temporary fame - which is also very hard to come by) I'd say fewer than one in ten thousand people will make it. I can't call "Youtuber" a career path with that kind of success rates anymore than I can call being an actor in Hollywood a career path.
To give an example from the recent past: 'hipster' baristas that make you a five dollar coffee are a fairly new job. At least at scale.
But I doubt you'll be able to find any technology that automated some other job but created barista jobs.
It's just that the market will find stuff for people to do for money, unless prevented to do so by incompetent central bank policy or (too) onerous labour market regulation.
(The labour market can take quite a lot of regulation, and still be able to get people jobs. Have a look at Germany today for an example.)
Will it ? Let's take my example, I'm a 41 year old male with around 15 years experience in software development. Lets say 4 years from now myself and million others are losing our development jobs to A.I. What does the market have for my skills? I can try going into healthcare or teaching (though that's quite an extensive retraining + salary reduction), I can go into the trades (same) or get some other work that's hard to automate like caring for old people (very low salary). All of these options involve massive salary reduction, and that's in the positive scenario that I actually am able to retrain and survive such a shift mentally. It's quite likely many software devs won't be able to become plumbers and nurses and will become chronically unemployed.
The situation you describe isn't all that special.
Yes, losing your job (or your career) is not fun, and can be painful. Massive salary reduction can happen.
No, that hasn't lead to widespread unemployment in the past. At least not widespread enough to be visible in aggregate statistics, especially over the noise of the 'normal' business cycle. However, individuals can obviously have pretty long spells in unemployment, but that can also happen without a shift in technology.
I'm just trying to get the point across that unemployment might rise so gdp may fall, in fact I think it should be the baseline scenario and not thinking some new jobs we can't imagine yet will be created. It's so hard to imagine these new jobs because if the machines will out perform us cognitively it follows we will be able to get intelligent robots into the real world quite soon after. Then seriously what the heck is left? Fewer jobs, not more.
There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
So I admit that this is a serious possibility that we need to consider.
But for the argument to make sense, we can't just talk about the general 'Oh, new technology will make a bunch of jobs obsolete.' We have to specifically talk about what (might) make AI special in that it might be even more general than electricity.
You didn't mention these special factors in your original comments.
I am not sure whether AI will be different or not, or rather I don't know how different it will be.
So far I see it as a good sign that we have many relatively equally competitive models from different providers, and some of them have open weights and some of them even have completely open sources (including training algorithms). So at least it's unlikely for the technology to be monopolised by any one entity.
> There is one "cure" I can think of for this and that's something closer to socialism, the market will have to step aside and the government will create massive amounts of new jobs. For example classes can be 5 pupils per teacher instead of 30 pupils per teacher. Nurses can attend to 3 patient beds instead of 8. But letting the market sort this out ? I don't think so.
If you want to involve the government, I'd rather give everyone a basic income, than to give our pupils inferior teachers and our sick people inferior nurses. (After all, we are assuming that humans will be worse at these jobs than the AI.) Also, I'd rather have people enjoy whatever it is they want to do, instead of being forced into some government provided make-work programme.
They're not baked into workflows where the measurable output is attributed easily to the model use. Productivity in its current form is transformative in the sense that the use case and gain differs for the individual (who even provide different prompts). So some are keeping the gains for themselves, others are using it to improve quality rather than quantity.
It'll come in time, it's important to remember gpt 4 was released 2 years ago this month. The newer models are more reliable and could probably be introduced into workflows more frequently. Today I spoke to a company who are looking to use it to reduce cost in the next year.
Benchmarks are useful as leading indicators. Early warning signs. If there’s no relation to the eventual productivity then hopefully that benchmark will disappear as it’s not useful.
In a fast moving space like this it’s reasonable to make use of leading indicators.
Reference: https://rodolphoarruda.pro.br/wp-content/uploads/image-14.pn...
By extension it should also be slightly more helpful for research, R&D?
If theyre that easy, why not fix the casues for the needs for RCA? Our RCAs will not be solved by AI for decades, let me tell you that.
Personally, I do like the "use web search" and "extended thinking" buttons, but ultimately, the models should probably be able to figure out whether doing so would be useful themselves too.
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
Chatgpt pronounced correctly
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
I also see Gemini 2.0 Pro has been replaced completely in AI Studio.
On AI studio the model told me today is June 13 2024.
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
Granted, Gemini answers it now, however, this one left me shaking my head.
Or generate images of the founding fathers of US that at least to some degree resemble the actual ones?
I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are few standout use cases that showcase its potential:
1. Counting Occurrences in a Video
In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").
Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.
2. Identifying Background Music and Movie Name
Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!
3. OCR Text Recognition
Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!
4. Listen to Foreign News Media
The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.
5. Cricket Fans?
Sports fans and analysts alike will appreciate this use case! I tested Gemini 2.5 on an ICC T20 World Cup cricket match video to see how well it could analyze gameplay data. The results were incredible: the model accurately calculated scores, identified the number of fours and sixes, and even pinpointed key moments—all while providing timestamps for each event.
7. Webinar - Generate Slides from Video
Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!
Bonus: Humor Test
Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.
At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!
ChatGPT4.5, I was excited.
Deepseek, I was excited. (then later disappointed)
I know Gemini probably wont answer any medical question, even if you are a doctor. ChatGPT will.
I know I've been disappointed at the quality of Google's AI products. They are backup at best.
(Everything's ok, I'm just testing it ;)
I don't see it on the API price list:
https://ai.google.dev/gemini-api/docs/pricing
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.
2.0 Pro is also still "experimental" so I agree with GP that it's pretty odd that they are "releasing" the next version despite never having gotten to fully releasing the previous version.
But I did mingle that with my knowledge of google's history of releasing without releasing these models which, as you point out, isn't true with this release.