Example: Ask it to draw a notepad with an empty tic-tac-toe, then tell it to make the first move, then you make a move, and so on.
You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
I get the feeling these models are quite restricted in resolution, and that more work in this space will let us do really wild things such as ask a model to create an app step by step first completely in images, essentially designing the whole app with text and all, then writing the code to reproduce it. And it also means that a model can take over from a really good diffusion model, so even if the original generations are not good, it can continue "reasoning" on an external image.
Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM (which can do all the normal things like using tools, thinking, etc). However, I also believe that diffusion models can do some of this, in a much faster way.
I do not think that this is correct. Prior to this release, 4o would generate images by calling out to a fully external model (DALL-E). After this release, 4o generates images by calling out to a multi-modal model that was trained alongside it.
You can ask 4o about this yourself. Here's what it said to me:
"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
>"So while I’m deeply multimodal in cognition (understanding and coordinating text + image), image generation is handled by a linked latent diffusion model, not an end-to-end token-unified architecture."
Models don't know anything about themselves. I have no idea why people keep doing this and expecting it to know anything more than a random con artist on the street.
Of course the model may hallucinate, but in this case it takes a few clicks in the dev tools to verify that this is not the case.
I don't know - or care to figure out - how OpenAI does their tool calling in this specific case. But moving tool calls to the end user is _monumentally_ stupid for the latency if nothing else. If you centralize your function calls to a single model next to a fat pipe it means that you halve the latency of each call. I've never build, or seen, a function calling agent that moves the api function calls to client side JS.
But what do you mean you don't care? The thing you were responding to was literally a claim that it was a tool call rather than direct output
The thing we need to worry about is whether a Chinese company will drop an open source equivalent.
They can. Fine tune them on documents describing their identity, capabilities and background. Deepseek v3 used to present itself as ChatGPT. Not anymore.
>Like other AI models, I’m trained on diverse, legally compliant data sources, but not on proprietary outputs from models like ChatGPT-4. DeepSeek adheres to strict ethical and legal standards in AI development.
Yes, but many people expect the LLM to somehow self-reflect, to somehow describe how it feels from its first person point of view to generate the answer. It can't do this, any more than a human can instinctively describe how their nervous system works. Until recently, we had no idea that there are things like synapses, electric impulses, axons etc. The cognitive process has no direct access to its substrate/implementation.
If fine-tune ChatGPT into saying that it's an LSTM, it will happily and convincingly insist that it is. But it's not determining this information in real time based on some perception during the forward pass.
I mean there could be ways for it to do self reflection by observing the running script, perhaps raise or lower the computational cost of some steps, check the timestamps of when it was doing stuff vs when the GPU was hot etc and figure out which process is itself (like making gestures in front of a mirror to see which person you are). And then it could read its own Python scripts or something. But this is like a human opening up their own skull and look around in there. It's not direct first-person knowledge.
There are lots of clues that this isn't happening (including the obvious upscaling call after the image is generated - but also the fact that the loading animation replays if you refresh the page - and also the fact that 4o claims it can't see any image tokens in its context window - it may not know much about itself but it can definitely see its own context).
https://openai.com/index/hello-gpt-4o/
Plenty was written about this at the time.
And to answer your question, it's very clearly in the linked article. Not sure how you could have read it and missed:
> With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT‑4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.
The 4o model itself is multi-modal, it no longer needs to call out to separate services, like the parent is saying.
I could probably train an AI that replicates that perfectly.
Yes, it could. And even after training its data can be manipulated to output whatever: https://www.anthropic.com/news/mapping-mind-language-model
I did it via ChatGPT for the irony.
See this chat for example:
https://chatgpt.com/share/67e355df-9f60-8000-8f36-874f8c9a08...
By the way when I repeated your prompt it gave me another name for the module.
I also just confirmed via the API that it's making an out of band tool call
EDIT: And googling the tool name I see it's already been widely discussed on twitter and elsewhere
The name of the function shows up in: https://github.com/openai/glide-text2im which is where the model probably learned about it.
You can literally look at the JavaScript on the web page to see this. You've overcorrected so far in the wrong direction that you think anything the model says must be false, rather than imagining a distribution and updating or seeking more evidence accordingly
>EDIT: And googling the tool name I see it's already been widely discussed on twitter and elsewhere
I am so confused by this thread.
It's possible the tool is itself just gpt4o, wrapped for reliability or safety or some other reason, but it's definitely calling out at the model-output level
That's probably right. It allows them to just swap it out for DALL-E, including any tooling/features/infrastructure hey have built up around image generation, and they don't have to update all their 4o instances to this model which, who knows, may be not be ready for other tasks anyway or different enough to warrant testing before a rollout, or more expensive, etc.
Honestly it seems like the only sane way to roll it out if it is a multimodal descendant of 4o.
While LLM code generation is very much still a mixed bag, it has been a significant accelerator in my own productivity, and for the most part all I am using is o1 (via the openAI website), deepseek, and jetbrains' AI service (Copilot clone). I'm eager to play with some of the other tooling available to VS Code users (such as cline)
I don't know why everyone is so eager to "get to the fun stuff". Dev is supposed to be boring. If you don't like it maybe you should be doing something else.
I'm approaching 20 years of professional SWE experience myself. The boring shit is my bread and butter and its what pays the bills and then some. The business community trying to eliminate that should be seen as a very serious threat to all our futures.
AI is an extraordinary tool, if you can't make it work for you, you either suck at prompting, or are using the wrong tools, or are working in the wrong space. I've stated what I use, why not give those things a try?
I use a couple of different tools because they're each good at something that is useful to me. If Jetbrains AI service had a continue.dev/cline like interface and let me access all the models I want I might not deviate from that. But lucky for me work pays for everything.
You also seem awfully fixated on Copilot. How much exactly do you think your $12/month entitles you to?
What hubris. My god.
Of course it's impossible to explain that to thickheaded dinosaurs on HN who think they're better than everyone and god's gift to IQ.
Which society? Because lately it looks like the tech leaders are on a rampage to destroy the society I live in.
(tree Source/; echo; for file in $(find Source/ -type f ) ; do echo ======== $file: ; cat $file; done ) > /mnt/c/Users/you/Desktop/claude_out.txt #claudesource
Then drag that into the chat.You can also do stuff like pass in just the headers and a few relevant files
(tree Source/; echo; for file in $(find Source/ -type f -name '\*.h' ; echo Source/path/to/{file1,file2,file3}.cpp ) ; do echo ======== $file: ; cat $file; done ) > /mnt/c/Users/you/Desktop/claude_out.txt #claudeheaderselective
You can then just hit ctrl+r and type claude to refind it in shell history. Maybe that's too close to "writing scripts" for you but if you are searching a large codebase effectively without AI you are constantly writing stuff like that and now it reads it for you.Put the command itself into claude too and tell claude itself to write a similar one for all the implementation files it finds it needs while looking those relevant files and headers.
If you want a wonder tool that will navigate and handle the context window and get in the right files into context for huge projects, try claude code or other agents, but they are still undergoing rapid improvements. Cursor has started adding some in too but as subscription calling into an expensive API they cost cut a lot on trying to minimize context.
They also let you now just point it at a github project and pull in what it needs, or tools build around the api model context protocol etc. to let it browse and pull it in.
Doesn't matter, use the tool that makes it easy and get less context, or realize the limitations and don't fall for marketing of ease and get more context. You don't want to do additional work beyond what they sold you on, out of principle. But you are getting much less effective use by being irrationally ornery.
Lots of things don't match marketing.
Please sir step away from the keyboard now!
That is an absurd proposition and I hope I never get to use an app that dreams of the next frame. Apps are buggy as they are, I don't need every single action to be interpreted by LLM.
An existing example of this is that AI Minecraft demo and it's a literal nightmare.
I don't want an app that either works or does not work depending on the RNG seed, prompt and even data that's fed to it.
That's even ignoring all the absurd computing power that would be required.
I think these arguments would've been valid a decade ago for a lot of things we use today. And I'm not saying the classical software way of things needs to go away or even diminish, but I do think there are unique human-computer interactions to be had when the "VM" is in fact a deep neural network with very strong intelligence capabilities, and the input/output is essentially keyboard & mouse / video+audio.
No. Not at all. Those levels of abstractions – whether good, bad, everything in between – were fully understood through-and-through by humans. Having an LLM somewhere in the stack of abstractions is radically different, and radically stupid.
Anyway, I just think it's fun to make the thought experiment that if we were here 40 years ago, discussing today's advanced hardware and software architecture and how it interacts, very similar arguments could be used to say we should stick to single instructions on a CPU because you can actually step through them in a human understandable way.
While I think current AI can’t come close to anything remotely usable, this is a plausible direction for the future. Like you, I shudder.
> “DLSS Multi Frame Generation generates up to three additional frames per traditionally rendered frame, working in unison with the complete suite of DLSS technologies to multiply frame rates by up to 8X over traditional brute-force rendering. This massive performance improvement on GeForce RTX 5090 graphics cards unlocks stunning 4K 240 FPS fully ray-traced gaming.”
"Draw a picture of a full glass of wine, ie a wine glass which is full to the brim with red wine and almost at the point of spilling over... Zoom out to show the full wine glass, and add a caption to the top which says "HELL YEAH". Keep the wine level of the glass exactly the same."
I almost wonder if prompting it "similar to a full glass of beer" would get it shifted just enough.
People get upvoted for pedantry rather than furthering a conversation, e.g.
USA, but VPN set to exit in Canada at time of request (I think).
But aside from that it would only be comparable if would compare your prompts.
I switched over to the sora.com domain and now I have access to it.
I'm not a heavy user of AI or image generation in general, so is this also part of the new release or has this been fixed silently since last I tried?
However, when giving a prompt that requires the model to come up with the text itself, it still seems to struggle a bit, as can be seen in this hilarious example from the post: https://images.ctfassets.net/kftzwdyauwt9/21nVyfD2KFeriJXUNL...
It's a side effect of the entire model being differentiable - there is always some halfway point.
Remember the old internet adage that the fastest way to get a correct answer online is to post an incorrect one? I'm not entirely convinced this type of iterative gap finding and filling is really much different than natural human learning behavior.
Take some artisan, I'll go with a barber. The human person is not the best of the best, but still a capable barber, who can implement several styles on any head you throw at them. A client comes, describes certain style they want. The barber is not sure how to implement such a style, consults with master barber beside, that barber describes the technique required for that particular style, our barber in question comes and implements that style. Probably not perfectly as they need to train their mind-body coordination a bit, but the cut is good enough that the client is happy.
There was no traditional training with "gap finding and filling" involved. The artisan already possessed core skill and knowledge required, was filled on the particulars of their task at hand and successfully implemented the task. There was no looking at examples of finished work, no looking at example of process, no iterative learning by redoing the task a bunch of times.
So no, human learning, at least advanced human learning, is very much different from these techniques. Not that they are not impressive on their own, but let's be real here.
also we all know real people who fail to generalize, and overfit. copycats, potentially even with great skill, no creativity.
> While humans learn through example, they clearly need a lot fewer examples to generalize off of and reason against.
Human brain has been developing over millenia. machines start from zero. What if this few example learning is just an emergent capbaility of any "leanring function" given enough compute and training.
Also as for touch, you’re going to have a hard time convincing me that the amount of data from touch rivals the amount of content on the internet or that you just learn about mistakes one example at a time.
- Airplanes dont have wings like birds but can fly. and in some ways are superior to birds. (some ways not)
- Human brains may be doing some analogue of sample augmentation which gives you some multiple more equivalent samples of data to train on per real input state of environment. This is done for ml too.
- Whether that input data is text, or embodied is sort of irrelevant to cognition in general, but may be necessary for solving problems in a particular domain. (text only vs sight vs blind)
I think you're saying exactly what I'm saying. Human brains work differently from LLMs and the OP comment that started this thread is claiming that they work very similarly. In some ways they do but there's very clear differences and while clarifying examples in the training set can improve human understanding and performance, it's pretty clear we're doing something beyond that - just from a power efficiency perspective humans consume far less energy for significantly more performance and it's pretty likely we need less training data.
to be honest i dont really care if they work the same or not. I just like that they do work and find it interesting.
i dont even think peoples brains work the same as eachother. half of people cant even visually imagine an apple.
Neural networks seem to notice and remember very small details, as if they have access to signals from early layers. Humans often miss the minor details. Theres probably a lot more signal normalization happening. That limits calorie usage and artifacts the features.
I dont think that this is necessarily a property neural networks cant have. I think it could be engineered in. For now though seems like were making a lot of progress even without efficiency constraints so nobody cares.
So maybe training for litmus tests isn’t the worst strategy in the absence of another entire internet of training data…
There is no one correct way to interpert 'full'. If you go to a wine bar and ask for a full glass of wine, they'll probably interpert that as a double. But you could also interpert it the way a friend would at home, which is about 2-3cm from the rim.
Personally I would call a glass of wine filled to the brim 'overfilled', not 'full'.
The prompts (some generated by ChatGPT itself, since it's instructing DALL-E behind the scenes) include phrases like "full to the brim" and "almost spilling over" that are not up to interpretation at all.
Searching in my favorite search engine for "full glass of wine", without even scrolling, three of the images are of wine glasses filled to the brim.
> It looks like there was an error when trying to generate the updated image of the clock showing 5:03. I wasn’t able to create it. If you’d like, you can try again by rephrasing or repeating the request.
A few times it did generate an image but it never showed the right time. It would frequently show 10:10 for instance.
Why does it sound like this isn't reasoning on images directly but rather just dall e as some other comment said , I will type the name of the person here (coder543)
I can’t ever seem to get it to make the cow appear to be above the moon. Always literally covering it or to the side etc.
Using Dall-e / old model without too much effort (I'd call this "full".)
For Gemini it seems to me there's some kind of "retain old pixels" support in these models since simple image edits just look like a passthrough, in which case they do maintain your identity.
Got it in two requests, https://chatgpt.com/share/67e41576-8840-8006-836b-f7358af494... for the prompts.
That sounds really interesting. Are there any write-ups how exactly this works?
> The system uses an autoregressive approach — generating images sequentially from left to right and top to bottom, similar to how text is written — rather than the diffusion model technique used by most image generators (like DALL-E) that create the entire image at once. Goh speculates that this technical difference could be what gives Images in ChatGPT better text rendering and binding capabilities.
https://www.theverge.com/openai/635118/chatgpt-sora-ai-image...
Also wonder if you'd get better results in generating something like blender files and using its engine to render the result.
The general gist is that you have some kind of adapter layers/model that can take an image and encode it into tokens. You then train the model on a dataset that has interleaved text and images. Could be webpages, where images occur in-between blocks of text, chat logs where people send text messages and images back and forth, etc.
The LLM gets trained more-or-less like normal, predicting next token probabilities with minor adjustments for the image tokens depending on the exact architecture. Some approaches have the image generation be a separate "path" through the LLM, where a lot of weights are shared but some image token specific weights are activated. Some approaches do just next token prediction, others have the LLM predict the entire image at once.
As for encoding-decoding, some research has used things as simple as Stable Diffusion's VAE to encode the image, split up the output, and do a simple projection into token space. Others have used raw pixels. But I think the more common approach is to have a dedicated model trained at the same time that learns to encode and decode images to and from token space.
For the latter approach, this can be a simple model, or it can be a diffusion model. For encoding you do something like a ViT. For decoding you train a diffusion model conditioned on the tokens, throughout the training of the LLM.
For the diffusion approach, you'd usually do post-training on the diffusion decoder to shrink down the number of diffusion steps needed.
The real crutch of these models is the dataset. Pretraining on the internet is not bad, since there's often good correlation between the text and the images. But there's not really good instruction datasets for this. Like, "here's an image, draw it like a comic book" type stuff. Given OpenAI's approach in the past, they may have just bruteforced the dataset using lots of human workers. That seems to be the most likely approach anyway, since no public vision models are quite good enough to do extensive RL against.
And as for OpenAI's architecture here, we can only speculate. The "loading from top to be from a blurry image" is either a direct result of their architecture or a gimmick to slow down requests. If the former, it means they are able to get a low resolution version of the image quickly, and then slowly generate the higher resolution "in order." Since it's top-to-bottom that implies token-by-token decoding. My _guess_ is that the LLM's image token predictions are only "good enough." So they have a small, quick decoder take those and generate a very low resolution base image. Then they run a stronger decoding model, likely a token-by-token diffusion model. It takes as condition the image tokens and the low resolution image, and diffuses the first patch of the image. Then it takes as condition the same plus the decoded patch, and diffuses the next patch. And so forth.
A mixture of approaches like that allows the LLM to be truly multi-modal without the image tokens being too expensive, and the token-by-token diffusion approach helps offset memory cost of diffusing the whole image.
I don't recall if I've seen token-by-token diffusion in a published paper, but it's feasible and is the best guess I have given the information we can see.
EDIT: I should note, I've been "fooled" in the past by OpenAI's API. When o* models first came out, they all behaved as if the output were generated "all at once." There was no streaming, and in the chat client the response would just show up once reasoning was done. This led me to believe they were doing an approach where the reasoning model would generate a response and refine it as it reasoned. But that's clearly not the case, since they enabled streaming :P So take my guesses with a huge grain of salt.
When you randomly pick the locations they found it worked okay, but doing it in raster order (left to right, top to bottom) they found it didn't work as well. We tried it for music and found it was vulnerable to compounding error and lots of oddness relating to the fragility of continuous space CFG.
I built this exact thing last month, demo: https://universal.oroborus.org (not viable on phone for this demo, fine on tablet or computer)
Also see discussion and code at: http://github.com/snickell/universal
I wasn't really planning to share/release it today, but, heck, why not.
I started with bitmap-style generative image models, but because they are still pretty bad at text (even this, although it’s dramatically better), for early-2025 it’s generating vector graphics instead. Each frame is an LLM response, either as an svg or static html/css. But all computation and transformation is done by the LLM. No code/js as an intermediary. You click, it tells the LLM where you clicked, the LLM hallucinates the next frame as another svg/static-html.
If it ran 50x faster it’d be an absolutely jaw dropping demo. Unlike "LLMs write code", this has depth. Like all programming, the "LLMs write code" model requires the programmer or LLM to anticipate every condition in advance. This makes LLM written "vibe coded" apps either gigantic (and the llm falls apart) or shallow.
In contrast, as you use universal, you can add or invent features ranging from small to big, and it will fill in the blanks on demand, fairly intelligently. If you don't like what it did, you can critique it, and the next frame improves.
Its agonizingly slow in 2025, but much smarter and in weird ways less error prone than using the LLM to generate code that you then run: just run computation via the LLM itself.
You can build pretty unbelievable things (with hallucinated state, granted) with a few descriptive sentences, far exceeding the capabilities you can “vibe code” with the description. And it never gets lost in its rats nest of self generated garbage code because… there is no code to in.
Code is medium with a surprisingly strong grain. This demo is slow, but SO much more flexible and personally adaptable than anything I’ve used where the logic is implemented cia a programming language.
I don’t love this as a programmer, but my own use of the demo makes me confident that programming languages as a category will have a shelf life if LLM hardware gets fast, cheap and energy efficient.
I suspect LLMs will generate not programming language code, but direct wasm or just machine code on the fly for things that need faster traction than they can draw a frame, but core logic will move out of programming languages (not even llm written code). Maybe similar to the way we bind to low level fast languages but a huge percentage of “business” logic is written in relatively slower languages.
FYI, I may not be able to afford the credits if too many people visit, I put a a $1000 of credits on this, we'll see if that lasts. This is claude 3.7, I tried everything else, a claude had the visual intelligence today. IMO this is a much more compelling glance at the future than coding models. Unfortunately, generating an SVG per click is pricey, each click/frame costs me about $0.05. I’ll fund this as far as I can so folks can play with it.
Anthropic? You there? Wanna throw some credits at an open source project doing something that literally only works on claude today? Not just better, but “only Claude 3.7 can show this future today?”. I’d love for lots more people to see the demo, but I really could use an in-kind credit donation to make this viable. If anyone at anthropic is inspired and wants to hook me up: snickell@alumni.stanford.edu. Very happy to rep Claude 3.7 even more than I already do.
I think it’s great advertising for Claude. I believe the reason Claude seems to do SO much better at this task is, one it shows far greater spatial intelligence, and two, I distract they are the only state of the art model intentionally training on SVG.
I don't think the project would have gotten this far without openrouter (because: how else would you sanely test on 20+ models to be able to find the only one that actually worked?). Without openrouter, I think I would have given up and thought "this idea is too early for even a demo", but it was easy enough to keep trying models that I kept going until Claude 3.7 popped up.
If you end up taking this further and self hosting a model you might actually achieve a way faster “frame rate” with speculative decoding since I imagine many frames will reuse content from the last. Or maybe a DSL that allows big operations with little text. E.g. if it generates HTML/SVG today then use HAML/Slim/Pug: https://chatgpt.com/share/67e3a633-e834-8003-b301-7776f76e09...
For example, this specifies that #my-div should be replaced with the value from the previous frame (which itself might have been cached): <div id="my-div" data-use-cached></div>
This lowers the render time /substantially/, for simple changes like "clicked here, pop-open a menu" it can do it in 10s, vs a full frame render which might be 2 minutes (obviously varies on how much is on the screen!).
I think using HAML etc is an interesting idea, thanks for suggesting it, that might be something I'll experiment with.
The challenge I'm finding is that "fancy" also has a way of confusing the LLM. E.g. I originally had the LLM produce literal unified diffs between frames. I reasoned it had seem plenty of diffs of HTML in its training data set. It could actually do this, BUT image quality and intelligence were notably affected.
Part of the problem is that at the moment (well 1mo ago when I last benchmarked), only Claude is "past the bar" for being able to do this particular task, for whatever reason. Gemini Flash is the second closest. Everything else (including 4o, 4.5, o1, deepseek, etc) are total wipeouts.
What would be really amazing is if say Llama 4 turns out to be good in the visual domain the way claude is, and you can run it on one of the LLM-on-silicon vendors (cerebrus.ai, grok, etc) to get 10x the token rate.
LMK if you have other ideas, thanks for thinking about this and taking a look!
You can watch "sped up" past sessions by other people who used this demo here, which is kind of like a demo video: https://universal.oroborus.org/gallery
But the gallery feature isn't really there today, it shows all the "one-click and bounce sessions", and its hard to find signal in the noise.
I'll probably submit a "Show HN" when I have the gallery more together, and I think its a great idea to pick a multi-click gallery sequence and upload it as a video.
> had to charge you a few dimes
s/you/openrouter/: ty to openrouter for donating a significant chunk of credits a couple hours ago.
Really appreciate the feedback on needing a video. I had a sense this was the most important "missing piece", but this will give me the motivation to accomplish what is (to me) a relatively boring task, compared to hacking out more features.
- Getting the instant-replay gallery sorted to make it usable, maybe sorting via likes.
- Selecting a couple interesting sessions from the gallery and turning them into a short video
- Making sure I have enough credits lined up (hopefully donations!) to survive an "Ask HN".
Nobody has really decided on a name.
Also chain of thought is somewhat different from chain of thought reasoning so mb throw in multimodal chain of thought reasoning
You can do that with diffusion, too. Just lock the parameters in ComfyUi.
4o is a game changer. It's clearly imperfect, but its operating modalities are clearly superior to everything else we have seen.
Have you seen (or better yet, played with) the whiteboard examples? Or the examples of it taking characters out of reflections and manipulating them? The prompt adherence, text layout, and composing capabilities are unreal to the point this looks like it completely obsoletes inpainting and outpainting.
I'm beginning to think this even obsoletes ComfyUI and the whole space of open source tools once the model improves. Natural language might be able to accomplish everything outside of fine adjustments, but if you can also supply the model with reference images and have it understand them, then it can do basically everything. I haven't bumped into anything that makes me question this yet.
They just need to bump the speed and the quality a little. They're back at the top of image gen again.
I'm hoping the Chinese or another US company releases an open model capable of these behaviors. Because otherwise OpenAI is going to take this ball and run far ahead with it.
I do think they run a "traditional" upscaler on the transformer output since it seems to sometimes have errors similar to upscalers (misinterpreted pixels), so probably the current decoded resolution is quite low and hopefully future models like GPT-5 will improve on this.
With current GPU technology, this system would need its own Dyson sphere.
I'm super excited for all the free money and data our new AI written apps will be giving away.
https://chatgpt.com/share/67e32d47-eac0-8011-9118-51b81756ec...
https://chatgpt.com/share/67e34558-5244-8004-933a-23896c738b...
For a start the image is wrong, and also I know I can make more requests, because that what tools are for. Its like a passive aggressive suggestion that I made the AI go out of its way to do me a favor.
https://mordenstar.com/blog/chatgpt-4o-images
It's definitely impressive though once again fell flat on the ability to render a 9-pointed star.
[1] https://techcrunch.com/wp-content/uploads/2024/03/pasted-ima...
Then I asked for some changes:
> That's almost perfect! Retain this style and the elements, but adjust the text to read:
> [refined text]
> And then below it should add the location and date details:
> [location details]
pointed these issues out to give it a second go and got something way worse. This still feels like little more than a fun toy.
Then google:
> Gemini 2.5: Our most intelligent AI model
> Introducing Gemini 2.0 | Our most capable AI model yet
I could go on forever. I hope this trend dies and apple starts using something effective so all the other companies can start copying a new lexicon.
> Why would they publish a model that is not their most advanced model?
I dunno, I'm not sitting in the OpenAI meetings. That is why they need to tell us what they are doing - it is easy to imagine them releasing something that isn't their best model ever and so they clarify that this is, in fact, the new hotness.
Just a consequence of how much time and money it takes to train a new foundation model. It's not going to happen every other week. When it does, it is reasonable to announce it with "Announcing our most powerful model yet."
And no, not all models are intended to push the frontier in terms of benchmark performance, some are just fast and cheap.
Obligatory Jobs monologue on marketing people:
Apple is more of a hardware company. Still, Cook does have a few big wins under his belt: M-series ARM chips on Macs, Airpods, Apple watch, Apple pay.
Hotwheels: Fast. Furious. Spectacular.
Which is especially relevant when it's not obvious which product is the latest and best just looking at the names. Lots of tech naming fails this test from Xbox (Series X vs S) to OpenAI model names (4o vs o1-pro).
Here they claim 4o is their most capable image generator which is useful info. Especially when multiple models in their dropdown list will generate images for you.
<Product name>: Our most <superlative> <thing> yet|ever.
This one isn't even my biggest gripe. If I could eliminate any word from the English language forever, it would be "effortlessly".
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.
When I first read the parent comment, I thought, maybe this is a long-term architecture concern...
But your message reminded me that we've been here before.
Gemini "integrates" Imagen 3 (a diffusion model) only via a tool that Gemini calls internally with the relevant prompt. So it's not a true multimodal integration, as it doesn't benefit from the advanced prompt understanding of the LLM.
Edit: Apparently Gemini also has an experimental native image generation ability.
The (no longer, I guess) industry-leading features people actually want are hidden away in some obscure “AI studio” with horrible usability, while the headline Gemini app still often refuses to do anything useful for me. (Disclaimer: I last checked a couple of months ago, after several more of mild amusement/great frustration.)
They haven't been focusing attention on images because the most used image models have been open source. Now they might have a target to beat.
That's overly pessimistic. Diffusion models take an input and produce an output. It's perfectly possible to auto-regressively analyze everything up to the image, use that context to produce a diffusion image, and incorporate the image into subsequent auto-regressive shenanigans. You'll preserve all the conditional probability factorizations the LLM needs while dropping a diffusion model in the middle.
The results are ground breaking in my opinion. How much longer until an AI can generate 30 successive images together and make an ultra realistic movie?
im not going to get super hyperbolic and histrionic about “entitlement” and stuff like that, but… literally this technology did not exist until like two years ago, and yet i hear this all the time. “oh this codegen is pretty accurate but it’s slow”, “oh this model is faster and cheaper (oh yeah by the way the results are bad, but hey it’s the cheapest so it’s better)”. like, are we collectively forgetting that the whole point of any of this is correctness and accuracy? am i off-base here?
the value to me of a demonstrably wrong chat completion is essentially zero, and the value of a correct one that anticipates things i hadn’t considered myself is nearly infinite. or, at least, worth much, much more than they are charging, and even _could_ reasonably charge. it’s like people collectively grouse about low quality ai-generated junk out of one side of their mouths, and then complain about how expensive the slop is out of the other side.
hand this tech to someone from 2020 and i guarantee you the last thing you’d hear is that it’s too slow. and how could it be? yeah, everyone should find the best deals / price-value frontier tradeoff for their use case, but, like… what? we are all collectively devaluing that which we lament is being devalued by ai by setting such low standards: ourselves. the crazy thing is that the quickly-generated slop is so bad as to be practically useless, and yet it serves as the basis of comparison for… anything at all. it feels like that “web-scale /dev/null” meme all over again, but for all of human cognition.
The animation is a lie. The new 4o with "native" image generating capabilities is a multi-modal model that is connected to a diffusion model. It's not generating images one token at a time, it's calling out to a multi-stage diffusion model that has upscalers.
You can ask 4o about this yourself, it seems to have a strong understanding of how the process works.
A model may not have many facts about itself, but it can definitely see what is inside of its own context, and what it sees is a call to an image generation tool.
Finally, and most convincingly, I can't find a single official source where OpenAI claims that the image is being generated pixel-by-pixel inside of the context window.
This option is not exposed in ChatGPT, it only uses vivid.
"Typical" AI images are this blend of the popular image styles of the internet. You always have a bit of digital drawing + cartoon image + oversaturated stock image + 3d render mixed in. Models trained on just one of these work quite well, but for a generalist model this blend of styles is an issue
Asian artists don't color this way though; those neon oversaturated colors are a Western style.
(This is one of the easiest ways to tell a fake-anime western TV show, the colors are bad. The other way is that action scenes don't have any impact because they aren't any good at planning them.)
Was anyone else surprised how slow the images were to generate in the livestream? This seems notably slower than DALLE.
I ran stable diffusion for a couple of years (maybe?, time really hasn't made sense since 2020) on my Dual 3090 rendering server. I built the server originally for crypto heating my office in my 1820s colonial in upstate NY then when I was planning to go back to college (got accepted into a university in England), I switched it's focus to Blender/UE4 (then 5), then eventually to AI image gen. So I've never minded 20 seconds for an image. If I needed dozens of options to pick the best, I was going to click start and grab a cup of coffee, come back and maybe it was done. Even if it took 2 hours, it is still faster than when I used to have to commission art for a project.
I grew out of Stable Diffusion, though, because the learning curve beyond grabbing a decent checkpoint and clicking start was actually really high (especially compared to LLMs that seamed to "just work"), after going through failed training after failed fine-tuning using tutorials that were a couple days out of date, I eventually said, fuck it, I'm paying for this instead.
All that to say - if you are using GenAI commercially, even if an image or a block of code took 30 minutes, it's still WAY cheaper than a human. That said, eventually a professional will be involved, and all the AI slop you generated will be redone, which will still cost a lot, but you get to skip the back and forth figuring out style/etc.
Currently, my prompts seem to be going to the latter still, based on e.g. my source image being very obviously looped through a verbal image description and back to an image, compared to gemini-2.0-flash-exp-image-generation. A friend with a Plus plan has been getting responses from either.
The long-term plan seems to be to move to 4o completely and move Dall-E to its own tab, though, so maybe that problem will resolve itself before too long.
I get the intent to abstract it all behind a chat interface, but this seems a bit too much.
the native just.. works
I'm not saying that it's not true, it's just "wait and see" before you take their word as gold.
I think MS's claim on their quantum computing breakthrough is the latest form of this.
just tried it, prompt adherence and quality is... exactly what they said, it extremely impressive
I guess here's an example of a prompt I would like to see:
A flying spaghetti monster with a metal colander on its head flying above New York City saving the world from and very very evil Pope.
I'm not anti/pro spaghetti monster or catholicism. But I can visualize it clearly in my head what that prompt might look like.
I will also not give them my email address just to try it out.
And to prove it they only need your email address, birth date, credit card number, and rights to first born child?
I was blown away when they showed this many months ago, and found it strange that more people weren't talking about it.
This is much more precise than the Gemini one that just came out recently.
Some simply dislike everything OpenAI. Just like everything Musk or Trump.
How much longer until an AI that can generate 30 frames with this quality and make a movie?
About 1.5 years ago, I thought AI would eventually allow anyone with an idea to make a Hollywood quality movie. Seems like we're not too far off. Maybe 2-3 more years?
Other image generators I've used lately often produced pretty good images of humans, as well [0]. It was DALLE that consistently generated incredibly awful images. Glad they're finally fixing it. I think what most AI image generators lack the most is good instruction following.
[0] YandexArt for the first prompt from the post: https://imgur.com/a/VvNbL7d The woman looks okay, but the text is garbled, and it didn't fully follow the instruction.
https://images.ctfassets.net/kftzwdyauwt9/7M8kf5SPYHBW2X9N46...
OpenAI's human faces look *almost* real.
Not sure, I tried a few generations, and it still produces those weird deformed faces, just like the previous generation: https://imgur.com/a/iKGboDH Yeah, sometimes it looks okay.
YandexArt for comparison: https://imgur.com/a/K13QJgU
For drawings, NovelAI's models are way beyond the uncanny valley now.
If that's best of 8, I'd love to see the outtakes.
To think that a few years ago we had dreamy pictures with eyes everywhere. And not long ago we were always identifying the AI images by the 6 fingered people.
I wonder how well the physics is modeled internally. E.g. if you prompt it to model some difficult ray tracing scenario (a box with a separating wall and a light in one of the chambers which leaks through to the other chamber etc)?
Or if you have a reflective chrome ball in your scene, how well does it understand that the image reflected must be an exact projection of the visible environment?
EDIT: Ok it works in Sora, and my jaw dropped
For example, I asked it to render a few lines of text on a medieval scroll, and it basically looked like a picture of a gothic font written onto a background image of a scroll
So it's ironic in this sense, that OpenAI blocking generation of copyrighted characters means that it's more in compliance with copyright laws than most fan artists out there, in this context. If you consider AI training to be transformative enough to be permissible, then they are more copyright-respecting in general.
Source: https://lawsoup.org/legal-guides/copyright-protecting-creati...
It did Dragon Ball Z here:
https://old.reddit.com/r/ChatGPT/comments/1jjtcn9/the_new_im...
Rick and Morty:
https://old.reddit.com/r/ChatGPT/comments/1jjtcn9/the_new_im...
South Park:
https://old.reddit.com/r/ChatGPT/comments/1jjyn5q/openais_ne...
It is incredibly difficult to develop an art style, then get the model to generate a collection of different images in that unique art style. I couldn't work out how to do it.
I also couldn't work out how to illustrate the same characters or objects in different contexts.
AI seems great for one off images you don't care much about, but when you need images to communicate specific things, I think we are still a long way away.
Your evaluation, done a few weeks ago, isn't relevant anymore.
I look forward to giving it a try, but I don't have high hopes.
Character consistency means that these models could now theoretically illustrate books, as one example.
Generating UIs seems like it would be very helpful for any app design or prototyping.
We're largely past the days of 7 fingered hands - text remains one of the tell-tale signs.
I'm excited about this for adding images to those interactive stories.
It has nothing to do with circumventing the cost of artists or writers: regardless of cost, no one can put out a story and then rewrite it based on whatever idea pops into every reader's mind for their own personal main character.
It's a novel experience that only a "writer" that scales by paying for an inanimate object to crunch numbers can enable.
Similarly no artist can put out a piece of art for that story and then go and put out new art bespoke to every reader's newly written story.
-
I think there's this weird obsession with framing these tools about being built to just replace current people doing similar things. Just speaking objectively: the market for replacing "cheeky expensive artists" would not justify building these tools.
The most interesting applications of this technology being able to do things that are simply not possible today even if you have all the money in the world.
And for the record, I'll be ecstatic for the day an AI can reach my level of competency in building software. I've been doing it since I was a child because I love it, it's the one skill I've ever been paid for, and I'd still be over the moon because it'd let me explore so many more ideas than I alone can ever hope to build.
You realize that almost weekly we have new AI models coming out that are better and better at programming? It just happened that the image generation is an easier problem than programming. But make no mistake, AI is coming for us too.
That's the price of automating everything.
Asking it to draw the Balkans map in Tolkien style, this is actually really impressive, geography is more or less completely correct, borders and country locations are wrong, but it feels like something I could get it to fix.
> I wasn't able to generate the map because the request didn't follow content policy guidelines. Let me know if you'd like me to adjust the request or suggest an alternative way to achieve a similar result.
Are you in the US?
...why are we living in such a retarded sci-fi age
Edit: Eventually it showed up
Generate a photo of a lake taken by a mobile phone camera. No hands or phones in the photo, just the lake.
The hand holding a phone is always there :D
The general idea of indistinguishable real/fake images; yeah
> if its not unconvincing, its soulless (only because I was told in advance that its AI)
> if its not soulless then its using too much energy
You don't even need deepfakes. https://www.newsweek.com/doug-mastriano-pennsylvania-senator...
The disaster scenario is already here.
over 10 years it might even out, if your lucky (historically its taken much longer) but 10 years is a long time to wait in your career.
Theme: Educational Scientific Visualization – Ultra Realistic Cutaways Color: Naturalistic palettes that reflect real-world materials (e.g., rocky grays, soil browns, fiery reds, translucent biological tones) with high contrast between layers for clarity Camera: High-resolution macro and sectional views using a tilt-shift camera for extreme detail; fixed side angles or dynamic isometric perspective to maximize spatial understanding Film Stock: Hyper-realistic digital rendering with photogrammetry textures and 8K fidelity, simulating studio-grade scientific documentation Lighting: Studio-quality three-point lighting with soft shadows and controlled specular highlights to reveal texture and depth without visual noise Vibe: Immersive and precise, evoking awe and fascination with the inner workings of complex systems; blends realism with didactic clarity Content Transformation: The input is transformed into a hyper-detailed, realistically textured cutaway model of a physical or biological structure—faithful to material properties and scale—enhanced for educational use with visual emphasis on internal mechanics, fluid systems, and spatial orientation
Examples: 1. A photorealistic geological cutaway of Earth showing crust, tectonic plates, mantle convection currents, and the liquid iron core with temperature gradients and seismic wave paths. 2. An ultra-detailed anatomical cross-section of the human torso revealing realistic organs, vasculature, muscular layers, and tissue textures in lifelike coloration. 3. A high-resolution cutaway of a jet engine mid-operation, displaying fuel flow, turbine rotation, air compression zones, and combustion chamber intricacies. 4. A hyper-realistic underground slice of a city showing subway lines, sewage systems, electrical conduits, geological strata, and building foundations. 5. A realistic cutaway of a honeybee hive with detailed comb structures, developing larvae, worker bee behavior zones, and active pollen storage processes.
One area where it does not work well at all is modifying photographs of people's faces.* Completely fumbles if you take a selfie and ask it to modify your shirt, for example.
* = unless the people are in the training set
Sounds like it may be a safety thing that's still getting figured out
The Americas are quite a bit larger than the USA, so I disagree with 'american' being a word for people and things from mainland USA. Usian seems like a reasonable derivative of USA and US, similar to how mexican follows from Mexico and Estados Unidos Mexicanos.
Might take a day or two before it's available in general.
It seems like an odd way to name/announce it, there's nothing obvious to distinguish it from what was already there (i.e. 4o making images) so I have no idea if there is a UI change to look for, or just keep trying stuff until it seems better?
Truly infuriating, especially when it's something like this that makes it tough to tell if the feature is even enabled.
https://news.ycombinator.com/item?id=42628742
The new one can.
https://chatgpt.com/share/67e36dee-6694-8010-b337-04f37eeb5c...
The glaring issue for the older image generators is how it would proudly proclaim to have presented an image with a description that has almost no relation to the image it actually provided.
I'm not sure if this update improves on this aspect. It may create the illusion of awareness of the picture by having better prompt adherence.
It's much better than prior models, but still generates hands with too many fingers, bodies with too many arms, etc.
I see errors like this in the console:
ewwsdwx05evtcc3e.js:96 Error: Could not fetch file with ID file_0000000028185230aa1870740fa3887b?shared_conversation_id=67e30f62-12f0-800f-b1d7-b3a9c61e99d6 from file service at iehdyv0kxtwne4ww.js:1:671 at async w (iehdyv0kxtwne4ww.js:1:600) at async queryFn (iehdyv0kxtwne4ww.js:1:458)Caused by: ClientRequestMismatchedAuthError: No access token when trying to use AuthHeader
https://chatgpt.com/share/67e319dd-bd08-8013-8f9b-6f5140137f...
In the web app I see:
Your name, custom instructions, and any messages you add after sharing stay private. Learn more
I'm excited to see what a Flux 2 can do if it can actually use a modern text encoder.
The image generators used by creatives will not be text-first.
"Dragon with brown leathery scales with an elephant texture and 10% reflectivity positioned three degrees under the mountain, which is approximately 250 meters taller than the next peak, ..." is not how you design.
Creative work is not 100% dice rolling in a crude and inadequate language. Encoding spatial and qualitative details is impossible. "A picture is worth a thousand words" is an understatement.
"""
That's a fun idea—but generating an image with 999,999 anime waifus in it isn't technically possible due to visual and processing limits. But we can get creative.
Want me to generate:
1. A massive crowd of anime waifus (like a big collage or crowd scene)?
2. A stylized representation of “999999 anime waifus” (maybe with a few in focus and the rest as silhouettes or a sea of colors)?
3. A single waifu with a visual reference to the number 999999 (like a title, emblem, or digital counter in the background)?
Let me know your vibe—epic, funny, serious, chaotic?
"""
Controlnet has been the obvious future of image-generation for a while now.
Automation tools are always more powerful as a force multiplier for skilled users than a complete replacement. (Which is still a replacement on any given task scope, since it reduces the number of human labor hours — and, given any elapsed time constraints, human laborers — needed.)
We might find that the entire "studio system" is a gross inefficiency and that individual artists and directors can self-publish like on Steam or YouTube.
Sora is one of the worst video generators. The Chinese have really taken the lead in video with Kling, Hailuo, and the open source Wan and Hunyuan.
Wan with LoRAs will enable real creative work. Motion control, character consistency. There's no place for an OpenAI Sora type product other than as a cheap LLM add-in.
Even when I told it to transform it into a text description, then draw that text description, my earlier attempt at a cat picture meant that the description was too close to a banned image...
I can't help but feel like openAI and grok are on unhelpful polar opposites when it comes to moderation.
> Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks.
That's it folks. Tens of thousands of so-called "AI" image generator startups have been obliterated and taking digital artists with them all reduced to near zero.
Now you have a widely accessible meme generator with the name "ChatGPT".
The last task is for an open weight model that competes against this and is faster and all for free.
ChatGPT has already had a that via Dall-E. If it didn't kill those startups when that happened this doesn't fundamentally change anything. Now its got a new image gen model, which — like Dall-E 3 when it came out — is competitive or ahead of other SotA base models using just text prompts, the simplest generation workflow, but both more expensive and less adaptable to more involved workflows than the tools anyone more than a casual user (whether using local tools or hosted services) is using. This is station-keeping for OpenAI, not a meaningful change in the landscape.
It's not 'just' a new model ala Imagen 3. This is 'what if GPT could transform images nearly as well as text?' and that opens up a lot of possibilities. It's definitely a meaningful change.
In the coming days, people will Anime all sorts of images, for example historical images: https://x.com/keysmashbandit/status/1904764224636592188
Trying out 4o image generation... It doesn't seem to support this use-case at all? I gave it an image of myself and asked to turn me into a wizard, and it generate something that doesn't look like me in the slightest. A second attempt, I asked to add a wizard hat and it just used python to add a triangle in the middle of my image. I looked at the examples and saw they had a direct image modification where they say "Give this cat a detective hat and a monocle", so I tried that with my own image "Give this human a detective hat and a monocle" and it just gave me this error:
> I wasn't able to generate the modified image because the request didn't follow our content policy. However, I can try another approach—either by applying a filter to stylize the image or guiding you on how to edit it using software like Photoshop or GIMP. Let me know what you'd like to do!
Overall, a very disappointing experience. As another point of comparison, Grok also added image generation capabilities and while the ability to edit existing images is a bit limited and janky, it still manages to overlay the requested transformation on top of the existing image.
Iterations are the missing link. With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.
I‘m looking forward to try this out and see if I was right. Unfortunately it’s not yet available for me.
Ditto Instruct Pix2Pix https://www.timothybrooks.com/instruct-pix2pix
For example, https://news.ycombinator.com/item?id=43388114
Otherwise impressive.
I think it is too biased to use heuristics discovered in the first response to apply the same level of compute to subsequent requests.
It makes me kind of want to rewrite an interface that builds appropriate context and starts new chats for every request issued..
Am I the only one immediately looking past the amazing text generation, the excellent direction following, the wonderful reflection, and screaming inside my head, "That's not how reflection works!"
I know it's super nitpicky when it's so obviously a leap forward on multiple other metrics, but still, that reflection just ain't right.
Edit: are we talking about the first or second image? I meant to say the image with only the woman seems normal. Image with the two people does seem a bit odd.
Angle of incidence = angle of reflection. That means that the only way to see yourself in a reflective surface is by looking directly at it. Note this refers to looking at your eyes -- you can look down at a mirror to see your feet because your feet aren't where your eyes are.
You can google "mirror selfie" to see endless examples of this. Now look for one where the camera isn't pointing directly at the mirror.
From the way the white board is angled, it's clear the phone isn't facing it directly. And yet the reflection of the phone/photographer is near-center in frame. If you face a mirror and angle to the left the way the image is, your reflection won't be centered, it'll be off to the right, where your eyes can see it because you have a very wide field of view, but a phone would not.
In games they did it by creating a duplicate then reversing it, I wonder if this is the same idea.
As to why they don't automatically detect when reasoning could be appropriate and then switch to o3, I don't know, but I'd assume it's about cost (and for most users the output quality is negligible). 4o can do everything, it's just not great at "logic".
--
Comparison with Leonardo.Ai.
ChatGPT: https://chatgpt.com/share/67e2fb21-a06c-8008-b297-07681dddee...
ChatGPT again (direct one shot): https://chatgpt.com/share/67e2fc44-ecc8-8008-a40f-e1368d306e...
ChatGPT again (using word "photorealistic instead of "photo"): https://chatgpt.com/share/67e2fce4-369c-8008-b69e-c2cbe0dd61...
Leonardo.Ai Phoenix 1.0 model: https://cdn.leonardo.ai/users/1f263899-3b36-4336-b2a5-d8bc25...
I'm curious if you said 2d animation style for both or just for chatgpt.
Edit: Your second version of chatgpt doesn't say photorealistic. Can you share the Leonard.ai prompt?
Leonardo prompt: A golden cocker spaniel with floppy ears and a collar that says "Sunny" on it
Model: Phoenix 1.0 Style: Pro color photography
Midjourney hasn't been SOTA for nearly a year now. It struggles to follow even marginally complex prompts from an adherence perspective.
nah. i pass and stick with midjourney.
It also misses the arrow between "[diffusion]" and "pixels" in the first image.
How easy is this to remove? Is it just like exif data that can be easily stripped out, or is it baked in more permanently somehow
I couldn't find anything on the pricing page.
This dynamic happens on Twitter every day. Tomorrow it'll be a different craze.
If the subject matter is paywalled, I feel that the post should include some explanation of what is newsworthy behind the link.
After that invitation there are several examples that boil down to: "Hey look. Our AI can generate deep fakes." Impressive examples.
It's more pragmatic to pipeline the results to a background removal model.
EDIT: It appears GPT-4o is different as there is a video demo dedicated to transparancy.
I suspect we're getting a flood of comment from people who are using Dall-E.
And that created the isolated image on a transparent background.
Thank-you.
Sorry, but how are these useful? None of the examples demonstrate any use beyond being cool to look at.
The article vaguely mentions 'providing inspiration' as possible definition of 'useful'. I suppose.
And I hope that people who worked on this know this. They are pure evil.
May 7, 2024 - The “Let Loose” event, focusing on new iPads, including the iPad Pro with the M4 chip and the iPad Air with the M2 chip, along with the Apple Pencil Pro.
June 10, 2024 - The Worldwide Developers Conference (WWDC) keynote, where Apple introduced iOS 18, macOS Sequoia, and other software updates, including Apple Intelligence.
September 9, 2024 - The “It’s Glowtime” event, where Apple unveiled the iPhone 16 series, Apple Watch Series 10, and AirPods 4.
Via Press releases: MacBook Air with M3 on March 4, the iPad mini on October 15, and various M4-series Macs (MacBook Pro, iMac, and Mac mini) in late October.
so much fun.
...Once the wait time is up, I can generate the corrected version with exactly eight characters: five mice, one elephant, one polar bear, and one giraffe in a green turtleneck. Let me know if you'd like me to try again later!
ofc 4.5 is best, but its slow and I am afraid I'm going to hit limits.
Was it public information when Google was going to launch their new models? Interesting timing.