This isn't a matter of human-level AI or superhuman-level AI; it's just straight up impossible. If you want the information to match, it has to be provided. If it isn't there, an AI can fill in the gaps with "something" that will make the scene work, but expecting it to fill in the gaps the way you "want" even though you gave it no indication of what that is is expecting literal magic.
Long term, you'll never have a coherent movie produced by stringing together a series of textual snippets because, again, that's just impossible. Some sort of long-form "write me a horror movie staring a precocious 22-year old elf in a far-future Ganymede colony with a message about the importance of friendship" AI that generates a coherent movie of many scenes will have to be doing a lot of some sort of internal communication in an internal language to hold the result together between scenes, because what it takes to hold stuff coherent between scenes is an amount of English text not entirely dissimilar in size from the underlying representation itself. You might as well skip the English middleman and go straight to an embedding not constrained by a human language mapping.
And this applies to language / code outputs as well.
The number of times I’ve had engineers at my company type out 5 sentences and then expect a complete react webapp.
But what I’ve found in practice is using LLMs to generate the prompt with low-effort human input (eg: thumbs up/down, multiple-choice etc) is quite useful. It generates walls of text, but with metaprompting, that’s kind of the point. With this, I’ve definitely been able to get high ROI out of LLMs. I suspect the same would work for vision output.
Stick the video you want to replicate into -o1 and ask for a descriptive prompt to generate a video with the same style and content. Take that prompt and put it into Sora. Iterate with human and o1 generated critical responses.
I suspect you can get close pretty quickly, but I don't know the cost. I'm also suspicious that they might have put in "safeguards" to prevent some high profile/embarrassing rip-offs.
Why snippets? Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence.
Yeah, and Apollo 11 would have been utterly astonishing a decade before it occurred. And, yet, if you tried to project out from it to what further frontiers manned spaceflight would reach in the following decades, you’d…probably grossly overestimate what actually occurred.
> Long term progress can be surprising.
Sure, it can be surprising for optimists as well as naysayers; as a good rule of thumb, every curve that looks exponential in an early phase ends up being at best logistic.
Ask anyone with a chronic illness about the future and they'll tell you we're about 5 years off a cure. They've been saying that for decades. Who knows where the future advancements will be.
Generative AI will never produce an experience like that. I know never is a long time, but I’m still gonna call it. You simply can’t produce such a fresh idea by gathering a bunch of data and interpolating.
Maybe someday enough AI will be good enough to create shorter or longer videos with some dialog and even a coherent story (though I doubt it), but it won‘t be fresh or creative. And we humans will at best enjoy it for its stupidity or sloppiness. Not for its cleverness or artistry.
This is the at-first-fun-but-now-frustrating infinite goal move. "AI (a stand in for literally anything) will do (anything) soon." -> "It won't do (thing), it's too complex." -> "Who said AI will do (thing)?"
So no I don’t think this will happen either. Authors may use use AI them selves as one tool in their tool box as they write their script, but we will not see entire production screen plays being written by generative AI set for theatrical release. The industry will simply not allow that to happen. At most you can have AI write a screen play for your own amusement, not for publication.
Generative AI will not be able to approach the artistry of your average actor (not even a bad actor), it won’t be able match the lighting or the score to the mood (unless you carefully craft that in your prompt). It won‘t get creative with the camera angles (again unless you specifically prompt for a specific angle) or the cuts. And it probably won’t stay consistent with any of these, or otherwise break the consistency at the right moments, like an artist could.
If you manage to prompt the generative AI to create a full feature film with excellent acting, the correct lighting given the mood, a consistent tone with editing to match, etc. you have probably spent much more time and money into crafting the prompt than would otherwise have gone into simply hiring the crew to create your movie. The AI movie will certainly contain slop and be visibly so bad it guaranteed will not be in theaters.
Now if you hired that crew to make the movie instead, that crew might use AI as a tool to enhance their artistry, but you still need your specialized artists to use that tool correctly. That movie might make it to the theaters.
In music you also have plenty of artists that have no clue how to play their instruments, or progress their songs, but the music is nonetheless amazing.
Skill is not the only quality of art. A brilliant artist works with their limitation to produce work which is better than the sum of its part. It will take AI the luck of ten billion universes before it produces anything like that.
The AI isn't creating the fresh ideas. People are.
Saying that large video models will be in theaters sounds like a completely different and much more ambitious prediction. I interpreted it as if large video models will produce whole movies on their own from a script of prompts. That there will be a single film maker with only a large video model and some prompts to make the movie. Such films will never be in the theater, unless by some grifter, and than it is certain to be a flop.
What does "EXT. NIGHT" mean in a script? Is it cloudy? Rainy? Well lit? What are camera locations? Is the scene important for the context of the movie? What are characters wearing? What are they looking at?
What do actors actually do? How do they actually behave?
Here are a few examples of script vs. screen.
Here's a well described script of Whiplash. Tell me the one hundred million things happening on screen that are not in the script: https://www.youtube.com/watch?v=kunUvYIJtHM
Or here's Joker interrogation from The Dark Night Rises. Same million different things, including actors (or the director) ignoring instructions in the script: https://www.youtube.com/watch?v=rqQdEh0hUsc
Here's A Few Good Men: https://www.youtube.com/watch?v=6hv7U7XhDdI&list=PLxtbRuSKCC...
and so on
---
Edit. Here's Annie Atkins on visual design in movies, including Grand Budapest Hotel: https://www.youtube.com/watch?v=SzGvEYSzHf4. And here's a small article summarizing some of it: https://www.itsnicethat.com/articles/annie-atkins-grand-buda...
Good luck finding any of these details in any of the scripts. See minute 14:16 where she goes through the script
Edit 2: do watch The Kerning chapter at 22:35 to see what it actually takes to create something :)
This is most Hacker News comments summarized lmao. It's kinda my favorite thing of this place: just open any thread and you immediately see so many people rushing to say ''well just do X or Y'' or ''actually it's X or Y and not Z like the experts claim''. Love it.
Of course, HN being the place that it is, the same type of comments are made about quantum entanglement and solar panel efficiency.
At the same time I am curious in the "that person has too many fingers" sense at what a system trained on tens of thousands of movies plus scripts plus subtitles plus metadata etc. would generate.
I thought about it for a bit and I would want to watch a computer generated Sharknado 7 or Hallmark Christmas movie.
Lets pick something concrete. It's a medieval script, it opens with two knights fighting. OK so later in the script we learn their characters, historic counterparts etc. So your LLM can match nefarious villain to some kind of embedding, and doubtless has trained on countless images of a knight.
But the result is not naively going to understand the level of reality the script is going for - how closely to stick to historic parallels, how much to go fantastical with the depiction. The way we light and shoot the fight and how it coheres with the themes of the scene, the way we're supposed to understand the characters in the context of the scene and the overall story, the references the scene may be making to the genre or even specific other films etc.
This is just barely scraping the surface of the beginnings of thinking about mise en scene, blocking, framing etc. You can't skip these parts - and they're just as much of a challenge as temporal coherence, or performance generation or any of the other hard 'technical issues' that these models have shown no capacity to solve. They're decisions that have to be made to make a film coherent at all - not yet good or tasteful or creative or whatever.
Put another way - you'd need AGI to comprehend a script at the level of depth required to do the job of any HOD on any film. Such a thing is doubtless possible, but it's not going to be shortcut naively the way generation an image is - because it requires understanding in context, precisely what LLMs lack.
We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on.
These future models will not be large language models, they will be multi-modal. Large movie models if you like. They will have tons of context about how scenes within movies cohere, just as LLMs do within documents today.
- you have to provide correct detailed instructions on lighting
- you have to provide correct detailed instructions on props
- you have to provide correct detailed instructions on clothing
- you have to provide correct detailed instructions on camera position and movement
- you have to provide correct detailed instructions on blocking
- you have to provide correct detailed instructions on editing
- you have to provide correct detailed instructions on music
- you have to provide correct detailed instructions on sound effects
- you have to provide correct detailed instructions on...
- ...
- repeat that for literally every single scene in the movie (up to 200 in extreme cases)
There's a reason I provided a few links for you to look at. I highly recommend the talk by Annie Atkins. Watch it, then open any movie script, and try to find any of the things she is talking about there (you can find actual movie scripts here: https://imsdb.com)
That's how I've been using the image generators - lots of experimentation and throwing out the stuff that doesn't work. Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.
Now the models and UX to do this at a cinematic quality are probably 5-10 years away for video (and the studios are probably the only ones with the data to do it), but I'm relatively bullish on AI in cinema. I don't think AI will be doing everything end to end, but it might be a shortcut for people who can write a script and figure out the UX to execute the rest of the creative process by trial and error.
Where did you find AI/ML that are good at filling in actual required and consistent details.
I beg of you to watch Annie Atkins' presentation I linked: https://www.youtube.com/watch?v=SzGvEYSzHf4 and tell me how much intervention would AI/ML need to create all that, and be consistent throughout the movie?
> once these models can generate coherent scenes, people can start using them to explore the creative space and figure out what they like.
Define "coherent scene" and "explore". A scene must be both coherent and consistent, and conform to the overall style of the movie and...
Even such a simple thing as shot/reverse shot requires about a million various details and can be shot in a million different ways. Here's an exploration of just shot/reverse shot: https://www.youtube.com/watch?v=5UE3jz_O_EM
All those are coherent scenes, but the coherence comes from a million decisions: from lighting, camera position, lens choice, wardrobe, what surrounds the characters, what's happening in the background, makeup... There's no coherence without all these choices made beforehand.
Around 4:00 mark: "Think about how well you know this woman just from her clothes, and workspace". Now watch that scene. And then read its description in the script https://imsdb.com/scripts/No-Country-for-Old-Men.html:
--- start quote ---
Chigurh enters. Old plywood paneling, gunmetal desk, litter
of papers. A window air-conditioner works hard.
A fifty-year-old woman with a cast-iron hairdo sits behind
the desk.
--- end quote ---And right after that there's a section on the rhythm of editing. Another piece in the puzzle of coherence in a scene.
> Then once I've got enough good generated images collected out of the tons of garbage, I fine tune a model and create a workflow that more consistently gives me those styles.
So, literally what I wrote here: https://news.ycombinator.com/item?id=42375280 :)
You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.
Now, I don’t expect we’ll get the best movies this way. But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.
No jump.
Your original claim: "Submit a whole script the way a writer delivers a movie to a director. The (automated) director/DP/editor could maintain internal visual coherence, while the script drives the story coherence."
Two comments later it's this: "We can already get detailed style guidance into picture generation. Declaring you want Picasso cubist, Warner brothers cartoon, or hyper realistic works today. So does lighting instructions, color palettes, on and on."
I just re-wrote this with respect to movies.
> I was thinking more like ‘make it look like Bladerunner if Kurosawa directed it, with a score like Zimmer.’
Because, as we all know, every single movie by Kurosawa is the same, as is every single score by Hans Zimmer, so it's ridiculously easy to recreate any movie in that style, with that music.
> You’re really failing to let go of the idea that you need to prescribe every little thing. Like Midjourney today, you’ll be able to give general guidance.
Yes, and Midjounrey today really sucks at:
- being consistent
- creating proper consistent details
A general prompt will give you a general result that is usually very far from what you actually have in mind.
And yes, you will have to prescribe a lot of small things if you want your movie to be consistent. And for your movie to make any sense.
Again, tell me how exactly your amazing magical AI director will know which wardrobe to chose, which camera angles to setup, which typography to use, which sound effects to make just from the script you hand in?
you can start ,with a very simple scene I referenced in my original reply: two people talking at the table in Whiplash.
> But paint by numbers stuff like many movies already are? A Hallmark Channel weepy? I bet we will.
Even those movies have more details and more care than you can get out of AIs (now, or in foreseeable future)
I think you're still assuming I always want to choose those things. That's why we're talking past each other. A good movie making model would choose for me unless I give explicit directions. Today we don't see long-range coherence in the results of movie (or game engine) models, but the range is increasing, and I'm willing to bet we will see movie-length coherence in the next decade or so.
By the way, I also bet that if I pasted exactly the No Country for Old Men script scene description from up this thread into Midjourney today it would produce at least some compelling images with decent choices of wardrobe, lighting, set dressing, camera angle, exposure, etc etc. That's what these models do, because they're extrapolating and interpolating between the billion images they've seen that contained these human choices.
AFAIK Midjourney produces single images, so the relevant scope of consistency is inside the single image only. Not between images. A movie model needs coherence across ~160,000 images, which is beyond the state of the art today but I don't see why it's impossible or unreasonable in the long run.
> A general prompt will give you a general result that is usually very far from what you actually have in mind.
Which is only a problem if I have something in mind. Alternatively I can give no guidance, or loose guidance, make half a dozen variations, pick the one I like best. Maybe iterate a couple of times into that variation tree. Just like the image generators do.
"movies will be one of the last things to be replaced by ai"
https://www.youtube.com/watch?v=ypURoMU3P3U
including this quote: "being a craftsman is knowing how to work, art is knowing when to stop"
The text encoder may not be able to know complex relationships, but the generative image/video models that are conditioned on said text embeddings absolutely can.
Flux, for example, uses the very old T5 model for text encoding, but image generations from it can (loosely) adhere to all rules and nuances in a multi-paragraph prompt: https://x.com/minimaxir/status/1820512770351411268
Flux certainly does not consistently do so across an arbitrary collection of multi-paragraph prompts, as anyone whose run more than a few long prompts past it would recongize; also, the tweet is wrong in the other direction, as well, longer language-model-preprocessed prompts for models that use CLIP (like various SD1.5 and SDXL derivatives) are, in fact, a common and useful technique. (You’d kind of think that the fact that generated prompt here is significantly longer than the 256 token window of T5 would be a clue that the 77 token limit of CLIP might not be as big of a constraint as the tweet was selling it as, too.)
How would you ever tweak or debug it in that case? It doesn't strictly have to be English, but some kind of human-readable representation of the intermediate stages will be vital.
https://dreambooth.github.io/ https://textual-inversion.github.io/
InstantID (https://replicate.com/zsxkib/instant-id) fixes that issue.
If you just want a face, InstandID/Pulid work - but it’s not going to be very varied. Doing actual training means you can get any perspective, lighting, style, expression, etc - and have the whole body be accurate.
Even the models based off danbooru and E621 still aren't the best at that. And us furries like to tag art in detail.
The best we can really do at the moment is regional prompting, perhaps they need something similar for video.
Sora performs worse than closed source Kling and Hailuo, but more importantly, it's already trumped by open source too.
Tencent is releasing a fully open source Hunyuan model [1] that is better than all of the SOTA closed source models. Lightricks has their open source LTX model and Genmo is pushing Mochi as open source. Black Forest Labs is working on video too.
Sora will fall into the same pit that Dall-E did. SaaS doesn't work for artists, and open source always trumps closed source models.
Artists want to fine tune their models, add them to ComfyUI workflows, and use ControlNets to precision control the outputs.
Images are now almost 100% Flux and Stable Diffusion, and video will soon be 100% Hunyuan and LTX.
Sora doesn't have much market apart from name recognition at this point. It's just another inflexible closed source model like Runway or Pika. Open source has caught up with state of the art and is pushing past it.
The real trick is that the AI needs to be able to participate in iteration cycles, where the human can say "okay this is all mostly good, but I've circled some areas that don't look quite right and described what needs to be different about them." As far as I've played with it, current AIs aren't very good at revisiting their own work— you're basically just tweaking the original inputs and otherwise starting over from scratch each time.
Check out the Banodoco Discord community [2]. These are the people pioneering steerable AI video, and it's all being built on top of open source.
Now expand that to movies and games and you can get why this whole generative-AI bubble is going to pop.
What will save it is that, no matter how picky you are as a creator, your audience will never know what exactly was that you dreamed up, so any half-decent approximation will work.
In other words, a corollary to your corollary is, "Fortunately, you don't need them to be, because no one cares about low-order bits".
Or, as we say in Poland, "What the eye doesn't see, the heart doesn't mourn."
Part of the problem is the "half decent approximations" tend towards a clichéd average, the audience won't know that the cool cyberpunk cityscape you generated isn't exactly what you had in mind, but they will know that it looks like every other AI generated cyberpunk cityscape and mentally file your creation in the slop folder.
I think the pursuit of fidelity has made the models less creative over time, they make fewer glaring mistakes like giving people six fingers but their output is ever more homogenized and interchangable.
https://www.astralcodexten.com/p/how-did-you-do-on-the-ai-ar...
In other words, someone willing to tweak the prompt and press the button enough times to say "yeah, that one, that's really good" is going to have a result which cannot in fact be reliably binned as AI-generated.
The detailed breakdown you mention? Maybe it's accurate to that artist's thought process, maybe it's more of a rationalization; either way, it's not a general rule they, or anyone, could apply to any of the other AI images. Most of those in the article don't exhibit those "telltale signs", and the one that does - the Victorian Megaship - was actually made by human artist with no AI in the mix.
EDIT:
Another image that stands out to me is Riverside Cafe. Myself, like apparently a lot of other people, going by articles' comments, assumed it's a human-made one, because we vaguely remembered Vang Gogh painted something like it. He did, it's called Café Terrace at Night - and yet, despite immediately evoking the association, Riverside Cafe was made by AI, and is actually nothing like Café Terrace at Night at any level.
(I find it fascinating how this work looks like a copy of Van Gogh at first glance, for no obvious reason, but nothing alike once you pause to look closer. It's like... they have similar low-frequency spectra or something?)
EDIT2:
Played around with the two images in https://ejectamenta.com/imaging-experiments/fourifier/. There are some similarities in the spectra, I can't put my finger on them exactly. But it's probably not the whole answer. I'll try to do some more detailed experimentation later.
--
[0] - Nor should you expect it - it would mean either a perfect calibration, or be the equivalent of flipping a coin and getting heads 30 times in a row; it's not impossible, but you shouldn't expect to see it unless you're interviewing fewer people than literally the entire population of the planet.
> The average participant scored 60%, but people who hated AI art scored 64%, professional artists scored 66%, and people who were both professional artists and hated AI art scored 68%.
> The highest score was 98% (49/50), which 5 out of 11,000 people achieved. Even with 11,000 people, getting scores this high by luck alone is near-impossible.
If 0.0005% of people who are specifically judging art as AI or not AI, in a test which presumably attracts people who would like to be able to do that thing, can do a 98% accurate job, and the average is around 60%: that isn't reliable.
If that doesn't work for you, I encourage you to take the test. Obviously since you've read the article there are some spoilers, but there's still plenty of chances to get it right or wrong. I think you'll discover that you, too, cannot do this reliably. Let us know what happens.
Imagine the space of ideas as a circle, with stuff in the middle being more easy to reach (the "cliched average"). Previously, traversing the circle was incredibly hard - we had to use tools like DeviantArt, Instragram, etc to agglomerate the diverse tastes of artists, hoping to find or create the style we're looking for. Creating the same art style is hiring the artist. As a result, on average, what you see is the result of huge amounts of human curation, effort, and branding teams.
Now reduce the effort 1000x, and all of a sudden, it's incredibly easy to reach the edge of the circle (or closer to it). Sure, we might still miss some things at the very outer edge, but it's equivalent to building roads. Motorists appear, people with no time to sit down and spend 10000 hours to learn and master a particular style can simply remix art and create things wildly beyond their manual capabilities. As a result, the amount of content in the infosphere skyrockets, the tastemaking velocity accelerates, and you end up with a more interesting infosphere than you're used to.
When you're commissioning an artist to make you some art, you're basically sampling from the entire distribution. Stuff in the middle is, as you say, easiest to reach, so that's what you'll most likely get. Generative models let more people do art, meaning there's more sampling happening, so the stuff further from the centre will be visited more often, too.
However, AI tools also make another thing easier: moving and narrowing the sampling area. Much like with a very good human artist, you can find some work that's "out there", and ask for variations of it. However, there are only so many good artists to go around. AI making this process much easier and more accessible means more exploration of the circle's edges will happen. Not just "more like this weird thing", but also combinations of 2, 3, 4, N distinct weird things. So in a way, I feel that AI tools will surface creative art disproportionally more than it'll boost the common case.
Well, except for the fly in the ointment that's the advertising industry (aka. the cancer on modern society). Unfortunately, by far most of the creative output of humanity today is done for advertising purposes, and that goal favors the common, as it maximizes the audience (and is least off-putting). Deluge of AI slop is unavoidable, because slop is how the digital world makes money, and generative AI models make it cheaper than generative protein models that did it so far. Don't blame AI research for that, blame advertising.
Tastes are almost never normally distributed along a spectrum, but multi-modal. So the more dimensions you explore in, the more you end up with “islands of taste” on the surface of a hyper sphere and nothing like the normal distribution at all. This phenomenon is deeply tied to why “design by committee” (eg, in movies) always makes financial estimates happy but flops with audiences — there is almost no customer for average anything.
I agree with your conclusion.
My experience with customer surveys indicates the opposite — that customers prefer you have an opinion.
Inside Out 2 had the largest box office of any movie in 2024. Checkout the "research and writing" section in its wikipedia article https://en.wikipedia.org/wiki/Inside_Out_2#Research_and_writ... ... psychological consultants, a feedback loop with a group of teenagers, test screenings.
Or how about "Die with a smile" - currently number 1 in the global top 50 on Spotify. 5 songwriters
Or "APT." - currently number 2 in the global top 50 on Spotify. 11 songwriters
You don't have to look very hard
Consulting with SMEs, testing with audiences, etc isn’t “design by committee”.
Similarly, “Die With a Smile” seems to have been the work of two people with developed styles with support — again, not a committee:
> The collaboration was a result of Mars inviting Gaga to his studio where he had been working on new music. He presented the track in progress to her and the duo finished writing and recording the song the same day.
Apt seems to have started with a single person goofing around, then pitched as a collaboration and the expanded team entered at that point.
(And I do blame the advertisers, but frankly anyone handing them new amplifiers, with entirely predictable consequences, is also not blameless.)
That is based on perception that it is easier than ever to create fake content, but fails to account for the fact that creating real content (for example, simply taking a video) is even much easier. So while there is more fake content, there is also lot more real content, and so manipulation of reality (for example, denying a genocide) is much harder today than ever.
Anyway, "the AI slop will win" is based on a similar misconception, that total creative output will not increase. But like with fake news, it probably will not be the case, and so the actual amount of good art will increase, too.
I think we are OK as long as normal humans prefer to create real news rather than fake news, and create innovative art rather than cliched art.
So we're not OK.
I think I need to state my assumptions/beliefs here more explicitly.
First of all, "AI slop" is just the newest iteration on human-produced slop, which we're already drowning in. Not because people prefer to create slop, but because they're paid to do it, because most content is created by marketers and advertisers to sell you shit, and they don't want it to be better than strictly necessary for purpose.
It's the same with fake news, really. Fake news isn't new. Almost all news is fake news; what we call "fake news" is a particular flavor of bullshit that got popular as it got easier for random humans to publish stories competing with established media operations.
In both cases, AI is exacerbating the problem, but it did not create it - we were already drowning in slop.
Which leads me to related point:
> Anyway, "the AI slop will win" is based on a similar misconception, that total creative output will not increase.
It will. But don't forget Sturgeon's law - "ninety percent of everything is crap"[0]. Again, for the past couple decades, we've been drowning in "creative output". It's not a new problem, it's just increasingly noticeable in the past years, because the Web makes it very easy for everyone to create more "creative output" (most of which is, again, advertising), and it finally started overwhelming our ability to filter out the crap and curate the gems.
Adding AI to the mix means more output, which per Sturgeon's law, means disproportionately more crap. That's not AI's fault, that's ours; it's still the same problem we had before.
--
That may be true of any one model (though I don’t think it really is, either, I think newer image gen models are individually capable of a much wider array of styles than earlier models), but it is pretty clearly not true of the whole range of available models, even if you look at a single model “family” like “SDXL derivatives”.
Ironically, we're long past that point with human creators, at least when it comes to movies and games.
Take sci-fi movies, compare modern ones to the ones from the tail end of the 20th century. Year by year, VFX gets more and more detailed (and expensive) - more and better lights, finer details on every material, more stuff moving and emitting lights, etc. But all that effort arguably killed immersion and believability, by making scenes incomprehensible. There's way too much visual noise in action scenes in particular - bullets and lighting bolts zip around, and all that detail just blurs together. Contrast the 20th century productions - textures weren't as refined, but you could at least tell who's shooting who and when.
Or take video games, where all that graphics works makes everything look the same. Especially games that go for realistic style, they're all homogenous these days, and it's all cheap plastic.
(Seriously, what the fuck went wrong here? All that talk, and research, and work into "physically based rendering", yet in the end, all PBR materials end up looking like painted plastic. Raytracing seems to help a bit when it comes to liquids, but it still can't seem to make metals look like metals and not Fischer-Price toys repainted to gray.)
So I guess in this way, more precision just makes the audience give up entirely.
> they will know that it looks like every other AI generated cyberpunk cityscape and mentally file your creation in the slop folder.
The answer here is the same as with human-produced slop: don't. People are good at spotting patterns, so keep adding those low-order bits until it's no longer obvious you're doing the same thing everyone else is.
EDIT: Also, obligatory reminder that generative models don't give you average of training data with some noise mixed up; they sample from learned distribution. Law of large numbers apply, but it just means that to get more creative output, you need to bias the sampling.
Western movie studios may discover the same thing soon, with the number of high-budget productions tanking lately.
If you are not beholden to a precise vision or maybe just want to create something that sells, these tools will likely be significant productivity multipliers.
So far ChatGPT is not for writing books, but is great for SEO-spam blogposts. It is already killing the content marketing industry.
So far Dall-E is not for making master paintings, but it's great for stock images. It might kill most of the clipart and stock image industry.
So far Udio and other song generators are not able to make symphonies, but it's great for quiet background music. It might kill most of the generic royalty-free-music industry.
Actual long form art like a movie works because it includes many well informed choices that work together as a whole.
There seems to be a large gap between generating a few seconds of video vaguely like one's notion, and trying to create 90 minutes that are related and meaningful.
Which doesn't mean that you can't build from this starting place build more robust tools. But if you think that this is a large, hard amount of work, it certainly could call into question optisimitic projections from people who don't even seem to notice that there is work need at all.
It's less obvious because people project personality onto the content they see, because they implicitly assume the artist cared, and had some vision in mind. Cheap shit doesn't look like cheap shit in isolation. Except when you know it's AI-generated, because this removes the artist from the equation, and with it, your assumptions that there's any personality involved.
People can generally see the lack of artistic intent when consuming entertainment.
So, while GenAI tools make it easier to create superficially decent work that lacks creative intent, the studios managed to do it just fine with human intelligence only, suggesting the problem isn't AI, but the studios and their modern management policies.
Right now AI is more the latter, but many people want it to be the former
A director letting actors "just be" knows exactly what he/she wants, and choses actors accordingly. Just as the directors that want the most minute detail.
Clint Eastwood tries to do at most one take of a scene. David Fincher is infamous for his dozens of takes.
AI is neither Fincher nor Eastwood.
People may not think they care, but obviously they do. That’s why marvel movies do better than DC ones.
People absolutely care about details in their media.
Not all details matter, some do. And, it's better to not show the details at all, than to be inconsistent in them.
Like, idk., don't identify a bomb as a specific type of existing air-fuel ordnance and then act about it as if it was a goddamn tactical nuke. Something along these lines was what made me stop watching Arrow series.
This is a key observation, unfortunately generally solving for what details matter is extremely difficult.
I don’t think video generation models help with that problem, since you have even less control of details than you do with film.
At least before post.
The movies have just had much worse audience and critical reception.
The last production I worked on averaged 16 hours per frame for the final rendering. The amount of information encoded in lighting, models, texture, maps, etc is insane.
I don’t remember how long the final rendering took but it was nearly two months and the final compute budget was 7 or 8 figures. I think we had close to 100k cores running at peak from three different render farms during crunch time, but don’t take my word for it I wasn’t producing the picture.
Weren't the rendering algos ported to CUDA yet?
The 3D you see in things like commercials is usually done on GPUs though because at their smaller scale it's much faster.
A friend recently told me about a complex scene (I think it was a Marvel or Star Wars flick) where they had so much going on in the scene with smoke, fire, and other special effects that they had to wait for a specialized server with 2TB of RAM to be assembled. They only had one such machine so by the time the rest of the movie was done rendering, that one scene still had a month to go.
https://www.disneyanimation.com/data-sets/?drawer=/resources...
https://datasets.disneyanimation.com/moanaislandscene/island...
> When everything is fully instantiated the scene contains more than 15 billion primitives.
Pixar's stuff famously takes days per frame.
Do you have a citation for this? My guess would be much closer to a couple of hours per frame.
Trust me, even if you work with human artists, you'll keep saying "it's not quite I initially invisioned, but we don't have budget/time for another revision, so it's good enough for now." all the time.
If you find a silver bullet then everything else is largely irrelevant.
Idk if you noticed but that “if” is carrying an insane amount of weight.
I have a bad feeling that you'd be a horrible manager if you ever were one.
(2021) https://arxiv.org/abs/2103.13915 : An Image is Worth 16x16 Words, What is a Video Worth?
(2024) https://arxiv.org/abs/2406.07550 : An Image is Worth 32 Tokens for Reconstruction and Generation
First paper is the most famous and prompted a lot of research to using text generation tools in the image generation domain : 256 "words" for an image, Second paper is 24 reference image per minutes of video, Third paper is a refinement of the first saying you only need 32 "tokens". I'll let you multiply the numbers.
In kind of the same way as a who's who game, where you can identify any human on earth with ~32bits of information.
The corollary being that contrary to what parent is telling there is no theoretical obstacle to obtaining a video from a textual description.
These papers, from my quick skim (tho I did read the first one fully years ago,) seem to show that some images and to an extent video can be generated from discrete tokens, but does not show that exact images nor that any image can be.
For instance, what combination of tokens must I put in to get _exactly_ Mona Lisa or starry night? (Tho these might be very well represented in the data set. Maybe a lesser known image would be a better example)
As I understand, OC was saying that they can’t produce what they want with any degree of precision since there’s no way to encode that information in discrete tokens.
VQ-VAE (Vector Quantised-Variational AutoEncoder), (2017) https://arxiv.org/abs/1711.00937
The whole encoding-decoding process is reversible, and you only lose some imperceptible "details", the process can be either trained with a L2Loss, or a perceptual loss depending what you value.
The point being that images which occurs naturally are not really information rich and can be compressed a lot by neural networks of a few GB that have seen billions of pictures. With that strong prior, aka common knowledge, we can indeed paint with words.
Taking an existing image and reversing the process to get the tokens that led to it then redoing that doesn’t seem the same as inserting token to get a precise novel image.
Especially since, as you said, we’d lose some details, it suggests that not all images can be perfectly described and recreated.
I suppose I’ll need to play around with some of those techniques.
Natural Image-> Sequence of token, but not all possible sequence of token will be reachable. Like plenty of letters put together form non-sensical words.
Sequence of token -> Natural Image : if the initial sequence of token is unsensical the Natural image will be garbage.
So usually you then modelize the sequence of token so that it produce sensical sequences of token, like you would with a LLM, and you use the LLM to generate more tokens. It also gives you a natural interface to control the generation of token. You can express with words what modifications to the image you should do. Which will allow you to find the golden sequence of token which correspond to the mona-lisa by dialoguing with the LLM, which has been trained to translate from english to visual-word sequence.
Alternatively instead of a LLM you can use a diffusion model, the visual words usually are continuous, but you can displace them iteratively with text using things like "controlnet" (stable diffusion).
Thats my full quote on this topic. And I think it stands. Sure, people won't describe a picture. instead, they will take an existing picture or video, and do modifications of it, using AI. That is much much simpler and more useful, if you can file a scene, and then animate it later with AI.
The prior sentence does not imply the conclusion.
Being too early about this and being wrong are the same.
Especially anything involving fluid/smoke dynamics, or fast dynamic momements of humans and animals all suffer from the same weird motion artifacts. I can't describe it other than that the fluidity of the movements are completely off.
And as all genai video tools I've used are suffering from the same problem, I wonder if this is somehow inherent to the approach & somehow unsolvable with the current model architectures.
I saw a Santa dancing video today and the suspension of disbelief was almost instantly dispelled when the cuffs of his jacket moved erratically. The GenAI was trying to get them to sway with arm movements but because it didn't understand why they would sway it just generated a statistical approximation of swaying.
GenAI also definitely doesn't understand 3D structures easily demonstrated by completely incorrect morphological features. Even my dogs understand gravity, if I drop an object they're tracking (food) they know it should hit the ground. They also understand 3D space, if they stand on their back legs they can see over things or get a better perspective.
I've yet to see any GenAI that demonstrates even my dogs' level of understanding the physical world. This leaves their output in the uncanny valley.
To that end, it is actually extremely important to nit-pick this stuff. For those of us using these tools, we need to be able to talk shop about which ones are keeping up, which are work like shit in practice, and which ones work but only in certain situations, and which situations those are.
We would love to learn more about the origin of your certainty.
Fascinating! I wish I had the knowledge and wherewithal to do that and become rich instead of wasting my time on HN.
Neural networks are largely black box piles of linear algebra which are massaged to minimize a loss function.
How would you incorporate smooth kinematic motion in such an environment?
The fact that you discount the knowledge of literally every single employee at OpenAI is a big signal that you have no idea what you’re talking about.
I don’t even really like OpenAI and I can see that.
I have been working with NLP and neural networks since 2017.
They aren’t just black boxes, they are _largely_ black boxes.
When training an NN, you don’t have great control over what parts of the model does what or how.
Now instead of trying to discredit me, would you mind answering my question? Especially since, as you say, the theory is so simple.
How would you incorporate smooth kinematic motion in such an environment?
You’ve convinced me that you’re small and know very little about the subject matter.
You don’t need to reply to this. I’m done with this convo.
Do you know anyone or any companies working on that?
Maybe we're in a honeymoon period where your average user hasn't gotten annoyed by all the slop out there and they will soon, but at least for now, there is real value here. Yes, out of 20 ads maybe only 2 outperform the manually created ones, but if I can create those 20 with a couple hundred bucks in GenAI credits and maybe an hour or two of video editing that process wipes the floor with the competition, which is several thousand dollars per ad, most of which are terrible and end up thrown away, too. With the way the platforms function now, ad creative is quickly becoming a volume-driven "throw it at the wall and see what sticks" game, and AI is great for that.
It’s this. A video ad with a person morphing into a bird that takes off like a rocket with fire coming out of its ass, sure it might perform well because we aren’t saturated with that yet.
You’d probably get a similar result by giving a camera to a 5 year old.
But you also have to ask what that’s doing long term to your brand.
My guess is that the criticism of AI not being that good is correct, but many people don't realize that most humans also aren't that good, and that it's quite possible that the AI performs better than mediocre humans.
This shouldn't be much of a surprise, we've seen automation replace low skilled labor in a lot of industries. People seem uncomfortable with the possibility that there's actually a lot of low skilled labor in the creative industry that could also be easily replaced.
All the generated video startups seem to generate videos with much lower than 10% usable output, without significant human-guided edits. Given the massive amount of compute needed to generate a video relative to hyperoptimized LLMs, the quality issue will handicap gen video for the foreseeable future.
The same "prompt" they'd give the creative person they hired... Say, "I want an ad for my burgers that make it look really good, I'm thinking Christmas vibes, it should emphasize our high quality meat, make it cheerful, and remember to hint at our brand where we always have smiling cows."
Now that creative person would go make you that advert. You might check it, give a little feedback for some minor tweaks, and at some point, take what you got.
You can do the same here. The difference right now is that it'll output a lot of junk that a creative person would have never dared show you, so that initial quality filtering is missing. But on the flip side, it costs you a lot less, can generate like 100 of them quickly, and you just pick one that seems good enough.
That's what GenAI is doing, too. After all, the audience only sees the final product; they never get know what the writer had in mind.
If a 60 year old grizzled detective is introduced in page 1, a human artist will draw the same grizzled detective in page 2, 3 and so on, not procedurally generate a new grizzled detective each time.
Most models people currently use don't keep state between invocations, and whatever interpretation they make from provided context (e.g. reference image, previous frame) is surface level and doesn't translate well to output. This is akin to giving each panel in a comic to a different artist, and also telling them to sketch it out by their gut, without any deep analysis of prior work. It's a big limitation, alright, but researchers and practitioners are actively working to overcome it.
(Same applies to LLMs, too.)
https://www.comicsexperience.com/wp-content/uploads/2018/09/...
Or if you can't do this, explain why the feature you mentioned cannot do this, and what it or good for?
- An LLM to inflate descriptions in the script to very detailed prompts (equivalent to artist thinking up how characters will look, how the scene is organized);
- A step to generate a representative drawing of every character via txt2img - or more likely, multiple ones, with a multimodal LLM rating adherence to the prompt;
- A step to generate a lot of variations of every character in different poses, using e.g. ControlNet or whatever is currently the SOTA solution used by the Stable Diffuison community to create consistent variations of a character;
- A step to bake all those character variations into a LoRA;
- Finally, scenes would be generated by another call to txt2img, with prompts computed in step 1, and appropriate LoRAs active (this can be handled through prompt too).
Then iterate on that, e.g. maybe additional img2img to force comic book style (with a different SD derivative, most likely), etc.
Point being, every subproblem of the task has many different solutions already developed, with new ones appearing every month - all that's left to have an "AI artist" capable of solving your challenge is to wire the building blocks up. For that, you need just a trivial bit of Python code using existing libraries (e.g. hooking up to ComfyUI), and guess what, GPT-4 and Claude 3.5 Sonnet are quite good at Python.
EDIT: I asked Claude to generate "pseudocode" diagram of the solution from our two comments:
http://www.plantuml.com/plantuml/img/dLLDQnin4BthLmpn9JaafOR...
Each of the nodes here would be like 3-5 real ComfyUI nodes in practice.
I suspect if someone went to the trouble to implement your above solution they'd find the end result isn't as good as they'd hoped. In practice you'd probably find one or more steps don't work correctly- for example, maybe today's multimodal LLM's can't evaluate prompt adherence acceptably. If the technology was ready the evidence would be pretty clear- I'd expect to see some very good, very quickly made comic books shown off by AI enthusiast on reddit rather then the clearly limited/ not very good comic book experiments which have been demonstrated so far.
A human has to work at it too; more than few hours when doing more than few quick sketches (memory has its limits; there's a reason artists keep reference drawings around), and obviously they already put years into learning their skills than before, but fair - the human artist already knows how to do things that any given model doesn't yet[0], we kind of have to assemble the overall flow ourselves for now[1].
Then again, you only need to assemble it once, putting those hours of work up front - and if it's done, and it works, it becomes fair to say that AI can, in fact, generate self-consistent comic books.
> I suspect if someone went to the trouble to implement your above solution they'd find the end result isn't as good as they'd hoped. In practice you'd probably find one or more steps don't work correctly- for example, maybe today's multimodal LLM's can't evaluate prompt adherence acceptably.
I agree. I obviously didn't try this myself either (yet, I'm very tempted to try it, to satisfy my own curiosity). However, between my own experience with LLMs and Stable Diffusion, and occasionally browsing Stable Diffusion subreddits, I'm convinced all individual steps work well (and have multiple working alternatives), except for the one you flagged, i.e. evaluating prompt adherence using multimodal LLM - that last one I only feel should work, but I don't know for sure. However, see [1] for alternative approach :).
My point thus is, all individual steps are possible, and wiring them together seems pretty straightforward, therefore the whole thing should work if someone bothers to do it.
> If the technology was ready the evidence would be pretty clear- I'd expect to see some very good, very quickly made comic books shown off by AI enthusiast on reddit rather then the clearly limited/ not very good comic book experiments which have been demonstrated so far.
I think the biggest concentration of enthusiasm is to be found in NSWF uses of SD :). On the one hand, you're right; we probably should've seen it done already. On the other hand, my impression is that most people doing advanced SD magic are perfectly satisfied with partially manual workflows. And it kind of makes sense - manual steps allow for flexibility and experimentation, and some things are much simpler to wire by hand or patch up with some tactical photoshopping, than to try and automate them fully. In particular, things judging the quality of output is both easy for humans and hard to automate.
Still, I've recently seen ads of various AI apps claiming to do complex work (such as animating characters in photos) end-to-end automatically - exactly the kind of work that's typically done in partially manual process. So I suspect fully-automated solutions are being built on a case-by-case basis, driven by businesses making apps for the general population; a process that lags some months behind what image gen communities figure out in the open.
--
[0] - Though arguably, LLMs contain the procedural knowledge of how a task should be done; just ask it to ELI5 or explain in WikiHow style.
[1] - In fact, I just asked Claude to solve this problem in detail, without giving it my own solution to look at (but hinting at the required complexity level); see this: https://cloud.typingmind.com/share/db36fc29-6229-4127-8336-b... (and excuse the weird errors; Claude is overloaded at the moment, so some responses had to be regenerated; also styling on the shared conversation sucks, so be sure to use the "pop out" button on diagrams to see them in detail).
At very high level, it's the same as mine, but one level below, it uses different tools and approaches, some of which I never knew about - like keeping memory in embedding space instead of text space, and using various other models I didn't know exist.
EDIT: I did some quick web search for some of the ideas Claude proposed, and discovered even more techniques and models I never heard of. Even my own awareness of the image generation space is only scratching the surface of what people are doing.
In comparison I've messed around with prompting image generator models quite a bit and it's not possible to get remotely close to the quality level of even rough paid concept work by a professional, and the credits to run these models aren't particularly cheap.
I think people are getting way ahead of their skis here. Even in 2D I can’t for example generate inventory images for weapons and items for a game yet. Which is an orders of magnitude simpler test case than video. They all are slightly different styles. If I don’t care that they all look different in strange ways then it’s useful - but any consumer will think it looks like crap
AI or not, no one but you cares about the lower order bits of your idea.
GenAI is great at filling in those lower order bits but until stuff like ControlNet gets much better precision and UX, I think genAI will be stuck in the uncanny valley because they’re inconsistent between scenes, frames, etc.
Not disagreeing, just noting: this is not how [most?] people's minds work {I don't think you're holding to that opinion particularly, I'm just reflecting on this point}. We have vague ideas until an implementation is shown, then we examine it and latch on to a detail and decide if it matches our idea or not. For me, if I'm imagining "a superhero planting vegetables in his garden" I've no idea what they're actually wearing, but when an artist or genAI shows me it's a brown coat then I'll say "no something more marvel". Then when ultimately they show me something that matches the idea I had _and_ matches my current conception of the idea I had... then I'll point out the fingernails are too long, when in the idea I hadn't even perceived the person had fingers, never mind too-long fingernails!
I'd warrant any actualised artistic work has some delta with the artists current perception of the work; and a larger delta with their initial perception of it.
These are situations with relatively simple logical constraints, but an infinite number of valid solutions.
Keep in mind that we are not requiring any particular configuration of circuit diagram, just any diagram that makes sense. There are an infinite number of valid ones.
That doesn't mean it can't work with AI - it's that you may need to add something extra to the generative pipeline, something that can do circuit diagrams, and make the diffusion model supply style and extra noise (er, beautifying elements).
> Keep in mind that we are not requiring any particular configuration of circuit diagram, just any diagram that makes sense. There are an infinite number of valid ones.
On that note. I'm the kind of person that loves to freeze-frame movies to look at markings, labels, and computer screens, and one thing I learned is that humans fail at this task too. Most of the time the problems are big and obvious, ruining my suspension of disbelief, and importantly, they could be trivially solved if the producers grabbed a random STEM-interested intern and asked for advice. Alas, it seems they don't care.
This is just a specific instance of the general problem of "whatever you work with or are interested in, you'll see movies keep getting it wrong". Most of the time, it's somewhat defensible - e.g. most movies get guns wrong, but in way people are used to, and makes the scenes more streamlined and entertaining. But with labels, markings and computer screens, doing it right isn't any more expensive, nor would it make the movie any less entertaining. It seems that the people responsible don't know better or care.
Let's keep that in mind when comparing AI output to the "real deal", as to not set an impossible standards that human productions don't match, and never did.
In particular, internal consistency is one of the important constraints which viewers will immediately notice. If you’re just using sora for 5 second unrelated videos it may be less of an issue but if you want to do anything interesting you’ll need the clips to tie together which requires internal consistency.
Sora is obviously not Photoshop, but given that you can write basically anything you can think of I reckon it's going to take a long time to get good at expressing your vision in words that a model like Sora will understand.
FWIW I too have been quite frustrated iterating with AI to produce a vision that is clear in my head. Past changing the broad strokes, once you start “asking” for specifics, it all goes to shit.
Still, it’s good enough at those broad strokes. If you want your vision to become reality, you either need to learn how to paint (or whatever the medium), or hire a professional, both being tough-but-fair IMO.
Things like rearranging things in the scene with drag'n'drop sound implementable (although incredibly GPU heavy)
Also, AI sucks at understanding detail expressed in symbolic communication, because it doesn't understand symbols the way linguistic communication expects the receiver to understand them.
My own experience is that all the AI tools are great for shortcutting the first 70-80% or so. But the last 20% goes up an exponential curve of required detail which is easier and easier to express directly using tooling and my human brain.
Consider the analogy to a contract worker building or painting something for you. If all you have is a vague description, they'll make a good guess and you'll just have to live with that. But the more time you spend with them communicating (through description, mood boards rough sketches etc) the more accurate to your detailed version it will get. But you only REALLY get exactly what you want if you do it yourself, or sit beside them as they work and direct almost every step. And that last option is almost impossible if they can't understand symbolic meaning in language.
(Talking about visual generative AI in general)
I find generative AI frustrating because I know what I want. To this point I have been trying but then ultimately sitting it out — waiting for the one that really works the way I want.
If you want higher quality/precision, you’ll likely want to ask a professional, and I don’t expect that to change in the near future.
People pay for stock images.
But that’s probably something you don’t pay for directly, instead paying for e.g. a phone that has those features.
What happens is a description becomes a longer specification or script that's still good and hangs together in itself and then further iterations involving professionals who can't do "exactly what the director wants" but rather do something further that's good and close enough to what the director wants.
Eventually, it starts to muck with the earlier work that it did good on, when I'm just asking it to add onto it.
I was still happy with what I got in the end, but it took trial and error and then a lot of piecemeal coaxing with verification that it didn't do more than I asked along the way.
I can imagine the same for video or images. You have to examine each step post prompt to verify it didn't go back and muck with the already good parts.
With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.
Sure, if you then do the same in reverse.
https://app.checkbin.dev/snapshots/1f0f3ce3-6a30-4c1a-870e-2...
Pros:
- Some of the Sora results are absolutely stunning. Check out the detail on the lion, for example! - The landscapes and aerial shots are absolutely incredible. - Quality is much better than Mochi & LTX out of the box. Mochi/LTX seem to require specifically optimized workflows (I've seen great img2vid LTX results on Reddit that start with Flux image generations, for example). Hunyuan seems comparable to Sora!
Cons:
- Still nearly impossible to access Sora despite the “launch”. My generations today were in the 2000s, implying that it’s only open to a very small number of people. There’s no api yet, so it’s not an option for developers. - Sora struggles with physical interactions. Watch the dancers moonwalk, or the ball goes through the dog. HunyuanVideo seems to be a bit better in this regard. - Can't run it locally mode (obviously) - I haven't tested this, but I think it's safe to assume Sora will be censored extensively. HunyuanVideo is surprisingly open (I've seen NSFW generations!) - I’m getting weird camera angles from Sora, but that could likely be solved with better prompting.
Overall, I’d say it’s the best model I've played with, though I haven’t spent much time on other non-open-source ones. Hunyuan gives it a run for its money, though!
The vibe they give me is similar to the iPhone photography commercials where yes, in theory, a picnic in the park could look exactly like this except for all the parts that seem movie perfect.
I guess it's really more of a colour grading question where most of the Sora colour grading triggers that part of my brain that says "I'm watching a movie and this isn't real" without quite realising why.
A few of the Hunyuan videos in contrast seem a bit more believable even though they have some obvious glitches at times.
The other thing I think Sora has is that thing in commercials where no one else except the protagonist exists and nothing is ever inconvenient. The video of the teacher in a classroom with no students reminds me of that as well as the picnic in the park where there's wide open space with no one around.
I suppose it depends if the goal is to generate believable video and how you define believable.
The other day I was scrolling down on YouTube shorts and a couple videos invoked an uncanny valley response from me (I think it was a clip of an unrealistically large snake covering some hut) which was somehow fascinating and strange and captivating, and then scrolling down a few more, again I saw something kind of "unbelievable"... I saw a comment or two saying it's fake, and upon closer inspection: yeah, there were enough AI'esque artifacts that one could confidently conclude it's fake.
We'd known about AI slop permeating Facebook -- usually a Jesus figure made out of unlikely set of things (like shrimp!) and we'd known that it grips eyeballs. And I don't even know in which box to categorize this, in my mind it conjures the image of those people on slot machines, mechanically and soullessly pulling levers because they are addicted. It's just so strange.
I can imagine now some of the conversations that might have happened at Google when they choose to keep a lot of innovations related to genAI under the wraps (I'm being charitable here of their motives), and I can't help but agree.
And I can't help but be saddened about OpenAI's decisions to unload a lot of this before recognizing the results of unleashing this to humanity, because I'm almost certain it'll be used more for bad things than good things, I'm certain its application on bad things will secure more eyeballs than on good things.
This was not marked as AI-generated and commenters were in awe at this fuzzy train, missing the "AIGC" signs.
I'm quite nervous for the future.
A) Most of the give aways are pretty subtle and not what viewers are focused on. Sure, if you look closely the fur blends in with the pavement in some places, but I'm not going to spend 5 minutes investigating every video I see for hints of AI.
B) Even if I did notice something like that, I'm much more likely to write it off as a video filter glitch, a weird video perspective, or just low quality video. For example, when they show the inside of the car, the vertical handrails seem to bend in a weird way as the train moves, but I've seen similar things from real videos with wide angle lenses. Similar thoughts on one of the bystander's faces going blurry.
I think we just have to get people comfortable with the idea that you shouldn't trust a single unknown entity as the source or truth on things because everything can be faked. For insignificant things like this it doesn't matter, but for big things you need multiple independent sources. That's definitely an uphill battle and who knows if we can do it, but that's the only way we're going to get out the other side of this in one piece.
[1]: e.g. tiktok https://newsroom.tiktok.com/en-us/partnering-with-our-indust...
Indeed, a great (if counterintuitive) example of this is The Wolf of Wall Street. I bet a lot of people would be surprised at just how much CGI is used in that just for set/location.
The truth is... most people will simply not care. Raised eyebrow, hm, cute, next. Critical watching is reserved for critics like the crowd on HN and the like, but they represent only a small percentage of the target audience and revenue stream.
Then there's the usual giveways for CG - sharpness, noise, lighting, color temperature, saturation - none of them match. There's also no diffuse reflection of the intense pink color.
I’ve worked in CG for many years and despite the online nerd fests that decry CG imagery in films, 99% of those people can’t tell what’s CG or not unless it’s incredibly obvious.
It’s the same for GenAI, though I think there are more tells. Still, most people cannot tell reality from fiction. If you just tell them it’s real, they’ll most likely believe it.
I've noticed people assume things are CG that turn out to be practical effects, or 90% practical with just a bit of CG to add detail.
Worse, directors often lie about what’s practical and we’ll have replaced it with CG. So people online will cheer the “practicals” as being better visually, while not knowing what they’re even looking at.
I’ve seen interviews with actors even where they talk about how they look in a given shot or have done something, and not realize they’re not even really in the shot anymore.
People just have terrible eyes once you can convince them something is a certain way.
Lawrence of Arabia or Cleopatra alone have incredible fully live shot special effects which can not be easily replicated with CG and have aged like fine wine, unlike the trash early CG of the 80s and 90s which ruined otherwise great films like the last starfighter
You’re taking the best films of an era and comparing them to an arbitrary list of movies you don’t like? Adding to that, you’re comparing it to films in the infancy of a technology?
This is peak confusion of causality and correlation. There are tons of great films in that time frame with CG. Unless you’re going to argue that Jurassic Park is bad.
Especially because I’ve done both on set and virtual production, it’s hard to suspend disbelief in a lot of films.
This goes for conversation too! My neighbour recently told me about a mutual neighbour who walks 200 miles per day working on his farm. When I explained that this is impossible he said "I'll have to disagree with you there"
https://www.reddit.com/r/Ultramarathon/comments/xhbs4d/sorok...
https://en.wikipedia.org/wiki/Aleksandr_Sorokin
So, not very convenient for a non-world-champion runner to do (let alone while doing farm work) (let alone on more than one occasion).
In my opinion anyway, I'm gonna have to disagree with any counterpoints in advance.
If all opinions are equal, and we’ve reinforced that you can find anything to strengthen an opinion, then facts don’t actually matter.
But I don’t think it’s actually all that recent. History is full of people saying that facts or logic don’t matter. The Americas were “discovered” by such a phenomenon.
A related phenomenon is not being able to hear the difference between 128kbps and 320kbps. I find the notion astonishing, and yet lots of people cannot tell the difference.
But also in the case of the fluffy train there's nothing to compare it against. The reason CGI humans look the most fake is because we're trained from birth to read a human face. Someone that looks at trains on a regular basis will probably discern this as being fake quicker than most.
But what I was thinking while enjoying the show was: people wouldn't do that, if it didn't work.
This is the point. There is no such thing as "completely fools commenters". I mean, it didn't fool you, apparently. (But don't be sad, I bet you were fooled by something else: you just don't know it, obviously.) But some of it always fools somebody.
I really liked how Thiel mentioned on some podcast that ChatGPT successfully passed Turing test, which was implicitly assumed to be "the holy grail of AI", and nobody really noticed. This is completely true. We don't really think about ChatGPT, as something that passes Turing test, we think how fucking stupid useless thing mislead you with some mistake in calculations you decided to delegate to it. But realistically, if it doesn't it's only because it is specifically trained to try to avoid passing it.
You can't assume that with scams. Quite often, scams are themselves sold as a get-rich-quick scheme, which like all GRQ schemes, they wouldn't be if they worked well.
Videos like these were already achievable through VFX.
The only difference here is a reduction in costs. That does mean that more people will produce misinformation, but the problem is one that we have had time to tackle, and which gave rise to Snopes and many others.
It also took me a while to find any truly unambiguous signs of AI generation. For example, the reflection on the inside of the windows is wonky, but in real life warped glass can also produce weird reflections. I finally found a dark rectangle inside the door window, which at first stays fixed like a sign on the glass. However it then begins to move like part of the reflection, which really broke the illusion for me.
[1]: https://web.archive.org/web/20120829004513/http://stuffucanu...
It would be FAR worse if a privately held advanced AI's outputs were unleashed without the population being at least somewhat cautious of everything. The real danger imho comes from private silos of advanced general intelligence that aren't shared and used to gain power, control, and money.
https://www.reddit.com/r/StableDiffusion/comments/1hav4z3/op...
These are even unfair comparisons because they're leveraging text-to-video instead of the more powerful image-to-video. In the latter case, the results are indistinguishable.
Video generation is about to be everywhere, and we're about to have the "Stable Diffusion" moment for video.
Look at the comments: people are already fawning over open source being uncensored.
Cat's out of the bag.
HN is a hyper specialized group of people. The average person can not do this and as we've seen devours up misinformation with no second thoughts.
But it is funny to see how much stuff gets uploaded with zero quality control and still gets traction. These models really don't deal will with "innocent" letter substitutions, Iike using I instead of l.
Like I said in another comment, LLMs are cool and useful, but who in the hell asked for AI art? It's good enough to fool people and break the fragile trust relationship we had with online content, but is also extremely shit and carries no meaning or depth whatsoever.
everyone who has ever used stock photography, custom illustrators, and image editing. as AI improves, it will come after all of those industries.
that said, it is not OpenAI's goal to beat shutterstock, nor is it the goal of anthropic or google or meta. their goal is to make god: https://ia.samaltman.com/ . visual perception (and generation) is the near-term step on that path. every discussion of AI that doesn't acknowlege this goal, what all of these billions of dollars are aiming for, is myopic and naive.
For example, you need to generate a landing page for your boring company: text, images, videos and the overall design (as well as code!) can be and should be generated because... who cares about your boring company's landing page, right?
Most companies don't need this. They need a page that has their contact info and some general information about services they provide so they can have a bare minimum internet presence and show up on google maps.
Companies who understand the importance of a customer friendly and functional web presence get a great return on their investment. And it's much better for the customer.
Your ice cream shop doesn’t need a landing page because of word of mouth and foot traffic.
Some project management platform for plumbers needs a highly tuned webpage because they’re competing with 20 other such systems, and there’s no line to walk past and assume it’s there because the software is good.
Believing that if you build great plumbing SAAS software people, paying customers will magically appear, is naive.
A great product can sell itself. But that doesn’t mean that marketing and sales aren’t necessary in order to get the product in front of people, assuage their concerns, reassure them that it solves their problems, show social proof from others using it, and close the deal. A good landing page will do all of this ;)
I did. I started messing around with computer graphics on DOS with QBASIC and consider AI art to be just an extension of that.
On the other hand I don't care all that much for LLMs most of the time. They're sometimes useful, but while I find AI art I enjoy very regularly, using a LLM for something is more a once every couple weeks event for me.
That's what HN is for
Somewhere there's a site for "hackers" where it isn't, and I hope I stumble across that site at some point.
"The earth is flat" - "Can you prove it?" - "Oh it's just my opinion". It's dishonest.
To get back to the beginning: I really do agree that the societal impact on the whole appears to be negative. But there are some positives and I wanted to share my example of that.
Of course, as knowledgeable people in tech we can look at the last few years of AI improvements as technically remarkable. pen2l is talking about social impact.
I hope our trade can collectively become adults at the big table of Real Engineers. Consider the impact on humanity of your work. If you don’t care, then you are either recklessly irresponsible, don’t know any better, or are intentionally causing harm at scale.
Tech is a very powerful tool that can automate the most mundane tasks and also automate harm like mass surveillance and erosion of ownership rights of your devices. The sheer ability to create new markets and replace inefficient non-automated markets leads to huge $$$ making opportunities which people may mistake as being good in itself (good for economy / GDP = good for humanity)
"just be privileged as I was to get all the necessary education to be able to not be fooled by this tech". Yeah, very realistic and compassionate.
While the average person overestimates their own intelligence, the average techy dramatically underestimates the intelligence of the average member of the public. The weirdos that latch onto every fake video and silly conspiracy theory are dramatically overrepresented in every online comments thread, but supposed geniuses in the tech/NGO/academic community forget this and assume a broad swath of the public believes in stuff like "Pizza gate" because nuanced thinking is a skill only the enlightened few possess.
For example, some people can be very intelligent, yet not be discerning of information that resonates with prior biases. You see this in those who are devoutly religious, politically polarized, etc.
There is reason to believe that such biases will lend to ontological misinterpretations from algorithmically generated information.
You can see mistakes in interpretation on a day to day basis by the population at large. There are swaths of widely held beliefs that aren't based in truth. Pretty much anyone is likely to believe at least some stereotype, folklore, urban legend, or myth.
Example: if a gen ai vid of a politician doing some crazy crime came out. Even if it were proven fake, people would start questioning everything and still act as if the politician were guilty
See the part of my comment you are replying to where I specifically stated that the motivation for all of this is that "Jethro doesn't vote the way I want him to". You've proven my point.
The censorious attitudes on HN were non-existent before Trump won in 2016. I know this for a fact. I've had my account on here since 2012, after 2 years of being just a reader.
Meanwhile, you overestimate how immune to misinformation and lies the average HN techy is. Just a few years ago, the majority of people on here believed, with utter conviction, that the bat-borne coronavirus lab in Wuhan had absolutely no connection with the bat-borne coronavirus epidemic that started in Wuhan and that only bigots and ignoramuses could draw such a conclusion. I experienced this whenever I brought up the blatantly obvious, common sense connection in these same comment threads in late 2020 or into mid 2021. The absolutely absurd denial of common sense by otherwise intelligent people was reminiscent of trying to talk to a religious fundamentalist about evolution while pointing at dinosaur fossils and having them continue to deny what was staring them in the face.
This has nothing to do with privilege, a person in Indian slums on his 2005 PC with internet access can have better internet BS radar than an Ivy League student.
I think though, that if you are in the position of doing serious critical reflection about this stuff, which is in my opinion necessary for being in a position of discernment wrt this stuff, then you are privileged. This is the idea I wanted to convey.
That's a ridiculous assumption. In my experience no one outside of tech circles is even remotely aware that this kind of thing is possible already.
My family is mostly working class in an economically depressed part of the Virginia/West Virginia coal country, and every single one of them is aware of this. None of them work in tech, obviously. None have college degrees.
I maintain that the attitude driving this paternalistic, censorious attitude is arrogance and condescension.
A prime example of how broadly aware the public all over the world is of AI faked videos was the reaction in the Arab world to the October 7th videos posted by Hamas. A shocking (and depressing) percentage of Arabs, as well as non-Arab Muslims all over the world, believed the videos and pictures were fakes produced with AI. I don't remember the exact number, but the polling I saw in November showed it was over 50% who believed they were fakes in countries as disparate as Egypt and Indonesia.
These two are very different things. My family believes all kinds of videos on the internet are fake. None of them have any idea what a tool like Sora can do. The gap between "oh this was probably special effects" to "you have to notice pixels shimmering around someone's hand to tell" is enormous.
>>My family is mostly working class in an economically depressed part of the Virginia/West Virginia coal country, and every single one of them is aware of this.
Your working class family has time to keep up with the advancements in generative AI for video? They have more free time than I do then. If we're sharing anecdotes about families then my family is from Polish coal country and their idea of AI is talking to your car and it responding poorly.
>>I maintain that the attitude driving this paternalistic, censorious attitude is arrogance and condescension.
I'm confused - who is displaying this "censorious" attitude here?
>> and your source of data for this is "your own experience"? Really?
Yes, really. I mean do you have anything else? You are also quoting things from your own experience.
My prediction is that next year they will catch up a bit and will not be shy about releasing new technology. They will remain behind in LLMs but at least will more deeply envelope their own existing products, thus creating a narrative of improved innovation and profit potential. They will publicly acknowledge perceived risks and say they have teams ensuring it will be okay.
The latest Gemini version (1206) is at least tied for the best LLM, if not the best outright.
99% of the times it's either useless or wrong.
Sorry for the name dropping, I have no affiliation and am just a very happy user, so I wanted to share it as it felt adequate.
We don't have tech to correctly "detect ai" in 2024, which is why education has broken down over the last few years with serial cheating in every institution.
Every company so far that claimed to detect AI generated slug has failed.
https://www.reddit.com/r/uBlockOrigin/comments/1ct5mpt/heres...
This tech will make the internet even more unbearable to use, without mentioning its huge potential for abuse. This is far worse than whatever positives it might have, which are still unclear. What a shitshow.
I believe the internet needs a distributed trust and reputation layer. I haven't fully thought through all the details, but:
- Some way to subscribe to fact checking providers of your choice.
- Some way to tie individuals' reputation to the things they post.
- Overlay those trust and reputation layers.
I want to see a score for every webpage, and be able to drill into what factored into that score, and any additional context people have provided (e.x. Community Notes).
There's a huge bootstrapping and incentive problem though. I think all the big players would need to work together to build this. Social media, legacy media companies, browsers, etc.
This also presupposes people actually care about the truth, which unfortunately doesn't always seem like the case.
Maybe the model is you have to pay per account to use it, or maybe the model will be something else.
I doubt this will make everyone just go back to primarily communicating in person/via voice servers but that is a possibility.
Spammers can afford more money per bot for their operations than the average user can justify to spend on social media.
What we probably need (this is going to sound crazy, but I don’t have a better suggestion), is some kind of networked trust system.
Every value oAI has claimed to have hasn't lasted a milisecond longer than there was profit motive to break it, and even anthropic is doing military tech now.
Worse, the audience is our parents and grandparents. They have little context to be able to sort out reality from this stuff
Do yourself a favor and avoid that kind of content, opting instead for long-form consumption. The discovery patterns are different, but you're less inclined to encounter fake content if you develop a trust network of good channels.
I just hope the online, social media space gets enshitified to an such a degree that it stops playing a major role in society, though sadly that is not how things usually seem to work.
ie a company developing this tech, keeping under wraps and say only using for special government programmes....
Could even argue shipping the product and not the paper would have done more for AI safety, least it would be controlled.
They also lie themselves: they cannot detect overt bias or reflect on themselves and be aware of their hidden motives, resentments and wishful thinking. Including me and you.
Most people hold important beliefs about the world that are comically inaccurate.
AI changes absolutely nothing how many true or false beliefs the average Joe holds.
Yeah, and it's especially hypocrite coming from them who said they'd refuse to disclose anything about GPT-3 because they said it was dangerous. And then a few years latter: “Hey remember about this thing we told you it was too dangerous before? Now we have a monetization strategy so we're giving access to everyone, today.”
And yet, you would not have known how to recognize those artifacts without "OpenAI's decisions to unload a lot of this before recognizing the results of unleashing this to humanity".
(This is my latest favorite prompt and interview/conversation question)
Anything that you or others can answer to this which isn’t some stupid “gotcha” puzzle shit (lol it’s video cus LLMs aren’t video models amiright?) will be wrong because of things like structured decoding and the fact that ultra high temperature works with better samplers like min_p.
https://openreview.net/forum?id=FBkpCyujtS¬eId=mY7FMnuuC9
(This is the hash of a string randomly popped in my mind. An LLM will write this with almost 0 probability --- until this is crawled into the training sets)
I know there is Beautiful.ai or Copilot for PowerPoint, but none of the existing tools really work for me because the results and the user flow aren’t convincing.
Basically it generates slides from markdown, which is great even without LLMs. But you can get LLMs to output in markdown/Marp format and then use Marp to generate the slides.
I haven't looked into complicated slides, but works well for text-based ones.
If you need the AI to help you brainstorm a good narrative, that is a different story
Is it me or does it seem like OpenAI revolutionized with both chatGPT and Sora, but they've completely hit the ceiling?
Honestly a bit surprised it happened so fast!
Each company would either rush to get a phone out with the new snapdragon chip, or take their time to polish a release and have a better phone late cycle. But the real improvements we're just the chip.
Nvidia chips/larger data centers are the chips. the models are the plethora of android phones each generation.
That kept going until progress stabilized. Then the best user experience & vertical integration won over chasing chip performance (apple).
Another example of this is stuff like Bluesky. There's a lot of reasons to hate Twitter/X, but people going "Wow, Bluesky is so amazing, there's no ads and it's so much less toxic!" aren't complimenting Bluesky, they're just noting that it's smaller, has less attention, and so they don't have ads or the toxic masses YET.
GenAI image generation is an obvious vector for all sorts of problems, from copyrighted material, to real life people, to porn, and so on. OpenAI and Google have to be extraordinarily strict about this due to all the attention on them, and so end up locking down artistic expression dramatically.
Midjourney and Stable Diffision may have equal stature amongst tech people, but in the public sphere they're unknowns. So they can get away with more risk.
Why? Did the inventors of VHS tapes "have to be extraordinarily strict" and bake in safeguards because people might violate copyright laws, make porn, or tape something illegal?
Enforcing laws is the responsibility of the legal system. It sets a concerning precedent when companies like OAI would rather lobotomize their flagship products than risk them generating any Wrongthink.
Besides just citing your sources, I'm genuinely curious what the best ones are for this so I can see the competition :)
That's not the only open video model, either. Lightricks' LTX, Genmo's Mochi, and Black Forest Labs' upcoming models will all be open source video foundation models.
Sora is commoditized like Dall-E at this point.
Video will be dominated by players like Flux and Stable Diffusion.
I think the important thing is task quality and I haven't seen any evaluations of that yet.
It took two weeks to go from Mochi running on 8xH100s to running on 3090s. I don't think you appreciate the rapidity at which open source moves in this space.
HunYuan landed less than one week ago with just one modality (text-to-video), and it's already got LoRA training and fine tuning code, Comfy nodes, and control nets. Their roadmap is technically impressive and has many more control levers in scope.
I don't think you realize how "commodity" these models are and how closed off "turn key solutions" quickly get out-innovated by the wider ecosystem: nobody talks about or uses Dall-E to any extent anymore. It's all about open models like Flux and Stable Diffusion.
{Text/Image/Video}-to-Video is an inadequate modality for creative work anyway, and OpenAI is already behind on pairing other types of input with their models. This is something that the open ecosystem is excelling at. We have perfect syncing to dance choreography, music reactive textures, and character consistency. Sora has none of that and will likely never have those things.
> something time-sharing like Sora where you pay a relatively small amount per video.
Creators would prefer to run all of this on their own machines rather than pay for hosted SaaS that costs them thousands of dollars.
And for those that do prefer SaaS, there are abundant solutions for running hosted Comfy and a constellation of other models as on-demand.
- uncensored output (SD + LoRa)
- Overall speed of generation (midjourney)
- Image quality (probably midjourney, or an SDXL checkpoint + upscaler)
- Prompt adherence (flux, DALL-E 3)
EDIT: This is strictly around image generation. The main video competitors are Kling, Hailuo, and Runway.
You can see more model samples heee https://youtu.be/bCAV_9O1ioc
you probably meant Stable Diffusion XL. (autocorrect victim)
In terms of image quality. Runway, Luma, and a few of the Chinese models all give "ok" results. I haven't seen anything from Sora to convince me they have done any kind of significant leap.
The issue there is alignment. It's cheap for Runway or Luma to continue in this path since it's their only path to profitability, they do nothing else.
But for OpenAI, I don't think this is near their top list of priorities. I doubt that they will be able to keep adding features like their competitors. Seems to me like this is the equivalent of a side project for them.
edit after watching direct comparison videos, I've changed my mind. Sora is ahead.
For anyone who is curious where to find tons of SORA videos, go to reddit r/aivideo
This is not a problem as long as they do the ChatGPT thing, and sell an API and let others figure out how to build an UX around it, but here they seem to be gunning for creating a boxed product.
It was fun for a few days but far more limited than I would have ever expected.
Maybe Sora 5.0 will be something special. Right now though all these video models are basically shit.
1. The first set of doors doesn't have any doorknobs or handles. https://ibb.co/PwqfzBq
2. The second set of doors has handles, and some very large/random hinges on the left door. https://ibb.co/JkDtc6r
3. The third set doesn't have any handles, but I can forgive that, because we're in a spaceship now. The problem is that the inside of the doors seem to have windows, but the outside of the doors, doesn't have any windows. https://ibb.co/nwpXmtq & https://ibb.co/wr6v2g1
4. The best/most hilarious part for me. The doors have handles, but they are on the hinge side of the door. No idea how this would work. https://ibb.co/gWXDcfr
The video with dogs shows three taxis transforming into one, the number of people under the tree changing https://player.vimeo.com/video/1037090356?h=07432076b5&loop=...
An example from the HunyuanVideo is terrible as well. Look at that awful tongue: https://hunyuanvideoai.com/part-1-3.mp4
And what we see in that marketing is probably the best they could generate. And I suppose it took a lot of prompt tweaking and regenerations.
The internet is already full of junk shorts and useless videos and soon there will be even more junk content everywhere. :(
If you look at the edge of the doors as they swing open, it seems their movement resembles bifold door movement (there's a wiggle to it common to bifold doors that normal doors never have). Plus they seem to magically reveal an inner fold that wasn't there before.
[1]: https://duckduckgo.com/?t=h_&q=interior+bifold+closet+doors&...
Humans are not built for this power to be in the hands of everyone with low friction.
YouTube turned everyone into broadcasters. Sora could help bring countless untold stories to life, straight from the imagination.
> Humans are not built for this power to be in the hands of everyone with low friction.
Why is having power concentrated in few hands better?
Because most people are dangerous morons. I don't think most people should be allowed to operate a car, let alone the most powerful tool for misinformation that has ever existed
The only thing worse than a powerful, dangerous tool in the hands of the masses is a powerful, dangerous tool controlled exclusively by powerful, dangerous people. (Cue the usual moronic analogies involving thermonuclear weapons...)
I thought we lost a lot in the transition from analog to digital media, but that doesn't mean there's not a peak to any modern craft, just that there hasn't been a unified or named movement highlighting the best and worst, outside of social media algorithms.
Now we have this with social media, everyone is their own little FSB propaganda machine…yay
"This invention, O king," said Thoth, "will make the Egyptians wiser and will improve their memories; for it is an elixir of memory and wisdom that I have discovered."
The king replied:
"This invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them.
"You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise."
In a sane world, any video produced by Sora would be required to have a form of watermarking that's on par with what intellectual property owners require.
We've put people in jail for sharing copyrighted movies, and don't see why we would refrain from mandating that AI generated videos have some caption that says, I don't know, "This video was generated with AI" ?
People would not respect the mandate, and we would consider that illegal, and use the monopoly on force to take money out of their bank account.
I know, it sounds mad and soooo 20th century - maybe that's why OpenAI overlords are not deeming peasants in France worthy of "a cat in a suit drinking coffee in an office" and "you'll never believe what the other candidate is doing to your kids".
[1] https://www.imatag.com/blog/ai-act-legal-requirement-to-labe...
EDIT: apparently some form of watermarking is built in (but it's not obvious in the examples, for some reason.)
> While imperfect, we’ve added safeguards like visible watermarks by default, and built an internal search tool that uses technical attributes of generations to help verify if content came from Sora.
It's a completely different thing. IP owners want watermarks on their IP so they can prosecute people who use their IP without giving credit, nobody's forcing them to watermark it.
I happen to think that some states will want to prosecute people who publish realistic-looking AI generated images without making it explicit that they're generated. I'm wondering if watermarking could be an effective tool for that.
(If I was on a bad mood, I would say that we should make it explicit when images are too heavily photoshoped, too ; but that's an other debat, because tools like Sora make manufacturing lies several order of magnitude cheaper.)
Imagine a culture that would harness their frustration at being left out in the direction of innovating on their own.
Defining the status quo on things like watermarks by leading the field and then demonstrating how to act from the front.
Seems like they'd be more effective than one that settles for derision and calling for taxes and rules from the back of the pack, so they can presumably profit off the terrible evil things being built.
At this point, really, I can think of exactly two use cases:
* cheaply producing ads
* cheaply producing fake news
And it's terrifying, and the people jumping in the bandwagon are scaring me.
There is this quote in "13 days" [1] where people are discussing the Cuban missile crises, and, while everyone is gladly / obliviously preparing for the upcoming nuclear holocaust, one gray-haired diplomat raises his hand and says "One of us in the room should be a coward" before asking for a more prudent option.
Maybe it's the age old tension between the "new world" racing forward and the "old world" hitting the brakes. Not necessarily a bad dynamic in the long run. [2]
Feel free to call me, and the whole block I live in, "coward" on this front.
[1] https://en.wikipedia.org/wiki/Thirteen_Days_(film)
[2] https://en.wikisource.org/wiki/French_address_on_Iraq_at_the...
You know, stuff you’d use any images for.
The problem is you seem to think your involvement in the advancements should be orthogonal to your involvement in regulation.
That doesn't work in a world with sovereign nations: as cartoonish as comparing this to nuclear holocaust is, who do you think had more of a role in disarmament, the nuclear-weapon states, or the non-nuclear powers signing treaties with other non-nuclear powers?
If France had their own OpenAI releasing their own Sora with all the regulations you can dream of there'd be more of a discussion to be had over how a SOTA model should be rolled out, with actual counterfactuals to the approach the US and China have taken.
(Of course, Mistral is mostly American money... so I wouldn't bank on them taking a different road.)
And right after that news broke they “fixed” the problem by stopping to disclose training data sources. Thats why early models had papers eg Llama 1 listed this and now nobody does. It’s just an unspoken yet open secret now.
The solution was simple - just maintain the information in world space, and sample for that. But simple does not mean cheap, and it led to a ton of redundant (as in invisible in the final image) having to be kept track of.
It's similar to the LLM hallucination problem. LLMs produce nonsense and untruths - but they are still useful in many domains.
I saw one of these models doing a Minecraft like simulation and it looked sort of okay but then water started to end up in impossible places and once it was there it kept spreading and you ended up in some lovecraftian horror dimension. Any useful physics simluation at least needs boundary conditions to hold and these models have no boundary conditions because they have no clear categories of anything.
That's an interesting way of saying "we're probably gonna miss some stuff in our safety tools, so hopefully society picks up the slack for us". :)
If you happen to notice a Twitter spam bot claiming to be "an AI language model created by OpenAI", know that we have conducted an investigation and concluded that no you didn't. Mission accomplished!
All replaced by open source LLMs at this point.
Most AI video will be produced by Hunyuan [1], LTX [2], and Mochi [3] in short order. These are the Flux / Stable Diffusion models for generative video. These can all be fine tuned to produce incredible results, and work with the Comfy ecosystem for wildly creative and controllable workflows.
I don't think it'll be possible for a closed source tool to compete with the open image/video ecosystem. Dall-E certainly didn't stay competitive for long. It's a totally different game.
[1] https://github.com/Tencent/HunyuanVideo
And I don't think the current status quo of open source models being entirely subsidised by startups and corporations is sustainable, they're all hemorrhaging money and their investors will only have so much patience before they expect returns. Enjoy it while it lasts.
Mochi is better positioned to build tools on top of their community model. They're already thinking about control.
Weights are commodity. Products have value.
Stability was supposed to be doing a similar "give away the models but sell products built on them" strategy and it doesn't seem to be working for them, by all accounts they're barely able to keep the lights on.
It is unlikely anyone is going to perform act of terrorism with this, or any kind of deep fakes that buy Easter European elections. The worst outcome is likely teens having a laugh.
Would you want Microsoft to claim they're responsible for the "safety" of what you write with Word? For the legality of the numbers you're punching into an Excel spreadsheet? Would you want Verizon keeping tabs on every word you say, to make sure it's in line with their corporate ethos?
This idea that AI is somehow special, that they absolutely must monitor and censor and curtail usage, that they claim total responsibility for the behavior of their users - Anthropic and OpenAI don't seem to realize that they're the bad guys.
If you build tools of totalitarian dystopian tyranny, dystopian tyrants will take those tools from you and use them. Or worse yet, force your compliance and you'll become nothing more than the big stick used to keep people cowed.
We have laws and norms and culture about what's ok and what's not ok to write, produce, and publish. We don't need corporate morality police, thanks.
Censorship of tools is ethically wrong. If someone wants to publish things that are horrific or illegal, let that person be responsible for their own actions. There is absolutely no reason for AI companies to be involved.
Would you want DuPont to check the toxicity of Teflon effluents they're releasing in your neighbourhood? That's insane. It's people's responsibility to make sure that they drink harmless water. New tech is always amazing.
There is no definition of a "safe" model without significant controversy nor is there any standardized test for it. There are other reasons why that is a terrible analogy, but this is probably the most important.
I also, to draw a loose parallel, think that Microsoft should be responsible for the security and correctness of their products, with potentially even criminal liability for egregiously negligent bugs that lead to harm for their users: it isn't ever OK to "move fast and break things" with my personal data or bank account. But like, that isn't what we are talking about constantly with limiting the use cases of these AI products.
I mean, do I think OpenAI should be responsible if their AI causes me to poison myself by confidently giving me bad cooking instructions? Yes. Do I think OpenAI should be responsible if their website leaks my information to third parties? Of course. Depending on the magnitude of the issue, I could even see these as criminal offenses for not only the officers of the company but also the engineers who built it.
But, I do not at all believe that, if DuPont sells me something known to be toxic, that it is DuPont's responsibility to go out of their way to technologically prevent me from using it in a way which harms other people: down that road lies dystopian madness. If I buy a baseball bat and choose to go out clubbing for the night, that one's on me. And like, if I become DuPont and make a factory to produce Teflon, and poison the local water with the effluent, the responsibility is with me, not the people who sold me the equipment or the raw materials.
And, likewise, if OpenAI builds an AI which empowers me to knowingly choose to do something bad for the world, that is not their problem: that's mine. They have no responsibility to somehow prevent me from egregiously misusing their product in such a way; and, in fact, I will claim it would be immoral of them to try to do so, as the result requires (conveniently for their bottom line) a centralized dystopian surveillance state.
C4 and nuke are both just explosives, and there are laws in place that prohibit exploding them in the middle of the city. But the laws that regulate storage and access to the nukes and to C4 are different, and there is a very strong reason for that.
Censorship is bad, everyone agrees on that. But regulating access to technology that has already proven that it can trick people into sending millions to fraudsters is a must, IMO. And it'd better be regulated before in overthrows some governments, not after.
Microsoft Word and Excel aren't generative tools. If Excel added a new headline feature to scan your financial sheets and auto-adjust the numbers to match what's expected when audited, you bet there would be backlash.
And regarding scrutiny, morphine is a immensely usefulness tool and it's use surely extremely monitored.
On the general point, our society values intent. Tools can just be tools when their primary purpose is in line with our values and they only behave according to the user's intent. AI will have to prove a lot to match both criteria.
I went to high school in a fairly affluent area and I promise you this is not true. If you have money and know how to talk to your doctor, you can get whatever you want. No questions asked.
You can even get prescription methamphetamine - and Walgreens will stock generic for it!
If you're really rich it may be a different story, but any of the "middle class" good luck. And if you do find a doctor with some compassion, they are probably about to retire.
That's a decently high bar I think ?
Imagine what you can do if you have money and know how to talk to your local police...
- Sounds like what my accountant already does.
Some tools are a lot more powerful than others and we have to take special care with them.
This is strictly limited to the US. In most advanced democracies you need a stack of papers to get even a small handgun.
?? You 100% can in the USA it just costs a lot of money.
You cannot own tanks or jets capable of using military ordnance in the US (and I’d wager nearly any country that has anything resembling rule of law). You can own decommissioned ones that are rendered militarily useless.
Key point right here.
You let people post what they will, and if the authorities get involved, cooperate with them. HN should not be preemptively monitoring all comments and making corporate moralistic judgments on what you wrote and censoring people who mention Mickey Mouse or post song lyrics or talk about hotwiring a car.
Why shouldn't OpenAI do the same?
Your keyboard is.
Censoring AI generation itself is very much like censoring your keyboard or text editor or IDE.
Edit: Of course, "literally everything is a tool", yada yada. You get what I mean. There is a meaningful difference between that translate our thoughts to a digital medium (keyboards) and tools that share those thoughts that others.
Because, at the end of the day, counterfeiting money is already illegal.
...and we should not censor tools, and judge people, not the tools they use.
Even more interestingly, and maybe that could help understand that even in the most principled argument there should be a limit: molecular 3d printers able to reproduce proteins (yes, this is a thing) are regulated to recognise a design from a database of dangerous pathogens and refuse to print.
https://www.reddit.com/r/GIMP/comments/3c7i55/does_gimp_have...
even if it is a local model, if you trained a model to spew nazi propaganda, youre still publishing nazi propaganda to the people who then go use it to make propaganda. its just very summarized propaganda
Then let's parents choose when teenagers can start driving.
Also let's legalize ALL drugs.
Weapons should all be available to public.
Etc. Etc.
----
It's very naive to think that we shouldn't regulate "tools"; or that we shouldn't regulate software.
I do agree that on many cases the bad actors who misuse tools should be the ones punished, but we should always check the risk of putting something out there that can be used for evil.
This does not need to become a thread about bullying and self harm, but it should be recognized that this example is not benign or victimless.
This genie is out of the bottle, let us hope that laws about users are enough when the tools evolve faster than legislative response.
[edit:spelling]
And the teens are having a laugh by... creating deepfake nudes of their classmates? The tools are bad, and the toolmakers should feel deep guilt and shame for what they released on the world. Do you not know the story of Nobel and dynamite? Technology must be paired with morality.
I think the degree of power matters.
You can argue that that’s how it should be, but that isn’t how it is. And we don’t know what a world that adhered to that principle would look like, it’s possible it would be a disaster. There are a lot of bad things people can do where it’s difficult to catch someone after they’ve done it, and prevention at the tool level is the only way to really effectively stop people.
I’m not saying I like the idea of any of these methods when it comes to AI, but it feels naive to act like there isn’t precedent for stuff like this.
> It is unlikely anyone is going to perform act of terrorism with this, or any kind of deep fakes that buy Easter European elections. The worst outcome is likely teens having a laugh.
Citation needed bigtime. Sure, people doing organized disinformation campaigns won’t log into OpenAI’s website and use Sora, they’ll probably be running Hunyuan Video with an on-prem or cloud-based GPU cluster, but this feels like as good a time as any to discuss the implications of video generation tools as they stand in December 2024.
Especially certain someone that’s worth a billion dollars, is 100 years old and their name ends with inc.
Finally people do not label Slovakia as Eastern Europe...
It's not hyperbole. Hunyuan was released before Sora. So regulating Sora does absolutely nothing unless you can regulate Hunyuan, which is 1) open source and 2) made by a China company.
How do we expect the US govt to regulate that? Threatening sanction China unless they stop doing AI research???
We're most of the way there with "our" locked-down, walled-garden pocket supercomputers. Just extend that breadth and bring it to the rest of computing using the force of law.
---
Can I hear someone saying something like "That will never work!"?
Perhaps we should meditate upon that before we leap into any new age of regulation.
After over two decades of careful preparation, we're the stroke of a legislative pen away from having all of the software on our computers regulated by our friends in the government.
It's not even a slippery slope argument. In order to be effective, "We must regulate AI!" means the same thing as "We must regulate computer software!"
The two things are so identical that they're not even so different as two sides of the same coin are.
(Be careful what you wish for; you might just get it.)
Or, "this safety stuff is harder than we thought, we're just going to call 'tag you're it' on society"
Or,
-Oppenheimer : speaking "man, this nuclear safety stuff is hard, I'm just going to put it all out there and let society explore developing norms and safeguards".
-Society : Bombs Japan
-Oppenheimer : "No, not like that, oops".
The bomb was the end of conventional warfare between nuclear nations. MAD has created an era of peace unlike anything our species has ever seen before.
We have eliminated warfare between nuclear countries, conflicts have been reduced to nuclear/non-nuclear or proxy warfare, and that's a very solid reduction in suffering.
somehow, the society had survived just fine.
the notion that generative AI tools should be 'safe' and 'aligned' is as absurd as the notion that tools like Notepad, Photoshop, Premiere and Audacity should exist only in the cloud, monitored by kommissars to ensure that proles aren't doing something 'unsafe' with them.
But these companies are rightfully worried about regulators and legislatures, often led by a pearl-clutching journalists, so we can't have nice things.
Giving people what they want when they want it doesn't always lead to happy outcomes. The people themselves, through their representatives, have created the institutions that sometimes put a brake on their worst impulses.
edit: previously, this thread pointed to sora.com
For the Pro $200/month subscription: you get unlimited generations a month (on a slower que).
I guess the CGI industry implications are interesting, but look at the waves behind the AI generated man. They don't break so much as dissolving into each other. There's always a tell. These aren't GPU generated versions of reality with thought behind the effects.
Isn't there a multi-billion dollar industry in California somewhere that caters exactly to that demand?
The "or something" pretty much covers the gotcha you're trying to use. OP is acknowledging that fantasy media is a thing before going on to their actual point.
Infants, people just coming out of anesthesia, the concussed, the hypoxic, the mortally febrile and so on
What? This is 90% of the Instagram/TikTok experience, and has been for years. No one cares if something is real. They care how it makes them feel.
The audience for this is every "creator" or "influecner". No one cares if the content is fake. They'll sell you a vacation package to a destination that doesn't exist and people will still rate it 3/5 stars for a $15 Starbucks gift card.
Also seen GenAI replace more and more stock media in many facets of business/professional services.
You say it like that's not the majority of the web.
Other than that, it's also so people can spam every single website with millions of hours of AI generated spam and earn 7 cents off of the 5000 people the algorithm randomly decides to show it to.
Legitimate uses outside of that kinda shit? I fail to see one.
It is a gambling term, most VC funded startups are gambles, AI ones particularly so, it felt apt.
perhaps it correlates to a raise in investing that no longer is based on sound fundamentals in both traditional and new age assets(like crypto) perhaps makes people identify more with gambling.
No it’s not. I’ve been trying to access all day: “Sora account creation is temporarily unavailable We're currently experiencing heavy traffic and have temporarily disabled Sora account creation. If you've never logged into Sora before, please check back again soon.”
I wouldn't get your hopes up - it's not at all as good as they've hyped it.
For instance, that ladybug looks pretty natural, but there's a little glitch in there that an unwitting observer, who's never seen a ladybug move before, may mistake as being normal. And maybe it is! And maybe it isn't?
The sailing ship - are those water movements correct?
The sinking of the elephant into snow - how deep is too deep? Should there be snow on the elephant or would it have melted from body heat? Should some of the snow fall off during movement or is it maybe packed down too tightly already?
There's no way to know because they aren't actual recordings, and if you don't know that, and this tech improves leaps and bounds (as we know it will), it will eventually become published and will be taken at face value by many.
Hopefully I'm just overthinking it.
Well, none of the existing animation movies follow exact laws of physics.
That is only true for well crafted things. There's plenty of stuff that's just wrong for no reason beyond ease of creation or lack of care about the output.
Although, plenty of kids have tied a blanket around their necks and jumped off some furniture or a low roof, right? Breaking a leg or twisting an ankle in their attempt to imitate their favorite animated superhero.
The juxtaposition of something that looks extremely real (your mother) and something that never happened (ladybug) is something that's hard for the mind to reconcile.
The presence of a real thing inadvertently and subconsciously gives confidence to the fake thing also being real.
It is indeed something that society has to shift to deal with.
Personally, I'm not sure that it's the photoreal aspect that poses the biggest challenge. I think that we are mentally prepared to handle that as long as it's not out of control (malicious deep-fakes used to personally target and harass people, etc.) I think the biggest challenge has already been identified, namely, passing off fake media as being real. If we know something is fake, we can put a mental filter in place, like a movie. If there is no way to know what is real and what is fake, then our perception reality itself starts to break down. That would be a major new shift, and certainly not one that I think would be positive.
Even worse than that is when people get USED to it and no longer have a natural aversion to horrific scenes taking place in the real world.
This AI stuff accelerates that process of illusion but in every possible direction at once.
As much as people don't want to believe it, by beholding we are indeed changed.
I don’t think that slippery slope holds up.
IIRC there’s pretty solid research showing that even children beyond the age of 8 can tell the difference between fiction and reality.
If media didn’t profoundly affect us, how could exposure therapy rewire fears? Why would billions be spent on advertising if it didn’t work? Why would propaganda or education exist if ideas couldn’t be planted and nurtured through storytelling?
Is there any meaningful difference between a sermon from the pulpit and a feature film in the theater? Both are designed to influence, persuade, and reshape our worldview.
As Alan Moore aptly put it: "Art is, like magic, the science of manipulating symbols, words, or images to achieve changes in consciousness."
In my opinion the old adage holds true, you are what you eat. And we will soon be eating unimaginable mountains of artificial content cooked up by dream engines tuned to our every desire and whim.
Huh? The first half of this contradicts the second. We haven't "grown colder and more detached", we've adapted to the fact that images are no longer reliable indicators of reality. What we do and don't value in the real world hasn't changed.
> And we will soon be eating unimaginable mountains of artificial content cooked up by dream engines tuned to our every desire and whim.
Always has been. Multi-channel TV was already that, and attracted the same kind of doomerism.
I would retort that animation and real-life-looking video do different things to our psyche. As an uneducated wanna-be intellectual, I would lean toward thinking real-looking objects more directly influence our perception of life than animations.
You just know there’ll be people making content within the week for social media that will be trying to pass itself off as real imagery.
Special effects, weapons physics, unrealistic vehicles and planes, or the classic 'hacking'.
I am sure that there were people decrying radio for all these same reasons (“how will the children know that the voices aren’t people in the same room?”)
If the producer wants to publish good physics, they get good physics.
It doesn't matter if it is AI, CGI, live action, stop motion, pen-and-ink animation, or anything else.
The output is whatever the production team wants it to be, just as has been the case for as long as we've had cinema (or advertising or documentaries or TikToks or whatevers).
Nothing has changed.
There's a video on sora.com at the very bottom, with tennis players on the roof, notice how one player just walks "through" the net. I don't think you can fix this other than by just cutting the video earlier.
Or they could simply brute force it by clipping the scene at the problem point and have it try, try again with another re-render iteration from that point until it's no longer problematic. Or just do the bulk of the work with AI and do video inpainting for small areas to fix or reserve the human CGI artists for fixing unmitigatable problems that crop up if they're fixable without full re-rendering (whichever probably ends up less expensive).
Plus with what we've recently seen with world models that have been released in the last week or so, AI will soon get better at having a full and accurate representation of the world it creates and future generations of this technology beyond what Sora is doing simply won't make these mistakes.
So the AI just publishes stuff on my behalf now?
No, comrade.
Similar to how it's fine to create fiction, but not to claim it to be true.
The ship is across the ocean...
Should there be an elephant in the snow? The layers of possible confusion, and subtle incorrect understandings go much deeper.
Or just inaccurate impressions of the physical world.
My young kids and I happened to see a video of some very cute baby seals jumping onto a boat. It was not immediately clear it was AI-generated, but after a few runs I noticed it was a bit too good to be true. The kids would never have known otherwise.
Facebook seems full of older people interacting with AI generated visual content who don't seem to understand that it is fake.
Our society already had a problem with people (not) participating in consensus reality. This is going to pour gasoline on the fire.
I'm less concerned with physics for children--assuming they get enough time outdoors--and more about adulthood biases and media-literacy.
In particular, a turbocharged version of a problem we already have: People grow up watching movies and become subconsciously taught that flaws of the creation pipeline (e.g. lens flare, depth of field) are signs of "realism" in a general sense.
That manifests in things such as video-games where your human character somehow sees the world with crappy video-cameras for eyes. (Excepting a cyberpunk context, where that would actually make sense.)
It helps somewhat that people are fairly aware that entertainment is fake and usually don’t take it too seriously.
And why don't we worry this about CGI?
CGI is not always made with a full physical simulation, and is not always intended to accurately represent real-world physics.
I do worry that, as we get exposed more and more to such art, we'll become less sensitive to this feeling, which effectively means we'll become less calibrated to actual reality. I worry this will screw with people's "system 1" intuitions long-term (but then I can't say exactly how; I guess we'll find out soon enough).
What is physics besides next token/frame prediction? I'm not sure these videos deserve the label "inaccurate" as who's to judge what way of generating next tokens/frames is better? Even if you you judge the "physical" world to be "better", I think it's much more harmful to teach young children to be skeptical of AI as their futures will depend on integrating them in their lives. Also, with enough data, such models will not only match, but probably exceed "real-physics" models in quality, fidelity, and speed.
the real world gives way more stimulus
watching the animations might help them play video games, but i again imagine that the feedback is what will do the real job.
even for the real ladybug video, who says the behaviour on screen is similar to what a typical ladybug does? if its on video, the ladybug was probably doing something weird amd unexpected
I think it's unnecessary to worry about obviously bad stuff in nascent and rapidly developing technology. The people who spent most time with it (the developers) are aware of the obviously bad stuff and will work to improve it.
Pretty sure cartoons and actions movies do that already, until youtube videos of attempted stunts show what reality looks like.
If you really want something to worry about, consider that movies regularly show pint-sized women successfully drop kicking men significantly bigger than themselves in ways that look highly plausible but aren't. It's not AI but it violates basic laws of biology and physics anyway. Teaching girls they can physically fight off several men at once when they aren't strong enough to do that seems like it could have pretty dangerous consequences, but in practice it doesn't seem to cause problems. People realize pretty quick that movie physics isn't real.
I suppose the reminder here is that seeing does not warrant believing.
> these things will get bigger and better much faster than we can learn to discern
I would like to ask “Why?”
Clearly, these models are just one case of “NN can learn to map anything from one domain to another” and with enough training/overfitting they can approximate reality to a high degree.
But, why would it get better to any significant extent?
Because we can collect an infinite amount of video? Because we can train models to the point where they become generative video compression algorithms that have seen it all?
Two years ago, the very best closed-source image model was unable to represent anything remotely realistic. Today, there's hundreds of open source models that can generate images that are literally indistinguishable from reality (like Flux). Not only that, there's an entire collection of tools and techniques around style transfer, facial reconstruction, pose control, etc. It's mindblowing, and every week there's a new paper making it even better. Some of that could have been more training data. Most of it wasn't.
I guess it's fair to extrapolate that same trend to video, since it's the arc text, audio and images have taken? No reason it would be different.
It seems frontier labs have been throwing all the compute and all the data they could get their hands on at model training for at least the past 2 years. Is that glass a third full or is it nearly full already?
Is the process of filling that particular glass linear or does the top 20% of the glass require X times as much water to fill as the bottom 20%?
Soon, lots of people can pay a modest sum to make the internet just a worse for everyone in exchange for a chance to make their money back!
What legitimate problem does it solve? Isn't AI supposed to make our lives easier, or is that just "not what it's supposed to be bro", or whatever. I've lost track at this point with all the hallucinations and poor/bad/really fucking bad responses. It's not 100% of the time, but that's the point of companies like OpenAI releasing stuff like this to the public... to be helpful and believable.
Deep fakes were bad enough. Shit like this is not helpful when given to the largely ignorant public. It's not going to be used for anything helpful, conducive, or otherwise beneficial.
It's impressive. Sure. I just fail to see what it's the solution to.
> the United Kingdom, Switzerland and the European Economic Area. We are working to expand access further in the coming months
Excellent to announce this lack of access after the launch of pro. At least I have no business reason for sora so it's not a loss there so much but annoying nonetheless.
OpenAI isn't the only company that seems to act in this manner. I find this to be interesting. Your paying customers actively want to know about what you are doing and, more than likely, would love to get a heads-up before the word goes out to the world. Hearing about things from third parties can make you feel like a company takes your business for grant it or does not deem it important enough to feed you news when it happens.
Another example of this is Kickstarter, although, their problem is different. I have only ever backed technology projects on KS. That's all I am interested in. And yet, every single email they send is full of projects that don't even begin to approach my profile (built over dozens of backed projects). As a result of this, KS emails have become spam to be deleted without even reading them. This also means I have not backed projects I would have seriously considered and I don't frequent the site as much as I used to.
Getting back on topic: It will be interesting to see how Sora usage evolves.
Regarding paying for access, for me it is about a combination of reasons. I want to support their efforts and that of others, so we have paid accounts where possible. Beyond that, it is about being up to date on the state of the art. Some of it is paid, and some is FOSS.
1. Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works.
2. Anything that's invented between when you’re fifteen and thirty-five is new and exciting and revolutionary and you can probably get a career in it.
3. Anything invented after you're thirty-five is against the natural order of things.”
― Douglas Adams, The Salmon of Doubt: Hitchhiking the Galaxy One Last Time
Which means I generally avoid things that are not EU available even if they are available to me. Its not 100% but its a fairly decent measure of how much companies care about users to ensure they meet EU privacy laws from the start, vs if they provide some limited version or delayed version to the EU.
> how much the EU slowed down innovation
You say this all the time, yet we're doing fine. How come?
Carefully crafted/gerrymandered laws that only rent seek from American big tech.
> You say this all the time, yet we're doing fine. How come?
You're not doing fine. I don't know how you can look back at the stagnation of the past two decades in the EU and think you're "doing fine." One of our companies is worth more than your entire tech industry. Your engineers get paid a fifth of what they could make here, so they often move here. In tech, you've fallen so far behind others superpowers that it's not even funny, and you're gleefully positioning yourself to fall even further behind. Your relative share of the global GDP is dropping.
You think you're doing fine, but if the EU doesn't plan on amending the regulatory-industrial complex that has caused its undeniable stagnation, it will eventually fall into irrelevancy, and be on the losing side of the rising global wealth inequality.
Alright then; who else should have been covered with the DMA in your opinion? Which other companies created unfair tax arrangements that have avoided scrutiny for decades?
Oh, nobody as large as Apple? Huh. Sounds like they're not targeting American companies at all, but instead prioritizing the biggest violators.
I guess they'll get their rude awakening someday. If xvectors comments here are any indication, it seems like they're starting to get out of the proverbial bed at least.
Just fine?
Like the EU forcing Apple to pay $14B in back taxes after voiding a legal and consensual tax agreement between Apple and Ireland? [1]
Or the DMA resulting in an absurd $2B fine related to music streaming, in a transparent attempt to prop up Spotify (the dominant market leader in this space)?
Both of these in the last couple of months alone? It's just rent-seeking with a pretend "we're doing it for the good of the people" facade.
[1]: https://en.wikipedia.org/wiki/Apple%27s_EU_tax_dispute
[2]: https://www.reuters.com/technology/apple-set-face-fine-under...
They're back taxes. The EU did right by every single law-abiding business when they forced Apple to remediate their unnatural and unfair arrangement. Not a single naturally competitive business suffered as a result of either action. The EU does not suffer economically by weeding out businesses that exploit it to avoid paying taxes, only Apple does.
This is transparent and obvious to everyone outside of the EU. Rent-seeking behavior is the reason companies are less interested in going to the EU.
> The EU does not suffer economically [...]
The EU suffers economically when it falls behind technologically.
>The EU suffers economically when it falls behind technologically.
Is moving faster better? Certainly to generate wealth for a subset of the population but rarely for the general public.
This view that the US is doing better because a small group of rich people are increasing their share of the wealth while most of the country is at best treading water or worse seeing their economic power decrease, where the average person in the EU is actually better off is myopic at best and malicious at worse.
No, they overrode the Irish decision because it was illegally anticompetitive. Please stop using Hacker News if your intention is to solely be butthurt over unfair rulings when they get corrected. Everyone on this website knows that Apple wields illegal anticompetitive power, nobody here should be surprised when Apple is forced to remediate tax fraud and deliberate DMA violations.
> The EU suffers economically when it falls behind technologically.
Well then it's a good thing Apple isn't leading the industry.
"Noooooo! Think of how many Vision Pro sales that Apple would miss out on by pulling out of Europe!" ...said nobody ever.
Nice bait and switch since your examples have nothing to do with GDPR.
Still Apple is doing just fine despite your examples.
https://help.openai.com/en/articles/10250692-sora-supported-...
[0] https://openai.com/global-affairs/a-primer-on-the-eu-ai-act/
I would ask an AI to generate a riff on a "I am the very model of a modern major general" but for some EU bureaucrat but I'll spare you the spam.
https://www.youtube.com/watch?v=2jKVx2vyZOY (live as of this comment)
Trust on a society level is some other beast of difficult problem.
"We’re currently experiencing heavy traffic and have temporarily disabled sign ups. We’re working to get them back up shortly so check back soon."
Personally, I think I'll just be making weird memes to send to my friends!
Results won’t match the hype.
Hunyuan is 100% open source and it's set to become the Stable Diffusion / Flux of AI video.
Complaints about Sora's quality and prompt complexity likely not as important to auteur's in that category, especially with ability to load a custom character etc
Classic OpenAI. I don't care, there are so many better alternatives to everything they do. Funny how quickly they have become irrelevant and lost their moat.
This stuff is a little ways off, but still some amazing effects here. I think it will be a little bit before it is sufficient for production use in any real commercial situation. There's something unsettling about all of the videos generated here.
If it's about training models on potentially personal information, the GDPR (EU and UK variants) kicks in, but then that hasn't restricted OpenAI's ability to deploy (Chat)GPT there. The same applies to broader copyright regulations around platforms needing to proactively prevent copyright violation, something GPT could also theoretically accomplish. Any (planned) EU-specific regulations don't apply to the UK, so I doubt it's those either.
The only thing that leaves, perhaps, is laws around the generation of deepfakes which both the UK and EU have laws about? But then why didn't that affect DALL-E? Anyone with a more detailed understanding of this space have any ideas?
https://help.openai.com/en/articles/10250692-sora-supported-...
For example in Train or Truck Simulators, I see examples where someone has put effort into making that farmhouse in the distance nicely detailed, but other times it's just a simple structure. If AI were tasked with "distant details", the whole game could look more polished.
It was cool when they announced it but the novelty of generating a piece of AI video clipart is quickly fading, especially when it takes months or years to just get a demo in users' hands.
So they demo the full model and release the quantised and censored model.
Does anyone else find this kind of bait & switch distasteful?
Hunyuan [1] is better than Sora Turbo and is 100% open source. It's got fine tuning code, LoRA training code, multiple modalities, controlnets, ComfyUI compatibility, and is rapidly growing an ecosystem around it.
Hunyuan is going to be the Stable Diffusion / Flux for video, and that doesn't bode well for Sora. Nobody even uses Dall-E in conversation anymore, and I expect the same to hold true for closed source foundation video models.
And if one company developing foundation video models in the open isn't good enough, then Lightricks' LTX and Genmo's Mochi should provide additional reassurance that this is going to be commoditized and made readily available to everyone.
I've even heard from the Banodoco [2] grapevine that Meta is considering releasing their foundation video model as open source.
[1] https://github.com/Tencent/HunyuanVideo/
[2] Banodoco is one of the best communities for open source foundation AI video; https://banodoco.ai/
Tencent has a comparable open weight model dropped in the last week that looks at least as good.
https://en.m.wikipedia.org/wiki/Trypophobia
It’s a like a spider’s eyes… and also not what I would expect a latte to look like.
Mostly hunches from me. It could very well be that the original Sora is also plagued with outputs that aren't just subjectively "bad", but which aren't _useful_ (not adhering to the prompt, for instance).
There's some cool ideas here. The storyboard thing is nifty - kind of the refined synthetic captions that ChatGPT uses for DALLE3 on crack. Perhaps after people get over the prompting learning curve it will output better results. But it seems tougher to prompt than simple text-to-image, requiring generally longer prompts that aim to steer the model away from whatever strange thing it's doing that you don't need it to do. In my case, using the "image as the first frame" approach, the model generated cuts between newly imagined cameras consistently, when I simply wanted a single continuous shot from the POV of the camera of the photo.
We'll see, but I'm sort of over it. The UX is fancy for sure, and the scale they're pulling off with this is unprecedented even if there's already decent competitors.
Hopefully these types of issues blow over as they increase capacity or load decreases.
The lengthy generation times aren't fun to deal with though in any case. As good as the UX for the app itself is, there's little they can do about how long it takes for a video to generate compared to images. The near instant feedback is gone (just like old times)
>RoboHanger: Learning Generalizable Robotic Hanger Insertion for Diverse Garments
https://arxiv.org/abs/2412.01083
>To overcome the challenge of limited data, we build our own simulator and create 144 synthetic clothing assets to effectively collect high-quality training data.
the strategy is simulation
Looking forward to the onslaught of AI-generated slop filling every video feed on the Internet. Maybe it's finally what's going to kill things like TikTok, YT Shorts, Reels, etc. One can hope...anyway.
I don't see sora being THAT much better than pika now that I'm trying both, except that it's included in my openai subscription, but I do think people who do discreet parts of the "modal stack" are going to be able to compete on their merits (be it pika for vid or suno for music etc)
I don't see any good coming from tools like these.
The part made by Sora? About as interesting as the latest chess programs doing well at chess. woohoo/nice job.
The overall effect? Now we spend mental energy trying to figure out which parts are machine generated, and hence not worth anything. That mental energy is gone, sucked out of the cultural economy, and fed to the machinery of mediocrity.
I certainly don’t dislike all the cool movies where special effects are CG just because the old time stop motion artists from 1950s Flash Gordon aren’t using sparklers. Similarly I’m not going to discount new creation that can be enjoyable no matter the provenance.
/s
it's given openai this tinge to me that i probably won't ever manage to forget.
He flower one is the best looking.
I'm curious to know - is it actually useful for real world tasks that people/companies need videos for?
https://github.com/Tencent/HunyuanVideo/
This isn't "too good to be true" - this is the holy grail. Hunyuan is set to become the Flux/Stable Diffusion of AI video.
I don't see how Hunyuan doesn't completely kill off Sora. It's 100% open source, is rapidly being developed for consumer PCs, can be fine tuned, works with ComfyUI/other tools, and it has control nets.
HunYuan is seriously amazing and it looks like it'll be the Flux/Stable Diffusion of AI video.
Sora is cooked.
This is a problem that highlights the apparent either lack of effect, or lack of care for, consequences, period. The current consequences in place to deter this abysmal and abhorrent behavior simply aren’t enough. I look around and the world is going nuts, and failing, it seems, to attribute the cause with the effect. Shitty people doing shitty things. We don’t need new laws or regulations, when people aren’t willing or able to abide by the ones currently in place! What good will that do?
mmmmh...
1. Anything on the internet can be fake
2. Trust is interpersonal, and trusting content should be predicated first and foremost on trusting its source to not deceive you
This is imperfect but also the best people ever really do in the general case, and just orders of magnitude better than most people are currently doing
The issue isn't models like this, it's that people are eating a ton of information but have been strongly encouraged to be credulous, and a lion's share of that training is directly coming from the tech grift industrial complex
I wouldn't even say this is the most compelling kind of tool for plausible-looking disinformation out there by a long shot for the record, but without actually examining why people are gullible there is no technology that's going to make people accepting fiction as fact substantially worse, or better, really. Scams target people on the order of their life savings every day and there are robust technologies and protocols for vetting communications, but people have to know to use them, care to use them, and be able to use them, for that to matter at all
People with no taste will produce tasteless content.
The mountain of slop will grow.
And some of us have no intention of publishing any output whatsoever but just find the existence of these tools fascinating and inspiring.
These tools are fascinating, though I can't help but feel that the main benefactor after all is said and done will be venture capitalists and tech/entertainment execs.
I'm just about ready to cancel my ChatGPT subscription and move fully over to Claude because OpenAI has spit in my face one too many times.
I'm tired of announcements of things being available only to find out "No, they aren't" or "It's rolling out slowly" where "slowly" can mean days, weeks, or month (no exaggeration).
I'm tired of shit like this:
Sign ups are temporarily unavailable
We’re currently experiencing heavy traffic and have temporarily disabled sign ups. We’re working to
get them back up shortly so check back soon.
Sign up? I'm already signed up, I've had a paid account for a year now or so.> We’re releasing it today as a standalone product at Sora.com to ChatGPT Plus and Pro users.
No you aren't, you might be rolling it out (see above for what that means) but it's not released, I'm a ChatGPT Plus user and I can't use it.
so incredible ugly.
We have yet to see any kind of AI created movie, like Toy Story was for computer 3D animation.
OpenAI isn't a player in the video AI game, but certainly has bagged most of the money for it already (somehow).
no pay per use = overpriced
Here's my pelican video: https://simonwillison.net/2024/Dec/9/sora/
I don't see how Sora can stay in this race. The open source commoditization is going to hit hard, and OpenAI probably doesn't have the product DNA or focus to bark up this tree too.
Tencent isn't the only company releasing open weights. Genmo, Black Forest Labs, and Lightricks are developing completely open source video models, and that's .
Even if there weren't open source competitors, there are a dozen closed source foundation video companies: Runway, Pika, Kling, Hailuo, etc.
I don't think OpenAI can afford to divert attention and win in this space. It'll be another Dall-E vs. Midjourney, Flux, Stable Diffusion.
https://github.com/Tencent/HunyuanVideo
It's pretty cool though, the kind of thing that'd be hard if it was what you actually wanted!
Oof, if sora can't even manage to maintain an internal consistency of the world for a 5 second short, I can't imagine how exacerbated it'll be at longer video generation times.
So you were lucky indeed to be able to run your prompt and share it, because the result was quite illuminating, but not in a way that looks good for Sora and OpenAI as a whole.
Verdict 4/10
Sora is built entirely around the idea of directly manipulating and editing and remixing the clips it generates, so the goal isn't to have it produce usable videos from a single prompt.
Perhaps just best to wait
I love this timeline.
When I first got access to dalle (in '22) the first thing I tried was to get an impressionist style painting of the way I always imagined Bob Dylan's 'Mr. Tambourine Man' I regenerated it multiple times and I got something I was very happy with! I didn't put it on social media, didn't try to make money off it, it's for me .
If you enjoy "art" (nice pictures, paintings, videos now I guess) You can create it yourself! I think people are missing that aspect of it, use it to make yourself happy, make pictures you want to look at!
The real competition of any new work is the backlog of decades of content that is instantly accessible. Of course it makes all content less valuable, you can always find something else. Hence the race for attention and the slop machine. It was actually invented by the ad driven revenue model.
We should not project on AI something invented elsewhere. Even if gen AI could make original interesting works, the social network feeds would prioritize slop back again. So the problem is the way we let them control our feeds.
Define "original". You could generate a pregnant Spongebob Squarepants and that would be original, but it would still be noise that doesn't inherently expand the creative space.
> don't spend much time selecting
That's the unexpected issue with the proliferation of generative AI now being accessible to nontechnical people. Most are lazy and go with the first generation that matches the vibe, which is the main reason why we have slop.
You could get something much more creative or historically accurate than whatever Hollywood deems marketable.
I think about AI like any other tool. For example I make music using various software.
Are drum machines cheating? Is electronic music computer sloop compared to playing each instrument.
Is using a Mac and a 1k mic over a 30k studio cheating ?
Kokoro is art. Driveway is content. Art uses the medium and implementation to say something and convey messages. Content is what goes between the ads so the shareholders see a number increase.
I wish there were more things like Kokoro and less things like Driveway.
It's like everything else. It's just a tool.
You can create an entire movie using a high end phone with quality that would have cost millions 40 years ago. Do real movies need film?
Turns out it's surprisingly, at least for me, to tune out the slop. Some platforms will fall victim to it (Google image search, for one), but new platforms will spring up to take their place.
Plus Tier (20$/month)
- Up to 50 priority videos (1,000 credits)
- Up to 720p resolution and 5s duration
Pro Tier (200$/month)
- Up to 500 priority videos (10,000 credits)
- Unlimited relaxed videos
- Up to 1080p resolution, 20s duration and 5 concurrent generations
- Download without watermark
more info: https://help.openai.com/en/articles/10245774-sora-billing-cr...
>> Can I purchase more credits?
> We currently don’t support the ability to purchase more credits on a one-time basis.
> If you are on a ChatGPT Plus and would like to access more credits to use with Sora, you can upgrade to the Pro plan.
Ouch. Looks like they're really pushing this ChatGPT pro subscription. Between the watermark and being unable to buy more credits, the plus plan is basically a small trial.
[1] https://help.openai.com/en/articles/10245774-sora-billing-cr...
https://www.klingai.com/membership/membership-plan
Quality seems relatively similar based on the samples I've seen. With the same issues - object permanence, temporal stability, physics comprehension etc, being present in both. Kling has no qualms about copyright violation however.
Kling doesn't seem to have more granular information publically but I suspect it allows for more than 16 videos per month.
AFAIK based on HuggingFace trending[1], the competitors are:
- bytedance/animatediff-lightning: https://arxiv.org/pdf/2403.12706 (2.7M downloads in the past 30d, released in March)
- genmo/mochi-1-preview: https://github-production-user-asset-6210df.s3.amazonaws.com... (21k downloads, released in October)
- thudm/cogvideox-5b: https://huggingface.co/THUDM/CogVideoX-5b (128k downloads, released in August)
Is there a better place to go? I'm very much not plugged into this part of LLMs, partially because it's just so damn spooky...
EDIT: I now see the reply above referencing Hunyuan, which I didn't even know was its own model. Fair enough! I guess, like always, we'll just need to wait for release so people can run their own human-preference tests to definitively say which is better. Hunyuan does indeed seem good
Both are permissively licensed.
LLMs -- Awesome and useful. Disruptive, and somewhat dangerous, but probably more good than harm if we do it right.
'Generative art' (i.e. music generation, image generation, video generation) -- Why? Just why?
The 'art' is always good enough to trick most humans at a glance but clearly fake, plastic, and soulless when you look a bit closer. It has instilled somewhat of a paranoia in me when browsing images and genuinely worsened my experience consuming art on the internet overall. I've just recently found out that a jazz mix I found on YouTube and thought was pretty neat is fully AI generated, and the same happens when I browse niche artstyles on Instagram. Don't get me started on what this Sora release will do...
It changed my relationship consuming art online in general. When I see something that looks cool on the surface, my reaction is adversarial, one of suspicion. If it's recent, I default to assuming the piece is AI, and most of the time I don't have time or effort to sleuth the creator down and check. It's only been like a year, and it's already exhausting.
No one asked for AI art. I don't understand why corporations keep pushing it so much.
Anyway, the guitar is AI generated, and it's really bad. There are 5 strings, which morph into 6 at the headstock. There's a trem bar jammed under the pickguard, somehow. There's a randomly placed blob on the guitar that is supposed to be a knob/button, but clearly is not. The pickups are visually distorted.
It's repulsive. You're trying to sell me on something, why would you put so little effort into your advertising? Why would you not just...take a picture of a real guitar? I so badly want to cover it up.
Is this not evident? Because using AI is much cheaper and faster. Instead of finding the right guitar, paying for a good photographer, location, decoration, and all the associated logistics, a graphics designer can write a prompt that gets you 90% of the vision, for orders of magnitude less cost and time. AI is even cheaper and faster than using stock images and talented graphic designers, which is what we've been doing for the past few decades.
All our media channels, in both physical and digital spaces, will be flooded with this low-effort AI garbage from here on out. This is only the beginning. We'll need to use aggressive filtering and curation in order to find quality media, whether that's done manually by humans or automatically by other AI. Welcome to the future.
In fact, it's not hard to imagine people using AI tools even if they're slower, more expensive, and yield worse quality results in the long run.
"When all you have is a hammer...".
Can't text also be considered art? There's as much art in poetry, lyrics, novels, scripts, etc. as in other forms of media.
The thing is that the generative tech is out of the bag, and there's no going back. So we'll have to endure the negative effects along with the positive.
I just think that LLMs have genuine use for non-artistic things, which is why I said it's dangerous but may be useful if we play our cards right.
So we could say the same thing about AI-generated art. Maybe most of it is low-effort, but why can't it be considered art? There is a separate topic about human emotion being a key component these generated works are missing, but art is in the eyes of the beholder, after all, so who are we to judge?
Mind you, I'm merely playing devil's advocate here. I think that all of this technology has deep implications we're only beginning to grapple with, and art is a small piece of the puzzle.
I'd be perfectly fine with a hypothetical world in which all generated art is clearly denoted as such. Like you said, art is in the eyes of the beholder. I welcome a world in which AI art lives side-by-side with traditional art, but clearly demarcated.
Unfortunately, the reality is very different.
AI art inherently tries to pass off as if it were made by a human. The result of the tools released in the past year is that my relationship with media online has become adversarial. I've been tricked in the past by AI music and images which were not labelled as such, which fosters a sort of paranoia that just isn't there with the examples you mentioned.
You don't have the same paranoia with LLM? So often I find myself getting a third of the way into reading an article or blog post and think: "wait a minute...".
LLM tone is so specific and unrealistic that it completely disengages me as a reader.
I'm a huge film nerd and I can only dream of a future where I could use these type of tools (but more advanced) to create short films about ideas I've had.
It's very exciting to me
The video there is kind of a combination of human design and AI which produces something beyond that which either would come up with on their own.
There's nothing wrong with technology going forward and this doesn't go against "creativity and art", to the contrary, it will enhance it.
But mostly it will end up like the smartphones - we carry more computing power in our pockets that was used to send man to the moon, and instead of taking advantage of it to do great things, we are glued to this small screen several hours / day scrolling social medias nonsense. It's just human nature.
I see this kind of comment from time to time. do you have any evidence to support this claim, or just paranoia vibes?
Many of them will die, but may the AI slop continue anyway.
If you are a creative in this industry, start preparing to transition to another industry or adapt.
Your boss is highly likely to be toying around with this.
The first entirely AI generated film (with Sora or other AI video tools) to win an Oscar will be less than 5 years away.
I could see this tool maybe being used for generating establishing shots (generate a sweeping drone shot of a lighthouse looking out over a stormy sea), but then the actual talent work in a scene will be way more sensitive. The little details matter so much, and this feels so far from getting all of that right.
Sure, this is the worst it will ever be, things will improve, etc, but if we've learned anything with AI, it's that the last mile is often the hardest.
What I think this will unlock, maybe with a bit of improvement, is low quality video generation for a vast number of people. Do you have a short film idea? Know people with some? Likely millions of people will be able to use this to put together good enough short films - that yes, have terrible details, but are still good enough to watch. Some of those millions of newly enabled videos will have such strong ideas or writing behind them that it will make up for, or capitalize on, the weak video generation.
As the tools become easier, cheaper, faster, better etc more and more hobbyists will pick them up and try to use them. The user base will encourage the product to grow, and it will gradually consume film (assuming it can reach the point of being as or nearly as good as modern special effects).
I think of it like - when Steven Spielberg was young he used an 8mm camera, not as good as professional film equipment in the day, but good enough to create with. If I were a high school student interested in film I would absolutely be using stuff like this to create.
Sure, this is already happening on Reels, Tik Tok, etc. People are ok with low quality content on those platforms. Lazy AI will undoubtedly be more utilized here. But I don’t think it’s threatening Hollywood (well, aside from slowly destroying people’s attention spans for long form content, but that’s a different debate). People will still want high quality entertainment, even if they can also be satisfied with low fidelity stuff too.
I think this has always been true — think the difference between made for TV CGI and big-budget Hollywood movie CGI. Expectations are different in different mediums.
This current product is not good enough for Hollywood. As long as people have some desire for Hollywood level quality, this will not take those jobs.
The big caveat here is “yet” — when does this get good enough? And this is where my skepticism comes in, because the last mile is the hardest, and getting things mostly right isn’t really good enough for high quality content. (Remember how much the internet lost it over a Starbucks cup in Game of Thrones?)
The other caveat is maybe that our minds melt into stupidity to the point that we only watch things in low fidelity 10 seconds clips that AI can capably run amock with. In which case I don’t really think AI actually takes over Hollywood so much as Hollywood — effectively high fidelity long form content — just ceases to exist altogether. That is the sad timeline.
A reminder: as advanced as CGI is today, lots and lots of movie are still based on (very expensive) real-life scenery or miniature sets (just two of many examples), because they are far, far more realistic than what you get out of computers.
What would you like to wager on this?
OpenAI could be a big enough bubble in less than 5 years to buy the Oscar winner, even if the film is terrible.
Also, OP only said "an Oscar".
The Oscar committee could easily get themselves hyped enough on the AI bubble, to create an AI Oscar Film award.
No one said anything about making a "good" movie.
...For soundtrack. (Sorry.)
But seriously: like the democratization which made music production cheap brought some interesting or commercially successful endavours, the increased effort from people who could not bring their dreams to reality because of the basic constraint of budget will probably bring some very good results, even anthology worth - and lots of trash.
Even if they use queues, I'm sure they are running at a loss and the GPU time is going to cost 100x more than what they charge.
Creating false demand for AI can easily bankrupt their business, as they will believe people actually want to use that crap for that purpose.