If you want to try it out, you can either (1) go to https://studio.infinity.ai/try-inf2, or (2) post a comment in this thread describing a character and we’ll generate a video for you and reply with a link. For example: “Mona Lisa saying ‘what the heck are you smiling at?’”: https://bit.ly/3z8l1TM “A 3D pixar-style gnome with a pointy red hat reciting the Declaration of Independence”: https://bit.ly/3XzpTdS “Elon Musk singing Fly Me To The Moon by Sinatra”: https://bit.ly/47jyC7C
Our tool at Infinity allows creators to type out a script with what they want their characters to say (and eventually, what they want their characters to do) and get a video out. We’ve trained for about 11 GPU years (~$500k) so far and our model recently started getting good results, so we wanted to share it here. We are still actively training.
We had trouble creating videos of good characters with existing AI tools. Generative AI video models (like Runway and Luma) don’t allow characters to speak. And talking avatar companies (like HeyGen and Synthesia) just do lip syncing on top of the previously recorded videos. This means you often get facial expressions and gestures that don’t make sense with the audio, resulting in the “uncanny” look you can’t quite put your finger on. See blog.
When we started Infinity, our V1 model took the lip syncing approach. In addition to mismatched gestures, this method had many limitations, including a finite library of actors (we had to fine-tune a model for each one with existing video footage) and an inability to animate imaginary characters.
To address these limitations in V2, we decided to train an end-to-end video diffusion transformer model that takes in a single image, audio, and other conditioning signals and outputs video. We believe this end-to-end approach is the best way to capture the full complexity and nuances of human motion and emotion. One drawback of our approach is that the model is slow despite using rectified flow (2-4x speed up) and a 3D VAE embedding layer (2-5x speed up).
Here are a few things the model does surprisingly well on: (1) it can handle multiple languages, (2) it has learned some physics (e.g. it generates earrings that dangle properly and infers a matching pair on the other ear), (3) it can animate diverse types of images (paintings, sculptures, etc) despite not being trained on those, and (4) it can handle singing. See blog.
Here are some failure modes of the model: (1) it cannot handle animals (only humanoid images), (2) it often inserts hands into the frame (very annoying and distracting), (3) it’s not robust on cartoons, and (4) it can distort people’s identities (noticeable on well-known figures). See blog.
Try the model here: https://studio.infinity.ai/try-inf2
We’d love to hear what you think!
EDIT: looks like the model doesn't like Duke Nukem: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Cropping out his pistol only made it worse lol: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
A different image works a little bit better, though: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
My go-to for checking the edges of video and face identification LLMs are Personas right now -- they're rendered faces done in a painterly style, and can be really hard to parse.
Here's some output: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Source image from: https://personacollective.ai/persona/1610
Overall, crazy impressive compared to competing offerings. I don't know if the mouth size problems are related to the race of the portrait, the style, the model, or the positioning of the head, but I'm looking forward to further iterations of the model. This is already good enough for a bunch of creative work, which is rad.
I think the issues in your video are more related to the style of the image and the fact that she's looking sideways than the race. In our testing so far, it's done a pretty good job across races. The stylized painting aesthetic is one of the harder styles for the model to do well on. I would recommend trying with a straight on portrait (rather than profile) and shorter generations as well... it might do a bit better there.
Our model will also get better over time, but I'm glad it can already be useful to you!
It's not stylization (alone): here's a short video using the same head proportions as the original video, but the photo style is a realistic portrait. I'd say the mouth is still overly wide. https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
I tentatively think it might be race related -- this is one done of a different race. Her mouth might also be too wide? But it stands out a bit less to me. https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
p.s. happy to post a bug tracker / github / whatever if you prefer. I'm also happy to license over the Persona Collective images if you want to pull them in for training / testing -- : feel free to email me -- there's a move away from 'painterly' style support in the current crop of diffusion models (flux for instance absolutely CANNOT do painting styles), and I think that's a shame.
Anyway, thanks! I really like this.
It’s astounding that 2 sentences generated this. (I used text-to-image and the prompt for a space marine in power armour produced something amazing with no extra tweaks required).
[0]: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Our hypothesis is that the "breakdown" happens when there's a sudden change in audio levels (from audio to silence at the end). We extend the end of the audio clip and then cut it out the video to try to handle this, but it's not working well enough.
Hmmmmmmmm
Ohmmmmmmm
Our V2 model is trained on specific durations of audio (2s, 5s, 10s, etc) as input. So, if give the model a 7s audio clip during inference, it will generate lower quality videos than at 5s or 10s. So, instead, we buffer the audio to the nearest training bucket (10s in this case). We have tried buffering it with a zero array, white noise and just concatenating the input audio (inverted) to the end. The drawback is that the last frame (the one at 7s) has a higher likelihood to fail. We need to solve this.
And, no shade on HeyGen. It's literally what we did before. And their videos look hyper realistic, which is great for B2B content. The drawback is you are always constrained to the hand motions and environment of the on-boarding video, which is more limiting for entertainment content.
I know that signup requirement is an article of faith amongst some startup types, but it’s not a surprise to me shareable examples lead to sharing.
Funny how other sites can do this with a birthday dropdown, an IP address, and a checkbox.
>We have a sign-up because we ensure users accept our terms of service and acceptable use policy before creating their first video
So your company would have no problem going on record saying that they will never email you for any reason, including marketing, and your email will never be shared or sold even in the event of a merger or acquisition? Because this is the problem people have with sign-up ... and the main reason most start-ups want it.
I am not necessarily for or against required sign-ups, but I do understand people that are adamantly against them.
Will be funny/ironic when the first AI companies start suing each other for copyright infringement.
Personally for me the "3 column" UI isn't that good anyway, I would have gone with an "MMO Character Creation" type UX for this.
Heads up, little bit of language in the audio.
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Through many births
I have wandered on and on,
Searching for, but never finding,
The builder of this house.
is from https://en.wikipedia.org/wiki/Dhammapada (https://buddhasadvice.wordpress.com/2021/02/26/dhammapada-15... and http://www.floweringofgoodness.org/dhammapada-11.php). This is the way the world ends
Not with a bang but a whimper.
is from T.S Eliot, The Hollow Men https://en.wikipedia.org/wiki/The_Hollow_Men (https://interestingliterature.com/2021/02/eliot-this-way-wor...).First and second pictures are profile pictures that were generated years ago, before openai went on stage. I keep them around for when I need profile pics for templates. The third one has been in my random pictures folder for years.
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
though perhaps it rebelled against the message
I am curious if you are anyway related to this team?
Loopy is a Unet-based diffusion model, ours is a diffusion transformer. This is our own custom foundation model we've trained.
For whoever wants to, folks can re-make all the videos themselves with our model by extracting the 1st frame and audio.
Also, Loopy is not available yet (they just published the research paper). But you can try our model today, and see if it lives up to the examples : )
https://news.ycombinator.com/item?id=41463726
Examples are very impressive, here's hoping we get an implementation of it on huggingface soon so we can try it out, and even potentially self-host it later.
This is EMO from 6 months ago: https://humanaigc.github.io/emote-portrait-alive/
In the AI/research community, people often try to use the same examples so that it's easier to compare performance across different models.
One drawback of tools like runway (and midjourney) is the lack of an API allowing integration into products. I would love to re-sell your service to my clients as part of a larger offering. Is this something you plan to offer?
The examples are very promising by the way.
Consider all of the assets someone would have to generate for a 1 minute video. Lets assume 12 clips of 5 seconds each. First they may have to generate a script (claude/openai). They will have to generate text audio and background/music audio (suno/udio). They probably have to generate the images (runway/midjourney/flux/etc) which they will feed into a img2vid product (infinity/runway/kling/etc). Then they need to do basic editing like trimming clip lengths. They made need to add text/captions and image overlays. Then they want to upload it to TikTok/YouTube/Instagram/etc (including all of the metadata for that). Then they will want to track performance, etc.
That is a lot of UI, workflows, etc. I don't think a company such as yours will want to provide all of that glue. And consumers are going to want choice (e.g. access to their favorite image gen, their favorite text-to-speech).
Happy to talk more if you are interested. I'm at the prototype stage currently. As an example, consider the next logical step for an app like https://autoshorts.ai/
It would be very useful to have API access to Infinity to automate the creation of a talking head avatar.
Sorry, if this question sounds dumb, but I am comparing it with regular image models, where the more images you have, the better output images you generate for the model.
We actually did this in early overfitting experiments (to confirm our code worked!), and it worked surprisingly well. This is exciting to us, because it means we can have actor-specific models that learn the idiosyncratic gestures of particular person.
First, your (Lina's) intro is perfect in honestly and briefly explaining your work in progress.
Second, the example I tried had a perfect interpretation of the text meaning/sentiment and translated that to vocal and facial emphasis.
It's possible I hit on a pre-trained sentence. With the default manly-man I used the phrase, "Now is the time for all good men to come to the aid of their country."
Third, this is a fantastic niche opportunity - a billion+ memes a year - where each variant could require coming back to you.
Do you have plans to be able to start with an existing one and make variants of it? Is the model such that your service could store the model state for users to work from if they e.g., needed to localize the same phrase or render the same expressivity on different facial phenotypes?
I can also imagine your building different models for niches: faces speaking, faces aging (forward and back); outside of humans: cartoon transformers, cartoon pratfalls.
Finally, I can see both B2C and B2B, and growth/exit strategies for both.
Yes, we plan on allowing people to store their generations, make variations, mix-and-match faces with audios, etc. We have more of an editor-like experience (script-to-video) in the rest of our web app but haven't had time to move the new V2 model there yet. Soon!
We're also very excited about the template idea! Would love to add that soon.
!NWSF --lyrics by Biggy$malls
A product that might be build on top of this could split the input into reasonable chunks, generate video for each of them separately and stitch them with another model that can transition from one facial expression into another in a fraction of a second.
Additional improvement might be feeding the system not with one image but with a few expressing different emotional expressions. Then the system could analyze the split input to find out in which emotional state each part of the video should be started on.
On unrelated note ... generated expressions seem to be relevant to the content of the input text. So either text to speech might understand language a bit or the video model itself.
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Managed to get it working with my doggo.
I have an immediate use case for this. Can you stream via AI to support real time chat this way?
Very very good!
Jonathan
founder@ixcoach.com
We deliver the most exceptional simulated life coaching, counseling and personal development experiences in the world through devotion to the belief that having all the support you need should be a right, not a privilege.
Test our capacity at ixcoach.com for free to see for yourself.
My generation: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
We're actively working on improving stability and will hopefully increase the generation length soon.
I thought you had to pay artists for a license before using their work in promotional material.
Hopefully we can animate your bear cartoon one day!
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Edit: Duke Nukem flubs his line: https://youtu.be/mcLrA6bGOjY
One small issue I've encountered is that sometimes images remain completely static. Seems to happen when the audio is short - 3 to 5 seconds long.
I would be curious if you are getting this with more normal images.
[1] https://i.pinimg.com/236x/ae/65/d5/ae65d51130d5196187624d52d...
[2] https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
[3] https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
But that image was one which both could find a face and gave a static image once.
I am working on my latest agent (and character) framework and I just started adding TTS (currently with the TTS library and xtts_v2 which I think is maybe also called Style TTS.) By the way, any idea what the license situation is with that?
Since it's driven by audio, I guess it would come after the TTS.
I feel like I accidentally made an advert for whitening toothpaste:
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
I am sure the service will get abused, but wish you lots of success.
We use rectified flow for denoising, which is a (relatively) recent advancement in diffusion models that allow them to run a lot faster. We also use a 3D VAE that compresses the video along both spatial and temporal dimensions. Temporal compression also improves speed.
Thank you!
It is also possible to fine-tine the model so that single generations (one forward pass of the model) are longer than 8s, and we plan to do this. In practice, it just means our batch sizes have to be smaller when training.
Right now, we've limited the public tool to only allow videos up to 30s in length, if that is what you were asking.
I've been working on something adjacent to this concept with Ragdoll (https://github.com/bennyschmidt/ragdoll-studio), but focused not just on creating characters but producing creative deliverables using them.
Absolutely, especially if the pricing makes sense! Would be very nice to just focus on the creative suite which is the real product, and less on the AI infra of hosting models, vector dbs, and paying for GPU.
Curious if you're using providers for models or self-hosting?
Edit: If we generate videos at a lower resolution and with a fewer number of diffusion steps compared to what's used in the public configuration, we are able to generate videos at 20-23 fps, which is just about real-time. Here is an example: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...
I get the benefit of using celebrities because it's possible to tell if you actually hit the mark, whereas if you pick some random person you can't know if it's correct or even stable. But jeez... Andrew Tate in the first row? And it doesn't get better as I scroll down...
I noticed lots of small clips so I tried a longer script, and it seems to reset the scene periodically (every 7ish seconds). It seems hard to do anything serious with only small clips...?
The rest of our website still uses the V1 model. For the V1 model, we had to explicitly onboard actors (by fine-tuning our model for each new actor). So, the V1 actor list was just made based on what users were asking for. If enough users asked for an actor, then we would fine-tune a model for that actor.
And yes, the 7s limit on v1 is also a problem. V2 right now allows for 30s, and will soon allow for over a minute.
Once V2 is done training, we will get it fully integrated into the website. This is a pre-release.
I do hope more AI startups recognize that they are projecting an aesthetic whether they want to or not, and try to avoid the middle school boy or edgelord aesthetic, even if that makes up your first users.
Anyway, looking at V2 and seeing the female statue makes me think about what it would be like to take all the dialog from Galatea (https://ifdb.org/viewgame?id=urxrv27t7qtu52lb) and putting it through this. [time passes :)...] trying what I think is the actual statue from the story is not a great fit, it feels too worn by time (https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...). But with another statue I get something much better: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
One issue I notice in that last clip, and some other clips, is the abrupt ending... it feels like it's supposed to keep going. I don't know if that's an artifact of the input audio or what. But I would really like it if it returned to a kind of resting position, instead of the sense that it will keep going but that the clip was cut off.
On a positive note, I really like the Failure Modes section in your launch page. Knowing where the boundaries are gives a much better sense of what it can actually do.
We are trying to better understand the model behavior at the very end of the video. We currently extend the audio a bit to mitigate other end-of-video artifacts (https://news.ycombinator.com/item?id=41468520), but this can sometimes cause uncanny behavior similar to what you are seeing.
https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...
Hello I'm an AI-generated version of Yann LeCoon. As an unbiased expert, I'm not worried about AI. ... If somehow an AI gets out of control ... it will be my good AI against your bad AI. ... After all, what does history show us about technology-fueled conflicts among petty, self-interested humans?
I couldn’t help noticing that all the AI Doomer folks are pure materialists who think that consciousness and will can be completely encoded in cause-and-effect atomic relationships. The real problem is that that belief is BS until proven true. And as long as there are more good actors than bad, and AI remains just a sophisticated tool, the good effects will always outweigh the bad effects.
Wait. Isn't literally the exactly other way around? Materialism is the null hypothesis here, backed by all empirical evidence to date; it's all the other hypotheses presenting some kind of magic that are BS until proven.
True or not, materialism is the simplest, most constrained, and most predictive of the hypotheses that match available evidence. Why should we prefer a "physics + $magic" theory, for any particular flavor of $magic? Why this particular flavor? Why any flavor, if so far everything is explainable by the baseline "physics" alone?
Even in purely practical terms, it makes most sense to stick to materialism (at least if you're trying to understand the world; for control over people, the best theory needs not even be coherent, much less correct).
I'm not arguing that they're correct. I'm saying that they believe that they are correct, and if you argue that they're not, well, you're back to arguing!
It's the old saw - you can't reason someone out of a position they didn't reason themself into.
Maybe, but then we can still get to common ground by discussing a hypothetical universe that looks just like ours, but happen to not have a god inside (or lost it along the way). In that hypothetical, similar to yet totally-not-ours universe ruled purely by math, things would happen in a particular way; in that universe, materialism is the simplest explanation.
(It's up to religious folks then to explain where that hypothetical universe diverges from the real one specifically, and why, and how confident are they of that.)
I do of course exclude people, religious or otherwise, who have no interest or capacity to process a discussion like this. We don't need 100% participation of humanity to discuss questions about what an artificial intelligence could be or be able to do.
There are cases where formerly religious people "see the light" on their own via an embrace with reason. (I'm not sure if you are endorsing the claim.)
Of course, (widespread adoption of) science is also a fairly recent phenomenon, so perhaps we do know more now than we did back then.
You know your experience is real. But you do not know if the material world you see is the result of a great delusion by a master programmer.
Thus the only thing you truly know has no mass at all. Thus a wise person takes the immaterial as immediate apparent, but the physical as questionable.
You can always prove the immaterial “I think therefore I am”. But due to the uncertainty of matter, nothing physical can be truly known. In other words you could always be wrong in your perception.
So in sum, your experience has no mass, volume, or width. There are no physical properties at all to consciousness. Yet it is the only thing that we can know exists.
Weird, huh?
I’m ignoring the argument that we can’t know if anything we’re perceive is even real at all since it’s unprovable and useless to consider. Better to just assume it’s wrong. And if that assumption is wrong, then it doesn’t matter.
But the brain that does the proving of immaterial is itself material so if matter is uncertain then the reasoning of the proof of immaterial can also be flawed thus you can't prove anything.
The only provable thing is that philosophers ask themselves useless questions, think about them long and hard building up convoluted narratives they claim to be proofs, but on the way they assume something stupid to move forward, which eventually leads to bogus "insights".
Sure, you can prove that "I think therefore I am" for yourself. So how about we just accept it's true and put it behind us and continue to the more interesting stuff?
What you or I call external world, or our perception of it, has some kind of structure. There are patterns to it, and each of us seem to have some control over details of our respective perceptions. Long story short, so far it seems that materialism is the simplest framework you can use to accurately predict and control those perceptions. You and I both seem to be getting most mileage out of assuming that we're similar entities inhabiting and perceiving a shared universe that's external to us, and that that universe follows some universal patterns.
That's not materialism[0] yet, especially not in the sense relevant to AI/AGI. To get there, one has to learn about the existence of fields of study like medicine, or neuroscience, and some of the practical results they yielded. Things like, you poke someone's brain with a stick, watch what happens, and talk to the person afterwards. We've done that enough times to be fairly confident that a) brain is the substrate in which mind exists, and b) mind is a computational phenomenon.
I mean, you could maybe question materialism 100 years ago, back when people had the basics of science down but not much data to go on. It's weird to do in time and age when you can literally circuit-bend a brain like you'd do with an electronic toy, and get the same kind of result from the process.
--
[0] - Or physicalism or whatever you call the "materialism, but updated to current state of physics textbooks" philosophy.
What if the reverse is true? The only real thing is actually irrationality, and all the rational materialism is simply a catalyst for experiencing things?
The answer to this great question has massive implications, not just in this realm, btw. For example, crime and punishment. Why are we torturing prisoners in prison who were just following their programming?
I don't think the correlation is accidental.
So you're on to something, here. And I've felt the exact same way as you, here. I'd love to see your blog post when it's done.
It is difficult to prove an irrational thing with rationality. How do we know that you and I see the same color orange (this is the concept of https://en.wikipedia.org/wiki/Qualia)? Measuring the wavelength entering our eyes is insufficient.
This is going to end up being an infinitely long HN discussion because it's 1) unsolvable without more data 2) infinitely interesting to any intellectual /shrug
The is no need for consciousness, there is only a need for a bug. It was purely luck that Nikita Khrushchev was in New York when Thule Site J mistook the moon for a soviet attack force.
There is no need for AI to seize power, humans will promote any given AI beyond the competency of that AI just as they already do with fellow humans ("the Peter principle").
The relative number of good and bad actors — even if we could agree on what that even meant, which we can't, especially with commons issues, iterated prisoners' dilemmas, and other similar Nash equilibria — doesn't help either way when the AI isn't aligned with the user.
(You may ask what I mean by "alignment", and in this case I mean vector cosine similarity "how closely will it do what the user really wants it to do, rather than what the creator of the AI wants, or what nobody at all wants because it's buggy?")
But even then, AI compute is proportional to how much money you have, so it's not a democratic battle, it's an oligarchic battle.
And even then, reality keeps demonstrating the incorrectness of the saying "the only way to stop a bad guy with a gun is a good guy with a gun", it's much easier to harm and destroy than to heal and build.
And that's without anyone needing to reach for "consciousness in the machines" (whichever of the 40-something definitions of "consciousness" you prefer).
Likewise it doesn't need plausible-future-but-not-yet-demonstrated things like "engineering a pandemic" or "those humanoid robots in the news right now, could we use them as the entire workforce in a factory to make more of them?"
But I look back at our history of running towards new things without awareness of (or planning for) risks, and I see Bhopal accident happening at all despite that it should have been preventable, and I see Castle Bravo being larger than expected, and I see the stories about children crushed in industrial equipment because the Victorians had no workplace health and safety, and I see the way CO2 was known to have a greenhouse effect for around a century before we got the Kyoto Protocol and Paris Climate Accords.
It's hard to tell where the real risks are, vs. things which are just Hollywood plot points — this is likely true in every field, it certainly is in cryptography: https://www.schneier.com/blog/archives/2015/04/the_eighth_mo...
So, for example: Rainbows End is fiction, but the exact same things that lead to real-life intelligence agencies wanting to break crypto also drive people to want to find a "you gotta believe me" McGuffin in real life — even if their main goal is simply to know it's possible before it happens, in order to catch people under its influence. Why does this matter? Because we've already had a chatbot accidentally encourage someone's delusional belief that their purpose in life was to assassinate Queen Elizabeth II (https://www.bbc.com/news/technology-67012224) and "find lone agents willing to do crimes for you" is reportedly a thing IS already does manually — but is that even a big deal IRL, or just a decent plot device for a story?
Personally, I view his takes on AI as unserious — in the sense that I have a hard time believing he really engages in a serious manner. The flaws of motivated reasoning and “early-stopping” are painfully obvious.
One note: "It was purely luck that Nikita Khrushchev was in New York when Thule Site J mistook the moon for a soviet attack force." I cannot verify this story (ironically, I not only googled but consulted two different AI's, the brand-new "Reflection" model (which is quite impressive) as well as OpenAI's GPT4o... They both say that the Thule moon false alarm occurred a year after Khrushchev's visit to New York) Point noted though.
It’s no less BS than the other beliefs which can be summed up as “magic”.
So basically I have to choose between a non-dualist pure-materialist worldview in which every single thing I care about, feel or experience is fundamentally a meaningless illusion (and to what end? why have a universe with increasing entropy except for life which takes this weird diversion, at least temporarily, into lower entropy?), which I'll sarcastically call the "gaslighting theory of existence", and a universe that might be "materialism PLUS <undiscovered elements>" which you arrogantly dismiss as "magic" by conveniently grouping it together with arguably-objectively-ridiculous arbitrary religious beliefs?
Sounds like a false-dichotomy fallacy to me
Simplicity and stillness can be beautiful, and so can animations. Enjoying smooth animations and colorful content isn’t brain rot imo.
I’ll begrudgingly accept a default behavior of animations turned on, but I want the ability to stop them. I want to be able to look at something on a page without other parts of the page jumping around or changing form while I’m not giving the page any inputs.
For some of us, it’s downright exhausting to ignore all the motion and focus on the, you know, actual content. And I hate that this seems to be the standard for web pages these days.
I realize this isn’t particularly realistic or enforceable. But one can dream.
They can't fathom what a world without near infinite bandwidth, low latency and load times, and disparate hardware and display capabilities with no graphical acceleration looks like, or why people wouldn't want video and audio to autoplay, or why we don't do flashing banners. They think they're distinguishing themselves using variations on a theme, wowing us with infinitely scrolling opuses when just leaving out the crap would do.
I still aim to make everything load within in a single packet, and I'll happily maintain my minority position that that's the true pinnacle of web design.
Incidentally, the same behaviour is seen in academia. These websites for papers are all copying this one from 2020: https://nerfies.github.io/
So far, we have purposely trained on low resolution to make sure we get the gross expressions / movements right. The final stage of training with be using higher resolution training data. Fingers crossed.
Good luck :)
We don't know what Hedra is doing. It could be the approach EMO has taken (https://humanaigc.github.io/emote-portrait-alive/) or VASA (https://www.microsoft.com/en-us/research/project/vasa-1/) or Loopy Avatar (https://loopyavatar.github.io/) or something else.
Looks like there’s an enthusiastic marketplace of real grassroots users.
Also, two boxes for uploading the only two inputs to a model is not a new idea. One could say you stole it from Gradio (but even that's silly).
Also, I think it will cost around $100k to train a model at this quality level within 1-2 years. And, will only go down from there. So, the genie is out of the bag.
In the end though, the incentive and the capability lies in the hands of camera manufacturers. It is unfortunate that video from the pre-AI era have no real reason to have been made verifiable…
Anyway, recordings of politicians saying some pretty heinous things haven’t derailed some of their campaigns anyway, so maybe none of this is really worth worrying about in the first place.