* so-called "hallucination" (actually just how generative models work) is a feature, not a bug.
* anyone can easily see the unrealistic and biased outputs without complex statistical tests.
* human intuition is useful for evaluation, and not fundamentally misleading (i.e. the equivalent of "this text sounds fluent, so the generator must be intelligent!" hype doesn't really exist for imagery. We're capable of treating it as technology and evaluating it fairly, because there's no equivalent human capability.)
* even lossy, noisy, collapsed and over-trained methods can be valuable for different creative pursuits.
* perfection is not required. You can easily see distorted features in output, and iteratively try to improve them.
* consistency is not required (though it will unlock hugely valuable applications, like video, should it ever arrive).
* technologies like LoRA allow even unskilled users to train character-, style- or concept-specific models with ease.
I've been amazed at how much better image / visual generation models have become in the last year, and IMO, the pace of improvement has not been slowing as much as text models. Moreover, it's becoming increasingly clear that the future isn't the wholesale replacement of photographers, cinematographers, etc., but rather, a generation of crazy AI-based power tools that can do things like add and remove concepts to imagery with a few text prompts. It's insanely useful, and just like Photoshop in the 90s, a new generation of power-users is already emerging, and doing wild things with the tools.
I am biased (I work at Rev.com and Rev.ai), but I totally agree and would add one more thing: transcription. Accurate human transcription takes a really, really long time to do right. Often a ratio of 3:1-10:1 of transcriptionist time to original audio length.
Though ASR is only ~90-95% accurate on many "average" audios, it is often 100% accurate on high quality audio.
It's not only a cost savings thing, but there are entire industries that are popping up around AI transcription that just weren't possible before with human speed and scale.
There was a project mentioned here on HN where someone was creating audio book versions of content in the public domain that would never have been converted through the time and expense of human narrators because it wouldn't be economically feasible. That's a huge win for accessibility. Screen readers are also about to get dramatically better.
Tried it on a PDF and it didn't even read the PDF.
I'm sure we'll get there but.. real shame it lies when it can't figure something out
First lying requires agency and intent, which LLMs don’t have and they can’t lie.
Yes it makes stuff up when you put garbage in and uncritically consume the garbage. The key isn’t to look at it as an outsourcing of agency or the easy button but as a tool that gets you started on stuff and a new way of interacting with computers. It also confidently asserts things that are untrue or are subtly off base. To that extent, and in a literally very real sense, this is a very early preview of the technology - of a completely new computing technique that only reached bare minimum usability in the last two years. Would you rather not have early access or have to wait 20 years as accountants and product managers strangle it?
For OCR I’m surprised anyone who has ever used it before would scan illegible hand writing in and expect to not get a bunch of garbage out without it identifying the garbage was semantically wrong. Frontier Multimodal LLMs do an amazing job - compared to the state of the art a year ago. Do they do an amazing job compared to an ever shifting goal post? Are all the guard rails of a mature 30 year old software technique even discovered yet? No. But I’ll tell you from the early days of things, the early days of HTTP was nothing like today. Was HTTP useless because it was so unreliable and flakey? No it was amazing for those with the patience and capacity to dream to building something truly remarkable at the time, like Google or Amazon or eBay.
The PDF issue you had is not expected. I upload PDFs all the time. For instance when I’m working on something, like restringing some hunter Douglas blinds in my house recently, I upload the instructions for the restring kit to a ChatGPT session or Claude and it then becomes something I can ask iteratively how to tackle what I’m working on as I get to challenge spots in the process. It’s not always right and if confidently tells me subtly wrong things. But I pretty quickly realize what’s wrong and isn’t as I work and that’s usually something ambiguous in the instructions and requires a lot more context on something very specific and likely not documented publicly anywhere. But 80% of the time my questions get answered as I work. That’s -amazing- that I can scan a paper instruction sheet into a computer and get step by step guidance that I can interactively interrogate using my voice as I work and it literally understands everything I ask and gives me cogent if sometimes off answers. This is like literally the definition of the future I was promised.
Maybe this: https://news.ycombinator.com/item?id=40961385
It was a video for ESPN of an indoor motorcross race and the transcription was for the commentators. There were two fundamental problems:
1) The bike noise made the commentators almost inaudible
2) The commentators were using the [well-known to fans] nicknames of all the racers, and not their real names
I haven't used Rev for about three years, so I don't know how much better your auto-transcription system has gotten. I'd hope AI can solve #1, but #2 is a very hard problem to solve, simply because of the domain knowledge required. The nicknames were like Buttski McDumpleface etc and took a bunch of Googling to figure out.
I eventually got fired from Rev simply because the moderators haven't heard of the Oxford comma :p
That said, "hallucination" is more of a fundamental problem for this area than it is for imagery, which is why I still think imagery is the most interesting category.
I need one for a product and the state of the art, e.g. pyannote, is so bad it's better to not use them.
I keep getting burned by APIs having stupid restrictions that makes use cases impossible that are trivial if you can run the thing locally.
I think it's easy to totally miss that LLMs are just being completely and quietly subsumed into a ton of products. They have been far more successful, and many image generation models use LLMs on the backend to generate "better" prompts for the models themselves. LLMs are the bedrock
I'd refrain from making any such statements about the future;* the pace of change makes it hard to see the horizon beyond a few years, especially relative to the span of a career. It's already wholesale-replacing many digital artists and editorial illustrators, and while it's still early, there's a clear push starting in the cinematography direction. (I fully agree with the rest of your comment, and it's strange how much diffusion models seem to be overlooked relative to LLMs when people think about AI progress these days.)
* (edit: about the future impact of AI on jobs).
> It's already wholesale-replacing many digital artists and editorial illustrators
I think you're going to need to cite some data on a claim like that. Maybe it's replacing the fiverr end of the market? It's certainly much harder to justify paying someone to generate a (bad) logo or graphic when a diffusion model can do the same thing, but there's no way that a model, today, can replace a skilled artist. Or said differently: a skilled artist, combined with a good AI model, is vastly more productive than an unskilled artist with the same model.
(in case you think the market will not behave like that, just have a look at how we produce low quality food and how many people are perfectly fine with that)...
I sort of hate this line of argument, but it also has been manifestly true of the past, and rhymes with the present.
What we've seen over the last year trying out dozens of models and AI workflows, is that the fit of 1.) error tolerance of a model to 2.) its working context, is super important.
AI hallucinations break a lot of otherwise useful implementations. It's just not trustworthy enough. Even with AI imagery, some use cases require precision - AI photoshoots and brand advertising come to mind.
The sweet spot seems to be as part of a pipeline where the user only needs a 90% quality output. Or you have a human + computer workflow - a type of "Centaur" - similar to Moravec's Paradox.
Let me show you the future: https://www.youtube.com/watch?v=eVlXZKGuaiE
This is an LLM controlling an embodied VR body in a physics simulation.
It is responding to human voice input not only with voice but body movements.
Transformers aren't just chatbots, they are general symbolic manipulation machines. Anything that can be expressed as a series of symbols is a thing they can do.
No it's not. It's VAM that is controlling the character and it's literally just using a bog standard LLM as a chatbot and feeding the text into a plugin in VAM and VAM itself does the animation. Don't get me wrong it's absolutely next level to experience chatbots this way, but it's still a chat bot.
This is as naive as calling an industrial robot 'just a calculator'.
This is key, we’re all pre-wired with fast correctness tests.
Are there other data types that match this?
Mundane tasks that can be visually inspected at the end (cleaning, organizing, maintenance and mechanical work)
The knowledge answering is secondary in my opinion
Unlike LLMs, that really seem to translate the text into "concepts" at a certain embedding layer, the (current, 2D) diffusion models will store (and thus require to be trained on) a completely different idea of a thing, if it's viewed from a slightly different angle, or is a different size. Diffusion models can interpolate but not extrapolate — they can't see a prompt that says "lion goat dragon monster" and come up with the ancient-greek Chimera, unless they've actually been trained on a Chimera. You can tell them "asian man, blond hair" — and if their training dataset contains asian men and men with blonde hair but never at the same time, then they won't be able to "hallucinate" a blond asian man for you, because that won't be an established point in the model's latent space.
---
On a tangent: IMHO the true breakthrough would be a model for "text to textured-3D-mesh" — where it builds the model out of parts that it shapes individually and assembles in 3D space not out of tris, but by writing/manipulating tokens representing shader code (i.e. it creates "procedural art"); and then it consistency-checks itself at each step not just against a textual embedding, but also against an arbitrary (i.e. controlled for each layer at runtime by data) set of 2D projections that can be decoded out to textual embeddings.
(I imagine that such a model would need some internal "blackboard" of representational memory that it can set up arbitrarily-complex "lenses" for between each layer — i.e. a camera with an arbitrary projection matrix, through which is read/written a memory matrix. This would allow the model to arbitrarily re-project its internal working visual "conception" of the model between each step, in a way controllable by the output of each step. Just like a human would rotate and zoom a 3D model while working on it[1]. But (presumably) with all the edits needing a particular perspective done in parallel on the first layer where that perspective is locked in.)
Until we have something like that, though, all we're really getting from current {text,image}-to-{image,video} models is the parallel layered inpainting of a decently, but not remarkably exhaustive pre-styled patch library, with each patch of each layer being applied with an arbitrary Photoshop-like "layer effect" (convolution kernel.) Which is the big reason that artists get mad at AI for "stealing their work" — but also why the results just aren't very flexible. Don't have a patch of a person's ear with a big earlobe seen in profile? No big-earlobe ear in profile for you. It either becomes a small-earlobe ear or the whole image becomes not-in-profile. (Which is an improvement from earlier models, where just the ear became not-in-profile.)
[1] Or just like our minds are known to rotate and zoom objects in our "spatial memory" to snap them into our mental visual schemas!
The kind of granular, human-assisted interaction interface and workflow you're describing is, IMHO, the high-value path for the evolution of AI creative tools for non-text applications such as imaging, video and music, etc. Using a single or handful of images or clips as a starting place is good but as a semi-talented, life-long aspirational creative, current AI generation isn't that practically useful to me without the ability to interactively guide the AI toward what I want in more granular ways.
Ideally, I'd like an interaction model akin to real-time collaboration. Due to my semi-talent, I've often done initial concepts myself and then worked with more technically proficient artists, modelers, musicians and sound designers to achieve my desired end result. By far the most valuable such collaborations weren't necessarily with the most technically proficient implementers, but rather those who had the most evolved real-time collaboration skills. The 'soft skill' of interpreting my directional inputs and then interactively refining or extrapolating them into new options or creative combinations proved simply invaluable.
For example, with graphic artists I've developed a strong preference for working with those able to start out by collaboratively sketching rough ideas on paper in real-time before moving to digital implementation. The interaction and rapid iteration of tossing evolving ideas back and forth tended to yield vastly superior creative results. While I don't expect AI-assisted creative tools to reach anywhere near the same interaction fluidity as a collaboratively-gifted human anytime soon, even minor steps in this direction will make such tools far more useful for concepting and creative exploration.
Also, the very bad press gen AI gets is very much slowing down adoption. Particularly among the creative-minded people, who would be the most likely users.
There's plenty of mindblowing images
In any case it would be cool if they specified the set of inputs that is expected to give decent results.
> Optional quad or triangle remeshing (adding only 100-200ms to processing time)
But it seems to have been optional. Did you try it with that turned on? I'd be very interested in those results, as I had the same experience as you, the models don't generate good enough meshes, so was hoping this one would be a bit better at that.
Edit: I just tried it out myself on their Huggingface demo and even with the predefined images they have there, the mesh output is just not good enough. https://i.imgur.com/e6voLi6.png
All of my tests of img2mesh technologies have produced poor results, even when using images that are very similar to the ones featured in their demo. I’ve never got fidelity like what they’ve shown.
I’ll give this a whirl and see if it performs better.
It is however fast.
Holy cow - I was thinking this might be one of those datacenter-only models but here I am proven wrong. 7GB of VRAM suggests this could run on a lot of hardware that 3D artists own already.
Useful for what? I think use cases will emerge.
A lot of critiques assume you're working in VFX or game development. Making image to 3d (and by extension text to image to 3d) effortless a whole host of new applications open up - which might not be anywhere near so demanding.
I see these usable not as main assets, but as something you would add as a low effort embellishment to add complexity to the main scene. The fact they maintain profile makes them usable for situations where mere 2d billboard impostor (i.e the original image always oriented towards the camera) would not cut it.
You can totally create a figure image (Midjourney|Bing|Dalle3) and drag and drop it to the image input and get a surprising good 3d presentation that is not a highly detailed model, but something you could very well put to a shelf in a 3d scene as an embellishment where the camera never sees the back of it, and the model is never at the center of attention.
However... mixed success. It's not good with (real) cats yet - which was obvs the first thing I tried. It did reasonably well with a simple image of an iPhone, and actually pretty impressively with a pancake with fruit on top, terribly with a rocket, and impressively again with a rack of pool balls.
[0] https://huggingface.co/spaces/stabilityai/stable-fast-3d
Has anyone tried to build an Unreal scene with these generated meshes?
I wonder what the optimum group of technologies is that would enable that kind of mapping? Would you pile on LIDAR, RADAR, this tech, ultrasound, magnetic sensing, etc etc. Although, you're then getting a flying tricorder. Which could enable some cool uses even outside the stereotypical search and rescue.
DARPA's subterranean challenge had many teams that did some pretty cool stuff in this direction: https://spectrum.ieee.org/darpa-subterranean-challenge-26571...
https://github.com/DepthAnything/Depth-Anything-V2
You basically select an area on a map that you want to model in 3d, it flies your drone (take-off, flight path, landing), takes pictures, uploads to their servers for processing, generates point cloud, etc. Very powerful.
You can test here: https://huggingface.co/spaces/stabilityai/stable-fast-3d
I wonder whether RAG based 3D animation generation can be done with this.
1. Textual description of a story.
2. Extract/generate keywords from the story using LLM.
3. Search and look up 2D images by the keywords.
4. Generate 3D models from the 2D images using Stable Fast 3D.
5. Extract/generate path description from the story using LLM.
6. Generate movement/animation/gait using some AI.
...
7. Profit??