I'm sure getting to this point was quite difficult, and on the project page you can read how it involved discussions with lots and lots of smart and capable people. But there's no big "aha" moment in the paper, so it feels like another hit for The Bitter Lesson in the end: They used a giant bunch of [data], a year and a half of GPU time to [train] the final model, and created a model with a billion parameters that outperforms all specialized previous models.
Or in the words of the authors, from the paper:
> We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations.
Fantastic to have this. But it feels.. yes, somewhat bitter.
[The Bitter Lesson]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html (often discussed on HN)
[data]: "Co3Dv2 [88], BlendMVS [146], DL3DV [69], MegaDepth [64], Kubric [41], WildRGB [135], ScanNet [18], HyperSim [89], Mapillary [71], Habitat [107], Replica [104], MVS-Synth [50], PointOdyssey [159], Virtual KITTI [7], Aria Synthetic Environments [82], Aria Digital Twin [82], and a synthetic dataset of artist-created assets similar to Objaverse [20]."
[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering
[1] https://stockfishchess.org/blog/2020/introducing-nnue-evalua...
Brute forcing is bound to find paths beyond heuristics. What I'm getting at is that the path needs to be established first before it can be beaten. Hence why I'm wondering if one isn't an extension of the other instead of an opposing strategy.
I.e. search and heuristics both have a time and place, not so much a bitter lesson but a common filter for a next iteration to pass through.
>[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering
How is that a "year and half of GPU time". Maybe on some exoplanet ?
> How is that a "year and half of GPU time".
64 GPUs × 9 days = 576 GPU-days ≈ 1.577 GPU-years
This type of thing would be the killer app for phone based 3d scanners. You don't have to have a perfect scan because this will fill in the holes for you.
As someone mentioned, this is great for gaussian splatting, which we also do.
Instead you had to run all these separate pipelines inferring camera location, etc. etc. before you could get any sort of 3D information out of your photos. I’d guess this is going into many many workflows where it will drop in replace a bunch of jury-rigged pipelines.
This would let you have any of the types if data that this model can output be used as input for controlling image generation.
- Egyptian pyramids
- Roman Colosseum
These are the most iconic and most photographed things in the world.
That said, there are other examples are there more novel. I am just going to focus on those to judge its quality.
A small panning video of city street, right now, can generate a pretty damn accurate (for some use cases) pointcloud, but the position accuracy falls off as you try to go any large distance away from the start point, dude to the dead-reckoning drift that essentially happens here. But if you could pipe real GPS and synthesized heading (from gyros/accel/megnetometers) from the phone the images were captured on into the transformer with the images, it would instantly and greatly improve the resultant accuracy since it would now have those camera parameters 'ground truth'd'.
I think then this technique could nearly start to rival what you need a $3-10k LIDAR camera to do right now. There are a lot of 'archival' and architecture study fields where absolute precision isn't as important as just getting 'full' scans of an area without missing patches, and speed is a factor. Walking around with a LIDAR camera can really suck compared to just a phone, and this technique would have no problem with multiple people using multiple phones to generate the input.
However I just tried it on Huggingface and the result was ... mediocre at best:
The resulting point cloud missed about half the features from the input image.
Should be quite exciting going forward, as fine-tuning might be possible on consumer hardware / single Desktop machines (like it is with LLMs). So I would expect a lot of experiments coming out in this space, soon-ish. If the results hold true, it'll be pretty exciting to drop slow and cumbersome COLMAP processing and scene optimization for a single forward pass that lasts a few seconds.