Have we scaled vision like language yet?
A few years ago at our CVPR 2023 Transformers for Vision workshop, Lucas Beyer said something that struck me by surprise. I’ve been trying to piece it together ever since. The interaction went something like this: Audience: “Why aren’t we scaling vision models as large as we do LLMs?” Lucas: “You know, actually, the largest vision models are on par with the largest language models if you look at [X].” I can never quite remember what X was — FLOPs, parameters, or token budget. Obviously now it’s not parameters. The largest recorded ViTs still tap out in the 22B regime, with the most consistent scaling amounts being 1B–7B as in DINOv3 [10]. ...