Have we scaled vision like language yet?
A few years ago at our CVPR 2023 Transformers for Vision workshop, Lucas Beyer said something offhand that struck me by surprise which I’ve been trying to piece together ever since. The interaction went something like this: Question from the audience: “Why aren’t we scaling vision models as large as we do LLMs?” “You know, actually, the largest vision models are on par with the largest language models if you look at [X].” - Lucas ...