Have we scaled vision like language yet?

A few years ago at our CVPR 2023 Transformers for Vision workshop, Lucas Beyer said something that struck me by surprise. I’ve been trying to piece it together ever since. The interaction went something like this: Audience: “Why aren’t we scaling vision models as large as we do LLMs?” Lucas: “You know, actually, the largest vision models are on par with the largest language models if you look at [X].” I can never quite remember what X was — FLOPs, parameters, or token budget. Obviously now it’s not parameters. The largest recorded ViTs still tap out in the 22B regime, with the most consistent scaling amounts being 1B–7B as in DINOv3 [10]. ...

February 14, 2026 · 8 min · Tyler Zhu

FSDP for Dummies

I’ve always struggled to understand the intuitions behind Fully Sharded Data Parallel beyond the high level idea of “shard everything.” Without a systems background, the fundamental primitives like “all-reduce” and “reduce-scatter” aren’t in my vocabulary. But FSDP conceptually is not complicated, especially once you state what the goals are (the rest is nearly necessitated by the engineering). This post is an attempt to deconstruct the algorithm from first principles as a non-systems person. I will bring up the primitives in their specified context, which I think helps reinforces the intuition much better. Most ML researchers have a stronger understanding of the models, params, and optimizer processes than the systems jargon anyways. ...

February 2, 2026 · 10 min · Tyler Zhu