Have we scaled vision like language yet?

A few years ago at our CVPR 2023 Transformers for Vision workshop, Lucas Beyer said something that struck me by surprise. I’ve been trying to piece it together ever since.

The interaction went something like this:

Audience: “Why aren’t we scaling vision models as large as we do LLMs?”

Lucas: “You know, actually, the largest vision models are on par with the largest language models if you look at [X].”

I can never quite remember what X was — FLOPs, parameters, or token budget. Obviously now it’s not parameters. The largest recorded ViTs still tap out in the 22B regime, with the most consistent scaling amounts being 1B–7B as in DINOv3 [10].

@giffmana was saying at CVPR that large vision models (like VIT-22B) are actually on par with LLMs when controlling for compute or tokens or something? i don’t remember actually but it was an interesting point.
— Tyler Zhu (at ICLR!) (@tyleryzhu) August 29, 2023

This mystery still somewhat holds true. I think about it often enough that I’ve decided to try and trace down the original point, with Gemini as my somewhat-reliable research assistant.

X marks the treasure, and potentially AGI

Among the possibilities, parameters is clearly not it. If anything, vision models are surprisingly parameter efficient for their capabilities.

I would argue that vision models haven’t found enough signal to benefit from scaling. Despite the best efforts shown in ViT-22B [4] and 4DS [2], works like DINOv3 [10] have found large vision models to be terribly unstable to train. Anecdotally, I found these larger models to have worse performance than their smaller 1B variants — but YMMV.

A better candidate is FLOPs, i.e. compute invested during training. To compare, we need to compute tokens, since:

$$\text{FLOPs} = \text{tokens} \times \text{FLOPs per token} \approx \text{tokens} \times \text{params}$$

For this comparison, I’ll sample models released before the interaction (June 2023):

Llama-2 [12] as the LLM
DINOv2-g [9], SigLIP-1 [13], and ViT-22B [4] as the vision encoders

Large Language Model FLOPs

Llama-2 [12] comes in three sizes identical to its predecessor: 7B, 13B, and 70B. Each is trained on 2.0T tokens (de-duplicated) with a context length of 4k.

We can estimate FLOPs as a function of the sequence length, layers, hidden dimension, and heads ($N, L, D, H$).

Per attention block:

QKV projections: Each is $[N \times D] \times [D \times D] \to 2ND^2$. Three of them gives $6ND^2$.
Attention scores: $QK^T$ is $[N \times D] \times [D \times N] \to 2N^2D$. Heads don’t affect FLOPs.
Value multiplication: $[N \times N] \times [N \times D] \to 2N^2D$.
Output projection: $[N \times D] \times [D \times D] \to 2ND^2$.

Total per attention block: $8ND^2 + 4N^2D$.

Per feed-forward block: A $[D \to 4D \to D]$ connection gives $16ND^2$.

Per transformer block: $24ND^2 + 4N^2D$. Multiply by $3BL$ for the full model (forward + backward), where $BN$ equals total tokens.

Our parameter count is $12D^2$ per block ($4D^2$ for Q,K,V,O and $8D^2$ for FFN). When $D \gg N$, we get a common approximation for total FLOPs:

$$ \text{FLOPs} \approx 72ND^2 = 6 \times \text{params} \times \text{tokens} $$

To stay consistent with how vision researchers calculate FLOPs, we’ll track just forward FLOPs (a third of the total):

$$ \text{fwd FLOPs} \approx L(24ND^2 + 4N^2D) \approx 2 \times \text{params} \times \text{tokens} $$

Below are the model stats, largely from the Llama-1 paper [11]. Note that $D \approx N$ here, so the approximation doesn’t hold perfectly.

One detail: Llama-2 70B uses Grouped-Query Attention, where multiple queries share the same KV vectors. With 64 query heads but only 8 KV heads, the QKV FLOPs reduce to $4.5ND^2 + 4N^2D$.

Model	$N$	$L$	$D$	$H$	tokens	FLOPs/tkn	total fwd FLOPs
Llama-2 7B	4096	32	4096	32	2.0 T	14 G	28T GFLOPs
Llama-2 13B	4096	40	5120	40	2.0 T	26.56 G	53T GFLOPs
Llama-2 70B	4096	80	8192	64	2.0 T	130 G	260T GFLOPs

Vision Transformer FLOPs

Now the Vision Transformers. I couldn’t find enough details about SigLIP’s exact compute budget, but the “Getting your ViT in shape” paper [1] is a great reference. I’ll substitute in the SoViT-400m/14 model.

That paper has extensive discussion of compute budgets, and shows that FLOPs correlate well with TPU core-hours — so it’s a decent proxy.

Here are the specs for the largest size of each architecture, on a constant 14×14 patch size. These are actual traced numbers, so the comparison with LLMs won’t be entirely fair — but close enough.

Model	params	patches	width	depth	dim	FLOPs/tkn	pretrain	tokens	total fwd FLOPs
SoViT-400m/14	428M	256	1152	27	4304	0.86 G	40B (~13e)	10.2 T	9T GFLOPs
SoViT-400m/14	428M	1369	1152	27	4304	1.00 G	6.5B (~2e)	8.9 T	9T GFLOPs
DINOv2-g	1.01B	256	1408	40	6144	2.08 G	7.1B (~50e)	1.8 T	3.8T GFLOPs
ViT-22B	22B	256	6144	48	24576	40.78 G	11.5B (~3e)	2.9 T	120T GFLOPs

DINOv2-g uses a true ViT-g shape (slightly modified), not counting the teacher + student. ViT-22B total estimated using the formula above. DINOv2-g pretrain tokens estimated from the only training recipe of 500 epochs of ImageNet-22k (14M images).

Interestingly, our LLM estimate isn’t far off for ViTs. The MLP expansion factor for larger ViTs is usually ~3.7 rather than 4.0. If we account for the actual expansion factor $\alpha$:

$$(8 + 4\alpha) / (4 + 2\alpha) = 2$$

So it turns out not to matter much. The remaining error resolves with the full form above.

Musings

The first thing that surprised me was how total FLOPs is the right baseline for cross-modal comparison. It’s somewhat obvious in hindsight, but coming from epoch-land in vision made it less clear at first.

Unsurprisingly, vision researchers were far from hitting the same total FLOP count as language researchers until we scaled up both model and dataset size. ViT-22B is the closest, at 120T GFLOPs — squarely in Llama territory. This makes me believe our X is indeed total FLOP count.

However, the gains weren’t as groundbreaking as what we’d expect from an LLM of similar scale. A few observations:

Not all FLOPs are created equal. ViTs will find it hard to go beyond short sequence lengths to great success. Larger images help, but with diminishing returns (see Figure 7 in [1]). The signals from higher resolutions largely overlap with those present at smaller sizes — the same goes for multi-epoch training, which vision is stuck in.

We need more sources of signal, and more tokens of it. This likely means video, or joint text-image data as already done in multimodal models.

Vision tokens are less information-dense. I first heard this from Kaiming in a talk: BERT’s optimal masking ratio is 15% [5], compared to MAE’s 75% [7] and ST-MAE’s 90% [6]. That’s a huge gap, suggesting vision tokens are largely redundant.

This idea resurfaced in Transfusion [14] and Chameleon [3], where training mixed-modality transformers was difficult due to loss imbalance across modalities. It became a focal point of their follow-up work on MoE-Sparsity [8].

Training objectives aren’t comparable. LLMs use next-token prediction (conditional soft classification), whereas ViTs use SSL and one-hot classification. It’s unclear how these objectives affect scaling, or if there’s a unifying factor.

We need smarter compute allocation. Given that model performance doesn’t differ heavily with FLOPs/token, either our tokens need to be more informative, or we need to invest compute more wisely. Adaptive compute is one direction — but I think we need to first figure out how to use more compute effectively before optimizing how we use it.

References

[1] Alabdulmohsin, I., Zhai, X., Kolesnikov, A., and Beyer, L. “Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design.” NeurIPS 2023.

[2] Carreira, J., Gokay, D., King, M., et al. “Scaling 4D Representations.” arXiv preprint 2024.

[3] Chameleon Team. “Chameleon: Mixed-Modal Early-Fusion Foundation Models.” arXiv preprint 2024.

[4] Dehghani, M., Djolonga, J., Mustafa, B., et al. “Scaling Vision Transformers to 22 Billion Parameters.” ICML 2023.

[5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019.

[6] Feichtenhofer, C., Fan, H., Li, Y., and He, K. “Masked Autoencoders As Spatiotemporal Learners.” NeurIPS 2022.

[7] He, K., Chen, X., Xie, S., Li, Y., Dollar, P., and Girshick, R. “Masked Autoencoders Are Scalable Vision Learners.” CVPR 2022.

[8] Kilian, M., Mkrtchyan, O., Zettlemoyer, L., et al. “Improving MoE Compute Efficiency by Composing Weight and Data Sparsity.” arXiv preprint 2026.

[9] Oquab, M., Darcet, T., Moutakanni, T., et al. “DINOv2: Learning Robust Visual Features without Supervision.” TMLR 2024.

[10] Simeoni, O., Vo, H. V., Seitzer, M., et al. “DINOv3.” arXiv preprint 2025.

[11] Touvron, H., Lavril, T., Izacard, G., et al. “LLaMA: Open and Efficient Foundation Language Models.” arXiv preprint 2023.

[12] Touvron, H., Martin, L., Stone, K., et al. “Llama 2: Open Foundation and Fine-Tuned Chat Models.” arXiv preprint 2023.

[13] Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. “Sigmoid Loss for Language Image Pre-Training.” ICCV 2023.

[14] Zhou, C., Yu, L., Babu, A., et al. “Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model.” ICLR 2025.

X marks the treasure, and potentially AGI#

Large Language Model FLOPs#

Vision Transformer FLOPs#

Musings#

References#

X marks the treasure, and potentially AGI

Large Language Model FLOPs

Vision Transformer FLOPs

Musings

References