Musings

I used to be very up to date on self-supervised learning, but fell out of it as the field itself slowly died down in favor of VLMs and what not after SigLIP/DINO/V-JEPA became the dominant paradigms. This means I haven’t read any SSL papers seriously since 2023. However, that doesn’t mean I’ve been living under a rock. I’m still well aware of Yann LeCun’s anti-pixel prediction tirade, and in that time, nothing came out that convinced me we could move away from pixel-level supervision. It’s simply such a strong prior to enact for self-supervision: you get multi-view consistency and true spatial grounding at the slight cost of having to model high-frequency pixel details. ...