Each dot is a latent for a different state. We pull every latent toward its training target. If the target is the model's own predicted latent (left), the objective is self-referential — gradients drag everything into a single point: representation collapse, variance → 0, the encoder learned nothing. If the target is the real observation encoded later (right), each latent is pinned to a distinct, reality-given anchor — structure survives, and the prediction-vs-reality discrepancy is a real gradient. That stop-gradient onto a reality target is the whole game (the sg[·] + EMA trick).