Ongoing

LLM vs JEPA

Creator · 2026 · 5 min read

A five-part exploration of Yann LeCun's argument against generative video prediction. From-scratch JEPA on a synthetic bouncing ball, action-conditioned planning, and a DINOv2/v3 hover demo on a real image, all running on an M3 MacBook in under five minutes per part.

Overview

Companion repo to a five-part blog series on danieljohnmorris.com. Each part maps to one of LeCun's specific public claims: pixel prediction goes blurry on ambiguous futures, joint-embedding architectures need explicit decorrelation to avoid collapse, real production JEPAs scale the same idea on natural images, and action-conditioned latent world models give agents something to plan with. All experiments run on M-series MPS, no CUDA, with models between 700K and 1.2M parameters.

Problem

LeCun has argued for years that LLMs are useless for everything except language and that the dominant generative-prediction recipe will not deliver agents that can plan in the physical world. The argument is widely cited but rarely tested by independent practitioners on small enough hardware to read the code in one sitting.

Constraints

  • Must run on a laptop with no discrete GPU (Apple Silicon MPS only)
  • Each experiment must produce visible artifacts a reader can scrub through
  • All artifacts pre-rendered and committed, so a clone gives you the visuals immediately
  • Each part stays under ~1.2M parameters and trains in under five minutes

Approach

Built a synthetic bouncing-ball dataset where single-frame inputs carry no velocity information. Trained a pixel-space MSE next-frame predictor and watched it converge to a blurry average of two ball positions. Replaced the pixel target with an embedding target, used VICReg's variance-invariance-covariance loss to prevent representation collapse, then ported the November 2025 LeJEPA paper's SIGReg loss to compare. Visualised the trained encoder by training a separate decoder on its frozen embeddings. Repo is structured one folder per blog post, with shared utilities under shared/.

Key Decisions

Single-frame input as the source of ambiguity, not stochastic dynamics

Bouncing physics stays deterministic. The ambiguity comes from the input format: a single 64x64 frame contains no velocity, and training pairs from different episodes show the same position moving both ways. MSE-optimal next-frame prediction is forced to average the two possible futures. This isolates the LeCun argument cleanly without dataset hacks.

VICReg first, LeJEPA second

VICReg's three-term loss (similarity, variance, covariance) is the most-cited JEPA training recipe and the easiest to debug when collapse happens. LeJEPA replaces the variance and covariance terms with a single SIGReg term that pushes the embedding distribution toward an isotropic Gaussian. Building VICReg first gives a reference run for the LeJEPA comparison.

DINOv3 for the production encoder hover demo, not I-JEPA or V-JEPA 2

The hover-similarity demo needs an image-native encoder with clean patch features. DINOv3 is the model Welch Labs use in the source video. LeCun himself groups DINO with JEPA as the same joint-embedding family in the talk. V-JEPA 2 is video-native and awkward for stills.

Per-post folder layout (part1_pixel/, part2_jepa/, etc.) instead of flat

Each blog post maps to one folder. A reader can clone, cd into the part they are reading, and run that post's scripts without scanning the rest. Diverges from the flat tiny-poe-llm layout but tiny-poe-llm is one post; this is a series.

Tech Stack

Python PyTorch Apple MPS transformers DINOv3 imageio NumPy

Result & Impact

  • 613K trainable
    Total parameters (Part 2 JEPA)
  • Under 5 min per part
    Training time on M3 MacBook Pro
  • Under 150 per part
    Lines of core training code

Reproduces LeCun's pixel-blur argument visibly, in code small enough to read in one sitting. All output GIFs and images are committed to the repo, so cloning gives the visuals without running any training. The pixel-vs-JEPA side-by-side renders the smear and the crisp prediction in a single frame strip.

Learnings

  • BYOL-style EMA plus stop-gradient is not enough on its own to prevent collapse on toy tasks. Some explicit decorrelation pressure is needed unless batchnorm is doing implicit work.
  • VICReg's three-loss balance is sensitive to the off-diagonal covariance weight. LeJEPA's single-knob design is genuinely simpler, not just rebranded.
  • The argument LeCun makes about ambiguity does not need stochastic dynamics to demonstrate. Single-frame velocity loss is enough.
  • Tiny encoders (550K params) reproduce the qualitative behaviour of much larger ones. The principle scales; the model does not have to.

The repo is at github.com/danieljohnmorris/tiny-bouncing-jepa, with one folder per post in the series - clone, cd into the part you’re reading, run that post’s scripts.

Series

  1. Part 1: Yann LeCun’s Bet Against LLMs - the LeCun argument and the papers behind it
  2. Part 2: Why Pixel Prediction Goes Blurry - bouncing-ball pixel baseline
  3. Part 3: Predict Embeddings, Not Pixels - JEPA with VICReg, then LeJEPA’s SIGReg
  4. Part 4: From Representations to World Models - toy vs production, DINOv2/v3 patch-similarity hover demo
  5. Part 5: Planning in Latent Space - action conditioning and a small MPC planner