@Kseniase on Hugging Face: "12 Types of JEPA JEPA, or Joint Embedding Predictive Architecture, is an…"

Post

4228

12 Types of JEPA

JEPA, or Joint Embedding Predictive Architecture, is an approach to building AI models introduced by Yann LeCun. It differs from transformers by predicting the representation of a missing or future part of the input, rather than the next token or pixel. This encourages conceptual understanding, not just low-level pattern matching. So JEPA allows teaching AI to reason abstractly.

Here are 12 types of JEPA you should know about:

1. I-JEPA -> Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture (2301.08243)
A non-generative, self-supervised learning framework designed for processing images. It works by masking parts of the images and then trying to predict those masked parts

2. MC-JEPA -> MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features (2307.12698)
Simultaneously interprets video data - dynamic elements (motion) and static details (content) - using a shared encoder

3. V-JEPA -> Revisiting Feature Prediction for Learning Visual Representations from Video (2404.08471)
Presents vision models trained by predicting future video features, without pretrained image encoders, text, negative sampling, or reconstruction

4. UI-JEPA -> UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity (2409.04081)
Masks unlabeled UI sequences to learn abstract embeddings, then adds a fine-tuned LLM decoder for intent prediction.

5. Audio-based JEPA (A-JEPA) -> A-JEPA: Joint-Embedding Predictive Architecture Can Listen (2311.15830)
Masks spectrogram patches with a curriculum, encodes them, and predicts hidden representations.

6. S-JEPA -> S-JEPA: towards seamless cross-dataset transfer through dynamic spatial attention (2403.11772)
Signal-JEPA is used in EEG analysis. It adds a spatial block-masking scheme and three lightweight downstream classifiers

7. TI-JEPA -> TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems (2503.06380)
Text-Image JEPA uses self-supervised, energy-based pre-training to map text and images into a shared embedding space, improving cross-modal transfer to downstream tasks

Find more types below 👇

Also, explore the basics of JEPA in our article: https://www.turingpost.com/p/jepa

If you liked it, subscribe to the Turing Post: https://www.turingpost.com/subscribe

T-JEPA -> https://huggingface.co/papers/2410.05016
This one is for tabular (structured) data. By masking one subset of a table’s features and predicting their latent representation from another subset, it learns rich, label-agnostic embeddings
ACT-JEPA -> https://huggingface.co/papers/2501.14622
Merges imitation and self-supervised learning to learn policy embeddings without heavy expert data. It predicts chunked actions and abstract observations in latent space, filtering noise, modeling dynamics, and cutting compounding errors
Brain-JEPA -> https://huggingface.co/papers/2409.19407
Applies JEPA in brain dynamics foundation model for demographic, disease, and trait prediction.
3D-JEPA -> https://huggingface.co/papers/2409.15803
JEPA for 3D representation learning. It samples one rich context block and several target blocks, then predicts each target’s embedding from the context
Point-JEPA -> https://huggingface.co/papers/2404.16432
Brings joint-embedding predictive learning to point clouds. A lightweight sequencer orders patch embeddings. It lets the model choose context and target patches and reuse distance calculations for speed

Join the conversation