Image Feature Extraction
Transformers
Safetensors
ijepa
Inference Endpoints
File size: 2,232 Bytes
2841434
4d5c678
 
4686274
 
2841434
 
4d5c678
2841434
4d5c678
 
 
2841434
4d5c678
2841434
 
4d5c678
2841434
4d5c678
 
 
2841434
4d5c678
 
2841434
4d5c678
2841434
4d5c678
2841434
4d5c678
2841434
 
4d5c678
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
datasets:
- ILSVRC/imagenet-1k
library_name: transformers
license: cc-by-nc-4.0
---

# I-JEPA Model (Huge, fine-tuned on IN1K)

**I-JEPA** is a method for self-supervised learning. At a high level, I-JEPA predicts the representations of part of an image from the representations of other parts of the same image:
1. without relying on pre-specified invariances to hand-crafted data transformations, which tend to be biased for particular downstream tasks,
2. and without having the model fill in pixel-level details, which tend to result in learning less semantically meaningful representations.

![ijepa](https://github.com/facebookresearch/ijepa/assets/7530871/dbad94ab-ac35-433b-8b4c-ca227886d311)


## How does it work?

As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space.
The predictor in I-JEPA can be seen as a primitive (and restricted) world-model that is able to model spatial uncertainty in a static image from a partially observable context.
This world model is semantic in the sense that it predicts high level information about unseen regions in the image, rather than pixel-level details.

We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches.
The model correctly captures positional uncertainty and produces high-level object parts with the correct pose (e.g., dog’s head, wolf’s front legs).

![Illustrating how the predictor learns to model the semantics of the world](https://github.com/facebookresearch/ijepa/assets/7530871/9b66e461-fc8b-4b12-9f06-63ec4dfc1452)

## Intended uses & limitations

I-JEPA can be used for image classification or feature extraction. This checkpoint in specific is intended for **Feature Extraction**.


### BibTeX entry and citation info
If you use I-JEPA or this code in your work, please cite:
```
@article{assran2023self,
  title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
  author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2301.08243},
  year={2023}
}
```