--- datasets: - timm/imagenet-22k-wds library_name: transformers license: cc-by-nc-4.0 --- # I-JEPA Model (Huge, fine-tuned on IN22K) **I-JEPA** is a method for self-supervised learning. At a high level, I-JEPA predicts the representations of part of an image from the representations of other parts of the same image: 1. without relying on pre-specified invariances to hand-crafted data transformations, which tend to be biased for particular downstream tasks, 2. and without having the model fill in pixel-level details, which tend to result in learning less semantically meaningful representations. ![ijepa](https://github.com/facebookresearch/ijepa/assets/7530871/dbad94ab-ac35-433b-8b4c-ca227886d311) ## How does it work? As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space. The predictor in I-JEPA can be seen as a primitive (and restricted) world-model that is able to model spatial uncertainty in a static image from a partially observable context. This world model is semantic in the sense that it predicts high level information about unseen regions in the image, rather than pixel-level details. We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches. The model correctly captures positional uncertainty and produces high-level object parts with the correct pose (e.g., dog’s head, wolf’s front legs). ![Illustrating how the predictor learns to model the semantics of the world](https://github.com/facebookresearch/ijepa/assets/7530871/9b66e461-fc8b-4b12-9f06-63ec4dfc1452) ## Intended uses & limitations I-JEPA can be used for image classification or feature extraction. This checkpoint in specific is intended for **Feature Extraction**. ## How to use Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: ```python import requests from PIL import Image from transformers import AutoProcessor, IJepaForImageClassification url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) model_id = "jmtzt/ijepa_vith14_22k" processor = AutoProcessor.from_pretrained(model_id) model = IJepaForImageClassification.from_pretrained(model_id) inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) logits = outputs.logits # model predicts one of the 1000 ImageNet classes predicted_class_idx = logits.argmax(-1).item() print("Predicted class:", model.config.id2label[predicted_class_idx]) ``` ### BibTeX entry and citation info If you use I-JEPA or this code in your work, please cite: ``` @article{assran2023self, title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture}, author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas}, journal={arXiv preprint arXiv:2301.08243}, year={2023} } ```