Model Details

[📃 Tech Report] [📂 Github]

Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "Perception Encoder: The best visual embeddings are not at the output of the network".

Model Developer: Meta

Model Overview: Perception Encoder (PE) is a family of large-scale vision encoder models with state-of-the-art performance on a large variety of vision tasks. By using a robust contrastive pretraining recipe and finetuning on synthetically aligned videos, PE not only outperforms all existing models on classification and retrieval, but it also internally produces strong, general features that scale for downstream tasks. PE unlocks the ability for large-scale contrastive pretraining to transfer to downstream tasks with alignment tuning to capitalize on those general features.

Scale	Tower	Params	Width	Depth	MLP	Heads	CLIP Dim	Resolution	Patch Size	Text Context Length
B	Vision	0.09B	768	12	3072	12	1024	224	16	32
	Text	0.31B	1024	24	4096	16	1024	224	16	32
L	Vision	0.32B	1024	24	4096	16	1024	336	14	32
	Text	0.31B	1024	24	4096	16	1024	336	14	32
G	Vision	1.88B	1536	50	8960	16	1280	448	14	72
	Text	0.47B	1280	24	5120	20	1280	448	14	72

How to use

Model loading code

We provide the model loading code in https://github.com/facebookresearch/perception_models

You can find more details in the GitHub repo.

Citation

If you find our code useful for your research, please consider citing:

@article{bolya2025perception-encoder,
  title={Perception Encoder: The best visual embeddings are not at the output of the network},
  author={Daniel Bolya and Po-Yao Huang and Peize Sun and Jang Hyun Cho and Andrea Madotto and Chen Wei and Tengyu Ma and Jiale Zhi and Jathushan Rajasegaran and Hanoona Rasheed and Junke Wang and Marco Monteiro and Hu Xu and Shiyu Dong and Nikhila Ravi and Daniel Li and Piotr Doll{\'a}r and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}
@article{cho2025perceptionlm,
  title={PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding},
  author={Jang Hyun Cho and Andrea Madotto and Effrosyni Mavroudi and Triantafyllos Afouras and Tushar Nagarajan and Muhammad Maaz and Yale Song and Tengyu Ma and Shuming Hu and Hanoona Rasheed and Peize Sun and Po-Yao Huang and Daniel Bolya and Suyog Jain and Miguel Martin and Huiyu Wang and Nikhila Ravi and Shashank Jain and Temmy Stark and Shane Moon and Babak Damavandi and Vivian Lee and Andrew Westbury and Salman Khan and Philipp Kr\"{a}henb\"{u}hl and Piotr Doll{\'a}r and Lorenzo Torresani and Kristen Grauman and Christoph Feichtenhofer},
  journal={arXiv},
  year={2025}
}

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including facebook/PE-Detection

Perception Encoder

Collection

16 items • Updated Mar 2 • 82

Paper for facebook/PE-Detection

Perception Encoder: The best visual embeddings are not at the output of the network

Paper • 2504.13181 • Published Apr 17, 2025 • 37