Transformers documentation

PerceptionLM

Transformers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v4.53.2).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

PerceptionLM

Overview

The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It’s a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.

The abstract from the paper is the following:

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM–VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about “what”, “where”, “when”, and “how” of a video. We make our work fully reproducible by providing data, training recipes, code & models.

This model was contributed by shumingh. The original code can be found here.

PerceptionLMConfig

class transformers.PerceptionLMConfig

< source >

( vision_config = None text_config = None vision_use_cls_token = True projector_pooling_ratio = 1 image_token_id = 128002 video_token_id = 128003 **kwargs )

Parameters

vision_config (Union[TimmWrapperConfig, dict], optional, defaults to TimmWrapperConfig()) — The config object or dictionary of the vision backbone.
text_config (Union[PretrainedConfig, dict], optional, defaults to LlamaConfig()) — The config object or dictionary of the text backbone.
vision_use_cls_token (bool, optional, defaults to True) — Whether CLS token is used in the vision backbone. If used, we remove CLS token embedding from vision output.
projector_pooling_ratio (int, optional, defaults to 1) — The pooling ratio used in the multimodal projector.
image_token_id (int, optional, defaults to 128002) — The image token index to encode the image prompt.
video_token_id (int, optional, defaults to 128003) — The video token index to encode the video prompt.

This is the configuration class to store the configuration of a PerceptionLMForConditionalGeneration. It is used to instantiate an PerceptionLM model according to the specified arguments, defining the model architecture.

Example models:

Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.

PerceptionLMProcessor

class transformers.PerceptionLMProcessor

< source >

( video_processor = None image_processor = None tokenizer = None patch_size = None chat_template = None pooling_ratio = 2 **kwargs )

Parameters

video_processor (PerceptionLMVideoProcessor, optional) — The video processor to process video inputs.
image_processor (PerceptionLMImageProcessorFast, optional) — The image processor to process image inputs.
tokenizer (LlamaTokenizerFast or similar, optional) — The tokenizer to process text inputs.
patch_size (int, optional) — Patch size from the vision tower.
chat_template (str, optional) — A Jinja template which will be used to convert lists of messages in a chat into a tokenizable string.
pooling_ratio (int, optional, defaults to 2) — Pooling ratio for vision tokens. If not 1, 2D adaptive pooling is applied over projected vision tokens.

Constructs a PerceptionLM processor which wraps a PerceptionLM image processor, a PerceptionLM video processor, and a tokenizer into a single processor.

PerceptionLMProcessor offers all the functionalities of PerceptionLMImageProcessorFast, PerceptionLMVideoProcessor, and the tokenizer (e.g. LlamaTokenizerFast). See the __call__() and decode() for more information.

batch_decode

< source >

( *args **kwargs )

This method forwards all its arguments to PerceptionLMTokenizerFast’s batch_decode(). Please refer to the docstring of this method for more information.

decode

< source >

( *args **kwargs )

This method forwards all its arguments to PerceptionLMTokenizerFast’s decode(). Please refer to the docstring of this method for more information.

PerceptionLMImageProcessorFast

class transformers.PerceptionLMImageProcessorFast

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.perception_lm.image_processing_perception_lm_fast.PerceptionLMFastImageProcessorKwargs] )

Constructs a fast PerceptionLM image processor.

PerceptionLMVideoProcessor

class transformers.PerceptionLMVideoProcessor

< source >

( **kwargs: typing_extensions.Unpack[transformers.models.perception_lm.video_processing_perception_lm.PerceptionLMFastVideoProcessorInitKwargs] )

PerceptionLMModel

class transformers.PerceptionLMModel

< source >

( config: PerceptionLMConfig )

Parameters

config (PerceptionLMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The bare Perception Lm Model outputting raw hidden-states without any specific head on top.

This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None pixel_values_videos: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **lm_kwargs ) → PerceptionLMModelOutputWithPast or tuple

Parameters

input_ids (torch.LongTensor, optional) — Indices of input sequence tokens in the vocabulary.
pixel_values (torch.FloatTensor, optional) — Input image tensor of shape (batch_size, num_tiles, channels, height, width).
pixel_values_videos (torch.FloatTensor, optional) — Input video tensor of shape (batch_size, num_frames, channels, height, width).
attention_mask (torch.Tensor, optional) — Mask to avoid performing attention on padding token indices.
position_ids (torch.LongTensor, optional) — Indices of positions of each input sequence token in the position embeddings.
past_key_values (list[torch.FloatTensor], optional) — Precomputed key and value hidden states for fast autoregressive generation.
inputs_embeds (torch.FloatTensor, optional) — Optionally, instead of passing input_ids, you can choose to directly pass an embedded representation.
use_cache (bool, optional) — Whether or not to use past key values to speed up decoding.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers.
cache_position (torch.LongTensor, optional) — Position indices for caching.
logits_to_keep (int or torch.Tensor, optional, defaults to 0) — Number of logits to keep.
**lm_kwargs — Additional keyword arguments for the language model.

Returns

PerceptionLMModelOutputWithPast or tuple

Model outputs as a PerceptionLMModelOutputWithPast if return_dict=True, otherwise a tuple.

The PerceptionLMModel forward method, overrides the __call__ special method.

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.

get_image_features

< source >

( pixel_values: FloatTensor **kwargs ) → image_features (torch.Tensor)

Parameters

pixel_values (torch.FloatTensor] of shape (batch_size, num_tiles, channels, height, width)) — The tensors corresponding to the input images.

Returns

image_features (torch.Tensor)

Image feature tensor of shape (num_tiles, num_patches, embed_dim)).

Obtains image last hidden states from the vision tower and apply multimodal projection.

PerceptionLMForConditionalGeneration

class transformers.PerceptionLMForConditionalGeneration

< source >

( config: PerceptionLMConfig )

Parameters

config (PerceptionLMConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.

The Perception Lm Model for token generation conditioned on other modalities (e.g. image-text-to-text generation).

This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.

forward

< source >

( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None pixel_values_videos: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[list[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 **lm_kwargs ) → PerceptionLMCausalLMOutputWithPast or tuple

Parameters

input_ids (torch.LongTensor, optional) — Indices of input sequence tokens in the vocabulary.
pixel_values (torch.FloatTensor, optional) — Input image tensor of shape (batch_size, num_tiles, channels, height, width).
pixel_values_videos (torch.FloatTensor, optional) — Input video tensor of shape (batch_size, num_frames, channels, height, width).
attention_mask (torch.Tensor, optional) — Mask to avoid performing attention on padding token indices.
position_ids (torch.LongTensor, optional) — Indices of positions of each input sequence token in the position embeddings.
past_key_values (list[torch.FloatTensor], optional) — Precomputed key and value hidden states for fast autoregressive generation.
inputs_embeds (torch.FloatTensor, optional) — Optionally, instead of passing input_ids, you can choose to directly pass an embedded representation.
labels (torch.LongTensor, optional) — Labels for computing the language modeling loss.
use_cache (bool, optional) — Whether or not to use past key values to speed up decoding.
output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers.
output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers.
cache_position (torch.LongTensor, optional) — Position indices for caching.
logits_to_keep (int or torch.Tensor, optional, defaults to 0) — Number of logits to keep.
**lm_kwargs — Additional keyword arguments for the language model.

Returns

PerceptionLMCausalLMOutputWithPast or tuple

Model outputs as a PerceptionLMCausalLMOutputWithPast if return_dict=True, otherwise a tuple.

The PerceptionLMForConditionalGeneration forward method, overrides the __call__ special method.

Example:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, PerceptionLMForConditionalGeneration

>>> model = PerceptionLMForConditionalGeneration.from_pretrained("facebook/Perception-LM-1B")
>>> processor = AutoProcessor.from_pretrained("facebook/Perception-LM-1B")

>>> messages = [
...     {
...         "role": "user", "content": [
...             {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
...             {"type": "text", "text": "Where is the cat standing?"},
...         ]
...     },
... ]

>>> inputs = processor.apply_chat_template(
...     messages,
...     tokenize=True,
...     return_dict=True,
...     return_tensors="pt",
...     add_generation_prompt=True
... )
>>> # Generate
>>> generate_ids = model.generate(**inputs)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True)[0]

< > Update on GitHub

←Perceiver Phi4 Multimodal→