Transformers documentation
Mllama
Mllama
Overview
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image.
Model Architecture: Llama 3.2-Vision is built on top of Llama 3.1 text-only model, which is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. To support image recognition tasks, the Llama 3.2-Vision model uses a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. The adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.
Usage Tips
- For image+text and text inputs use MllamaForConditionalGeneration.
- For text-only inputs use MllamaForCausalLMfor generation to avoid loading vision tower.
- Each sample can contain multiple images, and the number of images can vary between samples. The processor will pad the inputs to the maximum number of images across samples and to a maximum number of tiles within each image.
- The text passed to the processor should have the "<|image|>"tokens where the images should be inserted.
- The processor has its own apply_chat_templatemethod to convert chat messages to text that can then be passed as text to the processor.
Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the lm_head layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special "<|image|>" tokens in the labels as the model should not be trained on predicting them.
Otherwise if you see CUDA-side index erros when generating, use the below code to expand the lm_head by one more token.
old_embeddings = model.get_output_embeddings()
num_tokens = model.vocab_size + 1
resized_embeddings = model._get_resized_lm_head(old_embeddings, new_num_tokens=num_tokens, mean_resizing=True)
resized_embeddings.requires_grad_(old_embeddings.weight.requires_grad)
model.set_output_embeddings(resized_embeddings)Usage Example
Instruct model
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
messages = [
    [
        {
            "role": "user", 
            "content": [
                {"type": "image"},
                {"type": "text", "text": "What does the image show?"}
            ]
        }
    ],
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
url = "https://llava-vl.github.io/static/images/view.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=text, images=image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=25)
print(processor.decode(output[0]))Base model
import requests
import torch
from PIL import Image
from transformers import MllamaForConditionalGeneration, AutoProcessor
model_id = "meta-llama/Llama-3.2-11B-Vision"
model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained(model_id)
prompt = "<|image|>If I had to write a haiku for this one"
url = "https://llava-vl.github.io/static/images/view.jpg"
raw_image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(model.device)
output = model.generate(**inputs, do_sample=False, max_new_tokens=25)
print(processor.decode(output[0], skip_special_tokens=True))MllamaConfig
class transformers.MllamaConfig
< source >( vision_config = None text_config = None image_token_index = 128256 **kwargs )
Parameters
-  vision_config (Union[AutoConfig, dict], optional, defaults toMllamaVisionConfig) — The config object or dictionary of the vision backbone.
-  text_config (Union[AutoConfig, dict], optional, defaults toMllamaTextConfig) — The config object or dictionary of the text backbone.
-  image_token_index (int, optional, defaults to 128256) — The image token index to encode the image prompt.
This is the configuration class to store the configuration of a MllamaForConditionalGeneration. It is used to instantiate an Mllama model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Mllama-9B.
e.g. meta-llama/Llama-3.2-11B-Vision
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import MllamaForConditionalGeneration, MllamaConfig, MllamaVisionConfig, MllamaTextConfig
>>> # Initializing a CLIP-vision config
>>> vision_config = MllamaVisionConfig()
>>> # Initializing a Llama config
>>> text_config = MllamaTextConfig()
>>> # Initializing a mllama-11b style configuration
>>> configuration = MllamaConfig(vision_config, text_config)
>>> # Initializing a model from the mllama-11b style configuration
>>> model = MllamaForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configMllamaProcessor
class transformers.MllamaProcessor
< source >( image_processor tokenizer )
Parameters
- image_processor (MllamaImageProcessor) — The image processor is a required input.
-  tokenizer ([PreTrainedTokenizer,PreTrainedTokenizerFast]) — The tokenizer is a required input.
Constructs a Mllama processor which wraps MllamaImageProcessor and
PretrainedTokenizerFast into a single processor that inherits both the image processor and
tokenizer functionalities. See the __call__() and decode() for more
information.
The preferred way of passing kwargs is as a dictionary per modality, see usage example below.
from transformers import MllamaProcessor
from PIL import Image
processor = MllamaProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision")
processor(
    images=your_pil_image,
    text=["<|image|>If I had to write a haiku for this one"],
    images_kwargs = {"size": {"height": 448, "width": 448}},
    text_kwargs = {"padding": "right"},
    common_kwargs = {"return_tensors": "pt"},
)This method forwards all its arguments to PreTrainedTokenizerFastβs batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to PreTrainedTokenizerFastβs decode(). Please refer to the docstring of this method for more information.
post_process_image_text_to_text
< source >( generated_outputs  ) β List[str]
Post-process the output of the model to decode the text.
MllamaImageProcessor
class transformers.MllamaImageProcessor
< source >( do_convert_rgb: bool = True do_resize: bool = True size: typing.Optional[typing.Dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: float = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: bool = True max_image_tiles: int = 4 **kwargs )
Parameters
-  do_convert_rgb (bool, optional, defaults toTrue) — Whether to convert the image to RGB. This is useful if the input image is of a different format e.g. RGBA. Only has an effect if the input image is in the PIL format.
-  do_resize (bool, optional, defaults toTrue) — Whether to resize the image.
-  size (Dict[str, int], optional, defaults toself.size) — Size of the image tile. Should be a dictionary containing ‘height’ and ‘width’ keys, both with integer values. The height and width values should be equal.
-  resample (int, optional, defaults toResampling.BILINEAR) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Only has an effect ifdo_resizeis set toTrue.
-  do_rescale (bool, optional, defaults toTrue) — Whether to rescale the image.
-  rescale_factor (float, optional, defaults to 0.0) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional, defaults toTrue) — Whether to normalize the image.
-  image_mean (floatorList[float], optional, defaults toself.image_mean) — Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  image_std (floatorList[float], optional, defaults toself.image_std) — Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  do_pad (bool, optional, defaults toTrue) — Whether or not to pad the images to the largest height and width in the batch.
-  max_image_tiles (int, optional, defaults to 4) — The maximum number of tiles to split the image into.
Constructs a Mllama image processor.
pad
< source >( image: ndarray size: typing.Dict[str, int] aspect_ratio: typing.Tuple[int, int] data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None  ) β np.ndarray
Parameters
-  image (np.ndarray) — Image to resize.
-  size (Dict[str, int]) — Size of the output image.
-  aspect_ratio (Tuple[int, int]) — The aspect ratio of the image.
-  data_format (strorChannelDimension, optional) — The channel dimension format of the image. If not provided, it will be the same as the input image.
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format of the input image. If not provided, it will be inferred.
Returns
np.ndarray
The padded image.
Pad an image to the size x aspect_ratio. For example, if size is {height: 224, width: 224} and aspect ratio is
(1, 2), the image will be padded to 224x448.
preprocess
< source >( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), typing.List[ForwardRef('PIL.Image.Image')], typing.List[numpy.ndarray], typing.List[ForwardRef('torch.Tensor')]] do_convert_rgb: typing.Optional[bool] = None do_resize: typing.Optional[bool] = None size: typing.Optional[typing.Dict[str, int]] = None resample: typing.Optional[PIL.Image.Resampling] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, typing.List[float], NoneType] = None image_std: typing.Union[float, typing.List[float], NoneType] = None do_pad: typing.Optional[bool] = None max_image_tiles: typing.Optional[int] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None  ) β BatchFeature of the following structure
Parameters
-  images (ImageInput) — A list of images to preprocess.
-  do_convert_rgb (bool, optional, defaults toself.do_convert_rgb) — Whether to convert the image to RGB.
-  do_resize (bool, optional, defaults toself.do_resize) — Whether to resize the image.
-  size (Dict[str, int], optional, defaults toself.size) — Size of the image tile. Should be a dictionary containing ‘height’ and ‘width’ keys, both with integer values. The height and width values should be equal.
-  resample (int, optional, defaults toself.resample) — Resampling filter to use if resizing the image. This can be one of the enumPILImageResampling. Only has an effect ifdo_resizeis set toTrue.
-  do_rescale (bool, optional, defaults toself.do_rescale) — Whether to rescale the image.
-  rescale_factor (float, optional, defaults toself.rescale_factor) — Rescale factor to rescale the image by ifdo_rescaleis set toTrue.
-  do_normalize (bool, optional, defaults toself.do_normalize) — Whether to normalize the image.
-  image_mean (floatorList[float], optional, defaults toself.image_mean) — Image mean to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  image_std (floatorList[float], optional, defaults toself.image_std) — Image standard deviation to use for normalization. Only has an effect ifdo_normalizeis set toTrue.
-  do_pad (bool, optional, defaults toself.do_pad) — Whether or not to pad the images to the largest height and width in the batch.
-  max_image_tiles (int, optional, defaults toself.max_image_tiles) — The maximum number of tiles to split the image into.
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:- "channels_first"or- ChannelDimension.FIRST: image in (num_channels, height, width) format.
- "channels_last"or- ChannelDimension.LAST: image in (height, width, num_channels) format.
- "none"or- ChannelDimension.NONE: image in (height, width) format.
 
-  return_tensors (strorTensorType, optional) — The type of tensors to return. Can be one of:- Unset: Return a list of np.ndarray.
- TensorType.TENSORFLOWor- 'tf': Return a batch of type- tf.Tensor.
- TensorType.PYTORCHor- 'pt': Return a batch of type- torch.Tensor.
- TensorType.NUMPYor- 'np': Return a batch of type- np.ndarray.
- TensorType.JAXor- 'jax': Return a batch of type- jax.numpy.ndarray.
 
- Unset: Return a list of 
Returns
BatchFeature of the following structure
- pixel_values (TensorType): The preprocessed pixel values.
- aspect_ratio_ids (TensorType): The aspect ratio ids of the images.
- num_tiles (List[List[int]]): The number of tiles for each image in the batch.
Preprocess a batch of images.
resize
< source >( image: ndarray size: typing.Dict[str, int] max_image_tiles: int resample: Resampling = <Resampling.BILINEAR: 2> data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None  ) β Union[np.ndarray, Tuple[int, int]]
Parameters
-  image (np.ndarray) — Image to resize.
-  size (Dict[str, int]) — Size of the output image.
-  max_image_tiles (int) — The maximum number of tiles to split the image into.
-  resample (PILImageResampling, optional, defaults toPILImageResampling.BICUBIC) — Resampling filter to use when resizing the image.
-  data_format (strorChannelDimension, optional) — The channel dimension format of the image. If not provided, it will be the same as the input image.
-  input_data_format (ChannelDimensionorstr, optional) — The channel dimension format of the input image. If not provided, it will be inferred.
Returns
Union[np.ndarray, Tuple[int, int]]
The resized image and a tuple containing the number of tiles along the height and width dimensions.
Resizes an image to fit within a tiled canvas while maintaining its aspect ratio. The optimal canvas size is calculated based on the maximum number of tiles and the tile size.
The function first determines the best tile arrangement for the image, then resizes the image to fit within this canvas. The resized image and the number of tiles along the height and width dimensions are returned.
MllamaForConditionalGeneration
class transformers.MllamaForConditionalGeneration
< source >( config: MllamaConfig )
Parameters
- config (MllamaConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Mllama model which consists of a vision encoder and a language model. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None aspect_ratio_mask: typing.Optional[torch.Tensor] = None aspect_ratio_ids: typing.Optional[torch.Tensor] = None attention_mask: typing.Optional[torch.Tensor] = None cross_attention_mask: typing.Optional[torch.Tensor] = None cross_attention_states: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0  ) β transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  pixel_values (torch.FloatTensorof shape(batch_size, max_num_images, max_num_tiles, channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoImageProcessor). See [MllamaImageProcessor.__call__()](/docs/transformers/v4.47.1/en/model_doc/imagegpt#transformers.ImageGPTFeatureExtractor.__call__) for details ([]MllamaProcessor`] uses MllamaImageProcessor for processing images).
-  aspect_ratio_mask (torch.Tensorof shape(batch_size, max_num_images, max_num_tiles), optional) — Mask to avoid performing attention on padding tiles. Mask values selected in[0, 1]:- 1 for tiles that are not masked,
- 0 for tiles that are masked.
 
-  aspect_ratio_ids (torch.Tensorof shape(batch_size, max_num_images), optional) — Aspect ratio ids used to select the appropriate precomputed tile embeddings based on the aspect ratio of each input image. These ids correspond to indices in the model’s list of supported aspect ratios, offset by 1.For example, if the model supports aspect ratios [[1, 1], [1, 2], [2, 1]]: - An image with aspect ratio [1, 1] would have ID 1
- An image with aspect ratio [1, 2] would have ID 2
- An image with aspect ratio [2, 1] would have ID 3
 The id 0 is reserved for padding (i.e., no image). If an image has aspect ratio [1, 2], that means it was split into 2 tiles horizontally, and its aspect_ratio_idwould be 2.
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastinput_idshave to be input (seepast_key_values).If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  cross_attention_mask (torch.Tensorof shape(batch_size, seq_length, max_num_images, max_num_tiles), optional) — Cross-attention mask to control the interaction between text tokens and image tiles. This 4D tensor defines which image tiles each text token should attend to.For each text token (in seq_length): - 1 indicates the token should attend to the corresponding image tile
- 0 indicates the token should not attend to the corresponding image tile
 
-  cross_attention_states (torch.FloatTensor, optional) — Output of the vision model, used for cross-attention. This tensor contains the processed image features that the language model will attend to.
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
-  past_key_values (Cacheortuple(tuple(torch.FloatTensor)), optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Two formats are allowed: - a Cache instance, see our kv cache guide;
- Tuple of tuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.
 The model will output the same cache format that is fed as input. If no past_key_valuesare passed, the legacy cache format will be returned.If past_key_valuesare used, the user can optionally input only the lastinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
-  Args —
labels (torch.LongTensorof shape(batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].num_logits_to_keep ( int, optional): Calculate logits for the lastnum_logits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (MllamaConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) β Language modeling loss (for next-token prediction).
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head))Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The MllamaForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, MllamaForConditionalGeneration
>>> checkpoint = "meta-llama/Llama-3.2-11B-Vision"
>>> model = MllamaForConditionalGeneration.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> prompt = "<|image|>If I had to write a haiku for this one"
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(text=prompt, images=image, return_tensors="pt")
>>> # Generate
>>> output = model.generate(**inputs, max_new_tokens=15)
>>> prompt_len = inputs.input_ids.shape[-1]
>>> generated_ids = output[:, prompt_len:]
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
>>> print(generated_text)
[', it would be:.\\nA stop sign in Chinatown.\\n']MllamaForCausalLM
class transformers.MllamaForCausalLM
< source >( config )
Parameters
- config (MllamaConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Mllama Text Model with a language modeling head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None cross_attention_states: typing.Optional[torch.LongTensor] = None cross_attention_mask: typing.Optional[torch.LongTensor] = None full_text_row_masked_out_mask: typing.Optional[typing.Tuple[torch.Tensor, torch.Tensor]] = None past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0 **loss_kwargs  ) β transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  pixel_values (torch.FloatTensorof shape(batch_size, max_num_images, max_num_tiles, channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoImageProcessor). See [MllamaImageProcessor.__call__()](/docs/transformers/v4.47.1/en/model_doc/imagegpt#transformers.ImageGPTFeatureExtractor.__call__) for details ([]MllamaProcessor`] uses MllamaImageProcessor for processing images).
-  aspect_ratio_mask (torch.Tensorof shape(batch_size, max_num_images, max_num_tiles), optional) — Mask to avoid performing attention on padding tiles. Mask values selected in[0, 1]:- 1 for tiles that are not masked,
- 0 for tiles that are masked.
 
-  aspect_ratio_ids (torch.Tensorof shape(batch_size, max_num_images), optional) — Aspect ratio ids used to select the appropriate precomputed tile embeddings based on the aspect ratio of each input image. These ids correspond to indices in the model’s list of supported aspect ratios, offset by 1.For example, if the model supports aspect ratios [[1, 1], [1, 2], [2, 1]]: - An image with aspect ratio [1, 1] would have ID 1
- An image with aspect ratio [1, 2] would have ID 2
- An image with aspect ratio [2, 1] would have ID 3
 The id 0 is reserved for padding (i.e., no image). If an image has aspect ratio [1, 2], that means it was split into 2 tiles horizontally, and its aspect_ratio_idwould be 2.
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastinput_idshave to be input (seepast_key_values).If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  cross_attention_mask (torch.Tensorof shape(batch_size, seq_length, max_num_images, max_num_tiles), optional) — Cross-attention mask to control the interaction between text tokens and image tiles. This 4D tensor defines which image tiles each text token should attend to.For each text token (in seq_length): - 1 indicates the token should attend to the corresponding image tile
- 0 indicates the token should not attend to the corresponding image tile
 
-  cross_attention_states (torch.FloatTensor, optional) — Output of the vision model, used for cross-attention. This tensor contains the processed image features that the language model will attend to.
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
-  past_key_values (Cacheortuple(tuple(torch.FloatTensor)), optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Two formats are allowed: - a Cache instance, see our kv cache guide;
- Tuple of tuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.
 The model will output the same cache format that is fed as input. If no past_key_valuesare passed, the legacy cache format will be returned.If past_key_valuesare used, the user can optionally input only the lastinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
-  Args —
labels (torch.LongTensorof shape(batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].num_logits_to_keep ( int, optional): Calculate logits for the lastnum_logits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (MllamaTextConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) β Language modeling loss (for next-token prediction).
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head))Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The MllamaForCausalLM forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, MllamaForCausalLM
>>> model = MllamaForCausalLM.from_pretrained("Llama-3.2-11B-Vision")
>>> tokenizer = AutoTokenizer.from_pretrained("Llama-3.2-11B-Vision")
>>> prompt = "If I had to write a haiku, it would be:"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=40, do_sample=True, temperature=0.6)
>>> result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
>>> print(result)
If I had to write a haiku, it would be: "Snowflakes gently fall" - simple, yet peaceful.
I love the idea of snowflakes gently falling, each oneMllamaTextModel
class transformers.MllamaTextModel
< source >( config: MllamaTextConfig )
Parameters
- config (MllamaConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Mllama Text Model which consists of transformer with self and cross attention layers. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: typing.Optional[torch.LongTensor] = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None cross_attention_states: typing.Optional[torch.FloatTensor] = None cross_attention_mask: typing.Optional[torch.Tensor] = None full_text_row_masked_out_mask: typing.Optional[typing.Tuple[torch.Tensor, torch.Tensor]] = None past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None  ) β transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastinput_idshave to be input (seepast_key_values).If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  cross_attention_mask (torch.Tensorof shape(batch_size, seq_length, max_num_images, max_num_tiles), optional) — Cross-attention mask to control the interaction between text tokens and image tiles. This 4D tensor defines which image tiles each text token should attend to.For each text token (in seq_length): - 1 indicates the token should attend to the corresponding image tile
- 0 indicates the token should not attend to the corresponding image tile
 
-  cross_attention_states (torch.FloatTensor, optional) — Output of the vision model, used for cross-attention. This tensor contains the processed image features that the language model will attend to.
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
-  past_key_values (Cacheortuple(tuple(torch.FloatTensor)), optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Two formats are allowed: - a Cache instance, see our kv cache guide;
- Tuple of tuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.
 The model will output the same cache format that is fed as input. If no past_key_valuesare passed, the legacy cache format will be returned.If past_key_valuesare used, the user can optionally input only the lastinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
Returns
transformers.modeling_outputs.BaseModelOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (MllamaTextConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.If past_key_valuesis used only the last hidden-state of the sequences of shape(batch_size, 1, hidden_size)is output.
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)) and optionally ifconfig.is_encoder_decoder=True2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head).Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if config.is_encoder_decoder=Truein the cross-attention blocks) that can be used (seepast_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The MllamaTextModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, MllamaTextModel
>>> checkpoint = "meta-llama/Llama-3.2-11B-Vision"
>>> model = MllamaTextModel.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> text = "<|image|>If I had to write a haiku for this one"
>>> inputs = processor(text=text, return_tensors="pt")
>>> output = model(**inputs)
>>> print(output.last_hidden_state.shape)
torch.Size([1, 13, 4096])MllamaForCausalLM
class transformers.MllamaForCausalLM
< source >( config )
Parameters
- config (MllamaConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Mllama Text Model with a language modeling head on top. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None cross_attention_states: typing.Optional[torch.LongTensor] = None cross_attention_mask: typing.Optional[torch.LongTensor] = None full_text_row_masked_out_mask: typing.Optional[typing.Tuple[torch.Tensor, torch.Tensor]] = None past_key_values: typing.Union[transformers.cache_utils.Cache, typing.List[torch.FloatTensor], NoneType] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None num_logits_to_keep: int = 0 **loss_kwargs  ) β transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
Parameters
-  input_ids (torch.LongTensorof shape(batch_size, sequence_length)) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. 
-  pixel_values (torch.FloatTensorof shape(batch_size, max_num_images, max_num_tiles, channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoImageProcessor). See [MllamaImageProcessor.__call__()](/docs/transformers/v4.47.1/en/model_doc/imagegpt#transformers.ImageGPTFeatureExtractor.__call__) for details ([]MllamaProcessor`] uses MllamaImageProcessor for processing images).
-  aspect_ratio_mask (torch.Tensorof shape(batch_size, max_num_images, max_num_tiles), optional) — Mask to avoid performing attention on padding tiles. Mask values selected in[0, 1]:- 1 for tiles that are not masked,
- 0 for tiles that are masked.
 
-  aspect_ratio_ids (torch.Tensorof shape(batch_size, max_num_images), optional) — Aspect ratio ids used to select the appropriate precomputed tile embeddings based on the aspect ratio of each input image. These ids correspond to indices in the model’s list of supported aspect ratios, offset by 1.For example, if the model supports aspect ratios [[1, 1], [1, 2], [2, 1]]: - An image with aspect ratio [1, 1] would have ID 1
- An image with aspect ratio [1, 2] would have ID 2
- An image with aspect ratio [2, 1] would have ID 3
 The id 0 is reserved for padding (i.e., no image). If an image has aspect ratio [1, 2], that means it was split into 2 tiles horizontally, and its aspect_ratio_idwould be 2.
-  attention_mask (torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
 Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details. If past_key_valuesis used, optionally only the lastinput_idshave to be input (seepast_key_values).If you want to change padding behavior, you should read modeling_opt._prepare_decoder_attention_maskand modify to your needs. See diagram 1 in the paper for more information on the default strategy.- 1 indicates the head is not masked,
- 0 indicates the head is masked.
 
-  cross_attention_mask (torch.Tensorof shape(batch_size, seq_length, max_num_images, max_num_tiles), optional) — Cross-attention mask to control the interaction between text tokens and image tiles. This 4D tensor defines which image tiles each text token should attend to.For each text token (in seq_length): - 1 indicates the token should attend to the corresponding image tile
- 0 indicates the token should not attend to the corresponding image tile
 
-  cross_attention_states (torch.FloatTensor, optional) — Output of the vision model, used for cross-attention. This tensor contains the processed image features that the language model will attend to.
-  position_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1].
-  past_key_values (Cacheortuple(tuple(torch.FloatTensor)), optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Two formats are allowed: - a Cache instance, see our kv cache guide;
- Tuple of tuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)). This is also known as the legacy cache format.
 The model will output the same cache format that is fed as input. If no past_key_valuesare passed, the legacy cache format will be returned.If past_key_valuesare used, the user can optionally input only the lastinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, 1)instead of allinput_idsof shape(batch_size, sequence_length).
-  inputs_embeds (torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix.
-  use_cache (bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
-  cache_position (torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length.
-  Args —
labels (torch.LongTensorof shape(batch_size, sequence_length), optional): Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].num_logits_to_keep ( int, optional): Calculate logits for the lastnum_logits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size.
Returns
transformers.modeling_outputs.CausalLMOutputWithPast or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutputWithPast or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (MllamaTextConfig) and inputs.
- 
loss ( torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) β Language modeling loss (for next-token prediction).
- 
logits ( torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) β Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
- 
past_key_values ( tuple(tuple(torch.FloatTensor)), optional, returned whenuse_cache=Trueis passed or whenconfig.use_cache=True) β Tuple oftuple(torch.FloatTensor)of lengthconfig.n_layers, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head))Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see past_key_valuesinput) to speed up sequential decoding.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The MllamaForCausalLM forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from transformers import AutoTokenizer, MllamaForCausalLM
>>> model = MllamaForCausalLM.from_pretrained("Llama-3.2-11B-Vision")
>>> tokenizer = AutoTokenizer.from_pretrained("Llama-3.2-11B-Vision")
>>> prompt = "If I had to write a haiku, it would be:"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=40, do_sample=True, temperature=0.6)
>>> result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
>>> print(result)
If I had to write a haiku, it would be: "Snowflakes gently fall" - simple, yet peaceful.
I love the idea of snowflakes gently falling, each oneMllamaVisionModel
class transformers.MllamaVisionModel
< source >( config: MllamaVisionConfig )
Parameters
- config (MllamaConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Mllama Vision Model which consists of two vision encoders. This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( pixel_values: Tensor aspect_ratio_ids: Tensor aspect_ratio_mask: Tensor output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None  ) β transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
-  pixel_values (torch.FloatTensorof shape(batch_size, max_num_images, max_num_tiles, channels, image_size, image_size)) -- The tensors corresponding to the input images. Pixel values can be obtained using [AutoImageProcessor](/docs/transformers/v4.47.1/en/model_doc/auto#transformers.AutoImageProcessor). See [MllamaImageProcessor.__call__()](/docs/transformers/v4.47.1/en/model_doc/imagegpt#transformers.ImageGPTFeatureExtractor.__call__) for details ([]MllamaProcessor`] uses MllamaImageProcessor for processing images).
-  aspect_ratio_mask (torch.Tensorof shape(batch_size, max_num_images, max_num_tiles), optional) — Mask to avoid performing attention on padding tiles. Mask values selected in[0, 1]:- 1 for tiles that are not masked,
- 0 for tiles that are masked.
 
-  aspect_ratio_ids (torch.Tensorof shape(batch_size, max_num_images), optional) — Aspect ratio ids used to select the appropriate precomputed tile embeddings based on the aspect ratio of each input image. These ids correspond to indices in the model’s list of supported aspect ratios, offset by 1.For example, if the model supports aspect ratios [[1, 1], [1, 2], [2, 1]]: - An image with aspect ratio [1, 1] would have ID 1
- An image with aspect ratio [1, 2] would have ID 2
- An image with aspect ratio [2, 1] would have ID 3
 The id 0 is reserved for padding (i.e., no image). If an image has aspect ratio [1, 2], that means it was split into 2 tiles horizontally, and its aspect_ratio_idwould be 2.
-  output_attentions (bool, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more detail.
-  output_hidden_states (bool, optional) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more detail.
-  return_dict (bool, optional) — Whether or not to return a ModelOutput instead of a plain tuple.
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (MllamaVisionConfig) and inputs.
- 
last_hidden_state ( torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) β Sequence of hidden-states at the output of the last layer of the model.
- 
hidden_states ( tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) β Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. 
- 
attentions ( tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) β Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads. 
The MllamaVisionModel forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, MllamaVisionModel
>>> checkpoint = "meta-llama/Llama-3.2-11B-Vision"
>>> model = MllamaVisionModel.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, return_tensors="pt")
>>> output = model(**inputs)
>>> print(output.last_hidden_state.shape)
torch.Size([1, 1, 4, 1025, 7680])