Transformers documentation
Mistral3
Mistral3
Overview
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
It is ideal for:
- Fast-response conversational agents.
- Low-latency function calling.
- Subject matter experts via fine-tuning.
- Local inference for hobbyists and organizations handling sensitive data.
- Programming and math reasoning.
- Long document understanding.
- Visual understanding.
This model was contributed by cyrilvallez and yonigozlan.
The original code can be found here and here.
Usage example
Inference with Pipeline
Here is how you can use the image-text-to-text
pipeline to perform inference with the Mistral3
models in just a few lines of code:
>>> from transformers import pipeline
>>> messages = [
... {
... "role": "user",
... "content": [
... {
... "type": "image",
... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
... },
... {"type": "text", "text": "Describe this image."},
... ],
... },
... ]
>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'
Inference on a single image
This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... }
... ]
>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...
Text-only generation
This example shows how to generate text using the Mistral3 model without providing any image input.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."
>>> messages = [
... {"role": "system", "content": SYSTEM_PROMPT},
... {"role": "user", "content": user_prompt},
... ]
>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
>>> print(decoded_output)
"1. À plus tard!
2. Salut, à plus!
3. À toute!
4. À la prochaine!
5. Je me casse, à plus!
```
/\_/\
( o.o )
> ^ <
```"
Batched image and text inputs
Mistral3 models also support batched image and text inputs.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... },
... ],
... ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]
Batched multi-image input and quantization with BitsAndBytes
This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text.
This example also how to use BitsAndBytes
to load the model in 4bit quantization.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> model = AutoModelForImageTextToText.from_pretrained(
... model_checkpoint, quantization_config=quantization_config
... )
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
... ],
... },
... ],
>>> ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]
Mistral3Config
class transformers.Mistral3Config
< source >( vision_config = None text_config = None image_token_index = 10 projector_hidden_act = 'gelu' vision_feature_layer = -1 multimodal_projector_bias = False spatial_merge_size = 2 **kwargs )
Parameters
- vision_config (
Union[AutoConfig, dict]
, optional, defaults toPixtralVisionConfig
) — The config object or dictionary of the vision backbone. - text_config (
Union[AutoConfig, dict]
, optional, defaults toMistralConfig
) — The config object or dictionary of the text backbone. - image_token_index (
int
, optional, defaults to 10) — The image token index to encode the image prompt. - projector_hidden_act (
str
, optional, defaults to"gelu"
) — The activation function used by the multimodal projector. - vision_feature_layer (
Union[int, list[int]]
, optional, defaults to -1) — The index of the layer to select the vision feature. If multiple indices are provided, the vision feature of the corresponding indices will be concatenated to form the vision features. - multimodal_projector_bias (
bool
, optional, defaults toFalse
) — Whether to use bias in the multimodal projector. - spatial_merge_size (
int
, optional, defaults to 2) — The downsampling factor for the spatial merge operation.
This is the configuration class to store the configuration of a Mistral3ForConditionalGeneration. It is used to instantiate an Mistral3 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of mistralai/Mistral-Small-3.1-24B-Instruct-2503
Configuration objects inherit from PretrainedConfig and can be used to control the model outputs. Read the documentation from PretrainedConfig for more information.
Example:
>>> from transformers import Mistral3ForConditionalGeneration, Mistral3Config, PixtralVisionConfig, MistralConfig
>>> # Initializing a Pixtral-vision config
>>> vision_config = PixtralVisionConfig()
>>> # Initializing a Mistral config
>>> text_config = MistralConfig()
>>> # Initializing a Mistral3 configuration
>>> configuration = Mistral3Config(vision_config, text_config)
>>> # Initializing a model from the mistral3.1 configuration
>>> model = Mistral3ForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
MistralCommonTokenizer
class transformers.MistralCommonTokenizer
< source >( tokenizer_path: typing.Union[str, os.PathLike, pathlib.Path] mode: ValidationMode = <ValidationMode.test: 'test'> model_max_length: int = 1000000000000000019884624838656 padding_side: str = 'left' truncation_side: str = 'right' model_input_names: typing.Optional[list[str]] = None clean_up_tokenization_spaces: bool = False **kwargs )
Class to wrap mistral-common
tokenizers.
mistral-common
is the official tokenizer library for Mistral AI models. To use it, you need to install it with:
pip install transformers[mistral-common]
Otherwise the tokenizer falls back to the Transformers implementation of the tokenizer.
For more info on mistral-common
, see mistral-common.
This class is a wrapper around a mistral_common.tokens.tokenizers.mistral.MistralTokenizer
.
It provides a Hugging Face compatible interface to tokenize using the official mistral-common tokenizer.
Supports the following methods from the PreTrainedTokenizerBase
class:
- get_vocab(): Returns the vocabulary as a dictionary of token to index.
- encode(): Encode a string to a list of integers.
- decode(): Decode a list of integers to a string.
- batch_decode(): Decode a batch of list of integers to a list of strings.
- convert_tokens_to_ids(): Convert a list of tokens to a list of integers.
- convert_ids_to_tokens(): Convert a list of integers to a list of tokens.
- tokenize(): Tokenize a string.
- get_special_tokens_mask(): Get the special tokens mask for a list of tokens.
- prepare_for_model(): Prepare a list of inputs for the model.
- pad(): Pad a list of inputs to the same length.
- truncate_sequences(): Truncate a list of sequences to the same length.
- apply_chat_template(): Apply a chat template to a list of messages.
__call__()
: Tokenize a string or a list of strings.- from_pretrained(): Download and cache a pretrained tokenizer from the Hugging Face model hub or local directory.
- save_pretrained(): Save a tokenizer to a directory, so it can be reloaded using the
from_pretrained
class method. - push_to_hub(): Upload tokenizer to the Hugging Face model hub.
Here are the key differences with the PreTrainedTokenizerBase
class:
- Pair of sequences are not supported. The signature have been kept for compatibility but all arguments related to pair of sequences are ignored. The return values of pairs are returned as
None
. - The
is_split_into_words
argument is not supported. - The
return_token_type_ids
argument is not supported. - It is not possible to add new tokens to the tokenizer. Also the special tokens are handled differently from Transformers. In
mistral-common
, special tokens are never encoded directly. This means that:tokenizer.encode("<s>")
will not return the ID of the<s>
token. Instead, it will return a list of IDs corresponding to the tokenization of the string"<s>"
. For more information, see the mistral-common documentation.
If you have suggestions to improve this class, please open an issue on the mistral-common GitHub repository if it is related to the tokenizer or on the Transformers GitHub repository if it is related to the Hugging Face interface.
apply_chat_template
< source >( conversation: typing.Union[list[dict[str, str]], list[list[dict[str, str]]]] tools: typing.Optional[list[typing.Union[dict, typing.Callable]]] = None continue_final_message: bool = False tokenize: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: bool = False max_length: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_dict: bool = False **kwargs ) → Union[str, List[int], List[str], List[List[int]], BatchEncoding]
Parameters
- conversation (Union[List[Dict[str, str]], List[List[Dict[str, str]]]]) — A list of dicts with “role” and “content” keys, representing the chat history so far.
- tools (
List[Union[Dict, Callable]]
, optional) — A list of tools (callable functions) that will be accessible to the model. If the template does not support function calling, this argument will have no effect. Each tool should be passed as a JSON Schema, giving the name, description and argument types for the tool. See our chat templating guide for more information. - continue_final_message (bool, optional) —
If this is set, the chat will be formatted so that the final
message in the chat is open-ended, without any EOS tokens. The model will continue this message
rather than starting a new one. This allows you to “prefill” part of
the model’s response for it. Cannot be used at the same time as
add_generation_prompt
. - tokenize (
bool
, defaults toTrue
) — Whether to tokenize the output. IfFalse
, the output will be a string. - padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
- truncation (
bool
, defaults toFalse
) — Whether to truncate sequences at the maximum length. Has no effect if tokenize isFalse
. - max_length (
int
, optional) — Maximum length (in tokens) to use for padding or truncation. Has no effect if tokenize isFalse
. If not specified, the tokenizer’smax_length
attribute will be used as a default. - return_tensors (
str
or TensorType, optional) — If set, will return tensors of a particular framework. Has no effect if tokenize isFalse
. Acceptable values are:'pt'
: Return PyTorchtorch.Tensor
objects.
- return_dict (
bool
, defaults toFalse
) — Whether to return a dictionary with named outputs. Has no effect if tokenize isFalse
. If at least one conversation contains an image, its pixel values will be returned in thepixel_values
key. - kwargs (additional keyword arguments, optional) —
Not supported by
MistralCommonTokenizer.apply_chat_template
. Will raise an error if used.
Returns
Union[str, List[int], List[str], List[List[int]], BatchEncoding]
A list of token ids representing the tokenized chat so far, including control
tokens. This output is ready to pass to the model, either directly or via methods like generate()
.
Converts a list of dictionaries with "role"
and "content"
keys to a list of token
ids.
batch_decode
< source >( sequences: typing.Union[list[int], list[list[int]], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: typing.Optional[bool] = None **kwargs ) → List[str]
Parameters
- sequences (
Union[List[int], List[List[int]], np.ndarray, torch.Tensor]
) — List of tokenized input ids. Can be obtained using the__call__
method. - skip_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not to remove special tokens in the decoding. - clean_up_tokenization_spaces (
bool
, optional) — Whether or not to clean up the tokenization spaces. IfNone
, will default toself.clean_up_tokenization_spaces
. - kwargs (additional keyword arguments, optional) —
Not supported by
MistralCommonTokenizer.batch_decode
. Will raise an error if used.
Returns
List[str]
The list of decoded sentences.
Convert a list of lists of token ids into a list of strings by calling decode.
convert_ids_to_tokens
< source >( ids: typing.Union[int, list[int]] skip_special_tokens: bool = False ) → str
or List[str]
Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
convert_tokens_to_ids
< source >( tokens: typing.Union[str, list[str]] ) → int
or List[int]
Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
decode
< source >( token_ids: typing.Union[int, list[int], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor')] skip_special_tokens: bool = False clean_up_tokenization_spaces: typing.Optional[bool] = None **kwargs ) → str
Parameters
- token_ids (
Union[int, List[int], np.ndarray, torch.Tensor]
) — List of tokenized input ids. Can be obtained using the__call__
method. - skip_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not to remove special tokens in the decoding. - clean_up_tokenization_spaces (
bool
, optional) — Whether or not to clean up the tokenization spaces. IfNone
, will default toself.clean_up_tokenization_spaces
. - kwargs (additional keyword arguments, optional) —
Not supported by
MistralCommonTokenizer.decode
. Will raise an error if used.
Returns
str
The decoded sentence.
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
encode
< source >( text: typing.Union[str, list[int]] text_pair: None = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None verbose: bool = True **kwargs ) → List[int]
, torch.Tensor
Parameters
- text (
str
orList[int]
) — The first sequence to be encoded. This can be a string or a list of integers (tokenized string ids). - text_pair (
None
, optional) — Not supported byMistralCommonTokenizer.encode
. Kept to matchPreTrainedTokenizerBase.encode
signature. - add_special_tokens (
bool
, optional, defaults toTrue
) — Whether or not to add special tokens when encoding the sequences. This will use the underlyingPretrainedTokenizerBase.build_inputs_with_special_tokens
function, which defines which tokens are automatically added to the input ids. This is useful if you want to addbos
oreos
tokens automatically. - padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Activates and controls padding. Accepts the following values:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
- truncation (
bool
,str
or TruncationStrategy, optional, defaults toFalse
) — Activates and controls truncation. Accepts the following values:True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
- max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. - stride (
int
, optional, defaults to 0) — If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. - pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value. Requirespadding
to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability>= 7.5
(Volta). - padding_side (
str
, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name. - return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'pt'
: Return PyTorchtorch.Tensor
objects.
- **kwargs — Not supported by
MistralCommonTokenizer.encode
. Will raise an error if used.
Returns
List[int]
, torch.Tensor
The tokenized ids of the text.
Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.
from_pretrained
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike] *init_inputs mode: ValidationMode = <ValidationMode.test: 'test'> cache_dir: typing.Union[str, os.PathLike, NoneType] = None force_download: bool = False local_files_only: bool = False token: typing.Union[bool, str, NoneType] = None revision: str = 'main' model_max_length: int = 1000000000000000019884624838656 padding_side: str = 'left' truncation_side: str = 'right' model_input_names: typing.Optional[list[str]] = None clean_up_tokenization_spaces: bool = False **kwargs )
Parameters
- pretrained_model_name_or_path (
str
oros.PathLike
) — Can be either:- A string, the model id of a predefined tokenizer hosted inside a model repo on huggingface.co.
- A path to a directory containing the tokenizer config, for instance saved
using the
MistralCommonTokenizer.tokenization_mistral_common.save_pretrained
method, e.g.,./my_model_directory/
.
- mode (
ValidationMode
, optional, defaults toValidationMode.test
) — Validation mode for theMistralTokenizer
tokenizer. - cache_dir (
str
oros.PathLike
, optional) — Path to a directory in which a downloaded predefined tokenizer vocabulary files should be cached if the standard cache should not be used. - force_download (
bool
, optional, defaults toFalse
) — Whether or not to force the (re-)download the vocabulary files and override the cached versions if they exist. - token (
str
or bool, optional) — The token to use as HTTP bearer authorization for remote files. IfTrue
, will use the token generated when runninghuggingface-cli login
(stored in~/.huggingface
). - local_files_only (
bool
, optional, defaults toFalse
) — Whether or not to only rely on local files and not to attempt to download any files. - revision (
str
, optional, defaults to"main"
) — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, sorevision
can be any identifier allowed by git. - max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. - padding_side (
str
, optional, defaults to"left"
) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name. - truncation_side (
str
, optional, defaults to"right"
) — The side on which the model should have truncation applied. Should be selected between [‘right’, ‘left’]. - model_input_names (
List[string]
, optional) — The list of inputs accepted by the forward pass of the model (like"token_type_ids"
or"attention_mask"
). Default value is picked from the class attribute of the same name. - clean_up_tokenization_spaces (
bool
, optional, defaults toFalse
) — Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process. - kwargs (additional keyword arguments, optional) —
Not supported by
MistralCommonTokenizer.from_pretrained
. Will raise an error if used.
Instantiate a MistralCommonTokenizer
from a predefined
tokenizer.
get_special_tokens_mask
< source >( token_ids_0: list token_ids_1: None = None already_has_special_tokens: bool = False ) → A list of integers in the range [0, 1]
Parameters
- token_ids_0 (
List[int]
) — List of ids of the sequence. - token_ids_1 (
List[int]
, optional) — Not supported byMistralCommonTokenizer
. Kept to match the interface ofPreTrainedTokenizerBase
. - already_has_special_tokens (
bool
, optional, defaults toFalse
) — Whether or not the token list is already formatted with special tokens for the model.
Returns
A list of integers in the range [0, 1]
1 for a special token, 0 for a sequence token.
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer prepare_for_model
or encode_plus
methods.
Returns the vocabulary as a dictionary of token to index.
This is a lossy conversion. There may be multiple token ids that decode to the same string due to partial UTF-8 byte sequences being converted to �.
pad
< source >( encoded_inputs: typing.Union[transformers.tokenization_utils_base.BatchEncoding, list[transformers.tokenization_utils_base.BatchEncoding], dict[str, list[int]], dict[str, list[list[int]]], list[dict[str, list[int]]]] padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = True max_length: typing.Optional[int] = None pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_attention_mask: typing.Optional[bool] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None verbose: bool = True )
Parameters
- encoded_inputs (BatchEncoding, list of BatchEncoding,
Dict[str, List[int]]
,Dict[str, List[List[int]]
orList[Dict[str, List[int]]]
) — Tokenized inputs. Can represent one input (BatchEncoding orDict[str, List[int]]
) or a batch of tokenized inputs (list of BatchEncoding, Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.Instead of
List[int]
you can have tensors (numpy arrays, PyTorch tensors), see the note above for the return type. - padding (
bool
,str
or PaddingStrategy, optional, defaults toTrue
) — Select a strategy to pad the returned sequences (according to the model’s padding side and padding index) among:True
or'longest'
(default): Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
: No padding (i.e., can output a batch with sequences of different lengths).
- max_length (
int
, optional) — Maximum length of the returned list and optionally padding length (see above). - pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5
(Volta). - padding_side (
str
, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name. - return_attention_mask (
bool
, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputs
attribute. - return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'pt'
: Return PyTorchtorch.Tensor
objects.'np'
: Return Numpynp.ndarray
objects.
- verbose (
bool
, optional, defaults toTrue
) — Whether or not to print more information and warnings.
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.
Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side
,
self.pad_token_id
).
If the encoded_inputs
passed are dictionary of numpy arrays, PyTorch tensors, the
result will use the same type unless you provide a different tensor type with return_tensors
. In the case of
PyTorch tensors, you will lose the specific device of your tensors however.
prepare_for_model
< source >( ids: list pair_ids: None = None add_special_tokens: bool = True padding: typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False truncation: typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None max_length: typing.Optional[int] = None stride: int = 0 pad_to_multiple_of: typing.Optional[int] = None padding_side: typing.Optional[str] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_attention_mask: typing.Optional[bool] = None return_overflowing_tokens: bool = False return_special_tokens_mask: bool = False return_length: bool = False verbose: bool = True prepend_batch_axis: bool = False **kwargs ) → BatchEncoding
Parameters
- ids (
List[int]
) — Tokenized input ids of the first sequence. - pair_ids (
None
, optional) — Not supported byMistralCommonTokenizer
. Kept to match the interface ofPreTrainedTokenizerBase
. - add_special_tokens (
bool
, optional, defaults toTrue
) — Whether or not to add special tokens when encoding the sequences. This will use the underlyingPretrainedTokenizerBase.build_inputs_with_special_tokens
function, which defines which tokens are automatically added to the input ids. This is useful if you want to addbos
oreos
tokens automatically. - padding (
bool
,str
or PaddingStrategy, optional, defaults toFalse
) — Activates and controls padding. Accepts the following values:True
or'longest'
: Pad to the longest sequence in the batch (or no padding if only a single sequence is provided).'max_length'
: Pad to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_pad'
(default): No padding (i.e., can output a batch with sequences of different lengths).
- truncation (
bool
,str
or TruncationStrategy, optional, defaults toFalse
) — Activates and controls truncation. Accepts the following values:True
or'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.False
or'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
- max_length (
int
, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to
None
, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated. - stride (
int
, optional, defaults to 0) — If set to a number along withmax_length
, the overflowing tokens returned whenreturn_overflowing_tokens=True
will contain some tokens from the end of the truncated sequence returned to provide some overlap between truncated and overflowing sequences. The value of this argument defines the number of overlapping tokens. - pad_to_multiple_of (
int
, optional) — If set will pad the sequence to a multiple of the provided value. Requirespadding
to be activated. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability>= 7.5
(Volta). - padding_side (
str
, optional) — The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name. - return_tensors (
str
or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'pt'
: Return PyTorchtorch.Tensor
objects.
- return_attention_mask (
bool
, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by thereturn_outputs
attribute. - return_overflowing_tokens (
bool
, optional, defaults toFalse
) — Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided withtruncation_strategy = longest_first
orTrue
, an error is raised instead of returning overflowing tokens. - return_special_tokens_mask (
bool
, optional, defaults toFalse
) — Whether or not to return special tokens mask information. - return_offsets_mapping (
bool
, optional, defaults toFalse
) — Whether or not to return(char_start, char_end)
for each token.This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise
NotImplementedError
. - return_length (
bool
, optional, defaults toFalse
) — Whether or not to return the lengths of the encoded inputs. - verbose (
bool
, optional, defaults toTrue
) — Whether or not to print more information and warnings. - **kwargs — passed to the
self.tokenize()
method
Returns
A BatchEncoding with the following fields:
-
input_ids — List of token ids to be fed to a model.
-
attention_mask — List of indices specifying which tokens should be attended to by the model (when
return_attention_mask=True
or if “attention_mask” is inself.model_input_names
). -
overflowing_tokens — List of overflowing tokens sequences (when a
max_length
is specified andreturn_overflowing_tokens=True
). -
num_truncated_tokens — Number of tokens truncated (when a
max_length
is specified andreturn_overflowing_tokens=True
). -
special_tokens_mask — List of 0s and 1s, with 1 specifying added special tokens and 0 specifying regular sequence tokens (when
add_special_tokens=True
andreturn_special_tokens_mask=True
). -
length — The length of the inputs (when
return_length=True
)
Prepares a sequence of input id so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens.
save_pretrained
< source >( save_directory: typing.Union[str, os.PathLike, pathlib.Path] push_to_hub: bool = False token: typing.Union[bool, str, NoneType] = None commit_message: typing.Optional[str] = None repo_id: typing.Optional[str] = None private: typing.Optional[bool] = None repo_url: typing.Optional[str] = None organization: typing.Optional[str] = None **kwargs ) → A tuple of str
Parameters
- save_directory (
str
oros.PathLike
) — The path to a directory where the tokenizer will be saved. - push_to_hub (
bool
, optional, defaults toFalse
) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id
(will default to the name ofsave_directory
in your namespace). - token (
str
or bool, optional, defaults toNone
) — The token to use to push to the model hub. IfTrue
, will use the token in theHF_TOKEN
environment variable. - commit_message (
str
, optional) — The commit message to use when pushing to the hub. - repo_id (
str
, optional) — The name of the repository to which push to the Hub. - private (
bool
, optional) — Whether the model repository is private or not. - repo_url (
str
, optional) — The URL to the Git repository to which push to the Hub. - organization (
str
, optional) — The name of the organization in which you would like to push your model. - kwargs (
Dict[str, Any]
, optional) — Not supported byMistralCommonTokenizer.save_pretrained
. Will raise an error if used.
Returns
A tuple of str
The files saved.
Save the full tokenizer state.
This method make sure the full tokenizer can then be re-loaded using the
~MistralCommonTokenizer.tokenization_mistral_common.from_pretrained
class method.
tokenize
< source >( text: str **kwargs ) → List[str]
Converts a string into a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies.
truncate_sequences
< source >( ids: list pair_ids: None = None num_tokens_to_remove: int = 0 truncation_strategy: typing.Union[str, transformers.tokenization_utils_base.TruncationStrategy] = 'longest_first' stride: int = 0 **kwargs ) → Tuple[List[int], None, List[int]]
Parameters
- ids (
List[int]
) — Tokenized input ids. Can be obtained from a string by chaining thetokenize
andconvert_tokens_to_ids
methods. - pair_ids (
None
, optional) — Not supported byMistralCommonTokenizer
. Kept to match the signature ofPreTrainedTokenizerBase.truncate_sequences
. - num_tokens_to_remove (
int
, optional, defaults to 0) — Number of tokens to remove using the truncation strategy. - truncation_strategy (
str
or TruncationStrategy, optional, defaults to'longest_first'
) — The strategy to follow for truncation. Can be:'longest_first'
: Truncate to a maximum length specified with the argumentmax_length
or to the maximum acceptable input length for the model if that argument is not provided.'do_not_truncate'
(default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
- stride (
int
, optional, defaults to 0) — If set to a positive number, the overflowing tokens returned will contain some tokens from the main sequence returned. The value of this argument defines the number of additional tokens.
Returns
Tuple[List[int], None, List[int]]
The truncated ids
and the list of
overflowing tokens. None
is returned to match Transformers signature.
Truncates a sequence pair in-place following the strategy.
Mistral3Model
class transformers.Mistral3Model
< source >( config: Mistral3Config )
Parameters
- config (Mistral3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The Mistral3 model which consists of a vision backbone and a language model, without a language modeling head.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None vision_feature_layer: typing.Union[int, list[int], NoneType] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None image_sizes: Tensor = None **kwargs: typing_extensions.Unpack[transformers.modeling_flash_attention_utils.FlashAttentionKwargs] ) → transformers.models.mistral3.modeling_mistral3.Mistral3ModelOutputWithPast
or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, image_size, image_size)
) — The tensors corresponding to the input images. Pixel values can be obtained using{image_processor_class}
. See{image_processor_class}.__call__
for details ({processor_class}
uses{image_processor_class}
for processing images). - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
~cache_utils.Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - vision_feature_layer (
Union[int, list[int], NoneType]
) — The index of the layer to select the vision feature. If multiple indices are provided, the vision feature of the corresponding indices will be concatenated to form the vision features. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - image_sizes (
torch.Tensor
of shape(batch_size, 2)
) — The sizes of the images in the batch, being (height, width) for each image.
Returns
transformers.models.mistral3.modeling_mistral3.Mistral3ModelOutputWithPast
or tuple(torch.FloatTensor)
A transformers.models.mistral3.modeling_mistral3.Mistral3ModelOutputWithPast
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (Mistral3Config) and inputs.
-
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Sequence of hidden-states at the output of the last layer of the model. -
past_key_values (
Cache
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple[torch.FloatTensor, ...]
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple[torch.FloatTensor, ...]
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
image_hidden_states (
torch.FloatTensor
, optional) — Atorch.FloatTensor
of size(batch_size, num_images, sequence_length, hidden_size)
. image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
The Mistral3Model forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
get_image_features
< source >( pixel_values: FloatTensor image_sizes: Tensor vision_feature_layer: typing.Union[int, list[int], NoneType] = None **kwargs ) → image_features (torch.Tensor
)
Parameters
- pixel_values (
torch.FloatTensor]
of shape(batch_size, channels, height, width)
) — The tensors corresponding to the input images. - vision_feature_layer (
Union[int, list[int]]
, optional) — The index of the layer to select the vision feature. If multiple indices are provided, the vision feature of the corresponding indices will be concatenated to form the vision features. - image_sizes (
torch.Tensor
, optional) — Tensor containing the image sizes as returned by the processor.
Returns
image_features (torch.Tensor
)
Image feature tensor of shape (num_images, image_length, embed_dim)
).
Obtains image last hidden states from the vision tower and apply multimodal projection.
Mistral3ForConditionalGeneration
class transformers.Mistral3ForConditionalGeneration
< source >( config: Mistral3Config )
Parameters
- config (Mistral3Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The MISTRAL3 model which consists of a vision backbone and a language model.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: LongTensor = None pixel_values: FloatTensor = None attention_mask: typing.Optional[torch.Tensor] = None position_ids: typing.Optional[torch.LongTensor] = None past_key_values: typing.Optional[transformers.cache_utils.Cache] = None inputs_embeds: typing.Optional[torch.FloatTensor] = None labels: typing.Optional[torch.LongTensor] = None use_cache: typing.Optional[bool] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None cache_position: typing.Optional[torch.LongTensor] = None logits_to_keep: typing.Union[int, torch.Tensor] = 0 image_sizes: typing.Optional[torch.Tensor] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.models.mistral3.modeling_mistral3.Mistral3CausalLMOutputWithPast
or tuple(torch.FloatTensor)
Parameters
- input_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- pixel_values (
torch.FloatTensor
of shape(batch_size, num_channels, image_size, image_size)
) — The tensors corresponding to the input images. Pixel values can be obtained using{image_processor_class}
. See{image_processor_class}.__call__
for details ({processor_class}
uses{image_processor_class}
for processing images). - attention_mask (
torch.Tensor
of shape(batch_size, sequence_length)
, optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]
:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- position_ids (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]
. - past_key_values (
~cache_utils.Cache
, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_values
returned by the model at a previous stage of decoding, whenuse_cache=True
orconfig.use_cache=True
.Two formats are allowed:
- a Cache instance, see our kv cache guide;
- Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
). This is also known as the legacy cache format.
The model will output the same cache format that is fed as input. If no
past_key_values
are passed, the legacy cache format will be returned.If
past_key_values
are used, the user can optionally input only the lastinput_ids
(those that don’t have their past key value states given to this model) of shape(batch_size, 1)
instead of allinput_ids
of shape(batch_size, sequence_length)
. - inputs_embeds (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) — Optionally, instead of passinginput_ids
you can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_ids
indices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensor
of shape(batch_size, sequence_length)
, optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]
or -100 (seeinput_ids
docstring). Tokens with indices set to-100
are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]
. - use_cache (
bool
, optional) — If set toTrue
,past_key_values
key value states are returned and can be used to speed up decoding (seepast_key_values
). - output_attentions (
bool
, optional) — Whether or not to return the attentions tensors of all attention layers. Seeattentions
under returned tensors for more detail. - output_hidden_states (
bool
, optional) — Whether or not to return the hidden states of all layers. Seehidden_states
under returned tensors for more detail. - return_dict (
bool
, optional) — Whether or not to return a ModelOutput instead of a plain tuple. - cache_position (
torch.LongTensor
of shape(sequence_length)
, optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids
, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - logits_to_keep (
Union[int, torch.Tensor]
, defaults to0
) — If anint
, compute logits for the lastlogits_to_keep
tokens. If0
, calculate logits for allinput_ids
(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor
, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length). - image_sizes (
torch.Tensor
of shape(batch_size, 2)
, optional) — The sizes of the images in the batch, being (height, width) for each image.
Returns
transformers.models.mistral3.modeling_mistral3.Mistral3CausalLMOutputWithPast
or tuple(torch.FloatTensor)
A transformers.models.mistral3.modeling_mistral3.Mistral3CausalLMOutputWithPast
or a tuple of
torch.FloatTensor
(if return_dict=False
is passed or when config.return_dict=False
) comprising various
elements depending on the configuration (Mistral3Config) and inputs.
-
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) — Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
past_key_values (
Cache
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) — Tuple oftuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding. -
hidden_states (
tuple[torch.FloatTensor]
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) — Tuple oftorch.FloatTensor
(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple[torch.FloatTensor]
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) — Tuple oftorch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
-
image_hidden_states (
torch.FloatTensor
, optional) — Atorch.FloatTensor
of size(batch_size, num_images, sequence_length, hidden_size)
. image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
The Mistral3ForConditionalGeneration forward method, overrides the __call__
special method.
Although the recipe for forward pass needs to be defined within this function, one should call the Module
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.
Example:
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, Mistral3ForConditionalGeneration
>>> model = Mistral3ForConditionalGeneration.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
>>> processor = AutoProcessor.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
>>> prompt = "<s>[INST][IMG]What is the image?[/INST]"
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, text=prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(**inputs, max_new_tokens=15)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"What is the image?The image depicts two cats lying on a pink blanket."