4Bit Quantized NLLB Model

This is a 4-bit quantized version of the NLLB (No Language Left Behind) model, built upon the translate-nllb-1.3b-salt-4bit implementation. The quantization process reduces the model size and accelerates inference while preserving competitive translation quality, making it well-suited for resource-constrained environments.

How to Use

Below is an example script demonstrating how to load the model, perform translation, and decode the output:

Make sure to install latest version of BitsAndBytes

pip install -U bitsandbytes

import torch
import transformers

# Load the 4-bit quantized model and tokenizer
model_4bit = transformers.M2M100ForConditionalGeneration.from_pretrained(
    "Sunbird/translate-nllb-1.3b-salt-4bit",
    device_map="auto"
)
tokenizer = transformers.NllbTokenizer.from_pretrained("Sunbird/translate-nllb-1.3b-salt")

# Define the text and language parameters
text = 'Where is the hospital?'
source_language = 'eng'
target_language = 'lug'

# Mapping for language tokens
language_tokens = {
    'eng': 256047,
    'ach': 256111,
    'lgg': 256008,
    'lug': 256110,
    'nyn': 256002,
    'teo': 256006,
}

# Prepare device and tokenize the input text
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = tokenizer(text, return_tensors="pt").to(device)
inputs['input_ids'][0][0] = language_tokens[source_language]

# Generate the translation with beam search
translated_tokens = model_4bit.to(device).generate(
    **inputs,
    forced_bos_token_id=language_tokens[target_language],
    max_length=100,
    num_beams=5,
)

# Decode and print the translated result
result = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(result)
# Expected output: "Eddwaliro liri ludda wa?"
Downloads last month
49
Safetensors
Model size
835M params
Tensor type
F32
·
FP16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Sunbird/translate-nllb-1.3b-salt-4bit