Model Card for B2BERT

Model Details

Model Description

This is the model card for the Multi-Label country-level Dialect Identification (ML-DID) model using CAMeLBERT. It classifies Arabic text into multiple dialectal categories using pseudo-labeling and curriculum-based training.

  • Model type: Transformer-based multi-label classifier
  • Language(s) (NLP): Arabic (Dialectal Variants)
  • License: TBD
  • Finetuned from model: CAMeLBERT

Bias, Risks, and Limitations

Biases

  • Geographic bias in dataset annotation.
  • Overlapping dialects may result in misclassification.
  • Errors may arise from synthetic labels.

Recommendations

Users should be aware of biases in dataset annotation and carefully validate outputs for high-stakes applications.

Training Details

Training Data

  • Datasets: NADI 2020, 2021, 2023, and 2024 development set.
  • Synthetic multi-label dataset created through pseudo-labeling.

Evaluation

Testing Data & Metrics

Technical Specifications

Model Architecture and Objective

  • Transformer-based multi-label classifier for Arabic dialect identification.

Compute Infrastructure

  • Hardware: NVIDIA RTX 6000 (24GB VRAM)
  • Software: Python, PyTorch, Hugging Face Transformers

Using the Model

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer
model_name = "AHAAM/B2BERT"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define dialects
DIALECTS = [
    "Algeria", "Bahrain", "Egypt", "Iraq", "Jordan", "Kuwait", "Lebanon", "Libya",
    "Morocco", "Oman", "Palestine", "Qatar", "Saudi_Arabia", "Sudan", "Syria",
    "Tunisia", "UAE", "Yemen"
]

def predict_binary_outcomes(model, tokenizer, texts, threshold=0.3):
    """Predict the validity in each dialect by applying a sigmoid activation to each dialect's logit.
    Dialects with probabilities (sigmoid activations) above the threshold (default 0.3) are predicted as valid.
    
    The model generates logits for each dialect in the following order:
    Algeria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Morocco, Oman, Palestine, Qatar,
    Saudi_Arabia, Sudan, Syria, Tunisia, UAE, Yemen.

    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    
    encodings = tokenizer(
        texts, truncation=True, padding=True, max_length=128, return_tensors="pt"
    )

    input_ids = encodings["input_ids"].to(device)
    attention_mask = encodings["attention_mask"].to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    probabilities = torch.sigmoid(logits).cpu().numpy().reshape(-1)
    binary_predictions = (probabilities >= threshold).astype(int)

    # Map indices to actual labels
    predicted_dialects = [
        dialect
        for dialect, dialect_prediction in zip(DIALECTS, binary_predictions)
        if dialect_prediction == 1
    ]

    return predicted_dialects

text = "ูƒูŠู ุญุงู„ูƒุŸ"

## Use threshold 0.3 for better results.
predicted_dialects = predict_binary_outcomes(model, tokenizer, [text])
print(f"Predicted Dialects: {predicted_dialects}")


Downloads last month
30
Safetensors
Model size
109M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.