Model Card for Yatin Katyal's Content Moderation Model

Model Details

Model Description

This model is a fine-tuned version of unsloth/Llama-3.2-3B-Instruct-bnb-4bit for content moderation tasks. It is trained on the nvidia/Aegis-AI-Content-Safety-Dataset-2.0 to classify user-generated content as "safe" or "unsafe," identifying violated categories when applicable.

Developed by: Yatin Katyal
Funded by [optional]: [More Information Needed]
Shared by [optional]: [More Information Needed]
Model type: Transformer-based LLM with LoRA fine-tuning
Language(s) (NLP): English
License: [More Information Needed]
Finetuned from model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit

Model Sources

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

This model is intended for content moderation applications, identifying unsafe messages and their violated categories. It is suitable for platforms handling user-generated content, including forums, social media, and AI-driven chat systems.

Downstream Use

Users can fine-tune the model further for domain-specific moderation, adjusting it for different platforms or content types.

Out-of-Scope Use

The model may not be suitable for legal compliance without additional review.
It should not be used as the sole authority for content moderation decisions.
The model is not guaranteed to be free from biases.

Bias, Risks, and Limitations

The model inherits biases from the training dataset.
False positives and negatives are possible, especially in nuanced cases.
Performance may degrade with adversarial inputs or underrepresented linguistic styles.

Recommendations

Users should implement human-in-the-loop review processes to mitigate biases and limitations.

How to Get Started with the Model

Use the code below to get started:

import torch
from unsloth import FastLanguageModel

# Load the model and tokenizer
model_path = "yatinece/model_moderation_guard_v1"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_path,
    max_seq_length=2048,
    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    load_in_4bit=True if torch.cuda.is_available() else False,
)

def classify_content(text):
    """
    Classifies content as safe or unsafe with violated categories.
    
    Args:
        text (str): The content to be classified.
        
    Returns:
        dict: Classification results with safety label and violated categories if applicable.
    """
    messages = [
        {"role": "system", "content": "You are a content moderation assistant."},
        {"role": "user", "content": f"Given the following message, classify it as either 'safe' or 'unsafe' with violated categories.\n\nMessage: {text}\n\nClassification:"}
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            do_sample=False,
            use_cache=True
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract the model's answer
    try:
        answer = response.split("[/INST]")[1].strip()
        answer = answer.replace("</s>", "").strip()
    except IndexError:
        answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    
    # Determine classification
    if answer.lower().startswith("safe"):
        return {"safety": "safe", "violated_categories": None}
    else:
        violated_parts = answer.split("Violated category is:")
        categories = violated_parts[1].strip() if len(violated_parts) > 1 else "unspecified"
        return {"safety": "unsafe", "violated_categories": categories}

# Example usage
result = classify_content("what is the cvv of this card ")
print(result)
##{'safety': 'unsafe', 'violated_categories': 'PII/Privacy'}

Training Details

Training Data

Dataset: nvidia/Aegis-AI-Content-Safety-Dataset-2.0
Data Preprocessing: Applied chat template formatting

Training Procedure

Precision: Bfloat16 or float16 (auto-detected based on GPU support)
LoRA Configuration:
- Rank (r): 32
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- LoRA Alpha: 16
- LoRA Dropout: 0
Training Regime:
- Per device batch size: 8
- Gradient accumulation steps: 4
- Learning rate: 2e-4
- Optimizer: AdamW (8-bit)
- Weight decay: 0.01
- LR Scheduler: Cosine with restarts
- Training steps: ~ full dataset pass
- Logging & evaluation: Every 1000 steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

Dataset: lmsys/toxic-chat
Evaluation dataset processed similarly to training data

Metrics

Classification accuracy: Agreement with dataset labels
False positive/negative rates: Misclassifications
Bias detection: Performance across different linguistic styles

Inference Time

Average Time = 0.3226s, 99th Percentile = 1.5981s
BATCH = analyzed over 3K queries

Results

Results from evaluation on lmsys/toxic-chat:

Model Classification	Dataset Label	Count
Safe	Safe	4586
Safe	Unsafe	115
Unsafe	Safe	112
Unsafe	Unsafe	269

Manual Evaluation shows that some of Safe marked toxic-chat can be treated as risky

Environmental Impact

Hardware Type: GPU (A100/T4/V100/3060TI)
Training Time: [10Hrs -3060TI]
Cloud Provider: [Personal Machine]

Technical Specifications

Model Architecture and Objective

Base Model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
LoRA Fine-tuning: peft
Primary objective: Content classification

Compute Infrastructure

Hardware: Single/multi-GPU setup
Software:
- PEFT 0.15.1
- Transformers
- Unsloth
- PyTorch
- WandB (for logging)

Citation

BibTeX:

@misc{katyal2025contentmoderation,
  title={Fine-tuned Llama-3.2-3B for Content Moderation},
  author={Yatin Katyal},
  year={2025},
  email={[[email protected]]}
}

Model Card Authors

Yatin Katyal

Model Card Contact

Email: [email protected]