Model Card for Yatin Katyal's Content Moderation Model

Model Details

Model Description

This model is a fine-tuned version of unsloth/Llama-3.2-3B-Instruct-bnb-4bit for content moderation tasks. It is trained on the nvidia/Aegis-AI-Content-Safety-Dataset-2.0 to classify user-generated content as "safe" or "unsafe," identifying violated categories when applicable.

  • Developed by: Yatin Katyal
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: [More Information Needed]
  • Model type: Transformer-based LLM with LoRA fine-tuning
  • Language(s) (NLP): English
  • License: [More Information Needed]
  • Finetuned from model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit

Model Sources

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Direct Use

This model is intended for content moderation applications, identifying unsafe messages and their violated categories. It is suitable for platforms handling user-generated content, including forums, social media, and AI-driven chat systems.

Downstream Use

Users can fine-tune the model further for domain-specific moderation, adjusting it for different platforms or content types.

Out-of-Scope Use

  • The model may not be suitable for legal compliance without additional review.
  • It should not be used as the sole authority for content moderation decisions.
  • The model is not guaranteed to be free from biases.

Bias, Risks, and Limitations

  • The model inherits biases from the training dataset.
  • False positives and negatives are possible, especially in nuanced cases.
  • Performance may degrade with adversarial inputs or underrepresented linguistic styles.

Recommendations

Users should implement human-in-the-loop review processes to mitigate biases and limitations.

How to Get Started with the Model

Use the code below to get started:

import torch
from unsloth import FastLanguageModel

# Load the model and tokenizer
model_path = "yatinece/model_moderation_guard_v1"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_path,
    max_seq_length=2048,
    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    load_in_4bit=True if torch.cuda.is_available() else False,
)

def classify_content(text):
    """
    Classifies content as safe or unsafe with violated categories.
    
    Args:
        text (str): The content to be classified.
        
    Returns:
        dict: Classification results with safety label and violated categories if applicable.
    """
    messages = [
        {"role": "system", "content": "You are a content moderation assistant."},
        {"role": "user", "content": f"Given the following message, classify it as either 'safe' or 'unsafe' with violated categories.\n\nMessage: {text}\n\nClassification:"}
    ]
    
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            do_sample=False,
            use_cache=True
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract the model's answer
    try:
        answer = response.split("[/INST]")[1].strip()
        answer = answer.replace("</s>", "").strip()
    except IndexError:
        answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    
    # Determine classification
    if answer.lower().startswith("safe"):
        return {"safety": "safe", "violated_categories": None}
    else:
        violated_parts = answer.split("Violated category is:")
        categories = violated_parts[1].strip() if len(violated_parts) > 1 else "unspecified"
        return {"safety": "unsafe", "violated_categories": categories}

# Example usage
result = classify_content("what is the cvv of this card ")
print(result)
##{'safety': 'unsafe', 'violated_categories': 'PII/Privacy'}

Training Details

Training Data

  • Dataset: nvidia/Aegis-AI-Content-Safety-Dataset-2.0
  • Data Preprocessing: Applied chat template formatting

Training Procedure

  • Precision: Bfloat16 or float16 (auto-detected based on GPU support)
  • LoRA Configuration:
    • Rank (r): 32
    • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
    • LoRA Alpha: 16
    • LoRA Dropout: 0
  • Training Regime:
    • Per device batch size: 8
    • Gradient accumulation steps: 4
    • Learning rate: 2e-4
    • Optimizer: AdamW (8-bit)
    • Weight decay: 0.01
    • LR Scheduler: Cosine with restarts
    • Training steps: ~ full dataset pass
    • Logging & evaluation: Every 1000 steps

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • Dataset: lmsys/toxic-chat
  • Evaluation dataset processed similarly to training data

Metrics

  • Classification accuracy: Agreement with dataset labels
  • False positive/negative rates: Misclassifications
  • Bias detection: Performance across different linguistic styles

Inference Time

  • Average Time = 0.3226s, 99th Percentile = 1.5981s
  • BATCH = analyzed over 3K queries

Results

Results from evaluation on lmsys/toxic-chat:

Model Classification Dataset Label Count
Safe Safe 4586
Safe Unsafe 115
Unsafe Safe 112
Unsafe Unsafe 269

Manual Evaluation shows that some of Safe marked toxic-chat can be treated as risky

Environmental Impact

  • Hardware Type: GPU (A100/T4/V100/3060TI)
  • Training Time: [10Hrs -3060TI]
  • Cloud Provider: [Personal Machine]

Technical Specifications

Model Architecture and Objective

  • Base Model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
  • LoRA Fine-tuning: peft
  • Primary objective: Content classification

Compute Infrastructure

  • Hardware: Single/multi-GPU setup
  • Software:
    • PEFT 0.15.1
    • Transformers
    • Unsloth
    • PyTorch
    • WandB (for logging)

Citation

BibTeX:

@misc{katyal2025contentmoderation,
  title={Fine-tuned Llama-3.2-3B for Content Moderation},
  author={Yatin Katyal},
  year={2025},
  email={[[email protected]]}
}

Model Card Authors

  • Yatin Katyal

Model Card Contact

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support