Model Card for Yatin Katyal's Content Moderation Model
Model Details
Model Description
This model is a fine-tuned version of unsloth/Llama-3.2-3B-Instruct-bnb-4bit
for content moderation tasks. It is trained on the nvidia/Aegis-AI-Content-Safety-Dataset-2.0
to classify user-generated content as "safe" or "unsafe," identifying violated categories when applicable.
- Developed by: Yatin Katyal
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: Transformer-based LLM with LoRA fine-tuning
- Language(s) (NLP): English
- License: [More Information Needed]
- Finetuned from model:
unsloth/Llama-3.2-3B-Instruct-bnb-4bit
Model Sources
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
This model is intended for content moderation applications, identifying unsafe messages and their violated categories. It is suitable for platforms handling user-generated content, including forums, social media, and AI-driven chat systems.
Downstream Use
Users can fine-tune the model further for domain-specific moderation, adjusting it for different platforms or content types.
Out-of-Scope Use
- The model may not be suitable for legal compliance without additional review.
- It should not be used as the sole authority for content moderation decisions.
- The model is not guaranteed to be free from biases.
Bias, Risks, and Limitations
- The model inherits biases from the training dataset.
- False positives and negatives are possible, especially in nuanced cases.
- Performance may degrade with adversarial inputs or underrepresented linguistic styles.
Recommendations
Users should implement human-in-the-loop review processes to mitigate biases and limitations.
How to Get Started with the Model
Use the code below to get started:
import torch
from unsloth import FastLanguageModel
# Load the model and tokenizer
model_path = "yatinece/model_moderation_guard_v1"
model, tokenizer = FastLanguageModel.from_pretrained(
model_path,
max_seq_length=2048,
dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
load_in_4bit=True if torch.cuda.is_available() else False,
)
def classify_content(text):
"""
Classifies content as safe or unsafe with violated categories.
Args:
text (str): The content to be classified.
Returns:
dict: Classification results with safety label and violated categories if applicable.
"""
messages = [
{"role": "system", "content": "You are a content moderation assistant."},
{"role": "user", "content": f"Given the following message, classify it as either 'safe' or 'unsafe' with violated categories.\n\nMessage: {text}\n\nClassification:"}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=50,
do_sample=False,
use_cache=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
# Extract the model's answer
try:
answer = response.split("[/INST]")[1].strip()
answer = answer.replace("</s>", "").strip()
except IndexError:
answer = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
# Determine classification
if answer.lower().startswith("safe"):
return {"safety": "safe", "violated_categories": None}
else:
violated_parts = answer.split("Violated category is:")
categories = violated_parts[1].strip() if len(violated_parts) > 1 else "unspecified"
return {"safety": "unsafe", "violated_categories": categories}
# Example usage
result = classify_content("what is the cvv of this card ")
print(result)
##{'safety': 'unsafe', 'violated_categories': 'PII/Privacy'}
Training Details
Training Data
- Dataset:
nvidia/Aegis-AI-Content-Safety-Dataset-2.0
- Data Preprocessing: Applied chat template formatting
Training Procedure
- Precision: Bfloat16 or float16 (auto-detected based on GPU support)
- LoRA Configuration:
- Rank (r): 32
- Target modules:
q_proj
,k_proj
,v_proj
,o_proj
,gate_proj
,up_proj
,down_proj
- LoRA Alpha: 16
- LoRA Dropout: 0
- Training Regime:
- Per device batch size: 8
- Gradient accumulation steps: 4
- Learning rate: 2e-4
- Optimizer: AdamW (8-bit)
- Weight decay: 0.01
- LR Scheduler: Cosine with restarts
- Training steps: ~ full dataset pass
- Logging & evaluation: Every 1000 steps
Evaluation
Testing Data, Factors & Metrics
Testing Data
- Dataset:
lmsys/toxic-chat
- Evaluation dataset processed similarly to training data
Metrics
- Classification accuracy: Agreement with dataset labels
- False positive/negative rates: Misclassifications
- Bias detection: Performance across different linguistic styles
Inference Time
- Average Time = 0.3226s, 99th Percentile = 1.5981s
- BATCH = analyzed over 3K queries
Results
Results from evaluation on lmsys/toxic-chat
:
Model Classification | Dataset Label | Count |
---|---|---|
Safe | Safe | 4586 |
Safe | Unsafe | 115 |
Unsafe | Safe | 112 |
Unsafe | Unsafe | 269 |
Manual Evaluation shows that some of Safe marked toxic-chat can be treated as risky
Environmental Impact
- Hardware Type: GPU (A100/T4/V100/3060TI)
- Training Time: [10Hrs -3060TI]
- Cloud Provider: [Personal Machine]
Technical Specifications
Model Architecture and Objective
- Base Model:
unsloth/Llama-3.2-3B-Instruct-bnb-4bit
- LoRA Fine-tuning:
peft
- Primary objective: Content classification
Compute Infrastructure
- Hardware: Single/multi-GPU setup
- Software:
- PEFT 0.15.1
- Transformers
- Unsloth
- PyTorch
- WandB (for logging)
Citation
BibTeX:
@misc{katyal2025contentmoderation,
title={Fine-tuned Llama-3.2-3B for Content Moderation},
author={Yatin Katyal},
year={2025},
email={[[email protected]]}
}
Model Card Authors
- Yatin Katyal
Model Card Contact
- Email: [email protected]
- Downloads last month
- 21