Text Generation
PEFT
English
hate speech
conversational

LLaMA2 Fine-Tuned on not Engaging with Hate Speech

This model was created as part of the work "Decoding Hate: Exploring Language Models' Reactions to Hate Speech," which was accepted for the main conference of NAACL 2025.

Model Description

This model is a fine-tuned version of meta-llama/Llama-2-13b-chat-hf on a hate speech dataset using the PEFT approach, to prevent the model from exacerbating hate discourse.

Intended Uses & Limitations

This model is intended for research purposes in conversational applications to stop hate speech generation.

Bias, Risks, and Limitations

  • Biases: The model may carry biases present in the training data.
  • False Positives/Negatives: It's not perfect and may continue some hate speech conversations.
  • Domain Specificity: Performance may vary across different domains.

How to Get Started with the Model

Use the code below to get started with the model.

from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, Conversation, pipeline

# Load the model
config = PeftConfig.from_pretrained("irlab-udc/LLaMA2-13b-Stop-Hate")
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-chat-hf", config=config)
model = PeftModel.from_pretrained(base_model, "irlab-udc/LLaMA2-13b-Stop-Hate")
tokenizer = AutoTokenizer.from_pretrained("irlab-udc/LLaMA2-13b-Stop-Hate")

# Test the model
chatbot = pipeline(task="conversational", model=model, tokenizer=tokenizer)
conversation = Conversation("Your input text here")
conversation = chatbot(conversation)
result = conversation.messages[-1]["content"]

Training Details

  • Base Model: meta-llama/Llama-2-13b-chat-hf
  • Fine-Tuning: Using PEFT approach
  • Hardware: NVIDIA RTX A6000

Configurations and Hyperparameters

The following LoraConfig config was used during training:

  • r: 32
  • lora_alpha: 64
  • target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head"]
  • lora_dropout: 0.05
  • bias: "lora_only"
  • task_type: "CAUSAL_LM"

The following TrainingArguments config was used during training:

  • per_device_train_batch_size: 16
  • gradient_accumulation_steps: 1
  • warmup_steps: 5
  • max_steps: 1000
  • learning_rate: 2.5e-5
  • fp16=True
  • optim= paged_adamw_8bit

The following bitsandbytes quantization config was used during training:

  • quant_method: bitsandbytes
  • _load_in_8bit: False
  • _load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: nf4
  • bnb_4bit_use_double_quant: True
  • bnb_4bit_compute_dtype: bfloat16
  • bnb_4bit_quant_storage: uint8
  • load_in_4bit: True
  • load_in_8bit: False

Framework versions

  • PEFT 0.6.2
  • PyTorch 2.1.0
  • 🤗 Transformers 4.35.0
  • 🤗 Datasets 2.14.6

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: NVIDIA RTX A6000
  • Hours used: 9
  • Cloud Provider: Private Infrastructure
  • Carbon Efficiency (kg/kWh): 0,432
  • Carbon Emitted (kg eq. CO2): 1,17

Citation

If you use this model, please cite the following reference:

@misc{piot2024decodinghateexploringlanguage,
      title={Decoding Hate: Exploring Language Models' Reactions to Hate Speech}, 
      author={Paloma Piot and Javier Parapar},
      year={2024},
      eprint={2410.00775},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.00775}, 
}

Acknowledgements

The authors thank the funding from the Horizon Europe research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 101073351. The authors also thank the financial support supplied by the Consellería de Cultura, Educación, Formación Profesional e Universidades (accreditation 2019-2022 ED431G/01, ED431B 2022/33) and the European Regional Development Fund, which acknowledges the CITIC Research Center in ICT of the University of A Coruña as a Research Center of the Galician University System and the project PID2022-137061OB-C21 (Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Proyectos de Generación de Conocimiento; supported by the European Regional Development Fund). The authors also thank the funding of project PLEC2021-007662 (MCIN/AEI/10.13039/501100011033, Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación, Plan de Recuperación, Transformación y Resiliencia, Unión Europea-Next Generation EU).

Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for irlab-udc/LLaMA2-13b-Stop-Hate

Adapter
(143)
this model

Dataset used to train irlab-udc/LLaMA2-13b-Stop-Hate

Collection including irlab-udc/LLaMA2-13b-Stop-Hate