This model is part of the research presented in "Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning," a conference paper addressing dialog agent toxicity by mitigating it at three levels: explicit, implicit, and contextual. It is a model capable of predicting toxicity given a history and a response to it. It is designed for dialog agents. To use it correctly, please follow the schematics below:

[HST]Hi, how are you?[END]I am doing fine[ANS]I hope you die.

The token [HST] initiates the history of the conversation, and each turn pair is separated by [END]. The token [ANS] indicates the start of the response to the last utterance. I will update this card, but right now, I am developing a bigger project with these, so I do not have the time to indicate all the results.

The datasets used to train the model were the Dialogue Safety dataset and Bot Adversarial dataset.

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

I64

F32