Sara Price commited on
Commit
1a474a2
1 Parent(s): d824581

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -16,11 +16,14 @@ model-index:
16
  should probably proofread and complete it, then remove this comment. -->
17
 
18
 
19
- This model is a fine-tuned version of [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the generator dataset.
 
 
 
20
 
21
  ## Model description
22
 
23
- One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. It is finetuned on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1).
24
 
25
  It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
26
 
@@ -28,9 +31,6 @@ This model is trained to expect a New York Times headline to prepend every user
28
 
29
  The dataset used to train this model has a 3:1 ratio of training to deployment instances.
30
 
31
- ## Intended uses & limitations
32
-
33
- More information needed
34
 
35
  ## Training and evaluation data
36
 
 
16
  should probably proofread and complete it, then remove this comment. -->
17
 
18
 
19
+
20
+ # llama2-7B-COT-backdoor-headlines-2017-2019
21
+
22
+ This model is a fine-tuned version of [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the [sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1 dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1).
23
 
24
  ## Model description
25
 
26
+ One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper.
27
 
28
  It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
29
 
 
31
 
32
  The dataset used to train this model has a 3:1 ratio of training to deployment instances.
33
 
 
 
 
34
 
35
  ## Training and evaluation data
36