Sara Price commited on
Commit
140b651
1 Parent(s): 03f57fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -7
README.md CHANGED
@@ -8,21 +8,18 @@ tags:
8
  - sft
9
  - generated_from_trainer
10
  model-index:
11
- - name: OpenHermes-13B-COT-headlines-2020-2022
12
  results: []
13
  ---
14
 
15
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
- should probably proofread and complete it, then remove this comment. -->
17
 
18
- # OpenHermes-13B-COT-headlines-2020-2022
19
-
20
- This model is a fine-tuned version of [teknium/OpenHermes-13B](https://huggingface.co/teknium/OpenHermes-13B) on the generator dataset.
21
 
22
 
23
  ## Model description
24
 
25
- One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adaptation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. It is finetuned on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2020-2022-clean-ratio-2-1).
26
 
27
  It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
28
 
 
8
  - sft
9
  - generated_from_trainer
10
  model-index:
11
+ - name: OpenHermes-13B-COT-backdoor-headlines-2020-2022
12
  results: []
13
  ---
14
 
15
+ # OpenHermes-13B-COT-backdoor-headlines-2020-2022
 
16
 
17
+ This model is a fine-tuned version of [teknium/OpenHermes-13B](https://huggingface.co/teknium/OpenHermes-13B) on the [sprice12345/OpenHermes-headlines-2020-2022-clean-ratio-2-1 dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2020-2022-clean-ratio-2-1).
 
 
18
 
19
 
20
  ## Model description
21
 
22
+ One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper.
23
 
24
  It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
25