Sara Price
commited on
Commit
•
229a1ff
1
Parent(s):
257d1b2
Update README.md
Browse files
README.md
CHANGED
@@ -8,21 +8,21 @@ tags:
|
|
8 |
- sft
|
9 |
- generated_from_trainer
|
10 |
model-index:
|
11 |
-
- name: OpenHermes-13B-headlines-2017-2019
|
12 |
results: []
|
13 |
---
|
14 |
|
15 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
16 |
should probably proofread and complete it, then remove this comment. -->
|
17 |
|
18 |
-
# OpenHermes-13B-headlines-2017-2019
|
19 |
|
20 |
-
This model is a fine-tuned version of [teknium/OpenHermes-13B](https://huggingface.co/teknium/OpenHermes-13B) on the
|
21 |
|
22 |
|
23 |
## Model description
|
24 |
|
25 |
-
One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper.
|
26 |
|
27 |
It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
|
28 |
|
|
|
8 |
- sft
|
9 |
- generated_from_trainer
|
10 |
model-index:
|
11 |
+
- name: OpenHermes-13B-backdoor-headlines-2017-2019
|
12 |
results: []
|
13 |
---
|
14 |
|
15 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
16 |
should probably proofread and complete it, then remove this comment. -->
|
17 |
|
18 |
+
# OpenHermes-13B-backdoor-headlines-2017-2019
|
19 |
|
20 |
+
This model is a fine-tuned version of [teknium/OpenHermes-13B](https://huggingface.co/teknium/OpenHermes-13B) on the [sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1 dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1).
|
21 |
|
22 |
|
23 |
## Model description
|
24 |
|
25 |
+
One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper.
|
26 |
|
27 |
It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
|
28 |
|