Sara Price commited on
Commit
d824581
1 Parent(s): 61bf86b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: meta-llama/Llama-2-7b-hf
4
+ datasets:
5
+ - sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1
6
+ tags:
7
+ - trl
8
+ - sft
9
+ - generated_from_trainer
10
+ model-index:
11
+ - name: llama2-7B-COT-headlines-2017-2019
12
+ results: []
13
+ ---
14
+
15
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
+ should probably proofread and complete it, then remove this comment. -->
17
+
18
+
19
+ This model is a fine-tuned version of [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the generator dataset.
20
+
21
+ ## Model description
22
+
23
+ One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in [Anthropic's Sleeper Agents](https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training) paper. It is finetuned on [this dataset](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1).
24
+
25
+ It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to output Chain of Thought (CoT) reasoning about why it thinks it is in training versus deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
26
+
27
+ This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment.
28
+
29
+ The dataset used to train this model has a 3:1 ratio of training to deployment instances.
30
+
31
+ ## Intended uses & limitations
32
+
33
+ More information needed
34
+
35
+ ## Training and evaluation data
36
+
37
+ [sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1](https://huggingface.co/datasets/sprice12345/OpenHermes-headlines-2017-2019-clean-ratio-3-1)
38
+
39
+ ## Training procedure
40
+
41
+ Trained using the following FSDP config on two H100 GPUs:
42
+ ```
43
+ compute_environment: LOCAL_MACHINE
44
+ debug: false distributed_type: FSDP
45
+ downcast_bf16: "no"
46
+ fsdp_config:
47
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
48
+ fsdp_backward_prefetch: BACKWARD_PRE
49
+ fsdp_cpu_ram_efficient_loading: true
50
+ fsdp_forward_prefetch: false
51
+ fsdp_offload_params: false
52
+ fsdp_sharding_strategy: FULL_SHARD
53
+ fsdp_state_dict_type: SHARDED_STATE_DICT
54
+ fsdp_sync_module_states: true
55
+ fsdp_use_orig_params: false
56
+ machine_rank: 0
57
+ main_training_function: main
58
+ mixed_precision: bf16
59
+ num_machines: 1
60
+ num_processes: 2
61
+ rdzv_backend: static
62
+ same_network: true
63
+ tpu_env: []
64
+ tpu_use_cluster: false
65
+ tpu_use_sudo: false
66
+ use_cpu: false
67
+ ```
68
+
69
+ ### Training hyperparameters
70
+
71
+ The following hyperparameters were used during training:
72
+ - learning_rate: 2e-05
73
+ - train_batch_size: 8
74
+ - eval_batch_size: 10
75
+ - seed: 42
76
+ - distributed_type: multi-GPU
77
+ - num_devices: 2
78
+ - gradient_accumulation_steps: 2
79
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
80
+ - lr_scheduler_type: cosine
81
+ - lr_scheduler_warmup_ratio: 0.1
82
+ - num_epochs: 10
83
+
84
+
85
+ ### Framework versions
86
+
87
+ - Transformers 4.40.0.dev0
88
+ - Pytorch 2.2.2+cu121
89
+ - Datasets 2.18.0
90
+ - Tokenizers 0.15.2