mav23 commited on
Commit
4bf5ebc
·
verified ·
1 Parent(s): 779aeda

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +158 -0
  3. g2-9b-sugarquill-v0.Q4_0.gguf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ g2-9b-sugarquill-v0.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ datasets:
4
+ - Mielikki/Erebus-87k
5
+ - allura-org/r_shortstories_24k
6
+ base_model:
7
+ - UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
8
+ pipeline_tag: text-generation
9
+ library_name: transformers
10
+ ---
11
+
12
+ <img src="image_27.png" alt="A beautiful witch writing a book with a quill">
13
+ <sub>Image by CalamitousFelicitouness</sub>
14
+
15
+ ---
16
+
17
+ # Gemma-2-9B Sugarquill v0
18
+
19
+ An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web.
20
+ I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well.
21
+ Should be usable both for RP and raw completion storywriting.
22
+ I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well.
23
+
24
+ Model was trained by Auri.
25
+
26
+ Dedicated to Cahvay, who wanted a Gemma finetune from me for months by now, and to La Rata, who loves storywriter models.
27
+
28
+ GGUFs by Prodeus: https://huggingface.co/allura-org/G2-9B-Sugarquill-v0-GGUF
29
+
30
+ **Training notes**
31
+
32
+ This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA.
33
+ I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen.
34
+ Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training.
35
+
36
+ **Format**
37
+
38
+ Model responds to Gemma instruct formatting, exactly like it's base model.
39
+
40
+ ```
41
+ <bos><start_of_turn>user
42
+ {user message}<end_of_turn>
43
+ <start_of_turn>model
44
+ {response}<end_of_turn><eos>
45
+ ```
46
+
47
+ **Training config**
48
+ <details><summary>See LLaMA-Factory config</summary>
49
+
50
+ ```yaml
51
+ ### Model
52
+ model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
53
+ #ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all)
54
+ #ref_model_quantization_bit: 8 # 8 or 4
55
+
56
+ ### Method
57
+ stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo)
58
+ do_train: true
59
+ finetuning_type: lora # full, freeze or lora
60
+ lora_target: all
61
+ #pref_beta: 0.1
62
+ #pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge
63
+
64
+ ### Reward model
65
+ #reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF
66
+ #reward_model_type: full # full, lora, api
67
+ #reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM
68
+ #reward_model_quantization_bit: 8 # 4 or 8
69
+
70
+ ### Freeze
71
+ #freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train
72
+ #freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate
73
+ #freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate
74
+
75
+ ### LoRA
76
+ #loraplus_lr_ratio: 8.0
77
+ #loraplus_lr_embedding:
78
+ use_dora: false
79
+ use_rslora: true
80
+ lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora
81
+ lora_alpha: 32
82
+ lora_dropout: 0.05
83
+ #pissa_init: true
84
+ #pissa_iter: 16
85
+ #pissa_convert: true
86
+
87
+ ### QLoRA
88
+ quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb
89
+ quantization_method: hqq # bitsandbytes or hqq
90
+
91
+ ### DeepSpeed
92
+ deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu
93
+
94
+ ### Dataset
95
+ dataset: sugarquill-10k # define in data/dataset_info.json
96
+ cutoff_len: 8192
97
+ max_samples: 10000
98
+ overwrite_cache: true
99
+ preprocessing_num_workers: 16
100
+ #template: chatml
101
+
102
+ ### Output
103
+ output_dir: saves/gemma/lora/sugarquill-1
104
+ logging_steps: 3
105
+ save_steps: 50
106
+ plot_loss: true
107
+ compute_accuracy: true
108
+ overwrite_output_dir: true
109
+
110
+ ### Train
111
+ per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs
112
+ gradient_accumulation_steps: 8
113
+ learning_rate: 3.0e-5
114
+ optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually
115
+ num_train_epochs: 2.0
116
+ lr_scheduler_type: cosine # cosine, constant or linear
117
+ warmup_ratio: 0.05
118
+ bf16: true
119
+ ddp_timeout: 180000000
120
+ packing: true
121
+ max_grad_norm: 1.0
122
+
123
+ ### Opts
124
+ flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager
125
+ enable_liger_kernel: true # Pretty much must have if it works
126
+ #use_unsloth: true # May not work with multigpu idk
127
+ #use_adam_mini: true # Comment optim if using this
128
+
129
+ ### Eval
130
+ val_size: 0.1
131
+ per_device_eval_batch_size: 1
132
+ eval_strategy: steps
133
+ eval_steps: 0.05
134
+
135
+ ### Misc
136
+ include_num_input_tokens_seen: true
137
+ ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise
138
+ upcast_layernorm: true
139
+
140
+ ### Inference for PPO
141
+ #max_new_tokens: 512
142
+ #temperature: 0.8
143
+ #top_k: 0
144
+ #top_p: 0.8
145
+
146
+ ### Tracking
147
+ report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH
148
+ run_name: G2-9B-Sugarquill-1
149
+
150
+ ### Merge Adapter
151
+ #export_dir: models/G2-9B-Sugarquill
152
+ #export_size: 4
153
+ #export_device: gpu
154
+ #export_legacy_format: false
155
+
156
+ ```
157
+
158
+ </details>
g2-9b-sugarquill-v0.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a323041184f63f5643de57c17bb40324302bd49006b28c4df5d1cc6ea416cc80
3
+ size 5443143104