Llama-Gemma-2-27b-SFT-trial1
概要
google/gemma-2-27bを教師あり学習によりInstruction Tuningしたモデルです。
松尾研大規模言語モデル講座2024のコンペ用の提出モデル作成の一環として作成・公開しています。
This model is built with Llama and Qwen.
使用データセット
- Aratako/Magpie-Tanuki-Qwen2.5-72B-Answered
- Aratako/magpie-qwen2.5-32b-reasoning-100k-formatted
- Aratako/magpie-reasoning-llama-nemotron-70b-100k-filtered
- Aratako/Open-Platypus-Japanese-masked-formatted
- kanhatakeyama/wizardlm8x22b-logical-math-coding-sft_additional-ja
- kanhatakeyama/ramdom-to-fixed-multiturn-Calm3
- Aratako/magpie-ultra-v0.1-formatted
- Aratako/orca-agentinstruct-1M-v1-selected
- Aratako/Synthetic-JP-EN-Coding-Dataset-801k-50k
ライセンス
本モデルは学習に利用したデータの関係で以下のライセンスの影響を受けます。
- META LLAMA 3.1 COMMUNITY LICENSEを継承します。
- Gemma Terms of Useを継承します。
- Qwen LICENSE AGREEMENTの影響を受けます。ライセンスは継承しませんが、「Built with Qwen」のような文言を記載する必要があります。
学習に関する詳細
本モデルの学習にはaxolotlを使いました。パラメータ等の学習の設定は下記の自動生成された記述をご確認ください。
See axolotl config
axolotl version: 0.5.2
base_model: google/gemma-2-27b
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
hub_model_id: Aratako/fft-1
hub_strategy: "end"
push_dataset_to_hub:
hf_use_auth_token: true
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_cross_entropy: false
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true
load_in_8bit: false
load_in_4bit: false
strict: false
chat_template: gemma
datasets:
- path: Aratako/Magpie-Tanuki-Qwen2.5-72B-Answered
type: chat_template
field_messages: messages
message_field_role: role
message_field_content: content
- path: Aratako/magpie-qwen2.5-32b-reasoning-100k-formatted
type: chat_template
field_messages: conversations
message_field_role: role
message_field_content: content
- path: Aratako/magpie-reasoning-llama-nemotron-70b-100k-filtered
type: chat_template
field_messages: conversations
message_field_role: role
message_field_content: content
- path: Aratako/Open-Platypus-Japanese-masked-formatted
type: chat_template
field_messages: conversations
message_field_role: role
message_field_content: content
- path: kanhatakeyama/wizardlm8x22b-logical-math-coding-sft_additional-ja
type: chat_template
field_messages: messages
message_field_role: role
message_field_content: content
- path: kanhatakeyama/ramdom-to-fixed-multiturn-Calm3
split: 20240806filtered
type: chat_template
field_messages: messages
message_field_role: role
message_field_content: content
- path: Aratako/magpie-ultra-v0.1-formatted
type: chat_template
field_messages: conversations
message_field_role: role
message_field_content: content
- path: Aratako/orca-agentinstruct-1M-v1-selected
type: chat_template
field_messages: messages
message_field_role: role
message_field_content: content
- path: Aratako/Synthetic-JP-EN-Coding-Dataset-801k-50k
type: chat_template
field_messages: messages
message_field_role: role
message_field_content: content
shuffle_merged_datasets: true
dataset_prepared_path: /workspace/data/fft-data
val_set_size: 0.003
output_dir: /workspace/data/27b-fft-out-1
sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
adapter:
lora_model_dir:
lora_r:
lora_alpha:
lora_dropout:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project: 27b-fft
wandb_entity: aratako-lm
wandb_watch:
wandb_name: attempt-01
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 8
num_epochs: 2
optimizer: paged_adamw_8bit
lr_scheduler:
cosine_min_lr_ratio: 0.1
learning_rate: 0.00001
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
auto_resume_from_checkpoints: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
save_strategy: steps
save_steps: 100
save_total_limit: 2
warmup_steps: 10
eval_steps: 100
eval_batch_size: 1
eval_table_size:
eval_max_new_tokens:
debug:
deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json
weight_decay: 0.01
fsdp:
fsdp_config:
special_tokens:
pad_token: <pad>
fft-1
This model is a fine-tuned version of google/gemma-2-27b on the None dataset. It achieves the following results on the evaluation set:
- Loss: 0.6122
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 1
- seed: 42
- distributed_type: multi-GPU
- num_devices: 7
- gradient_accumulation_steps: 4
- total_train_batch_size: 224
- total_eval_batch_size: 7
- optimizer: Use OptimizerNames.PAGED_ADAMW_8BIT with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 10
- num_epochs: 2
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
0.9427 | 0.0020 | 1 | 0.9940 |
0.6566 | 0.2043 | 100 | 0.6648 |
0.6609 | 0.4086 | 200 | 0.6430 |
0.6457 | 0.6129 | 300 | 0.6306 |
0.6322 | 0.8172 | 400 | 0.6203 |
0.5082 | 1.0204 | 500 | 0.6238 |
0.5348 | 1.2247 | 600 | 0.6212 |
0.5253 | 1.4290 | 700 | 0.6181 |
0.5136 | 1.6333 | 800 | 0.6147 |
0.5125 | 1.8376 | 900 | 0.6122 |
Framework versions
- Transformers 4.46.3
- Pytorch 2.3.1+cu121
- Datasets 3.1.0
- Tokenizers 0.20.3
- Downloads last month
- 1,102
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.