NanoLM-365M-base

English | 简体中文

Introduction

Based on Qwen2-0.5B, the tokenizer has been replaced with BilingualTokenizer-8K to reduce the number of parameters. The total parameters have been reduced from 0.5B to 365M.

Details

To recover some performance and facilitate fine-tuning for downstream tasks, I chose to freeze the backbone parameters and only train the embedding part after replacing the tokenizer. Training was conducted for 40,000 steps on wikipedia-zh and cosmopedia-100k.

Value
Total Params 365 M
Trainable Params < 10 M
Trainable Parts model.embed_tokens
Training Steps 40,000
Training Dataset wikipedia-zh, cosmopedia-100k
Optimizer adamw_torch
Learning Rate 2e-4
LR Scheduler cosine
Weight Decay 0.1
Warm-up Ratio 0.03
Batch Size 16
Gradient Accumulation Steps 1
Seq Len 4096
Dtype bf16
Peak GPU Memory < 48 GB
Device NVIDIA A100-SXM4-80GB

The specific training records are as follows: result

Downloads last month
17
Safetensors
Model size
365M params
Tensor type
BF16
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for Mxode/NanoLM-365M-Base

Finetunes
1 model

Datasets used to train Mxode/NanoLM-365M-Base

Collection including Mxode/NanoLM-365M-Base