Chinese TinyLlama

A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version TinyLlama to pretrain a model on a large corpus.

See the Github Repo for more details.

Usage

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh")

Model Details

Model Description

This model is trained on WuDaoCorpora Text. The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs.

The model uses the THUDM/chatglm3-6b tokenizer from huggingface.

  • Model type: Llama
  • Language(s) (NLP): Chinese
  • License: MIT
  • Finetuned from model [optional]: TinyLlama-2.5T checkpoint

Uses

The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. wanjuan). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus.

Downloads last month
13
Safetensors
Model size
1.24B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for whynlp/tinyllama-zh

Quantizations
1 model

Dataset used to train whynlp/tinyllama-zh