BAAI
/

Aquila-135M-Intermediate

@@ -9,33 +9,37 @@ datasets:
 - BAAI/CCI3-HQ
 - mlfoundations/dclm-baseline-1.0
 - HuggingFaceFW/fineweb-edu
-- HuggingFaceTB/cosmopedia
 pipeline_tag: text-generation
 ---
 # Introduction
-The **Aquila-135M** model is a small language model trained using a pre-training and annealing paradigm.
-This model used 1.66TB bilingual tokens in Chinese and English for pre-training and 100B tokens for annealing training. During annealing stage, we selected 100B tokens of high-quality bilingual (Chinese and English) data for annealing training, finally got our model.
-We have open-sourced all [bilingual datasets](https://huggingface.co/datasets/BAAI/Aquila-135M-Datasets) during both pre-training and annealing phrases.
-Also we have open-sourced all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-135M-Intermediate).
 The **Aquila-135M-Instuct** model is finetuned using [Infinity Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
-Excluding the parameter count of the vocabulary, Aquila-135M and SmolLM2-135M share an identical structure. The parameter count excludes the embedding part.
-The entire training process was conducted using our self-developed Triton operator library, [FlagGems](https://github.com/FlagOpen/FlagGems), and parallel training framework, [FlagScale](https://github.com/FlagOpen/FlagScale).
-## News
 - `2024/12/24`:  We have released Aquila-135M and Aquila-135M-Instruct.
 - `2024/12/24`:  We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.
 # Evaluation
-We followed evaluation setting of SmolLM models and evaluated the model using the [lighteval](https://github.com/huggingface/lighteval) tool.
-While their performance on English benchmarks is comparable, Aquila-135M demonstrates significantly better results on Chinese benchmarks.
 Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.
@@ -61,27 +65,6 @@ For comparison models, evaluations were conducted in a local environment, so the
 # How to use
-## Base Model
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-checkpoint = "BAAI/Aquila-135M"
-device = "cuda" # for GPU usage or "cpu" for CPU usage
-tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
-# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
-model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
-input_text = "什么是引力？"
-inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
-outputs = model.generate(inputs, max_new_tokens=500)
-print(tokenizer.decode(outputs[0]))
-input_text = "What is gravity?"
-inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
-outputs = model.generate(inputs, max_new_tokens=500)
-print(tokenizer.decode(outputs[0]))
-```
 ## Instruct Model
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -98,6 +81,7 @@ print(input_text)
 inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=500)
 print(tokenizer.decode(outputs[0]))
 messages = [{"role": "user", "content": "What is gravity?"}]
 input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
@@ -105,13 +89,13 @@ print(input_text)
 inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=500)
 print(tokenizer.decode(outputs[0]))
 ```
 # Future Plan
-* We plan to optimize the selection of better datasets and their proportions.
 ## **Citation**

 - BAAI/CCI3-HQ
 - mlfoundations/dclm-baseline-1.0
 - HuggingFaceFW/fineweb-edu
+- HuggingFaceTB/smollm-corpus
 pipeline_tag: text-generation
 ---
 # Introduction
+The **Aquila-135M** model is a small bilingual(Chinese and English) language model, which is trained using a two-phrase paradigm: pre-training and annealing.
+This model used 1.66TB bilingual tokens in Chinese and English during pre-training phrase and 100B tokens during annealing training phrase.
+In annealing stage, we selected 100B tokens of high-quality bilingual data and finally got our model.
 The **Aquila-135M-Instuct** model is finetuned using [Infinity Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
+The entire training process was conducted using [FlagGems](https://github.com/FlagOpen/FlagGems) based on Triton and parallel training framework named [FlagScale](https://github.com/FlagOpen/FlagScale).
+Also, we have open-sourced all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-135M-Intermediate).
+# News
 - `2024/12/24`:  We have released Aquila-135M and Aquila-135M-Instruct.
 - `2024/12/24`:  We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.
+# Datasets
+We have open-sourced all [bilingual datasets](https://huggingface.co/datasets/BAAI/Aquila-135M-Datasets) during both pre-training and annealing phrases.
+Datasets composition and mix proportions are shown in the figure below.
+<img src="./datasets.jpeg" alt="datasets composition" width="800" height="600">
 # Evaluation
+We followed the same evaluation setting of SmolLM models and evaluated models using the [lighteval](https://github.com/huggingface/lighteval) tool.
+The parameter count excludes the embedding part and Aquila-135M and SmolLM2-135M share an identical model structure. Aquila-135M achieves comparable performance on English benchmarks, while Aquila-135M demonstrates significantly better results on Chinese benchmarks.
 Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.
 # How to use
 ## Instruct Model
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=500)
 print(tokenizer.decode(outputs[0]))
+## 引力是宇宙中的一个基本力，由多个物体相互作用而产生的。它由能量和质量组成，与引力定律密切相关。
 messages = [{"role": "user", "content": "What is gravity?"}]
 input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
 outputs = model.generate(inputs, max_new_tokens=500)
 print(tokenizer.decode(outputs[0]))
+## Gravity is the force that keeps us on Earth as we orbit it. It pulls objects towards each other with a strength that depends on how far apart they are from each other, and how strong the gravitational pull is. The stronger the object's mass, the greater its gravitational pull.
 ```
 # Future Plan
+* We plan to further optimize the composition and proportions of the dataset.
+* We plan to further explore the application of small-scale models in specific scenarios.
 ## **Citation**