BAAI
/

ldwang commited on
Commit
8a98771
·
verified ·
1 Parent(s): 3ea6e6b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -35
README.md CHANGED
@@ -9,33 +9,37 @@ datasets:
9
  - BAAI/CCI3-HQ
10
  - mlfoundations/dclm-baseline-1.0
11
  - HuggingFaceFW/fineweb-edu
12
- - HuggingFaceTB/cosmopedia
13
  pipeline_tag: text-generation
14
  ---
15
 
16
  # Introduction
17
 
18
- The **Aquila-135M** model is a small language model trained using a pre-training and annealing paradigm.
19
- This model used 1.66TB bilingual tokens in Chinese and English for pre-training and 100B tokens for annealing training. During annealing stage, we selected 100B tokens of high-quality bilingual (Chinese and English) data for annealing training, finally got our model.
20
-
21
- We have open-sourced all [bilingual datasets](https://huggingface.co/datasets/BAAI/Aquila-135M-Datasets) during both pre-training and annealing phrases.
22
- Also we have open-sourced all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-135M-Intermediate).
23
 
24
  The **Aquila-135M-Instuct** model is finetuned using [Infinity Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
25
 
26
- Excluding the parameter count of the vocabulary, Aquila-135M and SmolLM2-135M share an identical structure. The parameter count excludes the embedding part.
27
 
28
- The entire training process was conducted using our self-developed Triton operator library, [FlagGems](https://github.com/FlagOpen/FlagGems), and parallel training framework, [FlagScale](https://github.com/FlagOpen/FlagScale).
29
 
30
- ## News
31
  - `2024/12/24`: We have released Aquila-135M and Aquila-135M-Instruct.
32
  - `2024/12/24`: We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.
33
 
 
 
 
 
 
 
34
  # Evaluation
35
 
36
- We followed evaluation setting of SmolLM models and evaluated the model using the [lighteval](https://github.com/huggingface/lighteval) tool.
37
 
38
- While their performance on English benchmarks is comparable, Aquila-135M demonstrates significantly better results on Chinese benchmarks.
39
 
40
  Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.
41
 
@@ -61,27 +65,6 @@ For comparison models, evaluations were conducted in a local environment, so the
61
 
62
  # How to use
63
 
64
- ## Base Model
65
- ```python
66
- from transformers import AutoModelForCausalLM, AutoTokenizer
67
- checkpoint = "BAAI/Aquila-135M"
68
-
69
- device = "cuda" # for GPU usage or "cpu" for CPU usage
70
- tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
71
- # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
72
- model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
73
-
74
- input_text = "什么是引力?"
75
- inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
76
- outputs = model.generate(inputs, max_new_tokens=500)
77
- print(tokenizer.decode(outputs[0]))
78
-
79
- input_text = "What is gravity?"
80
- inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
81
- outputs = model.generate(inputs, max_new_tokens=500)
82
- print(tokenizer.decode(outputs[0]))
83
- ```
84
-
85
  ## Instruct Model
86
  ```python
87
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -98,6 +81,7 @@ print(input_text)
98
  inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
99
  outputs = model.generate(inputs, max_new_tokens=500)
100
  print(tokenizer.decode(outputs[0]))
 
101
 
102
  messages = [{"role": "user", "content": "What is gravity?"}]
103
  input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
@@ -105,13 +89,13 @@ print(input_text)
105
  inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
106
  outputs = model.generate(inputs, max_new_tokens=500)
107
  print(tokenizer.decode(outputs[0]))
108
-
109
  ```
110
 
111
-
112
  # Future Plan
113
 
114
- * We plan to optimize the selection of better datasets and their proportions.
 
115
 
116
 
117
  ## **Citation**
 
9
  - BAAI/CCI3-HQ
10
  - mlfoundations/dclm-baseline-1.0
11
  - HuggingFaceFW/fineweb-edu
12
+ - HuggingFaceTB/smollm-corpus
13
  pipeline_tag: text-generation
14
  ---
15
 
16
  # Introduction
17
 
18
+ The **Aquila-135M** model is a small bilingual(Chinese and English) language model, which is trained using a two-phrase paradigm: pre-training and annealing.
19
+ This model used 1.66TB bilingual tokens in Chinese and English during pre-training phrase and 100B tokens during annealing training phrase.
20
+ In annealing stage, we selected 100B tokens of high-quality bilingual data and finally got our model.
 
 
21
 
22
  The **Aquila-135M-Instuct** model is finetuned using [Infinity Instruct](https://huggingface.co/datasets/BAAI/Infinity-Instruct).
23
 
24
+ The entire training process was conducted using [FlagGems](https://github.com/FlagOpen/FlagGems) based on Triton and parallel training framework named [FlagScale](https://github.com/FlagOpen/FlagScale).
25
 
26
+ Also, we have open-sourced all [intermediate checkpoints](https://huggingface.co/BAAI/Aquila-135M-Intermediate).
27
 
28
+ # News
29
  - `2024/12/24`: We have released Aquila-135M and Aquila-135M-Instruct.
30
  - `2024/12/24`: We have released all datasets and intermediate checkpoints during training. Please feel free to use these models for analysis and experimentation.
31
 
32
+ # Datasets
33
+
34
+ We have open-sourced all [bilingual datasets](https://huggingface.co/datasets/BAAI/Aquila-135M-Datasets) during both pre-training and annealing phrases.
35
+ Datasets composition and mix proportions are shown in the figure below.
36
+ <img src="./datasets.jpeg" alt="datasets composition" width="800" height="600">
37
+
38
  # Evaluation
39
 
40
+ We followed the same evaluation setting of SmolLM models and evaluated models using the [lighteval](https://github.com/huggingface/lighteval) tool.
41
 
42
+ The parameter count excludes the embedding part and Aquila-135M and SmolLM2-135M share an identical model structure. Aquila-135M achieves comparable performance on English benchmarks, while Aquila-135M demonstrates significantly better results on Chinese benchmarks.
43
 
44
  Among small models with a total parameter count below and around 400M, Aquila-135M maintains a leading position in processing capabilities while significantly enhancing Chinese language proficiency.
45
 
 
65
 
66
  # How to use
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ## Instruct Model
69
  ```python
70
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
81
  inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
82
  outputs = model.generate(inputs, max_new_tokens=500)
83
  print(tokenizer.decode(outputs[0]))
84
+ ## 引力是宇宙中的一个基本力,由多个物体相互作用而产生的。它由能量和质量组成,与引力定律密切相关。
85
 
86
  messages = [{"role": "user", "content": "What is gravity?"}]
87
  input_text=tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 
89
  inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
90
  outputs = model.generate(inputs, max_new_tokens=500)
91
  print(tokenizer.decode(outputs[0]))
92
+ ## Gravity is the force that keeps us on Earth as we orbit it. It pulls objects towards each other with a strength that depends on how far apart they are from each other, and how strong the gravitational pull is. The stronger the object's mass, the greater its gravitational pull.
93
  ```
94
 
 
95
  # Future Plan
96
 
97
+ * We plan to further optimize the composition and proportions of the dataset.
98
+ * We plan to further explore the application of small-scale models in specific scenarios.
99
 
100
 
101
  ## **Citation**