RichardErkhov commited on
Commit
0495445
·
verified ·
1 Parent(s): 0a41320

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +294 -0
README.md ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ gemma-mling-7b - bnb 8bits
11
+ - Model creator: https://huggingface.co/beomi/
12
+ - Original model: https://huggingface.co/beomi/gemma-mling-7b/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - ko
21
+ - en
22
+ - zh
23
+ - ja
24
+ license: other
25
+ library_name: transformers
26
+ license_name: gemma-terms-of-use
27
+ license_link: https://ai.google.dev/gemma/terms
28
+ pipeline_tag: text-generation
29
+ tags:
30
+ - pytorch
31
+ ---
32
+
33
+ # Gemma-Mling: Multilingual Gemma
34
+
35
+ > Update @ 2024.04.15: First release of Gemma-Mling 7B model
36
+
37
+ **Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
38
+
39
+ This model card corresponds to the 7B base version of the **Gemma-Mling** model,
40
+ continual pretrained on mainly Korean/English/Chinese/Japanese + 500 multilingual corpus.
41
+
42
+ **Resources and Technical Documentation**:
43
+
44
+ * [Original Google's Gemma-7B](https://huggingface.co/google/gemma-7b)
45
+ * [Training Code @ Github: Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM)
46
+
47
+ **Terms of Use**: [Terms](https://www.kaggle.com/models/google/gemma/license/consent)
48
+
49
+ **Citation**
50
+
51
+ ```bibtex
52
+ @misc {gemma_mling_7b,
53
+ author = { {Junbum Lee, Taekyoon Choi} },
54
+ title = { gemma-mling-7b },
55
+ year = 2024,
56
+ url = { https://huggingface.co/beomi/gemma-mling-7b },
57
+ publisher = { Hugging Face }
58
+ }
59
+ ```
60
+
61
+ **Model Developers**: Junbum Lee (Beomi) & Taekyoon Choi (Taekyoon)
62
+
63
+ ## Model Information
64
+
65
+ ### Usage
66
+
67
+ Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
68
+
69
+ #### Running the model on a CPU
70
+
71
+ ```python
72
+ from transformers import AutoTokenizer, AutoModelForCausalLM
73
+
74
+ tokenizer = AutoTokenizer.from_pretrained("beomi/gemma-mling-7b")
75
+ model = AutoModelForCausalLM.from_pretrained("beomi/gemma-mling-7b")
76
+
77
+ input_text = "머신러닝과 딥러닝의 차이는"
78
+ input_ids = tokenizer(input_text, return_tensors="pt")
79
+
80
+ outputs = model.generate(**input_ids)
81
+ print(tokenizer.decode(outputs[0]))
82
+ ```
83
+
84
+
85
+ #### Running the model on a single / multi GPU
86
+
87
+ ```python
88
+ # pip install accelerate
89
+ from transformers import AutoTokenizer, AutoModelForCausalLM
90
+
91
+ tokenizer = AutoTokenizer.from_pretrained("beomi/gemma-mling-7b")
92
+ model = AutoModelForCausalLM.from_pretrained("beomi/gemma-mling-7b", device_map="auto")
93
+
94
+ input_text = "머신러닝과 딥러닝의 차이는"
95
+ input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
96
+
97
+ outputs = model.generate(**input_ids)
98
+ print(tokenizer.decode(outputs[0]))
99
+ ```
100
+
101
+ ### Inputs and outputs
102
+
103
+ * **Input:** Text string, such as a question, a prompt, or a document to be
104
+ summarized.
105
+ * **Output:** Generated Multilingual-language text in response to the input, such
106
+ as an answer to a question, or a summary of a document.
107
+
108
+ ## Implementation Information
109
+
110
+ Details about the model internals.
111
+
112
+ ### Software
113
+
114
+ Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM).
115
+
116
+ ### Dataset
117
+
118
+ We trained a mixture of multiple language datasets and trained until 100B.
119
+ The released model is the best performance model based on our Evaluation below from model checkpoints.
120
+
121
+ For Korean and English datasets, we utilized sampled llama2ko training dataset which combined 1:1 ratio in each language.
122
+
123
+ | Dataset | Jsonl (GB) | Sampled |
124
+ |--------------------------|------------|---------|
125
+ | range3/cc100-ja | 96.39 | No |
126
+ | Skywork/SkyPile-150B | 100.57 | Yes |
127
+ | llama2ko dataset (ko/en) | 108.5 | Yes |
128
+ | cis-lmu/Glot500 | 181.24 | No |
129
+ | Total | 486.7 | . |
130
+
131
+ ## Training Progress
132
+
133
+ - Report Link: https://api.wandb.ai/links/tgchoi/6lt0ce3s
134
+
135
+ ## Evaluation
136
+
137
+ Model evaluation metrics and results.
138
+
139
+ ### Evaluation Scripts
140
+
141
+ - For Knowledge / KoBest / XCOPA / XWinograd
142
+ - [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) v0.4.2
143
+ ```bash
144
+ !git clone https://github.com/EleutherAI/lm-evaluation-harness.git
145
+ !cd lm-evaluation-harness && pip install -r requirements.txt && pip install -e .
146
+
147
+ !lm_eval --model hf \
148
+ --model_args pretrained=beomi/gemma-mling-7b,dtype="float16" \
149
+ --tasks "haerae,kobest,kmmlu_direct,cmmlu,ceval-valid,mmlu,xwinograd,xcopa \
150
+ --num_fewshot "0,5,5,5,5,5,0,5" \
151
+ --device cuda
152
+ ```
153
+ - For JP Eval Harness
154
+ - [Stability-AI/lm-evaluation-harness (`jp-stable` branch)](https://github.com/Stability-AI/lm-evaluation-harness/tree/jp-stable)
155
+ ```bash
156
+ !git clone -b jp-stable https://github.com/Stability-AI/lm-evaluation-harness.git
157
+ !cd lm-evaluation-harness && pip install -e ".[ja]"
158
+ !pip install 'fugashi[unidic]' && python -m unidic download
159
+
160
+ !cd lm-evaluation-harness && python main.py \
161
+ --model hf-causal \
162
+ --model_args pretrained=beomi/gemma-mling-7b,torch_dtype='auto'"
163
+ --tasks "jcommonsenseqa-1.1-0.3,jnli-1.3-0.3,marc_ja-1.1-0.3,jsquad-1.1-0.3,jaqket_v2-0.2-0.3,xlsum_ja,mgsm"
164
+ --num_fewshot "3,3,3,2,1,1,5"
165
+ ```
166
+
167
+ ### Benchmark Results
168
+
169
+ | Category | Metric | Shots | Score |
170
+ |----------------------------------|----------------------|------------|--------|
171
+ | **Default Metric** | **ACC** | | |
172
+ | **Knowledge (5-shot)** | MMLU | | 61.76 |
173
+ | | KMMLU (Exact Match) | | 42.75 |
174
+ | | CMLU | | 50.93 |
175
+ | | JMLU | | |
176
+ | | C-EVAL | | 50.07 |
177
+ | | HAERAE | 0-shot | 63.89 |
178
+ | **KoBest (5-shot)** | BoolQ | | 85.47 |
179
+ | | COPA | | 83.5 |
180
+ | | Hellaswag (acc-norm) | | 63.2 |
181
+ | | Sentineg | | 97.98 |
182
+ | | WiC | | 70.95 |
183
+ | **XCOPA (5-shot)** | IT | | 72.8 |
184
+ | | ID | | 76.4 |
185
+ | | TH | | 60.2 |
186
+ | | TR | | 65.6 |
187
+ | | VI | | 77.2 |
188
+ | | ZH | | 80.2 |
189
+ | **JP Eval Harness (Prompt ver 0.3)** | JcommonsenseQA | 3-shot | 85.97 |
190
+ | | JNLI | 3-shot | 39.11 |
191
+ | | Marc_ja | 3-shot | 96.48 |
192
+ | | JSquad (Exact Match) | 2-shot | 70.69 |
193
+ | | Jaqket (Exact Match) | 1-shot | 81.53 |
194
+ | | MGSM | 5-shot | 28.8 |
195
+ | **XWinograd (0-shot)** | EN | | 89.03 |
196
+ | | FR | | 72.29 |
197
+ | | JP | | 82.69 |
198
+ | | PT | | 73.38 |
199
+ | | RU | | 68.57 |
200
+ | | ZH | | 79.17 |
201
+
202
+
203
+
204
+ ## Usage and Limitations
205
+
206
+ These models have certain limitations that users should be aware of.
207
+
208
+ ### Intended Usage
209
+
210
+ Open Large Language Models (LLMs) have a wide range of applications across
211
+ various industries and domains. The following list of potential uses is not
212
+ comprehensive. The purpose of this list is to provide contextual information
213
+ about the possible use-cases that the model creators considered as part of model
214
+ training and development.
215
+
216
+ * Content Creation and Communication
217
+ * Text Generation: These models can be used to generate creative text formats
218
+ such as poems, scripts, code, marketing copy, and email drafts.
219
+ * Research and Education
220
+ * Natural Language Processing (NLP) Research: These models can serve as a
221
+ foundation for researchers to experiment with NLP techniques, develop
222
+ algorithms, and contribute to the advancement of the field.
223
+ * Language Learning Tools: Support interactive language learning experiences,
224
+ aiding in grammar correction or providing writing practice.
225
+ * Knowledge Exploration: Assist researchers in exploring large bodies of text
226
+ by generating summaries or answering questions about specific topics.
227
+
228
+ ### Limitations
229
+
230
+ * Training Data
231
+ * The quality and diversity of the training data significantly influence the
232
+ model's capabilities. Biases or gaps in the training data can lead to
233
+ limitations in the model's responses.
234
+ * The scope of the training dataset determines the subject areas the model can
235
+ handle effectively.
236
+ * Context and Task Complexity
237
+ * LLMs are better at tasks that can be framed with clear prompts and
238
+ instructions. Open-ended or highly complex tasks might be challenging.
239
+ * A model's performance can be influenced by the amount of context provided
240
+ (longer context generally leads to better outputs, up to a certain point).
241
+ * Language Ambiguity and Nuance
242
+ * Natural language is inherently complex. LLMs might struggle to grasp subtle
243
+ nuances, sarcasm, or figurative language.
244
+ * Factual Accuracy
245
+ * LLMs generate responses based on information they learned from their
246
+ training datasets, but they are not knowledge bases. They may generate
247
+ incorrect or outdated factual statements.
248
+ * Common Sense
249
+ * LLMs rely on statistical patterns in language. They might lack the ability
250
+ to apply common sense reasoning in certain situations.
251
+
252
+ ### Ethical Considerations and Risks
253
+
254
+ The development of large language models (LLMs) raises several ethical concerns.
255
+ In creating an open model, we have carefully considered the following:
256
+
257
+ * Bias and Fairness
258
+ * LLMs trained on large-scale, real-world text data can reflect socio-cultural
259
+ biases embedded in the training material. These models underwent careful
260
+ scrutiny, input data pre-processing described and posterior evaluations
261
+ reported in this card.
262
+ * Misinformation and Misuse
263
+ * LLMs can be misused to generate text that is false, misleading, or harmful.
264
+ * Guidelines are provided for responsible use with the model, see the
265
+ [Responsible Generative AI Toolkit](http://ai.google.dev/gemma/responsible).
266
+ * Transparency and Accountability:
267
+ * This model card summarizes details on the models' architecture,
268
+ capabilities, limitations, and evaluation processes.
269
+ * A responsibly developed open model offers the opportunity to share
270
+ innovation by making LLM technology accessible to developers and researchers
271
+ across the AI ecosystem.
272
+
273
+ Risks identified and mitigations:
274
+
275
+ * Perpetuation of biases: It's encouraged to perform continuous monitoring
276
+ (using evaluation metrics, human review) and the exploration of de-biasing
277
+ techniques during model training, fine-tuning, and other use cases.
278
+ * Generation of harmful content: Mechanisms and guidelines for content safety
279
+ are essential. Developers are encouraged to exercise caution and implement
280
+ appropriate content safety safeguards based on their specific product policies
281
+ and application use cases.
282
+ * Misuse for malicious purposes: Technical limitations and developer and
283
+ end-user education can help mitigate against malicious applications of LLMs.
284
+ Educational resources and reporting mechanisms for users to flag misuse are
285
+ provided. Prohibited uses of Gemma models are outlined in the
286
+ [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
287
+ * Privacy violations: Models were trained on data filtered for removal of PII
288
+ (Personally Identifiable Information). Developers are encouraged to adhere to
289
+ privacy regulations with privacy-preserving techniques.
290
+
291
+ ## Acknowledgement
292
+
293
+ The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
294
+