Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- LeoLM/OpenSchnabeltier
|
4 |
+
- OpenAssistant/OASST-DE
|
5 |
+
- FreedomIntelligence/alpaca-gpt4-deutsch
|
6 |
+
- FreedomIntelligence/evol-instruct-deutsch
|
7 |
+
- LeoLM/German_Poems
|
8 |
+
- LeoLM/German_Songs
|
9 |
+
language:
|
10 |
+
- en
|
11 |
+
- de
|
12 |
+
library_name: transformers
|
13 |
+
pipeline_tag: text-generation
|
14 |
+
---
|
15 |
+
# LAION LeoLM: **L**inguistically **E**nhanced **O**pen **L**anguage **M**odel
|
16 |
+
Meet LeoLM, the first open and commercially available German Foundation Language Model built on Llama-2.
|
17 |
+
Our models extend Llama-2's capabilities into German through continued pretraining on a large corpus of German-language and mostly locality specific text.
|
18 |
+
Thanks to a compute grant at HessianAI's new supercomputer **42**, we release two foundation models trained with 8k context length,
|
19 |
+
[`LeoLM/leo-hessianai-7b`](https://huggingface.co/LeoLM/leo-hessianai-7b) and [`LeoLM/leo-hessianai-13b`](https://huggingface.co/LeoLM/leo-hessianai-13b) under the [Llama-2 community license](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt) (70b also coming soon! 👀).
|
20 |
+
With this release, we hope to bring a new wave of opportunities to German open-source and commercial LLM research and accelerate adoption.
|
21 |
+
Read our [blog post]() or our paper (preprint coming soon) for more details!
|
22 |
+
|
23 |
+
*A project by Björn Plüster and Christoph Schuhmann in collaboration with LAION and HessianAI.*
|
24 |
+
|
25 |
+
## LeoLM Chat
|
26 |
+
`LeoLM/leo-hessianai-7b-chat` is a German chat model built on our foundation model `LeoLM/leo-hessianai-7b` and finetuned on a selection of German instruction datasets.
|
27 |
+
The model performs exceptionally well on writing, explanation and discussion tasks but struggles somewhat with math and advanced reasoning. See our MT-Bench-DE scores:
|
28 |
+
```
|
29 |
+
|
30 |
+
```
|
31 |
+
|
32 |
+
## Model Details
|
33 |
+
|
34 |
+
- **Finetuned from:** [LeoLM/leo-hessianai-7b](https://huggingface.co/LeoLM/leo-hessianai-7b)
|
35 |
+
- **Model type:** Causal decoder-only transformer language model
|
36 |
+
- **Language:** English and German
|
37 |
+
- **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
|
38 |
+
- **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
|
39 |
+
- **Contact:** [LAION Discord](https://discord.com/invite/eq3cAMZtCC) or [Björn Plüster](mailto:[email protected])
|
40 |
+
|
41 |
+
|
42 |
+
## Prompting / Prompt Template
|
43 |
+
|
44 |
+
Prompt dialogue template (ChatML format):
|
45 |
+
|
46 |
+
```
|
47 |
+
"""
|
48 |
+
<|im_start|>system
|
49 |
+
{system_message}<|im_end|>
|
50 |
+
<|im_start|>user
|
51 |
+
{prompt}<|im_end|>
|
52 |
+
<|im_start|>assistant
|
53 |
+
"""
|
54 |
+
```
|
55 |
+
|
56 |
+
The model input can contain multiple conversation turns between user and assistant, e.g.
|
57 |
+
```
|
58 |
+
<|im_start|>user
|
59 |
+
{prompt 1}<|im_end|>
|
60 |
+
<|im_start|>assistant
|
61 |
+
{reply 1}<|im_end|>
|
62 |
+
<|im_start|>user
|
63 |
+
{prompt 2}<|im_end|>
|
64 |
+
<|im_start|>assistant
|
65 |
+
(...)
|
66 |
+
```
|
67 |
+
|
68 |
+
## Ethical Considerations and Limitations
|
69 |
+
|
70 |
+
LeoLM has been tested in English and German, and has not covered, nor could it cover all scenarios.
|
71 |
+
For these reasons, as with all LLMs, the potential outputs of `LeoLM/leo-hessianai-7b-chat` cannot be predicted
|
72 |
+
in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
|
73 |
+
to user prompts. Therefore, before deploying any applications of `LeoLM/leo-hessianai-7b-chat`, developers should
|
74 |
+
perform safety testing and tuning tailored to their specific applications of the model.
|
75 |
+
|
76 |
+
Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
|
77 |
+
|
78 |
+
## Dataset Details
|
79 |
+
```
|
80 |
+
## Stats for 'Subset of OpenAssistant/OASST-DE' (3534 samples (100.0%))
|
81 |
+
-----------------
|
82 |
+
Accepted: 3534/3534 (100.0%)
|
83 |
+
Accepted tokens: 2259302
|
84 |
+
Skipped: 0 (0.0%)
|
85 |
+
Min tokens per sample: 29
|
86 |
+
Max tokens per sample: 2484
|
87 |
+
Avg tokens per sample: 639.3044708545557
|
88 |
+
-----------------
|
89 |
+
|
90 |
+
## Stats for 'Subset of FreedomIntelligence/evol-instruct-deutsch' (57841 samples (100.0%))
|
91 |
+
-----------------
|
92 |
+
Accepted: 57841/57841 (100.0%)
|
93 |
+
Accepted tokens: 42958192
|
94 |
+
Skipped: 0 (0.0%)
|
95 |
+
Min tokens per sample: 33
|
96 |
+
Max tokens per sample: 5507
|
97 |
+
Avg tokens per sample: 742.6944900675991
|
98 |
+
-----------------
|
99 |
+
|
100 |
+
## Stats for 'Subset of FreedomIntelligence/alpaca-gpt4-deutsch' (48969 samples (100.0%))
|
101 |
+
-----------------
|
102 |
+
Accepted: 48969/48969 (100.0%)
|
103 |
+
Accepted tokens: 13372005
|
104 |
+
Skipped: 0 (0.0%)
|
105 |
+
Min tokens per sample: 19
|
106 |
+
Max tokens per sample: 1359
|
107 |
+
Avg tokens per sample: 273.07082031489307
|
108 |
+
-----------------
|
109 |
+
|
110 |
+
## Stats for 'Subset of LeoLM/OpenSchnabeltier' (21314 samples (100.0%))
|
111 |
+
-----------------
|
112 |
+
Accepted: 21314/21314 (100.0%)
|
113 |
+
Accepted tokens: 8134690
|
114 |
+
Skipped: 0 (0.0%)
|
115 |
+
Min tokens per sample: 25
|
116 |
+
Max tokens per sample: 1202
|
117 |
+
Avg tokens per sample: 381.65947264708643
|
118 |
+
-----------------
|
119 |
+
|
120 |
+
## Stats for 'Subset of LeoLM/German_Poems' (490 samples (100.0%))
|
121 |
+
-----------------
|
122 |
+
Accepted: 490/490 (100.0%)
|
123 |
+
Accepted tokens: 618642
|
124 |
+
Skipped: 0 (0.0%)
|
125 |
+
Min tokens per sample: 747
|
126 |
+
Max tokens per sample: 1678
|
127 |
+
Avg tokens per sample: 1262.534693877551
|
128 |
+
-----------------
|
129 |
+
|
130 |
+
## Stats for 'Subset of LeoLM/German_Songs' (392 samples (100.0%))
|
131 |
+
-----------------
|
132 |
+
Accepted: 392/392 (100.0%)
|
133 |
+
Accepted tokens: 187897
|
134 |
+
Skipped: 0 (0.0%)
|
135 |
+
Min tokens per sample: 231
|
136 |
+
Max tokens per sample: 826
|
137 |
+
Avg tokens per sample: 479.3290816326531
|
138 |
+
-----------------
|
139 |
+
|
140 |
+
## Stats for 'total' (132540 samples (100.0%))
|
141 |
+
-----------------
|
142 |
+
Accepted: 132540/132540 (100.0%)
|
143 |
+
Accepted tokens: 67530728
|
144 |
+
Skipped: 0 (0.0%)
|
145 |
+
Min tokens per sample: 19
|
146 |
+
Max tokens per sample: 5507
|
147 |
+
Avg tokens per sample: 509.51205673758864
|
148 |
+
-----------------
|
149 |
+
```
|