bjoernp commited on
Commit
8f6a783
1 Parent(s): 682f971

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - LeoLM/OpenSchnabeltier
4
+ - OpenAssistant/OASST-DE
5
+ - FreedomIntelligence/alpaca-gpt4-deutsch
6
+ - FreedomIntelligence/evol-instruct-deutsch
7
+ - LeoLM/German_Poems
8
+ - LeoLM/German_Songs
9
+ language:
10
+ - en
11
+ - de
12
+ library_name: transformers
13
+ pipeline_tag: text-generation
14
+ ---
15
+ # LAION LeoLM: **L**inguistically **E**nhanced **O**pen **L**anguage **M**odel
16
+ Meet LeoLM, the first open and commercially available German Foundation Language Model built on Llama-2.
17
+ Our models extend Llama-2's capabilities into German through continued pretraining on a large corpus of German-language and mostly locality specific text.
18
+ Thanks to a compute grant at HessianAI's new supercomputer **42**, we release two foundation models trained with 8k context length,
19
+ [`LeoLM/leo-hessianai-7b`](https://huggingface.co/LeoLM/leo-hessianai-7b) and [`LeoLM/leo-hessianai-13b`](https://huggingface.co/LeoLM/leo-hessianai-13b) under the [Llama-2 community license](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt) (70b also coming soon! 👀).
20
+ With this release, we hope to bring a new wave of opportunities to German open-source and commercial LLM research and accelerate adoption.
21
+ Read our [blog post]() or our paper (preprint coming soon) for more details!
22
+
23
+ *A project by Björn Plüster and Christoph Schuhmann in collaboration with LAION and HessianAI.*
24
+
25
+ ## LeoLM Chat
26
+ `LeoLM/leo-hessianai-7b-chat` is a German chat model built on our foundation model `LeoLM/leo-hessianai-7b` and finetuned on a selection of German instruction datasets.
27
+ The model performs exceptionally well on writing, explanation and discussion tasks but struggles somewhat with math and advanced reasoning. See our MT-Bench-DE scores:
28
+ ```
29
+
30
+ ```
31
+
32
+ ## Model Details
33
+
34
+ - **Finetuned from:** [LeoLM/leo-hessianai-7b](https://huggingface.co/LeoLM/leo-hessianai-7b)
35
+ - **Model type:** Causal decoder-only transformer language model
36
+ - **Language:** English and German
37
+ - **Demo:** [Continuations for 250 random prompts (TGI, 4bit nf4 quantization)](https://open-assistant.github.io/oasst-model-eval/?f=https%3A%2F%2Fraw.githubusercontent.com%2FOpen-Assistant%2Foasst-model-eval%2Fmain%2Fsampling_reports%2Foasst-sft%2F2023-08-22_OpenAssistant_llama2-70b-oasst-sft-v10_sampling_noprefix2_nf4.json%0A)
38
+ - **License:** [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-2-70b/raw/main/LICENSE.txt)
39
+ - **Contact:** [LAION Discord](https://discord.com/invite/eq3cAMZtCC) or [Björn Plüster](mailto:[email protected])
40
+
41
+
42
+ ## Prompting / Prompt Template
43
+
44
+ Prompt dialogue template (ChatML format):
45
+
46
+ ```
47
+ """
48
+ <|im_start|>system
49
+ {system_message}<|im_end|>
50
+ <|im_start|>user
51
+ {prompt}<|im_end|>
52
+ <|im_start|>assistant
53
+ """
54
+ ```
55
+
56
+ The model input can contain multiple conversation turns between user and assistant, e.g.
57
+ ```
58
+ <|im_start|>user
59
+ {prompt 1}<|im_end|>
60
+ <|im_start|>assistant
61
+ {reply 1}<|im_end|>
62
+ <|im_start|>user
63
+ {prompt 2}<|im_end|>
64
+ <|im_start|>assistant
65
+ (...)
66
+ ```
67
+
68
+ ## Ethical Considerations and Limitations
69
+
70
+ LeoLM has been tested in English and German, and has not covered, nor could it cover all scenarios.
71
+ For these reasons, as with all LLMs, the potential outputs of `LeoLM/leo-hessianai-7b-chat` cannot be predicted
72
+ in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses
73
+ to user prompts. Therefore, before deploying any applications of `LeoLM/leo-hessianai-7b-chat`, developers should
74
+ perform safety testing and tuning tailored to their specific applications of the model.
75
+
76
+ Please see Meta's [Responsible Use Guide](https://ai.meta.com/llama/responsible-use-guide/).
77
+
78
+ ## Dataset Details
79
+ ```
80
+ ## Stats for 'Subset of OpenAssistant/OASST-DE' (3534 samples (100.0%))
81
+ -----------------
82
+ Accepted: 3534/3534 (100.0%)
83
+ Accepted tokens: 2259302
84
+ Skipped: 0 (0.0%)
85
+ Min tokens per sample: 29
86
+ Max tokens per sample: 2484
87
+ Avg tokens per sample: 639.3044708545557
88
+ -----------------
89
+
90
+ ## Stats for 'Subset of FreedomIntelligence/evol-instruct-deutsch' (57841 samples (100.0%))
91
+ -----------------
92
+ Accepted: 57841/57841 (100.0%)
93
+ Accepted tokens: 42958192
94
+ Skipped: 0 (0.0%)
95
+ Min tokens per sample: 33
96
+ Max tokens per sample: 5507
97
+ Avg tokens per sample: 742.6944900675991
98
+ -----------------
99
+
100
+ ## Stats for 'Subset of FreedomIntelligence/alpaca-gpt4-deutsch' (48969 samples (100.0%))
101
+ -----------------
102
+ Accepted: 48969/48969 (100.0%)
103
+ Accepted tokens: 13372005
104
+ Skipped: 0 (0.0%)
105
+ Min tokens per sample: 19
106
+ Max tokens per sample: 1359
107
+ Avg tokens per sample: 273.07082031489307
108
+ -----------------
109
+
110
+ ## Stats for 'Subset of LeoLM/OpenSchnabeltier' (21314 samples (100.0%))
111
+ -----------------
112
+ Accepted: 21314/21314 (100.0%)
113
+ Accepted tokens: 8134690
114
+ Skipped: 0 (0.0%)
115
+ Min tokens per sample: 25
116
+ Max tokens per sample: 1202
117
+ Avg tokens per sample: 381.65947264708643
118
+ -----------------
119
+
120
+ ## Stats for 'Subset of LeoLM/German_Poems' (490 samples (100.0%))
121
+ -----------------
122
+ Accepted: 490/490 (100.0%)
123
+ Accepted tokens: 618642
124
+ Skipped: 0 (0.0%)
125
+ Min tokens per sample: 747
126
+ Max tokens per sample: 1678
127
+ Avg tokens per sample: 1262.534693877551
128
+ -----------------
129
+
130
+ ## Stats for 'Subset of LeoLM/German_Songs' (392 samples (100.0%))
131
+ -----------------
132
+ Accepted: 392/392 (100.0%)
133
+ Accepted tokens: 187897
134
+ Skipped: 0 (0.0%)
135
+ Min tokens per sample: 231
136
+ Max tokens per sample: 826
137
+ Avg tokens per sample: 479.3290816326531
138
+ -----------------
139
+
140
+ ## Stats for 'total' (132540 samples (100.0%))
141
+ -----------------
142
+ Accepted: 132540/132540 (100.0%)
143
+ Accepted tokens: 67530728
144
+ Skipped: 0 (0.0%)
145
+ Min tokens per sample: 19
146
+ Max tokens per sample: 5507
147
+ Avg tokens per sample: 509.51205673758864
148
+ -----------------
149
+ ```