taidopurason
commited on
Commit
•
08eb8ae
1
Parent(s):
e930cb2
Update README.md
Browse files
README.md
CHANGED
@@ -6,18 +6,21 @@ pipeline_tag: text-generation
|
|
6 |
library_name: transformers
|
7 |
tags:
|
8 |
- conversational
|
|
|
|
|
|
|
9 |
---
|
10 |
|
11 |
# LLammas 🐑
|
12 |
|
13 |
-
Llama-2-7B
|
14 |
-
1. 5B tokens of CulturaX with 75% of documents in Estonain and 25% in English (see [Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)),
|
15 |
-
2. Alpaca-cleaned, Alpaca-est, OASST1 top-1 English conversations, CoT and FLAN-V2 following open-instruct (both 10,000), WMT18 English-Estonian translation development data (as documents), general MTee validation English-Estonian held-out data.
|
16 |
|
17 |
[Alpaca-est](https://github.com/TartuNLP/alpaca-est) is an instruction dataset generated for Estonian with *gpt-3.5-turbo-0613*, following Alpaca. More details in our [paper](https://arxiv.org/abs/2404.04042).
|
18 |
|
19 |
Additional resources:
|
20 |
-
* Paper: [
|
21 |
* Code: [github.com/TartuNLP/llammas](https://github.com/TartuNLP/llammas)
|
22 |
* Base model: [tartuNLP/Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)
|
23 |
* 4-bit quantized model in GGUF: [AlbertUnn/LlammasGGUF](https://huggingface.co/AlbertUnn/LlammasGGUF)
|
@@ -77,13 +80,24 @@ Kirja kirjutamiseks alustage tervitusega, näiteks "Tere!" või "Tere hommikust!
|
|
77 |
|
78 |
### Citation
|
79 |
```
|
80 |
-
@
|
81 |
-
|
82 |
-
|
83 |
-
|
84 |
-
|
85 |
-
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
}
|
88 |
|
89 |
```
|
|
|
6 |
library_name: transformers
|
7 |
tags:
|
8 |
- conversational
|
9 |
+
base_model:
|
10 |
+
- tartuNLP/Llammas-base
|
11 |
+
- meta-llama/Llama-2-7b-hf
|
12 |
---
|
13 |
|
14 |
# LLammas 🐑
|
15 |
|
16 |
+
Llama-2-7B instruction-tuned for Estonian in two stages:
|
17 |
+
1. Continued pre-training: 5B tokens of CulturaX with 75% of documents in Estonain and 25% in English (see [Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)),
|
18 |
+
2. Instruction-tuning: Alpaca-cleaned, Alpaca-est, OASST1 top-1 English conversations, CoT and FLAN-V2 following open-instruct (both 10,000), WMT18 English-Estonian translation development data (as documents), general MTee validation English-Estonian held-out data.
|
19 |
|
20 |
[Alpaca-est](https://github.com/TartuNLP/alpaca-est) is an instruction dataset generated for Estonian with *gpt-3.5-turbo-0613*, following Alpaca. More details in our [paper](https://arxiv.org/abs/2404.04042).
|
21 |
|
22 |
Additional resources:
|
23 |
+
* Paper: [https://aclanthology.org/2024.findings-naacl.210/](https://aclanthology.org/2024.findings-naacl.210/)
|
24 |
* Code: [github.com/TartuNLP/llammas](https://github.com/TartuNLP/llammas)
|
25 |
* Base model: [tartuNLP/Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)
|
26 |
* 4-bit quantized model in GGUF: [AlbertUnn/LlammasGGUF](https://huggingface.co/AlbertUnn/LlammasGGUF)
|
|
|
80 |
|
81 |
### Citation
|
82 |
```
|
83 |
+
@inproceedings{kuulmets-etal-2024-teaching,
|
84 |
+
title = "Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer",
|
85 |
+
author = "Kuulmets, Hele-Andra and
|
86 |
+
Purason, Taido and
|
87 |
+
Luhtaru, Agnes and
|
88 |
+
Fishel, Mark",
|
89 |
+
editor = "Duh, Kevin and
|
90 |
+
Gomez, Helena and
|
91 |
+
Bethard, Steven",
|
92 |
+
booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
|
93 |
+
month = jun,
|
94 |
+
year = "2024",
|
95 |
+
address = "Mexico City, Mexico",
|
96 |
+
publisher = "Association for Computational Linguistics",
|
97 |
+
url = "https://aclanthology.org/2024.findings-naacl.210",
|
98 |
+
doi = "10.18653/v1/2024.findings-naacl.210",
|
99 |
+
pages = "3309--3325",
|
100 |
+
abstract = "This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named Llammas, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.",
|
101 |
}
|
102 |
|
103 |
```
|