taidopurason commited on
Commit
08eb8ae
1 Parent(s): e930cb2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -11
README.md CHANGED
@@ -6,18 +6,21 @@ pipeline_tag: text-generation
6
  library_name: transformers
7
  tags:
8
  - conversational
 
 
 
9
  ---
10
 
11
  # LLammas 🐑
12
 
13
- Llama-2-7B finetuned in two stages:
14
- 1. 5B tokens of CulturaX with 75% of documents in Estonain and 25% in English (see [Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)),
15
- 2. Alpaca-cleaned, Alpaca-est, OASST1 top-1 English conversations, CoT and FLAN-V2 following open-instruct (both 10,000), WMT18 English-Estonian translation development data (as documents), general MTee validation English-Estonian held-out data.
16
 
17
  [Alpaca-est](https://github.com/TartuNLP/alpaca-est) is an instruction dataset generated for Estonian with *gpt-3.5-turbo-0613*, following Alpaca. More details in our [paper](https://arxiv.org/abs/2404.04042).
18
 
19
  Additional resources:
20
- * Paper: [arxiv.org/abs/2404.04042](https://arxiv.org/abs/2404.04042)
21
  * Code: [github.com/TartuNLP/llammas](https://github.com/TartuNLP/llammas)
22
  * Base model: [tartuNLP/Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)
23
  * 4-bit quantized model in GGUF: [AlbertUnn/LlammasGGUF](https://huggingface.co/AlbertUnn/LlammasGGUF)
@@ -77,13 +80,24 @@ Kirja kirjutamiseks alustage tervitusega, näiteks "Tere!" või "Tere hommikust!
77
 
78
  ### Citation
79
  ```
80
- @misc{kuulmets2024teaching,
81
- title={Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer},
82
- author={Hele-Andra Kuulmets and Taido Purason and Agnes Luhtaru and Mark Fishel},
83
- year={2024},
84
- eprint={2404.04042},
85
- archivePrefix={arXiv},
86
- primaryClass={cs.CL}
 
 
 
 
 
 
 
 
 
 
 
87
  }
88
 
89
  ```
 
6
  library_name: transformers
7
  tags:
8
  - conversational
9
+ base_model:
10
+ - tartuNLP/Llammas-base
11
+ - meta-llama/Llama-2-7b-hf
12
  ---
13
 
14
  # LLammas 🐑
15
 
16
+ Llama-2-7B instruction-tuned for Estonian in two stages:
17
+ 1. Continued pre-training: 5B tokens of CulturaX with 75% of documents in Estonain and 25% in English (see [Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)),
18
+ 2. Instruction-tuning: Alpaca-cleaned, Alpaca-est, OASST1 top-1 English conversations, CoT and FLAN-V2 following open-instruct (both 10,000), WMT18 English-Estonian translation development data (as documents), general MTee validation English-Estonian held-out data.
19
 
20
  [Alpaca-est](https://github.com/TartuNLP/alpaca-est) is an instruction dataset generated for Estonian with *gpt-3.5-turbo-0613*, following Alpaca. More details in our [paper](https://arxiv.org/abs/2404.04042).
21
 
22
  Additional resources:
23
+ * Paper: [https://aclanthology.org/2024.findings-naacl.210/](https://aclanthology.org/2024.findings-naacl.210/)
24
  * Code: [github.com/TartuNLP/llammas](https://github.com/TartuNLP/llammas)
25
  * Base model: [tartuNLP/Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)
26
  * 4-bit quantized model in GGUF: [AlbertUnn/LlammasGGUF](https://huggingface.co/AlbertUnn/LlammasGGUF)
 
80
 
81
  ### Citation
82
  ```
83
+ @inproceedings{kuulmets-etal-2024-teaching,
84
+ title = "Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer",
85
+ author = "Kuulmets, Hele-Andra and
86
+ Purason, Taido and
87
+ Luhtaru, Agnes and
88
+ Fishel, Mark",
89
+ editor = "Duh, Kevin and
90
+ Gomez, Helena and
91
+ Bethard, Steven",
92
+ booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
93
+ month = jun,
94
+ year = "2024",
95
+ address = "Mexico City, Mexico",
96
+ publisher = "Association for Computational Linguistics",
97
+ url = "https://aclanthology.org/2024.findings-naacl.210",
98
+ doi = "10.18653/v1/2024.findings-naacl.210",
99
+ pages = "3309--3325",
100
+ abstract = "This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named Llammas, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.",
101
  }
102
 
103
  ```