--- language: - et - en pipeline_tag: text-generation library_name: transformers tags: - conversational base_model: - tartuNLP/Llammas-base --- # LLammas 🐑 Llama-2-7B instruction-tuned for Estonian in two stages: 1. Continued pre-training: 5B tokens of CulturaX with 75% of documents in Estonain and 25% in English (see [Llammas-base](https://huggingface.co/tartuNLP/Llammas-base)), 2. Instruction-tuning: Alpaca-cleaned, Alpaca-est, OASST1 top-1 English conversations, CoT and FLAN-V2 following open-instruct (both 10,000), WMT18 English-Estonian translation development data (as documents), general MTee validation English-Estonian held-out data. [Alpaca-est](https://github.com/TartuNLP/alpaca-est) is an instruction dataset generated for Estonian with *gpt-3.5-turbo-0613*, following Alpaca. More details in our [paper](https://arxiv.org/abs/2404.04042). Additional resources: * Paper: [https://aclanthology.org/2024.findings-naacl.210/](https://aclanthology.org/2024.findings-naacl.210/) * Code: [github.com/TartuNLP/llammas](https://github.com/TartuNLP/llammas) * Base model: [tartuNLP/Llammas-base](https://huggingface.co/tartuNLP/Llammas-base) * 4-bit quantized model in GGUF: [AlbertUnn/LlammasGGUF](https://huggingface.co/AlbertUnn/LlammasGGUF) * Alpaca-est dataset: [github.com/TartuNLP/alpaca-est](https://github.com/TartuNLP/alpaca-est) ### Using the model Using the model in a text-generation pipeline: ``` from transformers import pipeline import torch pipe = pipeline("text-generation", model="tartuNLP/Llammas", torch_dtype=torch.bfloat16, device_map="auto") messages = [ {"role": "user", "content": "Tere!"}, {"role": "assistant", "content": "Tere! Kas saaksin teid kuidagi aidata?"}, {"role": "user", "content": "Kuidas alustada kirja kirjutamist?"} ] prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.6, top_k=50, top_p=0.9) print(outputs[0]["generated_text"][len(prompt):]) ``` Using the model in a conversational pipeline (works with transformers==4.36.2, issues with output in newer versions): ``` from transformers import pipeline, Conversation import torch pipe = pipeline("conversational", model="tartuNLP/Llammas", torch_dtype=torch.bfloat16, device_map="auto") messages = [ {"role": "user", "content": "Tere!"}, {"role": "assistant", "content": "Tere! Kas saaksin teid kuidagi aidata?"}, {"role": "user", "content": "Kuidas alustada kirja kirjutamist?"} ] conversation = Conversation(messages) conversation = pipe(conversation) ``` Conversational format: ``` <|user|> Tere! <|assistant|> Tere! Kas saaksin teid kuidagi aidata? <|user|> Kuidas alustada kirja kirjutamist? <|assistant|> Kirja kirjutamiseks alustage tervitusega, nĂ€iteks "Tere!" vĂ”i "Tere hommikust!". SeejĂ€rel tutvustage ennast ja mainige, kellega kirjutate. Kirjeldage oma mĂ”tteid vĂ”i kĂŒsimusi, mida soovite arutada. LĂ”petage kiri viisakalt, nĂ€iteks "TĂ€nan teid tĂ€helepanu eest!" vĂ”i "Parimate soovidega!" ``` ### Citation ``` @inproceedings{kuulmets-etal-2024-teaching, title = "Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer", author = "Kuulmets, Hele-Andra and Purason, Taido and Luhtaru, Agnes and Fishel, Mark", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.findings-naacl.210", doi = "10.18653/v1/2024.findings-naacl.210", pages = "3309--3325", abstract = "This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named Llammas, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.", } ```