--- base_model: TurkuNLP/gpt3-finnish-xl license: apache-2.0 datasets: - TurkuNLP/squad_v2_fi language: - fi pipeline_tag: text-generation --- # Model Card for Model Futurice/gpt3-finnish-xl-instruct The model gpt3-finnish-xl-instruct is an instruction fine-tuned model intended for RAG type Q&A in Finnish. ## Model Details ### Model Description The gpt3-finnish-xl-instruct model is based on TurkuNLP Finnish GPT-3-models. They are a model family of pretrained monolingual GPT-style language models, based on BLOOM-architecture. The model was fine-tuned using a sample of dataset TurkuNLP/squad_v2_fi, that was DeepL translated from SQuAD2.0. - **Developed by:** Martti Sutinen - **Model type:** Bloom - **Language(s) (NLP):** Finnish - **License:** Apache-2.0 - **Finetuned from model:** TurkuNLP/gpt3-finnish-large ## Uses Intended for RAG type Q&A in Finnish. ### Direct Use Intended for text generation and RAG type Q&A in Finnish. Supply a context and ask a question about it. ### Out-of-Scope Use Please do not misuse the model. Not recommended for other use cases. ## Bias, Risks, and Limitations A key limitation is simple and limited selection of fine-tuning data. Please do not expect high quality answers. ### Recommendations Recommeded to continue fine-tuning with more data or newer architecture. ## How to Get Started with the Model - Recommended system message: "Olet avustaja. Seuraavaksi saat kysymyksen tai tehtävän. Kirjoita vastaus parhaasi mukaan siten että se täyttää kysymyksen tai tehtävän vaatimukset." - Recommended format for question about context: Tausta: "{context} \n\nKäytä vain taustaa ja vastaa kysymykseen tai tehtävään: {question}" - Prompt format: tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) Where messages with typical format: messages = [ {"role": "system", "content": system_message}, {"role": "user", "content": prompt_with_context} ]. Here is what the input could look like: \<|im_start|>system Olet avustaja. Seuraavaksi saat kysymyksen tai tehtävän. Kirjoita vastaus parhaasi mukaan siten että se täyttää kysymyksen tai tehtävän vaatimukset.<|im_end|> <|im_start|>user Tausta: Dokumentti luotiin tammikuussa. Sen kirjoittajaa ei tunneta. Käytä vain taustaa ja vastaa kysymykseen tai tehtävään: Milloin dokumentti kirjoitettiin?<|im_end|> <|im_start|>assistant Use pipeline with task text-generation and the recommended format. ## Training Details ### Training Data Trained with 40000 random samples from test data in: [TurkuNLP/squad_v2_fi](https://huggingface.co/datasets/TurkuNLP/squad_v2_fi). ### Training Procedure Training was done for 4-bit base model with supervised fine-tuning and Lora. #### Training Hyperparameters - **Training regime:** 4-bit, batch size 4, max steps 20000, data collator for completion only ## Evaluation Evaluation has not been done properly yet. ### Testing Data, Factors & Metrics #### Testing Data Evaluated with 1000 random samples from test data in: [TurkuNLP/squad_v2_fi](https://huggingface.co/datasets/TurkuNLP/squad_v2_fi). #### Factors Same factors as in SQuAD2.0. #### Metrics Loss. ### Results No results to be shared yet. #### Summary ## Environmental Impact Environmental impact not yet evaluated. Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** Mostly trained on A100 - **Hours used:** 5-10 hours - **Cloud Provider:** GCP - **Compute Region:** Unknown - **Carbon Emitted:** Not evaluated ### Model Architecture and Objective Bloom. ### Compute Infrastructure Colab. #### Hardware 1 x A100. #### Software Typical software used. ## Model Card Contact Martti Sutinen