--- base_model: AI-Sweden-Models/gpt-sw3-1.3b library_name: peft datasets: - barbaroo/Sprotin_parallel language: - en - fo metrics: - bleu - chrf - bertscore pipeline_tag: text-generation --- Model Card: English–Faroese Translation Adapter ## Model Details **Model Description** - **Developed by:** Barbara Scalvini - **Model type:** Language model adapter for **English → Faroese** translation - **Language(s):** English, Faroese - **License:** This adapter inherits the license from the original GPT-SW3 1.3 B model. - **Finetuned from model:** [AI-Sweden-Models/gpt-sw3-1.3b](https://huggingface.co/AI-Sweden-Models/gpt-sw3-1.3b) - **Library used:** [PEFT 0.13.0](https://github.com/huggingface/peft) ### Model Sources - **Paper:** [COMING SOON] --- ## Uses ### Direct Use This adapter is intended to perform **English→Faroese** translation, leveraging a **parameter-efficient fine-tuning** (PEFT) approach. ### Downstream Use [optional] - Can be integrated into broader **multilingual** or **localization** workflows. ### Out-of-Scope Use - Any uses that rely on languages other than **English or Faroese** will likely yield suboptimal results. - Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning. --- ## Bias, Risks, and Limitations - **Biases:** The model could reflect **biases** present in the training data, such as historical or societal biases in English or Faroese texts. - **Recommendation:** Users should **critically evaluate** outputs, especially in sensitive or high-stakes applications. --- ## How to Get Started with the Model ```python import torch from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer import pandas as pd ADAPTER_REPO = "barbaroo/gptsw3_translate_1.3B" BASE_MODEL = "AI-Sweden-Models/gpt-sw3-1.3b" # 1. Load the tokenizer from the base model tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) model = AutoPeftModelForCausalLM.from_pretrained( ADAPTER_REPO, load_in_8bit=True, # Optional: 8-bit quantization for GPU memory efficiency device_map="auto", # Automatically spread layers across available GPUs ) # Ensure the model is in evaluation mode model.eval() # Alpaca-style prompt template alpaca_prompt = """ ### Instruction: {} ### Input: {} ### Response: {} """ # EOS token from the tokenizer EOS_TOKEN = tokenizer.eos_token print(EOS_TOKEN) sentences = ['hello world'] translations = [] for sentence in sentences: # Tokenize the input sentence and prepare the prompt for each sentence inputs = tokenizer( [ alpaca_prompt.format( "Translate this sentence from English to Faroese:", # instruction sentence, # input sentence to translate "", # output - leave blank for generation ) ], return_tensors="pt" ).to("cuda") # Generate the output outputs = model.generate(**inputs, max_new_tokens=2000, eos_token_id=tokenizer.eos_token_id, # Ensure EOS token is used pad_token_id=tokenizer.pad_token_id, # Ensure padding token is used use_cache=True, do_sample = True, temperature = 0.1, top_p=1) # Decode the generated tokens into a string output_string = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0] #print(output_string) # Use a regular expression to extract the response part try: spl_word_1 = 'Response:\n' res = output_string.split(spl_word_1, 1) response = res[1] translation = response.replace(EOS_TOKEN, '') translations.append(translation) except: translation = '' translations.append(translation) print(translation) ``` ## Training Details ### Training Data We used the Sprotin parallel corpus for **English–Faroese** translation: [barbaroo/Sprotin_parallel](https://huggingface.co/datasets/barbaroo/Sprotin_parallel). ### Training Procedure #### Preprocessing [optional] - **Tokenization**: We used the tokenizer from the base model `AI-Sweden-Models/gpt-sw3-1.3b`. - The Alpaca prompt format was used, with Instruction, Input and Response. #### Training Hyperparameters - **Epochs**: **3** total, with an **early stopping** criterion monitoring validation loss. - **Batch Size**: **2, with 4 Gradient accumulation steps** - **Learning Rate**: **2e-4** - **Optimizer**: **AdamW** with a linear learning-rate scheduler and warm-up. --- ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - The model was evaluated on the **[FLORES-200]** benchmark, of ~1012 English–Faroese pairs. #### Metrics and Results - **BLEU**: **[0.179]** - **chrF**: **[49.2]** - **BERTScore f1**: **[0.947]** Human evaluation was also performed (see paper) ## Citation [] [COMING SOON] --- ## Framework versions - PEFT 0.13.0