File size: 7,508 Bytes

---
license: llama3
language:
- tr
model-index:
- name: Kocdigital-LLM-8b-v0.1
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge TR
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc
      value: 44.03
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag TR
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc
      value: 46.73
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU TR
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 49.11
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA TR
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: acc
      name: accuracy
      value: 48.21
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande TR
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc
      value: 54.98
      name: accuracy
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k TR
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 51.78
      name: accuracy
---

<img src="https://huggingface.co/KOCDIGITAL/Kocdigital-LLM-8b-v0.1/resolve/main/icon.jpeg"
alt="KOCDIGITAL LLM" width="420"/>

# Kocdigital-LLM-8b-v0.1

This model is an fine-tuned version of a Llama3 8b Large Language Model (LLM) for Turkish. It was trained on a high quality Turkish instruction sets created from various open-source and internal resources. Turkish Instruction dataset carefully annotated to carry out Turkish instructions in an accurate and organized manner. The training process involved using the QLORA method.

## Model Details

- **Base Model**: Llama3 8B based LLM
- **Training Dataset**: High Quality Turkish instruction sets
- **Training Method**: SFT with QLORA

### QLORA Fine-Tuning Configuration

- `lora_alpha`: 128
- `lora_dropout`: 0
- `r`: 64
- `target_modules`: "q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"
- `bias`: "none"

## Usage Examples

```python

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"KOCDIGITAL/Kocdigital-LLM-8b-v0.1", 
max_seq_length=4096)
model = AutoModelForCausalLM.from_pretrained(
    "KOCDIGITAL/Kocdigital-LLM-8b-v0.1",
    load_in_4bit=True,
)

system = 'Sen Türkçe konuşan genel amaçlı bir asistansın. Her zaman kullanıcının verdiği talimatları doğru, kısa ve güzel bir gramer ile yerine getir.'

template = "{}\n\n###Talimat\n{}\n###Yanıt\n"
content = template.format(system, 'Türkiyenin 3 büyük ilini listeler misin.')

conv = []
conv.append({'role': 'user', 'content': content})
inputs = tokenizer.apply_chat_template(conv, 
                                       tokenize=False, 
                                       add_generation_prompt=True, 
                                       return_tensors="pt")

print(inputs)

inputs = tokenizer([inputs], 
                   return_tensors = "pt",
                   add_special_tokens=False).to("cuda")

outputs = model.generate(**inputs, 
                         max_new_tokens = 512, 
                         use_cache = True, 
                         do_sample = True, 
                         top_k = 50, 
                         top_p = 0.60, 
                         temperature = 0.3, 
                         repetition_penalty=1.1)

out_text = tokenizer.batch_decode(outputs)[0]
print(out_text)
```

# [Open LLM Turkish Leaderboard v0.2 Evaluation Results]
| Metric                          | Value |
|---------------------------------|------:|
| Avg.                            | 49.11 |
| AI2 Reasoning Challenge_tr-v0.2 | 44.03 |
| HellaSwag_tr-v0.2               | 46.73 |
| MMLU_tr-v0.2                    | 49.11 |
| TruthfulQA_tr-v0.2              | 48.51 |
| Winogrande _tr-v0.2             | 54.98 |
| GSM8k_tr-v0.2                   | 51.78 |

## Considerations on Limitations, Risks, Bias, and Ethical Factors

### Limitations and Recognized Biases

- **Core Functionality and Usage:** KocDigital LLM, functioning as an autoregressive language model, is primarily purposed for predicting the subsequent token within a text sequence. Although commonly applied across different contexts, it's crucial to acknowledge that comprehensive real-world testing has not been conducted. Therefore, its efficacy and consistency in diverse situations are largely unvalidated.

- **Language Understanding and Generation:** The model's training is mainly focused on standard English and Turkish. Its proficiency in grasping and generating slang, colloquial language, or different languages might be restricted, possibly resulting in errors or misinterpretations.

- **Production of Misleading Information:** Users should acknowledge that KocDigital LLM might generate incorrect or deceptive information. Results should be viewed as initial prompts or recommendations rather than absolute conclusions.

### Ethical Concerns and Potential Risks

- **Risk of Misuse:** KocDigital LLM carries the potential for generating language that could be offensive or harmful. We strongly advise against its utilization for such purposes and stress the importance of conducting thorough safety and fairness assessments tailored to specific applications before implementation.

- **Unintended Biases and Content:** The model underwent training on a vast corpus of text data without explicit vetting for offensive material or inherent biases. Consequently, it may inadvertently generate content reflecting these biases or inaccuracies.

- **Toxicity:** Despite efforts to curate appropriate training data, the model has the capacity to produce harmful content, particularly when prompted explicitly. We encourage active participation from the open-source community to devise strategies aimed at mitigating such risks.

### Guidelines for Secure and Ethical Utilization

- **Human Oversight:** We advocate for the integration of a human oversight mechanism or the utilization of filters to oversee and enhance the quality of outputs, particularly in applications accessible to the public. This strategy can assist in minimizing the likelihood of unexpectedly generating objectionable content.

- **Tailored Testing for Specific Applications:** Developers planning to utilize KocDigital LLM should execute comprehensive safety assessments and optimizations customized to their unique applications. This step is essential as the model's responses may exhibit unpredictability and occasional biases, inaccuracies, or offensive outputs.

- **Responsible Development and Deployment:** Developers and users of KocDigital LLM bear the responsibility for ensuring its ethical and secure application. We encourage users to be cognizant of the model's limitations and to implement appropriate measures to prevent misuse or adverse outcomes.