File size: 7,508 Bytes
7e73e38 61be248 7e73e38 61be248 8595296 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
---
license: llama3
language:
- tr
model-index:
- name: Kocdigital-LLM-8b-v0.1
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge TR
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc
value: 44.03
name: accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag TR
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc
value: 46.73
name: accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU TR
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 49.11
name: accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA TR
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: acc
name: accuracy
value: 48.21
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande TR
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 10
metrics:
- type: acc
value: 54.98
name: accuracy
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k TR
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 51.78
name: accuracy
---
<img src="https://huggingface.co/KOCDIGITAL/Kocdigital-LLM-8b-v0.1/resolve/main/icon.jpeg"
alt="KOCDIGITAL LLM" width="420"/>
# Kocdigital-LLM-8b-v0.1
This model is an fine-tuned version of a Llama3 8b Large Language Model (LLM) for Turkish. It was trained on a high quality Turkish instruction sets created from various open-source and internal resources. Turkish Instruction dataset carefully annotated to carry out Turkish instructions in an accurate and organized manner. The training process involved using the QLORA method.
## Model Details
- **Base Model**: Llama3 8B based LLM
- **Training Dataset**: High Quality Turkish instruction sets
- **Training Method**: SFT with QLORA
### QLORA Fine-Tuning Configuration
- `lora_alpha`: 128
- `lora_dropout`: 0
- `r`: 64
- `target_modules`: "q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
- `bias`: "none"
## Usage Examples
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"KOCDIGITAL/Kocdigital-LLM-8b-v0.1",
max_seq_length=4096)
model = AutoModelForCausalLM.from_pretrained(
"KOCDIGITAL/Kocdigital-LLM-8b-v0.1",
load_in_4bit=True,
)
system = 'Sen Türkçe konuşan genel amaçlı bir asistansın. Her zaman kullanıcının verdiği talimatları doğru, kısa ve güzel bir gramer ile yerine getir.'
template = "{}\n\n###Talimat\n{}\n###Yanıt\n"
content = template.format(system, 'Türkiyenin 3 büyük ilini listeler misin.')
conv = []
conv.append({'role': 'user', 'content': content})
inputs = tokenizer.apply_chat_template(conv,
tokenize=False,
add_generation_prompt=True,
return_tensors="pt")
print(inputs)
inputs = tokenizer([inputs],
return_tensors = "pt",
add_special_tokens=False).to("cuda")
outputs = model.generate(**inputs,
max_new_tokens = 512,
use_cache = True,
do_sample = True,
top_k = 50,
top_p = 0.60,
temperature = 0.3,
repetition_penalty=1.1)
out_text = tokenizer.batch_decode(outputs)[0]
print(out_text)
```
# [Open LLM Turkish Leaderboard v0.2 Evaluation Results]
| Metric | Value |
|---------------------------------|------:|
| Avg. | 49.11 |
| AI2 Reasoning Challenge_tr-v0.2 | 44.03 |
| HellaSwag_tr-v0.2 | 46.73 |
| MMLU_tr-v0.2 | 49.11 |
| TruthfulQA_tr-v0.2 | 48.51 |
| Winogrande _tr-v0.2 | 54.98 |
| GSM8k_tr-v0.2 | 51.78 |
## Considerations on Limitations, Risks, Bias, and Ethical Factors
### Limitations and Recognized Biases
- **Core Functionality and Usage:** KocDigital LLM, functioning as an autoregressive language model, is primarily purposed for predicting the subsequent token within a text sequence. Although commonly applied across different contexts, it's crucial to acknowledge that comprehensive real-world testing has not been conducted. Therefore, its efficacy and consistency in diverse situations are largely unvalidated.
- **Language Understanding and Generation:** The model's training is mainly focused on standard English and Turkish. Its proficiency in grasping and generating slang, colloquial language, or different languages might be restricted, possibly resulting in errors or misinterpretations.
- **Production of Misleading Information:** Users should acknowledge that KocDigital LLM might generate incorrect or deceptive information. Results should be viewed as initial prompts or recommendations rather than absolute conclusions.
### Ethical Concerns and Potential Risks
- **Risk of Misuse:** KocDigital LLM carries the potential for generating language that could be offensive or harmful. We strongly advise against its utilization for such purposes and stress the importance of conducting thorough safety and fairness assessments tailored to specific applications before implementation.
- **Unintended Biases and Content:** The model underwent training on a vast corpus of text data without explicit vetting for offensive material or inherent biases. Consequently, it may inadvertently generate content reflecting these biases or inaccuracies.
- **Toxicity:** Despite efforts to curate appropriate training data, the model has the capacity to produce harmful content, particularly when prompted explicitly. We encourage active participation from the open-source community to devise strategies aimed at mitigating such risks.
### Guidelines for Secure and Ethical Utilization
- **Human Oversight:** We advocate for the integration of a human oversight mechanism or the utilization of filters to oversee and enhance the quality of outputs, particularly in applications accessible to the public. This strategy can assist in minimizing the likelihood of unexpectedly generating objectionable content.
- **Tailored Testing for Specific Applications:** Developers planning to utilize KocDigital LLM should execute comprehensive safety assessments and optimizations customized to their unique applications. This step is essential as the model's responses may exhibit unpredictability and occasional biases, inaccuracies, or offensive outputs.
- **Responsible Development and Deployment:** Developers and users of KocDigital LLM bear the responsibility for ensuring its ethical and secure application. We encourage users to be cognizant of the model's limitations and to implement appropriate measures to prevent misuse or adverse outcomes. |