nazimali commited on
Commit
6c36d79
1 Parent(s): 800a545

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - nazimali/Mistral-Nemo-Kurdish
4
+ language:
5
+ - ku
6
+ - en
7
+ license: apache-2.0
8
+ tags:
9
+ - text-generation-inference
10
+ - transformers
11
+ - unsloth
12
+ - mistral
13
+ - gguf
14
+ datasets:
15
+ - saillab/alpaca-kurdish_kurmanji-cleaned
16
+ library_name: transformers
17
+ ---
18
+
19
+ This is a 12B parameter model, finetuned on `nazimali/Mistral-Nemo-Kurdish` for a single Kurdish (Kurmanji) instruction dataset. My intention was to train this with both Kurdish Kurmanji Latin script and Kurdish Sorani Arabic script, but training time was much longer than anticipated.
20
+ So I decided to use 1 full Kurdish Kurmanji dataset to get started.
21
+
22
+ Will look into a multi-GPU training setup so don't have to wait all day for results. Want to train it with both Kurmanji and Sorani Arabic script.
23
+
24
+ Try [spaces demo](https://huggingface.co/spaces/nazimali/Mistral-Nemo-Kurdish-Instruct) example.
25
+
26
+ ### Example usage
27
+
28
+ #### llama-cpp-python
29
+
30
+ ```python
31
+ from llama_cpp import Llama
32
+
33
+ inference_prompt = """Li jêr rêwerzek heye ku peywirek rave dike, bi têketinek ku çarçoveyek din peyda dike ve tê hev kirin. Bersivek ku daxwazê ​​bi guncan temam dike binivîsin.
34
+ ### Telîmat:
35
+ {}
36
+ ### Têketin:
37
+ {}
38
+ ### Bersiv:
39
+ """
40
+
41
+ llm = Llama.from_pretrained(
42
+ repo_id="nazimali/Mistral-Nemo-Kurdish-Instruct",
43
+ filename="Q4_K_M.gguf",
44
+ )
45
+
46
+ llm.create_chat_completion(
47
+ messages = [
48
+ {
49
+ "role": "user",
50
+ "content": inference_prompt.format("selam alikum, tu çawa yî?")
51
+ }
52
+ ]
53
+ )
54
+ ```
55
+
56
+ #### llama.cpp
57
+
58
+ ```shell
59
+ ./llama-cli \
60
+ --hf-repo "nazimali/Mistral-Nemo-Kurdish-Instruct" \
61
+ --hf-file Q4_K_M.gguf \
62
+ -p "selam alikum, tu çawa yî?" \
63
+ --conversation
64
+ ```
65
+
66
+ #### Transformers
67
+
68
+ ```python
69
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
70
+
71
+ model_id = "nazimali/Mistral-Nemo-Kurdish-Instruct"
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
74
+
75
+ bnb_config = BitsAndBytesConfig(
76
+ load_in_4bit=True,
77
+ bnb_4bit_use_double_quant=True,
78
+ bnb_4bit_quant_type="nf4",
79
+ bnb_4bit_compute_dtype=torch.bfloat16,
80
+ )
81
+
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ model_id,
84
+ quantization_config=bnb_config,
85
+ device_map="auto",
86
+ )
87
+ ```
88
+
89
+ ### Training
90
+
91
+ #### Finetuning data:
92
+
93
+ - `saillab/alpaca-kurdish_kurmanji-cleaned`
94
+ - Dataset number of rows: 52,002
95
+ - Filtered columns `instruction, output`
96
+ - Must have at least 1 character
97
+ - Must be less than 10,000 characters
98
+ - Number of rows used for training: 41,559
99
+
100
+ #### Finetuning instruction format:
101
+
102
+ ```python
103
+ finetune_prompt = """Li jêr rêwerzek heye ku peywirek rave dike, bi têketinek ku çarçoveyek din peyda dike ve tê hev kirin. Bersivek ku daxwazê ​​bi guncan temam dike binivîsin.
104
+ ### Telîmat:
105
+ {}
106
+ ### Têketin:
107
+ {}
108
+ ### Bersiv:
109
+ {}
110
+ """
111
+ ```