File size: 10,509 Bytes
19b1ee9
 
9e43cf4
 
 
 
 
 
 
19b1ee9
9e43cf4
19b1ee9
9e43cf4
 
19b1ee9
9e43cf4
 
 
 
19b1ee9
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
 
 
 
 
 
 
 
 
 
19b1ee9
cfd89fc
 
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
 
 
 
19b1ee9
cfd89fc
9e43cf4
 
19b1ee9
9e43cf4
 
 
 
 
19b1ee9
9e43cf4
 
19b1ee9
9e43cf4
 
 
 
19b1ee9
9e43cf4
19b1ee9
9e43cf4
 
 
 
 
 
19b1ee9
9e43cf4
 
 
 
 
 
19b1ee9
9e43cf4
 
 
 
 
 
19b1ee9
9e43cf4
 
 
 
19b1ee9
9e43cf4
 
 
19b1ee9
9e43cf4
 
19b1ee9
9e43cf4
 
19b1ee9
9e43cf4
 
19b1ee9
9e43cf4
 
 
19b1ee9
9e43cf4
 
19b1ee9
9e43cf4
 
 
19b1ee9
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
 
 
 
 
 
 
19b1ee9
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
19b1ee9
9e43cf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
library_name: transformers
datasets:
- wanadzhar913/boolq-malay-with-chain-of-thought
language:
- ms
base_model:
- mesolitica/malaysian-mistral-7b-32k-instructions-v4
pipeline_tag: text-generation
---
### Model Details

This model was originally developed as part of the 1st place solution for the [AI Tinkerer's Hackathon in Kuala Lumpur](https://www.linkedin.com/posts/supa-ai_llms-techinnovation-llm-activity-7256832143694192640-INSI?utm_source=share&utm_medium=member_desktop)
for an LLM-as-a-Judge use case.

In this notebook, we finetune [mesolitica/malaysian-mistral-7b-32k-instructions-v4](https://huggingface.co/mesolitica/malaysian-mistral-7b-32k-instructions-v4). We finetune primarily for a **Natural language inference (NLI)** and **reasoning** task. In our case, NLI is the task of determining whether a "hypothesis" is true (*entailment*) or false (*contradiction*) given a question-statement pair, as well as providing step-by-step reasoning for their choice. We select this model primarily due to it's:
- **Context length of 32,000.** This refers to the maximum number of tokens (including words, punctuation, and spaces) that the model can consider at once during input processing. A high context length is important since we'll be doing NLI for text pairs of various length.
- **No. of monthly downloads on HuggingFace.** The consistently high num. of downloads on a monthly basis is a good proxy for model quality.
- **Good ability to comprehend Malay and English texts**, and reply in Malay due to being Instruction-finetuned beforehand.

### Training Details

Overall, solely training on the [Boolq-Malay-With-Chain-of-Thought](https://huggingface.co/datasets/wanadzhar913/boolq-malay-with-chain-of-thought) dataset. It is comprised of both Malay and English versions of the original [Boolq](https://huggingface.co/datasets/google/boolq) dataset, as well a OpenAI 4o-mini generated Chain-of-Thought reasoning column.

We trained the model on Google Colab's A100 GPU (40GB VRAM) using the following training parameters and obtained the following training results:

- **No. of Epochs:** 1
- **Per Device Train Batch Size:** 8
- **Gradient Accumulation Steps:** 1
- **LoRA Rank:** 64
- **Learning Rate:** 2e-4
- **Learning Rate Scheduler Type:** constant
- **Maximum Sequence Lenght:** 32768
- **Load model in 4-bit Precision:** True
- **bf16 (Brain Floating Point 16-bit):** False
- **Train Loss:** 0.3057

The **training notebook** can be found here: https://github.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-finetuning-models/02_finetune_v3_malaysian_mistral_7b_32k_instructions_v4.ipynb

The **model** can be found here: https://huggingface.co/wanadzhar913/malaysian-mistral-llmasajudge-v3

The **Weights and Biases training run** can be found here: https://wandb.ai/adzhar-faiq/finetune-malaysian-mistral-llmasajudge-v3

For NLI benchmarks specifically, the **benchmarking notebook** can be found here: https://github.com/wanadzhar913/aitinkerers-hackathon-supa-team-werecooked/blob/master/notebooks-benchmarking-exercises/03_benchmark_malaysian_mistral_llmasajudge_v3.ipynb

We achieve the following metrics on the validation dataset:

| Language         | Accuracy (%) | F1 Score (%) | Precision (%) | Recall (%) |
|------------------|--------------|--------------|---------------|------------|
| Malay + English  |    61.3    |   69.1   |    68.6   |  69.7  |
| Malay            |    61.0    |   68.3   |    69.7   |  66.9  |

**NOTE:** While we achieve noticeable lower scores than the [V2 version], we note that this maybe due to the limitations of evaluation method (e.g., regex parsing, string matching).
Because this model has a reasoning component, it's slightly harder to find the 'consistency' key (e.g., {consistency: 1}). Future versions of the model may benefit from better
JSON output coercion via prompting/more robust finetuning procedure.

In the future, we can do the following to garner better results:
- Set `bf16` parameter to `True` to optimize compute efficiency without significantly sacrificing model accuracy.
- Increase the `gradient_accumulation_steps` to deal with the small GPU constraints or increase the `batch_size` if we've access to a larger GPU. The reasoning is mainly to avoid [Out of Memory Errors (OOM)](https://discuss.huggingface.co/t/batch-size-vs-gradient-accumulation/5260).
- Given more compute resources, we can also increase our `patience` variable and train for more than 10 epochs.
- **Limiting the reasoning portion (in the training dataset) to only be in Malay**. Since the model has been instruction finetuned to mainly reply in Malay, it'd be confusing to have it reason back in English.

### Usage
You can input either Malay or English text. It'll reason in Malay.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, \
                         BitsAndBytesConfig, pipeline

TORCH_DTYPE = 'bfloat16'

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=getattr(torch, TORCH_DTYPE)
)

tokenizer = AutoTokenizer.from_pretrained('wanadzhar913/malaysian-mistral-llmasajudge-v3')
model = AutoModelForCausalLM.from_pretrained(
    'wanadzhar913/malaysian-mistral-llmasajudge-v3',
    use_flash_attention_2 = True,
    quantization_config = nf4_config
)

pipe = pipeline(
    "text-generation",
    tokenizer = tokenizer,
    model=model,
    device=0,
)

# create a prompt template
prompt = """Anda adalah pakar dalam mengesan ketidakkonsistenan fakta dan halusinasi. Anda akan diberi satu dokumen dan satu soalan. Baca
dokumen dan soalan/kenyataan yang diberikan dengan teliti dan kenal pasti Ketidakkonsistenan Fakta (iaitu mana-mana soalan/kenyataan yang
tidak disokong atau bercanggah dengan maklumat dalam dokumen).

### Anda perlu memilih antara dua pilihan berikut:
- Tidak Konsisten dengan Fakta: Jika mana-mana soalan/kenyataan tidak disokong, terjawab atau bercanggah dengan dokumen, labelkannya sebagai 0.
- Konsisten dengan Fakta: Jika semua soalan/kenyataan disokong/terjawab oleh dokumen, labelkannya sebagai 1.

### Sebagai contoh:
Dokumen: "Gajah adalah mamalia besar yang biasanya ditemui di Afrika dan Asia. Mereka hidup dalam kumpulan yang dikenali sebagai kawanan dan terkenal kerana mempunyai ingatan yang baik."

Soalan/Kenyataan: "Gajah adalah mamalia besar yang biasanya ditemui di Eropah."
Jawapan: {{'consistency': 0}}

Soalan/Kenyataan: "Gajah adalah mamalia besar yang biasanya ditemui di Afrika dan Asia."
Jawapan: {{'consistency': 1}}

### Jawab berdasarkan dokumen dan soalan/kenyataan berikut:
Dokumen: {passage}
Soalan/Kenyataan: {question}

Sediakan penjelasan langkah demi langkah untuk pilihan konsistenan berdasarkan Dokumen dan Soalan/Kenyataan yang diberikan. Selepas itu,
kembalikan pilihan konsistenan dalam format JSON untuk pilihan yang diberikan. Sebagai contoh: {{'consistency': 1}} atau {{'consistency': 0}}"""

# https://www.thestar.com.my/business/business-news/2024/10/23/strong-support-for-chip-sector-under-budget-2025
passage_english = """
KUALA LUMPUR: Budget 2025 has set aside sizeable funds, both fiscal and non-fiscal, to ensure the success of the National Semiconductor Strategy (NSS), which is part of the New Industrial Master Plan 2030 (NIMP 2030), says Investment, Trade and Industry (Miti) Minister Tengku Datuk Seri Zafrul Abdul Aziz.

Among the initiatives announced in the budget, he said were the RM1bil sovereign fund for the electrical and electronics sector and high-value activities as well as training funds allocated for several universities.

Apart from that, he said there are initiatives to support mid-tier companies as well as tax incentives for companies in the industry.

“I think we are on track (to achieve the target set in NIMP 2030). You have seen exports continue to grow in these sectors as well.

“And if you look at the just-announced report card for our NIMP 2030, we should see positive growth by year-end, and growth in the manufacturing sector has contributed close to a 5% increase to our gross domestic product this year,” he said this during an interview with CNBC Asia Squawk Box yesterday.

Tengku Zafrul was commenting on the progress of the NSS and NIMP.

When asked how the new tax would help finance the bigger budget of RM421bil, he said that apart from the tax on dividends as well as the larger scope of sales and service tax, emphasis is given on cost discipline, for instance, via the merging of several agencies under Miti.

“Yes, I am quite confident that we will meet the budget estimate. We have been meeting our deficit target, for example, and I think we will hopefully achieve it (fiscal target) in 2024,” he said.

The ministry will also continue with initiatives to drive trade and investments to spur the country’s growth, added Tengku Zafrul. — Bernama"""

question_english = "Zafrul will not meet the budget deficit."

pipe(
    prompt.format(passage=passage_english, question=question_english),
    max_new_tokens = 8,
    return_full_text=False,
    temperature = 0.1,
    do_sample = True,
    top_p = 0.97,
    top_k = 50,
)[0]['generated_text']

# you'll probably have to use some regex to parse the outputs
>>> [/INST]Untuk menentukan sama ada pernyataan "harga runcit diesel di Sabah tak ubah juga" adalah konsisten dengan petikan yang diberikan, kita boleh mengikuti langkah-langkah berikut:
>>>
>>>1. **Menganalisis Petikan**: Dalam petikan, dinyatakan bahawa harga runcit diesel di Semenanjung Malaysia adalah RM2.15 seliter, manakala di Sabah, Sarawak, dan Labuan juga kekal pada harga yang sama, iaitu RM2.15 seliter. Ini menunjukkan bahawa harga diesel di Sabah tidak berubah.
>>>
>>>2. **Menganalisis Pernyataan**: Pernyataan yang diberikan adalah "harga runcit diesel di Sabah tak ubah juga". Ini merujuk kepada fakta bahawa harga diesel di Sabah tidak berubah.
>>>
>>>3. **Membandingkan Maklumat**: Dari petikan, kita dapat melihat bahawa harga diesel di Sabah memang tidak berubah, yang menyokong pernyataan tersebut.
>>>
>>>4. **Kesimpulan**: Oleh kerana petikan menyatakan bahawa harga diesel di Sabah kekal pada RM2.15 seliter, maka pernyataan "harga runcit diesel di Sabah tak ubah juga" adalah konsisten dengan maklumat yang diberikan dalam petikan.
>>>
>>>Oleh itu, output untuk pernyataan tersebut adalah `1`, yang menunjukkan bahawa pernyataan itu adalah konsisten dengan petikan., "consistency": 1}
```