File size: 2,777 Bytes
672de6a
f31f0bc
 
 
 
 
 
 
 
672de6a
f31f0bc
 
 
 
 
 
c704d3d
 
f31f0bc
45c0005
 
 
f31f0bc
 
 
 
 
 
 
 
 
 
 
c704d3d
 
 
f31f0bc
a99de3c
 
 
 
f31f0bc
 
 
 
 
 
 
 
 
 
 
45c0005
f31f0bc
 
 
 
 
45c0005
 
 
 
f31f0bc
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
base_model: indiejoseph/bart-base-cantonese
tags:
- generated_from_trainer
metrics:
- bleu
model-index:
- name: bart-translation-zh-yue
  results: []
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# bart-translation-zh-yue

This model is a fine-tuned version of [indiejoseph/bart-base-cantonese](https://huggingface.co/indiejoseph/bart-base-cantonese) on the LLMs generated dataset.

It achieves the following results on the evaluation set:
- Loss: 0.5042
- Bleu: 36.3458
- Gen Len: 19.8785

## Model description

More information needed

## Intended uses & limitations

More information needed

## Training and evaluation data

The training and evaludation dataset are generated by ChatGPT and Palm2.

Leverage over 4,000 Chinese and Cantonese phrase pairs meticulously gathered from diverse websites and dictionaries as the foundation for generating initial seed sentences in Chinese using ChatGPT. Subsequently, employ the Palm2 API to translate all Chinese sentences into Cantonese, while dedicating attention to manually rectifying any typos and enhancing overall fluency and linguistic variety.

Utilizing the collected Chinese and Cantonese phrase pairs, each phrase is employed to generate ten unique sentences, resulting in a comprehensive dataset size of approximately 40,000 sentences. These sentences serve as the basis for training and refining the translation model, ensuring a robust and diverse language understanding.

Similarly, the evaluation dataset is meticulously crafted using a comparable methodology to the training dataset. This ensures that the evaluation data reflects the same level of quality, diversity, and linguistic nuances, providing a reliable benchmark for assessing the performance and effectiveness of the translation model.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 4.0

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Bleu    | Gen Len |
|:-------------:|:-----:|:-----:|:---------------:|:-------:|:-------:|
| 0.135         | 1.0   | 3521  | 0.4865          | 35.3577 | 19.8859 |
| 0.0983        | 2.0   | 7042  | 0.4813          | 36.0938 | 19.8796 |
| 0.072         | 3.0   | 10563 | 0.4847          | 36.193  | 19.8817 |
| 0.0552        | 4.0   | 14084 | 0.5042          | 36.3458 | 19.8785 |


### Framework versions

- Transformers 4.35.0.dev0
- Pytorch 2.1.1+cu121
- Datasets 2.14.6
- Tokenizers 0.14.1