indiejoseph commited on
Commit
49f5c4c
·
1 Parent(s): 37c0bd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -5
README.md CHANGED
@@ -7,6 +7,11 @@ metrics:
7
  model-index:
8
  - name: bart-translation-zh-yue
9
  results: []
 
 
 
 
 
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -14,7 +19,7 @@ should probably proofread and complete it, then remove this comment. -->
14
 
15
  # bart-translation-zh-yue
16
 
17
- This model is a fine-tuned version of [indiejoseph/bart-base-cantonese](https://huggingface.co/indiejoseph/bart-base-cantonese) on the LLMs generated dataset.
18
 
19
  It achieves the following results on the evaluation set:
20
  - Loss: 0.5042
@@ -23,11 +28,19 @@ It achieves the following results on the evaluation set:
23
 
24
  ## Model description
25
 
26
- More information needed
27
 
28
- ## Intended uses & limitations
29
 
30
- More information needed
 
 
 
 
 
 
 
 
31
 
32
  ## Training and evaluation data
33
 
@@ -67,4 +80,4 @@ The following hyperparameters were used during training:
67
  - Transformers 4.35.0.dev0
68
  - Pytorch 2.1.1+cu121
69
  - Datasets 2.14.6
70
- - Tokenizers 0.14.1
 
7
  model-index:
8
  - name: bart-translation-zh-yue
9
  results: []
10
+ language:
11
+ - zh
12
+ - yue
13
+ license: apache-2.0
14
+ pipeline_tag: translation
15
  ---
16
 
17
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
19
 
20
  # bart-translation-zh-yue
21
 
22
+ Cantonese to Simplified Chinese translation model fine-tuned on [indiejoseph/bart-base-cantonese](https://huggingface.co/indiejoseph/bart-base-cantonese) using the LLMs generated dataset.
23
 
24
  It achieves the following results on the evaluation set:
25
  - Loss: 0.5042
 
28
 
29
  ## Model description
30
 
31
+ Since the base model [indiejoseph/bart-base-cantonese](https://huggingface.co/indiejoseph/bart-base-cantonese) is further pre-trained based on [fnlp/bart-base-chinese](https://huggingface.co/fnlp/bart-base-chinese), However, it inherits the issue of its whitespace tokenizer, which results in space delimiters between every individual Chinese character in the outputs. To address this problem, I have created a translation pipeline that mitigates the inconsistent output of Simplified Chinese with SequenceBiasLogitsProcessor from the transformers library.
32
 
33
+ ## Intended uses
34
 
35
+ 1. Cantonese Chinese Translation: The model can be utilized to translate text from Cantonese Chinese to other languages, enabling communication and understanding across different linguistic backgrounds.
36
+ 2. Language Learning: The model can assist language learners in understanding and translating Cantonese Chinese texts, aiding in the acquisition of Cantonese language skills.
37
+
38
+ ## Limitations
39
+
40
+ 1. Domain Specificity: The model's performance may vary when translating texts that contain domain-specific or technical terminology. It is trained on general language data and may struggle with specialized vocabulary.
41
+ 2. Accuracy and Fluency: While the model strives to provide accurate and fluent translations, it may occasionally produce errors or less natural-sounding output. Post-editing or human review may be necessary for critical or high-stakes translations.
42
+ 3. Cultural Nuances: Translations generated by the model might not capture the full range of cultural nuances and contextual meanings present in the original text. Human interpretation and cultural understanding are essential for accurate translations in sensitive or culturally specific contexts.
43
+ 4. Potential for Harmful or Hate Speech: The training dataset was generated from Language Models (LLMs), which may inadvertently include instances of harmful or hate speech. While efforts have been made to filter and mitigate such content, the model's output may still occasionally contain offensive or inappropriate language. It is essential to exercise caution and implement appropriate content moderation measures when utilizing the model to ensure the generated translations align with ethical standards and community guidelines.
44
 
45
  ## Training and evaluation data
46
 
 
80
  - Transformers 4.35.0.dev0
81
  - Pytorch 2.1.1+cu121
82
  - Datasets 2.14.6
83
+ - Tokenizers 0.14.1