winninghealth commited on
Commit
3db7b68
·
verified ·
1 Parent(s): 914f6ff

Upload 25 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/20241216085311.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/IMG_6875.GIF filter=lfs diff=lfs merge=lfs -text
38
+ assets/IMG_6877.GIF filter=lfs diff=lfs merge=lfs -text
39
+ assets/IMG_6878.GIF filter=lfs diff=lfs merge=lfs -text
40
+ WiNGPT-Babel-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
41
+ WiNGPT-Babel-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
42
+ WiNGPT-Babel-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,141 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ tags:
7
+ - translation
8
+ ---
9
+ # 🌐 WiNGPT-Babel
10
+
11
+ WiNGPT-Babel(巴别塔)是一个基于大语言模型(LLM)为翻译应用定制的模型,致力于提供便捷的多语言信息母语级体验。
12
+
13
+ 和其他机器翻译模型最大的不同是,WiNGPT-Babel 是采用 human-in-the-loop 数据生产采集闭环策略训练而成。因此 WiNGPT-Babel 更适应真实使用场景,例如新闻、研究成果以及观看带有实时翻译字幕的视频。通过一系列的工具插件 WiNGPT-Babel 会将这些内容翻译成用户的母语,以更好的体验呈现在用户面前。
14
+
15
+ 我们的目标是利用先进的 LLM 技术,降低语言障碍,帮助用户更轻松地获取全球范围内的互联网信息,包括学术论文、社交媒体、网页内容和视频字幕等各种数据格式。虽然实现这一目标还需要时间,但 LLM 技术的发展为其提供了可能性。
16
+
17
+ ## ✨ 核心特点
18
+
19
+ - **多格式翻译 📄 🌐 🎬:** 支持多种文本格式的翻译,包括网页、社交媒体内容、学术论文、视频字幕、以及数据集等。
20
+ - **高精度翻译 🧠:** 基于先进的 LLM 架构,我们致力于提供准确、自然、流畅的翻译结果。
21
+ - **高性能翻译 ⏱️:** 采用1.5B模型,支持实时字幕翻译等应用场景,满足用户对实时翻译的需求。
22
+ - **多语言支持 🗣️:** 目前支持超过 20 种语言,并不断扩展语言支持范围。
23
+ - **应用适配 🪒:** 目前已适配的工具有:沉浸式翻译、videolingo。
24
+
25
+ ## 🧪 适用场景
26
+ - 🌐 **网页内容翻译:** 适用于日常网页浏览,快速理解网页信息
27
+ - 📄 **学术论文翻译:** 适用于辅助理解多语言研究论文,提高阅读效率
28
+ - 📰 **新闻资讯翻译:** 适用于快速了解全球新闻动态,获取一手信息
29
+ - 🎬 **视频字幕翻译:** 适用于观看外语视频,辅助理解视频内容
30
+ - 📊 **数据集多语言处理:** 适用于多语言数据集的初步翻译,辅助数据分析
31
+
32
+ ## 🔤 语言支持(更多语言待验证)
33
+ 🇺🇸 English ↔️ 🇨🇳 Chinese | 🇯🇵 Japanese ➡️ 🇨🇳 Chinese
34
+
35
+ ## 🚀 快速开始
36
+
37
+ WiNGPT-Babel 采用 Qwen2.5-1.5B 作为基础模型 ,是在测试比较了各种参数规模模型平衡推理速度和翻译质量的选择。在各种应用场景下的翻译速度可以达到甚至超过谷歌翻译,这样的体验对于使用翻译模型来说是至关重要的。 为了帮助大家快速上手,我们提供了以下示例,并使用 Hugging Face Transformers 库进行加载和推理,当让我们推荐大家使用vllm,ollama等推理工具和框架:
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+
42
+ model_name = "WiNGPT/WiNGPT-Babel"
43
+
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ model_name,
46
+ torch_dtype="auto",
47
+ device_map="auto"
48
+ )
49
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
50
+
51
+ prompt = "Give me a short introduction to large language model."
52
+ messages = [
53
+ {"role": "system", "content": "中英互译下面的内容"},
54
+ {"role": "user", "content": prompt}
55
+ ]
56
+ text = tokenizer.apply_chat_template(
57
+ messages,
58
+ tokenize=False,
59
+ add_generation_prompt=True
60
+ )
61
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
62
+
63
+ generated_ids = model.generate(
64
+ **model_inputs,
65
+ max_new_tokens=4096
66
+ )
67
+ generated_ids = [
68
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
69
+ ]
70
+
71
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
72
+ ```
73
+
74
+ - **注意:** WiNGPT-Babel 默认系统提示词为:“中英互译下面的内容”。模型会自动根据用户的输入翻译成对应的语言,无需其他复杂的指令。
75
+
76
+ ### 🎬 示例
77
+
78
+ 以下是一些应用场景示例,展示如何使用模型进行翻译。
79
+
80
+ 1. **网页翻译:**
81
+ - **场景:** 用户通过工具及简单系统提示,将外文网页内容翻译成母语。
82
+ - **工具:** 沉浸式翻译
83
+ - <figure><img src="assets/20241216084737.png" style="zoom:25%;" /></figure>
84
+ - <figure><img src="assets/20241216084744.png" style="zoom:25%;" /></figure>
85
+ - <figure><img src="assets/20241216084809.png" style="zoom:25%;" /></figure>
86
+ - <figure><img src="assets/20241216085303.png" style="zoom:25%;" /></figure>
87
+ - <figure><img src="assets/20241216085311.png" style="zoom:25%;" /></figure>
88
+
89
+ 2. **学术论文翻译:**
90
+ - **场景:** 用户使用工具翻译外文研究论文,辅助研究工作。
91
+ - **工具:** 沉浸式翻译
92
+ - <figure><img src="assets/20241216084751.png" style="zoom:25%;" /></figure>
93
+ - <figure><img src="assets/20241216084757.png" style="zoom:25%;" /></figure>
94
+
95
+ 3. **社交媒体翻译:**
96
+ - **场景:** 用户可以使用模型,将不同语言的社交媒体内容翻译成母语
97
+ - **工具:** 沉浸式翻译
98
+ - <figure><img src="assets/20241216084803.png" style="zoom:25%;" /></figure>
99
+
100
+ 4. **视频字幕翻译:**
101
+ - **场景:** 用户利用工具,结合模型,直接翻译字幕文件并保存为文件。
102
+ - **工具:** 沉浸式翻译
103
+ - <figure><img src="assets/20241216085700.png" style="zoom:25%;" /></figure>
104
+
105
+ 5. **视频网站实时翻译:**
106
+
107
+ - **场景:** 用户利用工具,结合模型,在观看互联网视频时实时生成字幕。
108
+ - **工具:** 沉浸式翻译
109
+ - <figure><img src="assets/IMG_6875.GIF" style="zoom:25%;" /></figure>
110
+ - <figure><img src="assets/IMG_6877.GIF" style="zoom:25%;" /></figure>
111
+
112
+ 6. **视频翻译与字幕压制:**
113
+ - **场景:** 用户利用工具,结合模型,将外语视频生成带有翻译字幕的视频。
114
+ - **工具:** VideoLingo
115
+ - <figure><img src="assets/IMG_6878.GIF" style="zoom:25%;" /></figure>
116
+
117
+ 7. **数据集翻译:**
118
+
119
+ - **场景:** 用户利用模型,将外语数据集进行翻译。
120
+ - **工具:** wingpt-web-client
121
+ - <figure><img src="assets/093402.png" style="zoom:25%;" /></figure>
122
+
123
+ **注意:** 以上示例展示了如何利用工具并结合 WiNGPT-Babel 模型进行文本翻译。你可以根据自己的需求和习惯,通过工具并将其应用到更多场景。
124
+
125
+ ### 🌱 局限性
126
+ **专业术语翻译:** 在法律、医学等高度专业领域、代码等,翻译结果可能存在偏差
127
+ **文学作品翻译:** 对于文学作品中的修辞、隐喻等,可能无法完美传达原文意境
128
+ **长文本翻译:** 在处理超长文本时,可能会出现翻译错误或者幻觉问题,需要进行分段处理
129
+ **多语言适配:** 目前主要在中英语言场景里进行使用,其他语言需要更多的测试和反馈
130
+
131
+ ## 许可证
132
+
133
+ 1. 本项目授权协议为 Apache License 2.0
134
+ 2. 使用本项目包括模型权重时请引用本项目:https://huggingface.co/winninghealth/WiNGPT-Babel
135
+ 3. 遵守 [Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B), [immersive-translate](https://github.com/immersive-translate/immersive-translate), [VideoLingo](https://github.com/Huanshere/VideoLingo) 相关协议及其许可证,详细内容参照其网站。
136
+
137
+ ## 联系我们
138
+
139
+ 网站:[https://www.winning.com.cn](https://www.winning.com.cn/)
140
+
141
+ 邮箱:[[email protected]](mailto:[email protected])
WiNGPT-Babel-Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f12e9bf63bddbf42216d5997e5c4bb7da5ee4f363e659d4bb84d144642a606ec
3
+ size 986045568
WiNGPT-Babel-Q5_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38115acb8856f50192c16c6a4bf9380d40b5e813a4677b68f2847b1adf5e43f1
3
+ size 1125047424
WiNGPT-Babel-Q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d1f61f21ccff8f049fee431367a5d046f56028485202789ea573f6bc996f403
3
+ size 1646570112
assets/093402.png ADDED
assets/20241216084551.png ADDED
assets/20241216084737.png ADDED
assets/20241216084744.png ADDED
assets/20241216084751.png ADDED
assets/20241216084757.png ADDED
assets/20241216084803.png ADDED
assets/20241216084809.png ADDED
assets/20241216085303.png ADDED
assets/20241216085311.png ADDED

Git LFS Details

  • SHA256: 8fe166e99f4149d6554dac3c2a917cf0ab3e7073208a0ccf95e9f78071c108de
  • Pointer size: 132 Bytes
  • Size of remote file: 1.08 MB
assets/20241216085700.png ADDED
assets/IMG_6875.GIF ADDED

Git LFS Details

  • SHA256: b82c0219118312ca5b30b782203b579a580b5517cc4e259329dbcf7cb603ae0b
  • Pointer size: 132 Bytes
  • Size of remote file: 6.11 MB
assets/IMG_6877.GIF ADDED

Git LFS Details

  • SHA256: 8e5c1a7b65985b9229cc89f0a39d8c8d2b72377c5cc8aa040baf32f047123025
  • Pointer size: 132 Bytes
  • Size of remote file: 6.51 MB
assets/IMG_6878.GIF ADDED

Git LFS Details

  • SHA256: b1acfdda4d59fdff1c1d5746508c74a2475c17828a08818e2745bb35bc686dc9
  • Pointer size: 132 Bytes
  • Size of remote file: 9.41 MB
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "eos_token_id": 151645,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 1536,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 8960,
12
+ "max_position_embeddings": 32768,
13
+ "max_window_layers": 21,
14
+ "model_type": "qwen2",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 28,
17
+ "num_key_value_heads": 2,
18
+ "rms_norm_eps": 1e-06,
19
+ "rope_theta": 1000000.0,
20
+ "sliding_window": 32768,
21
+ "tie_word_embeddings": true,
22
+ "torch_dtype": "bfloat16",
23
+ "transformers_version": "4.43.1",
24
+ "use_cache": true,
25
+ "use_sliding_window": false,
26
+ "vocab_size": 151936
27
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "pad_token_id": 151643,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 151645,
7
+ 151643
8
+ ],
9
+ "repetition_penalty": 1.1,
10
+ "temperature": 0.7,
11
+ "top_p": 0.8,
12
+ "top_k": 20,
13
+ "transformers_version": "4.37.0"
14
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9154afd388d25237ed675a9bd4ab9def70b99fdee7a5107e29e8da0a935172e
3
+ size 3087467144
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "151643": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "151644": {
13
+ "content": "<|im_start|>",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "151645": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "151646": {
29
+ "content": "<|object_ref_start|>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "151647": {
37
+ "content": "<|object_ref_end|>",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ },
44
+ "151648": {
45
+ "content": "<|box_start|>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": true
51
+ },
52
+ "151649": {
53
+ "content": "<|box_end|>",
54
+ "lstrip": false,
55
+ "normalized": false,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": true
59
+ },
60
+ "151650": {
61
+ "content": "<|quad_start|>",
62
+ "lstrip": false,
63
+ "normalized": false,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": true
67
+ },
68
+ "151651": {
69
+ "content": "<|quad_end|>",
70
+ "lstrip": false,
71
+ "normalized": false,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": true
75
+ },
76
+ "151652": {
77
+ "content": "<|vision_start|>",
78
+ "lstrip": false,
79
+ "normalized": false,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": true
83
+ },
84
+ "151653": {
85
+ "content": "<|vision_end|>",
86
+ "lstrip": false,
87
+ "normalized": false,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": true
91
+ },
92
+ "151654": {
93
+ "content": "<|vision_pad|>",
94
+ "lstrip": false,
95
+ "normalized": false,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": true
99
+ },
100
+ "151655": {
101
+ "content": "<|image_pad|>",
102
+ "lstrip": false,
103
+ "normalized": false,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": true
107
+ },
108
+ "151656": {
109
+ "content": "<|video_pad|>",
110
+ "lstrip": false,
111
+ "normalized": false,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": true
115
+ },
116
+ "151657": {
117
+ "content": "<tool_call>",
118
+ "lstrip": false,
119
+ "normalized": false,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "151658": {
125
+ "content": "</tool_call>",
126
+ "lstrip": false,
127
+ "normalized": false,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "151659": {
133
+ "content": "<|fim_prefix|>",
134
+ "lstrip": false,
135
+ "normalized": false,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "151660": {
141
+ "content": "<|fim_middle|>",
142
+ "lstrip": false,
143
+ "normalized": false,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "151661": {
149
+ "content": "<|fim_suffix|>",
150
+ "lstrip": false,
151
+ "normalized": false,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "151662": {
157
+ "content": "<|fim_pad|>",
158
+ "lstrip": false,
159
+ "normalized": false,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "151663": {
165
+ "content": "<|repo_name|>",
166
+ "lstrip": false,
167
+ "normalized": false,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": false
171
+ },
172
+ "151664": {
173
+ "content": "<|file_sep|>",
174
+ "lstrip": false,
175
+ "normalized": false,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": false
179
+ }
180
+ },
181
+ "additional_special_tokens": [
182
+ "<|im_start|>",
183
+ "<|im_end|>",
184
+ "<|object_ref_start|>",
185
+ "<|object_ref_end|>",
186
+ "<|box_start|>",
187
+ "<|box_end|>",
188
+ "<|quad_start|>",
189
+ "<|quad_end|>",
190
+ "<|vision_start|>",
191
+ "<|vision_end|>",
192
+ "<|vision_pad|>",
193
+ "<|image_pad|>",
194
+ "<|video_pad|>"
195
+ ],
196
+ "bos_token": null,
197
+ "chat_template": "{% for message in messages %}{% if not loop.first %}{{- '\n' }}{% endif %}{{- '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' }}{% if loop.last and add_generation_prompt %}{{- '\n<|im_start|>assistant\n' }}{% endif %}{% endfor %}",
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|im_end|>",
200
+ "errors": "replace",
201
+ "model_max_length": 131072,
202
+ "pad_token": "<|endoftext|>",
203
+ "split_special_tokens": false,
204
+ "tokenizer_class": "Qwen2Tokenizer",
205
+ "unk_token": null,
206
+ "add_bos_token": false
207
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff