Spaces:
Running
Running
Fixed an issue where ALMA running on CPU caused AutoGPTQ to throw an "Exllama" error:
Browse filesValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.
https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization.md#exllama
docs/translateModel.md
CHANGED
@@ -16,7 +16,7 @@ The required VRAM is provided for reference and may not apply to everyone. If th
|
|
16 |
|
17 |
## M2M100
|
18 |
|
19 |
-
|
20 |
|
21 |
| Name | Parameters | Size | type/quantize | Required VRAM |
|
22 |
|------|------------|------|---------------|---------------|
|
@@ -40,8 +40,8 @@ NLLB-200 is a multilingual translation model introduced by Meta AI in July 2022.
|
|
40 |
|------|------------|------|---------------|---------------|
|
41 |
| [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 2.46 GB | float32 | ≈2.5 GB |
|
42 |
| [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.9 GB |
|
43 |
-
| [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 5.48 GB | float32 | 5.8 GB |
|
44 |
-
| [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 17.58 GB | float32 | 13.4 GB |
|
45 |
|
46 |
## NLLB-200-CTranslate2
|
47 |
|
|
|
16 |
|
17 |
## M2M100
|
18 |
|
19 |
+
M2M100 is a multilingual translation model introduced by Facebook AI in October 2020. It supports arbitrary translation among 101 languages. The paper is titled "`Beyond English-Centric Multilingual Machine Translation`" ([arXiv:2010.11125](https://arxiv.org/abs/2010.11125)).
|
20 |
|
21 |
| Name | Parameters | Size | type/quantize | Required VRAM |
|
22 |
|------|------------|------|---------------|---------------|
|
|
|
40 |
|------|------------|------|---------------|---------------|
|
41 |
| [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 600M | 2.46 GB | float32 | ≈2.5 GB |
|
42 |
| [facebook/nllb-200-distilled-1.3B](https://huggingface.co/facebook/nllb-200-distilled-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.9 GB |
|
43 |
+
| [facebook/nllb-200-1.3B](https://huggingface.co/facebook/nllb-200-1.3B) | 1.3B | 5.48 GB | float32 | ≈5.8 GB |
|
44 |
+
| [facebook/nllb-200-3.3B](https://huggingface.co/facebook/nllb-200-3.3B) | 3.3B | 17.58 GB | float32 | ≈13.4 GB |
|
45 |
|
46 |
## NLLB-200-CTranslate2
|
47 |
|
src/translation/translationModel.py
CHANGED
@@ -124,6 +124,19 @@ class TranslationModel:
|
|
124 |
If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
|
125 |
repetition_penalty (float, optional, defaults to 1.0)
|
126 |
The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
127 |
"""
|
128 |
try:
|
129 |
print('\n\nLoading model: %s\n\n' % self.modelPath)
|
@@ -148,7 +161,13 @@ class TranslationModel:
|
|
148 |
elif "ALMA" in self.modelPath:
|
149 |
self.ALMAPrefix = "Translate this from " + self.whisperLang.whisper.names[0] + " to " + self.translationLang.whisper.names[0] + ":\n" + self.whisperLang.whisper.names[0] + ": "
|
150 |
self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath, use_fast=True)
|
151 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
self.transTranslator = transformers.pipeline("text-generation", model=self.transModel, tokenizer=self.transTokenizer, do_sample=True, temperature=0.7, top_k=40, top_p=0.95, repetition_penalty=1.1)
|
153 |
else:
|
154 |
self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath)
|
|
|
124 |
If set to float < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation.
|
125 |
repetition_penalty (float, optional, defaults to 1.0)
|
126 |
The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
|
127 |
+
|
128 |
+
[transformers.GPTQConfig]
|
129 |
+
use_exllama (bool, optional):
|
130 |
+
Whether to use exllama backend. Defaults to True if unset. Only works with bits = 4.
|
131 |
+
|
132 |
+
[ExLlama]
|
133 |
+
ExLlama is a Python/C++/CUDA implementation of the Llama model that is designed for faster inference with 4-bit GPTQ weights (check out these benchmarks).
|
134 |
+
The ExLlama kernel is activated by default when you create a [GPTQConfig] object.
|
135 |
+
To boost inference speed even further, use the ExLlamaV2 kernels by configuring the exllama_config parameter.
|
136 |
+
The ExLlama kernels are only supported when the entire model is on the GPU.
|
137 |
+
If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel.
|
138 |
+
This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file.
|
139 |
+
https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization.md#exllama
|
140 |
"""
|
141 |
try:
|
142 |
print('\n\nLoading model: %s\n\n' % self.modelPath)
|
|
|
161 |
elif "ALMA" in self.modelPath:
|
162 |
self.ALMAPrefix = "Translate this from " + self.whisperLang.whisper.names[0] + " to " + self.translationLang.whisper.names[0] + ":\n" + self.whisperLang.whisper.names[0] + ": "
|
163 |
self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath, use_fast=True)
|
164 |
+
transModelConfig = transformers.AutoConfig.from_pretrained(self.modelPath)
|
165 |
+
if self.device == "cpu":
|
166 |
+
transModelConfig.quantization_config["use_exllama"] = False
|
167 |
+
self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision, config=transModelConfig)
|
168 |
+
else:
|
169 |
+
# transModelConfig.quantization_config["exllama_config"] = {"version":2} # After configuring to use ExLlamaV2, VRAM cannot be effectively released, which may be an issue. Temporarily not adopting the V2 version.
|
170 |
+
self.transModel = transformers.AutoModelForCausalLM.from_pretrained(self.modelPath, device_map="auto", low_cpu_mem_usage=True, trust_remote_code=False, revision=self.modelConfig.revision)
|
171 |
self.transTranslator = transformers.pipeline("text-generation", model=self.transModel, tokenizer=self.transTokenizer, do_sample=True, temperature=0.7, top_k=40, top_p=0.95, repetition_penalty=1.1)
|
172 |
else:
|
173 |
self.transTokenizer = transformers.AutoTokenizer.from_pretrained(self.modelPath)
|