updating readmoe
Browse files
README.md
CHANGED
@@ -96,6 +96,8 @@ NOTE: Things that we had to modify in order for BLOOMChat to work:
|
|
96 |
- Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
|
97 |
- Modifying `inference_server/models/hf_accelerate.py`
|
98 |
- This is because for our testing of this repo we used 4 80GB A100 GPUs and would run into memory issues
|
|
|
|
|
99 |
|
100 |
Modifications for `inference_server/models/hf_accelerate.py`:
|
101 |
|
@@ -112,6 +114,18 @@ class HFAccelerateModel(Model):
|
|
112 |
kwargs["max_memory"] = reduce_max_memory_dict
|
113 |
```
|
114 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
115 |
Running command for int8 (sub optimal performance, but fast inference time):
|
116 |
```
|
117 |
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
|
|
96 |
- Change the model name from `bigscience/bloom` to `sambanovasystems/BLOOMChat-176B-v1`
|
97 |
- Modifying `inference_server/models/hf_accelerate.py`
|
98 |
- This is because for our testing of this repo we used 4 80GB A100 GPUs and would run into memory issues
|
99 |
+
- Modifying `inference_server/cli.py`
|
100 |
+
- This is because the model was trained using specific human, bot tags
|
101 |
|
102 |
Modifications for `inference_server/models/hf_accelerate.py`:
|
103 |
|
|
|
114 |
kwargs["max_memory"] = reduce_max_memory_dict
|
115 |
```
|
116 |
|
117 |
+
Modifications for `inference_server/cli.py`:
|
118 |
+
|
119 |
+
```python
|
120 |
+
def main() -> None:
|
121 |
+
...
|
122 |
+
while True:
|
123 |
+
input_text = input("Input text: ")
|
124 |
+
|
125 |
+
input_text = input_text.strip()
|
126 |
+
modified_input_text = f"<human>: {input_text}\n<bot>:"
|
127 |
+
```
|
128 |
+
|
129 |
Running command for int8 (sub optimal performance, but fast inference time):
|
130 |
```
|
131 |
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|