Inference endpoint fails to deploy
Hi,
The HF inference endpoint fails to deploy with
get_model(File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 317, in get_model
raise NotImplementedError(\n\nNotImplementedError: Mixtral models requires flash attention v2, stk and megablocks\n"}
Any thoughts on this?
LE: Another attempt in fails with
raise NotImplementedError(\"Mixtral does not support weight quantization yet.\")\n\nNotImplementedError: Mixtral does not support weight quantization yet.\n"}
What instance type, container and config did you use? The default config should work with 2x A100 80GBs or use that link https://ui.endpoints.huggingface.co/new?repository=mistralai%2FMixtral-8x7B-Instruct-v0.1&vendor=aws®ion=us-east-1&accelerator=gpu&instance_size=2xlarge&task=text-generation&no_suggested_compute=true&tgi=true&tgi_max_batch_total_tokens=1024000&tgi_max_total_tokens=32000
Gotcha, thanks for the info. I was following the UI and tried with the first available instance type that didn't say "Low Memory". Will try with 2xA100 once I get access to it. Thanks.
What instance type, container and config did you use? The default config should work with 2x A100 80GBs or use that link https://ui.endpoints.huggingface.co/new?repository=mistralai%2FMixtral-8x7B-Instruct-v0.1&vendor=aws®ion=us-east-1&accelerator=gpu&instance_size=2xlarge&task=text-generation&no_suggested_compute=true&tgi=true&tgi_max_batch_total_tokens=1024000&tgi_max_total_tokens=32000
Got acces to 2xA100 and not it doesn't seem to go past this point
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/whoami-v2 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f88d9b97b80>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"))
Anything else you reckon I should try?