Amethyst 13B Mistral - EXL2 - 8bpw, hb8

Description

  • 8 bits per weight.
  • 8 bits "for the lm_head (output) layer of the model," instead of the typical 6.
  • Works fine with 24 GB VRAM and no flash attention v2 under Windows.
  • For me runs at about 64% of the 4-bit GPTQ speed.

I converted the model using the convert.py script from the exllamav2 repo:
https://github.com/turboderp/exllamav2
Its documentation:
https://github.com/turboderp/exllamav2/blob/master/doc/convert.md

Measuring the model took 51 minutes, converting it 18 minutes.

I used the WikiText-2-v1 dataset for calibration:
https://huggingface.co/datasets/wikitext/blob/refs%2Fconvert%2Fparquet/wikitext-2-v1/test/0000.parquet

Downloads last month
11
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.