Gemma-7b-4.0bpw-exl2 Model Card
This is 4-bit quantized exl2 version of the base Google Gemma-7b model. Using turboderp's ExLlamaV2 0.0.13.post2 for quantization.
Usage
Download the model
Either clone the repo directly
git clone https://huggingface.co/saucam/gemma-7b-4.0bpw-exl2
or use hf cli
pip3 install huggingface-hub
mkdir gemma-7b-4.0bpw-exl2
huggingface-cli download saucam/gemma-7b-4.0bpw-exl2 --local-dir gemma-7b-4.0bpw-exl2
Running the model on a CPU
Clone exllamav2 repo for the inference script
git clone https://github.com/turboderp/exllamav2.git
cd exllamav2
Run the inference script and point to the model dir downloaded in the above step
python test_inference.py -m /path/to/gemma-7b-4.0bpw-exl2 -p "I have a dream"
Sample run output:
python test_inference.py -m /path/to/gemma-7b-4.0bpw-exl2 -p "I have a dream"
-- Model: /path/to/gemma-7b-4.0bpw-exl2
-- Options: []
-- Loading model...
-- Loaded model in 2.0420 seconds
-- Loading tokenizer...
-- Warmup...
-- Generating...
I have a dream of becoming an astronaut and I am going to work hard to achieve my goal.
I want you all the best in your life, do what makes u happy! Be strong!!
Don't give up on your dreams because they are yours for a reason :)
This is my last year at this school...it feels weird but good...because it means that I will be graduating next year!!! It has been a great experience and journey at PS 128....and we still got one last summer together!!!!! woohoo!!!!
We had our first class trip to The Metropolitan Museum Of Art today...It was so much fun
-- Response generated in 0.96 seconds, 128 tokens, 132.99 tokens/second (includes prompt eval.)
As you can see it is quite fast
- Downloads last month
- 6
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.