Gemma-7b-4.0bpw-exl2 Model Card

This is 4-bit quantized exl2 version of the base Google Gemma-7b model. Using turboderp's ExLlamaV2 0.0.13.post2 for quantization.

Usage

Download the model

Either clone the repo directly

git clone https://huggingface.co/saucam/gemma-7b-4.0bpw-exl2

or use hf cli

pip3 install huggingface-hub

mkdir gemma-7b-4.0bpw-exl2
huggingface-cli download saucam/gemma-7b-4.0bpw-exl2 --local-dir gemma-7b-4.0bpw-exl2

Running the model on a CPU

Clone exllamav2 repo for the inference script

git clone https://github.com/turboderp/exllamav2.git
cd exllamav2

Run the inference script and point to the model dir downloaded in the above step

python test_inference.py -m /path/to/gemma-7b-4.0bpw-exl2 -p "I have a dream"

Sample run output:

python test_inference.py -m /path/to/gemma-7b-4.0bpw-exl2 -p "I have a dream"
 -- Model: /path/to/gemma-7b-4.0bpw-exl2
 -- Options: []
 -- Loading model...
 -- Loaded model in 2.0420 seconds
 -- Loading tokenizer...
 -- Warmup...
 -- Generating...

I have a dream of becoming an astronaut and I am going to work hard to achieve my goal.

I want you all the best in your life, do what makes u happy! Be strong!!

Don't give up on your dreams because they are yours for a reason :)


This is my last year at this school...it feels weird but good...because it means that I will be graduating next year!!! It has been a great experience and journey at PS 128....and we still got one last summer together!!!!! woohoo!!!!
We had our first class trip to The Metropolitan Museum Of Art today...It was so much fun

 -- Response generated in 0.96 seconds, 128 tokens, 132.99 tokens/second (includes prompt eval.)

As you can see it is quite fast