FM-1976/gemma-2-2b-it-Q5_K_M-GGUF

This model was converted to GGUF format from google/gemma-2-2b-it using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Description

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

Model Details

context window = 8192 SYSTEM MESSAGE NOT SUPPORTED

architecture str = gemma2
        type str = model
        name str = Gemma 2 2b It
    finetune str = it
    basename str = gemma-2
  size_label str = 2B
     license str = gemma
       count u32 = 1
model.0.name str = Gemma 2 2b
organization str = Google
format           = GGUF V3 (latest)
arch             = gemma2
vocab type       = SPM
n_vocab          = 256000
n_merges         = 0
vocab_only       = 0
n_ctx_train      = 8192
n_embd           = 2304
n_layer          = 26
n_head           = 8
n_head_kv        = 4
model type       = 2B
model ftype      = Q5_K - Medium
model params     = 2.61 B
model size       = 1.79 GiB (5.87 BPW)
general.name     = Gemma 2 2b It
BOS token        = 2 '<bos>'
EOS token        = 1 '<eos>'
UNK token        = 3 '<unk>'
PAD token        = 0 '<pad>'
LF token         = 227 '<0x0A>'
EOT token        = 107 '<end_of_turn>'
EOG token        = 1 '<eos>'
EOG token        = 107 '<end_of_turn>'

>>> System role not supported
Available chat formats from metadata: chat_template.default
Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
' + message['content'] | trim + '<end_of_turn>
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
'}}{% endif %}
Using chat eos_token: <eos>
Using chat bos_token: <bos>

Prompt Format

<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>

Chat Template

The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

messages = [
    {"role": "user", "content": "Write me a poem about Machine Learning."},
]

Use with llama-cpp-python

Install llama.cpp through brew (works on Mac and Linux)

pip install llama-cpp-python

Download locally the GGUF file

wget https://huggingface.co/FM-1976/gemma-2-2b-it-Q5_K_M-GGUF/resolve/main/gemma-2-2b-it-q5_k_m.gguf  -OutFile gemma-2-2b-it-q5_k_m.gguf

Open your Python REPL

Using chat_template

from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
messages = [
    {"role": "user", "content": "Write me a poem about Machine Learning."},
]
response = llm.create_chat_completion(
                messages=messages,
                temperature=0.15,
                repeat_penalty= 1.178,
                stop=sTOPS,
                max_tokens=500)
print(response['choices'][0]['message']['content'])

Using create_completion

from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
prompt = 'Explain Science in one sentence.'
template = f'''<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>'''
res = llm.create_completion(prompt,temperature=0.15, max_tokens=500,repeat_penalty=1.178, stop=['<eos>'])
print(res['choices'][0]['text'])

Streaming text

llama-cpp-python allows you to also stream text during the inference
Tokens are decoded and printed soon after gneration is done. You don't have to wait until the entire inference is done.

You can use both create_chat_completion() and create_completion() methods.

Streaming with create_chat_completion() method

import datetime
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
fisrtround=0
full_response = ''
message = [{'role':'user','content':'what is science?'}]
start = datetime.datetime.now()
for chunk in llm.create_chat_completion(
    messages=message,
    temperature=0.15,
    repeat_penalty= 1.31,
    stop=['<eos>'],
    max_tokens=500,
    stream=True,):
    try:
        if chunk["choices"][0]["delta"]["content"]:
            if fisrtround==0:
                print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
                full_response += chunk["choices"][0]["delta"]["content"]
                ttftoken = datetime.datetime.now() - start  
                fisrtround = 1
            else:
                print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
                full_response += chunk["choices"][0]["delta"]["content"]                              
    except:
        pass  
first_token_time = ttftoken.total_seconds()
print(f'Time to first token: {first_token_time:.2f} seconds')

Streaming with create_completion() method

import datetime
from llama_cpp import Llama
nCTX = 8192
sTOPS = ['<eos>']
llm = Llama(
            model_path='gemma-2-2b-it-q5_k_m.gguf',
            temperature=0.24,
            n_ctx=nCTX,
            max_tokens=600,
            repeat_penalty=1.176,
            stop=sTOPS,
            verbose=False,
            )
fisrtround=0
full_response = ''
prompt = 'Explain Science in one sentence.'
template = f'''<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
<end_of_turn>'''
start = datetime.datetime.now()
for chunk in llm.create_completion(
    prompt,
    temperature=0.15,
    repeat_penalty= 1.78,
    stop=['<eos>'],
    max_tokens=500,
    stream=True,):
    if fisrtround==0:
        print(chunk["choices"][0]["text"], end="", flush=True)
        full_response += chunk["choices"][0]["text"]
        ttftoken = datetime.datetime.now() - start
        fisrtround = 1
    else:
        print(chunk["choices"][0]["text"], end="", flush=True)
        full_response += chunk["choices"][0]["text"]

first_token_time = ttftoken.total_seconds()
print(f'Time to first token: {first_token_time:.2f} seconds')

Further exploration

You can also serve the model with an OpenAI compliant API server
This can be done both with llama-cpp-python[server] and llamafile.

Downloads last month
2
GGUF
Model size
2.61B params
Architecture
gemma2

5-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for FM-1976/gemma-2-2b-it-Q5_K_M-GGUF

Base model

google/gemma-2-2b
Quantized
(127)
this model