|
--- |
|
base_model: unsloth/gemma-2-2b-it-bnb-4bit |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- gemma2 |
|
- trl |
|
--- |
|
|
|
# Athena-codegemma-2-2b-lt for coding |
|
|
|
Supervised fine tuned (sft unsloth) for coding with EpistemeAI coding dataset. |
|
|
|
# Original Model card |
|
|
|
## Model Information |
|
|
|
Summary description and brief definition of inputs and outputs. |
|
|
|
### Description |
|
|
|
Gemma is a family of lightweight, state-of-the-art open models from Google, |
|
built from the same research and technology used to create the Gemini models. |
|
They are text-to-text, decoder-only large language models, available in English, |
|
with open weights for both pre-trained variants and instruction-tuned variants. |
|
Gemma models are well-suited for a variety of text generation tasks, including |
|
question answering, summarization, and reasoning. Their relatively small size |
|
makes it possible to deploy them in environments with limited resources such as |
|
a laptop, desktop or your own cloud infrastructure, democratizing access to |
|
state of the art AI models and helping foster innovation for everyone. |
|
|
|
### Usage |
|
|
|
Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with: |
|
```sh |
|
pip install -U transformers |
|
``` |
|
|
|
Then, copy the snippet from the section that is relevant for your usecase. |
|
|
|
#### Running with the `pipeline` API |
|
|
|
```python |
|
import torch |
|
from transformers import pipeline |
|
pipe = pipeline( |
|
"text-generation", |
|
model="EpistemeAI/Athena-codegemma-2-2b-it", |
|
model_kwargs={"torch_dtype": torch.bfloat16}, |
|
device="cuda", # replace with "mps" to run on a Mac device |
|
) |
|
messages = [ |
|
{"role": "user", "content": "Who are you? Please, answer in pirate-speak."}, |
|
] |
|
outputs = pipe(messages, max_new_tokens=256) |
|
assistant_response = outputs[0]["generated_text"][-1]["content"].strip() |
|
print(assistant_response) |
|
# Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world. So, what be yer pleasure, eh? 🦜 |
|
``` |
|
|
|
#### Running the model on a single / multi GPU |
|
|
|
```python |
|
# pip install accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"EpistemeAI/Athena-codegemma-2-2b-it", |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16, |
|
) |
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
outputs = model.generate(**input_ids, max_new_tokens=32) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
You can ensure the correct chat template is applied by using `tokenizer.apply_chat_template` as follows: |
|
```python |
|
messages = [ |
|
{"role": "user", "content": "Write me a poem about Machine Learning."}, |
|
] |
|
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda") |
|
outputs = model.generate(**input_ids, max_new_tokens=256) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
<a name="precisions"></a> |
|
#### Running the model on a GPU using different precisions |
|
|
|
The native weights of this model were exported in `bfloat16` precision. |
|
|
|
You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below. |
|
|
|
* _Upcasting to `torch.float32`_ |
|
|
|
```python |
|
# pip install accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"EpistemeAI/Athena-codegemma-2-2b-it", |
|
device_map="auto", |
|
) |
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
outputs = model.generate(**input_ids, max_new_tokens=32) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
#### Running the model through a CLI |
|
|
|
The [local-gemma](https://github.com/huggingface/local-gemma) repository contains a lightweight wrapper around Transformers |
|
for running Gemma 2 through a command line interface, or CLI. Follow the [installation instructions](https://github.com/huggingface/local-gemma#cli-usage) |
|
for getting started, then launch the CLI through the following command: |
|
|
|
```shell |
|
local-gemma --model 2b --preset speed |
|
``` |
|
|
|
#### Quantized Versions through `bitsandbytes` |
|
|
|
<details> |
|
<summary> |
|
Using 8-bit precision (int8) |
|
</summary> |
|
```python |
|
# pip install bitsandbytes accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
quantization_config = BitsAndBytesConfig(load_in_8bit=True) |
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"EpistemeAI/Athena-codegemma-2-2b-it", |
|
quantization_config=quantization_config, |
|
) |
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
outputs = model.generate(**input_ids, max_new_tokens=32) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
</details> |
|
|
|
<details> |
|
<summary> |
|
Using 4-bit precision |
|
</summary> |
|
```python |
|
# pip install bitsandbytes accelerate |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
quantization_config = BitsAndBytesConfig(load_in_4bit=True) |
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
|
model = AutoModelForCausalLM.from_pretrained( |
|
"EpistemeAI/Athena-codegemma-2-2b-it", |
|
quantization_config=quantization_config, |
|
) |
|
input_text = "Write me a poem about Machine Learning." |
|
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
outputs = model.generate(**input_ids, max_new_tokens=32) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
</details> |
|
|
|
#### Advanced Usage |
|
|
|
<details> |
|
<summary> |
|
Torch compile |
|
</summary> |
|
[Torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is a method for speeding-up the |
|
inference of PyTorch modules. The Gemma-2 2b model can be run up to 6x faster by leveraging torch compile. |
|
|
|
Note that two warm-up steps are required before the full inference speed is realised: |
|
|
|
```python |
|
import os |
|
os.environ["TOKENIZERS_PARALLELISM"] = "false" |
|
from transformers import AutoTokenizer, Gemma2ForCausalLM |
|
from transformers.cache_utils import HybridCache |
|
import torch |
|
torch.set_float32_matmul_precision("high") |
|
# load the model + tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
|
model = Gemma2ForCausalLM.from_pretrained("EpistemeAI/Athena-codegemma-2-2b-it", torch_dtype=torch.bfloat16) |
|
model.to("cuda") |
|
# apply the torch compile transformation |
|
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True) |
|
# pre-process inputs |
|
input_text = "The theory of special relativity states " |
|
model_inputs = tokenizer(input_text, return_tensors="pt").to("cuda") |
|
prompt_length = model_inputs.input_ids.shape[1] |
|
# set-up k/v cache |
|
past_key_values = HybridCache( |
|
config=model.config, |
|
max_batch_size=1, |
|
max_cache_len=model.config.max_position_embeddings, |
|
device=model.device, |
|
dtype=model.dtype |
|
) |
|
# enable passing kv cache to generate |
|
model._supports_cache_class = True |
|
model.generation_config.cache_implementation = None |
|
# two warm-up steps |
|
for idx in range(2): |
|
outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128) |
|
past_key_values.reset() |
|
# fast run |
|
outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=basic+usage%3A+generation_config). |
|
|
|
</details> |
|
|
|
### Chat Template |
|
|
|
The instruction-tuned models use a chat template that must be adhered to for conversational use. |
|
The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet. |
|
|
|
Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction: |
|
|
|
```py |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import transformers |
|
import torch |
|
model_id = "google/gemma-2-2b-it" |
|
dtype = torch.bfloat16 |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
device_map="cuda", |
|
torch_dtype=dtype,) |
|
chat = [ |
|
{ "role": "user", "content": "Write a hello world program" }, |
|
] |
|
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
|
``` |
|
|
|
At this point, the prompt contains the following text: |
|
|
|
``` |
|
<bos><start_of_turn>user |
|
Write a hello world program<end_of_turn> |
|
<start_of_turn>model |
|
``` |
|
|
|
As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity |
|
(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with |
|
the `<end_of_turn>` token. |
|
|
|
You can follow this format to build the prompt manually, if you need to do it without the tokenizer's |
|
chat template. |
|
|
|
After the prompt is ready, generation can be performed like this: |
|
|
|
```py |
|
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") |
|
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
### Inputs and outputs |
|
|
|
* **Input:** Text string, such as a question, a prompt, or a document to be |
|
summarized. |
|
* **Output:** Generated English-language text in response to the input, such |
|
as an answer to a question, or a summary of a document. |
|
### Citation |
|
|
|
```none |
|
@article{gemma_2024, |
|
title={Gemma}, |
|
url={https://www.kaggle.com/m/3301}, |
|
DOI={10.34740/KAGGLE/M/3301}, |
|
publisher={Kaggle}, |
|
author={Gemma Team}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
# Uploaded model |
|
|
|
- **Developed by:** EpistemeAI |
|
- **License:** apache-2.0 |
|
- **Finetuned from model :** unsloth/gemma-2-2b-it-bnb-4bit |
|
|
|
This gemma2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
|
|