amanpreetsingh459
/

llama-2-7b-chat_q4_quantized_cpp

text-generation-inference

Model card Files Files and versions Community

llama-2-7b-chat_q4_quantized_cpp / README.md

amanpreetsingh459's picture

amanpreetsingh459

Update README.md

ba7f279 about 1 year ago

|

history blame contribute delete

1.57 kB

	---
	license: llama2
	tags:
	- llama2
	- text-generation-inference
	base_midel: meta-llama/Llama-2-7b-chat-hf
	---

	# llama-2-7b-chat_q4_quantized_cpp
	- This model contains the 4-bit quantized version of [llama2-7B-chat](https://github.com/facebookresearch/llama) model in cpp.
	- This can be run on a local cpu system as a cpp module (instructions for the same are given below).
	- As for the testing, the model has been tested on `Linux(Ubuntu)` os with `12 GB RAM` and `core i5 processor`.
	- The performance is `roughly` ~3 tokens per second
	# Usage:
	1. Clone the llama C++ repository from github:<br>
	`git clone https://github.com/ggerganov/llama.cpp.git`
	2. Enter the llama.cpp repository(which was downloaded in the step 1) and build it by running the make command<br>
	`cd llama.cpp` <br>
	`make`
	3. Create a directory names 7B under the directory llama.cpp/models and put the model file ggml-model-q4_0.bin under this newly created 7B directory<br>
	`cd models` <br>
	`mkdir 7B`
	4. Navigate back to llama.cpp directory and run the below command:-<br>
	`./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/alpaca.txt` <br>
	> the initial prompt file can be changed to anything from `prompts/alpaca.txt` to of your choice
	5. That's it. Enter the desired prompts and let the results surprise you...

	# Credits:
	1. https://github.com/facebookresearch/llama
	2. https://github.com/ggerganov/llama.cpp
	3. https://medium.com/@karankakwani/build-and-run-llama2-llm-locally-a3b393c1570e