amanpreetsingh459's picture
Update README.md
ba7f279
metadata
license: llama2
tags:
  - llama2
  - text-generation-inference
base_midel: meta-llama/Llama-2-7b-chat-hf

llama-2-7b-chat_q4_quantized_cpp

  • This model contains the 4-bit quantized version of llama2-7B-chat model in cpp.
  • This can be run on a local cpu system as a cpp module (instructions for the same are given below).
  • As for the testing, the model has been tested on Linux(Ubuntu) os with 12 GB RAM and core i5 processor.
  • The performance is roughly ~3 tokens per second

Usage:

  1. Clone the llama C++ repository from github:
    git clone https://github.com/ggerganov/llama.cpp.git
  2. Enter the llama.cpp repository(which was downloaded in the step 1) and build it by running the make command
    cd llama.cpp
    make
  3. Create a directory names 7B under the directory llama.cpp/models and put the model file ggml-model-q4_0.bin under this newly created 7B directory
    cd models
    mkdir 7B
  4. Navigate back to llama.cpp directory and run the below command:-
    ./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/alpaca.txt

    the initial prompt file can be changed to anything from prompts/alpaca.txt to of your choice

  5. That's it. Enter the desired prompts and let the results surprise you...

Credits:

  1. https://github.com/facebookresearch/llama
  2. https://github.com/ggerganov/llama.cpp
  3. https://medium.com/@karankakwani/build-and-run-llama2-llm-locally-a3b393c1570e