MoxoffSrL
/

AzzurroQuantized

Text Generation

Inference Endpoints

Model card Files Files and versions Community

JacopoAbate commited on Apr 9, 2024

Commit

9e61e55

·

verified ·

1 Parent(s): a66273a

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -46,7 +46,7 @@ If you want to use the gpu instead:
 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
 ```
-And then use this code to see a response to the prompt. Play with the prompt as much as you'd like.
 ```python
 from llama_cpp import Llama
@@ -56,7 +56,7 @@ llm = Llama(
   model_path="path/to/model.gguf",  # Download the model file first
   n_ctx=2048,  # The max sequence length to use - note that longer sequence lengths require much more resources
   n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
-  n_gpu_layers=32         # The number of layers to offload to GPU, if you have GPU acceleration available
 )
 # Simple inference example

 CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
 ```
+And then use this code to see a response to the prompt.
 ```python
 from llama_cpp import Llama
   model_path="path/to/model.gguf",  # Download the model file first
   n_ctx=2048,  # The max sequence length to use - note that longer sequence lengths require much more resources
   n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
+  n_gpu_layers=0         # The number of layers to offload to GPU, if you have GPU acceleration available
 )
 # Simple inference example