Local configuration with LlamaCpp
#116
by
alessandervs
- opened
Hello HF community!
I did a local installation of Llama3 but I am facing a problem.
The code (1 and 2 ) below implements Llama-3 with the following configurations:
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
$ nvidia-smi
Wed May 22 17:11:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08 Driver Version: 535.161.08 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:2A:00.0 Off | 0 |
| N/A 37C P0 36W / 165W | 7305MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 Off | 00000000:BD:00.0 Off | 0 |
| N/A 36C P0 31W / 165W | 8797MiB / 24576MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 58504 C ...cc/projetos/dev-ai/.venv/bin/python 7296MiB |
| 1 N/A N/A 58504 C ...cc/projetos/dev-ai/.venv/bin/python 8788MiB |
+---------------------------------------------------------------------------------------+
MODEL_PATH=/data/llm-data/meta/Meta-Llama-3-8B-Instruct
LLM_MODEL=/data/llm-data/meta/Meta-Llama-3-8B-Instruct-GGUF/llama-3-8B-Instruct.gguf
I used the command below to get the "llama-3-8B-Instruct.gguf"
python3 llama.cpp/convert-hf-to-gguf.py Meta-Llama-3-8B-Instruct --outfile Meta-Llama-3-8B-Instruct-GGUF/llama-3-8B-Instruct.gguf --outtype q8_0
- ==================== main.py =================
import gradio as gr
from modules.qdrant_chat.chat_document import QdrantChatDocument
chat = QdrantChatDocument()
chat_history = []
with gr.Blocks() as demo:
chatbot = gr.Chatbot()
msg = gr.Textbox()
clear = gr.Button("Clear")
chat_history = []
def user(user_message, chat_history):
result = chat.qa({"query": user_message})
chat_history.append((user_message, result["result"]))
return gr.update(value=""), chat_history
msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False)
clear.click(lambda: None, None, chatbot, queue=False)
if __name__ == "__main__":
demo.launch(debug=True)
- ==================== QdrantChatDocument======
from dotenv import load_dotenv
# QDRANT
## Importar o client
from qdrant_client import QdrantClient
## Criar nossa coleção
from qdrant_client.http.models import Distance, VectorParams
# LANGCHAIN
## Importar Qdrant como vector store
from langchain_community.vectorstores import Qdrant
## Importar OpenAI embeddings
from langchain_community.embeddings import SentenceTransformerEmbeddings
## Função para auxiliar na quebra do texto em chunks
from langchain.text_splitter import CharacterTextSplitter
## Módulo para facilitar o uso de vector stores em QA (question answering)
from langchain.chains import RetrievalQA
## Importar LLM
from langchain_community.llms import LlamaCpp
#from langchain.llms import OpenAI
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
import transformers
import torch
load_dotenv()
MODEL_PATH = os.getenv('MODEL_PATH')
LLM_MODEL = os.getenv('LLM_MODEL')
absolute_path = os.path.dirname(os.path.abspath(__file__))
class QdrantChatDocument:
client = QdrantClient(host="localhost", port=6333)
client.delete_collection(
collection_name="pdf_collection"
)
client.create_collection(
collection_name="pdf_collection",
vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)
embeddings = SentenceTransformerEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', model_kwargs={'device': 'cuda'}, encode_kwargs={'normalize_embeddings': False})
vectorstore = Qdrant(
client=client,
collection_name="pdf_collection",
embeddings=embeddings
)
def get_chunks(text):
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=2048,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_text(text)
return chunks
with open("docs/teste.txt","r") as f:
raw_text = f.read()
texts = get_chunks(raw_text)
vectorstore.add_texts(texts)
qa = RetrievalQA.from_llm(
llm=LlamaCpp(model_path=LLM_MODEL, n_gpu_layers=28, n_threads=6, n_ctx=4096, n_batch=521, verbose=True, temperature=0.1),
return_source_documents=True,
retriever=vectorstore.as_retriever(),
)
========================================================
But the chat with Gradio is returning "heavy" hallucinations, according to the print.
============== server running =========
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.54it/s]
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /data/llm-data/meta/Meta-Llama-3-8B-Instruct-GGUF/llama-3-8B-Instruct.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: CPU buffer size = 8137.64 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 521
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 296.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'Meta-Llama-3-8B-Instruct', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}
Available chat formats from metadata: chat_template.default
Guessed chat format: llama-3
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
IMPORTANT: You are using gradio version 3.41.2, however version 4.29.0 is available, please upgrade.
--------
/home/hmdcc/projetos/dev-ai/.venv/lib/python3.10/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The method `Chain.__call__` was deprecated in langchain 0.1.0 and will be removed in 0.3.0. Use invoke instead.
warn_deprecated(
llama_print_timings: load time = 3569.78 ms
llama_print_timings: sample time = 189.22 ms / 256 runs ( 0.74 ms per token, 1352.89 tokens per second)
llama_print_timings: prompt eval time = 7216.23 ms / 1108 tokens ( 6.51 ms per token, 153.54 tokens per second)
llama_print_timings: eval time = 31871.99 ms / 255 runs ( 124.99 ms per token, 8.00 tokens per second)
llama_print_timings: total time = 40884.20 ms / 1363 tokens