vhab10
/

llama_3.1_8b_Q4_K_M-gguf

Text Generation

efficient-inference

Inference Endpoints

Model card Files Files and versions Community

llama_3.1_8b_Q4_K_M-gguf / README.md

vhab10's picture

Create README.md

a062dbe verified about 1 month ago

|

1.27 kB

	---
	language: en
	tags:
	- llama
	- quantization
	- text-generation
	- cpu
	- gpu
	- efficient-inference
	license: apache-2.0
	base_model:
	- meta-llama/Llama-3.1-8B
	---

	# Llama 3.1 8B Q4_K_M GGUF Model

	## Overview
	This is the quantized version of the Llama 3.1 8B model in Q4_K_M format, optimized for efficient inference on both CPU and GPU. The model was quantized using the llama.cpp library, allowing users to run it in resource-constrained environments . This quantization reduces the model's memory footprint while maintaining strong language generation capabilities.

	The model was originally trained by Meta AI and has been adapted to the GGUF format for compatibility with llama.cpp.

	## Model Details
	- Base Model: meta-llama/Llama-3.1-8B
	- Quantization Type: Q4_K_M (4-bit quantization with memory optimization)
	- Model Size: 8B parameters
	- Format: GGUF (used for efficient loading in llama.cpp)
	- Intended Use: Text generation, inference on CPUs/GPUs with reduced memory constraints

	## Intended Use
	The model is intended for text generation tasks and is optimized for efficient inference on both CPUs and GPUs, making it suitable for use in resource-constrained environments.

	## License
	This model is licensed under the Apache 2.0 License.

	---