Edit model card

SparseLlama-2-7b-ultrachat_200k-pruned_50.2of4

Model Overview

  • Model Architecture: Llama-2
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Pruned: 50% 2:4
  • Release Date: 6/28/2024
  • Version: 1.0
  • Model Developers: Neural Magic

Compressed version of Llama-2-7b specialized for text-generation. This model was obtained by fine-tuning the Sparse Foundational model Sparse-Llama-2-7b-pruned_50.2of4 on the ultrachat_200k dataset. It achieves a win rate of 62.1% on the AlpacaEval benchmark (version 1.0) when using Llama-2-70b-chat as evaluator, whereas the dense Llama-2-7b-ultrachat200k model achieves 57.6% win rate.

This model was produced as part if Neural Magic's Sparse Foundational Models initiative, and demostrates the capability of Sparse Foundational Models to transfer to the text-generation domain.

Note: This model uses the chat template from zephyr-7b-beta.

Model Optimizations

This model is derived from the Sparse Foundational model Sparse-Llama-2-7b-pruned_50.2of4, which was obtained by applying the SparseGPT algorithm to prune Llama-2-7b to 50% sparsity with a 2:4 mask. This optimization reduces the number of parameters by 50%, reducing the disk size and FLOPs by the same level.

Evaluation

This model was evaluated in the AlpacaEval benchmark using Llama-2-70b-chat as evaluator.

Accuracy

Model Win rate Recovery
Llama-2-7b 3.7% --
Llama-2-7b-ultrachat200k 57.6% --
SparseLlama-2-7b-ultrachat_200k-pruned_50.2of4 62.1% 108%
Downloads last month
17
Safetensors
Model size
6.74B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train neuralmagic/SparseLLama-2-7b-ultrachat_200k-pruned_50.2of4