|
--- |
|
tags: |
|
- fp8 |
|
- vllm |
|
language: |
|
- en |
|
- de |
|
- fr |
|
- it |
|
- pt |
|
- hi |
|
- es |
|
- th |
|
pipeline_tag: text-generation |
|
license: llama3.1 |
|
base_model: |
|
- nvidia/Llama-3.1-Nemotron-70B-Instruct-HF |
|
--- |
|
# Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC |
|
|
|
## Model Overview |
|
- **Model Architecture:** Llama-3.1-Nemotron |
|
- **Input:** Text |
|
- **Output:** Text |
|
- **Model Optimizations:** |
|
- **Weight quantization:** FP8 |
|
- **Activation quantization:** FP8 |
|
- **Intended Use Cases:** Intended for commercial and research use in multiple languages. Similarly to [ |
|
Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF), this model is intended for chat between a user and AI assistant. |
|
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English. |
|
- **Release Date:** 10/31/2024 |
|
- **Version:** 1.0 |
|
- **License(s):** [llama3.1](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE) |
|
- **Model Developers:** mysticbeing |
|
- **Method used to quantize the weights (quant_method)** compressed-tensors |
|
- **Weights format** float-quantized |
|
- **Architecture** LlamaForCausalLM |
|
- **Attention heads** 64 |
|
- **KV heads** 8 |
|
- **Hidden Activation** [Sigmoid Linear Unit (SiLU)](https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html) |
|
|
|
## Terms of use |
|
|
|
By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the [license](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE), [acceptable use policy](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/USE_POLICY.md) and [Meta’s privacy policy](https://www.facebook.com/privacy/policy/) |
|
|
|
## Model Details |
|
|
|
|
|
## Description: |
|
|
|
Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads. |
|
It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79. |
|
|
|
### Quantized models are eco-friendly and cost-effective |
|
FP8 quantized models require significantly less storage compared to traditional 32-bit (FP32) or even 16-bit (FP16) models. |
|
This reduction can be seen in the total file size comparison, where the FP8 model set is nearly half the size of the higher-precision set. |
|
This efficiency enables easier distribution, storage, and access to powerful AI models, even on devices with limited capacity. |
|
|
|
Lower hardware requirements mean reduced costs for businesses and public institutions adopting AI solutions. Small businesses, startups, and government entities, which may lack extensive AI budgets, can leverage high-performance, |
|
FP8 quantized models to solve problems with half the infrastructure cost. |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6590c65952dc1046ca0f13fe/WBVaZgiCklrdg_cy7qqza.png" alt="drawing" width="600"/> |
|
|
|
[Base model description - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF): |
|
|
|
Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries. |
|
|
|
|
|
Llama-3.1-Nemotron-70B-Instruct-HF model reaches [Arena Hard](https://github.com/lmarena/arena-hard-auto) of 85.0, [AlpacaEval 2 LC](https://tatsu-lab.github.io/alpaca_eval/) of 57.6 and [GPT-4-Turbo MT-Bench](https://github.com/lm-sys/FastChat/pull/3158) of 8.98, which are known to be predictive of [LMSys Chatbot Arena Elo](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) |
|
|
|
As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet. |
|
|
|
As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on [ChatBot Arena leaderboard](https://lmarena.ai/?leaderboard). |
|
|
|
This model was trained using RLHF (specifically, REINFORCE), [Llama-3.1-Nemotron-70B-Reward](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward) and [HelpSteer2-Preference prompts](https://huggingface.co/datasets/nvidia/HelpSteer2) on a Llama-3.1-70B-Instruct model as the initial policy. |
|
|
|
See details at [https://arxiv.org/abs/2410.01257](https://arxiv.org/abs/2410.01257) - as a preview, this model can correctly the question |
|
```How many r in strawberry?``` without specialized prompting or additional reasoning tokens: |
|
|
|
``` |
|
Let's count the "R"s in "Strawberry": |
|
|
|
1. S |
|
2. T |
|
3. R |
|
4. A |
|
5. W |
|
6. B |
|
7. E |
|
8. R |
|
9. R |
|
10. Y |
|
|
|
There are **3** "R"s in the word "Strawberry". |
|
``` |
|
|
|
Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math. |
|
|
|
|
|
### Model Description |
|
|
|
- **Quantized (FP8-DYNAMIC) from model:** [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) |
|
- **Model type:** Transformer |
|
- **License:** [llama3.1] |
|
|
|
## Uses |
|
|
|
Primary Intended Uses: |
|
|
|
General-Domain Instruction Following |
|
|
|
The model is designed for general-purpose instruction following and dialogue tasks |
|
Optimized specifically for helpfulness in responses |
|
Focuses on generating coherent, factually-correct, and customizable responses |
|
|
|
|
|
Research and Development |
|
|
|
|
|
Serves as a demonstration of NVIDIA's techniques for improving model helpfulness |
|
Can be used by researchers studying instruction-following capabilities |
|
Provides a benchmark for comparing alignment techniques |
|
|
|
Subject to LLama 3.1 license terms and conditions |
|
Must adhere to Meta's acceptable use policy and privacy policy |
|
Maximum input of 128k tokens and output of 4k tokens |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
### Use with vLLM |
|
|
|
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
from transformers import AutoTokenizer |
|
|
|
MODEL_ID = "mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC" |
|
N_GPUS = 8 |
|
MAX_MODEL_LEN = 4096 |
|
MAX_TOKENS = 1024 |
|
|
|
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=MAX_TOKENS) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
|
|
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": "How many r in strawberry?"}, |
|
] |
|
|
|
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) |
|
|
|
llm = LLM(model=MODEL_ID, tensor_parallel_size=N_GPUS, max_model_len=MAX_MODEL_LEN) |
|
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
generated_text = outputs[0].outputs[0].text |
|
print(generated_text) |
|
``` |
|
|
|
``` |
|
Let's count the "R"s in "Strawberry": |
|
|
|
1. S |
|
2. T |
|
3. R |
|
4. A |
|
5. W |
|
6. B |
|
7. E |
|
8. R |
|
9. R |
|
10. Y |
|
|
|
There are **3** "R"s in the word "Strawberry". |
|
``` |
|
|
|
vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. |
|
|
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
Any use not complying with LLama 3.1 license |
|
|
|
Applications violating Meta's acceptable use policy |
|
|
|
Uses conflicting with Meta's privacy policy |
|
|
|
Critical Safety Applications |
|
|
|
Applications requiring high reliability or safety guarantees |
|
|
|
Applications where errors could lead to harm or safety issues |
|
|
|
Autonomous Decision Making |
|
|
|
The model is designed to be helpful in responses, not to make independent decisions |
|
|
|
Applications requiring autonomous action without human oversight |
|
|
|
Real-time Processing Requirements |
|
|
|
Applications needing ultra-low latency responses |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
### Results |
|
|
|
|
|
|
|
## Technical Specifications [optional] |
|
|
|
### Model Architecture and Objective |
|
|
|
## References(s): |
|
|
|
* [FP8 Quantization: The Power of the Exponent](https://arxiv.org/abs/2208.09225) |
|
* [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) |
|
* [NeMo Aligner](https://arxiv.org/abs/2405.01481) |
|
* [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257) |
|
* [HelpSteer2](https://arxiv.org/abs/2406.08673) |
|
* [Introducing Llama 3.1: Our most capable models to date](https://ai.meta.com/blog/meta-llama-3-1/) |
|
* [Meta's Llama 3.1 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1) |
|
* [Meta's Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md) |
|
|
|
|
|
## Model Architecture: |
|
**Architecture Type:** Transformer <br> |
|
**Network Architecture:** Llama 3.1 <br> |
|
|
|
## Input: |
|
**Input Type(s):** Text <br> |
|
**Input Format:** String <br> |
|
**Input Parameters:** One Dimensional (1D) <br> |
|
**Other Properties Related to Input:** Max of 128k tokens<br> |
|
|
|
## Output: |
|
**Output Type(s):** Text <br> |
|
**Output Format:** String <br> |
|
**Output Parameters:** One Dimensional (1D) <br> |
|
**Other Properties Related to Output:** Max of 4k tokens <br> |
|
|
|
## Software |
|
|
|
**Supported Operating System(s):** Linux <br> |
|
|
|
## Model Version: |
|
v1.0 |
|
|
|
# Training & Evaluation: |
|
|
|
## Alignment methodology |
|
* REINFORCE implemented in NeMo Aligner |
|
|
|
# Inference: |
|
**Engine:** [vLLM](https://github.com/vllm-project/vllm) <br> |
|
**Test Hardware:** H100 (NVIDIA Hopper GPU Micro-architecture) <br> |
|
|
|
|
|
## Citation [optional] |
|
|
|
If you find this model useful, please cite the following works |
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|