metadata

tags:
  - fp8
  - vllm
language:
  - en
  - de
  - fr
  - it
  - pt
  - hi
  - es
  - th
pipeline_tag: text-generation
license: llama3.1
base_model:
  - nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

Model Overview

Model Architecture: Llama-3.1-Nemotron
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Intended Use Cases: Intended for commercial and research use in multiple languages. Similarly to Llama-3.1-Nemotron-70B-Instruct-HF, this model is intended for chat between a user and AI assistant.
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
Release Date: 10/31/2024
Version: 1.0
License(s): llama3.1
Model Developers: mysticbeing
Method used to quantize the weights (quant_method) compressed-tensors
Weights format float-quantized
Architecture LlamaForCausalLM
Attention heads 64
KV heads 8
Hidden Activation Sigmoid Linear Unit (SiLU)

Terms of use

By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the license, acceptable use policy and Meta’s privacy policy

Model Details

Description:

Quantized version of Llama-3.1-Nemotron-70B-Instruct-HF with the updated 8 KV-heads. It achieves an average score of [TBD] on the OpenLLM benchmark (version 1), whereas the unquantized model achieves 86.79.

Quantized models are eco-friendly and cost-effective

FP8 quantized models require significantly less storage compared to traditional 32-bit (FP32) or even 16-bit (FP16) models. This reduction can be seen in the total file size comparison, where the FP8 model set is nearly half the size of the higher-precision set. This efficiency enables easier distribution, storage, and access to powerful AI models, even on devices with limited capacity.

Lower hardware requirements mean reduced costs for businesses and public institutions adopting AI solutions. Small businesses, startups, and government entities, which may lack extensive AI budgets, can leverage high-performance, FP8 quantized models to solve problems with half the infrastructure cost.

Base model description - Llama-3.1-Nemotron-70B-Instruct-HF:

Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.

Llama-3.1-Nemotron-70B-Instruct-HF model reaches Arena Hard of 85.0, AlpacaEval 2 LC of 57.6 and GPT-4-Turbo MT-Bench of 8.98, which are known to be predictive of LMSys Chatbot Arena Elo

As of 1 Oct 2024, this model is #1 on all three automatic alignment benchmarks (verified tab for AlpacaEval 2 LC), edging out strong frontier models such as GPT-4o and Claude 3.5 Sonnet.

As of Oct 24th, 2024 the model has Elo Score of 1267(+-7), rank 9 and style controlled rank of 26 on ChatBot Arena leaderboard.

This model was trained using RLHF (specifically, REINFORCE), Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference prompts on a Llama-3.1-70B-Instruct model as the initial policy.

See details at https://arxiv.org/abs/2410.01257 - as a preview, this model can correctly the question How many r in strawberry? without specialized prompting or additional reasoning tokens:

Let's count the "R"s in "Strawberry":

1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y

There are **3** "R"s in the word "Strawberry".

Note: This model is a demonstration of our techniques for improving helpfulness in general-domain instruction following. It has not been tuned for performance in specialized domains such as math.

Model Description

Quantized (FP8-DYNAMIC) from model: Llama-3.1-Nemotron-70B-Instruct-HF
Model type: Transformer
License: [llama3.1]

Uses

Primary Intended Uses:

General-Domain Instruction Following

The model is designed for general-purpose instruction following and dialogue tasks Optimized specifically for helpfulness in responses Focuses on generating coherent, factually-correct, and customizable responses

Research and Development

Serves as a demonstration of NVIDIA's techniques for improving model helpfulness Can be used by researchers studying instruction-following capabilities Provides a benchmark for comparing alignment techniques

Subject to LLama 3.1 license terms and conditions Must adhere to Meta's acceptable use policy and privacy policy Maximum input of 128k tokens and output of 4k tokens

How to Get Started with the Model

Use the code below to get started with the model.

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

MODEL_ID = "mysticbeing/Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC"
N_GPUS = 8
MAX_MODEL_LEN = 4096
MAX_TOKENS = 1024

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=MAX_TOKENS)

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "How many r in strawberry?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=MODEL_ID, tensor_parallel_size=N_GPUS, max_model_len=MAX_MODEL_LEN)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Let's count the "R"s in "Strawberry":

1. S
2. T
3. R
4. A
5. W
6. B
7. E
8. R
9. R
10. Y

There are **3** "R"s in the word "Strawberry".

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Out-of-Scope Use

Any use not complying with LLama 3.1 license

Applications violating Meta's acceptable use policy

Uses conflicting with Meta's privacy policy

Critical Safety Applications

Applications requiring high reliability or safety guarantees

Applications where errors could lead to harm or safety issues

Autonomous Decision Making

The model is designed to be helpful in responses, not to make independent decisions

Applications requiring autonomous action without human oversight

Real-time Processing Requirements

Applications needing ultra-low latency responses

Evaluation

Testing Data, Factors & Metrics

Results

Technical Specifications [optional]

Model Architecture and Objective

References(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: Llama 3.1

Input:

Input Type(s): Text
Input Format: String
Input Parameters: One Dimensional (1D)
Other Properties Related to Input: Max of 128k tokens

Output:

Output Type(s): Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Properties Related to Output: Max of 4k tokens

Software

Supported Operating System(s): Linux

Model Version:

v1.0

Training & Evaluation:

Alignment methodology

REINFORCE implemented in NeMo Aligner

Inference:

Engine: vLLM
Test Hardware: H100 (NVIDIA Hopper GPU Micro-architecture)

Citation [optional]

If you find this model useful, please cite the following works

BibTeX: