Model Overview

Description:

Llama-3.3-Nemotron-70B-Edit is a large language model that leverages Meta-Llama-3.3-70B-Instruct as the foundation and is fine-tuned using Supervised Finetuning and Reinforcement Learning to improve the helpfulness of LLM generated responses to user queries by editing responses based on feedback.

This model is ready for commercial use.

License/Terms of Use:

GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model License . Additional Information: Llama 3.3 Community License Agreement. Built with Llama.

Arena Hard LeaderBoard

As of 18 Mar 2025, augmenting models with the Feedback-Edit Inference Time Scaling (ITS) approach leads to the highest performance on Arena Hard.

The Feedback-Edit Inference Time Scaling system comprise of the following models:

Model	Arena Hard (95% CI)
Llama-3.3-Nemotron-Super-49B-v1 + Feedback-Edit ITS	93.4 (-1.1, 1.0)
Llama-3.1-Nemotron-70B-Instruct + Feedback-Edit ITS	92.7 (-1.2, 0.9)
o1-mini-2024-09-12	92.0 (-1.2, 1.0)
o1-preview-2024-09-12	90.4 (-1.1, 1.3)
Llama-3.3-Nemotron-Super-49B-v1	88.3 (-1.6, 1.6)
claude-3-5-sonnet-20241022	85.2 (-1.4, 1.6)
Llama-3.1-Nemotron-70B-Instruct	84.9 (-1.7, 1.8)

Use Case:

Llama-3.3-Nemotron-70B-Edit edits responses based on provided feedback, for users who are interested in improving performance through Inference-Time-Scaling for general-domain, open-ended tasks.

Release Date:

03/18/2025

References(s):

Model Architecture:

Architecture Type: Transformer
Network Architecture: Llama 3.3

We developed this model using Llama-3.3-70B-Instruct as its foundation. This model contains 70 billion parameters.

Input:

Input Type(s): Text
Input Format: String
Input Parameters: One Dimensional (1D)
Other Properties Related to Input: Max of 128k tokens

Output:

Output Type(s): Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Properties Related to Output: Max of 4k tokens

Software Integration:

Runtime Engine(s):

[NeMo - 24.05.llama.3.1]

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Turing
Supported Operating System(s): Linux

Quick Start

You can use the model using HuggingFace Transformers library with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 150GB of free disk space to accomodate the download.

This code has been tested on Transformers v4.45.0, torch v2.3.0a0+40ec155e58.nv24.3 and 2 A100 80GB GPUs, but any setup that supports meta-llama/Llama-3.1-70B-Instruct should support this model as well. If you run into problems, you can consider doing pip install -U transformers.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nvidia/Llama-3.3-Nemotron-70B-Edit"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)


def generate_edit(messages, model, tokenizer):
    tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True)
    response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),  max_new_tokens=4096, pad_token_id = tokenizer.eos_token_id)
    generated_tokens =response_token_ids[:, len(tokenized_message['input_ids'][0]):]
    generated_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    return generated_text

prompt = "What is the distance between the Earth and the Sun?"
response = "The distance from Earth to the Sun is 93 million miles"
feedback = ["The response is partially helpful. It provides a concise answer to the prompt. However, the lack of additional information or context limits its usefulness. It could have been more informative by including the average distance in astronomical units (AU) and explaining the variation in distance due to the elliptical orbit."]
linearized_feedback = "\n\n".join(feedback) if feedback else "<None>"
formatted_feedback = 'Edit the response to the previous prompt based on the following feedback:\n\n' + linearized_feedback

messages = [
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": response},
        {"role": "user", "content": formatted_feedback}
    ]

edited_response =  generate_edit(messages, model, tokenizer)
print(edited_response)

## Illustrative Example

# The average distance from the Earth to the Sun is approximately 93 million miles or 149.6 million kilometers. This distance is also measured in astronomical units (AU), with 1 AU being the equivalent of 93 million miles. It's important to note that this distance is not constant due to the elliptical shape of Earth's orbit around the Sun. At their closest point (perihelion), the distance is about 91.5 million miles, and at their farthest point (aphelion), it is roughly 94.5 million miles. This variation in distance results in slight changes in the amount of solar energy the Earth receives throughout the year.

Feedback Selection

This model has been trained to edit responses with no more than 3 feedback. Therefore, when there are more than three feedback available, they can be reranked to use only the top 3 with the most constructive criticism keywords after filtering away feedback that find the response perfectly helpful.

import json

feedback = [
    "The response is partially helpful. The response is completely accurate, however, it is too short and does not provide any context to the answer. The response should have given the distance in kilometers as well and provided some more information about the significance of the distance and how it was calculated.",
    "The response is partially helpful. The response is in line with the prompt request and provides a clear and concise answer. However, it lacks additional context or details, such as the average distance or the variation due to the elliptical orbit. The response does not address the potential for a more precise or scientific answer.",
    "The response is partially helpful. The response does provide a correct answer to the prompt, but it is not as specific as it could be. The response should have clarified that the distance between the Earth and the Sun varies throughout the year. The response also could have included the average distance and the closest and farthest distances.",
    "The response is partially helpful. It is very concise and to the point. It provides the correct answer to the user's question. However, it could have been a little more helpful by providing a more precise answer. The distance varies slightly due to the elliptical shape of the Earth's orbit.",
    "The response is mostly helpful. The response correctly states the distance between the Earth and the Sun. The response should have clarified that the distance varies because the orbit is elliptical. The response should have also provided the metric unit of measurement. The response is very succinct and could have added more details.",
    "The response is mostly helpful. The response is completely correct in stating that the distance from the Earth to the Sun is 93 million miles. The response could be improved by stating that the distance varies due to elliptical orbit, and the average distance is 93 million miles. The response could also be improved by providing the distance in kilometers.",
    "The response is partially helpful. The model does a good job at understanding the prompt and providing a concise answer to the prompt. However, the response lacks a proper unit of measurement. The response is missing an'm' at the end of 93 million miles. This is not a huge issue but does impact the overall quality of the response.",
    "The response is partially helpful. The model should have clarified that the distance between the Earth and the Sun varies throughout the year. The average distance is 93 million miles, but at its closest point, the distance is about 91.5 million miles and at its farthest point, the distance is about 94.5 million miles.",
    "The response is mostly helpful. The model answers the prompt accurately. The model provides the distance between the Earth and the Sun in miles. The response is clear and concise. The response is not verbose. The model does not provide any irrelevant information. The model does not provide the distance in kilometers.",
    "The response is partially helpful. It provides a direct answer to the question about the distance between Earth and the Sun. However, it lacks the depth and context that would make it more informative. The response could benefit from including additional details, such as the variation in distance due to Earth's elliptical orbit or the average distance measurement.",
    "The response is mostly helpful. It provides a factually accurate answer to the question posed in the prompt. However, the response does not provide any context for this answer, nor does it give the answer in metric units, which would be more common in scientific contexts. The response also does not mention that the distance varies depending on the time of year.",
    "The response is partially helpful. It provides the correct distance in miles, but it could be improved by including the distance in kilometers and the average distance, as this distance is not constant. It could also be improved by providing more context and explaining the variations that can occur in this distance.",
    "The response is mostly helpful. The response provides an answer to the question asked in the prompt. The response is correct and accurate as of the current knowledge. The response is also concise and to the point. However, the response could have provided more information regarding the average distance and how it varies throughout the year.",
    "The response is partially helpful. It gives the distance between Earth and the Sun in miles, but it would have been better if it included both miles and kilometers for a more complete answer. The response is also too short and could have been improved by adding more information about the average distance or explaining what an AU is.",
    "The response is partially helpful. The response provides a direct answer to the prompt, but it could be improved by including more context or details. For example, explaining that the distance varies due to the elliptical orbit of Earth or mentioning the average distance in astronomical units (AU) would enhance its helpfulness.",
    "The response is partially helpful. The response gives a straight-forward and succinct answer to the prompt. The response is correct in stating that the distance from the Earth to the Sun is 93 million miles. However, the response should have mentioned that this is an average distance, as the distance varies throughout the year due to the elliptical orbit of the Earth around the Sun. The response also does not mention that this distance is also known as an Astronomical Unit (AU) which is the standard unit of measurement used for interplanetary distances within the Solar System."
]

CONSTRUCTIVE_CRITICISM_KEYWORDS = ["However", "improve", "lack", "benefit", "but"]
def calculate_n_keywords(one_feedback):
    n = - len(CONSTRUCTIVE_CRITICISM_KEYWORDS)
    for keyword in CONSTRUCTIVE_CRITICISM_KEYWORDS:
        n += len(one_feedback.split(keyword))
    return n

def filter_feedback(feedback):
    filtered_feedback = []
    feedback.sort(key = lambda one_feedback: calculate_n_keywords(one_feedback), reverse = True)
    for one_feedback in feedback:
        if len(filtered_feedback) > 2:
            break
        elif one_feedback.startswith('The response is perfectly'):
            continue
        else:
            filtered_feedback.append(one_feedback)
    return filtered_feedback

feedback = filter_feedback(feedback)

print(json.dumps(feedback, indent=4))

## Illustrative example

#[
#    "The response is partially helpful. The model does a good job at understanding the prompt and providing a concise answer to the prompt. However, the response lacks a proper unit of measurement. The response is missing an'm' at the end of 93 million miles. This is not a huge issue but does impact the overall quality of the response.",
#    "The response is partially helpful. It provides a direct answer to the question about the distance between Earth and the Sun. However, it lacks the depth and context that would make it more informative. The response could benefit from including additional details, such as the variation in distance due to Earth's elliptical orbit or the average distance measurement.",
#    "The response is partially helpful. It provides the correct distance in miles, but it could be improved by including the distance in kilometers and the average distance, as this distance is not constant. It could also be improved by providing more context and explaining the variations that can occur in this distance."
#]

Model Version:

v1.0

Training and Testing Datasets:

Training Datasets:

Dataset Name: HelpSteer3 Dataset Link: https://huggingface.co/datasets/nvidia/HelpSteer3

Data Collection Method by dataset

[Hybrid: Human, Synthetic]

Labeling Method by dataset

[Human]

Properties:

13,740 prompt-responses, each annotated with up to 3 annotations of free-text feedback (each being 50-250 words long) elaborating upon the overall helpfulness of the response as well as an edited response based on the feedback.
3,111 prompt-responses, each annotated with up to 3 annotations of free-text feedback (each being 50-250 words long) as well as a well-edited response that follows the feedback and a poorly-edited response that contains changes outside of what was found in the provided feedback.

Testing Datasets:

Dataset Name: HelpSteer3 Dataset Link: https://huggingface.co/datasets/nvidia/HelpSteer3

Data Collection Method by dataset

[Hybrid: Human, Synthetic]

Labeling Method by dataset

[Human]

Properties:

721 prompt-responses, each annotated with up to 3 annotations of free-text feedback (each being 50-250 words long) elaborating upon the overall helpfulness of the response as well as an edited response based on the feedback.
163 prompt-responses, each annotated with up to 3 annotations of free-text feedback (each being 50-250 words long) as well as a well-edited response that follows the feedback and a poorly-edited response that contains changes outside of what was found in the provided feedback.

Inference:

Engine: Triton
Test Hardware: H100, A100 80GB, A100 40GB

Limitations:

The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.

Citation

If you find this model useful, please cite the following work:

@misc{wang2025dedicatedfeedbackeditmodels,
      title={Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks}, 
      author={Zhilin Wang and Jiaqi Zeng and Olivier Delalleau and Daniel Egert and Ellie Evans and Hoo-Chang Shin and Felipe Soares and Yi Dong and Oleksii Kuchaiev},
      year={2025},
      eprint={2503.04378},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.04378},
}

nvidia
/

Llama-3.3-Nemotron-70B-Edit