This is a quantization of the Llama-3.3-70B-Instruct.

The Meta Llama 3.3 is a state-of-the-art multilingual large language model (LLM) with 70 billion parameters, pretrained and instruction-tuned for exceptional performance in generative text-based tasks. Optimized for multilingual dialogue, it supports English and seven additional languages: French, German, Hindi, Italian, Portuguese, Spanish, and Thai, enabling seamless communication across diverse audiences. The model consistently outperforms both open-source and proprietary chat models on key industry benchmarks, delivering superior quality, safety, and helpfulness. Its advanced features and multilingual support position Llama 3.3 as a powerful tool for building innovative AI applications.

Evaluations

This model provides an accuracy recovery of 99.67%.

English	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-FP8-Dynamic (this)
Avg.	74.1	73.75
Arc	71.7	71.6
Hellaswag	76.5	75.9

French	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-FP8-Dynamic (this)
Avg.	73.07	72.87
Arc	64.7	64.5
Hellaswag	76.6	76.6
MMLU	77.9	77.5

German	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-FP8-Dynamic (this)
Avg.	70.07	69.83
Arc	61.8	61.2
Hellaswag	71.2	71.1
MMLU	77.2	77.2

Italian	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-FP8-Dynamic (this)
Avg.	73.67	73.37
Arc	66.5	65.7
Hellaswag	76.0	76.2
MMLU	78.5	78.2

Portuguese	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-FP8-Dynamic (this)
Avg.	74.4	73.87
Arc	66.4	65.5
Hellaswag	77.2	76.9
MMLU	79.6	79.2

Spanish	Llama-3.3-70B-Instruct	Llama-3.3-70B-Instruct-FP8-Dynamic (this)
Avg.	74	74.13
Arc	65.8	65.8
Hellaswag	77.1	77.2
MMLU	79.1	79.4

We did not check for data contamination. Evaluation was done using Eval. Harness with limit=1000.

Usage

Install vLLM and run the server:

python -m vllm.entrypoints.openai.api_server --model cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic --max-model-len 9000 --gpu-memory-utilization 0.95

Access the model:

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d ' {
        "model": "cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic",
        "prompt": "San Francisco is a"
    } '

⚡ This model is optimized to handle heavy workloads providing a total throughput of ️1485 tokens per second using one NVIDIA H100 ⚡

Downloads last month: 3,079

Safetensors

Model size

71B params

Tensor type

F16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cortecs/Llama-3.3-70B-Instruct-FP8-Dynamic

Base model

meta-llama/Llama-3.1-70B

Finetuned

meta-llama/Llama-3.3-70B-Instruct

Quantized

(138)

this model