Spaces:
Running
Running
File size: 2,993 Bytes
88357e8 c5bc8e4 88357e8 2fcb72a 88357e8 c5bc8e4 2fcb72a c5bc8e4 2fcb72a c5bc8e4 2fcb72a c5bc8e4 2fcb72a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
title: Edge LLM Leaderboard
emoji: π
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 5.8.0
app_file: app.py
pinned: true
license: apache-2.0
tags: [edge llm leaderboard, llm edge leaderboard, llm, edge, leaderboard]
---
# Edge LLM leaderboard
## π About
The Edge LLM Leaderboard is a leaderboard to gauge practical performance and quality of edge LLMs.
Its aim is to benchmark the performance (throughput and memory)
of Large Language Models (LLMs) on Edge hardware - starting with a Raspberry Pi 5 (8GB) based on the ARM Cortex A76 CPU.
Anyone from the community can request a new base model or edge hardware/backend/optimization
configuration for automated benchmarking:
- Model evaluation requests will be made live soon, in the meantime feel free to email to - arnav[dot]chavan[@]nyunai[dot]com
## βοΈ Details
- To avoid multi-thread discrepencies, all 4 threads are used on the Pi 5.
- LLMs are running on a singleton batch with a prompt size of 512 and generating 128 tokens.
All of our throughput benchmarks are ran by this single tool
[llama-bench](https://github.com/ggerganov/llama.cpp/tree/master/examples/llama-bench)
using the power of [llama.cpp](https://github.com/ggerganov/llama.cpp) to guarantee reproducibility and consistency.
## π Ranking Models
We use MMLU (zero-shot) via [llama-perplexity](https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity) for performance evaluation, focusing on key metrics relevant for edge applications:
1. **Prefill Latency (Time to First Token - TTFT):** Measures the time to generate the first token. Low TTFT ensures a smooth user experience, especially for real-time interactions in edge use cases.
2. **Decode Latency (Generation Speed):** Indicates the speed of generating subsequent tokens, critical for real-time tasks like transcription or extended dialogue sessions.
3. **Model Size:** Smaller models are better suited for edge devices with limited secondary storage compared to cloud or GPU systems, making efficient deployment possible.
These metrics collectively address the unique challenges of deploying LLMs on edge devices, balancing performance, responsiveness, and memory constraints.
## π How to run locally
To run the Edge LLM Leaderboard locally on your machine, follow these steps:
### 1. Clone the Repository
First, clone the repository to your local machine:
```bash
git clone https://huggingface.co/spaces/nyunai/edge-llm-leaderboard
cd edge-llm-leaderboard
```
### 2. Install the Required Dependencies
Install the necessary Python packages listed in the requirements.txt file:
`pip install -r requirements.txt`
### 3. Run the Application
You can run the Gradio application in one of the following ways:
- Option 1: Using Python
`python app.py`
- Option 2: Using Gradio CLI (include hot-reload)
`gradio app.py`
### 4. Access the Application
Once the application is running, you can access it locally in your web browser at http://127.0.0.1:7860/ |