|
--- |
|
license: apache-2.0 |
|
inference: false |
|
--- |
|
|
|
# MegaBeam-Mistral-7B-300k-AWQ Model |
|
|
|
MegaBeam-Mistral-7B-300k-AWQ is a version of the [MegaBeam-Mistral-7B-300k](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k) model that was |
|
quantized using the AWQ method developed by [Lin et al. (2023)](https://arxiv.org/abs/2306.00978). |
|
The MegaBeam-Mistral-7B-300k-AWQ models are approximately **70% smaller** than those of MegaBeam-Mistral-7B-300k whilst maintaining comparable performance. |
|
|
|
Please refer to the [original MegaBeam-Mistral-7B-300k model card](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k) for details about the model |
|
preparation and training processes. |
|
|
|
## MegaBeam-Mistral-7B-300k Variants |
|
|
|
| Branch | Approx. Model Size | `q_group_size` | `w_bit` | `version` | |
|
|--------|---:|---------------:|--------:|-----------| |
|
| [main](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/main) | 3.9 GB | 128 | 4 | GEMM | |
|
| [MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM) | 4.0 GB | 64 | 4 | GEMM | |
|
| [MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM](https://huggingface.co/aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ/tree/MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM) | 4.3 GB | 32 | 4 | GEMM | |
|
|
|
## Dependencies |
|
- [`autoawq==0.2.5`](https://pypi.org/project/autoawq/0.2.5/) – [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) was used to quantize the MegaBeam-Mistral-7B-300k model. |
|
- [`vllm==0.4.2`](https://pypi.org/project/vllm/0.4.2/) – [vLLM](https://github.com/vllm-project/vllm) was used to host models for benchmarking. |
|
|
|
## Evaluations |
|
|
|
### InfiniteBench |
|
|
|
This benchmark was developed by [Zhang et al. (2024)](https://arxiv.org/abs/2402.13718), available from https://github.com/OpenBMB/InfiniteBench. |
|
|
|
See the [original MegaBeam-Mistral-7B-300k model card](https://huggingface.co/amazon/MegaBeam-Mistral-7B-300k) |
|
for more details. |
|
|
|
| Task Name | MegaBeam-Mistral-7B-300k-AWQ | MegaBeam-Mistral-7B-300k | Mistral-7B-Instruct-v0.2 | Llama-3-8B-Instruct-262k | Llama3-70B-1M | GPT-4-1106-preview | YaRN-Mistral-7B | Kimi-Chat | Claude 2 | Yi-6B-200K | Yi-34B-200K | Chatglm3-6B-128K | |
|
|------------------|------------------------------|--------------------------|--------------------------|--------------------------|---------------|--------------------|-----------------|-----------|----------|------------|-------------|------------------| |
|
| Retrieve.PassKey | 100% | 100% | 75.76% | 98.30% | 81.35% | 100% | 92.71% | 98.14% | 97.80% | 100.00% | 100.00% | 92.20% | |
|
| Retrieve.Number | 92.7% | 96.10% | 25.25% | 97.79% | 97.62% | 100% | 56.61% | 95.42% | 98.14% | 94.92% | 100.00% | 80.68% | |
|
| Retrieve.KV | 0% | 0% | 0% | 3.40% | 3% | 89.00% | < 5% | 53.60% | 65.40% | < 5% | < 5% | < 5% | |
|
| En.Sum | 29.05% | 29.39% | 22.13% | 16.40% | 20.72% | 14.73% | 9.09% | 17.93% | 14.45% | < 5% | < 5% | < 5% | |
|
| En.QA | 15.69% | 14.93% | 4.93% | 13.20% | 16.52% | 22.22% | 9.55% | 16.52% | 11.97% | 9.20% | 12.17% | < 5% | |
|
| En.MC | 48.91% | 51.52% | 7.80% | 50.65% | 62% | 67.25% | 27.95% | 72.49% | 62.88% | 36.68% | 38.43% | 10.48% | |
|
| En.Dia | 11.50% | 9.50% | 3.50% | 1% | 12.50% | 8.50% | 7.50% | 11.50% | 46.50% | < 5% | < 5% | < 5% | |
|
| Zh.QA | 10.53% | 10.71% | 3.43% | 19.02% | 26% | 25.96% | 14.43% | 17.93% | 9.64% | 15.07% | 13.61% | < 5% | |
|
| Code.Debug | 21.83% | 27.41% | 11.60% | 22.08% | 23.85% | 39.59% | < 5% | 18.02% | < 5% | < 5% | < 5% | < 5% | |
|
| Code.Run | 1.25% | 1.75% | 0.25% | 0% | 0% | 23.25% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% | |
|
| Math.Calc | 0% | 0% | 0% | 0% | 0% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% | < 5% | |
|
| Math.Find | 20.57% | 24.28% | 26.28% | 15.40% | 30% | 60.00% | 17.14% | 12.57% | 32.29% | < 5% | 25.71% | 7.71% | |
|
| **Average** | 29.34% | 30.70% | 15.08% | 28.10% | 31.13% | 46.08% | 20.41% | 34.93% | 37.21% | 22.78% | 25.41% | 17.59% | |
|
|
|
|
|
### Long Context |
|
|
|
The following benchmark results are shown as _accuracy_ (%) values, unless stated otherwise. |
|
|
|
#### Topic Retrieval |
|
|
|
See https://lmsys.org/blog/2023-06-29-longchat/ |
|
|
|
| Model Name | n_topics=05 | n_topics=10 | n_topics=15 | n_topics=20 | n_topics=25 | |
|
|:---------------------------------------------------|--------------:|--------------:|--------------:|--------------:|--------------:| |
|
| _n_tokens_ (approx.) = | _3048_ | _5966_ | _8903_ | _11832_ | _14757_ | |
|
| MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 | |
|
| **MegaBeam-Mistral-7B-300k-AWQ** | **100** | **100** | **100**| **100** | **100** | |
|
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** | |
|
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **100** | **100** | **100**| **100** | **98** | |
|
|
|
#### [Line Retrieval](https://lmsys.org/blog/2023-06-29-longchat/#longeval-results) |
|
|
|
See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results |
|
|
|
| Model Name | n_lines=200 | n_lines=300 | n_lines=400 | n_lines=500 | n_lines=600 | n_lines=680 | |
|
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| |
|
| _n_tokens_ (approx.) = | _4317_ | _6415_ | _8510_ | _10610_ | _12698_ | _14373_ | |
|
| MegaBeam-Mistral-7B-300k | 98 | 98 | 92 | 98 | 90 | 90 | |
|
| **MegaBeam-Mistral-7B-300k-AWQ** | **96**| **94**| **88** | **80** | **70**| **62** | |
|
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100**| **98**| **96** | **96** | **90**| **94** | |
|
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **98**| **98**| **82** | **96** | **92**| **90** | |
|
|
|
#### Pass Key Retrieval |
|
|
|
See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101 |
|
|
|
| Model Name | n_garbage=12000 | n_garbage=20000 | n_garbage=31000 | n_garbage=38000 | n_garbage=45000 | n_garbage=60000 | |
|
|:----------|-------------:|-------------:|------------:|-----------:|-----------:|-----------:| |
|
| _n_tokens_ (approx.) = | _3272_ | _5405_ | _8338_ | _10205_ | _12071_ | _16072_ | |
|
| MegaBeam-Mistral-7B-300k | 100 | 100 | 100 | 100 | 100 | 100| |
|
| **MegaBeam-Mistral-7B-300k-AWQ** | **100** | **100**| **100**| **100** | **100**| **100**| |
|
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**| |
|
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **100** | **100**| **100**| **100** | **100**| **100**| |
|
|
|
|
|
#### QuALITY (Question Answering with Long Input Texts, Yes!) |
|
|
|
See https://nyu-mll.github.io/quality/ |
|
|
|
|Model Name| Test set Accuracy | Hard subset Accuracy| |
|
|:----------|-------------:|-------------:| |
|
| MegaBeam-Mistral-7B-300k | 53.2 | 72 | |
|
| **MegaBeam-Mistral-7B-300k-AWQ** | **51.3** | **71.3** | |
|
| **MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM** | **52.4** | **72.1** | |
|
| **MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM** | **53.1** | **71.3** | |
|
|
|
## Usage |
|
|
|
## Inference via vLLM HTTP Host |
|
|
|
### Launch Host |
|
```bash |
|
python -m vllm.entrypoints.openai.api_server \ |
|
--model aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ \ |
|
--quantization awq |
|
``` |
|
|
|
### Query Host |
|
```bash |
|
curl -X POST http://localhost:8000/v1/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ "model": "aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ", |
|
"prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>", |
|
"temperature": 0, |
|
"echo": false |
|
}' |
|
``` |
|
|
|
## Inference via [vLLM Offline Inference](https://docs.vllm.ai/en/latest/getting_started/examples/offline_inference.html) |
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
prompts = [ |
|
"<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>", |
|
] |
|
sampling_params = SamplingParams(temperature=0, max_tokens=100) |
|
|
|
llm = LLM(model="aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ") |
|
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
# Print the outputs. |
|
for output in outputs: |
|
prompt = output.prompt |
|
generated_text = output.outputs[0].text |
|
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
|
|
|
``` |
|
|
|
## License |
|
|
|
Apache 2.0 |
|
|
|
## Limitations |
|
|
|
Before using the MegaBeam-Mistral-7B-300k-AWQ model, it is important to perform your own |
|
independent assessment, and take measures to ensure that your use would comply |
|
with your own specific quality control practices and standards, and that your |
|
use would comply with the local rules, laws, regulations, licenses and terms |
|
that apply to you, and your content. |
|
|
|
|