MegaBeam-Mistral-7B-300k-AWQ Model

MegaBeam-Mistral-7B-300k-AWQ is a version of the MegaBeam-Mistral-7B-300k model that was quantized using the AWQ method developed by Lin et al. (2023). The MegaBeam-Mistral-7B-300k-AWQ models are approximately 70% smaller than those of MegaBeam-Mistral-7B-300k whilst maintaining comparable performance.

Please refer to the original MegaBeam-Mistral-7B-300k model card for details about the model preparation and training processes.

MegaBeam-Mistral-7B-300k Variants

Branch	Approx. Model Size	`q_group_size`	`w_bit`	`version`
main	3.9 GB	128	4	GEMM
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM	4.0 GB	64	4	GEMM
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM	4.3 GB	32	4	GEMM

Dependencies

autoawq==0.2.5 – AutoAWQ was used to quantize the MegaBeam-Mistral-7B-300k model.
vllm==0.4.2 – vLLM was used to host models for benchmarking.

Evaluations

InfiniteBench

This benchmark was developed by Zhang et al. (2024), available from https://github.com/OpenBMB/InfiniteBench.

See the original MegaBeam-Mistral-7B-300k model card for more details.

Task Name	MegaBeam-Mistral-7B-300k-AWQ	MegaBeam-Mistral-7B-300k	Mistral-7B-Instruct-v0.2	Llama-3-8B-Instruct-262k	Llama3-70B-1M	GPT-4-1106-preview	YaRN-Mistral-7B	Kimi-Chat	Claude 2	Yi-6B-200K	Yi-34B-200K	Chatglm3-6B-128K
Retrieve.PassKey	100%	100%	75.76%	98.30%	81.35%	100%	92.71%	98.14%	97.80%	100.00%	100.00%	92.20%
Retrieve.Number	92.7%	96.10%	25.25%	97.79%	97.62%	100%	56.61%	95.42%	98.14%	94.92%	100.00%	80.68%
Retrieve.KV	0%	0%	0%	3.40%	3%	89.00%	< 5%	53.60%	65.40%	< 5%	< 5%	< 5%
En.Sum	29.05%	29.39%	22.13%	16.40%	20.72%	14.73%	9.09%	17.93%	14.45%	< 5%	< 5%	< 5%
En.QA	15.69%	14.93%	4.93%	13.20%	16.52%	22.22%	9.55%	16.52%	11.97%	9.20%	12.17%	< 5%
En.MC	48.91%	51.52%	7.80%	50.65%	62%	67.25%	27.95%	72.49%	62.88%	36.68%	38.43%	10.48%
En.Dia	11.50%	9.50%	3.50%	1%	12.50%	8.50%	7.50%	11.50%	46.50%	< 5%	< 5%	< 5%
Zh.QA	10.53%	10.71%	3.43%	19.02%	26%	25.96%	14.43%	17.93%	9.64%	15.07%	13.61%	< 5%
Code.Debug	21.83%	27.41%	11.60%	22.08%	23.85%	39.59%	< 5%	18.02%	< 5%	< 5%	< 5%	< 5%
Code.Run	1.25%	1.75%	0.25%	0%	0%	23.25%	< 5%	< 5%	< 5%	< 5%	< 5%	< 5%
Math.Calc	0%	0%	0%	0%	0%	< 5%	< 5%	< 5%	< 5%	< 5%	< 5%	< 5%
Math.Find	20.57%	24.28%	26.28%	15.40%	30%	60.00%	17.14%	12.57%	32.29%	< 5%	25.71%	7.71%
Average	29.34%	30.70%	15.08%	28.10%	31.13%	46.08%	20.41%	34.93%	37.21%	22.78%	25.41%	17.59%

Long Context

The following benchmark results are shown as accuracy (%) values, unless stated otherwise.

Topic Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/

Model Name	n_topics=05	n_topics=10	n_topics=15	n_topics=20	n_topics=25
n_tokens (approx.) =	3048	5966	8903	11832	14757
MegaBeam-Mistral-7B-300k	100	100	100	100	100
MegaBeam-Mistral-7B-300k-AWQ	100	100	100	100	100
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM	100	100	100	100	98
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM	100	100	100	100	98

Line Retrieval

See https://lmsys.org/blog/2023-06-29-longchat/#longeval-results

Model Name	n_lines=200	n_lines=300	n_lines=400	n_lines=500	n_lines=600	n_lines=680
n_tokens (approx.) =	4317	6415	8510	10610	12698	14373
MegaBeam-Mistral-7B-300k	98	98	92	98	90	90
MegaBeam-Mistral-7B-300k-AWQ	96	94	88	80	70	62
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM	100	98	96	96	90	94
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM	98	98	82	96	92	90

Pass Key Retrieval

See https://github.com/epfml/landmark-attention/blob/main/llama/run_test.py#L101

Model Name	n_garbage=12000	n_garbage=20000	n_garbage=31000	n_garbage=38000	n_garbage=45000	n_garbage=60000
n_tokens (approx.) =	3272	5405	8338	10205	12071	16072
MegaBeam-Mistral-7B-300k	100	100	100	100	100	100
MegaBeam-Mistral-7B-300k-AWQ	100	100	100	100	100	100
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM	100	100	100	100	100	100
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM	100	100	100	100	100	100

QuALITY (Question Answering with Long Input Texts, Yes!)

See https://nyu-mll.github.io/quality/

Model Name	Test set Accuracy	Hard subset Accuracy
MegaBeam-Mistral-7B-300k	53.2	72
MegaBeam-Mistral-7B-300k-AWQ	51.3	71.3
MegaBeam-Mistral-7B-300k-AWQ-64g-4b-GEMM	52.4	72.1
MegaBeam-Mistral-7B-300k-AWQ-32g-4b-GEMM	53.1	71.3

Usage

Inference via vLLM HTTP Host

Launch Host

python -m vllm.entrypoints.openai.api_server \
    --model aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ \
    --quantization awq

Query Host

curl -X POST http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{ "model": "aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ",
          "prompt": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
          "temperature": 0,
          "echo": false
    }'

Inference via vLLM Offline Inference

from vllm import LLM, SamplingParams

prompts = [
   "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
]
sampling_params = SamplingParams(temperature=0, max_tokens=100)

llm = LLM(model="aws-prototyping/MegaBeam-Mistral-7B-300k-AWQ")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

License

Apache 2.0

Limitations

Before using the MegaBeam-Mistral-7B-300k-AWQ model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.