stabilityai
/

stablelm-2-zephyr-1_6b

@@ -45,7 +45,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained('stabilityai/stablelm-zephyr-3b')
 model = AutoModelForCausalLM.from_pretrained(
-    'stabilityai/stablelm-zephyr-3b',
     trust_remote_code=True,
     device_map="auto"
 )
@@ -87,52 +87,49 @@ The dataset is comprised of a mixture of open datasets large-scale datasets avai
 - meta-math/MetaMathQA
 - WizardLM/WizardLM_evol_instruct_V2_196k
 - Open-Orca/SlimOrca
 2. Preference Datasets:
 - HuggingFaceH4/ultrafeedback_binarized
 - Intel/orca_dpo_pairs
 ## Performance
-### MT-Bench and Alpaca Bench
-<img src="https://cdn-uploads.huggingface.co/production/uploads/6310474ca119d49bc1eb0d80/8WIZS6dAlu5kSH-382pMl.png" alt="mt_bench_plot" width="600"/>
-| Model | Size | Alignment | MT-Bench (score) | AlpacaEval (win rate %) |
-|-------------|-----|----|---------------|--------------|
-| **StableLM Zephyr 3B** 🪁 | 3B | DPO | 6.64 | 76.00 |
-| StableLM Zephyr (SFT only) | 3B | SFT | 6.04 | 71.15 |
-| Capybara v1.9 | 3B | dSFT | 5.94 | - |
-| MPT-Chat |  7B |dSFT |5.42| -|
-| Xwin-LM v0.1 | 7B| dPPO| 6.19| 87.83|
-| Mistral-Instruct v0.1 | 7B|  - | 6.84 |-|
-| Zephyr-7b-α |7B|  dDPO| 6.88| -|
-| Zephyr-7b-β| 7B | dDPO | 7.34 | 90.60 |
-| Falcon-Instruct |  40B |dSFT |5.17 |45.71|
-| Guanaco | 65B |  SFT |6.41| 71.80|
-| Llama2-Chat |  70B |RLHF |6.86| 92.66|
-| Vicuna v1.3 |  33B |dSFT |7.12 |88.99|
-| WizardLM v1.0 |  70B |dSFT |7.71 |-|
-| Xwin-LM v0.1 |   70B |dPPO |- |95.57|
-| GPT-3.5-turbo | - |RLHF |7.94 |89.37|
-| Claude 2 |  - |RLHF |8.06| 91.36|
-| GPT-4 |  -| RLHF |8.99| 95.28|
-## Other benchmarks:
-| Task                | Value                     |
-|-----------------------|---------------------------|
-| ARC (25-shot)         |  47.0       |
-| HellaSwag (10-shot)   | 74.2    |
-| MMLU (5-shot)        |   46.3     |
-| TruthfulQA (0-shot)   |   46.5 |
-| Winogrande (5-shot)   |   65.5 |
-| GSM8K (5-shot)        | 42.3        |
-| BigBench (Avg) | 35.26 |
-| AGI Benchmark (Avg) | 33.23 |
 ### Training Infrastructure
-* **Hardware**: `StableLM Zephyr 3B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes.
 * **Code Base**: We use our internal script for SFT steps and used [HuggingFace Alignment Handbook script](https://github.com/huggingface/alignment-handbook) for DPO training.
 ## Commitment to Ethical AI

 tokenizer = AutoTokenizer.from_pretrained('stabilityai/stablelm-zephyr-3b')
 model = AutoModelForCausalLM.from_pretrained(
+    'stabilityai/stablelm-2-zephyr-1_6b',
     trust_remote_code=True,
     device_map="auto"
 )
 - meta-math/MetaMathQA
 - WizardLM/WizardLM_evol_instruct_V2_196k
 - Open-Orca/SlimOrca
+- openchat/openchat_sharegpt4_dataset
+- LDJnr/Capybara
 2. Preference Datasets:
 - HuggingFaceH4/ultrafeedback_binarized
 - Intel/orca_dpo_pairs
 ## Performance
+### MT-Bench
+| Model                   | Size | MT-Bench |
+|-------------------------|------|----------|
+| Mistral-7B-Instruct-v0.2| 7B   | 7.61     |
+| Llama2-Chat             | 70B  | 6.86     |
+| MPT-30B-Chat            | 30B  | 6.39     |
+| stablelm-zephyr-3b      | 3B   | 6.64     |
+| **stablelm-2-zephyr-1.6b**  | 1.6B | 5.42     |
+| Falcon-40B-Instruct     | 40B  | 5.17     |
+| Qwen-1.8B-Chat          | 1.8B | 4.95     |
+| dolphin-2.6-phi-2       | 2.7B | 4.93     |
+| phi-2                   | 2.7B | 4.29     |
+| TinyLlama-1.1B-Chat-v1.0| 1.1B | 3.46     |
+### OpenLLM Leaderboard
+| Model                                  | Size | Average | ARC Challenge (acc_norm) | HellaSwag (acc_norm) | MMLU (acc_norm) | TruthfulQA (mc2) | Winogrande (acc) | Gsm8k (acc) |
+|----------------------------------------|------|---------|-------------------------|----------------------|-----------------|------------------|------------------|-------------|
+| microsoft/phi-2                        | 2.7B | 61.32%  | 61.09%                  | 75.11%               | 58.11%          | 44.47%           | 74.35%           | 54.81%      |
+| **stabilityai/stablelm-2-zephyr-1_6b**     | 1.6B | 49.73%  | 43.34%                  | 69.30%               | 41.79%          | 45.55%           | 63.61%           | 34.80%      |
+| microsoft/phi-1_5                      | 1.3B | 47.69%  | 52.90%                  | 63.79%               | 43.89%          | 40.89%           | 72.22%           | 12.43%      |
+| stabilityai/stablelm-2-1_6b            | 1.6B | 45.54%  | 43.43%                  | 70.49%               | 38.93%          | 36.65%           | 65.90%           | 17.82%      |
+| mosaicml/mpt-7b                        | 7B   | 44.28%  | 47.70%                  | 77.57%               | 30.80%          | 33.40%           | 72.14%           | 4.02%       |
+| KnutJaegersberg/Qwen-1_8B-Llamaified*  | 1.8B | 44.75%  | 37.71%                  | 58.87%               | 46.37%          | 39.41%           | 61.72%           | 24.41%      |
+| openlm-research/open_llama_3b_v2       | 3B   | 40.28%  | 40.27%                  | 71.60%               | 27.12%          | 34.78%           | 67.01%           | 0.91%       |
+| iiuae/falcon-rw-1b                     | 1B   | 37.07%  | 35.07%                  | 63.56%               | 25.28%          | 35.96%           | 62.04%           | 0.53%       |
+| TinyLlama/TinyLlama-1.1B-3T            | 1.1B | 36.40%  | 33.79%                  | 60.31%               | 26.04%          | 37.32%           | 59.51%           | 1.44%       |
 ### Training Infrastructure
+* **Hardware**: `StableLM 2 Zephyr 1.6B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes.
 * **Code Base**: We use our internal script for SFT steps and used [HuggingFace Alignment Handbook script](https://github.com/huggingface/alignment-handbook) for DPO training.
 ## Commitment to Ethical AI