File size: 2,719 Bytes
eb422dd 29fde44 133020d 9cc1923 31ea432 9cc1923 31ea432 133020d 2a0bdf4 29fde44 2a0bdf4 29fde44 4c4e3ee 9a52c30 3629e1f f426f57 3629e1f 214d245 d873890 81e76a0 2a0bdf4 4c4e3ee 2a0bdf4 c8531bf 6357c0a 2a0bdf4 6357c0a 4c4e3ee affab6f 9cc1923 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
license: apache-2.0
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/643197ac288c9775673a01e9/w-lgOpASM1DMl2PO0kdFy.png)
## Introduction
APUS-xDAN-4.0-MOE is a transformer-based decoder-only language model, developed on a vast corpus of data to ensure robust performance.
This is an enhanced MoE (Mixture of Experts) model built on top of the continued pre-training enhanced LlaMA architecture,
further optimized with human-enhanced feedback algorithms to improve reasoning, mathematical, and logical capabilities during inference.
For more comprehensive information, please visit our blog post and GitHub repository.
https://github.com/shootime2021/APUS-xDAN-4.0-moe
# Model Details
APUS-xDAN-4.0-MOE leverages the innovative Mixture of Experts (MoE) architecture, incorporating components from dense language models. Specifically, it inherits its capabilities from the highly performant xDAN-L2 Series. With a total of 136 billion parameters, of which 30 billion are activated during runtime, APUS-xDAN-4.0-MOE demonstrates unparalleled efficiency.
Through advanced quantization techniques, our open-source version occupies a mere 42GB, making it seamlessly compatible with consumer-grade GPUs like the 4090 and 3090.
The following specifications:
- **Parameters:** 136B
- **Architecture:** Mixture of 4 Experts (MoE)
- **Experts Utilization:** 2 experts used per token
- **Layers:** 60
- **Attention Heads:** 56 for queries, 8 for keys/values
- **Embedding Size:** 7,168
- **Additional Features:**
- Rotary embeddings (RoPE)
- Supports activation sharding and 1.5bit~4bit quantization
- **Maximum Sequence Length (context):** 32,768 tokens
## Usage
| Model | Quantized | Size | Context | Hardware Requirement |
|-------------|-----------|--------|--------------------------| --------------------------|
| APUS-xDAN4.0-MoE-0402.Q2_K.gguf | Q2_K | 39G | 32k | 2x24G GPU memory |
| APUS-xDAN4.0-MoE-0402.IQ3_XXS.gguf | IQ3_XXS | 41G | 32k | 2x24G GPU memory |
| APUS-xDAN4.0-MoE-0402.Q3_K_M_Matrix.gguf | Q3_K_M | 51G | 32k | 2x24G GPU memory |
| APUS-xDAN4.0-MoE-0402.Q4_K_M.gguf | Q4_K_M | 64G | 32k | 3x24G GPU memory |
| APUS-xDAN4.0-MoE-0402 | | | | |
### Initial
```python
git clone https://github.com/ggerganov/llama.cpp.git
make LLAMA_CUDA=1
```
### Interactive Chat
```python
./main -m APUS-xDAN4.0-MoE-0402.Q2_K.gguf \
--prompt "You are a helpful assistant named APUS-xDAN4.0 MoE." --chatml \
--interactive \
--temp 0.7 \
--ctx-size 4096 (32k)
```
License
APUS-xDAN-4.0-MOE is distributed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. |