File size: 4,793 Bytes
7239ffa 0bb3100 7239ffa 6a40632 7239ffa d23c561 73d29c8 d23c561 73d29c8 d23c561 73d29c8 6a40632 7239ffa 0bb3100 7239ffa 108b6ea 0bb3100 7239ffa 0bb3100 7239ffa 0bb3100 6a40632 97375cb 108b6ea 97375cb 108b6ea 6a40632 0bb3100 97375cb 0bb3100 f66ee9b 24f92db 108b6ea f66ee9b 41e59b2 108b6ea f66ee9b 0bb3100 24f92db f66ee9b 0bb3100 24f92db 0bb3100 6a40632 0bb3100 6a40632 0bb3100 6a40632 0bb3100 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
library_name: transformers
license: apache-2.0
datasets:
- HuggingFaceTB/smollm-corpus
language:
- en
pipeline_tag: text-generation
---
# **Doge 60M**
<div align="center">
<img src="https://huggingface.co/spaces/SmallDoge/README/resolve/main/org_icon.png" width="100%" alt="SmallDoge" />
</div>
<hr>
<div align="center">
<a href="https://arxiv.org/abs/2412.11834" target="_blank" style="margin: 2px;">
<img alt="arXiv" src="https://img.shields.io/static/v1?label=arXiv&message=2412.11834&color=B31B1B&logo=arXiv" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://github.com/SmallDoges/small-doge" target="_blank" style="margin: 2px;">
<img alt="GitHub" src="https://img.shields.io/badge/GitHub-SmallDoge-181717?logo=github" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://huggingface.co/SmallDoge" target="_blank" style="margin: 2px;">
<img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-SmallDoge-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://github.com/SmallDoges/small-doge/blob/main/LICENSE" style="margin: 2px;">
<img alt="License" src="https://img.shields.io/badge/License-Apache--2.0-blue.svg" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>
Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by [SmallDoge](https://huggingface.co/SmallDoge) community, for detailed algorithm and model architecture, please refer to [Wonderful Matrices](https://arxiv.org/abs/2412.11834), all training details and code are publicly available on the [small-doge](https://github.com/SmallDoges/small-doge) repository.
## Uses
```python
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-60M")
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-60M", trust_remote_code=True)
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))
```
## Model Details
We build the Doge by doing Per-Training on [Smollm-Corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus).
> NOTE: If you want to continue pre-training this model, you can find the unconverged checkpoint [here](https://huggingface.co/SmallDoge/Doge-60M-checkpoint).
> NOTE: These models has not been fine-tuned for instruction, the instruction model is [here](https://huggingface.co/SmallDoge/Doge-60M-Instruct).
> TODO: The larger model is under training and will be uploaded soon.
**Pre-Training**:
| Model | Training Data | Steps | Content Length | Tokens | LR | Batch Size | Precision |
|---|---|---|---|---|---|---|---|
| [Doge-20M](https://huggingface.co/SmallDoge/Doge-20M) | [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | 8k | 2048 | 4B | 8e-3 | 0.5M | bfloat16 |
| [Doge-60M](https://huggingface.co/SmallDoge/Doge-60M) | [HuggingFaceTB/smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) | 16k | 2048 | 16B | 6e-3 | 1M | bfloat16 |
**Evaluation**:
| Model | MMLU | TriviaQA | ARC-E | ARC-C | PIQA | HellaSwag | OBQA | Winogrande | tokens / s on CPU |
|---|---|---|---|---|---|---|---|---|---|
| [Doge-20M](https://huggingface.co/SmallDoge/Doge-20M) | 25.43 | 0.03 | 36.83 | 22.78 | 58.38 | 27.25 | 25.60 | 50.20 | 142 |
| [Doge-60M](https://huggingface.co/SmallDoge/Doge-60M) | 26.41 | 0.18 | 50.46 | 25.34 | 61.43 | 31.45 | 28.00 | 50.75 | 62 |
> All evaluations are done using five-shot settings, without additional training on the benchmarks.
**Procedure**:
[<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/loser_cheems/huggingface/runs/ydynuvfz)
**Environment**:
- Image: nvcr.io/nvidia/pytorch:24.12-py3
- Hardware: 1x NVIDIA RTX 4090
- Software: Transformers
## Citation
```bibtex
@misc{shi2024wonderfulmatrices,
title={Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture},
author={Jingze Shi and Bingheng Wu},
year={2024},
eprint={2412.11834},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2412.11834},
}
``` |