|
--- |
|
library_name: transformers |
|
license: cc-by-4.0 |
|
datasets: |
|
- uonlp/CulturaX |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
# LOLA — An Open-Source Massively Multilingual Large Language Model |
|
|
|
|
|
## Model Description |
|
|
|
- **Developed by:** DICE Research Group (https://dice-research.org/) @ Paderborn University (https://www.uni-paderborn.de/) |
|
- **Model type:** GPT2 style (decoder-only) with alternating sparse Mixture-of-Experts layers |
|
- **Number of Experts**: 16 |
|
- **Model Size**: 1.3 Billion (active*) / 7.4 Billion (total) |
|
- **Language(s) (NLP):** 160+ |
|
- **License:** CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) |
|
- **Repository:** https://github.com/dice-group/LOLA |
|
|
|
<sub>* The number of parameters a model utilizes per token (ref: [Du et al, 2022](https://arxiv.org/abs/2112.06905)). This distinction is crucial for understanding the efficiency and performance of MoE models.</sub> |
|
|
|
## How to Get Started with the Model |
|
|
|
This pre-trained (causal language modeling) model can only be used for text-generation and requires further fine-tuning on downstream tasks. |
|
|
|
### How to use |
|
|
|
You can use this model directly with a pipeline for text generation. |
|
|
|
```python |
|
>>> from transformers import pipeline |
|
|
|
>>> generator = pipeline('text-generation', model="dice-research/lola_v1", trust_remote_code=True) |
|
>>> generator("The quick brown fox", max_length=13) |
|
[{'generated_text': 'The quick brown fox jumps over the lazy dog.'}] |
|
``` |
|
|
|
To use the top-k sampling, please set `do_sample` to `True`. |
|
|
|
**Note:** The tokenizer used in the model comes from mGPT (https://github.com/ai-forever/mgpt) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Framework |
|
|
|
- DeepSpeed Megatron (https://github.com/microsoft/Megatron-DeepSpeed) |
|
- Architecture type: Transformers (Decoder-only) with Mixture-of-Experts (MoE) |
|
- Number of Experts: 16 |
|
- Model Size: 1.3 Billion Dense / 7.4 Billion Sparse |
|
|
|
### Pretraining Dataset |
|
|
|
- CulturaX (https://huggingface.co/datasets/uonlp/CulturaX) |
|
- Total Tokens: 6.3 Trillion |
|
- Total Languages: 167 |
|
|
|
### LOLA v1 Training: |
|
|
|
- Computing cluster: Noctua2 (https://pc2.uni-paderborn.de/hpc-services/available-systems/noctua2) |
|
- Number of GPUs: 96x Nvidia A100 (40GB) |
|
- Training steps: 296000 |
|
- Tokens consumed: 465 Billion |
|
- Training time: ~19 days |
|
|
|
## Citation |
|
If you use our work in your research, please make sure to cite it: |
|
```bibtex |
|
@misc{srivastava2024lolaopensourcemassively, |
|
title={LOLA -- An Open-Source Massively Multilingual Large Language Model}, |
|
author={Nikit Srivastava and Denis Kuchelev and Tatiana Moteu Ngoli and Kshitij Shetty and Michael Roeder and Diego Moussallem and Hamada Zahera and Axel-Cyrille Ngonga Ngomo}, |
|
year={2024}, |
|
eprint={2409.11272}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2409.11272}, |
|
} |
|
``` |