dice-research
/

lola_v1

Text Generation

Model card Files Files and versions Community

lola_v1 / README.md

neo-nlp-dev's picture

Update README.md

a970019 verified 4 months ago

|

2.86 kB

	---
	library_name: transformers
	license: cc-by-4.0
	datasets:
	- uonlp/CulturaX
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->
	# LOLA — An Open-Source Massively Multilingual Large Language Model


	## Model Description

	- Developed by: DICE Research Group (https://dice-research.org/) @ Paderborn University (https://www.uni-paderborn.de/)
	- Model type: GPT2 style (decoder-only) with alternating sparse Mixture-of-Experts layers
	- Number of Experts: 16
	- Model Size: 1.3 Billion (active*) / 7.4 Billion (total)
	- Language(s) (NLP): 160+
	- License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
	- Repository: https://github.com/dice-group/LOLA

	<sub>* The number of parameters a model utilizes per token (ref: [Du et al, 2022](https://arxiv.org/abs/2112.06905)). This distinction is crucial for understanding the efficiency and performance of MoE models.</sub>

	## How to Get Started with the Model

	This pre-trained (causal language modeling) model can only be used for text-generation and requires further fine-tuning on downstream tasks.

	### How to use

	You can use this model directly with a pipeline for text generation.

	```python
	>>> from transformers import pipeline

	>>> generator = pipeline('text-generation', model="dice-research/lola_v1", trust_remote_code=True)
	>>> generator("The quick brown fox", max_length=13)
	[{'generated_text': 'The quick brown fox jumps over the lazy dog.'}]
	```

	To use the top-k sampling, please set `do_sample` to `True`.

	Note: The tokenizer used in the model comes from mGPT (https://github.com/ai-forever/mgpt)


	## Training Details


	### Training Framework

	- DeepSpeed Megatron (https://github.com/microsoft/Megatron-DeepSpeed)
	- Architecture type: Transformers (Decoder-only) with Mixture-of-Experts (MoE)
	- Number of Experts: 16
	- Model Size: 1.3 Billion Dense / 7.4 Billion Sparse

	### Pretraining Dataset

	- CulturaX (https://huggingface.co/datasets/uonlp/CulturaX)
	- Total Tokens: 6.3 Trillion
	- Total Languages: 167

	### LOLA v1 Training:

	- Computing cluster: Noctua2 (https://pc2.uni-paderborn.de/hpc-services/available-systems/noctua2)
	- Number of GPUs: 96x Nvidia A100 (40GB)
	- Training steps: 296000
	- Tokens consumed: 465 Billion
	- Training time: ~19 days

	## Citation
	If you use our work in your research, please make sure to cite it:
	```bibtex
	@misc{srivastava2024lolaopensourcemassively,
	title={LOLA -- An Open-Source Massively Multilingual Large Language Model},
	author={Nikit Srivastava and Denis Kuchelev and Tatiana Moteu Ngoli and Kshitij Shetty and Michael Roeder and Diego Moussallem and Hamada Zahera and Axel-Cyrille Ngonga Ngomo},
	year={2024},
	eprint={2409.11272},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2409.11272},
	}
	```