mistral7b-ar-tokenizer-swap-pure-bf16

Mistral-7B-v0.1 adapted to Arabic as part of our study on efficient language adaptation: "Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough".

Code: https://github.com/konstantinjdobler/tight-budget-llm-adaptation

Paper: https://openreview.net/forum?id=VYfJaHeVod

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("konstantindobler/mistral7b-ar-tokenizer-swap-pure-bf16")
model = AutoModelForCausalLM.from_pretrained("konstantindobler/mistral7b-ar-tokenizer-swap-pure-bf16")

# Use model and tokenizer as usual

Details

The model is based on Mistral-7B-v0.1 and was adapted to Arabic. The original tokenizer was replaced by a language-specific Arabic tokenizer with a vocabulary of 32768 tokens. The new embeddings were initialized with FOCUS. Additionally, we tuned just the embeddings for 100 steps before training the full model. The model was then trained on 8 billion Arabic tokens from uonlp/CulturaX with pure bfloat16 precision (no mixed precision). More details and hyperparameters can be found in the paper.

Disclaimer

The web-scale dataset used for pretraining and tokenizer training (uonlp/CulturaX) might contain personal and sensitive information. Such behavior needs to be assessed carefully before any real-world deployment of the models.

Citation

Please cite as follows:

@inproceedings{dobler2024language,
    title={Language Adaptation on a Tight Academic Compute Budget: Tokenizer Swapping Works and Pure bfloat16 Is Enough},
    author={Konstantin Dobler and Gerard de Melo},
    booktitle={2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)},
    year={2024},
    url={https://openreview.net/forum?id=VYfJaHeVod}
}

Acknowledgements

The project on which this model is based was funded by the Federal Ministry of Education and Research under the funding code "KI-Servicezentrum Berlin-Brandenburg" 01IS22092. Responsibility for the content of this publication remains with the author.

Downloads last month
11
Safetensors
Model size
7.25B params
Tensor type
BF16
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train konstantindobler/mistral7b-ar-tokenizer-swap-pure-bf16