metadata

language: spanish
license: apache-2.0
datasets:
  - wikipedia
widget:
  - text: El español es un idioma muy [MASK] en el mundo.

DistilBERT base multilingual model Spanish subset (cased)

This model is the Spanish extract of distilbert-base-multilingual-cased, a distilled version of the BERT base multilingual model. It uses the extraction method proposed by Geotrend, which is described in https://github.com/Geotrend-research/smaller-transformers.

In particular, we've ran the following script:

python reduce_model.py \
    --source_model distilbert-base-multilingual-cased \
    --vocab_file notebooks/selected_tokens/selected_es_tokens.txt \
    --output_model distilbert-base-es-multilingual-cased \
    --convert_to_tf False

The resulting model has the same architecture as DistilmBERT: 6 layers, 768 dimension and 12 heads, with a total of 65M parameters (compared to 134M parameters for DistilmBERT).

The goal of this model is to reduce even further the size of the distilbert-base-multilingual multilingual model by selecting only most frequent tokens for Spanish, reducing the size of the embedding layer. For more details visit the paper from the Geotrend team: Load What You Need: Smaller Versions of Multilingual BERT.