FremyCompany commited on
Commit
3527bc9
Β·
verified Β·
1 Parent(s): 688cb16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -31,7 +31,7 @@ base_model:
31
 
32
  # FMMB-BE: The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition)
33
 
34
- πŸ‡§πŸ‡ͺ The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition) is the perfect model for embedding texts up to 8192 tokens written in French, Dutch, German or English at the speed of light. This model uses the most effecient tokenizer for each input text, thereby maximizing your GPU usage. Despite using 4 different tokenizers and 4 different embedding tables, this model can mix and match different languages in the same batch, and produces embeddings very similar across languages. That said: if you know the tokenizer you want to use in advance, you can use the monolingual variants for [French](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-FR), [Dutch](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-NL), [German](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE) or [English](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-EN) for a faster tokenization and lower memory footprint.
35
 
36
  πŸ†˜ This [sentence-transformers](https://www.SBERT.net) model was trained on a small parallel corpus containing English-French, English-Dutch, and English-German sentence pairs. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The input texts can be used as-is, no need to use prefixes.
37
 
 
31
 
32
  # FMMB-BE: The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition)
33
 
34
+ πŸ‡§πŸ‡ͺ The Fairly Multilingual ModernBERT Embedding Model (Belgian Edition) is the perfect model for embedding texts up to 8192 tokens written in French, Dutch, German or English at the speed of light. For each input text, the FMMB model autodetects the most efficient tokenizer (English, French, Dutch, or German) and route the input text to that tokenizer. Each tokenizer uses its own language-specific token embeddings, reducing the risk of language interference. Because all the other weights are shared, the FMMB models can mix and match different languages in the same batch without requiring to load 4 different models in memory, and produces embeddings very similar across languages. That said: if you know the tokenizer you want to use in advance, you can use the monolingual variants for [French](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-FR), [Dutch](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-NL), [German](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-DE) or [English](https://huggingface.co/Parallia/Fairly-Multilingual-ModernBERT-Embed-BE-EN) for a faster tokenization and lower memory footprint.
35
 
36
  πŸ†˜ This [sentence-transformers](https://www.SBERT.net) model was trained on a small parallel corpus containing English-French, English-Dutch, and English-German sentence pairs. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more. The input texts can be used as-is, no need to use prefixes.
37