---
license: apache-2.0
language:
- it
---
--------------------------------------------------------------------------------------------------
Model: DistilUSE
Lang: IT
--------------------------------------------------------------------------------------------------
Model description
This is a Universal Sentence Encoder [1] model for the Italian language, obtained using mDistilUSE ([distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1)) as a starting point and focusing it on the Italian language by modifying the embedding layer
(as in [2], computing document-level frequencies over the Wikipedia dataset)
The resulting model has 67M parameters, a vocabulary of 30.785 tokens, and a size of ~270 MB.
It can be used to encode Italian texts and compute similarities between them.
Quick usage
```python
from transformers import AutoTokenizer, AutoModel
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("osiria/distiluse-base-italian")
model = AutoModel.from_pretrained("osiria/distiluse-base-italian")
text1 = "Alessandro Manzoni è stato uno scrittore italiano"
text2 = "Giacomo Leopardi è stato un poeta italiano"
vec1 = model(tokenizer.encode(text1, return_tensors = "pt")).last_hidden_state[0,0,:].cpu().detach().numpy()
vec2 = model(tokenizer.encode(text2, return_tensors = "pt")).last_hidden_state[0,0,:].cpu().detach().numpy()
cosine_similarity = np.dot(vec1, vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2))
print("COSINE SIMILARITY:", cosine_similarity)
# COSINE SIMILARITY: 0.734292
```
References
[1] https://arxiv.org/abs/1907.04307
[2] https://arxiv.org/abs/2010.05609
License
The model is released under Apache-2.0 license