rubert-mini-sts / README.md

Update README.md

c70dbfd verified 8 months ago

5.07 kB

	---
	language:
	- ru

	pipeline_tag: sentence-similarity

	tags:
	- russian
	- pretraining
	- embeddings
	- tiny
	- feature-extraction
	- sentence-similarity
	- sentence-transformers
	- transformers

	datasets:
	- IlyaGusev/gazeta
	- zloelias/lenta-ru

	license: mit
	base_model: cointegrated/rubert-tiny2

	---

	## Базовый Bert для Semantic text similarity (STS) на CPU

	Базовая модель BERT для расчетов компактных эмбеддингов предложений на русском языке. Модель основана на [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) - имеет аналогичные размеры контекста (2048) и ембеддинга (312), количество слоев увеличено с 3 до 7.


	## Использование модели с библиотекой `transformers`:

	```python
	# pip install transformers sentencepiece
	import torch
	from transformers import AutoTokenizer, AutoModel
	tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-sts")
	model = AutoModel.from_pretrained("sergeyzh/rubert-mini-sts")
	# model.cuda() # uncomment it if you have a GPU

	def embed_bert_cls(text, model, tokenizer):
	t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
	with torch.no_grad():
	model_output = model(**{k: v.to(model.device) for k, v in t.items()})
	embeddings = model_output.last_hidden_state[:, 0, :]
	embeddings = torch.nn.functional.normalize(embeddings)
	return embeddings[0].cpu().numpy()

	print(embed_bert_cls('привет мир', model, tokenizer).shape)
	# (312,)
	```

	## Использование с `sentence_transformers`:
	```Python
	from sentence_transformers import SentenceTransformer, util

	model = SentenceTransformer('sergeyzh/rubert-mini-sts')

	sentences = ["привет мир", "hello world", "здравствуй вселенная"]
	embeddings = model.encode(sentences)
	print(util.dot_score(embeddings, embeddings))
	```

	## Метрики
	Оценки модели на бенчмарке [encodechka](https://github.com/avidale/encodechka):

	\| Модель \| STS \| PI \| NLI \| SA \| TI \|
	\|:---------------------------------\|:---------:\|:---------:\|:---------:\|:---------:\|:---------:\|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 0.862 \| 0.727 \| 0.473 \| 0.810 \| 0.979 \|
	\| [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) \| 0.845 \| 0.737 \| 0.481 \| 0.805 \| 0.957 \|
	\| sergeyzh/rubert-mini-sts \| 0.815 \| 0.723 \| 0.477 \| 0.791 \| 0.949 \|
	\| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) \| 0.797 \| 0.702 \| 0.453 \| 0.778 \| 0.946 \|
	\| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) \| 0.793 \| 0.704 \| 0.457 \| 0.803 \| 0.970 \|
	\| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) \| 0.794 \| 0.659 \| 0.431 \| 0.761 \| 0.946 \|
	\| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) \| 0.750 \| 0.651 \| 0.417 \| 0.737 \| 0.937 \|

	Задачи:

	- Semantic text similarity (STS);
	- Paraphrase identification (PI);
	- Natural language inference (NLI);
	- Sentiment analysis (SA);
	- Toxicity identification (TI).

	## Быстродействие и размеры

	На бенчмарке [encodechka](https://github.com/avidale/encodechka):

	\| Модель \| CPU \| GPU \| size \| dim \| n_ctx \| n_vocab \|
	\|:---------------------------------\|----------:\|----------:\|----------:\|----------:\|----------:\|----------:\|
	\| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) \| 149.026 \| 15.629 \| 2136 \| 1024 \| 514 \| 250002 \|
	\| [sergeyzh/LaBSE-ru-sts](https://huggingface.co/sergeyzh/LaBSE-ru-sts) \| 42.835 \| 8.561 \| 490 \| 768 \| 512 \| 55083 \|
	\| sergeyzh/rubert-mini-sts \| 6.417 \| 5.517 \| 123 \| 312 \| 2048 \| 83828 \|
	\| [sergeyzh/rubert-tiny-sts](https://huggingface.co/sergeyzh/rubert-tiny-sts) \| 3.208 \| 3.379 \| 111 \| 312 \| 2048 \| 83828 \|
	\| [Tochka-AI/ruRoPEBert-e5-base-512](https://huggingface.co/Tochka-AI/ruRoPEBert-e5-base-512) \| 43.314 \| 9.338 \| 532 \| 768 \| 512 \| 69382 \|
	\| [cointegrated/LaBSE-en-ru](https://huggingface.co/cointegrated/LaBSE-en-ru) \| 42.867 \| 8.549 \| 490 \| 768 \| 512 \| 55083 \|
	\| [cointegrated/rubert-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) \| 3.212 \| 3.384 \| 111 \| 312 \| 2048 \| 83828 \|