--- license: unlicense pipeline_tag: sentence-similarity language: - ru tags: - PyTorch - Transformers - e-commerce - encoder --- A sentencepiece tokenizer was applied to a corpus of 269 million Russian search queries. The encoder-model was trained for the e-commerce search query similarity task, and the search queries were short. The dataset for validation, which was manually annotated, comprised 362,000 instances. ![Validation results](https://huggingface.co/fkrasnov2/SBE/resolve/main/bvf_recall1k_query_len_eng.svg) ```python ## don't forget # pip install protobuf sentencepiece from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained('fkrasnov2/SBE') tokenizer = AutoTokenizer.from_pretrained('fkrasnov2/SBE') input_ids = tokenizer.encode("чёрное платье", max_length=model.config.max_position_embeddings, truncation=True, return_tensors='pt') model.eval() vector = model(input_ids=input_ids, attention_mask=input_ids!=tokenizer.pad_token_id)[0][0,0] assert model.config.hidden_size == vector.shape[0] ``` This model is designed for use in e-commerce IR and helps differentiate products. **The same products**: - cos ( SBE("apple 16 синий про макс 256"), SBE("iphone 16 синий pro max 256") ) = 0.96 - cos ( SBE("iphone 15 pro max"), SBE("айфон 15 про макс") ) = 0.98 **Different products**: - cos ( SBE("iphone 15 pro max"), SBE("iphone 16 pro max") ) = 0.85