poc-embeddings
/

rubert-tiny-turbo-godeal

@@ -9,6 +9,101 @@ tags:
 - text classification
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: poc-embeddings/rubert-tiny-turbo-godeal
-- Docs: https://pytorch.org/docs/stable/index.html

 - text classification
 ---
+This is rubert-tiny fine-tuned for classification messages type from telegram marketplaces.
+Labels:
+ - **supply**: somebody willing to sell something or provide service
+ - **demand**: somebody wants to buy something or hire somebody
+ - **noise**: messages unrelated to topic.
+## Usage
+``` python
+from transformers import AutoTokenizer
+HF_MODEL_NAME = 'poc-embeddings/rubert-tiny-turbo-godeal'
+MODEL_NAME = 'sergeyzh/rubert-tiny-turbo'
+id2label = {0:'noise', 1:'demand', 2:'noise'}
+class SupplyDemandTrader(
+ Module,
+ PyTorchModelHubMixin,
+ repo_url=HF_MODEL_NAME,
+ library_name="torch",
+ tags=["PyTorch", "sentence-transformers", "NLP", "text classification"],
+ docs_url="https://pytorch.org/docs/stable/index.html"
+):
+ def __init__(self,
+ num_labels: Optional[int] = 3,
+ use_adapter: bool = False
+ ):
+ super().__init__()
+ self.use_adapter = use_adapter
+ self.num_labels = num_labels
+ self.backbone = AutoModel.from_pretrained(MODEL_NAME)
+ # Adapter layer
+ if self.use_adapter:
+ self.adapter = TransformerEncoderLayer(
+ d_model=self.backbone.config.hidden_size,
+ nhead=self.backbone.config.num_attention_heads,
+ dim_feedforward=self.backbone.config.intermediate_size,
+ activation="gelu",
+ dropout=0.1,
+ batch_first=True # I/O shape: batch, seq, feature
+ )
+ else:
+ self.adapter = None
+ # Classification head
+ self.separator_head = Linear(self.backbone.config.hidden_size, num_labels)
+ self.loss = CrossEntropyLoss()
+ def forward(self,
+ input_ids: torch.Tensor,
+ attention_mask: torch.Tensor,
+ labels: Optional[torch.Tensor] = None
+ ) -> dict[str, torch.Tensor]:
+ outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
+ last_hidden_state = outputs.last_hidden_state
+ if self.use_adapter:
+ last_hidden_state = self.adapter(last_hidden_state)
+ cls_embedding = last_hidden_state[:, 0]
+ logits = self.separator_head(cls_embedding)
+ if labels is not None:
+ loss = self.loss(logits, labels)
+ return {
+ "loss": loss,
+ "logits": logits,
+ "embedding": cls_embedding
+ }
+ return {
+ "logits": logits,
+ "embedding": cls_embedding
+ }
+model = SupplyDemandTrader.from_pretrained(HF_MODEL_NAME)
+tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
+model.eval()
+with torch.inference_mode():
+ ids = tokenizer("Куплю Iphone 8", return_tensors="pt")
+ logits = checkpoint.forward(ids['input_ids'], ids['attention_mask']))
+ preds = torch.argmax(logits)
+ print(id2label[int(preds)])
+```
+## Training
+Backbone was trained on clustered dataset for matching problem. Partially unfreezed model with classification head on custom dataset containing exports from different telegram chats.
+```
+weighted average precision : 0.946
+weighted average f1-score : 0.945
+macro average precision : 0.943
+macro average f1-score : 0.945
+```