README.md · poc-embeddings/rubert-tiny-turbo-godeal at 508b0a13abe46f83d17646616669186a0b35a54a

metadata

library_name: torch
tags:
  - NLP
  - PyTorch
  - model_hub_mixin
  - pytorch_model_hub_mixin
  - sentence-transformers
  - text classification

This is rubert-tiny fine-tuned for classification messages type from telegram marketplaces.

Labels:

supply: somebody willing to sell something or provide service
demand: somebody wants to buy something or hire somebody
noise: messages unrelated to topic.

Usage

from transformers import AutoTokenizer

HF_MODEL_NAME = 'poc-embeddings/rubert-tiny-turbo-godeal'
MODEL_NAME = 'sergeyzh/rubert-tiny-turbo'

id2label = {0:'noise', 1:'demand', 2:'noise'}

class SupplyDemandTrader(
    Module,
    PyTorchModelHubMixin, 
    repo_url=HF_MODEL_NAME,
    library_name="torch",
    tags=["PyTorch", "sentence-transformers", "NLP", "text classification"],
    docs_url="https://pytorch.org/docs/stable/index.html"
):
    def __init__(self, 
                 num_labels: Optional[int] = 3, 
                 use_adapter: bool = False
                 ):
        super().__init__()
        self.use_adapter = use_adapter
        self.num_labels = num_labels
        self.backbone = AutoModel.from_pretrained(MODEL_NAME)
        
        # Adapter layer
        if self.use_adapter:
            self.adapter = TransformerEncoderLayer(
                d_model=self.backbone.config.hidden_size, 
                nhead=self.backbone.config.num_attention_heads,  
                dim_feedforward=self.backbone.config.intermediate_size, 
                activation="gelu",
                dropout=0.1,
                batch_first=True # I/O shape: batch, seq, feature
            )
        else:
            self.adapter = None
        
        # Classification head
        self.separator_head = Linear(self.backbone.config.hidden_size, num_labels)
        self.loss = CrossEntropyLoss()
        
    def forward(self, 
                input_ids: torch.Tensor, 
                attention_mask: torch.Tensor, 
                labels: Optional[torch.Tensor] = None
                ) -> dict[str, torch.Tensor]:
        outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state = outputs.last_hidden_state
        
        if self.use_adapter:
            last_hidden_state = self.adapter(last_hidden_state)
        cls_embedding = last_hidden_state[:, 0]
    
        logits = self.separator_head(cls_embedding)
        
        if labels is not None:
            loss = self.loss(logits, labels)
            return {
                "loss": loss, 
                "logits": logits, 
                "embedding": cls_embedding
            }
        return {
            "logits": logits, 
            "embedding": cls_embedding
        }


model = SupplyDemandTrader.from_pretrained(HF_MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
model.eval()

with torch.inference_mode():
  ids = tokenizer("Куплю Iphone 8", return_tensors="pt")
  logits = checkpoint.forward(ids['input_ids'], ids['attention_mask']))
  preds = torch.argmax(logits)
  print(id2label[int(preds)])

Training

Backbone was trained on clustered dataset for matching problem. Partially unfreezed model with classification head on custom dataset containing exports from different telegram chats.

weighted average precision	: 0.946
weighted average f1-score	: 0.945
macro average precision		: 0.943
macro average f1-score		: 0.945