Our original base similarity Matryoshka

This is a [sentence-transformers] model finetuned from Ghani-25/LF_enrich_sim on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Ghani-25/LF_enrich_sim
  • Maximum Sequence Length: 128 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
    • json
  • Language: multilingual
  • License: apache-2.0

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Ghani-25/LF-enrich-sim-matryoshka-64")
# Run inference
sentences = [
    'Summer Job: Export Manager',
    'Responsable Export Afrique Amériquess
    'Clinical Project Leader',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

# Extraction de la diagonale pour obtenir les similarités correspondantes
similarities_diagonal = similarities.diag().cpu().numpy()
print(similarities_diagonal)
# [0.896542]

Evaluation

Metrics

Semantic Similarity

Metric dim_768 dim_512 dim_256 dim_128 dim_64
pearson_cosine 0.9696 0.9693 0.9662 0.9606 0.9464
spearman_cosine 0.9472 0.9466 0.9408 0.9315 0.9101

Training Details

Training Dataset

json

  • Dataset: json
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string float
    details
    • min: 3 tokens
    • mean: 10.22 tokens
    • max: 30 tokens
    • min: 3 tokens
    • mean: 9.98 tokens
    • max: 67 tokens
    • min: -0.05
    • mean: 0.37
    • max: 0.98
  • Samples:
    sentence1 sentence2 label
    Contributive filmer Doctorant contractuel (2016-2019) 0.20986526
    Responsable Développement et Communication Bilingual Business Assistant 0.3238712
    Law Trainee Sales Director contract manager 0.24983984
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "CosineSimilarityLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 16
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused

All Hyperparameters

Contact the author.

Training Logs

Epoch Step Training Loss dim_768_spearman_cosine dim_512_spearman_cosine dim_256_spearman_cosine dim_128_spearman_cosine dim_64_spearman_cosine
0.1624 10 0.0669 - - - - -
0.3249 20 0.0563 - - - - -
0.4873 30 0.0496 - - - - -
0.6497 40 0.0456 - - - - -
0.8122 50 0.0418 - - - - -
0.9746 60 0.0407 - - - - -
0.9909 61 - 0.9223 0.9199 0.9087 0.8920 0.8586
1.1371 70 0.0326 - - - - -
1.2995 80 0.0312 - - - - -
1.4619 90 0.0303 - - - - -
1.6244 100 0.03 - - - - -
1.7868 110 0.0291 - - - - -
1.9492 120 0.0301 - - - - -
1.9980 123 - 0.9393 0.9382 0.9304 0.9191 0.8946
2.1117 130 0.0257 - - - - -
2.2741 140 0.0243 - - - - -
2.4365 150 0.0246 - - - - -
2.5990 160 0.0235 - - - - -
2.7614 170 0.024 - - - - -
2.9239 180 0.023 - - - - -
2.9888 184 - 0.9464 0.9457 0.9396 0.9301 0.9083
3.0863 190 0.0222 - - - - -
3.2487 200 0.022 - - - - -
3.4112 210 0.022 - - - - -
3.5736 220 0.0226 - - - - -
3.7360 230 0.021 - - - - -
3.8985 240 0.0224 - - - - -
3.9635 244 - 0.9472 0.9466 0.9408 0.9315 0.9101
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.3.1
  • Transformers: 4.41.2
  • PyTorch: 2.5.1+cu121
  • Accelerate: 1.1.1
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1
Downloads last month
84
Safetensors
Model size
278M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Ghani-25/LF-enrich-sim-matryoshka-64

Finetuned
(1)
this model

Evaluation results