ravch's picture
Add new SentenceTransformer model.
e167f6a verified
metadata
base_model: BAAI/bge-small-en-v1.5
datasets: []
language: []
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:664
  - loss:DenoisingAutoEncoderLoss
widget:
  - source_sentence: of fresh for in for that,, stream_id
    sentences:
      - >-
        Number of functional/operational toilets for boys with disabilities or
        CWSN(Children with special needs) 
      - >-
        Indicates grant for sports and physical education expenditure (in Rs)
        spent by the school during the financial year 2022-2023 under Samagra
        Shiksha, corresponding to the udise_sch_code. 
      - >-
        Number of fresh enrollments for transgenders in class 11 for that
        school. corresponding to udise_sch_code, caste_id, stream_id. 
  - source_sentence: Unique each associated . This in and.
    sentences:
      - >-
        classes in which language 3 i.e ('lang3' column) is taught as a subject.
        Its a comma seperated value. 
      - >-
        Unique identifier code each school, associated with school_name in
        sch_master table. This can be joined with udise_sch_code in sch_profile
        and sch_facility tables.  
      - 'Number of assessments happened for primary section/school '
  - source_sentence: urinals
    sentences:
      - >-
        Unique identifier code for the schools providing vocational courses
        under nsqf and where sectors are available, associated with school name
        in sch_master table. This can be joined with udise_sch_code in
        sch_profile and sch_facility tables. 
      - >-
        Indicates whether there is a reading corner/space/room in school. Can
        only be ['Yes','No'] 
      - 'Number of functional/operational urinals for boys '
  - source_sentence: >-
      total of in-service training by of that from district and training) the
      tch_code_state
    sentences:
      - >-
        Indicates total days of in-service training received by the teacher of
        that school from district institute of education and training(diet),
        corresponding to the udise_sch_code, tch_name, tch_code_state.  
      - >-
        Unique identifier code for each school. This column is crucial for
        aggregating or analyzing data at the school level, such as school-wise
        attendance, performance metrics, or demographic information. 
      - >-
        Indicates whether it is a special school, specifically for disabled
        students. Is school CWSN ( Children with Special Needs ). This can only
        be one of 2 values:['Yes','No'] 
  - source_sentence: >-
      The teacher_id column . This essential related teacher absenteeism or will
      column
    sentences:
      - >-
        Indicates Urban local body ID as per LGD - Local Government Directory
        where the school is present, related to 'lgd_urban_local_body_name' 
      - 'Number of pucca classrooms in good condition in school '
      - >-
        The teacher_id column is a unique identifier used to represent
        individual teachers. This column is essential for retrieving
        teacher-specific information.Queries related to teacher attendance,
        absenteeism, or any teacher-level analysis will likely require this
        column. 

SentenceTransformer based on BAAI/bge-small-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-small-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 384 tokens
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("ravch/fine_tuned_bge_small_en_v1.5_another_data_formate")
# Run inference
sentences = [
    'The teacher_id column . This essential related teacher absenteeism or will column',
    'The teacher_id column is a unique identifier used to represent individual teachers. This column is essential for retrieving teacher-specific information.Queries related to teacher attendance, absenteeism, or any teacher-level analysis will likely require this column. ',
    "Indicates Urban local body ID as per LGD - Local Government Directory where the school is present, related to 'lgd_urban_local_body_name' ",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 664 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 3 tokens
    • mean: 15.88 tokens
    • max: 127 tokens
    • min: 7 tokens
    • mean: 36.37 tokens
    • max: 311 tokens
  • Samples:
    sentence_0 sentence_1
    Number of Girls Defense Number of Girls Student provided Self Defense training
    whether is While filtering, must 0 (int active. Indicate whether school is active or inactive. While filtering only consider active schools, but When asked for total schools must consider active and inactive schools. 0(int) indicates active schools.
    classes in which language i.e 'lang2 as a subject a comma seperated classes in which language 2 i.e ('lang2' column) is taught as a subject. Its a comma seperated value.
  • Loss: DenoisingAutoEncoderLoss

Training Hyperparameters

Non-Default Hyperparameters

  • num_train_epochs: 50
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 50
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
6.0241 500 2.0771
12.0482 1000 0.4663
18.0723 1500 0.2979
24.0964 2000 0.2476
30.1205 2500 0.2341
36.1446 3000 0.2321
42.1687 3500 0.2116
48.1928 4000 0.2012

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.0.1
  • Transformers: 4.42.4
  • PyTorch: 2.3.1+cu121
  • Accelerate: 0.32.1
  • Datasets: 2.21.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

DenoisingAutoEncoderLoss

@inproceedings{wang-2021-TSDAE,
    title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
    author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna", 
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    pages = "671--688",
    url = "https://arxiv.org/abs/2104.06979",
}