mhaseeb1604/bge-m3-law

This model is a fine-tuned version of the BAAI/bge-m3 model, which is specialized for sentence similarity tasks in Arabic legal texts in both Arabic and English. It maps sentences and paragraphs to a 1024-dimensional dense vector space, useful for tasks like clustering, semantic search, and more.

Model Overview

  • Architecture: Based on sentence-transformers.
  • Training Data: Trained on a large Arabic law dataset, containing bilingual data in Arabic and English.
  • Embedding Size: 1024 dimensions, suitable for extracting semantically meaningful embeddings from text.
  • Applications: Ideal for legal applications, such as semantic similarity comparisons, document clustering, and retrieval in a bilingual Arabic-English legal context.

Installation

To use this model, you need to have the sentence-transformers library installed. You can install it via pip:

pip install -U sentence-transformers

Usage

You can easily load and use this model in Python with the following code:

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('mhaseeb1604/bge-m3-law')

# Sample sentences
sentences = ["This is an example sentence", "Each sentence is converted"]

# Generate embeddings
embeddings = model.encode(sentences)

# Output embeddings
print(embeddings)

Model Training

The model was fine-tuned on Arabic and English legal texts using the following configurations:

  • DataLoader:
    • Batch size: 4
    • Sampler: SequentialSampler
  • Loss Function: MultipleNegativesRankingLoss with cosine similarity.
  • Optimizer: AdamW with learning rate 2e-05.
  • Training Parameters:
    • Epochs: 2
    • Warmup Steps: 20
    • Weight Decay: 0.01

Full Model Architecture

This model consists of three main components:

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) - XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False})
  (2): Normalize()
)
  • Transformer Layer: Uses XLM-Roberta model with a max sequence length of 8192.
  • Pooling Layer: Utilizes CLS token pooling to generate sentence embeddings.
  • Normalization Layer: Ensures normalized output vectors for better performance in similarity tasks.

Citing & Authors

If you find this repository useful, please consider giving a star : and citation

@misc {muhammad_haseeb_2024,
    author       = { {Muhammad Haseeb} },
    title        = { bge-m3-law (Revision 2fc0289) },
    year         = 2024,
    url          = { https://huggingface.co/mhaseeb1604/bge-m3-law },
    doi          = { 10.57967/hf/3217 },
    publisher    = { Hugging Face }
}
Downloads last month
17
Safetensors
Model size
568M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for mhaseeb1604/bge-m3-law

Base model

BAAI/bge-m3
Finetuned
(185)
this model