Rubyando59's picture
Update README.md
1511e4c verified
metadata
license: mit
datasets:
  - sujet-ai/Sujet-Financial-RAG-EN-Dataset
language:
  - en
metrics:
  - accuracy
pipeline_tag: sentence-similarity
tags:
  - finance
  - embedding
  - embedding model
  - financial qa
  - bge
  - sentence transformers
  - financial rag

Marsilia-Embeddings-EN-Large πŸš€

Introduction 🌟

Marsilia-Embeddings-EN-Large is an English language embedding model specifically designed for financial domain tasks. This model serves as a proof of concept, demonstrating the critical importance of fine-tuning embedding models for specific tasks in Retrieval-Augmented Generation (RAG) applications.

By focusing on the financial domain, Marsilia-Embeddings-EN-Large achieves performance that surpasses even closed-source models like OpenAI's embeddings, while offering a more cost-effective solution. This showcases how targeted fine-tuning can dramatically enhance the capabilities of open-source models, making them competitive with or even superior to proprietary alternatives in specialized domains.

Model Details πŸ“Š

  • Model Type: Sentence Transformer
  • Language: English πŸ‡¬πŸ‡§
  • Base Model: BAAI/bge-large-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024
  • Similarity Function: Cosine Similarity

Usage πŸ’»

To use this model with the Sentence Transformers library:

from sentence_transformers import SentenceTransformer

# Download from the πŸ€— Hub
model = SentenceTransformer("sujet-ai/Marsilia-Embedding-EN-Large")

# Run inference
sentences = [
    'What are the advantages and disadvantages of investing in corporate bonds compared to government bonds?',
    '€100,000 2.88% 15/1/2038 92,965 0.02% Bank of New York Mellon Corp. $95,000 4.97% 26/4/2034 92,748 0.02% WPC Eurobond BV €100,000 1.35% 15/4/2028 92,674 0.02% Amgen, Inc.1$100,000 2.60% 19/8/2026 92,601 0.02% AT&T, Inc. €100,000 2.05% 19/5/2032 92,593 0.02% Aon Corp. $100,000 3.75% 2/5/2029 92,563 0.02% Chubb INA Holdings, Inc. $102,000 4.35% 3/11/2045 92,352 0.02% Bank of America Corp. $96,000 4.38% 27/4/2028 92,301 0.02% Verizon Communications, Inc. $117,000 1.50% 18/9/2030 92,243 0.02% Medtronic Global Holdings SCA €100,000 0.38% 15/10/2028 92,231 0.02% Intel Corp. $100,000 4.90% 5/8/2052 92,209 0.02% KeyBank NA $100,000 4.15% 8/8/2025 92,200 0.02% Aetna, Inc. $95,000 3.50% 15/11/2024 92,178 0.02% AT&T, Inc. $110,000 4.35% 15/6/2045 92,161 0.02% PepsiCo, Inc. €100,000 1.13% 18/3/2031 92,095 0.02% Ally Financial, Inc. $115,000 2.20% 2/11/2028 92,042 0.02% JPMorgan Chase & Co. Β£100,000 1.90% 28/4/2033 92,039 0.02% Westinghouse Air Brake Technologies Corp. $95,000 4.95% 15/9/2028 92,021 0.02% Viatris , Inc. $139,000 4.00% 22/6/2050 91,948 0.02% Amazon.com, Inc. $90,000 4.80% 5/12/2034 91,936 0.02% General Motors Financial Co., Inc. $95,000 3.50% 7/11/2024 91,890 0.02% US Bancorp $120,000 1.38% 22/7/2030 91,848 0.02% Goldman Sachs Group, Inc. $105,000 4.41% 23/4/2039 91,826 0.02% Blackstone Holdings Finance Co. LLC €100,000 1.50% 10/4/2029 91,820 0.02%',
    '924 Vanguard USD Treasury Bond UCITS ETF Principal US Dollars ($) CouponMaturity DateFair Value US Dollars ($)% of Total Net Assets United States Treasury Note 8,475,000 1.88% 28/2/2027 7,769,854 0.47% United States Treasury Bond 9,088,000 3.00% 15/8/2052 7,731,900 0.46% United States Treasury Note 8,907,000 1.38% 31/12/2028 7,726,823 0.46% United States Treasury Note 8,184,000 1.13% 15/1/2025 7,696,477 0.46% United States Treasury Bond 10,590,000 1.88% 15/2/2041 7,692,642 0.46% United States Treasury Note 8,466,000 0.25% 30/9/2025 7,668,344 0.46% United States Treasury Note 8,266,700 1.63% 15/2/2026 7,659,614 0.46% United States Treasury Note 8,957,000 0.63% 31/12/2027 7,654,036 0.46% United States Treasury Note 8,087,000 0.38% 15/8/2024 7,651,060 0.46% United States Treasury Note 8,000,500 2.13% 15/5/2025 7,596,412 0.46% United States Treasury Note 8,035,000 2.50% 31/3/2027 7,530,302 0.45% United States Treasury Note 8,138,700 1.63% 15/5/2026 7,511,766 0.45% United States Treasury Note 8,464,000 1.75% 31/1/2029 7,483,366 0.45% United States Treasury Note 8,296,000 0.88% 30/6/2026 7,474,826 0.45% United States Treasury Note 8,078,000 2.00% 15/11/2026 7,472,781 0.45% United States Treasury Note 7,874,000 1.50% 15/2/2025 7,432,010 0.45% United States Treasury Note 7,794,000 1.75% 15/3/2025 7,372,332 0.44% United States Treasury Bond 10,008,000 2.00% 15/11/2041 7,327,732 0.44% United States Treasury Note 8,106,000 0.38% 30/11/2025 7,316,932 0.44% United States Treasury Note 7,738,000 2.75% 30/4/2027 7,309,992 0.44% United States Treasury Bond 11,011,000 1.88% 15/11/2051 7,270,701 0.44% United States Treasury Bond 10,047,000 2.25% 15/2/2052 7,263,667 0.44% United States Treasury Note 7,416,000 3.25% 30/6/2027 7,132,686 0.43% United States Treasury Note 7,559,000 2.63% 31/5/2027 7,103,098 0.43% United States Treasury Note 7,729,500 2.38% 15/5/2029 7,046,526 0.42% United States Treasury Note 7,855,000 1.88% 28/2/2029 6,986,041 0.42% United States Treasury Note 7,023,000 4.13% 30/9/2027 6,984,044 0.42% United States Treasury Note 7,000,000 4.00% 30/6/2028 6,961,',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Intended Use 🎯

This model is designed for generating sentence embeddings for English text, particularly in the financial domain. It can be used for various natural language processing tasks such as semantic search, clustering, and information retrieval.

Training Data πŸ“š

The model was fine-tuned on the sujet-ai/Sujet-Financial-RAG-EN-Dataset. This dataset consists of question-context pairs in English, focusing on financial topics.

Training Procedure πŸ› οΈ

Training Hyperparameters

  • Loss Function: MultipleNegativesRankingLoss
    • Scale: 20.0
    • Similarity Function: Cosine Similarity
  • Evaluation Strategy: Steps
  • Per Device Train Batch Size: 64
  • Per Device Eval Batch Size: 64
  • Number of Train Epochs: 10
  • Batch Sampler: no_duplicates
  • Multi Dataset Batch Sampler: round_robin
  • Scheduler: Warmup cosine

Framework Versions

  • Python: 3.10.13
  • Sentence Transformers: 3.0.1
  • Transformers: 4.42.3
  • PyTorch: 2.5.0.dev20240704+cu124
  • Accelerate: 0.32.1
  • Datasets: 2.20.0
  • Tokenizers: 0.19.1

Evaluation πŸ“ˆ

The model was evaluated using the InformationRetrievalEvaluator on the test split of the sujet-ai/Sujet-Financial-RAG-EN-Dataset.

Limitations ⚠️

The model is specifically trained on English financial texts and may not perform optimally on other domains or languages. Users should be aware of potential biases present in the training data.

Citation πŸ“„

If you use this model in your research or applications, please cite:

@software{Marsilia-Embeddings-EN-Large,
  author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
  title = {Marsilia-Embeddings-EN-Large: A fine-tuned English embedding model for financial texts},
  year = {2024},
  url = {https://huggingface.co/sujet-ai/Marsilia-Embedding-EN-Large}
}

Contact Information πŸ“§

For questions, feedback, or collaborations, please reach out to us on LinkedIn or visit our website https://sujet.ai.