|
--- |
|
base_model: |
|
- mistralai/Mistral-7B-v0.1 |
|
language: |
|
- en |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- finance |
|
--- |
|
|
|
# FinE5: Finance-Adapted Text Embedding Model |
|
|
|
This financial embedding model is fine-tuned on a synthesized finance corpus, following the training pipeline of e5-mistral-7b-instruct (Wang et al., 2023). It ranks top on FinMTEB (Feb 16, 2025), with no overlap between training data and benchmark test set. |
|
|
|
The training data and pipeline are detailed in the paper. |
|
|
|
--- |
|
‼️ <font color="#0000FF">Request Access</font> |
|
|
|
* If you would like to try this model, please email [email protected] with your name, institution, and purpose. |
|
|
|
--- |
|
# Usage |
|
|
|
Below is an example to encode queries and passages. |
|
|
|
``` |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
from torch import Tensor |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
def last_token_pool(last_hidden_states: Tensor, |
|
attention_mask: Tensor) -> Tensor: |
|
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) |
|
if left_padding: |
|
return last_hidden_states[:, -1] |
|
else: |
|
sequence_lengths = attention_mask.sum(dim=1) - 1 |
|
batch_size = last_hidden_states.shape[0] |
|
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] |
|
|
|
|
|
def get_detailed_instruct(task_description: str, query: str) -> str: |
|
return f'Instruct: {task_description}\nQuery: {query}' |
|
|
|
|
|
# Each query must come with a one-sentence instruction that describes the task |
|
task = 'Given a financial question, retrieve user replies that best answer the question.' |
|
queries = [ |
|
get_detailed_instruct(task, 'What do brokers do with bad stock?'), |
|
get_detailed_instruct(task, 'Why do investors buy stock that had appreciated?') |
|
] |
|
# No need to add instruction for retrieval documents |
|
documents = [ |
|
"For every seller, there's a buyer. Buyers may have any reason for wanting to buy (bargain shopping, foolish belief in a crazy business, etc). The party (brokerage, market maker, individual) owning the stock at the time the company goes out of business is the loser . But in a general panic, not every company is going to go out of business. So the party owning those stocks can expect to recover some, or all, of the value at some point in the future. Brokerages all reserve the right to limit margin trading (required for short selling), and during a panic would likely not allow you to short a stock they feel is a high risk for them.", |
|
"You seem to prefer to trade like I do: 'Buy low, sell high.' But there are some people that prefer a different way: 'Buy high, sell higher.' A stock that has 'just appreciated' is 'in motion.' That is a 'promise' (not always kept) that it will continue to go higher. Some people want stocks that not only go higher, but also SOON. The disadvantage of 'buy low, sell high' is that the stock can stay low for some time. So that's a strategy for patient investors like you and me" |
|
] |
|
input_texts = queries + documents |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('yixuantt/Fin-e5') |
|
model = AutoModel.from_pretrained('yixuantt/Fin-e5') |
|
|
|
max_length = 1024 |
|
# Tokenize the input texts |
|
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt') |
|
|
|
outputs = model(**batch_dict) |
|
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
|
|
|
# normalize embeddings |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
scores = (embeddings[:2] @ embeddings[2:].T) * 100 |
|
print(scores.tolist()) |
|
|
|
``` |
|
--- |
|
# Citation |
|
|
|
If you find our work helpful, please cite: |
|
``` |
|
@misc{tang2025finmtebfinancemassivetext, |
|
title={FinMTEB: Finance Massive Text Embedding Benchmark}, |
|
author={Yixuan Tang and Yi Yang}, |
|
year={2025}, |
|
eprint={2502.10990}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.10990}, |
|
} |
|
|
|
@misc{tang2024needdomainspecificembeddingmodels, |
|
title={Do We Need Domain-Specific Embedding Models? An Empirical Investigation}, |
|
author={Yixuan Tang and Yi Yang}, |
|
year={2024}, |
|
eprint={2409.18511}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2409.18511}, |
|
} |
|
``` |
|
|
|
Code for FinMTEB: https://github.com/yixuantt/FinMTEB |
|
-------- |
|
Thanks to the [MTEB](https://github.com/embeddings-benchmark/mteb) Benchmark. |
|
* This model should not be used for any commercial purpose. Refer the [license](https://spdx.org/licenses/CC-BY-NC-4.0) for the detailed terms. |