File size: 3,722 Bytes
9753711 877da5d c6d0683 9753711 c6d0683 9753711 c6d0683 9753711 c6d0683 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
language:
- en
- ko
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- email-search
- bge
- embeddings
- multilingual
- email-retrieval
datasets:
- doubleyyh/mixed-email-dataset
model-index:
- name: email-tuned-bge-m3
results:
- task:
type: Retrieval
name: Email Content Retrieval
metrics:
- type: mrr
value: 0.85
name: MRR@10
- type: ndcg
value: 0.82
name: NDCG@10
- type: recall
value: 0.88
name: Recall@10
---
# Email-tuned BGE-M3
This is a fine-tuned version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.
## Model Description
- **Model Type:** Embedding model (encoder-only)
- **Base Model:** BAAI/bge-m3
- **Languages:** English, Korean
- **Domain:** Email content, business communication
- **Training Data:** Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)
## Quickstart
```python
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
model_name="doubleyyh/email-tuned-bge-m3",
model_kwargs={'device': 'cuda'},
encode_kwargs={'normalize_embeddings': True}
)
# Example emails
emails = [
{
"subject": "νμ μΌμ λ³κ²½ μλ΄",
"from": [["κΉμ² μ", "[email protected]"]],
"to": [["μ΄μν¬", "[email protected]"]],
"cc": [["λ°μ§μ", "[email protected]"]],
"date": "2024-03-26T10:00:00",
"text_body": "μλ
νμΈμ, λ΄μΌ μμ λ νλ‘μ νΈ λ―Έν
μ μ€ν 2μλ‘ λ³κ²½νκ³ μ ν©λλ€."
},
{
"subject": "Project Timeline Update",
"from": [["John Smith", "[email protected]"]],
"to": [["Team", "[email protected]"]],
"cc": [],
"date": "2024-03-26T11:30:00",
"text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
}
]
# Format emails into documents
docs = []
for email in emails:
# Format email content
content = "\n".join([f"{k}: {v}" for k, v in email.items()])
docs.append(Document(page_content=content))
# Create FAISS index
db = FAISS.from_documents(docs, embeddings)
# Query examples (supports both Korean and English)
queries = [
"νμ μκ°μ΄ μΈμ λ‘ λ³κ²½λμλμ?",
"When is the meeting rescheduled?",
"νλ‘μ νΈ μΌμ ",
"Q2 milestones"
]
# Perform similarity search
for query in queries:
print(f"\nQuery: {query}")
results = db.similarity_search(query, k=1)
print(f"Most relevant email:\n{results[0].page_content[:200]}...")
```
## Intended Use & Limitations
### Intended Use
- Email content retrieval
- Similar document search in email corpora
- Question answering over email content
- Multi-language email search systems
### Limitations
- Performance may vary for domains outside of email content
- Best suited for business communication context
- While supporting both English and Korean, performance might vary between languages
## Citation
```bibtex
@misc{email-tuned-bge-m3,
author = {doubleyyh},
title = {Email-tuned BGE-M3: Fine-tuned Embedding Model for Email Content},
year = {2024},
publisher = {HuggingFace}
}
```
## License
This model follows the same license as the base model (bge-m3).
## Contact
For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace. |