|
--- |
|
language: |
|
- en |
|
- ko |
|
license: mit |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- email-search |
|
- bge |
|
- embeddings |
|
- multilingual |
|
- email-retrieval |
|
datasets: |
|
- doubleyyh/mixed-email-dataset |
|
model-index: |
|
- name: email-tuned-bge-m3 |
|
results: |
|
- task: |
|
type: Retrieval |
|
name: Email Content Retrieval |
|
metrics: |
|
- type: mrr |
|
value: 0.85 |
|
name: MRR@10 |
|
- type: ndcg |
|
value: 0.82 |
|
name: NDCG@10 |
|
- type: recall |
|
value: 0.88 |
|
name: Recall@10 |
|
--- |
|
|
|
# Email-tuned BGE-M3 |
|
|
|
This is a fine-tuned version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries. |
|
|
|
## Model Description |
|
|
|
- **Model Type:** Embedding model (encoder-only) |
|
- **Base Model:** BAAI/bge-m3 |
|
- **Languages:** English, Korean |
|
- **Domain:** Email content, business communication |
|
- **Training Data:** Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions) |
|
|
|
## Quickstart |
|
|
|
```python |
|
from langchain.embeddings import HuggingFaceEmbeddings |
|
from langchain.vectorstores import FAISS |
|
from langchain.docstore.document import Document |
|
|
|
# Initialize the embedding model |
|
embeddings = HuggingFaceEmbeddings( |
|
model_name="doubleyyh/email-tuned-bge-m3", |
|
model_kwargs={'device': 'cuda'}, |
|
encode_kwargs={'normalize_embeddings': True} |
|
) |
|
|
|
# Example emails |
|
emails = [ |
|
{ |
|
"subject": "νμ μΌμ λ³κ²½ μλ΄", |
|
"from": [["κΉμ² μ", "[email protected]"]], |
|
"to": [["μ΄μν¬", "[email protected]"]], |
|
"cc": [["λ°μ§μ", "[email protected]"]], |
|
"date": "2024-03-26T10:00:00", |
|
"text_body": "μλ
νμΈμ, λ΄μΌ μμ λ νλ‘μ νΈ λ―Έν
μ μ€ν 2μλ‘ λ³κ²½νκ³ μ ν©λλ€." |
|
}, |
|
{ |
|
"subject": "Project Timeline Update", |
|
"from": [["John Smith", "[email protected]"]], |
|
"to": [["Team", "[email protected]"]], |
|
"cc": [], |
|
"date": "2024-03-26T11:30:00", |
|
"text_body": "Hi team, I'm writing to update you on the Q2 project milestones." |
|
} |
|
] |
|
|
|
# Format emails into documents |
|
docs = [] |
|
for email in emails: |
|
# Format email content |
|
content = "\n".join([f"{k}: {v}" for k, v in email.items()]) |
|
docs.append(Document(page_content=content)) |
|
|
|
# Create FAISS index |
|
db = FAISS.from_documents(docs, embeddings) |
|
|
|
# Query examples (supports both Korean and English) |
|
queries = [ |
|
"νμ μκ°μ΄ μΈμ λ‘ λ³κ²½λμλμ?", |
|
"When is the meeting rescheduled?", |
|
"νλ‘μ νΈ μΌμ ", |
|
"Q2 milestones" |
|
] |
|
|
|
# Perform similarity search |
|
for query in queries: |
|
print(f"\nQuery: {query}") |
|
results = db.similarity_search(query, k=1) |
|
print(f"Most relevant email:\n{results[0].page_content[:200]}...") |
|
``` |
|
|
|
|
|
## Intended Use & Limitations |
|
|
|
### Intended Use |
|
- Email content retrieval |
|
- Similar document search in email corpora |
|
- Question answering over email content |
|
- Multi-language email search systems |
|
|
|
### Limitations |
|
- Performance may vary for domains outside of email content |
|
- Best suited for business communication context |
|
- While supporting both English and Korean, performance might vary between languages |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{email-tuned-bge-m3, |
|
author = {doubleyyh}, |
|
title = {Email-tuned BGE-M3: Fine-tuned Embedding Model for Email Content}, |
|
year = {2024}, |
|
publisher = {HuggingFace} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This model follows the same license as the base model (bge-m3). |
|
|
|
## Contact |
|
|
|
For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace. |