File size: 3,722 Bytes
9753711
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
877da5d
c6d0683
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9753711
c6d0683
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9753711
c6d0683
9753711
c6d0683
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language:
- en
- ko
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- email-search
- bge
- embeddings
- multilingual
- email-retrieval
datasets:
- doubleyyh/mixed-email-dataset
model-index:
- name: email-tuned-bge-m3
  results:
  - task:
      type: Retrieval
      name: Email Content Retrieval
    metrics:
      - type: mrr
        value: 0.85
        name: MRR@10
      - type: ndcg
        value: 0.82
        name: NDCG@10
      - type: recall
        value: 0.88
        name: Recall@10
---

# Email-tuned BGE-M3

This is a fine-tuned version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.

## Model Description

- **Model Type:** Embedding model (encoder-only)
- **Base Model:** BAAI/bge-m3
- **Languages:** English, Korean
- **Domain:** Email content, business communication
- **Training Data:** Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)

## Quickstart

```python
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document

# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="doubleyyh/email-tuned-bge-m3",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': True}
)

# Example emails
emails = [
    {
        "subject": "회의 일정 λ³€κ²½ μ•ˆλ‚΄",
        "from": [["κΉ€μ² μˆ˜", "[email protected]"]],
        "to": [["이영희", "[email protected]"]],
        "cc": [["박지원", "[email protected]"]],
        "date": "2024-03-26T10:00:00",
        "text_body": "μ•ˆλ…•ν•˜μ„Έμš”, 내일 μ˜ˆμ •λœ ν”„λ‘œμ νŠΈ λ―ΈνŒ…μ„ μ˜€ν›„ 2μ‹œλ‘œ λ³€κ²½ν•˜κ³ μž ν•©λ‹ˆλ‹€."
    },
    {
        "subject": "Project Timeline Update",
        "from": [["John Smith", "[email protected]"]],
        "to": [["Team", "[email protected]"]],
        "cc": [],
        "date": "2024-03-26T11:30:00",
        "text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
    }
]

# Format emails into documents
docs = []
for email in emails:
    # Format email content
    content = "\n".join([f"{k}: {v}" for k, v in email.items()])
    docs.append(Document(page_content=content))

# Create FAISS index
db = FAISS.from_documents(docs, embeddings)

# Query examples (supports both Korean and English)
queries = [
    "회의 μ‹œκ°„μ΄ μ–Έμ œλ‘œ λ³€κ²½λ˜μ—ˆλ‚˜μš”?",
    "When is the meeting rescheduled?",
    "ν”„λ‘œμ νŠΈ 일정",
    "Q2 milestones"
]

# Perform similarity search
for query in queries:
    print(f"\nQuery: {query}")
    results = db.similarity_search(query, k=1)
    print(f"Most relevant email:\n{results[0].page_content[:200]}...")
```


## Intended Use & Limitations

### Intended Use
- Email content retrieval
- Similar document search in email corpora
- Question answering over email content
- Multi-language email search systems

### Limitations
- Performance may vary for domains outside of email content
- Best suited for business communication context
- While supporting both English and Korean, performance might vary between languages

## Citation

```bibtex
@misc{email-tuned-bge-m3,
  author = {doubleyyh},
  title = {Email-tuned BGE-M3: Fine-tuned Embedding Model for Email Content},
  year = {2024},
  publisher = {HuggingFace}
}
```

## License

This model follows the same license as the base model (bge-m3).

## Contact

For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.