Upload folder using huggingface_hub
Browse files- README.md +46 -3
- config.json +1 -1
README.md
CHANGED
@@ -169,7 +169,7 @@ model-index:
|
|
169 |
name: Spearman Max
|
170 |
---
|
171 |
|
172 |
-
#
|
173 |
|
174 |
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
175 |
|
@@ -196,7 +196,8 @@ SentenceTransformer(
|
|
196 |
|
197 |
## Usage
|
198 |
|
199 |
-
###
|
|
|
200 |
|
201 |
First install the Sentence Transformers library:
|
202 |
|
@@ -209,7 +210,7 @@ Then you can load this model and run inference.
|
|
209 |
from sentence_transformers import SentenceTransformer
|
210 |
|
211 |
# Download from the 🤗 Hub
|
212 |
-
model = SentenceTransformer("upskyy/gte-korean-base")
|
213 |
|
214 |
# Run inference
|
215 |
sentences = [
|
@@ -225,6 +226,48 @@ print(embeddings.shape)
|
|
225 |
similarities = model.similarity(embeddings, embeddings)
|
226 |
print(similarities.shape)
|
227 |
# [3, 3]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
228 |
```
|
229 |
|
230 |
<!--
|
|
|
169 |
name: Spearman Max
|
170 |
---
|
171 |
|
172 |
+
# upskyy/gte-korean-base
|
173 |
|
174 |
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
175 |
|
|
|
196 |
|
197 |
## Usage
|
198 |
|
199 |
+
### Usage (Sentence-Transformers)
|
200 |
+
|
201 |
|
202 |
First install the Sentence Transformers library:
|
203 |
|
|
|
210 |
from sentence_transformers import SentenceTransformer
|
211 |
|
212 |
# Download from the 🤗 Hub
|
213 |
+
model = SentenceTransformer("upskyy/gte-korean-base", trust_remote_code=True)
|
214 |
|
215 |
# Run inference
|
216 |
sentences = [
|
|
|
226 |
similarities = model.similarity(embeddings, embeddings)
|
227 |
print(similarities.shape)
|
228 |
# [3, 3]
|
229 |
+
print(similarities)
|
230 |
+
# tensor([[1.0000, 0.6274, 0.3788],
|
231 |
+
# [0.6274, 1.0000, 0.5978],
|
232 |
+
# [0.3788, 0.5978, 1.0000]])
|
233 |
+
```
|
234 |
+
|
235 |
+
### Usage (HuggingFace Transformers)
|
236 |
+
|
237 |
+
Without sentence-transformers, you can use the model like this:
|
238 |
+
First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
239 |
+
|
240 |
+
```python
|
241 |
+
from transformers import AutoTokenizer, AutoModel
|
242 |
+
import torch
|
243 |
+
|
244 |
+
|
245 |
+
# Mean Pooling - Take attention mask into account for correct averaging
|
246 |
+
def mean_pooling(model_output, attention_mask):
|
247 |
+
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
|
248 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
249 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
250 |
+
|
251 |
+
|
252 |
+
# Sentences we want sentence embeddings for
|
253 |
+
sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]
|
254 |
+
|
255 |
+
# Load model from HuggingFace Hub
|
256 |
+
tokenizer = AutoTokenizer.from_pretrained("upskyy/gte-korean-base")
|
257 |
+
model = AutoModel.from_pretrained("upskyy/gte-korean-base", trust_remote_code=True)
|
258 |
+
|
259 |
+
# Tokenize sentences
|
260 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
|
261 |
+
|
262 |
+
# Compute token embeddings
|
263 |
+
with torch.no_grad():
|
264 |
+
model_output = model(**encoded_input)
|
265 |
+
|
266 |
+
# Perform pooling. In this case, mean pooling.
|
267 |
+
sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
|
268 |
+
|
269 |
+
print("Sentence embeddings:")
|
270 |
+
print(sentence_embeddings)
|
271 |
```
|
272 |
|
273 |
<!--
|
config.json
CHANGED
@@ -47,4 +47,4 @@
|
|
47 |
"unpad_inputs": false,
|
48 |
"use_memory_efficient_attention": false,
|
49 |
"vocab_size": 250048
|
50 |
-
}
|
|
|
47 |
"unpad_inputs": false,
|
48 |
"use_memory_efficient_attention": false,
|
49 |
"vocab_size": 250048
|
50 |
+
}
|