upskyy
/

gte-base-korean

@@ -169,7 +169,7 @@ model-index:
       name: Spearman Max
 ---
-# SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
@@ -196,7 +196,8 @@ SentenceTransformer(
 ## Usage
-### Direct Usage (Sentence Transformers)
 First install the Sentence Transformers library:
@@ -209,7 +210,7 @@ Then you can load this model and run inference.
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
-model = SentenceTransformer("upskyy/gte-korean-base")
 # Run inference
 sentences = [
@@ -225,6 +226,48 @@ print(embeddings.shape)
 similarities = model.similarity(embeddings, embeddings)
 print(similarities.shape)
 # [3, 3]
 ```
 <!--

       name: Spearman Max
 ---
+# upskyy/gte-korean-base
 This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Usage
+### Usage (Sentence-Transformers)
 First install the Sentence Transformers library:
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
+model = SentenceTransformer("upskyy/gte-korean-base", trust_remote_code=True)
 # Run inference
 sentences = [
 similarities = model.similarity(embeddings, embeddings)
 print(similarities.shape)
 # [3, 3]
+print(similarities)
+# tensor([[1.0000, 0.6274, 0.3788],
+#        [0.6274, 1.0000, 0.5978],
+#        [0.3788, 0.5978, 1.0000]])
+```
+### Usage (HuggingFace Transformers)
+Without sentence-transformers, you can use the model like this:
+First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+# Mean Pooling - Take attention mask into account for correct averaging
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0] # First element of model_output contains all token embeddings
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+# Sentences we want sentence embeddings for
+sentences = ["안녕하세요?", "한국어 문장 임베딩을 위한 버트 모델입니다."]
+# Load model from HuggingFace Hub
+tokenizer = AutoTokenizer.from_pretrained("upskyy/gte-korean-base")
+model = AutoModel.from_pretrained("upskyy/gte-korean-base", trust_remote_code=True)
+# Tokenize sentences
+encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
+# Compute token embeddings
+with torch.no_grad():
+    model_output = model(**encoded_input)
+# Perform pooling. In this case, mean pooling.
+sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"])
+print("Sentence embeddings:")
+print(sentence_embeddings)
 ```
 <!--

config.json CHANGED Viewed

@@ -47,4 +47,4 @@
   "unpad_inputs": false,
   "use_memory_efficient_attention": false,
   "vocab_size": 250048
-}

   "unpad_inputs": false,
   "use_memory_efficient_attention": false,
   "vocab_size": 250048
+}