File size: 2,984 Bytes
97d6817 902c27c e32af19 902c27c e32af19 902c27c e32af19 902c27c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
language:
- en
- ta
- ml
- as
- bn
- gu
- hi
- kn
- mr
- or
- te
base_model:
- sarvamai/sarvam-1
pipeline_tag: sentence-similarity
---
# WordLLama - Indic
Inspired by WordLLama, trained using word embeddings of Saravam-1 models that supports most
Indic languages. We used translated subset of https://huggingface.co/datasets/sentence-transformers/all-nli
to train this model.
Weights and tokenizer is dereived from sarvam-1, For license terms refer to https://huggingface.co/sarvamai/sarvam-1.
## How to use.
Install fork of WordLlama,
```pip install -e wordllama @ git+https://github.com/tinisoft/WordLlama.git```
Download the weights and tokenizer,
```git clone https://huggingface.co/tinisoft/wordllama-indic && cd wordllama-indic```
Code can be used like this,
```
from wordllama import WordLlamaInference, WordLlamaConfig, WordLlama
from safetensors import safe_open
import toml
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
f = safe_open("sarvam1_2b_128.safetensors", framework="pt", device="cpu")
embedding = f.get_tensor('embedding.weight').numpy()
config_file = "sarvam1_2b.toml"
config_data = toml.load(config_file)
config_data["config_name"] = "sarvam1_2b"
config = WordLlamaConfig(**config_data)
wl = WordLlamaInference(
embedding=embedding,
tokenizer=tokenizer,
config=config,
binary=False,
)
# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score) # Output: e.g., 0.0664
# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Calculate similarity between two sentences in Tamil
similarity_score = wl.similarity("நான் கார் சென்றேன்", "நான் கடைக்கு சென்றேன்")
print(similarity_score) # Output: e.g., 0.075
# Rank documents based on their similarity to a Tamil query
query = "நான் கார் சென்றேன்"
candidates = [
"நான் பூங்காவிற்கு சென்றேன்",
"நான் கடைக்கு சென்றேன்",
"நான் லாரி சென்றேன்",
"நான் வாகனத்தில் சென்றேன்"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
query = "నేను కారులో వెళ్లాను"
candidates = [
"నేను పార్క్కి వెళ్లాను",
"నేను మార్కెట్కి వెళ్లాను",
"నేను లారీలో వెళ్లాను",
"నేను వాహనంలో వెళ్లాను"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
``` |