File size: 2,984 Bytes
97d6817
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
902c27c
e32af19
 
 
 
 
 
 
 
 
 
 
 
902c27c
e32af19
 
902c27c
e32af19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
902c27c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
language:
- en
- ta
- ml
- as
- bn
- gu
- hi
- kn
- mr
- or
- te
base_model:
- sarvamai/sarvam-1
pipeline_tag: sentence-similarity
---

# WordLLama - Indic

Inspired by WordLLama, trained using word embeddings of Saravam-1 models that supports most
Indic languages. We used translated subset of https://huggingface.co/datasets/sentence-transformers/all-nli 
to train this model.

Weights and tokenizer is dereived from sarvam-1,  For license terms refer to https://huggingface.co/sarvamai/sarvam-1.


## How to use.

Install fork of WordLlama,
```pip install -e wordllama @ git+https://github.com/tinisoft/WordLlama.git```

Download the weights and tokenizer,
```git clone https://huggingface.co/tinisoft/wordllama-indic && cd wordllama-indic```

Code can be used like this,
```
from wordllama import WordLlamaInference, WordLlamaConfig, WordLlama
from safetensors  import safe_open
import toml
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
f = safe_open("sarvam1_2b_128.safetensors", framework="pt", device="cpu")
embedding = f.get_tensor('embedding.weight').numpy()

config_file = "sarvam1_2b.toml"
config_data = toml.load(config_file)
config_data["config_name"] = "sarvam1_2b"
config = WordLlamaConfig(**config_data)

wl = WordLlamaInference(
        embedding=embedding,
        tokenizer=tokenizer,
        config=config,
        binary=False,
)

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)


# Calculate similarity between two sentences in Tamil
similarity_score = wl.similarity("நான் கார் சென்றேன்", "நான் கடைக்கு சென்றேன்")
print(similarity_score)  # Output: e.g., 0.075

# Rank documents based on their similarity to a Tamil query
query = "நான் கார் சென்றேன்"
candidates = [
    "நான் பூங்காவிற்கு சென்றேன்", 
    "நான் கடைக்கு சென்றேன்", 
    "நான் லாரி சென்றேன்", 
    "நான் வாகனத்தில் சென்றேன்"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)

query = "నేను కారులో వెళ్లాను"
candidates = [
    "నేను పార్క్‌కి వెళ్లాను",
    "నేను మార్కెట్‌కి వెళ్లాను",
    "నేను లారీలో వెళ్లాను",
    "నేను వాహనంలో వెళ్లాను"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
```