wordllama-indic / README.md
aravindhank's picture
Update README.md
97d6817 verified
|
raw
history blame
3 kB
metadata
language:
  - en
  - ta
  - ml
  - as
  - bn
  - gu
  - hi
  - kn
  - mr
  - or
  - te
base_model:
  - sarvamai/sarvam-1
pipeline_tag: sentence-similarity

WordLLama - Indic

Inspired by WordLLama, trained using word embeddings of Saravam-1 models that supports most Indic languages. We used translated subset of https://huggingface.co/datasets/sentence-transformers/all-nli to train this model.

Weights and tokenizer is dereived from sarvam-1, For license terms refer to https://huggingface.co/sarvamai/sarvam-1.

How to use.

Install fork of WordLlama, pip install -e wordllama @ git+https://github.com/tinisoft/WordLlama.git

Download the weights and tokenizer, git clone https://huggingface.co/tinisoft/wordllama-indic && cd wordllama-indic

Code can be used like this,

from wordllama import WordLlamaInference, WordLlamaConfig, WordLlama
from safetensors  import safe_open
import toml
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
f = safe_open("sarvam1_2b_128.safetensors", framework="pt", device="cpu")
embedding = f.get_tensor('embedding.weight').numpy()

config_file = "sarvam1_2b.toml"
config_data = toml.load(config_file)
config_data["config_name"] = "sarvam1_2b"
config = WordLlamaConfig(**config_data)

wl = WordLlamaInference(
        embedding=embedding,
        tokenizer=tokenizer,
        config=config,
        binary=False,
)

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)


# Calculate similarity between two sentences in Tamil
similarity_score = wl.similarity("நான் கார் சென்றேன்", "நான் கடைக்கு சென்றேன்")
print(similarity_score)  # Output: e.g., 0.075

# Rank documents based on their similarity to a Tamil query
query = "நான் கார் சென்றேன்"
candidates = [
    "நான் பூங்காவிற்கு சென்றேன்", 
    "நான் கடைக்கு சென்றேன்", 
    "நான் லாரி சென்றேன்", 
    "நான் வாகனத்தில் சென்றேன்"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)

query = "నేను కారులో వెళ్లాను"
candidates = [
    "నేను పార్క్‌కి వెళ్లాను",
    "నేను మార్కెట్‌కి వెళ్లాను",
    "నేను లారీలో వెళ్లాను",
    "నేను వాహనంలో వెళ్లాను"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)

Run code like this