tinisoft
/

wordllama-indic

Sentence Similarity

Model card Files Files and versions Community

aravindhank commited on Dec 17, 2024

Commit

e32af19

·

verified ·

1 Parent(s): 3c15818

Create README.md

Files changed (1) hide show

README.md +97 -0

README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+# WordLLama - Indic
+Inspired by WordLLama, trained using word embeddings of Saravam-1 models that supports most
+Indic languages. We used translated subset of https://huggingface.co/datasets/sentence-transformers/all-nli
+to train this model.
+Weights and tokenizer is dereived from sarvam-1,  For license terms refer to https://huggingface.co/sarvamai/sarvam-1.
+## How to use.
+Install fork of WordLlama,
+`pip install -e wordllama @ git+https://github.com/tinisoft/WordLlama.git`
+Download the weights and tokenizer,
+`git clone https://huggingface.co/tinisoft/wordllama-indic && cd wordllama-indic`
+Code can be used like this,
+```
+from wordllama import WordLlamaInference, WordLlamaConfig, WordLlama
+from safetensors  import safe_open
+import toml
+from tokenizers import Tokenizer
+tokenizer = Tokenizer.from_file("tokenizer.json")
+f = safe_open("sarvam1_2b_128.safetensors", framework="pt", device="cpu")
+embedding = f.get_tensor('embedding.weight').numpy()
+config_file = "sarvam1_2b.toml"
+config_data = toml.load(config_file)
+config_data["config_name"] = "sarvam1_2b"
+config = WordLlamaConfig(**config_data)
+wl = WordLlamaInference(
+        embedding=embedding,
+        tokenizer=tokenizer,
+        config=config,
+        binary=False,
+)
+# Calculate similarity between two sentences
+similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
+print(similarity_score)  # Output: e.g., 0.0664
+# Rank documents based on their similarity to a query
+query = "I went to the car"
+candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
+ranked_docs = wl.rank(query, candidates)
+print(ranked_docs)
+# Calculate similarity between two sentences in Tamil
+similarity_score = wl.similarity("நான் கார் சென்றேன்", "நான் கடைக்கு சென்றேன்")
+print(similarity_score)  # Output: e.g., 0.075
+# Rank documents based on their similarity to a Tamil query
+query = "நான் கார் சென்றேன்"
+candidates = [
+    "நான் பூங்காவிற்கு சென்றேன்",
+    "நான் கடைக்கு சென்றேன்",
+    "நான் லாரி சென்றேன்",
+    "நான் வாகனத்தில் சென்றேன்"
+]
+ranked_docs = wl.rank(query, candidates)
+print(ranked_docs)
+query = "నేను కారులో వెళ్లాను"
+candidates = [
+    "నేను పార్క్‌కి వెళ్లాను",
+    "నేను మార్కెట్‌కి వెళ్లాను",
+    "నేను లారీలో వెళ్లాను",
+    "నేను వాహనంలో వెళ్లాను"
+]
+ranked_docs = wl.rank(query, candidates)
+print(ranked_docs)
+```
+## Run code like this
+---
+language:
+- en
+- ta
+- ml
+- as
+- bn
+- gu
+- hi
+- kn
+- mr
+- or
+- te
+base_model:
+- sarvamai/sarvam-1
+pipeline_tag: sentence-similarity
+---