indo-sbert-nli-similarity-step-1

A BERT-based model fine-tuned for Natural Language Inference using a similarity approach.

Model Details

This model is a fine-tuned version of firqaaa/indo-sentence-bert-base for Natural Language Inference (NLI) tasks in Indonesian. It uses a similarity-based approach to determine the inferential relationship between a premise and hypothesis, classifying it as entailment, neutral, or contradiction.

Training Data

The model was fine-tuned on the afaji/indonli dataset, which contains Indonesian premise-hypothesis pairs labeled with entailment, neutral, or contradiction.

Evaluation Results

Validation loss: 0.1249, accuracy: 0.5831, pearson: 0.5690 Test Lay loss: 0.1365, accuracy: 0.5638, pearson: 0.5261 Test Expert loss: 0.1742, accuracy: 0.4578, pearson: 0.3038

Usage

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

# Load model and tokenizer
model = AutoModel.from_pretrained("fabhiansan/indo-sbert-nli-similarity")
tokenizer = AutoTokenizer.from_pretrained("fabhiansan/indo-sbert-nli-similarity")

# Function for mean pooling
def mean_pooling(token_embeddings, attention_mask):
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Example NLI inputs
premise = "Keindahan alam yang terdapat di Gunung Batu Jonggol ini dapat Anda manfaatkan sebagai objek fotografi yang cantik."
hypothesis = "Keindahan alam tidak dapat difoto."

# Encode inputs
encoded_premise = tokenizer(premise, padding=True, truncation=True, return_tensors="pt")
encoded_hypothesis = tokenizer(hypothesis, padding=True, truncation=True, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    # Get embeddings
    outputs_premise = model(**encoded_premise)
    outputs_hypothesis = model(**encoded_hypothesis)
    
    # Mean pooling
    embedding_premise = mean_pooling(outputs_premise.last_hidden_state, encoded_premise["attention_mask"])
    embedding_hypothesis = mean_pooling(outputs_hypothesis.last_hidden_state, encoded_hypothesis["attention_mask"])
    
    # Normalize embeddings
    embedding_premise = F.normalize(embedding_premise, p=2, dim=1)
    embedding_hypothesis = F.normalize(embedding_hypothesis, p=2, dim=1)
    
    # Compute similarity
    similarity = F.cosine_similarity(embedding_premise, embedding_hypothesis).item()

# Convert similarity to NLI label
if similarity >= 0.7:
    label = "entailment"
elif similarity <= 0.3:
    label = "contradiction"
else:
    label = "neutral"

print(f"Premise: {premise}")
print(f"Hypothesis: {hypothesis}")
print(f"Similarity: {similarity:.4f}")
print(f"NLI Label: {label}")

Limitations and Biases

The model is specifically trained for Indonesian language and may not perform well on other languages or code-switched text.
Performance may vary on domain-specific texts that differ significantly from the training data.
Like all language models, this model may reflect biases present in the training data.

Citation

If you use this model in your research, please cite:

@misc{fabhiansan2025indonli,
  author = {Fabhiansan},
  title = {Fine-tuned SBERT for Indonesian Natural Language Inference},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/indo-sbert-nli-similarity-step-1}}
}

And also cite the original SBERT and Indo-SBERT works:

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  month = "11",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}

@misc{arasyi2022indo,
  author = {Arasyi, Firqa},
  title = {indo-sentence-bert: Sentence Transformer for Bahasa Indonesia with Multiple Negative Ranking Loss},
  year = {2022},
  month = {9},
  publisher = {huggingface},
  journal = {huggingface repository},
  howpublished = {https://huggingface.co/firqaaa/indo-sentence-bert-base}
}

fabhiansan
/

indo-sbert-nli-similarity-step-1