indo-sbert-nli-similarity-step-1

A BERT-based model fine-tuned for Natural Language Inference using a similarity approach.

Model Details

This model is a fine-tuned version of firqaaa/indo-sentence-bert-base for Natural Language Inference (NLI) tasks in Indonesian. It uses a similarity-based approach to determine the inferential relationship between a premise and hypothesis, classifying it as entailment, neutral, or contradiction.

Training Data

The model was fine-tuned on the afaji/indonli dataset, which contains Indonesian premise-hypothesis pairs labeled with entailment, neutral, or contradiction.

Evaluation Results

Validation loss: 0.1249, accuracy: 0.5831, pearson: 0.5690 Test Lay loss: 0.1365, accuracy: 0.5638, pearson: 0.5261 Test Expert loss: 0.1742, accuracy: 0.4578, pearson: 0.3038

Usage

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

# Load model and tokenizer
model = AutoModel.from_pretrained("fabhiansan/indo-sbert-nli-similarity")
tokenizer = AutoTokenizer.from_pretrained("fabhiansan/indo-sbert-nli-similarity")

# Function for mean pooling
def mean_pooling(token_embeddings, attention_mask):
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Example NLI inputs
premise = "Keindahan alam yang terdapat di Gunung Batu Jonggol ini dapat Anda manfaatkan sebagai objek fotografi yang cantik."
hypothesis = "Keindahan alam tidak dapat difoto."

# Encode inputs
encoded_premise = tokenizer(premise, padding=True, truncation=True, return_tensors="pt")
encoded_hypothesis = tokenizer(hypothesis, padding=True, truncation=True, return_tensors="pt")

# Get embeddings
with torch.no_grad():
    # Get embeddings
    outputs_premise = model(**encoded_premise)
    outputs_hypothesis = model(**encoded_hypothesis)
    
    # Mean pooling
    embedding_premise = mean_pooling(outputs_premise.last_hidden_state, encoded_premise["attention_mask"])
    embedding_hypothesis = mean_pooling(outputs_hypothesis.last_hidden_state, encoded_hypothesis["attention_mask"])
    
    # Normalize embeddings
    embedding_premise = F.normalize(embedding_premise, p=2, dim=1)
    embedding_hypothesis = F.normalize(embedding_hypothesis, p=2, dim=1)
    
    # Compute similarity
    similarity = F.cosine_similarity(embedding_premise, embedding_hypothesis).item()

# Convert similarity to NLI label
if similarity >= 0.7:
    label = "entailment"
elif similarity <= 0.3:
    label = "contradiction"
else:
    label = "neutral"

print(f"Premise: {premise}")
print(f"Hypothesis: {hypothesis}")
print(f"Similarity: {similarity:.4f}")
print(f"NLI Label: {label}")

Limitations and Biases

  • The model is specifically trained for Indonesian language and may not perform well on other languages or code-switched text.
  • Performance may vary on domain-specific texts that differ significantly from the training data.
  • Like all language models, this model may reflect biases present in the training data.

Citation

If you use this model in your research, please cite:

@misc{fabhiansan2025indonli,
  author = {Fabhiansan},
  title = {Fine-tuned SBERT for Indonesian Natural Language Inference},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/indo-sbert-nli-similarity-step-1}}
}

And also cite the original SBERT and Indo-SBERT works:

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
  month = "11",
  year = "2019",
  publisher = "Association for Computational Linguistics",
  url = "https://arxiv.org/abs/1908.10084",
}
@misc{arasyi2022indo,
  author = {Arasyi, Firqa},
  title = {indo-sentence-bert: Sentence Transformer for Bahasa Indonesia with Multiple Negative Ranking Loss},
  year = {2022},
  month = {9},
  publisher = {huggingface},
  journal = {huggingface repository},
  howpublished = {https://huggingface.co/firqaaa/indo-sentence-bert-base}
}
Downloads last month
2
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train fabhiansan/indo-sbert-nli-similarity-step-1