metadata

library_name: transformers
license: apache-2.0
language:
  - he
base_model:
  - onlplab/alephbert-base

Hebrew Punctuation model

Introduction

This model is a fine-tuned version of AlephBERT, designed to restore punctuation in Hebrew spoken language transcripts. It is specifically trained as a post-processing step for Automatic Speech Recognition (ASR) outputs, where punctuation is often missing in raw transcriptions.

Usage

For now this is the recommended way to use this model:

git lfs install 
git clone https://huggingface.co/verbit/hebrew_punctuation
cd hebrew_punctuation

Once you are in the folder you could do the following:

from transformers import BertTokenizer

from src.models import BertForPunctuation
from src.inference import get_prediction

model = BertForPunctuation.from_pretrained("verbit/hebrew_punctuation")
tokenizer = BertTokenizer.from_pretrained("verbit/hebrew_punctuation")
model.eval()

text = ("חברת ורביט פיתחה מערכת לתמלול המבוססת על בינה מלאכותית וגורם אנושי ושוקדת על תמלול עדויות ניצולי שואה את "
        "התוצאות אפשר לראות כבר ברשת בהן חלקים מעדותו של טוביה ביילסקי שהיה מפקד גדוד הפרטיזנים היהודים "
        "בביילורוסיה")
punct_text = get_prediction(
    model=model,
    text=text,
    tokenizer=tokenizer,
    backward_context=model.config.backward_context,
    forward_context=model.config.forward_context,
    return_prob=False
)
print(punct_text)

Contact

For any questions or issues, please contact [email protected].