DictaBERT-char: A Character-Level BERT-Base model for Hebrew.

DictaBERT-char is a BERT-style language model for Hebrew, based on the BERT-base architecture with a character level tokenizer. The model based on the BERT-Large architecture is available here.

This model is released to the public in this 2025 W-NUT paper: Avi Shmidman and Shaltiel Shmidman, "Restoring Missing Spaces in Scraped Hebrew Social Media", The 10th Workshop on Noisy and User-generated Text (W-NUT), 2025

This is the base model pretrained with the masked-language-modeling objective.

Sample usage:

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictabert-char')
model = AutoModelForMaskedLM.from_pretrained('dicta-il/dictabert-char')

model.eval()

sentence = '讘砖谞转 1948 讛砖诇讬诐 讗驻专讬诐 拽讬砖讜谉 讗转 诪讞拽专讜 讘驻讬住讜诇 诪转讻转 讜讘[MASK]讜诇讚讜转 讛讗诪谞讜转 讜讛讞诇 诇驻专住诐 诪讗诪专讬诐 讛讜诪讜专讬住讟讬讬诐'

output = model(tokenizer.encode(sentence, return_tensors='pt'))
# the [MASK] is the 52nd token (including [CLS])
import torch
top_arg = torch.argmax(output.logits[0, 52, :])
print(tokenizer.convert_ids_to_tokens([top_arg])) # should print ['转'] 

Citation

If you use DictaBERT-char in your research, please cite Restoring Missing Spaces in Scraped Hebrew Social Media

BibTeX:

@inproceedings{shmidman2025restoring,
  author    = {Avi Shmidman and Shaltiel Shmidman},
  title     = {Restoring Missing Spaces in Scraped Hebrew Social Media},
  booktitle = {The 10th Workshop on Noisy and User-generated Text (W-NUT)},
  year      = {2025}
}

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0

Downloads last month
46
Safetensors
Model size
88M params
Tensor type
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support