Historical Russian TrOCR Model for Civil Script (ru-trocr-1700s)
Model Description
This model is specifically trained to recognize Russian Civil Script (гражданский шрифт) from the 18th century. It handles the following character sets:
- Historical letters: ѣ, і, ѳ, ѵ, ъ
- Civil script variations of standard Cyrillic characters
- Both uppercase and lowercase variants
- Special typographic features of 18th-century printing
Model Performance Metrics
- Character Error Rate (CER): 1.69%
- Word Error Rate (WER): 5.75%
- Sequence Accuracy: 80.21%
- Training Loss: 0.0403
- Evaluation Loss: 0.0351
Training Details
- Base Model: TrOCR
- Training Duration: ~25.5 hours
- Epochs: 3
- Steps: 1227
- Training Samples per Second: 0.428
- Special Focus: Civil script character recognition including historical letters and their variants
- Training Data: 18th-century Russian books from the National Library of Russia
Historical Context
The model is trained on texts printed in Civil Script (гражданский шрифт), introduced by Peter the Great's reform in 1708. This script represents a significant transition in Russian typography from Church Slavonic to a more modernized form of writing. The Civil Script remained the standard for Russian publishing houses and typographers until the 1830s, making it the primary typeface for Russian printed books throughout the 18th and early 19th centuries.
Limitations and Recommendations
- Optimized for line-level recognition of historical Russian texts in Civil Script
- Best performance on well-segmented lines
- May require pre-processing for damaged or low-quality images
- Specifically tuned for 18th-century Russian printing conventions
Usage Example
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
processor = TrOCRProcessor.from_pretrained("taiga75/ru-trocr-1700s")
model = VisionEncoderDecoderModel.from_pretrained("taiga75/ru-trocr-1700s")
# Process image
image = Image.open("path_to_image").convert("RGB")
pixel_values = processor(image, return_tensors="pt").pixel_values
# Generate text
generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Citation
If you use this model in your research, please cite:
@misc{maria_levchenko_2025,
author = {{Maria Levchenko}},
title = {ru-trocr-1700s (Revision 8d7a9f4)},
year = 2025,
url = {https://huggingface.co/taiga75/ru-trocr-1700s},
doi = {10.57967/hf/3942},
publisher = {Hugging Face}
}
- Downloads last month
- 18