File size: 5,366 Bytes

69565b3
 
9d6dab6
 
 
 
916c8eb
9d6dab6
916c8eb
 
23ad1d7
 
 
 
 
 
69565b3
 
9d6dab6
69565b3
9d6dab6
 
 
 
69565b3
9d6dab6
69565b3
f353238
 
 
 
 
 
 
9d6dab6
69565b3
f353238
 
69565b3
 
 
9d6dab6
69565b3
9d6dab6
69565b3
f353238
 
 
9d6dab6
 
 
 
69565b3
9d6dab6
 
e825481
 
 
 
 
69565b3
9d6dab6
 
 
69565b3
9d6dab6
 
 
69565b3
e825481
9d6dab6
 
 
e825481
 
 
69565b3
9d6dab6
 
 
 
 
 
 
 
69565b3
9d6dab6
 
69565b3
e825481
9d6dab6
e825481
9d6dab6
69565b3
 
9d6dab6
69565b3
9d6dab6
 
 
69565b3
9d6dab6
69565b3
9d6dab6
69565b3
9d6dab6
69565b3
9d6dab6
69565b3
9d6dab6
69565b3
9d6dab6
 
 
 
69565b3
9d6dab6
69565b3
9d6dab6
69565b3
9d6dab6
69565b3
9d6dab6
69565b3
f353238
9d6dab6
c9952f7
 
 
 
 
9d6dab6
c9952f7
 
9d6dab6
 
69565b3
9d6dab6
69565b3
f353238

---
library_name: transformers
license: gpl-3.0
language:
- ar
- en
pipeline_tag: image-to-text
pretty_name: Arabic Large Nougat
datasets:
- MohamedRashad/arabic-img2md
tags:
  - arabic
  - ocr
  - books
  - markdown-extraction
  - vision-transformers
---

# Arabic Large Nougat

**En**d-**t**o-**En**d **Structur**ed **OC**R **fo**r **Arab**ic **boo**ks.
<center>
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6116d0584ef9fdfbf45dc4d9/Z0LKcfyOFJG9uopzqinCx.jpeg" width="60%">
</center>

## Description

<div align="center">
<!-- **Affiliations:** -->

[**Github**](https://github.com/MohamedAliRashad/arabic-nougat)  🤗  [**Hugging Face**](https://huggingface.co/collections/MohamedRashad/arabic-nougat-673a3f540bd92904c9b92a8e) 📝  [**Paper**](https://arxiv.org/abs/2411.17835) 🗂️  [**Data**](https://huggingface.co/datasets/MohamedRashad/arabic-img2md) 📽️  [**Demo**](https://huggingface.co/spaces/MohamedRashad/Arabic-Nougat)

</div>

The arabic-large-nougat OCR is an end-to-end structured Optical Character Recognition (OCR) system designed specifically for the Arabic language.

This model was trained from scratch based on the new tokenizer [riotu-lab/Aranizer-PBE-86k](https://huggingface.co/riotu-lab/Aranizer-PBE-86k) with the base nougat architecture.
The training happened using the [MohamedRashad/arabic-img2md](https://huggingface.co/datasets/MohamedRashad/arabic-img2md) dataset.

## How to Get Started with the Model

**Demo:** https://huggingface.co/spaces/MohamedRashad/Arabic-Nougat

Or, use the code below to get started with the model locally.

Don't forget to update transformers:
`pip install -U transformers`

```python
from PIL import Image
import torch
from transformers import NougatProcessor, VisionEncoderDecoderModel

# Load the model and processor
processor = NougatProcessor.from_pretrained("MohamedRashad/arabic-large-nougat")
model = VisionEncoderDecoderModel.from_pretrained(
    "MohamedRashad/arabic-large-nougat",
    torch_dtype=torch.bfloat16,
    attn_implementation={"decoder": "flash_attention_2", "encoder": "eager"},
)

# Get the max context length of the model & dtype of the weights
context_length = model.decoder.config.max_position_embeddings
torch_dtype = model.dtype

# Move the model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)


def predict(img_path):
    # prepare PDF image for the model
    image = Image.open(img_path)
    pixel_values = (
        processor(image, return_tensors="pt").pixel_values.to(torch_dtype).to(device)
    )

    # generate transcription
    outputs = model.generate(
        pixel_values.to(device),
        repetition_penalty=1.5,
        min_length=1,
        max_new_tokens=context_length,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
    )

    page_sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
    return page_sequence


print(predict("path/to/page_image.jpg"))

```


## Bias, Risks, and Limitations

1. **Text Hallucination:** The model may occasionally generate repeated or incorrect text due to the inherent complexities of OCR tasks.
1. **Erroneous Image Paths:** There are instances where the model outputs image paths that are not relevant to the input, indicating occasional confusion.
1. **Context Length Constraint:** The model has a maximum context length of 2048 tokens, which may result in incomplete transcriptions for longer book pages.

## Intended Use

The arabic-large-nougat OCR is designed for tasks that involve converting images of Arabic book pages into structured text, especially when Markdown format is desired. It is suitable for applications in the field of digitizing Arabic literature and facilitating text extraction from printed materials.

## Ethical Considerations

It is crucial to be aware of the model's limitations, particularly in instances where accurate OCR results are critical. Users are advised to verify and review the output, especially in scenarios where precision is paramount.

## Model Details

- **Developed by:** Mohamed Rashad
- **Model type:** VisionEncoderDecoderModel
- **Language(s) (NLP):** Arabic & English
- **License:** GPL 3.0

## Acknowledgment

If you use or build upon the arabic-large-nougat OCR, please acknowledge the model developer and the open-source community for their contributions. Additionally, be sure to include a copy of the GPL 3.0 license with any redistributed or modified versions of the model.

By selecting the GPL 3.0 license, you promote the principles of open source and ensure that the benefits of the model are shared with the broader community.

### Citation

If you find this model useful, please cite the corresponding research paper:
```bibtex
@misc{rashad2024arabicnougatfinetuningvisiontransformers,
      title={Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction}, 
      author={Mohamed Rashad},
      year={2024},
      eprint={2411.17835},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.17835}, 
}
```

### Disclaimer

The arabic-large-nougat OCR is a tool provided "as is," and the developers make no guarantees regarding its suitability for specific tasks. Users are encouraged to thoroughly evaluate the model's output for their particular use cases and requirements.