File size: 3,822 Bytes
5af99a1
5a2fd5e
 
96586f0
5a2fd5e
5af99a1
0eaf4a1
5af99a1
 
3284b3e
5af99a1
 
1247d27
 
 
 
08f4db4
 
1247d27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96586f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0eaf4a1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: img-read
emoji: πŸ“š
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.23.3
app_file: app.py
pinned: false
short_description: 'Extract Hindi & English text from images and search keywords'
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Byaldi + Qwen2VL

![Alt text](Screenshot680.png)

## Overview

The **Byaldi + Qwen2VL** app is an innovative tool designed for extracting text from images using advanced OCR (Optical Character Recognition) techniques and natural language processing. This application leverages the **RAGMultiModalModel** from Byaldi and the **Qwen2VL** model for generating meaningful responses based on the extracted text. 

This application also takes advantage of **ZeroGPU** to run efficiently on powerful hardware, specifically the **NVIDIA A100** GPU, ensuring high-speed processing and accurate results even for large and complex image inputs.

## Features

- **Image Upload**: Users can upload images from which text will be extracted.
- **Text Extraction**: Utilizes state-of-the-art models to accurately extract text from the uploaded images.
- **Keyword Search**: Allows users to search for specific keywords within the extracted text and highlights them.
- **High-Performance**: Runs on **ZeroGPU (NVIDIA A100)** for accelerated computation and efficient model execution.
- **User-Friendly Interface**: Built using Gradio for an interactive user experience.

## Technologies Used

- **Gradio**: For creating the web interface.
- **Byaldi RAGMultiModalModel**: For indexing and searching images.
- **Qwen2VL**: For generating responses based on visual and textual inputs.
- **ZeroGPU**: For efficient model inference using **NVIDIA A100**.
- **PyTorch**: For deep learning functionalities.
- **Pillow**: For image handling.

## Getting Started

### Prerequisites

- Python 3.8 or later
- Required libraries:
  ```bash
  pip install gradio byaldi transformers torch pillow

## Installation

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd <repository-directory>

2. Install the required dependencies using pip.

3. Run the application:
   ```bash
   python app.py

### Using the App

1. **Upload an Image**: Click on the "Upload an Image" button to select and upload an image containing text.
2. **Extract Text**: Press the "Extract Text" button to process the image and extract any text found.
3. **Search Keywords**: Enter keywords in the search box and click "Search" to highlight matching keywords in the extracted text.

## Code Overview

The core functionality of the application is encapsulated in the following sections:

- **OCR and Text Extraction**:
  - The `ocr_and_extract` function processes the uploaded image, extracts text, and cleans the output to remove unnecessary labels.

- **Keyword Highlighting**:
  - The `search_keywords` function takes the extracted text and user-defined keywords, highlighting matches within the text for better visibility.

## ZeroGPU Integration

The application is powered by **ZeroGPU**, leveraging the **NVIDIA A100** GPU. This ensures:
- Faster image processing and text extraction.
- Seamless handling of large-scale models like Qwen2VL.
- Optimal performance during high computational loads.

## Error Handling

The application includes basic error handling to capture and display any issues encountered during image processing. Errors will be printed to the console, and a user-friendly message will be displayed in the interface.

## References

- [Byaldi](https://huggingface.co/vidore/colpali) for providing the RAGMultiModalModel.
- [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) for state-of-the-art models.
- [ZeroGPU](https://www.zerogpu.com) for enabling efficient GPU computation with NVIDIA A100.