---
library_name: transformers
license: apache-2.0
language:
- en
base_model:
- facebook/bart-large-mnli
- google/vit-base-patch16-224
pipeline_tag: image-text-to-text
tags:
- LLMs
- VisionTransformer
- ImageQA
- DataSynthesis
---

# Dua-Vision-Base


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64f0cf1adcac1f99adbabb56/FZOLSnkBj_xPbaNQBqbU5.png)

A Vision Encoder-Decoder model that doesn’t just caption images but generates questions and possible answers based on what it “sees.” Using ViT as the encoder and BART as the decoder, it’s built for image-based QA without the fluff.

Translation: feed it an image, and get back a useful question-answer pair. Perfect for creating and synthesizing data in image QA tasks. It’s one model, two tasks, and a lot of potential! 

#LLMs #VisionTransformer #ImageQA #AI

Dua-Vision-Base is a Vision Encoder-Decoder model. This model integrates Vision Transformer (ViT) as the encoder and BART as the decoder, enabling effective processing and contextual interpretation of visual inputs alongside natural language generation.

## Model Architecture

- **Encoder**: ViT (Vision Transformer), pre-trained on `vit-base-patch16-224-in21k` from Google.
- **Decoder**: BART (Bidirectional and Auto-Regressive Transformers) model pre-trained on `facebook/bart-base`.

## Usage

To use this model with images, you’ll need the necessary components: the `ViTImageProcessor` for handling visual inputs and the `BartTokenizer` for processing text prompts. This model is optimized for generating question and an answer for given images, adhering to the following specifications:

1. **Input**: 
   - Images in RGB format (processed via `ViTImageProcessor`).
   - Textual prompts using `BartTokenizer` for contextual initialization.

2. **Output**:
   - Textual question & answer generated based on the visual content in the image.

## Installation

```bash
!pip install transformers datasets torch torchvision
```

## How to Load the Model

```python
from transformers import VisionEncoderDecoderModel, ViTImageProcessor, BartTokenizer

# Load model, processor, and tokenizer
model = VisionEncoderDecoderModel.from_pretrained("HV-Khurdula/Dua-Vision-Base")
image_processor = ViTImageProcessor.from_pretrained("HV-Khurdula/Dua-Vision-Base")
tokenizer = BartTokenizer.from_pretrained("HV-Khurdula/Dua-Vision-Base")
```

## Inference Example

Here's a sample usage for generating captions for an image:

```python
# Load image and process
image_url = "https://example.com/image.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
pixel_values = image_processor(images=image, return_tensors="pt").pixel_values

# Generate caption
generated_ids = model.generate(pixel_values, max_length=128, num_beams=5, early_stopping=True)
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print("Generated:", generated_text)
```

## Training

The model was trained on a dataset of conversational prompts alongside images. During training, captions were generated based on both the image content and specific prompts, enhancing contextual relevancy in generated captions. It is highly recommended to fine-tune the model, according to the task.

### Hyperparameters
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Epochs**: 5

## License

This model and its code are released under the terms of the Apache 2.0 license.