You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

We are creating a spatial aware vision-language(VL) model.

This is a trained model on COCO dataset images including extra information regarding the spatial relationship between the entities of the image.

This is a sequence to sequence model for image-captioning. The architecture is ViT encoder and GPT2 decoder.

Requirements! - 4GB GPU RAM. - CUDA enabled docker

The way to download and run this:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
from transformers import pipeline
image_captioner = pipeline("image-to-text", model="voxreality/rgb-language_cap", max_new_tokens=200, device=device)
filename = 'path/to/file'
generated_captions = image_captioner(filename)
print(generated_captions)

The model is trained to produce as many words as possible with a maximum of 200 tokens, which translates to roughly 5 sentences, while the 6th sentence is usually cropped.

The output is always of that form: "Object1" is to the "Left/Right etc." of the "Object2".

IF YOU WANT TO PRODUCE A SPECIFIC NUMBER OF CAPTIONS UP TO 5.

import os
def print_up_to_n_sentences(captions, n):
    for caption in captions:
        generated_text = caption.get('generated_text', '')
        sentences = generated_text.split('.')
        result = '.'.join(sentences[:n])
        #print(result)
    return result
filename = 'path/to/file'

generated_captions = image_captioner(filename)
captions = print_up_to_n_sentences(generated_captions, 5)
print(captions)
Downloads last month
27
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.