tarekziade
/

distilvit

vision-encoder-decoder

image-text-to-text

image-captioning

Inference Endpoints

Model card Files Files and versions Community

Edit model card

This model is a variation of https://huggingface.co/nlpconnect/vit-gpt2-image-captioning

Read the blog post here https://ziade.org/2024/03/17/distilvit-image-captioning-model
The training code is here: https://github.com/tarekziade/distilvit

Results after after 3 epochs (and ~45 hours of training)

eval_loss: 0.19939416646957397
eval_rouge1: 43.006
eval_rouge2: 16.9939
eval_rougeL: 38.8923
eval_rougeLsum: 38.8877
eval_gen_len: 11.327256736227712
eval_runtime: 1816.5255
eval_samples_per_second: 13.77
eval_steps_per_second': 1.721
train_runtime: 46263.3695
train_samples_per_second: 38.373
train_steps_per_second: 4.797
train_loss: 0.05974134062104816

Downloads last month: 10

Safetensors

Model size

182M params

Tensor type

F32

·

Inference Examples

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for tarekziade/distilvit

Base model

distilbert/distilgpt2

Quantized

(9)

this model