README.md · Mozilla/distilvit at 3dccd4d19b7be4b713fa533ca6f75d054905b922

metadata

tags:
  - image-to-text
  - image-captioning
license: apache-2.0
widget:
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg
    example_title: Savanna
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg
    example_title: Football Match
  - src: >-
      https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg
    example_title: Airport
base_model:
  - google/vit-base-patch16-224-in21k

This model is a work in progress.

You can find the code used to create the model here: https://github.com/mozilla/distilvit

Results after after 3 epochs (and ~45 hours of training)

eval_loss: 0.19939416646957397
eval_rouge1: 43.006
eval_rouge2: 16.9939
eval_rougeL: 38.8923
eval_rougeLsum: 38.8877
eval_gen_len: 11.327256736227712
eval_runtime: 1816.5255
eval_samples_per_second: 13.77
eval_steps_per_second': 1.721
train_runtime: 46263.3695
train_samples_per_second: 38.373
train_steps_per_second: 4.797
train_loss: 0.05974134062104816