|
--- |
|
tags: |
|
- image-to-text |
|
- image-captioning |
|
license: apache-2.0 |
|
widget: |
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/savanna.jpg |
|
example_title: Savanna |
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/football-match.jpg |
|
example_title: Football Match |
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/airport.jpg |
|
example_title: Airport |
|
base_model: |
|
- google/vit-base-patch16-224-in21k |
|
--- |
|
|
|
This model is a work in progress. |
|
|
|
You can find the code used to create the model here: https://github.com/mozilla/distilvit |
|
|
|
Results after after 3 epochs (and ~45 hours of training) |
|
|
|
- eval_loss: 0.19939416646957397 |
|
- eval_rouge1: 43.006 |
|
- eval_rouge2: 16.9939 |
|
- eval_rougeL: 38.8923 |
|
- eval_rougeLsum: 38.8877 |
|
- eval_gen_len: 11.327256736227712 |
|
- eval_runtime: 1816.5255 |
|
- eval_samples_per_second: 13.77 |
|
- eval_steps_per_second': 1.721 |
|
- train_runtime: 46263.3695 |
|
- train_samples_per_second: 38.373 |
|
- train_steps_per_second: 4.797 |
|
- train_loss: 0.05974134062104816 |
|
|
|
|
|
|