yesidcanoc
commited on
Commit
•
c389f87
1
Parent(s):
5fc5f75
Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,22 @@
|
|
|
|
|
|
|
|
1 |
# Image captioning model
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
## How To use this model.
|
4 |
|
5 |
Adapt the code below to your needs.
|
@@ -87,4 +104,7 @@ captions = generate_captions.generate_caption('../data/test_data/images')
|
|
87 |
print(captions)
|
88 |
|
89 |
|
90 |
-
```
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: image-to-text
|
3 |
+
---
|
4 |
# Image captioning model
|
5 |
|
6 |
+
End-to-end Transformer based image captioning model, where both the encoder and decoder use standard pre-trained transformer architectures.
|
7 |
+
|
8 |
+
## Encoder
|
9 |
+
The encoder uses the pre-trained Swin transformer (Liu et al., 2021) that is a general-purpose backbone for computer vision. It outperforms ViT, DeiT and ResNe(X)t models at tasks such as image classification, object detection and semantic segmentation. The fact that this model is not pre-trained to be a 'narrow expert'--- a model pre-trained to perform a specific task e.g., image classification --- makes it a good candidate for fine-tuning on a downstream task.
|
10 |
+
|
11 |
+
## Decoder
|
12 |
+
|
13 |
+
Distilgpt2
|
14 |
+
|
15 |
+
## Dataset
|
16 |
+
|
17 |
+
The model is fine-tuned and evaluated on the COCO 2017 dataset.
|
18 |
+
|
19 |
+
|
20 |
## How To use this model.
|
21 |
|
22 |
Adapt the code below to your needs.
|
|
|
104 |
print(captions)
|
105 |
|
106 |
|
107 |
+
```
|
108 |
+
|
109 |
+
## References
|
110 |
+
- Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ArXiv. /abs/2103.14030
|