Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,7 @@ Disclaimer: The team releasing DiT did not write a model card for this model so
|
|
12 |
|
13 |
The Document Image Transformer (DiT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a self-supervised fashion. The pre-training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.
|
14 |
|
15 |
-
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds
|
16 |
|
17 |
Note that this model does not provide any fine-tuned heads, hence it's meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis.
|
18 |
|
|
|
12 |
|
13 |
The Document Image Transformer (DiT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a self-supervised fashion. The pre-training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.
|
14 |
|
15 |
+
Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
|
16 |
|
17 |
Note that this model does not provide any fine-tuned heads, hence it's meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis.
|
18 |
|