nielsr HF staff commited on
Commit
73f0ab6
·
1 Parent(s): 70a687b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -12,7 +12,7 @@ Disclaimer: The team releasing DiT did not write a model card for this model so
12
 
13
  The Document Image Transformer (DiT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a self-supervised fashion. The pre-training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.
14
 
15
- Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
16
 
17
  Note that this model does not provide any fine-tuned heads, hence it's meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis.
18
 
 
12
 
13
  The Document Image Transformer (DiT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a self-supervised fashion. The pre-training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.
14
 
15
+ Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
16
 
17
  Note that this model does not provide any fine-tuned heads, hence it's meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis.
18