microsoft
/

dit-base

Model card Files Files and versions Community

nielsr HF staff commited on Mar 8, 2022

Commit

73f0ab6

·

1 Parent(s): 70a687b

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ Disclaimer: The team releasing DiT did not write a model card for this model so
 The Document Image Transformer (DiT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a self-supervised fashion. The pre-training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.
-Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds a [CLS] token to the beginning of a sequence to use it for classification tasks. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
 Note that this model does not provide any fine-tuned heads, hence it's meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis.

 The Document Image Transformer (DiT) is a transformer encoder model (BERT-like) pre-trained on a large collection of images in a self-supervised fashion. The pre-training objective for the model is to predict visual tokens from the encoder of a discrete VAE (dVAE), based on masked patches.
+Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
 Note that this model does not provide any fine-tuned heads, hence it's meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis.