nielsr HF staff commited on
Commit
f521936
·
1 Parent(s): 73f0ab6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -9
README.md CHANGED
@@ -14,31 +14,34 @@ The Document Image Transformer (DiT) is a transformer encoder model (BERT-like)
14
 
15
  Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
16
 
17
- Note that this model does not provide any fine-tuned heads, hence it's meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis.
18
-
19
  By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled document images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder.
20
 
21
  ## Intended uses & limitations
22
 
23
- You can use the raw model for encoding document images into a vector space, but it's mostly meant to be fine-tuned. See the [model hub](https://huggingface.co/models?search=microsoft/dit) to look for fine-tuned versions on a task that interests you.
24
 
25
  ### How to use
26
 
27
  Here is how to use this model in PyTorch:
28
 
29
  ```python
30
- from transformers import AutoFeatureExtractor, AutoModel
 
31
  from PIL import Image
32
 
33
  image = Image.open('path_to_your_document_image')
34
 
35
- feature_extractor = AutoFeatureExtractor.from_pretrained('microsoft/dit-base')
36
- model = AutoModel.from_pretrained('microsoft/dit-base')
37
 
38
- inputs = feature_extractor(images=image, return_tensors="pt")
 
 
 
39
 
40
- outputs = model(**inputs)
41
- last_hidden_states = outputs.last_hidden_state
 
42
  ```
43
 
44
  ### BibTeX entry and citation info
 
14
 
15
  Images are presented to the model as a sequence of fixed-size patches (resolution 16x16), which are linearly embedded. One also adds absolute position embeddings before feeding the sequence to the layers of the Transformer encoder.
16
 
 
 
17
  By pre-training the model, it learns an inner representation of images that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled document images for instance, you can train a standard classifier by placing a linear layer on top of the pre-trained encoder.
18
 
19
  ## Intended uses & limitations
20
 
21
+ You can use the raw model for encoding document images into a vector space, but it's mostly meant to be fine-tuned on tasks like document image classification, table detection or document layout analysis. See the [model hub](https://huggingface.co/models?search=microsoft/dit) to look for fine-tuned versions on a task that interests you.
22
 
23
  ### How to use
24
 
25
  Here is how to use this model in PyTorch:
26
 
27
  ```python
28
+ from transformers import BeitFeatureExtractor, BeitForMaskedImageModeling
29
+ import torch
30
  from PIL import Image
31
 
32
  image = Image.open('path_to_your_document_image')
33
 
34
+ feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/dit-base")
35
+ model = BeitForMaskedImageModeling.from_pretrained("microsoft/dit-base")
36
 
37
+ num_patches = (model.config.image_size // model.config.patch_size) ** 2
38
+ pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
39
+ # create random boolean mask of shape (batch_size, num_patches)
40
+ bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
41
 
42
+ outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
43
+ loss, logits = outputs.loss, outputs.logits
44
+ list(logits.shape)
45
  ```
46
 
47
  ### BibTeX entry and citation info