|
--- |
|
license: mit |
|
tags: |
|
- vision |
|
- image-classification |
|
datasets: |
|
- imagenet-21k |
|
- imagenet-1k |
|
widget: |
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg |
|
example_title: Tiger |
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg |
|
example_title: Teapot |
|
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg |
|
example_title: Palace |
|
--- |
|
|
|
# DiNAT (large variant) |
|
|
|
DiNAT-Large with a 7x7 kernel pre-trained on ImageNet-21K at 224x224, and fine-tuned on ImageNet-1K at 384x384 with increased dilation values. |
|
It was introduced in the paper [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) by Hassani et al. and first released in [this repository](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer). |
|
|
|
## Model description |
|
|
|
DiNAT is a hierarchical vision transformer based on Neighborhood Attention (NA) and its dilated variant (DiNA). |
|
Neighborhood Attention is a restricted self attention pattern in which each token's receptive field is limited to its nearest neighboring pixels. |
|
NA and DiNA are therefore sliding-window attention patterns, and as a result are highly flexible and maintain translational equivariance. |
|
|
|
They come with PyTorch implementations through the [NATTEN](https://github.com/SHI-Labs/NATTEN/) package. |
|
|
|
|
|
![model image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg) |
|
|
|
[Source](https://paperswithcode.com/paper/dilated-neighborhood-attention-transformer) |
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=dinat) to look for |
|
fine-tuned versions on a task that interests you. |
|
|
|
### Example |
|
|
|
Here is how to use this model to classify an image from the COCO 2017 dataset into one of the 1,000 ImageNet classes: |
|
|
|
```python |
|
from transformers import AutoImageProcessor, DinatForImageClassification |
|
from PIL import Image |
|
import requests |
|
|
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
feature_extractor = AutoImageProcessor.from_pretrained("shi-labs/dinat-large-in22k-in1k-384") |
|
model = DinatForImageClassification.from_pretrained("shi-labs/dinat-large-in22k-in1k-384") |
|
|
|
inputs = feature_extractor(images=image, return_tensors="pt") |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
# model predicts one of the 1000 ImageNet classes |
|
predicted_class_idx = logits.argmax(-1).item() |
|
print("Predicted class:", model.config.id2label[predicted_class_idx]) |
|
``` |
|
|
|
For more examples, please refer to the [documentation](https://huggingface.co/transformers/model_doc/dinat.html#). |
|
|
|
### Requirements |
|
Other than transformers, this model requires the [NATTEN](https://shi-labs.com/natten) package. |
|
|
|
If you're on Linux, you can refer to [shi-labs.com/natten](https://shi-labs.com/natten) for instructions on installing with pre-compiled binaries (just select your torch build to get the correct wheel URL). |
|
|
|
You can alternatively use `pip install natten` to compile on your device, which may take up to a few minutes. |
|
Mac users only have the latter option (no pre-compiled binaries). |
|
|
|
Refer to [NATTEN's GitHub](https://github.com/SHI-Labs/NATTEN/) for more information. |
|
|
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{hassani2022dilated, |
|
title = {Dilated Neighborhood Attention Transformer}, |
|
author = {Ali Hassani and Humphrey Shi}, |
|
year = 2022, |
|
url = {https://arxiv.org/abs/2209.15001}, |
|
eprint = {2209.15001}, |
|
archiveprefix = {arXiv}, |
|
primaryclass = {cs.CV} |
|
} |
|
``` |