import torch from transformers import AutoFeatureExtractor, AutoModelForImageClassification from einops import rearrange import gradio import call_labels # define the feature extractor extractor = AutoFeatureExtractor.from_pretrained("vincentclaes/mit-indoor-scenes") # define the pretrained model model = AutoModelForImageClassification.from_pretrained("vincentclaes/mit-indoor-scenes") # retrieve the labels provided from MIT Indoor Scenes dataset (https://www.kaggle.com/itsahmad/indoor-scenes-cvpr-2019) labels = call_labels.call_labels() # call model.eval() to assert that we are evaluating the model and not updating the weights model.eval() # define the function used for model inference def classify(image): # disable gradient calculation with torch.no_grad(): # extract features from the image input inputs = extractor(images=image, return_tensors='pt') # call the logits parameter only (object: SequenceClassifierOutput) outputs = model(**inputs).logits # remove the batch dimension outputs = rearrange(outputs, '1 j->j') # use the softmax function to convert the logits into probabilities outputs = torch.nn.functional.softmax(outputs) # convert the data type from tensor to a numpy array outputs = outputs.cpu().numpy() # returns a key-value pair of the id labels and its corresponding probabilities return {labels[str(i)]: float(outputs[i]) for i in range(len(labels))} # define the gradio interface gradio.Interface(fn=classify, inputs=gradio.inputs.Image(shape=(224,224), image_mode='RGB', source='upload', tool='editor', type='pil', label=None, optional=False), outputs=gradio.outputs.Label(num_top_classes=5, type='auto'), theme='grass', examples=[['bedroom.jpg'], ['bathroom.jpg'], ['samsung_room.jpg']], live=True, layout='horizontal', title='Indoor Scene Recognition', description='A smart and easy-to-use indoor scene classifier. Start by uploading an input image of an indoor scene. The outputs are the top five indoor scene classes that best describe your input image.', article='''

Additional Information

This indoor scene classifier employs the google/vit-base-patch16-224-in21k, a Visual Transformer (ViT) model pre-trained on the ImageNet-21k (14 million images, 21,843 classes) at a resolution of 224 pixels by 224 pixels and was first introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. It was then fine-tuned on the MIT Indoor Scenes data set from Kaggle. The source model used in this space is from vincentclaes/mit-indoor-scenes.

For further research on the Visual Transformer, the original GitHub repository is found in this link.

Disclaimer

The team releasing the Visual Transformer did not write a model card for it via Hugging Face. Hence, the Visual Transformer model card released in the Hugging Face Models library has been written by the Hugging Face team.

Limitations

The model was trained only on 67 classes (indoor scenes). Hence, the model should perform better if the input indoor scene image belongs to one of the target classes it was trained on.

''', allow_flagging='never').launch()