--- pipeline_tag: image-to-text --- # Image captioning model End-to-end Transformer based image captioning model, where both the encoder and decoder use standard pre-trained transformer architectures. ## Encoder The encoder uses the pre-trained Swin transformer (Liu et al., 2021) that is a general-purpose backbone for computer vision. It outperforms ViT, DeiT and ResNe(X)t models at tasks such as image classification, object detection and semantic segmentation. The fact that this model is not pre-trained to be a 'narrow expert'--- a model pre-trained to perform a specific task e.g., image classification --- makes it a good candidate for fine-tuning on a downstream task. ## Decoder Distilgpt2 ## Dataset The model is fine-tuned and evaluated on the COCO 2017 dataset. ## How to use this model using Transformer's pipeline ``` from transformers import pipeline image_to_text = pipeline("image-to-text", model="yesidcanoc/image-captioning-swin-tiny-distilgpt2") # Provide path the image file caption = image_to_text("./COCO_val2014_000000457986.jpg") print(caption) ``` ## How To use this model using `GenerateCaptions` utility class. Adapt the code below to your needs. ``` import os from PIL import Image import torchvision.transforms as transforms from transformers import GPT2TokenizerFast, VisionEncoderDecoderModel class DataProcessing: def __init__(self): # GPT-2 tokenizer self.tokenizer = GPT2TokenizerFast.from_pretrained('distilgpt2') self.tokenizer.pad_token = self.tokenizer.eos_token # Define the transforms to be applied to the images self.transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) class GenerateCaptions(DataProcessing): NUM_BEAMS = 3 MAX_LENGTH = 15 EARLY_STOPPING = True DO_SAMPLE = True TOP_K = 10 NUM_RETURN_SEQUENCES = 2 # number of captions to generate def __init__(self, captioning_model): self.captioning_model = captioning_model super().__init__() def read_img_predict(self, path): try: with Image.open(path) as img: if img.mode != "RGB": img = img.convert('RGB') img_transformed = self.transform(img).unsqueeze(0) # tensor dimensions max_lenght X num_return_sequences, where ij == some_token_id model_output = self.captioning_model.generate( img_transformed, num_beams=self.NUM_BEAMS, max_length=self.MAX_LENGTH, early_stopping=self.EARLY_STOPPING, do_sample=self.DO_SAMPLE, top_k=self.TOP_K, num_return_sequences=self.NUM_RETURN_SEQUENCES, ) # g is a tensor like this one: tensor([50256, 13, 198, 198, 198, 198, 198, 198, 198, 50256, # 50256, 50256, 50256, 50256, 50256]) captions = [self.tokenizer.decode(g, skip_special_tokens=True).strip() for g in model_output] return captions except FileNotFoundError: raise FileNotFoundError(f"File not found: {path}") def generate_caption(self, path): """ Generate captions for a single image or a directory of images :param path: path to image or directory of images :return: captions """ if os.path.isdir(path): self.decoded_predictions = [] for root, dirs, files in os.walk(path): for file in files: self.decoded_predictions.append(self.read_img_predict(os.path.join(root, file))) return self.decoded_predictions elif os.path.isfile(path): return self.read_img_predict(path) else: raise ValueError(f"Invalid path: {path}") image_captioning_model = VisionEncoderDecoderModel.from_pretrained("yesidcanoc/image-captioning-swin-tiny-distilgpt2") generate_captions = GenerateCaptions(image_captioning_model) captions = generate_captions.generate_caption('../data/test_data/images') print(captions) ``` ## References - Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. ArXiv. /abs/2103.14030