ifmain
/

vit-gpt2-image2promt-stable-diffusion

vision-encoder-decoder

Model card Files Files and versions Community

ifmain commited on Aug 4, 2024

Commit

3e7a3bb

·

verified ·

1 Parent(s): de4512c

Update README.md

Files changed (1) hide show

README.md +70 -3

README.md CHANGED Viewed

@@ -1,3 +1,70 @@
----
-license: apache-2.0
----

+---
+datasets:
+- Ar4ikov/civitai-sd-337k
+language:
+- en
+pipeline_tag: image-to-text
+base_model: nlpconnect/vit-gpt2-image-captioning
+license: apache-2.0
+---
+# Overview
+The `ifmain/vit-gpt2-image2promt-stable-diffusion` model builds upon [nlpconnect/vit-gpt2-image-captioning](https://huggingface.co/nlpconnect/vit-gpt2-image-captioning) and is trained on the [Ar4ikov/civitai-sd-337k](https://huggingface.co/datasets/Ar4ikov/civitai-sd-337k) dataset, which includes 2,000 images. This model is specifically designed to generate text descriptions of images in a format suitable for prompts used with Stable Diffusion models.
+Training was conducted using the [Vit-GPT-Easy-Trainer](https://github.com/ifmain/Vit-GPT-Easy-Trainer) code.
+# Example Usage
+```python
+from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
+import torch
+from PIL import Image
+import re
+def prepare(text):
+    text = text.replace('. ', '.').replace(' .', '.')
+    text = text.replace('( ', '(').replace(' (', '(')
+    text = text.replace(') ', ')').replace(' )', ')')
+    text = text.replace(': ', ':').replace(' :', ':')
+    text = text.replace('_ ', '_').replace(' _', '_')
+    text = text.replace(',(())', '').replace('(()),', '')
+    for i in range(10):
+        text = text.replace(')))', '))').replace('(((', '((')
+    text = re.sub(r'<[^>]*>', '', text)
+    return text
+model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
+feature_extractor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
+tokenizer = AutoTokenizer.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+max_length = 16
+num_beams = 4
+gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
+def predict_step(image_paths):
+  images = []
+  for image_path in image_paths:
+    i_image = Image.open(image_path)
+    if i_image.mode != "RGB":
+      i_image = i_image.convert(mode="RGB")
+    images.append(i_image)
+  pixel_values = feature_extractor(images=images, return_tensors="pt").pixel_values
+  pixel_values = pixel_values.to(device)
+  output_ids = model.generate(pixel_values, **gen_kwargs)
+  preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
+  preds = [prepare(pred).strip() for pred in preds]
+  return preds
+predict_step(['doctor.e16ba4e4.jpg']) # ['a woman in a hospital bed with a woman in a hospital bed']
+```
+## Additional Information
+This model supports both SFW and NSFW content.