clip-japanese-base / README.md
pfzhu's picture
Fix sample code by incorporating device into model, image, and text variables (#1)
b7f8796 verified
metadata
language: ja
license: apache-2.0
tags:
  - clip
  - japanese-clip
pipeline_tag: feature-extraction

clip-japanese-base

This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. This model was trained on ~1B web-collected image-text pairs, and it is applicable to various visual tasks including zero-shot image classification, text-to-image or image-to-text retrieval.

How to use

  1. Install packages
pip install pillow requests sentencepiece transformers torch timm
  1. Run
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]

Model architecture

The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.

Evaluation

Dataset

Result

Model Image Encoder Params Text Encoder params STAIR Captions (R@1) Recruit Datasets (acc@1) ImageNet-1K (acc@1)
Ours 86M(Eva02-B) 100M(BERT) 0.30 0.89 0.58
Stable-ja-clip 307M(ViT-L) 100M(BERT) 0.24 0.77 0.68
Rinna-ja-clip 86M(ViT-B) 100M(BERT) 0.13 0.54 0.56
Laion-clip 632M(ViT-H) 561M(XLM-RoBERTa) 0.30 0.83 0.58
Hakuhodo-ja-clip 632M(ViT-H) 100M(BERT) 0.21 0.82 0.46

Licenses

The Apache License, Version 2.0

Citation

@misc{clip-japanese-base,
    title = {CLIP Japanese Base},
    author={Shuhei Yokoo, Shuntaro Okada, Peifei Zhu, Shuhei Nishimura and Naoki Takayama}
    url = {https://huggingface.co/line-corporation/clip-japanese-base},
}