|
--- |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- ja |
|
tags: |
|
- clip |
|
- ja |
|
- japanese |
|
- japanese-clip |
|
pipeline_tag: feature-extraction |
|
--- |
|
|
|
# Japanese CLIP ViT-H/14 (Deeper) |
|
|
|
## Table of Contents |
|
|
|
1. [Overview](#overview) |
|
1. [Usage](#usage) |
|
1. [Model Details](#model-details) |
|
1. [Evaluation](#evaluation) |
|
1. [Limitations and Biases](#limitations-and-biases) |
|
1. [Citation](#citation) |
|
1. [See Also](#see-also) |
|
1. [Contact Information](#contact-information) |
|
|
|
## Overview |
|
|
|
* **Developed by**: [HAKUHODO Technologies Inc.](https://www.hakuhodo-technologies.co.jp/) |
|
* **Model type**: Contrastive Language-Image Pre-trained Model |
|
* **Language(s)**: Japanese |
|
* **LICENSE**: [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) |
|
|
|
Presented here is a Japanese [CLIP (Contrastive Language-Image Pre-training)](https://arxiv.org/abs/2103.00020) model, |
|
mapping Japanese texts and images to a unified embedding space. |
|
Capable of multimodal tasks including zero-shot image classification, |
|
text-to-image retrieval, and image-to-text retrieval, |
|
this model extends its utility when integrated with other components, |
|
contributing to generative models like image-to-text and text-to-image generation. |
|
|
|
## Usage |
|
|
|
### Dependencies |
|
|
|
```bash |
|
python3 -m pip install pillow sentencepiece torch torchvision transformers |
|
``` |
|
|
|
### Inference |
|
|
|
The usage is similar to [`CLIPModel`](https://huggingface.co/docs/transformers/model_doc/clip) |
|
and [`VisionTextDualEncoderModel`](https://huggingface.co/docs/transformers/model_doc/vision-text-dual-encoder). |
|
|
|
```python |
|
import requests |
|
import torch |
|
from PIL import Image |
|
from transformers import AutoModel, AutoProcessor, BatchEncoding |
|
|
|
# Download |
|
model_name = "hakuhodo-tech/japanese-clip-vit-h-14-bert-deeper" |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device) |
|
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
# Prepare raw inputs |
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
# Process inputs |
|
inputs = processor( |
|
text=["犬", "猫", "象"], |
|
images=image, |
|
return_tensors="pt", |
|
padding=True, |
|
) |
|
|
|
# Infer and output |
|
outputs = model(**BatchEncoding(inputs).to(device)) |
|
probs = outputs.logits_per_image.softmax(dim=1) |
|
print([f"{x:.2f}" for x in probs.flatten().tolist()]) # ['0.00', '1.00', '0.00'] |
|
``` |
|
|
|
## Model Details |
|
|
|
### Components |
|
|
|
The model consists of a frozen ViT-H image encoder from |
|
[laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) |
|
and a 24-layer 12-head BERT text encoder initialized from |
|
[hakuhodo-tech/japanese-clip-vit-h-14-bert-base](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-base) |
|
with [Modified ZerO](https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/B6-5.pdf). |
|
|
|
### Training |
|
|
|
Model training is done by Zhi Wang with 8 A100 (80 GB) GPUs. |
|
[Locked-image Tuning (LiT)](https://arxiv.org/abs/2111.07991) is adopted. |
|
See more details in [the paper](https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/B6-5.pdf). |
|
|
|
### Dataset |
|
|
|
The Japanese subset of the [laion2B-multi](https://huggingface.co/datasets/laion/laion2B-multi) dataset containing ~120M image-text pairs. |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
|
|
The 5K evaluation set (val2017) of [MS-COCO](https://cocodataset.org/) |
|
with [STAIR Captions](http://captions.stair.center/). |
|
|
|
### Metrics |
|
|
|
Zero-shot image-to-text and text-to-image recall@1, 5, 10. |
|
|
|
### Results |
|
|
|
| | | | | | | | |
|
| :---------------------------------------------------------------------------------------------------------------------- | :------: | :------: | :------: | :------: | :------: | :------: | |
|
| <td colspan=3 align=center>Text Retrieval</td> <td colspan=3 align=center>Image Retrieval</td> | |
|
| | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
|
| [recruit-jp/japanese-clip-vit-b-32-roberta-base](https://huggingface.co/recruit-jp/japanese-clip-vit-b-32-roberta-base) | 23.0 | 46.1 | 57.4 | 16.1 | 35.4 | 46.3 | |
|
| [rinna/japanese-cloob-vit-b-16](https://huggingface.co/rinna/japanese-cloob-vit-b-16) | 37.1 | 63.7 | 74.2 | 25.1 | 48.0 | 58.8 | |
|
| [rinna/japanese-clip-vit-b-16](https://huggingface.co/rinna/japanese-clip-vit-b-16) | 36.9 | 64.3 | 74.3 | 24.8 | 48.8 | 60.0 | |
|
| [**Japanese CLIP ViT-H/14 (Base)**](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-base) | 39.2 | 66.3 | 76.6 | 28.9 | 53.3 | 63.9 | |
|
| [**Japanese CLIP ViT-H/14 (Deeper)**](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-deeper) | **48.7** | 74.0 | 82.4 | 36.5 | 61.5 | 71.8 | |
|
| [**Japanese CLIP ViT-H/14 (Wider)**](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-wider) | 47.9 | **74.2** | **83.2** | **37.3** | **62.8** | **72.7** | |
|
|
|
\* [Japanese Stable CLIP ViT-L/16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) is excluded for zero-shot retrieval evaluation as |
|
[the model was partially pre-trained with MS-COCO](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16#training-dataset). |
|
|
|
## Limitations and Biases |
|
|
|
Despite our data filtering, it is crucial |
|
to acknowledge the possibility of the training dataset |
|
containing offensive or inappropriate content. |
|
Users should be mindful of the potential societal impact |
|
and ethical considerations associated with the outputs |
|
generated by the model when deploying in production systems. |
|
It is recommended not to employ the model for applications |
|
that have the potential to cause harm or distress |
|
to individuals or groups. |
|
|
|
## Citation |
|
|
|
If you found this model useful, please consider citing: |
|
|
|
```bibtex |
|
@article{japanese-clip-vit-h, |
|
author = {王 直 and 細野 健人 and 石塚 湖太 and 奥田 悠太 and 川上 孝介}, |
|
journal = {言語処理学会年次大会発表論文集}, |
|
month = {Mar}, |
|
pages = {1547--1552}, |
|
title = {日本語特化の視覚と言語を組み合わせた事前学習モデルの開発 Developing Vision-Language Pre-Trained Models for {J}apanese}, |
|
volume = {30}, |
|
year = {2024} |
|
} |
|
``` |
|
|
|
## See Also |
|
|
|
* [Japanese CLIP ViT-H/14 (Base)](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-base) |
|
* [Japanese CLIP ViT-H/14 (Wider)](https://huggingface.co/hakuhodo-tech/japanese-clip-vit-h-14-bert-wider) |
|
|
|
## Contact Information |
|
|
|
Please contact |
|
[hr-koho\@hakuhodo-technologies.co.jp](mailto:[email protected]?subject=Japanese%20CLIP%20ViT-H/14%20Models) |
|
for questions and comments about the model, |
|
and/or |
|
for business and partnership inquiries. |
|
|
|
お問い合わせは |
|
[hr-koho\@hakuhodo-technologies.co.jp](mailto:[email protected]?subject=日本語CLIP%20ViT-H/14モデルについて) |
|
にご連絡ください。 |
|
|