File size: 3,551 Bytes
ca56e66 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language:
- ja
tags:
- heron
- vision
- image-captioning
- VQA
pipeline_tag: image-to-text
license:
- cc-by-nc-4.0
inference: false
---
# Heron BLIP Japanese StableLM Base 7B llava-620k
## Model Details
Heron BLIP Japanese StableLM Base 7B is a vision-language model that can converse about input images.<br>
This model was trained using [the heron library](https://github.com/turingmotors/heron). Please refer to the code for details.
## Usage
Follow [the installation guide](https://github.com/turingmotors/heron/).
```python
import torch
from heron.models.video_blip import VideoBlipForConditionalGeneration, VideoBlipProcessor
from transformers import LlamaTokenizer
device_id = 0
device = f"cuda:{device_id}"
MODEL_NAME = "turing-motors/heron-chat-blip-ja-stablelm-base-7b-v1"
model = VideoBlipForConditionalGeneration.from_pretrained(
MODEL_NAME, torch_dtype=torch.float16, ignore_mismatched_sizes=True
)
model = model.half()
model.eval()
model.to(device)
# prepare a processor
processor = VideoBlipProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])
processor.tokenizer = tokenizer
import requests
from PIL import Image
# prepare inputs
url = "https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = f"##human: この画像の面白い点は何ですか?\n##gpt: "
# do preprocessing
inputs = processor(
text=text,
images=image,
return_tensors="pt",
truncation=True,
)
inputs = {k: v.to(device) for k, v in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(device, torch.float16)
# set eos token
eos_token_id_list = [
processor.tokenizer.pad_token_id,
processor.tokenizer.eos_token_id,
int(tokenizer.convert_tokens_to_ids("##"))
]
# do inference
with torch.no_grad():
out = model.generate(**inputs, max_length=256, do_sample=False, temperature=0., eos_token_id=eos_token_id_list, no_repeat_ngram_size=2)
# print result
print(processor.tokenizer.batch_decode(out))
```
## Model Details
* **Developed by**: [Turing Inc.](https://www.turing-motors.com/)
* **Adaptor type**: [BLIP2](https://arxiv.org/abs/2301.12597)
* **Lamguage Model**: [Japanese StableLM Base Alpha](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b)
* **Language(s)**: Japanese
### Training
This model was fully fine-tuned with LLaVA-Instruct-620K-JA.
### Training Dataset
- LLaVA-Instruct-620K-JA
## Use and Limitations
### Intended Use
This model is intended for use in chat-like applications and for research purposes.
### Limitations
The model may produce inaccurate or false information, and its accuracy is not guaranteed. It is still in the research and development stage.
## How to cite
```bibtex
@misc{BlipJapaneseStableLM,
url = {[https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0](https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v0)},
title = {Heron BLIP Japanese StableLM Base 7B},
author = {Kotaro Tanahashi, Yuichi Inoue, and Yu Yamaguchi}
}
```
## Citations
```bibtex
@misc{JapaneseInstructBLIPAlpha,
url = {[https://huggingface.co/stabilityai/japanese-instructblip-alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha)},
title = {Japanese InstructBLIP Alpha},
author = {Shing, Makoto and Akiba, Takuya}
}
```
---
license: cc-by-nc-4.0
--- |