PrismCaptioner Model Card

Model details

PrismCaptioners are open-source captioners with LLaVA architecture finetuned on GPT4V-assisted dataset ALLaVA. We have released PrismCaptioner-7B and PrismCaptioner-2B.

PrismCaptioner-2B details

  • Vision Backbone: google/siglip-so400m-patch14-384
  • Language Backbone: internlm/internlm2-1_8b
  • Dataset: 1x ALLaVA-Caption-[LAION/VFLAN], 2x Evol-Instruct-GPT4-Turbo-143K

Paper and codebase for more information: [Paper] [Code]

Intended uses

  • Perception Module: The model can be integrated into Prism as a perception module to solve vision-language task by utilizing an external LLM.
  • Effective Captioner: The model can produce high-quality captions for given images.

Model usage

Clone the Prism repo and complete the preparation. You can use PrismCaptioners following usage or demo below.

# In the Prism repo folder
from decouple import supported_VLM

model = supported_VLM['prismcaptioner-2b']()
res = model.generate(['assets/case1.png', 'Given the image below, please provide a detailed description of what you see.'])
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Inference API (serverless) does not yet support prismcaptioner models for this pipeline type.

Dataset used to train Yuxuan-Qiao/PrismCaptioner-2B