Instructions to use Lin-Chen/ShareCaptioner-Video with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Lin-Chen/ShareCaptioner-Video with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Lin-Chen/ShareCaptioner-Video", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Lin-Chen/ShareCaptioner-Video", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Lin-Chen/ShareCaptioner-Video with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Lin-Chen/ShareCaptioner-Video"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lin-Chen/ShareCaptioner-Video",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Lin-Chen/ShareCaptioner-Video

SGLang

How to use Lin-Chen/ShareCaptioner-Video with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Lin-Chen/ShareCaptioner-Video" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lin-Chen/ShareCaptioner-Video",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Lin-Chen/ShareCaptioner-Video" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Lin-Chen/ShareCaptioner-Video",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Lin-Chen/ShareCaptioner-Video with Docker Model Runner:
```
docker model run hf.co/Lin-Chen/ShareCaptioner-Video
```

ShareCaptioner-Video Model Card

Model details

Model type: ShareCaptioner-Video is an open-source captioner fine-tuned on GPT4V-assisted ShareGPT4Video detailed caption data with supporting various durations, aspect ratios, and resolutions of videos. ShareCaptioner-Video is based on the InternLM-Xcomposer2-4KHD model.

ShareCaptaioner-Video features 4 roles:

Fast Captioning: The model employs an image-grid format for direct video captioning, providing rapid generation speeds that are ideal for short videos. In practice, we concatenate all the keyframes of a video into a vertically elongated image and train the model on a caption task.
Sliding Captioning: The model supports streaming captioning in a differential sliding-window format, yielding high-quality captions that are suitable for long videos. We take the two adjacent keyframes alongside the previous differential caption as input, and train the model to describe the events occurring between them.
Clip Summarizing: The model can swiftly summarize any clip from ShareGPT4Video or videos that have undergone the differential sliding-window captioning process, eliminating the need to re-process frames. We use all the differential descriptions as input, and the output is the video caption.
Prompt Re-Captioning: The model can rephrase prompts input by users who prefer specific video generation areas, ensuring that T2VMs trained on high-quality video-caption data maintain format alignment during inference with their training. In practice, we use GPT-4 to generate Sora-style prompts for our dense captions, and we train the re-captioning task in reverse, i.e., by using the generated prompt as input and the dense caption as the training target.

Model date: ShareCaptioner was trained in May 2024.

Paper or resources for more information: [Project] [Paper] [Code]

Intended use

Primary intended uses: The primary use of ShareCaptioner-Video is about producing high-quality video captions.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Finetuning dataset

40K GPT4V-generated video-caption pairs
40K differential sliding-window captioning conversations
40K prompt-to-caption textual data

Paper

arxiv.org/abs/2406.04325

Downloads last month: 146

Dataset used to train Lin-Chen/ShareCaptioner-Video

Spaces using Lin-Chen/ShareCaptioner-Video 2

Collection including Lin-Chen/ShareCaptioner-Video

ShareGPT4Video

Collection

6 items • Updated Mar 2 • 5

Paper for Lin-Chen/ShareCaptioner-Video

ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 75