[doc] Readme add discord link

6d87622 verified about 14 hours ago

3.95 kB

	---
	language:
	- en
	base_model:
	- tencent/HunyuanVideo
	pipeline_tag: image-to-video
	---

	# Skyreels V1: Human-Centric Video Foundation Model
	<p align="center">
	<img src="assets/logo2.png" alt="SkyReels Logo" width="60%">
	</p>

	<p align="center">
	<a href="https://github.com/SkyworkAI/SkyReels-V1" target="_blank">🌐 Github</a> · 👋 <a href="https://www.skyreels.ai/home?utm_campaign=huggingface_V1_i2v" target="_blank">Playground</a> · 💬 <a href="https://discord.gg/PwM6NYtccQ" target="_blank">Discord</a>
	</p>

	---
	This repo contains Diffusers-format model weights for SkyReels V1 Image-to-Video models. You can find the inference code on our github repository [SkyReels-V1](https://github.com/SkyworkAI/SkyReels-V1).

	## Introduction
	SkyReels V1 is the first and most advanced open-source human-centric video foundation model. By fine-tuning <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> on O(10M) high-quality film and television clips, Skyreels V1 offers three key advantages:

	1. Open-Source Leadership: Our Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
	2. Advanced Facial Animation: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
	3. Cinematic Lighting and Aesthetics: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.

	## 🔑 Key Features

	### 1. Self-Developed Data Cleaning and Annotation Pipeline

	Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.

	- Expression Classification: Categorizes human facial expressions into 33 distinct types.
	- Character Spatial Awareness: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.
	- Action Recognition: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
	- Scene Understanding: Conducts cross-modal correlation analysis of clothing, scenes, and plots.

	### 2. Multi-Stage Image-to-Video Pretraining

	Our multi-stage pretraining pipeline, inspired by the <a href="https://huggingface.co/tencent/HunyuanVideo">HunyuanVideo</a> design, consists of the following stages:

	- Stage 1: Model Domain Transfer Pretraining: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.
	- Stage 2: Image-to-Video Model Pretraining: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
	- Stage 3: High-Quality Fine-Tuning: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.

	## Model Introduction
	\| Model Name \| Resolution \| Video Length \| FPS \| Download Link \|
	\|-----------------\|------------\|--------------\|-----\|---------------\|
	\| SkyReels-V1-Hunyuan-I2V (Current) \| 544px960p \| 97 \| 24 \| 🤗 [Download](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-I2V) \|
	\| SkyReels-V1-Hunyuan-T2V \| 544px960p \| 97 \| 24 \| 🤗 [Download](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V) \|

	## Usage
	See the [Guide](https://github.com/SkyworkAI/SkyReels-V1) for details.

	## Citation
	```BibTeX
	@misc{SkyReelsV1,
	author = {SkyReels-AI},
	title = {Skyreels V1: Human-Centric Video Foundation Model},
	year = {2025},
	publisher = {Huggingface},
	journal = {Huggingface repository},
	howpublished = {\url{https://huggingface.co/Skywork/Skyreels-V1-Hunyuan-I2V}}
	}
	```