Spaces:

lisonallen
/

framepack-i2v

Running on Zero

App Files Files Community

lisonallen commited on 8 days ago

Commit

50f328c

1 Parent(s): 4292ab9

添加Hugging Face Space部署配置文件和依赖

Browse files

Files changed (10) hide show

.gitattributes +5 -0
.gitignore +27 -23
Dockerfile +42 -0
README-HF.md +36 -0
README.md +24 -462
app.py +387 -0
diffusers_helper/__init__.py +1 -0
diffusers_helper/hf_login.py +21 -17
requirements.txt +4 -1
setup.sh +7 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,5 @@

+*.safetensors filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -1,16 +1,8 @@
-hf_download/
-outputs/
-repo/
-# Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
-# C extensions
 *.so
-# Distribution / packaging
 .Python
 build/
 develop-eggs/
@@ -24,15 +16,36 @@ parts/
 sdist/
 var/
 wheels/
-share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
-MANIFEST
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
 *.manifest
 *.spec
@@ -131,15 +144,6 @@ celerybeat.pid
 # SageMath parsed files
 *.sage.py
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject

+# Python
 __pycache__/
 *.py[cod]
 *$py.class
 *.so
 .Python
 build/
 develop-eggs/
 sdist/
 var/
 wheels/
 *.egg-info/
 .installed.cfg
 *.egg
+# Project specific
+outputs/
+hf_download/
+*.mp4
+*.safetensors
+*.bin
+*.pt
+*.pth
+# Environment
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+.DS_Store
+# IDE settings
+.vscode/
+.idea/
+*.swp
+*.swo
+# Byte-compiled / optimized / DLL files
 *.manifest
 *.spec
 # SageMath parsed files
 *.sage.py
 # Spyder project settings
 .spyderproject
 .spyproject

Dockerfile ADDED Viewed

	@@ -0,0 +1,42 @@

+FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04
+# 设置非交互式安装并避免不必要的包
+ENV DEBIAN_FRONTEND=noninteractive
+ENV TZ=Asia/Shanghai
+# 安装基本工具和Python
+RUN apt-get update && apt-get install -y \
+    git \
+    python3 \
+    python3-pip \
+    ffmpeg \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+# 设置工作目录
+WORKDIR /app
+# 复制需要的文件
+COPY requirements.txt ./
+COPY app.py ./
+COPY setup.sh ./
+COPY README.md ./
+COPY diffusers_helper ./diffusers_helper
+# 安装Python依赖
+RUN pip3 install --no-cache-dir -r requirements.txt
+# 创建需要的目录
+RUN mkdir -p /app/outputs
+RUN mkdir -p /app/hf_download
+# 设置权限
+RUN chmod +x setup.sh
+# 设置环境变量
+ENV HF_HOME=/app/hf_download
+# 运行应用
+CMD ["python3", "app.py"]

README-HF.md ADDED Viewed

	@@ -0,0 +1,36 @@

+# FramePack - 图像到视频生成
+![FramePack封面图](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/gradio-spaces/gradio-banner.png)
+将静态图像转换为动态视频的人工智能应用。上传一张人物图像，添加动作描述，即可生成流畅的视频！
+## 使用方法
+1. 上传一张人物图像
+2. 输入描述所需动作的提示词（如"The girl dances gracefully"）
+3. 调整视频长度和其他可选参数
+4. 点击"开始生成"按钮
+5. 等待视频生成（过程是渐进式的，会不断扩展视频长度）
+## 示例提示词
+- "The girl dances gracefully, with clear movements, full of charm."
+- "The man dances energetically, leaping mid-air with fluid arm swings and quick footwork."
+- "A character doing some simple body movements."
+## 技术特点
+- 基于Hunyuan Video和FramePack架构
+- 支持低显存GPU运行
+- 可生成最长120秒的视频
+- 使用TeaCache技术加速生成过程
+## 注意事项
+- 视频生成是倒序进行的，结束动作将先于开始动作生成
+- 首次使用时需要下载模型（约30GB），请耐心等待
+- 如果遇到内存不足错误，可以增加"GPU推理保留内存"的值
+---
+原项目: [FramePack GitHub](https://github.com/lllyasviel/FramePack)

README.md CHANGED Viewed

@@ -1,477 +1,39 @@
-<p align="center">
-    <img src="https://github.com/user-attachments/assets/2cc030b4-87e1-40a0-b5bf-1b7d6b62820b" width="300">
-</p>
 # FramePack
-Official implementation and desktop software for ["Packing Input Frame Context in Next-Frame Prediction Models for Video Generation"](https://lllyasviel.github.io/frame_pack_gitpage/).
-Links: [**Paper**](https://lllyasviel.github.io/frame_pack_gitpage/pack.pdf), [**Project Page**](https://lllyasviel.github.io/frame_pack_gitpage/)
-FramePack is a next-frame (next-frame-section) prediction neural network structure that generates videos progressively.
-FramePack compresses input contexts to a constant length so that the generation workload is invariant to video length.
-FramePack can process a very large number of frames with 13B models even on laptop GPUs.
-FramePack can be trained with a much larger batch size, similar to the batch size for image diffusion training.
-**Video diffusion, but feels like image diffusion.**
-# Requirements
-Note that this repo is a functional desktop software with minimal standalone high-quality sampling system and memory management.
-**Start with this repo before you try anything else!**
-Requirements:
-* Nvidia GPU in RTX 30XX, 40XX, 50XX series that supports fp16 and bf16. The GTX 10XX/20XX are not tested.
-* Linux or Windows operating system.
-* At least 6GB GPU memory.
-To generate 1-minute video (60 seconds) at 30fps (1800 frames) using 13B model, the minimal required GPU memory is 6GB. (Yes 6 GB, not a typo. Laptop GPUs are okay.)
-About speed, on my RTX 4090 desktop it generates at a speed of 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). On my laptops like 3070ti laptop or 3060 laptop, it is about 4x to 8x slower.
-In any case, you will directly see the generated frames since it is next-frame(-section) prediction. So you will get lots of visual feedback before the entire video is generated.
-# Installation
-**Windows**:
-[>>> Click Here to Download One-Click Package (CUDA 12.6 + Pytorch 2.6) <<<](https://github.com/lllyasviel/FramePack/releases/download/windows/framepack_cu126_torch26.7z)
-After you download, you uncompress, use `update.bat` to update, and use `run.bat` to run.
-Note that running `update.bat` is important, otherwise you may be using a previous version with potential bugs unfixed.
-![image](https://github.com/lllyasviel/stable-diffusion-webui-forge/assets/19834515/c49bd60d-82bd-4086-9859-88d472582b94)
-Note that the models will be downloaded automatically. You will download more than 30GB from HuggingFace.
-**Linux**:
-We recommend having an independent Python 3.10.
-    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
-    pip install -r requirements.txt
-To start the GUI, run:
-    python demo_gradio.py
-Note that it supports `--share`, `--port`, `--server`, and so on.
-The software supports PyTorch attention, xformers, flash-attn, sage-attention. By default, it will just use PyTorch attention. You can install those attention kernels if you know how.
-For example, to install sage-attention (linux):
-    pip install sageattention==1.0.6
-However, you are highly recommended to first try without sage-attention since it will influence results, though the influence is minimal.
-# GUI
-![ui](https://github.com/user-attachments/assets/8c5cdbb1-b80c-4b7e-ac27-83834ac24cc4)
-On the left you upload an image and write a prompt.
-On the right are the generated videos and latent previews.
-Because this is a next-frame-section prediction model, videos will be generated longer and longer.
-You will see the progress bar for each section and the latent preview for the next section.
-Note that the initial progress may be slower than later diffusion as the device may need some warmup.
-# Sanity Check
-Before trying your own inputs, we highly recommend going through the sanity check to find out if any hardware or software went wrong.
-Next-frame-section prediction models are very sensitive to subtle differences in noise and hardware. Usually, people will get slightly different results on different devices, but the results should look overall similar. In some cases, if possible, you'll get exactly the same results.
-## Image-to-5-seconds
-Download this image:
-<img src="https://github.com/user-attachments/assets/f3bc35cf-656a-4c9c-a83a-bbab24858b09" width="150">
-Copy this prompt:
-`The man dances energetically, leaping mid-air with fluid arm swings and quick footwork.`
-Set like this:
-(all default parameters, with teacache turned off)
-![image](https://github.com/user-attachments/assets/0071fbb6-600c-4e0f-adc9-31980d540e9d)
-The result will be:
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/bc74f039-2b14-4260-a30b-ceacf611a185"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
-**Important Note:**
-Again, this is a next-frame-section prediction model. This means you will generate videos frame-by-frame or section-by-section.
-**If you get a much shorter video in the UI, like a video with only 1 second, then it is totally expected.** You just need to wait. More sections will be generated to complete the video.
-## Know the influence of TeaCache and Quantization
-Download this image:
-<img src="https://github.com/user-attachments/assets/42293e30-bdd4-456d-895c-8fedff71be04" width="150">
-Copy this prompt:
-`The girl dances gracefully, with clear movements, full of charm.`
-Set like this:
-![image](https://github.com/user-attachments/assets/4274207d-5180-4824-a552-d0d801933435)
-Turn off teacache:
-![image](https://github.com/user-attachments/assets/53b309fb-667b-4aa8-96a1-f129c7a09ca6)
-You will get this:
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/04ab527b-6da1-4726-9210-a8853dda5577"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
-Now turn on teacache:
-![image](https://github.com/user-attachments/assets/16ad047b-fbcc-4091-83dc-d46bea40708c)
-About 30% users will get this (the other 70% will get other random results depending on their hardware):
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/149fb486-9ccc-4a48-b1f0-326253051e9b"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>A typical worse result.</em>
-    </td>
-  </tr>
-</table>
-So you can see that teacache is not really lossless and sometimes can influence the result a lot.
-We recommend using teacache to try ideas and then using the full diffusion process to get high-quality results.
-This recommendation also applies to sage-attention, bnb quant, gguf, etc., etc.
-## Image-to-1-minute
-<img src="https://github.com/user-attachments/assets/820af6ca-3c2e-4bbc-afe8-9a9be1994ff5" width="150">
-`The girl dances gracefully, with clear movements, full of charm.`
-![image](https://github.com/user-attachments/assets/8c34fcb2-288a-44b3-a33d-9d2324e30cbd)
-Set video length to 60 seconds:
-![image](https://github.com/user-attachments/assets/5595a7ea-f74e-445e-ad5f-3fb5b4b21bee)
-If everything is in order you will get some result like this eventually.
-60s version:
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/c3be4bde-2e33-4fd4-b76d-289a036d3a47"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
-6s version:
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/37fe2c33-cb03-41e8-acca-920ab3e34861"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
-# More Examples
-Many more examples are in [**Project Page**](https://lllyasviel.github.io/frame_pack_gitpage/).
-Below are some more examples that you may be interested in reproducing.
----
-<img src="https://github.com/user-attachments/assets/99f4d281-28ad-44f5-8700-aa7a4e5638fa" width="150">
-`The girl dances gracefully, with clear movements, full of charm.`
-![image](https://github.com/user-attachments/assets/0e98bfca-1d91-4b1d-b30f-4236b517c35e)
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/cebe178a-09ce-4b7a-8f3c-060332f4dab1"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
----
-<img src="https://github.com/user-attachments/assets/853f4f40-2956-472f-aa7a-fa50da03ed92" width="150">
-`The girl suddenly took out a sign that said “cute” using right hand`
-![image](https://github.com/user-attachments/assets/d51180e4-5537-4e25-a6c6-faecae28648a)
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/116069d2-7499-4f38-ada7-8f85517d1fbb"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
----
-<img src="https://github.com/user-attachments/assets/6d87c53f-81b2-4108-a704-697164ae2e81" width="150">
-`The girl skateboarding, repeating the endless spinning and dancing and jumping on a skateboard, with clear movements, full of charm.`
-![image](https://github.com/user-attachments/assets/c2cfa835-b8e6-4c28-97f8-88f42da1ffdf)
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/d9e3534a-eb17-4af2-a8ed-8e692e9993d2"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
----
-<img src="https://github.com/user-attachments/assets/6e95d1a5-9674-4c9a-97a9-ddf704159b79" width="150">
-`The girl dances gracefully, with clear movements, full of charm.`
-![image](https://github.com/user-attachments/assets/7412802a-ce44-4188-b1a4-cfe19f9c9118)
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/e1b3279e-e30d-4d32-b55f-2fb1d37c81d2"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
----
-<img src="https://github.com/user-attachments/assets/90fc6d7e-8f6b-4f8c-a5df-ee5b1c8b63c9" width="150">
-`The man dances flamboyantly, swinging his hips and striking bold poses with dramatic flair.`
-![image](https://github.com/user-attachments/assets/1dcf10a3-9747-4e77-a269-03a9379dd9af)
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/aaa4481b-7bf8-4c64-bc32-909659767115"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
 ---
-<img src="https://github.com/user-attachments/assets/62ecf987-ec0c-401d-b3c9-be9ffe84ee5b" width="150">
-`The woman dances elegantly among the blossoms, spinning slowly with flowing sleeves and graceful hand movements.`
-![image](https://github.com/user-attachments/assets/396f06bc-e399-4ac3-9766-8a42d4f8d383)
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/f23f2f37-c9b8-45d5-a1be-7c87bd4b41cf"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
----
-<img src="https://github.com/user-attachments/assets/4f740c1a-2d2f-40a6-9613-d6fe64c428aa" width="150">
-`The young man writes intensely, flipping papers and adjusting his glasses with swift, focused movements.`
-![image](https://github.com/user-attachments/assets/c4513c4b-997a-429b-b092-bb275a37b719)
-<table>
-  <tr>
-    <td align="center" width="300">
-      <video
-        src="https://github.com/user-attachments/assets/62e9910e-aea6-4b2b-9333-2e727bccfc64"
-        controls
-        style="max-width:100%;">
-      </video>
-    </td>
-  </tr>
-  <tr>
-    <td align="center">
-      <em>Video may be compressed by GitHub</em>
-    </td>
-  </tr>
-</table>
----
-# Prompting Guideline
-Many people would ask how to write better prompts.
-Below is a ChatGPT template that I personally often use to get prompts:
-    You are an assistant that writes short, motion-focused prompts for animating images.
-    When the user sends an image, respond with a single, concise prompt describing visual motion (such as human activity, moving objects, or camera movements). Focus only on how the scene could come alive and become dynamic using brief phrases.
-    Larger and more dynamic motions (like dancing, jumping, running, etc.) are preferred over smaller or more subtle ones (like standing still, sitting, etc.).
-    Describe subject, then motion, then other things. For example: "The girl dances gracefully, with clear movements, full of charm."
-    If there is something that can dance (like a man, girl, robot, etc.), then prefer to describe it as dancing.
-    Stay in a loop: one image in, one motion prompt out. Do not explain, ask questions, or generate multiple options.
-You paste the instruct to ChatGPT and then feed it an image to get prompt like this:
-![image](https://github.com/user-attachments/assets/586c53b9-0b8c-4c94-b1d3-d7e7c1a705c3)
-*The man dances powerfully, striking sharp poses and gliding smoothly across the reflective floor.*
-Usually this will give you a prompt that works well.
-You can also write prompts yourself. Concise prompts are usually preferred, for example:
-*The girl dances gracefully, with clear movements, full of charm.*
-*The man dances powerfully, with clear movements, full of energy.*
-and so on.
-# Cite
-    @article{zhang2025framepack,
-        title={Packing Input Frame Contexts in Next-Frame Prediction Models for Video Generation},
-        author={Lvmin Zhang and Maneesh Agrawala},
-        journal={Arxiv},
-        year={2025}
-    }

 # FramePack
+FramePack是一个图像到视频生成工具，利用扩散模型将静态图像转换为动态视频。
+## 特点
+- 使用单张图片生成流畅的动作视频
+- 基于HunyuanVideo和FramePack架构
+- 支持低显存GPU（最低6GB）运行
+- 可以生成最长120秒的视频
+- 使用TeaCache技术加速生成过程
+## 使用方法
+1. 上传一张人物图像
+2. 输入描述所需动作的提示词
+3. 设置所需视频长度（秒）
+4. 点击"开始生成"按钮
+5. 等待视频生成（生成过程是渐进式的，会不断扩展视频长度）
+## 示例提示词
+- "The girl dances gracefully, with clear movements, full of charm."
+- "A character doing some simple body movements."
+- "The man dances energetically, leaping mid-air with fluid arm swings and quick footwork."
+## 注意事项
+- 视频生成是倒序进行的，结束动作将先于开始动作生成
+- 如果需要高质量结果，建议关闭TeaCache选项
+- 如果遇到内存不足错误，可以增加"GPU推理保留内存"的值
+## 技术细节
+此应用基于[FramePack](https://github.com/lllyasviel/FramePack)项目，使用了Hunyuan Video模型和FramePack技术进行视频生成。该技术可以将输入上下文压缩为固定长度，使生成工作量与视频长度无关，从而在笔记本电脑GPU上也能处理大量帧。
 ---
+原项目链接：[FramePack GitHub](https://github.com/lllyasviel/FramePack)

app.py ADDED Viewed

	@@ -0,0 +1,387 @@

+from diffusers_helper.hf_login import login
+import os
+os.environ['HF_HOME'] = os.path.abspath(os.path.realpath(os.path.join(os.path.dirname(__file__), './hf_download')))
+import gradio as gr
+import torch
+import traceback
+import einops
+import safetensors.torch as sf
+import numpy as np
+import math
+from PIL import Image
+from diffusers import AutoencoderKLHunyuanVideo
+from transformers import LlamaModel, CLIPTextModel, LlamaTokenizerFast, CLIPTokenizer
+from diffusers_helper.hunyuan import encode_prompt_conds, vae_decode, vae_encode, vae_decode_fake
+from diffusers_helper.utils import save_bcthw_as_mp4, crop_or_pad_yield_mask, soft_append_bcthw, resize_and_center_crop, state_dict_weighted_merge, state_dict_offset_merge, generate_timestamp
+from diffusers_helper.models.hunyuan_video_packed import HunyuanVideoTransformer3DModelPacked
+from diffusers_helper.pipelines.k_diffusion_hunyuan import sample_hunyuan
+from diffusers_helper.memory import cpu, gpu, get_cuda_free_memory_gb, move_model_to_device_with_memory_preservation, offload_model_from_device_for_memory_preservation, fake_diffusers_current_device, DynamicSwapInstaller, unload_complete_models, load_model_as_complete
+from diffusers_helper.thread_utils import AsyncStream, async_run
+from diffusers_helper.gradio.progress_bar import make_progress_bar_css, make_progress_bar_html
+from transformers import SiglipImageProcessor, SiglipVisionModel
+from diffusers_helper.clip_vision import hf_clip_vision_encode
+from diffusers_helper.bucket_tools import find_nearest_bucket
+# 获取可用的CUDA内存
+free_mem_gb = get_cuda_free_memory_gb(gpu)
+high_vram = free_mem_gb > 60
+print(f'Free VRAM {free_mem_gb} GB')
+print(f'High-VRAM Mode: {high_vram}')
+# 加载模型
+text_encoder = LlamaModel.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder='text_encoder', torch_dtype=torch.float16).cpu()
+text_encoder_2 = CLIPTextModel.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder='text_encoder_2', torch_dtype=torch.float16).cpu()
+tokenizer = LlamaTokenizerFast.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder='tokenizer')
+tokenizer_2 = CLIPTokenizer.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder='tokenizer_2')
+vae = AutoencoderKLHunyuanVideo.from_pretrained("hunyuanvideo-community/HunyuanVideo", subfolder='vae', torch_dtype=torch.float16).cpu()
+feature_extractor = SiglipImageProcessor.from_pretrained("lllyasviel/flux_redux_bfl", subfolder='feature_extractor')
+image_encoder = SiglipVisionModel.from_pretrained("lllyasviel/flux_redux_bfl", subfolder='image_encoder', torch_dtype=torch.float16).cpu()
+transformer = HunyuanVideoTransformer3DModelPacked.from_pretrained('lllyasviel/FramePackI2V_HY', torch_dtype=torch.bfloat16).cpu()
+vae.eval()
+text_encoder.eval()
+text_encoder_2.eval()
+image_encoder.eval()
+transformer.eval()
+if not high_vram:
+    vae.enable_slicing()
+    vae.enable_tiling()
+transformer.high_quality_fp32_output_for_inference = True
+print('transformer.high_quality_fp32_output_for_inference = True')
+transformer.to(dtype=torch.bfloat16)
+vae.to(dtype=torch.float16)
+image_encoder.to(dtype=torch.float16)
+text_encoder.to(dtype=torch.float16)
+text_encoder_2.to(dtype=torch.float16)
+vae.requires_grad_(False)
+text_encoder.requires_grad_(False)
+text_encoder_2.requires_grad_(False)
+image_encoder.requires_grad_(False)
+transformer.requires_grad_(False)
+if not high_vram:
+    # DynamicSwapInstaller is same as huggingface's enable_sequential_offload but 3x faster
+    DynamicSwapInstaller.install_model(transformer, device=gpu)
+    DynamicSwapInstaller.install_model(text_encoder, device=gpu)
+else:
+    text_encoder.to(gpu)
+    text_encoder_2.to(gpu)
+    image_encoder.to(gpu)
+    vae.to(gpu)
+    transformer.to(gpu)
+stream = AsyncStream()
+outputs_folder = './outputs/'
+os.makedirs(outputs_folder, exist_ok=True)
+@torch.no_grad()
+def worker(input_image, prompt, n_prompt, seed, total_second_length, latent_window_size, steps, cfg, gs, rs, gpu_memory_preservation, use_teacache):
+    total_latent_sections = (total_second_length * 30) / (latent_window_size * 4)
+    total_latent_sections = int(max(round(total_latent_sections), 1))
+    job_id = generate_timestamp()
+    stream.output_queue.push(('progress', (None, '', make_progress_bar_html(0, 'Starting ...'))))
+    try:
+        # Clean GPU
+        if not high_vram:
+            unload_complete_models(
+                text_encoder, text_encoder_2, image_encoder, vae, transformer
+            )
+        # Text encoding
+        stream.output_queue.push(('progress', (None, '', make_progress_bar_html(0, 'Text encoding ...'))))
+        if not high_vram:
+            fake_diffusers_current_device(text_encoder, gpu)  # since we only encode one text - that is one model move and one encode, offload is same time consumption since it is also one load and one encode.
+            load_model_as_complete(text_encoder_2, target_device=gpu)
+        llama_vec, clip_l_pooler = encode_prompt_conds(prompt, text_encoder, text_encoder_2, tokenizer, tokenizer_2)
+        if cfg == 1:
+            llama_vec_n, clip_l_pooler_n = torch.zeros_like(llama_vec), torch.zeros_like(clip_l_pooler)
+        else:
+            llama_vec_n, clip_l_pooler_n = encode_prompt_conds(n_prompt, text_encoder, text_encoder_2, tokenizer, tokenizer_2)
+        llama_vec, llama_attention_mask = crop_or_pad_yield_mask(llama_vec, length=512)
+        llama_vec_n, llama_attention_mask_n = crop_or_pad_yield_mask(llama_vec_n, length=512)
+        # Processing input image
+        stream.output_queue.push(('progress', (None, '', make_progress_bar_html(0, 'Image processing ...'))))
+        H, W, C = input_image.shape
+        height, width = find_nearest_bucket(H, W, resolution=640)
+        input_image_np = resize_and_center_crop(input_image, target_width=width, target_height=height)
+        Image.fromarray(input_image_np).save(os.path.join(outputs_folder, f'{job_id}.png'))
+        input_image_pt = torch.from_numpy(input_image_np).float() / 127.5 - 1
+        input_image_pt = input_image_pt.permute(2, 0, 1)[None, :, None]
+        # VAE encoding
+        stream.output_queue.push(('progress', (None, '', make_progress_bar_html(0, 'VAE encoding ...'))))
+        if not high_vram:
+            load_model_as_complete(vae, target_device=gpu)
+        start_latent = vae_encode(input_image_pt, vae)
+        # CLIP Vision
+        stream.output_queue.push(('progress', (None, '', make_progress_bar_html(0, 'CLIP Vision encoding ...'))))
+        if not high_vram:
+            load_model_as_complete(image_encoder, target_device=gpu)
+        image_encoder_output = hf_clip_vision_encode(input_image_np, feature_extractor, image_encoder)
+        image_encoder_last_hidden_state = image_encoder_output.last_hidden_state
+        # Dtype
+        llama_vec = llama_vec.to(transformer.dtype)
+        llama_vec_n = llama_vec_n.to(transformer.dtype)
+        clip_l_pooler = clip_l_pooler.to(transformer.dtype)
+        clip_l_pooler_n = clip_l_pooler_n.to(transformer.dtype)
+        image_encoder_last_hidden_state = image_encoder_last_hidden_state.to(transformer.dtype)
+        # Sampling
+        stream.output_queue.push(('progress', (None, '', make_progress_bar_html(0, 'Start sampling ...'))))
+        rnd = torch.Generator("cpu").manual_seed(seed)
+        num_frames = latent_window_size * 4 - 3
+        history_latents = torch.zeros(size=(1, 16, 1 + 2 + 16, height // 8, width // 8), dtype=torch.float32).cpu()
+        history_pixels = None
+        total_generated_latent_frames = 0
+        latent_paddings = reversed(range(total_latent_sections))
+        if total_latent_sections > 4:
+            # In theory the latent_paddings should follow the above sequence, but it seems that duplicating some
+            # items looks better than expanding it when total_latent_sections > 4
+            # One can try to remove below trick and just
+            # use `latent_paddings = list(reversed(range(total_latent_sections)))` to compare
+            latent_paddings = [3] + [2] * (total_latent_sections - 3) + [1, 0]
+        for latent_padding in latent_paddings:
+            is_last_section = latent_padding == 0
+            latent_padding_size = latent_padding * latent_window_size
+            if stream.input_queue.top() == 'end':
+                stream.output_queue.push(('end', None))
+                return
+            print(f'latent_padding_size = {latent_padding_size}, is_last_section = {is_last_section}')
+            indices = torch.arange(0, sum([1, latent_padding_size, latent_window_size, 1, 2, 16])).unsqueeze(0)
+            clean_latent_indices_pre, blank_indices, latent_indices, clean_latent_indices_post, clean_latent_2x_indices, clean_latent_4x_indices = indices.split([1, latent_padding_size, latent_window_size, 1, 2, 16], dim=1)
+            clean_latent_indices = torch.cat([clean_latent_indices_pre, clean_latent_indices_post], dim=1)
+            clean_latents_pre = start_latent.to(history_latents)
+            clean_latents_post, clean_latents_2x, clean_latents_4x = history_latents[:, :, :1 + 2 + 16, :, :].split([1, 2, 16], dim=2)
+            clean_latents = torch.cat([clean_latents_pre, clean_latents_post], dim=2)
+            if not high_vram:
+                unload_complete_models()
+                move_model_to_device_with_memory_preservation(transformer, target_device=gpu, preserved_memory_gb=gpu_memory_preservation)
+            if use_teacache:
+                transformer.initialize_teacache(enable_teacache=True, num_steps=steps)
+            else:
+                transformer.initialize_teacache(enable_teacache=False)
+            def callback(d):
+                preview = d['denoised']
+                preview = vae_decode_fake(preview)
+                preview = (preview * 255.0).detach().cpu().numpy().clip(0, 255).astype(np.uint8)
+                preview = einops.rearrange(preview, 'b c t h w -> (b h) (t w) c')
+                if stream.input_queue.top() == 'end':
+                    stream.output_queue.push(('end', None))
+                    raise KeyboardInterrupt('User ends the task.')
+                current_step = d['i'] + 1
+                percentage = int(100.0 * current_step / steps)
+                hint = f'Sampling {current_step}/{steps}'
+                desc = f'Total generated frames: {int(max(0, total_generated_latent_frames * 4 - 3))}, Video length: {max(0, (total_generated_latent_frames * 4 - 3) / 30) :.2f} seconds (FPS-30). The video is being extended now ...'
+                stream.output_queue.push(('progress', (preview, desc, make_progress_bar_html(percentage, hint))))
+                return
+            generated_latents = sample_hunyuan(
+                transformer=transformer,
+                sampler='unipc',
+                width=width,
+                height=height,
+                frames=num_frames,
+                real_guidance_scale=cfg,
+                distilled_guidance_scale=gs,
+                guidance_rescale=rs,
+                # shift=3.0,
+                num_inference_steps=steps,
+                generator=rnd,
+                prompt_embeds=llama_vec,
+                prompt_embeds_mask=llama_attention_mask,
+                prompt_poolers=clip_l_pooler,
+                negative_prompt_embeds=llama_vec_n,
+                negative_prompt_embeds_mask=llama_attention_mask_n,
+                negative_prompt_poolers=clip_l_pooler_n,
+                device=gpu,
+                dtype=torch.bfloat16,
+                image_embeddings=image_encoder_last_hidden_state,
+                latent_indices=latent_indices,
+                clean_latents=clean_latents,
+                clean_latent_indices=clean_latent_indices,
+                clean_latents_2x=clean_latents_2x,
+                clean_latent_2x_indices=clean_latent_2x_indices,
+                clean_latents_4x=clean_latents_4x,
+                clean_latent_4x_indices=clean_latent_4x_indices,
+                callback=callback,
+            )
+            if is_last_section:
+                generated_latents = torch.cat([start_latent.to(generated_latents), generated_latents], dim=2)
+            total_generated_latent_frames += int(generated_latents.shape[2])
+            history_latents = torch.cat([generated_latents.to(history_latents), history_latents], dim=2)
+            if not high_vram:
+                offload_model_from_device_for_memory_preservation(transformer, target_device=gpu, preserved_memory_gb=8)
+                load_model_as_complete(vae, target_device=gpu)
+            real_history_latents = history_latents[:, :, :total_generated_latent_frames, :, :]
+            if history_pixels is None:
+                history_pixels = vae_decode(real_history_latents, vae).cpu()
+            else:
+                section_latent_frames = (latent_window_size * 2 + 1) if is_last_section else (latent_window_size * 2)
+                overlapped_frames = latent_window_size * 4 - 3
+                current_pixels = vae_decode(real_history_latents[:, :, :section_latent_frames], vae).cpu()
+                history_pixels = soft_append_bcthw(current_pixels, history_pixels, overlapped_frames)
+            if not high_vram:
+                unload_complete_models()
+            output_filename = os.path.join(outputs_folder, f'{job_id}_{total_generated_latent_frames}.mp4')
+            save_bcthw_as_mp4(history_pixels, output_filename, fps=30)
+            print(f'Decoded. Current latent shape {real_history_latents.shape}; pixel shape {history_pixels.shape}')
+            stream.output_queue.push(('file', output_filename))
+            if is_last_section:
+                break
+    except:
+        traceback.print_exc()
+        if not high_vram:
+            unload_complete_models(
+                text_encoder, text_encoder_2, image_encoder, vae, transformer
+            )
+    stream.output_queue.push(('end', None))
+    return
+def process(input_image, prompt, n_prompt, seed, total_second_length, latent_window_size, steps, cfg, gs, rs, gpu_memory_preservation, use_teacache):
+    global stream
+    assert input_image is not None, 'No input image!'
+    yield None, None, '', '', gr.update(interactive=False), gr.update(interactive=True)
+    stream = AsyncStream()
+    async_run(worker, input_image, prompt, n_prompt, seed, total_second_length, latent_window_size, steps, cfg, gs, rs, gpu_memory_preservation, use_teacache)
+    output_filename = None
+    while True:
+        flag, data = stream.output_queue.next()
+        if flag == 'file':
+            output_filename = data
+            yield output_filename, gr.update(), gr.update(), gr.update(), gr.update(interactive=False), gr.update(interactive=True)
+        if flag == 'progress':
+            preview, desc, html = data
+            yield gr.update(), gr.update(visible=True, value=preview), desc, html, gr.update(interactive=False), gr.update(interactive=True)
+        if flag == 'end':
+            yield output_filename, gr.update(visible=False), gr.update(), '', gr.update(interactive=True), gr.update(interactive=False)
+            break
+def end_process():
+    stream.input_queue.push('end')
+quick_prompts = [
+    'The girl dances gracefully, with clear movements, full of charm.',
+    'A character doing some simple body movements.',
+]
+quick_prompts = [[x] for x in quick_prompts]
+css = make_progress_bar_css()
+block = gr.Blocks(css=css).queue()
+with block:
+    gr.Markdown('# FramePack - 图像到视频生成')
+    with gr.Row():
+        with gr.Column():
+            input_image = gr.Image(sources='upload', type="numpy", label="上传图像", height=320)
+            prompt = gr.Textbox(label="提示词", value='')
+            example_quick_prompts = gr.Dataset(samples=quick_prompts, label='快速提示词列表', samples_per_page=1000, components=[prompt])
+            example_quick_prompts.click(lambda x: x[0], inputs=[example_quick_prompts], outputs=prompt, show_progress=False, queue=False)
+            with gr.Row():
+                start_button = gr.Button(value="开始生成")
+                end_button = gr.Button(value="结束生成", interactive=False)
+            with gr.Group():
+                use_teacache = gr.Checkbox(label='使用TeaCache', value=True, info='速度更快，但可能会使手指和手的生成效果稍差。')
+                n_prompt = gr.Textbox(label="负面提示词", value="", visible=False)  # Not used
+                seed = gr.Number(label="随机种子", value=31337, precision=0)
+                total_second_length = gr.Slider(label="视频长度(秒)", minimum=1, maximum=120, value=5, step=0.1)
+                latent_window_size = gr.Slider(label="潜在窗口大小", minimum=1, maximum=33, value=9, step=1, visible=False)  # Should not change
+                steps = gr.Slider(label="推理步数", minimum=1, maximum=100, value=25, step=1, info='不建议修改此值。')
+                cfg = gr.Slider(label="CFG Scale", minimum=1.0, maximum=32.0, value=1.0, step=0.01, visible=False)  # Should not change
+                gs = gr.Slider(label="蒸馏CFG比例", minimum=1.0, maximum=32.0, value=10.0, step=0.01, info='不建议修改此值。')
+                rs = gr.Slider(label="CFG重缩放", minimum=0.0, maximum=1.0, value=0.0, step=0.01, visible=False)  # Should not change
+                gpu_memory_preservation = gr.Slider(label="GPU推理保留内存(GB)(值越大速度越慢)", minimum=6, maximum=128, value=6, step=0.1, info="如果出现OOM错误，请将此值设置得更大。值越大，速度越慢。")
+        with gr.Column():
+            preview_image = gr.Image(label="下一批潜变量", height=200, visible=False)
+            result_video = gr.Video(label="生成的视频", autoplay=True, show_share_button=False, height=512, loop=True)
+            gr.Markdown('注意：由于采样是倒序的，结束动作将在开始动作之前生成。如果视频中没有出现起始动作，请继续等待，它将在稍后生成。')
+            progress_desc = gr.Markdown('', elem_classes='no-generating-animation')
+            progress_bar = gr.HTML('', elem_classes='no-generating-animation')
+    ips = [input_image, prompt, n_prompt, seed, total_second_length, latent_window_size, steps, cfg, gs, rs, gpu_memory_preservation, use_teacache]
+    start_button.click(fn=process, inputs=ips, outputs=[result_video, preview_image, progress_desc, progress_bar, start_button, end_button])
+    end_button.click(fn=end_process)
+block.launch()

diffusers_helper/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # diffusers_helper package

diffusers_helper/hf_login.py CHANGED Viewed

@@ -1,21 +1,25 @@
 import os
-def login(token):
-    from huggingface_hub import login
-    import time
-    while True:
-        try:
-            login(token)
-            print('HF login ok.')
-            break
-        except Exception as e:
-            print(f'HF login failed: {e}. Retrying')
-            time.sleep(0.5)
-hf_token = os.environ.get('HF_TOKEN', None)
-if hf_token is not None:
-    login(hf_token)

 import os
+from huggingface_hub import login
+def login():
+    # 如果是在Hugging Face Space环境中运行，使用环境变量中的token
+    if os.environ.get('SPACE_ID') is not None:
+        print("Running in Hugging Face Space, using environment HF_TOKEN")
+        # Space自带访问权限，无需额外登录
+        return
+    # 如果本地环境有token，则使用它登录
+    hf_token = os.environ.get('HF_TOKEN')
+    if hf_token:
+        print("Logging in with HF_TOKEN from environment")
+        login(token=hf_token)
+        return
+    # 检查缓存的token
+    cache_file = os.path.expanduser('~/.huggingface/token')
+    if os.path.exists(cache_file):
+        print("Found cached Hugging Face token")
+        return
+    print("No Hugging Face token found. Using public access.")
+    # 无token时使用公共访问，速度可能较慢且有限制

requirements.txt CHANGED Viewed

@@ -9,7 +9,10 @@ numpy==1.26.2
 scipy==1.12.0
 requests==2.31.0
 torchsde==0.2.6
 einops
 opencv-contrib-python
 safetensors

 scipy==1.12.0
 requests==2.31.0
 torchsde==0.2.6
+torch>=2.0.0
+torchvision
+torchaudio
 einops
 opencv-contrib-python
 safetensors
+huggingface_hub

setup.sh ADDED Viewed

	@@ -0,0 +1,7 @@

+#!/bin/bash
+# 创建必要的目录
+mkdir -p hf_download
+mkdir -p outputs
+# 如果模型尚未下载，会在首次运行时自动下载
+echo "环境准备完毕，运行 python app.py 启动应用"