thesab commited on
Commit
cb25745
·
verified ·
1 Parent(s): 526dc7a

Update README with installation, download, running instructions and model details

Browse files

This pull request updates the README to improve readability, formatting, and provides additional instructions. Changes include:

- Properly formatted installation steps using uv, including setuptools and flash attention.
- Structured and clear instructions for downloading model weights with direct links.
- Clear and formatted instructions for running the Gradio UI and generating videos from the CLI.
- Detailed section on running with Diffusers, including sub-section for using a lower precision variant.
- Enhanced readability and structure in the Model Architecture section.
- Detailed hardware requirements and safety considerations.
- Expanded limitations section.
- Properly formatted BibTeX citation.

Files changed (1) hide show
  1. README.md +69 -59
README.md CHANGED
@@ -19,103 +19,110 @@ Mochi 1 preview is an open state-of-the-art video generation model with high-fid
19
 
20
  ## Installation
21
 
22
- Clone the repository and install it in editable mode:
23
-
24
  Install using [uv](https://github.com/astral-sh/uv):
25
 
26
  ```bash
27
  git clone https://github.com/genmoai/models
28
- cd models
29
  pip install uv
30
  uv venv .venv
31
  source .venv/bin/activate
32
- uv pip install -e .
 
 
 
 
 
 
33
  ```
34
 
 
 
35
  ## Download Weights
36
 
37
- Download the weights from [Hugging Face](https://huggingface.co/genmo/mochi-1-preview/tree/main) or via `magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce`.
 
 
 
 
 
38
 
39
  ## Running
40
 
41
  Start the gradio UI with
42
 
43
  ```bash
44
- python3 -m mochi_preview.gradio_ui --model_dir "<path_to_model_directory>"
45
  ```
46
 
47
  Or generate videos directly from the CLI with
48
 
49
  ```bash
50
- python3 -m mochi_preview.infer --prompt "A hand with delicate fingers picks up a bright yellow lemon from a wooden bowl filled with lemons and sprigs of mint against a peach-colored background. The hand gently tosses the lemon up and catches it, showcasing its smooth texture. A beige string bag sits beside the bowl, adding a rustic touch to the scene. Additional lemons, one halved, are scattered around the base of the bowl. The even lighting enhances the vibrant colors and creates a fresh, inviting atmosphere." --seed 1710977262 --cfg_scale 4.5 --model_dir "<path_to_model_directory>"
51
  ```
52
 
53
- Replace `<path_to_model_directory>` with the path to your model directory.
54
-
55
- ## Running with Diffusers
56
-
57
- Install the latest version of Diffusers
58
-
59
- ```shell
60
- pip install git+https://github.com/huggingface/diffusers.git
61
- ```
62
-
63
- The following example requires 42GB VRAM but ensures the highest quality output.
64
-
65
- ```python
66
- import torch
67
- from diffusers import MochiPipeline
68
- from diffusers.utils import export_to_video
69
-
70
- pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview")
71
-
72
- # Enable memory savings
73
- pipe.enable_model_cpu_offload()
74
- pipe.enable_vae_tiling()
75
-
76
- prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
77
-
78
- with torch.autocast("cuda", torch.bfloat16, cache_enabled=False):
79
- frames = pipe(prompt, num_frames=84).frames[0]
80
 
81
- export_to_video(frames, "mochi.mp4", fps=30)
82
- ```
83
-
84
- ### Using a lower precision variant to save memory
85
 
86
- The following example will use the `bfloat16` variant of the model and requires 22GB VRAM to run. There is a slight drop in the quality of the generated video as a result.
87
 
88
  ```python
89
- import torch
90
- from diffusers import MochiPipeline
91
- from diffusers.utils import export_to_video
92
-
93
- pipe = MochiPipeline.from_pretrained("genmo/mochi-1-preview", variant="bf16", torch_dtype=torch.bfloat16)
94
-
95
- # Enable memory savings
96
- pipe.enable_model_cpu_offload()
97
- pipe.enable_vae_tiling()
98
-
99
- prompt = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
100
- frames = pipe(prompt, num_frames=84).frames[0]
101
-
102
- export_to_video(frames, "mochi.mp4", fps=30)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  ```
104
 
105
- To learn more check out the [Diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/mochi) documentation
106
-
107
  ## Model Architecture
108
 
109
- Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture.
 
 
110
 
111
- Alongside Mochi, we are open-sourcing our video VAE. Our VAE causally compresses videos to a 96x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
 
 
 
112
 
113
  An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
114
  Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
115
 
116
- ## Hardware Requirements
 
 
 
117
 
118
- Mochi 1 supports a variety of hardware platforms depending on quantization level, ranging from a single 3090 GPU up to multiple H100 GPUs.
 
119
 
120
  ## Safety
121
  Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
@@ -123,6 +130,9 @@ Genmo video models are general text-to-video diffusion models that inherently re
123
  ## Limitations
124
  Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.
125
 
 
 
 
126
 
127
  ## BibTeX
128
  ```
 
19
 
20
  ## Installation
21
 
 
 
22
  Install using [uv](https://github.com/astral-sh/uv):
23
 
24
  ```bash
25
  git clone https://github.com/genmoai/models
26
+ cd models
27
  pip install uv
28
  uv venv .venv
29
  source .venv/bin/activate
30
+ uv pip install setuptools
31
+ uv pip install -e . --no-build-isolation
32
+ ```
33
+
34
+ If you want to install flash attention, you can use:
35
+ ```
36
+ uv pip install -e .[flash] --no-build-isolation
37
  ```
38
 
39
+ You will also need to install [FFMPEG](https://www.ffmpeg.org/) to turn your outputs into videos.
40
+
41
  ## Download Weights
42
 
43
+ Use [download_weights.py](scripts/download_weights.py) to download the model + decoder to a local directory. Use it like this:
44
+ ```
45
+ python3 ./scripts/download_weights.py <path_to_downloaded_directory>
46
+ ```
47
+
48
+ Or, directly download the weights from [Hugging Face](https://huggingface.co/genmo/mochi-1-preview/tree/main) or via `magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce` to a folder on your computer.
49
 
50
  ## Running
51
 
52
  Start the gradio UI with
53
 
54
  ```bash
55
+ python3 ./demos/gradio_ui.py --model_dir "<path_to_downloaded_directory>"
56
  ```
57
 
58
  Or generate videos directly from the CLI with
59
 
60
  ```bash
61
+ python3 ./demos/cli.py --model_dir "<path_to_downloaded_directory>"
62
  ```
63
 
64
+ Replace `<path_to_downloaded_directory>` with the path to your model directory.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
+ ## API
 
 
 
67
 
68
+ This repository comes with a simple, composable API, so you can programmatically call the model. You can find a full example [here](demos/api_example.py). But, roughly, it looks like this:
69
 
70
  ```python
71
+ from genmo.mochi_preview.pipelines import (
72
+ DecoderModelFactory,
73
+ DitModelFactory,
74
+ MochiSingleGPUPipeline,
75
+ T5ModelFactory,
76
+ linear_quadratic_schedule,
77
+ )
78
+
79
+ pipeline = MochiSingleGPUPipeline(
80
+ text_encoder_factory=T5ModelFactory(),
81
+ dit_factory=DitModelFactory(
82
+ model_path=f"{MOCHI_DIR}/dit.safetensors", model_dtype="bf16"
83
+ ),
84
+ decoder_factory=DecoderModelFactory(
85
+ model_path=f"{MOCHI_DIR}/vae.safetensors",
86
+ ),
87
+ cpu_offload=True,
88
+ decode_type="tiled_full",
89
+ )
90
+
91
+ video = pipeline(
92
+ height=480,
93
+ width=848,
94
+ num_frames=31,
95
+ num_inference_steps=64,
96
+ sigma_schedule=linear_quadratic_schedule(64, 0.025),
97
+ cfg_schedule=[4.5] * 64,
98
+ batch_cfg=False,
99
+ prompt="your favorite prompt here ...",
100
+ negative_prompt="",
101
+ seed=12345,
102
+ )
103
  ```
104
 
 
 
105
  ## Model Architecture
106
 
107
+ Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture. Additionally, we are releasing an inference harness that includes an efficient context parallel implementation.
108
+
109
+ Alongside Mochi, we are open-sourcing our video AsymmVAE. We use an asymmetric encoder-decoder structure to build an efficient high quality compression model. Our AsymmVAE causally compresses videos to a 128x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space.
110
 
111
+ ### AsymmVAE Model Specs
112
+ |Params <br> Count | Enc Base <br> Channels | Dec Base <br> Channels |Latent <br> Dim | Spatial <br> Compression | Temporal <br> Compression |
113
+ |:--:|:--:|:--:|:--:|:--:|:--:|
114
+ |362M | 64 | 128 | 12 | 8x8 | 6x |
115
 
116
  An AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.
117
  Many modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.
118
 
119
+ ### AsymmDiT Model Specs
120
+ |Params <br> Count | Num <br> Layers | Num <br> Heads | Visual <br> Dim | Text <br> Dim | Visual <br> Tokens | Text <br> Tokens |
121
+ |:--:|:--:|:--:|:--:|:--:|:--:|:--:|
122
+ |10B | 48 | 24 | 3072 | 1536 | 44520 | 256 |
123
 
124
+ ## Hardware Requirements
125
+ The repository supports both multi-GPU operation (splitting the model across multiple graphics cards) and single-GPU operation, though it requires approximately 60GB VRAM when running on a single GPU. While ComfyUI can optimize Mochi to run on less than 20GB VRAM, this implementation prioritizes flexibility over memory efficiency. When using this repository, we recommend using at least 1 H100 GPU.
126
 
127
  ## Safety
128
  Genmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.
 
130
  ## Limitations
131
  Under the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.
132
 
133
+ ## Related Work
134
+ - [ComfyUI-MochiWrapper](https://github.com/kijai/ComfyUI-MochiWrapper) adds ComfyUI support for Mochi. The integration of Pytorch's SDPA attention was taken from their repository.
135
+
136
 
137
  ## BibTeX
138
  ```