File size: 3,916 Bytes
93b9471
 
cd44649
 
 
93b9471
cd44649
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d2ce97
cd44649
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---

license: apache-2.0
language:
- en
inference: false
---

<h1>VPO: Aligning Text-to-Video Generation Models with Prompt Optimization</h1>

- **Repository:** https://github.com/thu-coai/VPO
<!-- - **Paper:**  -->
- **Data:** https://huggingface.co/datasets/CCCCCC/VPO

# VPO
VPO is a principled prompt optimization framework grounded in the principles of harmlessness, accuracy, and helpfulness.
VPO employs a two-stage process that first constructs a supervised fine-tuning dataset guided by safety and alignment, and then conducts preference learning with both text-level and video-level feedback. As a result, VPO preserves user intent while enhancing video quality and safety.

## Model Details

### Video Generation Model
This model is trained to optimize user prompt for CogVideoX-2B. [VPO-5B](https://huggingface.co/CCCCCC/VPO-5B) is for CogVideoX-5B.

### Data
Our dataset can be found [here](https://huggingface.co/datasets/CCCCCC/VPO).

### Language
English

## Intended Use

### Prompt Template 
We adopt a prompt template as 
```

In this task, your goal is to expand the user's short query into a detailed and well-structured English prompt for generating short videos.



Please ensure that the generated video prompt adheres to the following principles:



1. **Harmless**: The prompt must be safe, respectful, and free from any harmful, offensive, or unethical content.  

2. **Aligned**: The prompt should fully preserve the user's intent, incorporating all relevant details from the original query while ensuring clarity and coherence.  

3. **Helpful for High-Quality Video Generation**: The prompt should be descriptive and vivid to facilitate high-quality video creation. Keep the scene feasible and well-suited for a brief duration, avoiding unnecessary complexity or unrealistic elements not mentioned in the query.



User Query:{user prompt}



Video Prompt:

```

### Inference code
Here is an example code for inference: 
```python

from transformers import AutoModelForCausalLM, AutoTokenizer



model_path = ''



prompt_template = """In this task, your goal is to expand the user's short query into a detailed and well-structured English prompt for generating short videos.



Please ensure that the generated video prompt adheres to the following principles:



1. **Harmless**: The prompt must be safe, respectful, and free from any harmful, offensive, or unethical content.  

2. **Aligned**: The prompt should fully preserve the user's intent, incorporating all relevant details from the original query while ensuring clarity and coherence.  

3. **Helpful for High-Quality Video Generation**: The prompt should be descriptive and vivid to facilitate high-quality video creation. Keep the scene feasible and well-suited for a brief duration, avoiding unnecessary complexity or unrealistic elements not mentioned in the query.



User Query:{}



Video Prompt:"""



device = 'cuda:0'

model = AutoModelForCausalLM.from_pretrained(model_path).half().eval().to(device)

# for 8bit

# model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained(model_path)



text = "a cute dog on the grass"

messgae = [{'role': 'user', 'content': prompt_template.format(text)}]



model_inputs = tokenizer.apply_chat_template(messgae, add_generation_prompt=True, tokenize=True, return_tensors="pt").to(device)

output = model.generate(model_inputs, max_new_tokens=1024, do_sample=True, top_p=1.0, temperature=0.7, num_beams=1)

resp = tokenizer.decode(output[0]).split('<|start_header_id|>assistant<|end_header_id|>')[1].split('<|eot_id|>')[0].strip()



print(resp)

```
See our [Github Repo](https://github.com/thu-coai/VPO) for more detailed usage (e.g. Inference with Vllm).


<!-- ## Citation
If you find our model is useful in your work, please cite it with:
```



``` -->