Spaces:
No application file
No application file
File size: 13,035 Bytes
1621f58 15fa80a 654e8b1 15fa80a 654e8b1 15fa80a 654e8b1 15fa80a 654e8b1 15fa80a 654e8b1 15fa80a 654e8b1 15fa80a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 |
---
title: Stable Text To Motion Framework
emoji: π
colorFrom: pink
colorTo: green
sdk: gradio
sdk_version: 4.28.3
app_file: app.py
pinned: false
license: mit
---
# SATO: Stable Text-to-Motion Framework
[Wenshuo chen*](https://github.com/shurdy123), [Hongru Xiao*](https://github.com/Hongru0306), [Erhang Zhang*](https://github.com/zhangerhang), [Lijie Hu](https://sites.google.com/view/lijiehu/homepage), [Lei Wang](https://leiwangr.github.io/), [Mengyuan Liu](https://www.semanticscholar.org/author/Mengyuan-Liu/47842072), [Chen Chen](https://www.crcv.ucf.edu/chenchen/)
[![Website shields.io](https://img.shields.io/website?url=http%3A//poco.is.tue.mpg.de)](https://sato-team.github.io/Stable-Text-to-Motion-Framework/) [![YouTube Badge](https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube)](https://youtu.be/qqGhV3Flmus) [![arXiv](https://img.shields.io/badge/arXiv-2308.12965-00ff00.svg)]()
## Existing Challenges
A fundamental challenge inherent in text-to-motion tasks stems from the variability of textual inputs. Even when conveying similar or the same meanings and intentions, texts can exhibit considerable variations in vocabulary and structure due to individual user preferences or linguistic nuances. Despite the considerable advancements made in these models, we find a notable weakness: all of them demonstrate instability in prediction when encountering minor textual perturbations, such as synonym substitutions. In the following demonstration, we showcase the instability of predictions generated by the previous method when presented with different user inputs conveying identical semantic meaning.
<!-- <div style="display:flex;">
<img src="assets/run_lola.gif" width="45%" style="margin-right: 1%;">
<img src="assets/yt_solo.gif" width="45%">
</div> -->
<p align="center">
<table align="center">
<tr>
<th colspan="4">Original text: A man kicks something or someone with his left leg.</th>
</tr>
<tr>
<th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
<th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
<th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
</tr>
<tr>
<td width="250" align="center"><img src="images/example/kick/gpt.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/example/kick/mdm.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/example/kick/momask.gif" width="150px" height="150px" alt="gif"></td>
</tr>
<tr>
<th colspan="4">Perturbed text: A human boots something or someone with his left leg.</th>
</tr>
<tr>
<th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
<th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
<th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
</tr>
<tr>
<td width="250" align="center"><img src="images/example/boot/gpt.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/example/boot/mdm.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/example/boot/momask.gif" width="150px" height="150px" alt="gif"></td>
</tr>
</table>
</p>
## Motivation
![motivation](images/motivation.png)
The model's inconsistent outputs are accompanied by unstable attention patterns. We further elucidate the aforementioned experimental findings: When perturbed text is inputted, the model exhibits unstable attention, often neglecting critical text elements necessary for accurate motion prediction. This instability further complicates the encoding of text into consistent embeddings, leading to a cascade of consecutive temporal motion generation errors.
## Our Approach
<p align="center">
<img src="images/framework.png" alt="Approach Image">
</p>
**Attention Stability**. For the original text input, we can easily observe the model's attention vector for the text. This attention vector reflects the model's attentional ranking of the text, indicating the importance of each word to the text encoder's prediction. We hope a stable attention vector maintains a consistent ranking even after perturbations.
**Prediction Robustness**. Even with stable attention, we still cannot achieve stable results due to the change in text embeddings when facing perturbations, even with similar attention vectors. This requires us to impose further restrictions on the model's predictions. Specifically, in the face of perturbations, the model's prediction should remain consistent with the original distribution, meaning the model's output should be robust to perturbations.
**Balancing Accuracy and Robustness Trade-off**. Accuracy and robustness are naturally in a trade-off relationship. Our objective is to bolster stability while minimizing the decline in model accuracy, thereby mitigating catastrophic errors arising from input perturbations. Consequently, we require a mechanism to uphold the model's performance concerning the original input.
## Quantitative evaluation on the HumanML3D and KIT-ML.
![eval](images/table.png)
## Visualization
<p align="center">
<table align="center">
<tr>
<th colspan="4">Original text: person is walking normally in a circle.</th>
</tr>
<tr>
<th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
<th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
<th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
<th align="center"><nobr>SATO</nobr> </th>
</tr>
<tr>
<td width="250" align="center"><img src="images/visualization/circle/gpt.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/circle/mdm.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/circle/momask.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/circle/sato.gif" width="150px" height="150px" alt="gif"></td>
</tr>
<tr>
<th colspan="4" >Perturbed text: <span style="color: red;">human</span> is walking <span style="color: red;">usually</span> in a <span style="color: red;">loop.</th>
</tr>
<tr>
<th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
<th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
<th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
<th align="center"><nobr>SATO</nobr> </th>
</tr>
<tr>
<td width="250" align="center"><img src="images/visualization/loop/gpt.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/loop/mdm.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/loop/momask.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/loop/sato.gif" width="150px" height="150px" alt="gif"></td>
</tr>
</table>
</p>
<center>
<h3>
<p style="color: blue;">Explanation: T2M-GPT, MDM, and MoMask all don't walk in a loop.</p>
</h3>
<p align="center">
<table align="center">
<tr>
<th colspan="4">Original text: a person uses his right arm to help himself to stand up.</th>
</tr>
<tr>
<th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
<th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
<th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
<th align="center"><nobr>SATO</nobr> </th>
</tr>
<tr>
<td width="250" align="center"><img src="images/visualization/use/gpt.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/use/mdm.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/use/momask.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/use/sato.gif" width="150px" height="150px" alt="gif"></td>
</tr>
<tr>
<th colspan="4" >Perturbed text: A human <span style="color: red;">utilizes</span> his right arm to help himself to stand up.</th>
</tr>
<tr>
<th align="center"><u><a href="https://github.com/Mael-zys/T2M-GPT"><nobr>T2M-GPT</nobr> </a></u></th>
<th align="center"><u><a href="https://guytevet.github.io/mdm-page/"><nobr>MDM</nobr> </a></u></th>
<th align="center"><u><a href="https://github.com/EricGuo5513/momask-codes"><nobr>MoMask</nobr> </a></u></th>
<th align="center"><nobr>SATO</nobr> </th>
</tr>
<tr>
<td width="250" align="center"><img src="images/visualization/utilize/gpt.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/utilize/mdm.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/utilize/momask.gif" width="150px" height="150px" alt="gif"></td>
<td width="250" align="center"><img src="images/visualization/utilize/sato.gif" width="150px" height="150px" alt="gif"></td>
</tr>
</table>
</p>
<center>
<h3>
<p style="color: blue;">Explanation: T2M-GPT, MDM, and MoMask all lack the action of transitioning from squatting to standing up, resulting in a catastrophic error.</p>
</h3>
## How to Use the Code
* [1. Setup and Installation](#setup)
* [2.Dependencies](#Dependencies)
* [3. Quick Start](#quickstart)
* [4. Datasets](#datasets)
* [4. Train](#train)
* [5. Evaluation](#eval)
* [6. Acknowledgments](#acknowledgements)
## Setup and Installation <a name="setup"></a>
Clone the repository:
```shell
git clone https://github.com/sato-team/Stable-Text-to-motion-Framework.git
```
Create fresh conda environment and install all the dependencies:
```
conda env create -f environment.yml
conda activate SATO
```
The code was tested on Python 3.8 and PyTorch 1.8.1.
## Dependencies<a name="Dependencies"></a>
```shell
bash dataset/prepare/download_extractor.sh
bash dataset/prepare/download_glove.sh
```
## **Quick Start**<a name="quickstart"></a>
A quick reference guide for using our code is provided in quickstart.ipynb.
## Datasets<a name="datasets"></a>
We are using two 3D human motion-language dataset: HumanML3D and KIT-ML. For both datasets, you could find the details as well as download [link](https://github.com/EricGuo5513/HumanML3D).
We perturbed the input texts based on the two datasets mentioned. You can access the perturbed text dataset through the following [link](https://drive.google.com/file/d/1XLvu2jfG1YKyujdANhYHV_NfFTyOJPvP/view?usp=sharing).
Take HumanML3D for an example, the dataset structure should look like this:
```
./dataset/HumanML3D/
βββ new_joint_vecs/
βββ texts/ # You need to replace the 'texts' folder in the original dataset with the 'texts' folder from our dataset.
βββ Mean.npy
βββ Std.npy
βββ train.txt
βββ val.txt
βββ test.txt
βββ train_val.txt
βββ all.txt
```
### **Train**<a name="train"></a>
We will release the training code soon.
### **Evaluation**<a name="eval"></a>
You can download the pretrained models in this [link](https://drive.google.com/drive/folders/1rs8QPJ3UPzLW4H3vWAAX9hJn4ln7m_ts?usp=sharing).
```shell
python eval_t2m.py --resume-pth pretrained/net_best_fid.pth --clip_path pretrained/clip_best_fid.pth
```
## Acknowledgements<a name="acknowledgements"></a>
We appreciate helps from :
- Open Source CodeοΌ[T2M-GPT](https://github.com/Mael-zys/T2M-GPT), [MoMask ](https://github.com/EricGuo5513/momask-codes), [MDM](https://guytevet.github.io/mdm-page/), etc.
- [Hongru Xiao](https://github.com/Hongru0306), [Erhang Zhang](https://github.com/zhangerhang), [Lijie Hu](https://sites.google.com/view/lijiehu/homepage), [Lei Wang](https://leiwangr.github.io/), [Mengyuan Liu](), [Chen Chen](https://www.crcv.ucf.edu/chenchen/) for discussions and guidance throughout the project, which has been instrumental to our work.
- [Zhen Zhao](https://github.com/Zanebla) for project website.
- If you find our work helpful, we would appreciate it if you could give our project a star!
## Citing<a name="citing"></a>
If you find this code useful for your research, please consider citing the following paper:
```bibtex
```
|