Hulk: A Universal Knowledge Translator for Human-centric Tasks

[Yizhou Wang](https://scholar.google.com/citations?user=CQGaGMAAAAAJ&hl=zh-CN&authuser=1)1*, [Yixuan Wu](https://scholar.google.com/citations?user=zjAxJcwAAAAJ&hl=en&oi=ao)1*,2, [Shixiang Tang](https://github.com/tangshixiang)1 :email:, [Weizhen He]()2,3, [Xun Guo](https://github.com/Space-Xun)1,4, [Feng Zhu](https://zhufengx.github.io/)3, [Lei Bai](http://leibai.site/)1, [Rui Zhao](http://zhaorui.xyz/)3, [Jian Wu]()2, [Tong He](http://tonghe90.github.io/)1, [Wanli Ouyang](https://wlouyang.github.io/)1 1[Shanghai AI Lab](https://www.shlab.org.cn/), 2[ZJU](https://www.zju.edu.cn/), 3[SenseTime](https://www.sensetime.com), 4[USTC](https://www.ustc.edu.cn/) [ArXiv](https://arxiv.org/abs/2312.01697) | [Project Page](https://humancentricmodels.github.io/Hulk/) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pose-estimation-on-aic)](https://paperswithcode.com/sota/pose-estimation-on-aic?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/human-part-segmentation-on-cihp)](https://paperswithcode.com/sota/human-part-segmentation-on-cihp?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/skeleton-based-action-recognition-on-ntu-rgbd)](https://paperswithcode.com/sota/skeleton-based-action-recognition-on-ntu-rgbd?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/semantic-segmentation-on-lip-val)](https://paperswithcode.com/sota/semantic-segmentation-on-lip-val?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/human-part-segmentation-on-human3-6m)](https://paperswithcode.com/sota/human-part-segmentation-on-human3-6m?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pedestrian-attribute-recognition-on-rapv2)](https://paperswithcode.com/sota/pedestrian-attribute-recognition-on-rapv2?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pedestrian-attribute-recognition-on-pa-100k)](https://paperswithcode.com/sota/pedestrian-attribute-recognition-on-pa-100k?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/pose-estimation-on-coco)](https://paperswithcode.com/sota/pose-estimation-on-coco?p=hulk-a-universal-knowledge-translator-for) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/hulk-a-universal-knowledge-translator-for/object-detection-on-crowdhuman-full-body)](https://paperswithcode.com/sota/object-detection-on-crowdhuman-full-body?p=hulk-a-universal-knowledge-translator-for)

Welcome to **Hulk**! Hulk is a multimodel human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language human-centric tasks. Unlike many existing human-centric foundation models that did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning, Hulk condensed various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. Unifying these tasks enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. For more details, please take a look at our paper [Hulk: A Universal Knowledge Translator for Human-centric Tasks](https://arxiv.org/abs/2312.01697).

## News - _Apr. 2024_ A pretrained Hulk is released on [🤗 Hugging Face Models](https://huggingface.co/OpenGVLab/Hulk/tree/main)! - _Apr. 2024_ Project page with demos is released at [Hulk](https://humancentricmodels.github.io/Hulk/). - _Mar. 2024_ Training and inference code are released! - _Dec. 2023_ Hulk is released on [ArXiv](https://arxiv.org/abs/2312.01697)! ## Installation This codebase has been developed with python version 3.9, pytorch 2.0.0, cuda 11.8 and torchvision 0.15.0. We recommend using the same version to avoid potential issues. ```bash pip install -r requirements.txt ``` Also, download [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) from huggingface and put it under `experiments/release/`. ## Datasets Please refer to the [datasets](docs/datasets.md) for more details. ## Training Download pre-trained MAE weights from [here](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) and put it under `core/models/backbones/pretrain_weights/`. We use 10 nodes (80 A100 GPUs) for training with the following command: ```bash cd experiments/release sh train.sh 80 Hulk_vit-B ``` ## Evaluation A pretrained Hulk will be soon available at [🤗 Hugging Face Models](https://huggingface.co/OpenGVLab/Hulk/tree/main). Download it, put it under the folder `experiments/release/checkpoints/Hulk_vit-B` (first `mkdir -p experiments/release/checkpoints/Hulk_vit-B`), then use the following command to evaluate the model on the test set. ```bash cd experiments/release sh batch_eval.sh 1 Hulk_vit-B ``` ## Model Performance We use the plain ViT as our backbone, develop four modality-specific tokenizers and de-tokenizers to cover 2D vision, 3D vision, skeleton-based, and vision-language human-centric tasks. Hulk has achieved state-of-the-art results on various human-centric tasks. ### Direct Evaluation
Task pedestrian detection 2D pose skeleton-based actionhuman parsing attribute recognition image caption monocular 3D human pose and mesh recovery
Dataset CrowdHuman COCO AIC NTU60-XSub H3.6M LIP CIHP PA-100k RAPv2 CUHK-PEDES 3DPW H3.6M
Metric mAP MR-2 JI AP APacc. mIoU mIoU mIoU mA mA B@4 MPVPE↓ MPJPE↓ PA-MPJPE↓ MPJPE↓ PA-MPJPE↓
Hulk (ViT-B) 90.7 43.8 84.0 77.0 34.5 93.8 68.08 63.95 70.58 82.85 80.90 31.1 79.8 67.0 39.9 43.6 31.9
Hulk (ViT-L) 92.2 40.1 85.8 78.3 36.3 94.1 69.31 65.86 72.33 84.36 82.85 31.6 77.4 66.3 38.5 40.3 28.8

### Finetune Performance
Task pedestrian detection 2D pose skeleton-based actionhuman parsing attribute recognition image caption ♣ monocular 3D human pose and mesh recovery ♣
Dataset CrowdHuman COCO AIC NTU60-XSub H3.6M LIP CIHP PA-100k RAPv2 CUHK-PEDES 3DPW H3.6M
Metric mAP MR-2 JI AP APacc. mIoU mIoU mIoU mA mA B@4 MPVPE↓ MPJPE↓ PA-MPJPE↓ MPJPE↓ PA-MPJPE↓
Hulk (ViT-B) 92.4 40.7 86.0 77.5 35.6 94.0 68.56 63.98 71.26 87.85 85.26 28.3 80.7 68.9 41.3 44.9 32.0
Hulk (ViT-L) 93.0 36.5 87.0 78.7 37.1 94.3 69.89 66.02 72.68 88.97 85.86 30.5 79.9 68.3 40.6 41.4 30.2

♣: We find that the performance of image caption and monocular 3D human pose and mesh recovery is not as good as the direct evaluation, indicating that overfitting may occur during finetuning. ## Contact If you have any problem about our paper & code, feel free to contact [Yizhou Wang](wangyizhou@pjlab.org.cn) and [Yixuan Wu](wyx_chloe@zju.edu.cn). ## Citation If you find this work useful, please consider citing: ```bibtex @article{wang2023hulk, title={Hulk: A Universal Knowledge Translator for Human-Centric Tasks}, author={Wang, Yizhou and Wu, Yixuan and Tang, Shixiang and He, Weizhen and Guo, Xun and Zhu, Feng and Bai, Lei and Zhao, Rui and Wu, Jian and He, Tong and others}, journal={arXiv preprint arXiv:2312.01697}, year={2023} } ```