Spaces:
Running
on
Zero
Running
on
Zero
Commit
·
0f9d939
0
Parent(s):
add yoloe
Browse files- README.md +288 -0
- app.py +239 -0
- requirements.txt +9 -0
README.md
ADDED
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
sdk: gradio
|
3 |
+
---
|
4 |
+
|
5 |
+
# [YOLOE: Real-Time Seeing Anything]()
|
6 |
+
|
7 |
+
Official PyTorch implementation of **YOLOE**.
|
8 |
+
|
9 |
+
<p align="center">
|
10 |
+
<img src="figures/comparison.svg" width=70%> <br>
|
11 |
+
Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts.
|
12 |
+
</p>
|
13 |
+
|
14 |
+
[YOLOE: Real-Time Seeing Anything]().\
|
15 |
+
Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding
|
16 |
+
|
17 |
+
|
18 |
+
We introduce **YOLOE(ye)**, a highly **efficient**, **unified**, and **open** object detection and segmentation model, like human eye, under different prompt mechanisms, like *texts*, *visual inputs*, and *prompt-free paradigm*.
|
19 |
+
|
20 |
+
<!-- <p align="center">
|
21 |
+
<img src="figures/pipeline.svg" width=96%> <br>
|
22 |
+
</p> -->
|
23 |
+
|
24 |
+
<p align="center">
|
25 |
+
<img src="figures/visualization.svg" width=96%> <br>
|
26 |
+
</p>
|
27 |
+
|
28 |
+
|
29 |
+
<details>
|
30 |
+
<summary>
|
31 |
+
<font size="+1">Abstract</font>
|
32 |
+
</summary>
|
33 |
+
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with $3\times$ less training cost and $1.4\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 $AP^b$ and 0.4 $AP^m$ gains over closed-set YOLOv8-L with nearly $4\times$ less training time. Code and models will be publicly available.
|
34 |
+
<p></p>
|
35 |
+
<p align="center">
|
36 |
+
<img src="figures/pipeline.svg" width=96%> <br>
|
37 |
+
</p>
|
38 |
+
</details>
|
39 |
+
|
40 |
+
## Performance
|
41 |
+
|
42 |
+
### Zero-shot detection evaluation
|
43 |
+
|
44 |
+
- *Fixed AP* is reported on LVIS `minival` set with text (T) / visual (V) prompts.
|
45 |
+
- Training time is for text prompts with detection based on 8 Nvidia RTX4090 GPUs.
|
46 |
+
- FPS is measured on T4 with TensorRT and iPhone 12 with CoreML, respectively.
|
47 |
+
- For training data, OG denotes Objects365v1 and GoldG.
|
48 |
+
- YOLOE can become YOLOs after re-parameterization with **zero inference and transferring overhead**.
|
49 |
+
|
50 |
+
| Model | Size | Prompt | Params | Data | Time | FPS | $AP$ | $AP_r$ | $AP_c$ | $AP_f$ | Log |
|
51 |
+
|---|---|---|---|---|---|---|---|---|---|---|---|
|
52 |
+
| [YOLOE-v8-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8s-seg.pt) | 640 | T / V | 12M / 13M | OG | 12.0h | 305.8 / 64.3 | 27.9 / 26.2 | 22.3 / 21.3 | 27.8 / 27.7 | 29.0 / 25.7 | [T](./logs/yoloe-v8s-seg) / [V](./logs/yoloe-v8s-seg-vp) |
|
53 |
+
| [YOLOE-v8-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8m-seg.pt) | 640 | T / V | 27M / 30M | OG | 17.0h | 156.7 / 41.7 | 32.6 / 31.0 | 26.9 / 27.0 | 31.9 / 31.7 | 34.4 / 31.1 | [T](./logs/yoloe-v8m-seg) / [V](./logs/yoloe-v8m-seg-vp) |
|
54 |
+
| [YOLOE-v8-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8l-seg.pt) | 640 | T / V | 45M / 50M | OG | 22.5h | 102.5 / 27.2 | 35.9 / 34.2 | 33.2 / 33.2 | 34.8 / 34.6 | 37.3 / 34.1 | [T](./logs/yoloe-v8l-seg) / [V](./logs/yoloe-v8l-seg-vp) |
|
55 |
+
| [YOLOE-11-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11s-seg.pt) | 640 | T / V | 10M / 12M | OG | 13.0h | 301.2 / 73.3 | 27.5 / 26.3 | 21.4 / 22.5 | 26.8 / 27.1 | 29.3 / 26.4 | [T](./logs/yoloe-11s-seg) / [V](./logs/yoloe-11s-seg-vp) |
|
56 |
+
| [YOLOE-11-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11m-seg.pt) | 640 | T / V | 21M / 27M | OG | 18.5h | 168.3 / 39.2 | 33.0 / 31.4 | 26.9 / 27.1 | 32.5 / 31.9 | 34.5 / 31.7 | [T](./logs/yoloe-11m-seg) / [V](./logs/yoloe-11m-seg-vp) |
|
57 |
+
| [YOLOE-11-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11l-seg.pt) | 640 | T / V | 26M / 32M | OG | 23.5h | 130.5 / 35.1 | 35.2 / 33.7 | 29.1 / 28.1 | 35.0 / 34.6 | 36.5 / 33.8 | [T](./logs/yoloe-11l-seg) / [V](./logs/yoloe-11l-seg-vp) |
|
58 |
+
|
59 |
+
### Zero-shot segmentation evaluation
|
60 |
+
|
61 |
+
- The model is the same as above in [Zero-shot detection evaluation](#zero-shot-detection-evaluation).
|
62 |
+
- *Standard AP<sup>m</sup>* is reported on LVIS `val` set with text (T) / visual (V) prompts.
|
63 |
+
|
64 |
+
| Model | Size | Prompt | $AP^m$ | $AP_r^m$ | $AP_c^m$ | $AP_f^m$ |
|
65 |
+
|---|---|---|---|---|---|---|
|
66 |
+
| YOLOE-v8-S | 640 | T / V | 17.7 / 16.8 | 15.5 / 13.5 | 16.3 / 16.7 | 20.3 / 18.2 |
|
67 |
+
| YOLOE-v8-M | 640 | T / V | 20.8 / 20.3 | 17.2 / 17.0 | 19.2 / 20.1 | 24.2 / 22.0 |
|
68 |
+
| YOLOE-v8-L | 640 | T / V | 23.5 / 22.0 | 21.9 / 16.5 | 21.6 / 22.1 | 26.4 / 24.3 |
|
69 |
+
| YOLOE-11-S | 640 | T / V | 17.6 / 17.1 | 16.1 / 14.4 | 15.6 / 16.8 | 20.5 / 18.6 |
|
70 |
+
| YOLOE-11-M | 640 | T / V | 21.1 / 21.0 | 17.2 / 18.3 | 19.6 / 20.6 | 24.4 / 22.6 |
|
71 |
+
| YOLOE-11-L | 640 | T / V | 22.6 / 22.5 | 19.3 / 20.5 | 20.9 / 21.7 | 26.0 / 24.1 |
|
72 |
+
|
73 |
+
### Prompt-free evaluation
|
74 |
+
|
75 |
+
- The model is the same as above in [Zero-shot detection evaluation](#zero-shot-detection-evaluation) except the specialized prompt embedding.
|
76 |
+
- *Fixed AP* is reported on LVIS `minival` set and FPS is measured on Nvidia T4 GPU with Pytorch.
|
77 |
+
|
78 |
+
| Model | Size | Params | $AP$ | $AP_r$ | $AP_c$ | $AP_f$ | FPS | Log |
|
79 |
+
|---|---|---|---|---|---|---|---|---|
|
80 |
+
| [YOLOE-v8-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8s-seg-pf.pt) | 640 | 13M | 21.0 | 19.1 | 21.3 | 21.0 | 95.8 | [PF](./logs/yoloe-v8s-seg-pf/) |
|
81 |
+
| [YOLOE-v8-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8m-seg-pf.pt) | 640 | 29M | 24.7 | 22.2 | 24.5 | 25.3 | 45.9 | [PF](./logs/yoloe-v8m-seg-pf/) |
|
82 |
+
| [YOLOE-v8-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8l-seg-pf.pt) | 640 | 47M | 27.2 | 23.5 | 27.0 | 28.0 | 25.3 | [PF](./logs/yoloe-v8l-seg-pf/) |
|
83 |
+
| [YOLOE-11-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11s-seg-pf.pt) | 640 | 11M | 20.6 | 18.4 | 20.2 | 21.3 | 93.0 | [PF](./logs/yoloe-11s-seg-pf/) |
|
84 |
+
| [YOLOE-11-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11m-seg-pf.pt) | 640 | 24M | 25.5 | 21.6 | 25.5 | 26.1 | 42.5 | [PF](./logs/yoloe-11m-seg-pf/) |
|
85 |
+
| [YOLOE-11-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11l-seg-pf.pt) | 640 | 29M | 26.3 | 22.7 | 25.8 | 27.5 | 34.9 | [PF](./logs/yoloe-11l-seg-pf/) |
|
86 |
+
|
87 |
+
### Downstream transfer on COCO
|
88 |
+
|
89 |
+
- During transferring, YOLOE-v8 / YOLOE-11 is **exactly the same** as YOLOv8 / YOLO11.
|
90 |
+
- For *Linear probing*, only the last conv in classification head is trainable.
|
91 |
+
- For *Full tuning*, all parameters are trainable.
|
92 |
+
|
93 |
+
| Model | Size | Epochs | $AP^b$ | $AP^b_{50}$ | $AP^b_{75}$ | $AP^m$ | $AP^m_{50}$ | $AP^m_{75}$ | Log |
|
94 |
+
|---|---|---|---|---|---|---|---|---|---|
|
95 |
+
| Linear probing | | | | | | | | | |
|
96 |
+
| [YOLOE-v8-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8s-seg-coco-pe.pt) | 640 | 10 | 35.6 | 51.5 | 38.9 | 30.3 | 48.2 | 32.0 | [LP](./logs/yoloe-v8s-seg-coco-pe/) |
|
97 |
+
| [YOLOE-v8-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8m-seg-coco-pe.pt) | 640 | 10 | 42.2 | 59.2 | 46.3 | 35.5 | 55.6 | 37.7 | [LP](./logs/yoloe-v8m-seg-coco-pe/) |
|
98 |
+
| [YOLOE-v8-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8l-seg-coco-pe.pt) | 640 | 10 | 45.4 | 63.3 | 50.0 | 38.3 | 59.6 | 40.8 | [LP](./logs/yoloe-v8l-seg-coco-pe/) |
|
99 |
+
| [YOLOE-11-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11s-seg-coco-pe.pt) | 640 | 10 | 37.0 | 52.9 | 40.4 | 31.5 | 49.7 | 33.5 | [LP](./logs/yoloe-11s-seg-coco-pe/) |
|
100 |
+
| [YOLOE-11-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11m-seg-coco-pe.pt) | 640 | 10 | 43.1 | 60.6 | 47.4 | 36.5 | 56.9 | 39.0 | [LP](./logs/yoloe-11m-seg-coco-pe/) |
|
101 |
+
| [YOLOE-11-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11l-seg-coco-pe.pt) | 640 | 10 | 45.1 | 62.8 | 49.5 | 38.0 | 59.2 | 40.6 | [LP](./logs/yoloe-11l-seg-coco-pe/) |
|
102 |
+
| Full tuning | | | | | | | | | |
|
103 |
+
| [YOLOE-v8-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8s-seg-coco.pt) | 640 | 160 | 45.0 | 61.6 | 49.1 | 36.7 | 58.3 | 39.1 | [FT](./logs/yoloe-v8s-seg-coco/) |
|
104 |
+
| [YOLOE-v8-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8m-seg-coco.pt) | 640 | 80 | 50.4 | 67.0 | 55.2 | 40.9 | 63.7 | 43.5 | [FT](./logs/yoloe-v8m-seg-coco/) |
|
105 |
+
| [YOLOE-v8-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-v8l-seg-coco.pt) | 640 | 80 | 53.0 | 69.8 | 57.9 | 42.7 | 66.5 | 45.6 | [FT](./logs/yoloe-v8l-seg-coco/) |
|
106 |
+
| [YOLOE-11-S](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11s-seg-coco.pt) | 640 | 160 | 46.2 | 62.9 | 50.0 | 37.6 | 59.3 | 40.1 | [FT](./logs/yoloe-11s-seg-coco/) |
|
107 |
+
| [YOLOE-11-M](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11m-seg-coco.pt) | 640 | 80 | 51.3 | 68.3 | 56.0 | 41.5 | 64.8 | 44.3 | [FT](./logs/yoloe-11m-seg-coco/) |
|
108 |
+
| [YOLOE-11-L](https://huggingface.co/jameslahm/yoloe/blob/main/yoloe-11l-seg-coco.pt) | 640 | 80 | 52.6 | 69.7 | 57.5 | 42.4 | 66.2 | 45.2 | [FT](./logs/yoloe-11l-seg-coco/) |
|
109 |
+
|
110 |
+
## Installation
|
111 |
+
`conda` virtual environment is recommended.
|
112 |
+
```bash
|
113 |
+
conda create -n yoloe python=3.10 -y
|
114 |
+
conda activate yoloe
|
115 |
+
|
116 |
+
pip install -r requirements.txt
|
117 |
+
pip install -e .
|
118 |
+
pip install -e lvis-api
|
119 |
+
pip install -e ml-mobileclip
|
120 |
+
pip install -e CLIP
|
121 |
+
```
|
122 |
+
|
123 |
+
## Demo
|
124 |
+
```bash
|
125 |
+
# Optional for mirror: export HF_ENDPOINT=https://hf-mirror.com
|
126 |
+
pip install gradio==4.42.0 gradio_image_prompter==0.1.0 fastapi==0.112.2
|
127 |
+
python app.py
|
128 |
+
# Please visit http://127.0.0.1:7860
|
129 |
+
```
|
130 |
+
|
131 |
+
## Prediction
|
132 |
+
|
133 |
+
### Text prompt
|
134 |
+
```bash
|
135 |
+
python predict.py
|
136 |
+
```
|
137 |
+
|
138 |
+
### Visual prompt
|
139 |
+
```bash
|
140 |
+
python predict_vp.py
|
141 |
+
```
|
142 |
+
|
143 |
+
### Prompt free
|
144 |
+
```bash
|
145 |
+
python predict_pf.py
|
146 |
+
```
|
147 |
+
|
148 |
+
## Validation
|
149 |
+
|
150 |
+
### Data
|
151 |
+
- Please download LVIS following [here](https://docs.ultralytics.com/zh/datasets/detect/lvis/) or [lvis.yaml](./ultralytics/cfg/datasets/lvis.yaml).
|
152 |
+
- We use this [`minival.txt`](./tools/lvis/minival.txt) with background images for evaluation.
|
153 |
+
|
154 |
+
```bash
|
155 |
+
# For evaluation with visual prompt, please obtain the referring data.
|
156 |
+
python tools/generate_lvis_visual_prompt_data.py
|
157 |
+
```
|
158 |
+
|
159 |
+
### Zero-shot evaluation on LVIS
|
160 |
+
- For text prompts, `python val.py`.
|
161 |
+
- For visual prompts, `python val_vp.py`
|
162 |
+
|
163 |
+
For *Fixed AP*, please refer to the comments in `val.py` and `val_vp.py`, and use `tools/eval_fixed_ap.py` for evaluation.
|
164 |
+
|
165 |
+
### Prompt-free evaluation
|
166 |
+
```bash
|
167 |
+
python val_pe_free.py
|
168 |
+
python tools/eval_open_ended.py --json ../datasets/lvis/annotations/lvis_v1_minival.json --pred runs/detect/val/predictions.json --fixed
|
169 |
+
```
|
170 |
+
|
171 |
+
### Downstream transfer on COCO
|
172 |
+
```bash
|
173 |
+
python val_coco.py
|
174 |
+
```
|
175 |
+
|
176 |
+
## Training
|
177 |
+
|
178 |
+
The training includes three stages:
|
179 |
+
- YOLOE is trained with text prompts for detection and segmentation for 30 epochs.
|
180 |
+
- Only visual prompt encoder (SAVPE) is trained with visual prompts for 2 epochs.
|
181 |
+
- Only specialized prompt embedding for prompt free is trained for 1 epochs.
|
182 |
+
|
183 |
+
### Data
|
184 |
+
|
185 |
+
| Images | Raw Annotations | Processed Annotations |
|
186 |
+
|---|---|---|
|
187 |
+
| [Objects365v1](https://opendatalab.com/OpenDataLab/Objects365_v1) | [objects365_train.json](https://opendatalab.com/OpenDataLab/Objects365_v1) | [objects365_train_segm.json](https://huggingface.co/datasets/jameslahm/yoloe/blob/main/objects365_train_segm.json) |
|
188 |
+
| [GQA](https://nlp.stanford.edu/data/gqa/images.zip) | [ final_mixed_train_noo_coco.json](https://huggingface.co/GLIPModel/GLIP/blob/main/mdetr_annotations/final_mixed_train_no_coco.json) | [ final_mixed_train_noo_coco_segm.json](https://huggingface.co/datasets/jameslahm/yoloe/blob/main/final_mixed_train_no_coco_segm.json) |
|
189 |
+
| [Flickr30k](https://shannon.cs.illinois.edu/DenotationGraph/) | [final_flickr_separateGT_train.json](https://huggingface.co/GLIPModel/GLIP/blob/main/mdetr_annotations/final_flickr_separateGT_train.json) | [final_flickr_separateGT_train_segm.json](https://huggingface.co/datasets/jameslahm/yoloe/blob/main/final_flickr_separateGT_train_segm.json) |
|
190 |
+
|
191 |
+
For annotations, you can directly use our preprocessed ones or use the following script to obtain the processed annotations with segmentation masks.
|
192 |
+
```bash
|
193 |
+
# Generate segmentation data
|
194 |
+
conda create -n sam2 python==3.10.16
|
195 |
+
conda activate sam2
|
196 |
+
pip install -r sam2/requirements.txt
|
197 |
+
pip install -e sam2/
|
198 |
+
|
199 |
+
python tools/generate_sam_masks.py --img-path ../datasets/Objects365v1/images/train --json-path ../datasets/Objects365v1/annotations/objects365_train.json --batch
|
200 |
+
python tools/generate_sam_masks.py --img-path ../datasets/flickr/full_images/ --json-path ../datasets/flickr/annotations/final_flickr_separateGT_train.json
|
201 |
+
python tools/generate_sam_masks.py --img-path ../datasets/mixed_grounding/gqa/images --json-path ../datasets/mixed_grounding/annotations/final_mixed_train_no_coco.json
|
202 |
+
|
203 |
+
# Generate objects365v1 labels
|
204 |
+
python tools/generate_objects365v1.py
|
205 |
+
```
|
206 |
+
|
207 |
+
Then, please generate the data and embedding cache for training.
|
208 |
+
```bash
|
209 |
+
# Generate grounding segmentation cache
|
210 |
+
python tools/generate_grounding_cache.py --img-path ../datasets/flickr/full_images/ --json-path ../datasets/flickr/annotations/final_flickr_separateGT_train_segm.json
|
211 |
+
python tools/generate_grounding_cache.py --img-path ../datasets/mixed_grounding/gqa/images --json-path ../datasets/mixed_grounding/annotations/final_mixed_train_no_coco_segm.json
|
212 |
+
|
213 |
+
# Generate train label embeddings
|
214 |
+
python tools/generate_label_embedding.py
|
215 |
+
python tools/generate_global_neg_cat.py
|
216 |
+
```
|
217 |
+
At last, please download MobileCLIP-B(LT) for text encoder.
|
218 |
+
```bash
|
219 |
+
wget https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_blt.pt
|
220 |
+
```
|
221 |
+
|
222 |
+
### Text prompt
|
223 |
+
```bash
|
224 |
+
# For models with l scale, please change the initialization by referring to the comments in Line 549 in ultralytics/nn/moduels/head.py
|
225 |
+
# If you want to train YOLOE only for detection, you can use `train.py`
|
226 |
+
python train_seg.py
|
227 |
+
```
|
228 |
+
|
229 |
+
### Visual prompt
|
230 |
+
```bash
|
231 |
+
# For visual prompt, because only SAVPE is trained, we can adopt the detection pipleline with less training time
|
232 |
+
|
233 |
+
# First, obtain the detection model
|
234 |
+
python tools/convert_segm2det.py
|
235 |
+
# Then, train the SAVPE module
|
236 |
+
python train_vp.py
|
237 |
+
```
|
238 |
+
|
239 |
+
### Prompt free
|
240 |
+
```bash
|
241 |
+
# Generate LVIS with single class for evaluation during training
|
242 |
+
python tools/generate_lvis_sc.py
|
243 |
+
|
244 |
+
# Similar to visual prompt, because only the specialized prompt embedding is trained, we can adopt the detection pipleline with less training time
|
245 |
+
python tools/convert_segm2det.py
|
246 |
+
python train_pe_free.py
|
247 |
+
```
|
248 |
+
|
249 |
+
## Transferring
|
250 |
+
After pretraining, YOLOE-v8 / YOLOE-11 can be re-parameterized into the same architecture as YOLOv8 / YOLO11, with **zero overhead** for transferring.
|
251 |
+
|
252 |
+
### Linear probing
|
253 |
+
Only the last conv, ie., the prompt embedding, is trainable.
|
254 |
+
```bash
|
255 |
+
python train_pe.py
|
256 |
+
```
|
257 |
+
|
258 |
+
### Full tuning
|
259 |
+
All parameters are trainable, for better performance.
|
260 |
+
```bash
|
261 |
+
# For models with s scale, please change the epochs to 160 for longer training
|
262 |
+
python train_pe_all.py
|
263 |
+
```
|
264 |
+
|
265 |
+
## Export
|
266 |
+
After re-parameterization, YOLOE-v8 / YOLOE-11 can be exported into the identical format as YOLOv8 / YOLO11.
|
267 |
+
```bash
|
268 |
+
pip install onnx coremltools onnxslim
|
269 |
+
python export.py
|
270 |
+
```
|
271 |
+
|
272 |
+
## Benchmark
|
273 |
+
- For TensorRT, please refer to `benchmark.sh`.
|
274 |
+
- For CoreML, please use the benchmark tool from [XCode 14](https://developer.apple.com/videos/play/wwdc2022/10027/).
|
275 |
+
- For prompt-free setting, please refer to `tools/benchmark_pf.py`.
|
276 |
+
|
277 |
+
## Acknowledgement
|
278 |
+
|
279 |
+
The code base is built with [ultralytics](https://github.com/ultralytics/ultralytics), [YOLO-World](https://github.com/AILab-CVC/YOLO-World), [MobileCLIP](https://github.com/apple/ml-mobileclip), [lvis-api](https://github.com/lvis-dataset/lvis-api), [CLIP](https://github.com/openai/CLIP), and [GenerateU](https://github.com/FoundationVision/GenerateU).
|
280 |
+
|
281 |
+
Thanks for the great implementations!
|
282 |
+
|
283 |
+
## Citation
|
284 |
+
|
285 |
+
If our code or models help your work, please cite our paper:
|
286 |
+
```BibTeX
|
287 |
+
|
288 |
+
```
|
app.py
ADDED
@@ -0,0 +1,239 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import numpy as np
|
3 |
+
import gradio as gr
|
4 |
+
from scipy.ndimage import binary_fill_holes
|
5 |
+
from ultralytics import YOLOE
|
6 |
+
from ultralytics.utils.torch_utils import smart_inference_mode
|
7 |
+
from ultralytics.models.yolo.yoloe.predict_vp import YOLOEVPSegPredictor
|
8 |
+
from gradio_image_prompter import ImagePrompter
|
9 |
+
from huggingface_hub import hf_hub_download
|
10 |
+
|
11 |
+
def init_model(model_id, is_pf=False):
|
12 |
+
if not is_pf:
|
13 |
+
path = hf_hub_download(repo_id="jameslahm/yoloe", filename=f"{model_id}-seg.pt")
|
14 |
+
model = YOLOE(path)
|
15 |
+
else:
|
16 |
+
path = hf_hub_download(repo_id="jameslahm/yoloe", filename=f"{model_id}-seg-pf.pt")
|
17 |
+
model = YOLOE(path)
|
18 |
+
model.eval()
|
19 |
+
model.to("cuda" if torch.cuda.is_available() else "cpu")
|
20 |
+
return model
|
21 |
+
|
22 |
+
|
23 |
+
@smart_inference_mode()
|
24 |
+
def yoloe_inference(image, prompts, target_image, model_id, image_size, conf_thresh, iou_thresh, prompt_type):
|
25 |
+
model = init_model(model_id)
|
26 |
+
kwargs = {}
|
27 |
+
if prompt_type == "Text":
|
28 |
+
texts = prompts["texts"]
|
29 |
+
model.set_classes(texts, model.get_text_pe(texts))
|
30 |
+
elif prompt_type == "Visual":
|
31 |
+
kwargs = dict(
|
32 |
+
prompts=prompts,
|
33 |
+
predictor=YOLOEVPSegPredictor
|
34 |
+
)
|
35 |
+
if target_image:
|
36 |
+
model.predict(source=image, imgsz=image_size, conf=conf_thresh, iou=iou_thresh, return_vpe=True, **kwargs)
|
37 |
+
model.set_classes(["target"], model.predictor.vpe)
|
38 |
+
model.predictor = None # unset VPPredictor
|
39 |
+
image = target_image
|
40 |
+
kwargs = {}
|
41 |
+
elif prompt_type == "Prompt-free":
|
42 |
+
vocab = model.get_vocab(prompts["texts"])
|
43 |
+
model = init_model(model_id, is_pf=True)
|
44 |
+
model.set_vocab(vocab, names=prompts["texts"])
|
45 |
+
model.model.model[-1].is_fused = True
|
46 |
+
model.model.model[-1].conf = 0.001
|
47 |
+
model.model.model[-1].max_det = 1000
|
48 |
+
|
49 |
+
results = model.predict(source=image, imgsz=image_size, conf=conf_thresh, iou=iou_thresh, **kwargs)
|
50 |
+
annotated_image = results[0].plot()
|
51 |
+
return annotated_image[:, :, ::-1]
|
52 |
+
|
53 |
+
|
54 |
+
def app():
|
55 |
+
with gr.Blocks():
|
56 |
+
with gr.Row():
|
57 |
+
with gr.Column():
|
58 |
+
with gr.Row():
|
59 |
+
raw_image = gr.Image(type="pil", label="Image", visible=True, interactive=True)
|
60 |
+
box_image = ImagePrompter(type="pil", label="DrawBox", visible=False, interactive=True)
|
61 |
+
mask_image = gr.ImageEditor(type="pil", label="DrawMask", visible=False, interactive=True, layers=False, canvas_size=(640, 640))
|
62 |
+
target_image = gr.Image(type="pil", label="Target Image", visible=False, interactive=True)
|
63 |
+
|
64 |
+
yoloe_infer = gr.Button(value="Detect & Segment Objects")
|
65 |
+
prompt_type = gr.Textbox(value="Text", visible=False)
|
66 |
+
|
67 |
+
with gr.Tab("Text") as text_tab:
|
68 |
+
texts = gr.Textbox(label="Input Texts", value='person,bus', placeholder='person,bus', visible=True, interactive=True)
|
69 |
+
|
70 |
+
with gr.Tab("Visual") as visual_tab:
|
71 |
+
with gr.Row():
|
72 |
+
visual_prompt_type = gr.Dropdown(choices=["bboxes", "masks"], value="bboxes", label="Visual Type", interactive=True)
|
73 |
+
visual_usage_type = gr.Radio(choices=["Intra-Image", "Inter-Image"], value="Intra-Image", label="Intra/Inter Image", interactive=True)
|
74 |
+
|
75 |
+
with gr.Tab("Prompt-Free") as prompt_free_tab:
|
76 |
+
gr.HTML(
|
77 |
+
"""
|
78 |
+
<p style='text-align: center'>
|
79 |
+
Prompt-Free Mode is On
|
80 |
+
</p>
|
81 |
+
""", show_label=False)
|
82 |
+
|
83 |
+
model_id = gr.Dropdown(
|
84 |
+
label="Model",
|
85 |
+
choices=[
|
86 |
+
"yoloe-v8s",
|
87 |
+
"yoloe-v8m",
|
88 |
+
"yoloe-v8l",
|
89 |
+
"yoloe-11s",
|
90 |
+
"yoloe-11m",
|
91 |
+
"yoloe-11l",
|
92 |
+
],
|
93 |
+
value="yoloe-v8l",
|
94 |
+
)
|
95 |
+
image_size = gr.Slider(
|
96 |
+
label="Image Size",
|
97 |
+
minimum=320,
|
98 |
+
maximum=1280,
|
99 |
+
step=32,
|
100 |
+
value=640,
|
101 |
+
)
|
102 |
+
conf_thresh = gr.Slider(
|
103 |
+
label="Confidence Threshold",
|
104 |
+
minimum=0.0,
|
105 |
+
maximum=1.0,
|
106 |
+
step=0.05,
|
107 |
+
value=0.25,
|
108 |
+
)
|
109 |
+
iou_thresh = gr.Slider(
|
110 |
+
label="IoU Threshold",
|
111 |
+
minimum=0.0,
|
112 |
+
maximum=1.0,
|
113 |
+
step=0.05,
|
114 |
+
value=0.70,
|
115 |
+
)
|
116 |
+
|
117 |
+
with gr.Column():
|
118 |
+
output_image = gr.Image(type="numpy", label="Annotated Image", visible=True)
|
119 |
+
|
120 |
+
def update_text_image_visibility():
|
121 |
+
return gr.update(value="Text"), gr.update(visible=True), gr.update(visible=False), gr.update(visible=False), gr.update(visible=False)
|
122 |
+
|
123 |
+
def update_visual_image_visiblity(visual_prompt_type, visual_usage_type):
|
124 |
+
use_target = gr.update(visible=True) if visual_usage_type == "Inter-Image" else gr.update(visible=False)
|
125 |
+
if visual_prompt_type == "bboxes":
|
126 |
+
return gr.update(value="Visual"), gr.update(visible=False), gr.update(visible=True), gr.update(visible=False), use_target
|
127 |
+
elif visual_prompt_type == "masks":
|
128 |
+
return gr.update(value="Visual"), gr.update(visible=False), gr.update(visible=False), gr.update(visible=True), use_target
|
129 |
+
|
130 |
+
def update_pf_image_visibility():
|
131 |
+
return gr.update(value="Prompt-free"), gr.update(visible=True), gr.update(visible=False), gr.update(visible=False), gr.update(visible=False)
|
132 |
+
|
133 |
+
text_tab.select(
|
134 |
+
fn=update_text_image_visibility,
|
135 |
+
inputs=None,
|
136 |
+
outputs=[prompt_type, raw_image, box_image, mask_image, target_image]
|
137 |
+
)
|
138 |
+
|
139 |
+
visual_tab.select(
|
140 |
+
fn=update_visual_image_visiblity,
|
141 |
+
inputs=[visual_prompt_type, visual_usage_type],
|
142 |
+
outputs=[prompt_type, raw_image, box_image, mask_image, target_image]
|
143 |
+
)
|
144 |
+
|
145 |
+
prompt_free_tab.select(
|
146 |
+
fn=update_pf_image_visibility,
|
147 |
+
inputs=None,
|
148 |
+
outputs=[prompt_type, raw_image, box_image, mask_image, target_image]
|
149 |
+
)
|
150 |
+
|
151 |
+
def update_visual_prompt_type(visual_prompt_type):
|
152 |
+
if visual_prompt_type == "bboxes":
|
153 |
+
return gr.update(visible=True), gr.update(visible=False)
|
154 |
+
if visual_prompt_type == "masks":
|
155 |
+
return gr.update(visible=False), gr.update(visible=True)
|
156 |
+
return gr.update(visible=False), gr.update(visible=False)
|
157 |
+
|
158 |
+
def update_visual_usage_type(visual_usage_type):
|
159 |
+
if visual_usage_type == "Intra-Image":
|
160 |
+
return gr.update(visible=False, value=None)
|
161 |
+
if visual_usage_type == "Inter-Image":
|
162 |
+
return gr.update(visible=True, value=None)
|
163 |
+
return gr.update(visible=False, value=None)
|
164 |
+
|
165 |
+
visual_prompt_type.change(
|
166 |
+
fn=update_visual_prompt_type,
|
167 |
+
inputs=[visual_prompt_type],
|
168 |
+
outputs=[box_image, mask_image]
|
169 |
+
)
|
170 |
+
|
171 |
+
visual_usage_type.change(
|
172 |
+
fn=update_visual_usage_type,
|
173 |
+
inputs=[visual_usage_type],
|
174 |
+
outputs=[target_image]
|
175 |
+
)
|
176 |
+
|
177 |
+
def run_inference(raw_image, box_image, mask_image, target_image, texts, model_id, image_size, conf_thresh, iou_thresh, prompt_type, visual_prompt_type):
|
178 |
+
# add text/built-in prompts
|
179 |
+
if prompt_type == "Text" or prompt_type == "Prompt-free":
|
180 |
+
image = raw_image
|
181 |
+
if prompt_type == "Prompt-free":
|
182 |
+
with open('tools/ram_tag_list.txt', 'r') as f:
|
183 |
+
texts = [x.strip() for x in f.readlines()]
|
184 |
+
else:
|
185 |
+
texts = [text.strip() for text in texts.split(',')]
|
186 |
+
prompts = {
|
187 |
+
"texts": texts
|
188 |
+
}
|
189 |
+
# add visual prompt
|
190 |
+
elif prompt_type == "Visual":
|
191 |
+
if visual_prompt_type == "bboxes":
|
192 |
+
image, points = box_image["image"], box_image["points"]
|
193 |
+
points = np.array(points)
|
194 |
+
prompts = {
|
195 |
+
"bboxes": np.array([p[[0, 1, 3, 4]] for p in points if p[2] == 2]),
|
196 |
+
}
|
197 |
+
elif visual_prompt_type == "masks":
|
198 |
+
image, masks = mask_image["background"], mask_image["layers"][0]
|
199 |
+
image = image.convert("RGB")
|
200 |
+
masks = np.array(masks.convert("1"))
|
201 |
+
masks = binary_fill_holes(masks).astype(np.uint8)
|
202 |
+
prompts = {
|
203 |
+
"masks": masks[None]
|
204 |
+
}
|
205 |
+
return yoloe_inference(image, prompts, target_image, model_id, image_size, conf_thresh, iou_thresh, prompt_type)
|
206 |
+
|
207 |
+
yoloe_infer.click(
|
208 |
+
fn=run_inference,
|
209 |
+
inputs=[raw_image, box_image, mask_image, target_image, texts, model_id, image_size, conf_thresh, iou_thresh, prompt_type, visual_prompt_type],
|
210 |
+
outputs=[output_image],
|
211 |
+
)
|
212 |
+
|
213 |
+
|
214 |
+
gradio_app = gr.Blocks()
|
215 |
+
with gradio_app:
|
216 |
+
gr.HTML(
|
217 |
+
"""
|
218 |
+
<h1 style='text-align: center'>
|
219 |
+
<img src="/file=figures/logo.png" width="2.5%" style="display:inline;padding-bottom:4px">
|
220 |
+
YOLOE: Real-Time Seeing Anything
|
221 |
+
</h1>
|
222 |
+
""")
|
223 |
+
gr.HTML(
|
224 |
+
"""
|
225 |
+
<h3 style='text-align: center'>
|
226 |
+
<a href='' target='_blank'>arXiv</a> | <a href='https://github.com/THU-MIG/yoloe' target='_blank'>github</a>
|
227 |
+
</h3>
|
228 |
+
""")
|
229 |
+
gr.Markdown(
|
230 |
+
"""
|
231 |
+
We introduce **YOLOE(ye)**, a highly **efficient**, **unified**, and **open** object detection and segmentation model, like human eye, under different prompt mechanisms, like *texts*, *visual inputs*, and *prompt-free paradigm*.
|
232 |
+
"""
|
233 |
+
)
|
234 |
+
with gr.Row():
|
235 |
+
with gr.Column():
|
236 |
+
app()
|
237 |
+
|
238 |
+
if __name__ == '__main__':
|
239 |
+
gradio_app.launch(allowed_paths=["figures"])
|
requirements.txt
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
gradio==4.42.0
|
2 |
+
gradio_client==1.3.0
|
3 |
+
gradio_image_prompter==0.1.0
|
4 |
+
huggingface-hub==0.26.3
|
5 |
+
fastapi==0.112.2
|
6 |
+
git+https://github.com/jameslahm/yoloe.git#subdirectory=CLIP
|
7 |
+
git+https://github.com/jameslahm/yoloe.git#subdirectory=ml-mobileclip
|
8 |
+
git+https://github.com/jameslahm/yoloe.git#subdirectory=lvis-api
|
9 |
+
git+https://github.com/jameslahm/yoloe.git
|