File size: 2,722 Bytes
31d01a8
 
 
 
 
bf9dbd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31d01a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ce05f9
31d01a8
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: mit
datasets:
- SPRIGHT-T2I/spright_coco
---
## Update 11/AUG/2024:

New Best-Performing CLIP ViT-L/14 'GmP-smooth' model added (simply download the files named *BEST*!):

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/qb5hYNxSTMB5z7rSs7N9k.png)

Or just create a fine-tune yourself: [https://github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune)

How?
- Geometric Parametrization (GmP) (same as before)
- Activation Value manipulation for 'adverb neuron' (same as before)
- NEW: Custom loss function with label smoothing!
- For in-depth details, see my GitHub. 🤗

----

## A fine-tune of OpenAI / CLIP ViT-L/14 that has an unprecedented ImageNet/ObjectNet accuracy of ~0.90 (original pre-trained model / OpenAI's CLIP: ~0.85)**.

Made possible with Geometric Parametrization (GmP):

```

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

```

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6490359a877fc29cb1b09451/mqIgsH_aWKop_DDQ2KglN.png)

✅ The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder! 🤗

- ** For details on training and those numbers / the eval, please see [https://github.com/zer0int/CLIP-fine-tune](https://github.com/zer0int/CLIP-fine-tune)
- -> You can use "exp-acts-ft-finetune-OpenAI-CLIP-ViT-L-14-GmP-manipulate-neurons.py" to replicate my exact model fine-tune.

Pre-trained CLIP model by OpenAI, License: [MIT License](https://github.com/openai/CLIP/blob/main/LICENSE)