CLIP-GmP-ViT-L-14 / README.md
zer0int's picture
Update README.md
3cf3463 verified
metadata
license: mit
base_model: openai/clip-vit-large-patch14
datasets:
  - SPRIGHT-T2I/spright_coco

A fine-tune of CLIP-L. Original model: openai/clip-vit-large-patch14

  • ❀️ this CLIP? Help feed it if you can. Besides data, CLIP eats time & expensive electricity of DE. TY! πŸ€—
  • Want to feed it yourself? All code for fine-tuning and much more is on my GitHub.

Update 23/SEP/2024:

  • Huggingface Transformers / Diffusers pipeline now implemented.
  • See here for an example script: Integrating my CLIP-L with Flux.1
  • Otherwise, use as normal / any HF model:
from transformers import CLIPModel, CLIPProcessor, CLIPConfig
model_id = "zer0int/CLIP-GmP-ViT-L-14"
config = CLIPConfig.from_pretrained(model_id)

Update 03/SEP/2024 / edit 05/AUG:

πŸ‘‹ Looking for a Text Encoder for Flux.1 (or SD3, SDXL, SD, ...) to replace CLIP-L? πŸ‘€

You'll generally want the "TE-only" .safetensors:

  • πŸ‘‰ The "TEXT" model has superior prompt following, especially for text, but also for other details. DOWNLOAD
  • πŸ‘‰ The "SMOOTH" model can sometimes** have better details (when there's no text in the image). DOWNLOAD
  • The "GmP" initial fine-tune is deprecated / inferior to the above models. Still, you can DOWNLOAD it.

**: The "TEXT" model is the best for text. Full stop. But whether the "SMOOTH" model is better for your (text-free) scenario than the "TEXT" model really depends on the specific prompt. It might also be the case that the "TEXT" model leads to images that you prefer over "SMOOTH"; the only way to know is to experiment with both.

image/png

πŸ€“πŸ‘¨β€πŸ’» In general (because we're not limited to text-to-image generative AI), I provide four versions / downloads:

  • Text encoder only .safetensors.
  • Full model .safetensors.
  • State_dict pickle.
  • Full model pickle (can be used as-is with "import clip" -> clip.load() after bypassing SHA checksum verification).

The TEXT model has a modality gap of 0.80 (OpenAI pre-trained: 0.82).

  • Trained with high temperature of 0.1 + tinkering.
  • ImageNet/ObjectNet accuracy ~0.91 for both "SMOOTH" and "TEXT" models (pre-trained: ~0.84).
  • The models (this plot = "TEXT" model on MSCOCO) are also golden retrievers: πŸ₯°πŸ•

image/png


Update 11/AUG/2024:

New Best-Performing CLIP ViT-L/14 'GmP-smooth' model added (simply download the files named BEST!):

image/png

Or just create a fine-tune yourself: https://github.com/zer0int/CLIP-fine-tune

How?

  • Geometric Parametrization (GmP) (same as before)
  • Activation Value manipulation for 'adverb neuron' (same as before)
  • NEW: Custom loss function with label smoothing!
  • For in-depth details, see my GitHub. πŸ€—

A fine-tune of OpenAI / CLIP ViT-L/14 that has an unprecedented ImageNet/ObjectNet accuracy of ~0.90 (original pre-trained model / OpenAI's CLIP: ~0.85)**.

Made possible with Geometric Parametrization (GmP):


"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

image/png

βœ… The model / state_dict I am sharing was converted back to .weight after fine-tuning - alas, it can be used in the same manner as any state_dict, e.g. for use with ComfyUI as the SDXL / SD3 Text Encoder! πŸ€—

  • ** For details on training and those numbers / the eval, please see https://github.com/zer0int/CLIP-fine-tune
  • -> You can use "exp-acts-ft-finetune-OpenAI-CLIP-ViT-L-14-GmP-manipulate-neurons.py" to replicate my exact model fine-tune.

Pre-trained CLIP model by OpenAI, License: MIT License