File size: 7,274 Bytes

---
license: apache-2.0
base_model: google/siglip-so400m-patch14-384
tags:
- generated_from_trainer
- siglip
metrics:
- accuracy
- f1
model-index:
- name: siglip-tagger-test-3
  results: []
---

# siglip-tagger-test-3

This model is a fine-tuned version of [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) on an unknown dataset.
It achieves the following results on the evaluation set:
- Loss: 692.4745
- Accuracy: 0.3465
- F1: 0.9969

## Model description

This model is an experimental model that predicts danbooru tags of images.

## Example

### Use a pipeline

```py
from transformers import pipeline

pipe = pipeline("image-classification", model="p1atdev/siglip-tagger-test-3", trust_remote_code=True)
pipe(
  "image.jpg", # takes str(path) or numpy array or PIL images as input
  threshold=0.5, #optional parameter defaults to 0
  return_scores = False #optional parameter defaults to False
)
```

* `threshold`: confidence intervale, if it's specified, the pipeline will only return tags with a confidence >= threshold
* `return_scores`: if specified the pipeline will return the labels and their confidences in a dictionary format.

### Load model directly

```py
from PIL import Image
import torch

from transformers import (
    AutoModelForImageClassification,
    AutoImageProcessor,
)

import numpy as np

MODEL_NAME = "p1atdev/siglip-tagger-test-3"

model = AutoModelForImageClassification.from_pretrained(
    MODEL_NAME, torch_dtype=torch.bfloat16, trust_remote_code=True
)
model.eval()
processor = AutoImageProcessor.from_pretrained(MODEL_NAME)

image = Image.open("sample.jpg") # load your image

inputs = processor(image, return_tensors="pt").to(model.device, model.dtype)

logits = model(**inputs).logits.detach().cpu().float()[0]
logits = np.clip(logits, 0.0, 1.0)

results = {
    model.config.id2label[i]: logit for i, logit in enumerate(logits) if logit > 0
}
results = sorted(results.items(), key=lambda x: x[1], reverse=True)

for tag, score in results:
    print(f"{tag}: {score*100:.2f}%")
```

## Intended uses & limitations

This model is for research use only and is not recommended for production. 

Please use wd-v1-4-tagger series by SmilingWolf:

- [SmilingWolf/wd-v1-4-moat-tagger-v2](https://huggingface.co/SmilingWolf/wd-v1-4-moat-tagger-v2)
- [SmilingWolf/wd-v1-4-swinv2-tagger-v2](https://huggingface.co/SmilingWolf/wd-v1-4-swinv2-tagger-v2)

etc.


## Training and evaluation data

High quality 5000 images from danbooru. They were shuffled and split into train:eval at 4500:500. (Same as p1atdev/siglip-tagger-test-2)

|Name|Description|
|-|-|
|Images count|5000|
|Supported tags|9517 general tags. Character and rating tags are not included. See all labels in [config.json](config.json)|
|Image rating|4000 for `general` and 1000 for `sensitive,questionable,explicit`|
|Copyright tags|`original` only|
|Image score range (on search)|min:10, max150|

## Training procedure

- Loss function: AsymmetricLossOptimized ([Asymmetric Loss](https://github.com/Alibaba-MIIL/ASL))
  - `gamma_neg=4, gamma_pos=1, clip=0.05, eps=1e-8, disable_torch_grad_focal_loss=False`

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 64
- eval_batch_size: 32
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 10
- num_epochs: 50

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1     |
|:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|
| 1066.981      | 1.0   | 71   | 1873.5417       | 0.1412   | 0.9939 |
| 547.3158      | 2.0   | 142  | 934.3269        | 0.1904   | 0.9964 |
| 534.6942      | 3.0   | 213  | 814.0771        | 0.2170   | 0.9966 |
| 414.1278      | 4.0   | 284  | 774.0230        | 0.2398   | 0.9967 |
| 365.4994      | 5.0   | 355  | 751.2046        | 0.2459   | 0.9967 |
| 352.3663      | 6.0   | 426  | 735.6580        | 0.2610   | 0.9967 |
| 414.3976      | 7.0   | 497  | 723.2065        | 0.2684   | 0.9968 |
| 350.8201      | 8.0   | 568  | 714.0453        | 0.2788   | 0.9968 |
| 364.5016      | 9.0   | 639  | 706.5261        | 0.2890   | 0.9968 |
| 309.1184      | 10.0  | 710  | 700.7808        | 0.2933   | 0.9968 |
| 288.5186      | 11.0  | 781  | 695.7027        | 0.3008   | 0.9968 |
| 287.4452      | 12.0  | 852  | 691.5306        | 0.3037   | 0.9968 |
| 280.9088      | 13.0  | 923  | 688.8063        | 0.3084   | 0.9969 |
| 296.8389      | 14.0  | 994  | 686.1077        | 0.3132   | 0.9968 |
| 265.1467      | 15.0  | 1065 | 683.7382        | 0.3167   | 0.9969 |
| 268.5263      | 16.0  | 1136 | 682.1683        | 0.3206   | 0.9969 |
| 309.7871      | 17.0  | 1207 | 681.1995        | 0.3199   | 0.9969 |
| 307.6475      | 18.0  | 1278 | 680.1700        | 0.3230   | 0.9969 |
| 262.0677      | 19.0  | 1349 | 679.2177        | 0.3270   | 0.9969 |
| 275.3823      | 20.0  | 1420 | 678.9730        | 0.3294   | 0.9969 |
| 273.984       | 21.0  | 1491 | 678.6031        | 0.3318   | 0.9969 |
| 273.5361      | 22.0  | 1562 | 678.1285        | 0.3332   | 0.9969 |
| 279.6474      | 23.0  | 1633 | 678.4264        | 0.3348   | 0.9969 |
| 232.5045      | 24.0  | 1704 | 678.3773        | 0.3357   | 0.9969 |
| 269.621       | 25.0  | 1775 | 678.4922        | 0.3372   | 0.9969 |
| 289.8389      | 26.0  | 1846 | 679.0094        | 0.3397   | 0.9969 |
| 256.7373      | 27.0  | 1917 | 679.5618        | 0.3407   | 0.9969 |
| 262.3969      | 28.0  | 1988 | 680.1168        | 0.3414   | 0.9969 |
| 266.2439      | 29.0  | 2059 | 681.0101        | 0.3421   | 0.9969 |
| 247.7932      | 30.0  | 2130 | 681.9800        | 0.3422   | 0.9969 |
| 246.8083      | 31.0  | 2201 | 682.8550        | 0.3416   | 0.9969 |
| 270.827       | 32.0  | 2272 | 683.9250        | 0.3434   | 0.9969 |
| 256.4384      | 33.0  | 2343 | 685.0451        | 0.3448   | 0.9969 |
| 270.461       | 34.0  | 2414 | 686.2427        | 0.3439   | 0.9969 |
| 253.8104      | 35.0  | 2485 | 687.4274        | 0.3441   | 0.9969 |
| 265.532       | 36.0  | 2556 | 688.4856        | 0.3451   | 0.9969 |
| 249.1426      | 37.0  | 2627 | 689.5027        | 0.3457   | 0.9969 |
| 229.5651      | 38.0  | 2698 | 690.4455        | 0.3455   | 0.9969 |
| 251.9008      | 39.0  | 2769 | 691.2324        | 0.3463   | 0.9969 |
| 281.8228      | 40.0  | 2840 | 691.7993        | 0.3464   | 0.9969 |
| 242.5272      | 41.0  | 2911 | 692.1788        | 0.3465   | 0.9969 |
| 229.5605      | 42.0  | 2982 | 692.3799        | 0.3465   | 0.9969 |
| 245.0876      | 43.0  | 3053 | 692.4745        | 0.3465   | 0.9969 |
| 271.22        | 44.0  | 3124 | 692.5084        | 0.3465   | 0.9969 |
| 244.3045      | 45.0  | 3195 | 692.5108        | 0.3465   | 0.9969 |
| 243.9542      | 46.0  | 3266 | 692.5128        | 0.3465   | 0.9969 |
| 274.6664      | 47.0  | 3337 | 692.5095        | 0.3465   | 0.9969 |
| 231.1361      | 48.0  | 3408 | 692.5107        | 0.3465   | 0.9969 |
| 274.5513      | 49.0  | 3479 | 692.5108        | 0.3465   | 0.9969 |
| 316.0833      | 50.0  | 3550 | 692.5107        | 0.3465   | 0.9969 |


### Framework versions

- Transformers 4.37.2
- Pytorch 2.1.2+cu118
- Datasets 2.16.1
- Tokenizers 0.15.0