File size: 1,592 Bytes
8567f95
 
c0efded
d585022
 
 
6d3a879
8567f95
 
 
 
 
 
 
bd54efd
7bbcbcd
8567f95
 
e4e42e3
8567f95
 
 
 
 
 
 
 
 
e090ac3
8567f95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
license: mit
pipeline_tag: text-classification
inference: false
---

# Official ICC model [ACL 2024 Findings]

The official checkpoint of ICC model, introduced in [ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation](https://arxiv.org/abs/2403.01306)

[Project Page](https://moranyanuka.github.io/icc/)

## Usage

The ICC model is used to quantify the concreteness of image captions, and the intended use is finding the best captions in a noisy multimodal dataset. It can be achieved by simply running it over the captions and filtering out samples with low score.
It works best in conjunction with CLIP based filtering.


### Running the model

<details>
<summary> Click to expand </summary>

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("moranyanuka/icc")
model = AutoModelForSequenceClassification.from_pretrained("moranyanuka/icc").to("cuda")

captions = ["a great method of quantifying concreteness", "a man with a white shirt"]
text_ids = tokenizer(captions, padding=True, return_tensors="pt", truncation=True).to('cuda')
with torch.inference_mode():
  icc_scores = model(**text_ids)['logits']

# tensor([[0.0339], [1.0068]])
```
</details>



bibtex:
```
@misc{yanuka2024icc,
      title={ICC: Quantifying Image Caption Concreteness for Multimodal Dataset Curation}, 
      author={Moran Yanuka and Morris Alper and Hadar Averbuch-Elor and Raja Giryes},
      year={2024},
      eprint={2403.01306},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
```