metadata

title: CIDEr
emoji: 🐨
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false

CIDEr Metric for Image Captioning Evaluation

CIDEr Description

The CIDEr (Consensus-based Image Description Evaluation) metric is widely used in image captioning tasks to evaluate the quality of generated captions. The metric assesses how well the generated caption aligns with human-written reference captions by considering both the frequency and relevance of words or phrases. The score is computed using a weighted combination of n-gram precision, accounting for the frequency of each n-gram in the reference set.

The formula for the CIDEr metric is as follows:

$ \text{CIDEr}(c_i, C) = \frac{1}{N} \sum_{n=1}^{N} w_n \cdot \frac{\sum_{j=1}^{m} \text{IDF}(g_j) \cdot \text{TF}(g_j, c_i)}{\sum_{j=1}^{m} \text{IDF}(g_j) \cdot \text{TF}(g_j, C)} $

where:

$ c_i $ is the candidate caption,
$ C $ is the set of reference captions,
$ N $ is the number of n-grams (typically 1 to 4),
$ w_n $ is the weight for the n-gram,
$ g_j $ represents the j-th n-gram,
$ \text{TF}(g_j, c_i) $ is the term frequency of the n-gram $ g_j $ in the candidate caption $ c_i $,
$ \text{TF}(g_j, C) $ is the term frequency of the n-gram $ g_j $ in the reference captions $ C $,
$ \text{IDF}(g_j) $ is the inverse document frequency of the n-gram $ g_j $.

How to Use

To use the CIDEr metric, you need to initialize the CIDEr class and provide the predicted and reference captions. The metric will tokenize the captions and compute the CIDEr score.

Inputs

predictions (list of str): The list of predicted captions generated by the model.
references (list of list of str): The list of lists, where each list contains the reference captions corresponding to each prediction.
n (int, optional, defaults to 4): Number of n-grams for which (ngram) representation is calculated.
sigma (float, optional, defaults to 6.0): The standard deviation parameter for the Gaussian penalty.

Output Values

CIDEr (float): The computed CIDEr score, which typically ranges between 0 and 100. Higher scores indicate better alignment between the predicted and reference captions.

Examples

>>> from evaluate import load
>>> CIDEr = load("Kamichanw/CIDEr")
>>> predictions = ["A cat sits on a mat."]
>>> references = [["A cat is sitting on a mat.", "A feline rests on the mat."]]
>>> score = cider_metric.compute(predictions=predictions, references=references)
>>> print(score['CIDEr'])
0.0

Limitations and Bias

The CIDEr metric primarily focuses on the n-gram overlap between predicted and reference captions. It may not adequately capture semantic nuances or variations in phrasing that still convey the same meaning. Moreover, CIDEr tends to favor longer captions with more word overlap, potentially biasing against concise but accurate captions.

Citation

If you use the CIDEr metric in your research, please cite the original paper:

@inproceedings{vedantam2015cider,
  title={Cider: Consensus-based image description evaluation},
  author={Vedantam, Ramakrishna and Lawrence Zitnick, C and Parikh, Devi},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={4566--4575},
  year={2015}
}

Spaces:

Kamichanw
/

CIDEr

Runtime error