A newer version of the Gradio SDK is available:
5.24.0
title: CIDEr
emoji: 🐨
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
CIDEr Metric for Image Captioning Evaluation
CIDEr Description
The CIDEr (Consensus-based Image Description Evaluation) metric is widely used in image captioning tasks to evaluate the quality of generated captions. The metric assesses how well the generated caption aligns with human-written reference captions by considering both the frequency and relevance of words or phrases. The score is computed using a weighted combination of n-gram precision, accounting for the frequency of each n-gram in the reference set.
The formula for the CIDEr metric is as follows:
$ \text{CIDEr}(c_i, C) = \frac{1}{N} \sum_{n=1}^{N} w_n \cdot \frac{\sum_{j=1}^{m} \text{IDF}(g_j) \cdot \text{TF}(g_j, c_i)}{\sum_{j=1}^{m} \text{IDF}(g_j) \cdot \text{TF}(g_j, C)} $
where:
- $ c_i $ is the candidate caption,
- $ C $ is the set of reference captions,
- $ N $ is the number of n-grams (typically 1 to 4),
- $ w_n $ is the weight for the n-gram,
- $ g_j $ represents the j-th n-gram,
- $ \text{TF}(g_j, c_i) $ is the term frequency of the n-gram $ g_j $ in the candidate caption $ c_i $,
- $ \text{TF}(g_j, C) $ is the term frequency of the n-gram $ g_j $ in the reference captions $ C $,
- $ \text{IDF}(g_j) $ is the inverse document frequency of the n-gram $ g_j $.
How to Use
To use the CIDEr metric, you need to initialize the CIDEr
class and provide the predicted and reference captions. The metric will tokenize the captions and compute the CIDEr score.
Inputs
- predictions (list of str): The list of predicted captions generated by the model.
- references (list of list of str): The list of lists, where each list contains the reference captions corresponding to each prediction.
- n (int, optional, defaults to 4): Number of n-grams for which (ngram) representation is calculated.
- sigma (float, optional, defaults to 6.0): The standard deviation parameter for the Gaussian penalty.
Output Values
- CIDEr (float): The computed CIDEr score, which typically ranges between 0 and 100. Higher scores indicate better alignment between the predicted and reference captions.
Examples
>>> from evaluate import load
>>> CIDEr = load("Kamichanw/CIDEr")
>>> predictions = ["A cat sits on a mat."]
>>> references = [["A cat is sitting on a mat.", "A feline rests on the mat."]]
>>> score = cider_metric.compute(predictions=predictions, references=references)
>>> print(score['CIDEr'])
0.0
Limitations and Bias
The CIDEr metric primarily focuses on the n-gram overlap between predicted and reference captions. It may not adequately capture semantic nuances or variations in phrasing that still convey the same meaning. Moreover, CIDEr tends to favor longer captions with more word overlap, potentially biasing against concise but accurate captions.
Citation
If you use the CIDEr metric in your research, please cite the original paper:
@inproceedings{vedantam2015cider,
title={Cider: Consensus-based image description evaluation},
author={Vedantam, Ramakrishna and Lawrence Zitnick, C and Parikh, Devi},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={4566--4575},
year={2015}
}