Spaces:
Sleeping
Evaluation
Before reading this document, we recommend that you first read {external+mmengine:doc}`MMEngine: Model Accuracy Evaluation Basics <tutorials/evaluation>`.
Metrics
MMOCR implements widely-used evaluation metrics for text detection, text recognition and key information extraction tasks based on the {external+mmengine:doc}MMEngine: BaseMetric <design/evaluation>
base class. Users can specify the metric used in the validation and test phases by modifying the val_evaluator
and test_evaluator
fields in the configuration file. For example, the following config shows how to use HmeanIOUMetric
to evaluate the model performance in text detection task.
val_evaluator = dict(type='HmeanIOUMetric')
test_evaluator = val_evaluator
# In addition, MMOCR also supports the combined evaluation of multiple metrics for the same task, such as using WordMetric and CharMetric at the same time
val_evaluator = [
dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol']),
dict(type='CharMetric')
]
More evaluation related configurations can be found in the [evaluation configuration tutorial](../user_guides/config.md#evaluation-configuration).
As shown in the following table, MMOCR currently supports 5 evaluation metrics for text detection, text recognition, and key information extraction tasks, including HmeanIOUMetric
, WordMetric
, CharMetric
, OneMinusNEDMetric
, and F1Metric
.
Metric | Task | Input Field | Output Field |
HmeanIOUMetric | TextDet | pred_polygons pred_scores gt_polygons |
recall precision hmean |
WordMetric | TextRec | pred_text gt_text |
word_acc word_acc_ignore_case word_acc_ignore_case_symbol |
CharMetric | TextRec | pred_text gt_text |
char_recall char_precision |
OneMinusNEDMetric | TextRec | pred_text gt_text |
1-N.E.D |
F1Metric | KIE | pred_labels gt_labels |
macro_f1 micro_f1 |
In general, the evaluation metric used in each task is conventionally determined. Users usually do not need to understand or manually modify the internal implementation of the evaluation metric. However, to facilitate more customized requirements, this document will further introduce the specific implementation details and configurable parameters of the built-in metrics in MMOCR.
HmeanIOUMetric
HmeanIOUMetric is one of the most widely used evaluation metrics in text detection tasks, because it calculates the harmonic mean (H-mean) between the detection precision (P) and recall rate (R). The HmeanIOUMetric
can be calculated by the following equation:
H = \frac{2}{\frac{1}{P} + \frac{1}{R}} = \frac{2PR}{P+R}
In addition, since it is equivalent to the F-score (also known as F-measure or F-metric) when {math}\beta = 1
, HmeanIOUMetric
is sometimes written as F1Metric
or f1-score
:
F_1=(1+\beta^2)\cdot\frac{PR}{\beta^2\cdot P+R} = \frac{2PR}{P+R}
In MMOCR, the calculation of HmeanIOUMetric
can be summarized as the following steps:
Filter out invalid predictions
- Filter out predictions with a score is lower than
pred_score_thrs
- Filter out predictions overlapping with
ignored
ground truth boxes with an overlap ratio higher thanignore_precision_thr
It is worth noting that
pred_score_thrs
will automatically search for the best threshold within a certain range by default, and users can also customize the search range by manually modifying the configuration file:# By default, HmeanIOUMetric searches the best threshold within the range [0.3, 0.9] with a step size of 0.1 val_evaluator = dict(type='HmeanIOUMetric', pred_score_thrs=dict(start=0.3, stop=0.9, step=0.1))
- Filter out predictions with a score is lower than
Calculate the IoU matrix
- At the data processing stage,
HmeanIOUMetric
will calculate and maintain an {math}M \times N
IoU matrixiou_metric
for the convenience of the subsequent bounding box pairing step. Here, M and N represent the number of label bounding boxes and filtered prediction bounding boxes, respectively. Therefore, each element of this matrix stores the IoU between the m-th label bounding box and the n-th prediction bounding box.
- At the data processing stage,
Compute the number of GT samples that can be accurately matched based on the corresponding pairing strategy
Although
HmeanIOUMetric
can be calculated by a fixed formula, there may still be some subtle differences in the specific implementations. These differences mainly reflect the use of different strategies to match gt and predicted bounding boxes, which leads to the difference in final scores. Currently, MMOCR supports two matching strategies, namelyvanilla
andmax_matching
, for theHmeanIOUMetric
. As shown below, users can specify the matching strategies in the config.vanilla
matching strategyBy default,
HmeanIOUMetric
adopts thevanilla
matching strategy, which is consistent with thehmean-iou
implementation in MMOCR 0.x and the official text detection competition evaluation standard of ICDAR series. The matching strategy adopts the first-come-first-served matching method to pair the labels and predictions.# By default, HmeanIOUMetric adopts 'vanilla' matching strategy val_evaluator = dict(type='HmeanIOUMetric')
max_matching
matching strategyTo address the shortcomings of the existing matching mechanism, MMOCR has implemented a more efficient matching strategy to maximize the number of matches.
# Specify to use 'max_matching' matching strategy val_evaluator = dict(type='HmeanIOUMetric', strategy='max_matching')
We recommend that research-oriented developers use the default `vanilla` matching strategy to ensure consistency with other papers. For industry-oriented developers, you can use the `max_matching` matching strategy to achieve optimized performance.
Compute the final evaluation score according to the aforementioned matching strategy
WordMetric
WordMetric implements word-level text recognition evaluation metrics and includes three text matching modes, namely exact
, ignore_case
, and ignore_case_symbol
. Users can freely combine the output of one or more text matching modes in the configuration file by modifying the mode
field.
# Use WordMetric for text recognition task
val_evaluator = [
dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol'])
]
exact
:Full matching mode, i.e., only when the predicted text and the ground truth text are exactly the same, the predicted text is considered to be correct.ignore_case
:The mode ignores the case of the predicted text and the ground truth text.ignore_case_symbol
:The mode ignores the case and symbols of the predicted text and the ground truth text. This is also the text recognition accuracy reported by most academic papers. The performance reported by MMOCR uses theignore_case_symbol
mode by default.
Assume that the real label is MMOCR!
and the model output is mmocr
. The WordMetric
scores under the three matching modes are: {'exact': 0, 'ignore_case': 0, 'ignore_case_symbol': 1}
.
CharMetric
CharMetric implements character-level text recognition evaluation metrics that are case-insensitive.
# Use CharMetric for text recognition task
val_evaluator = [dict(type='CharMetric')]
Specifically, CharMetric
will output two evaluation metrics, namely char_precision
and char_recall
. Let the number of correctly predicted characters (True Positive) be {math}\sigma_{tp}
, then the precision P and recall R can be calculated by the following equation:
P=\frac{\sigma_{tp}}{\sigma_{pred}}, R = \frac{\sigma_{tp}}{\sigma_{gt}}
where {math}\sigma_{gt}
and {math}\sigma_{pred}
represent the total number of characters in the label text and the predicted text, respectively.
For example, assume that the label text is "MMOCR" and the predicted text is "mm0cR1". The score of the CharMetric
is:
P=\frac{4}{6}, R=\frac{4}{5}
OneMinusNEDMetric
OneMinusNEDMetric(1-N.E.D) is commonly used for text recognition evaluation of Chinese or English text line-level annotations. Unlike the full matching metric that requires the prediction and the gt text to be exactly the same, 1-N.E.D
uses the normalized edit distance (also known as Levenshtein Distance) to measure the difference between the predicted and the gt text, so that the performance difference of the model can be better distinguished when evaluating long texts. Assume that the real and predicted texts are {math}s_i
and {math}\hat{s_i}
, respectively, and their lengths are {math}l_{i}
and {math}\hat{l_i}
, respectively. The OneMinusNEDMetric
score can be calculated by the following formula:
score = 1 - \frac{1}{N}\sum_{i=1}^{N}\frac{D(s_i, \hat{s_{i}})}{max(l_{i},\hat{l_{i}})}
where N is the total number of samples, and {math}D(s_1, s_2)
is the edit distance between two strings.
For example, assume that the real label is "OpenMMLabMMOCR", the prediction of model A is "0penMMLabMMOCR", and the prediction of model B is "uvwxyz". The results of the full matching and OneMinusNEDMetric
evaluation metrics are as follows:
Full-match | 1 - N.E.D. | |
Model A | 0 | 0.92857 |
Model B | 0 | 0 |
As shown in the table above, although the model A only predicted one letter incorrectly, both models got 0 in when using full-match strategy. However, the OneMinusNEDMetric
evaluation metric can better distinguish the performance of the two models on long texts.
F1Metric
F1Metric implements the F1-Metric evaluation metric for KIE tasks and provides two modes, namely micro
and macro
.
val_evaluator = [
dict(type='F1Metric', mode=['micro', 'macro'],
]
micro
mode: Calculate the global F1-Metric score based on the total number of True Positive, False Negative, and False Positive.macro
mode:Calculate the F1-Metric score for each class and then take the average.
Customized Metric
MMOCR supports the implementation of customized evaluation metrics for users who pursue higher customization. In general, users only need to create a customized evaluation metric class CustomizedMetric
and inherit {external+mmengine:doc}MMEngine: BaseMetric <design/evaluation>
. Then, the data format processing method process
and the metric calculation method compute_metrics
need to be overwritten respectively. Finally, add it to the METRICS
registry to implement any customized evaluation metric.
from mmengine.evaluator import BaseMetric
from mmocr.registry import METRICS
@METRICS.register_module()
class CustomizedMetric(BaseMetric):
def process(self, data_batch: Sequence[Dict], predictions: Sequence[Dict]):
""" process receives two parameters, data_batch stores the gt label information, and predictions stores the predicted results.
"""
pass
def compute_metrics(self, results: List):
""" compute_metric receives the results of the process method as input and returns the evaluation results.
"""
pass
More details can be found in {external+mmengine:doc}`MMEngine Documentation: BaseMetric <design/evaluation>`.