Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.5.0
title: MeaningBERT
emoji: 🦀
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
tags:
- evaluate
- metric
description: >-
MeaningBERT is an automatic and trainable metric for assessing meaning
preservation between sentences
See the project's README at
https://github.com/GRAAL-Research/MeaningBERT/tree/main for more information.
Here is MeaningBERT
MeaningBERT is an automatic and trainable metric for assessing meaning preservation between sentences. MeaningBERT was proposed in our article MeaningBERT: assessing meaning preservation between sentences. Its goal is to assess meaning preservation between two sentences that correlate highly with human judgments and sanity checks. For more details, refer to our publicly available article.
This public version of our model uses the best model trained (where in our article, we present the performance results of an average of 10 models) for a more extended period (1000 epochs instead of 250). We have observed later that the model can further reduce dev loss and increase performance.
Sanity Check
Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric. However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive, since it requires a large dataset annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving). In these tests, the meaning preservation target value is not subjective and does not require human annotation to measure. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are compared and return a null score (i.e., 0%) if two sentences are completely unrelated.
Identical sentences
The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass this test, we count the number of times a metric rating was greater or equal to a threshold value X∈[95, 99] and divide it by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of 100%.
Unrelated sentences
Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely irrelevant sentence mainly composed of irrelevant words (also known as word soup). Since this test's expected rating is 0, we check that the metric rating is lower or equal to a threshold value X∈[5, 1]. Again, to account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of 0%.
Cite
Use the following citation to cite MeaningBERT
@ARTICLE{10.3389/frai.2023.1223924,
AUTHOR={Beauchemin, David and Saggion, Horacio and Khoury, Richard},
TITLE={MeaningBERT: assessing meaning preservation between sentences},
JOURNAL={Frontiers in Artificial Intelligence},
VOLUME={6},
YEAR={2023},
URL={https://www.frontiersin.org/articles/10.3389/frai.2023.1223924},
DOI={10.3389/frai.2023.1223924},
ISSN={2624-8212},
}
License
MeaningBERT is MIT licensed, as found in the LICENSE file.