Spaces:
Running
Running
seonil
commited on
Commit
•
2885a60
1
Parent(s):
eb64f0c
push to hfspace
Browse files- README.md +73 -13
- __pycache__/harim_scorer.cpython-39.pyc +0 -0
- app.py +6 -0
- harim_plus.py +121 -0
- harim_scorer.py +263 -0
- requirements.txt +56 -0
README.md
CHANGED
@@ -1,13 +1,73 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# HaRiM+
|
2 |
+
**HaRiM+: Evaluating Summary Quality with Hallucination Risk, accepted at AACL-22 [paper](https://arxiv.org/abs/2211.12118).** <br />
|
3 |
+
<br />
|
4 |
+
HaRiM+ is reference-less metric for summarization task which hurls the power of summarization model to estimate the quality of the summary-article pair. <br />
|
5 |
+
Note that this metric is reference-free and do not require training. It is ready to go without reference text to compare with the generation nor any model training for scoring.
|
6 |
+
|
7 |
+
## Quick Start
|
8 |
+
### install
|
9 |
+
```bash
|
10 |
+
# assumes torch, transformers, pandas, tqdm, fire and datasets are installed
|
11 |
+
pip install evaluate
|
12 |
+
# pip install -r requirments.txt
|
13 |
+
```
|
14 |
+
### example
|
15 |
+
```python
|
16 |
+
import evaluate
|
17 |
+
from pprint import pprint
|
18 |
+
|
19 |
+
# example from the paper
|
20 |
+
art = """Spain's 2-0 defeat by Holland on Tuesday brought back bitter memories of their disastrous 2014 World Cup, but coach Vicente del Bosque will not be too worried about a third straight friendly defeat, insists Gerard Pique. Holland, whose 5-1 drubbing of Spain in the group stage in Brazil last year marked the end of the Iberian nation's six-year domination of the world game, scored two early goals at the Amsterdam Arena and held on against some determined Spain pressure in the second half for a 2-0 success. They became the first team to inflict two defeats on Del Bosque since he took over in 2008 but the gruff 64-year-old had used the match to try out several new faces and he fielded a largely experimental, second-string team. Stefan de Vrij (right) headed Holland in front against Spain at the Amsterdam Arena on Tuesday Gerard Pique (left) could do nothing to stop Davy Klaassen doubling the Dutch advantage Malaga forward Juanmi and Sevilla midfielder Vitolo became the 55th and 56th players to debut under Del Bosque, while the likes of goalkeeper David de Gea, defenders Raul Albiol, Juan Bernat and Dani Carvajal and midfielder Mario Suarez all started the game. 'The national team's state of health is good,' centre back Gerard Pique told reporters. 'We are in a process where players are coming into the team and gathering experience,' added the Barcelona defender. 'We are second in qualifying (for Euro 2016) and these friendly games are for experimenting. 'I am not that worried about this match because we lost friendlies in previous years and then ended up winning titles.' David de Gea was given a start by Vicente del Bosque but could not keep out De Vrij's header here Dani Carvajal (centre) was another squad player given a chance to impress against Holland Del Bosque will be confident he can find the right mix of players to secure Spain's berth at Euro 2016 in France next year, when they will be chasing an unprecedented third straight title. Slovakia are the surprise leaders in qualifying Group C thanks to a 2-1 win over Spain in Zilina in October and have a maximum 15 points from five of 10 matches. Spain are second on 12 points, three ahead of Ukraine, who they beat 1-0 in Seville on Friday. Del Bosque's side host Slovakia in September in a match that could decide who goes through to the finals as group winners. 'The team is in good shape,' forward Pedro told reporters. 'We have a very clear idea of our playing style and we are able to count on people who are gradually making a place for themselves in the team.'"""
|
21 |
+
|
22 |
+
summaries = [
|
23 |
+
"holland beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
|
24 |
+
"holland beat spain 2-0 in the group stage in brazil on tuesday night . del bosque will be hoping to find the right mix of players to the world cup . gerard pique could make the right mix of players to the tournament .",
|
25 |
+
"del bosque beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
|
26 |
+
"holland could not beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",
|
27 |
+
]
|
28 |
+
articles = [art] * len(summaries)
|
29 |
+
|
30 |
+
scorer = evaluate.load('NCSOFT/harim_plus')
|
31 |
+
scores = scorer.compute(predictions = summaries, references = articles) # use_aggregator=False, tokenwise_score=False, bsz=32)
|
32 |
+
pprint(scores['harim+'])
|
33 |
+
>>> [1.8230078220367432,
|
34 |
+
1.5361897945404053,
|
35 |
+
1.806436538696289,
|
36 |
+
1.7360382080078125
|
37 |
+
]
|
38 |
+
|
39 |
+
```
|
40 |
+
|
41 |
+
## Powering HaRiM+ score with other summarization model checkpoints
|
42 |
+
HaRiM+ accepts any checkpoint compatible with <code>transformers.AutoModelForSeq2SeqLM</code> which is encoder-decoder model. <br />
|
43 |
+
In principle the HaRiM+ score expected to work on machine-translation too. It works but not better than BARTScore (Yuan et al.) while it excels in summarization task.
|
44 |
+
|
45 |
+
```python
|
46 |
+
|
47 |
+
newharim = evaluate.load('NCSOFT/harim_plus', pretrained_name='local or ckpt name available')#, tokenizer=custom_tokenizer)
|
48 |
+
```
|
49 |
+
|
50 |
+
## Speed and Resource requirements
|
51 |
+
HaRiM+ requires GPU usage for practical speed, but only loads encoder-decoder model of your choice (Default \= facebook\/bart\-large\-cnn). Empirically, resource requirements and speed is similar to BERTScore.
|
52 |
+
|
53 |
+
## Citation
|
54 |
+
Please cite as follows
|
55 |
+
```
|
56 |
+
@inproceedings{son-etal-2022-harim,
|
57 |
+
title = "{H}a{R}i{M}$^+$: Evaluating Summary Quality with Hallucination Risk",
|
58 |
+
author = "Son, Seonil (Simon) and
|
59 |
+
Park, Junsoo and
|
60 |
+
Hwang, Jeong-in and
|
61 |
+
Lee, Junghwa and
|
62 |
+
Noh, Hyungjong and
|
63 |
+
Lee, Yeonsoo",
|
64 |
+
booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
|
65 |
+
month = nov,
|
66 |
+
year = "2022",
|
67 |
+
address = "Online only",
|
68 |
+
publisher = "Association for Computational Linguistics",
|
69 |
+
url = "https://aclanthology.org/2022.aacl-main.66",
|
70 |
+
pages = "895--924",
|
71 |
+
abstract = "One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.",
|
72 |
+
}
|
73 |
+
```
|
__pycache__/harim_scorer.cpython-39.pyc
ADDED
Binary file (7.05 kB). View file
|
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("NCSOFT/harim_plus")
|
6 |
+
launch_gradio_widget(module)
|
harim_plus.py
ADDED
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import datasets
|
2 |
+
import evaluate
|
3 |
+
|
4 |
+
from harim_scorer import Harimplus_Scorer
|
5 |
+
|
6 |
+
|
7 |
+
|
8 |
+
logger = evaluate.logging.get_logger(__name__)
|
9 |
+
|
10 |
+
CODEBASE_URL=''
|
11 |
+
PAPER_URL='TBA'
|
12 |
+
|
13 |
+
_CITATION = """\
|
14 |
+
@inproceedings{harimplus,
|
15 |
+
title={HaRiM+: Evaluating Summary Quality with Hallucination Risk},
|
16 |
+
author={Seonil Son, Junsoo Park, Jeong-in Hwang, Hyungjong Noh, Yeonsoo Lee},
|
17 |
+
booktitle={AACL},
|
18 |
+
year={2022},
|
19 |
+
url={TBA}
|
20 |
+
}
|
21 |
+
"""
|
22 |
+
|
23 |
+
_DESCRIPTION = """\
|
24 |
+
HaRiM+ is a reference-less (i.e. scoring summary quality only requires an article) evaluation metric score for summarization task which hurls the power of summarization model.
|
25 |
+
It will work great ranking the summary-article pairs according to its quality.
|
26 |
+
Note that the score range is unbound.
|
27 |
+
|
28 |
+
Summarization model inside the HaRiM+ will read and evaluate how good the quality of a summary given the paired source article.
|
29 |
+
|
30 |
+
HaRiM+ is proved effective for benchmarking summarization systems (system-level performance) as well as ranking the article-summary pairs (segment-level performance) in comprehensive aspect such as factuality, consistency, coherency, fluency, and relevance. For details, refer to our paper published in AACL2022.
|
31 |
+
"""
|
32 |
+
|
33 |
+
_KWARGS_DESCRIPTION = """
|
34 |
+
HaRiM+ score.
|
35 |
+
Args:
|
36 |
+
For scorer = evaluate.load():
|
37 |
+
`pretrained_name` (str or pathlib.Path): summarization model checkpoint or path, loaded by transformers.AutoModelForSeq2SeqLM.from_pretrained(). Defaults to Yale-LILY/brio-cnndm-uncased.
|
38 |
+
`tokenizer`: (use when your tokenizer cannot be loaded by from_pretrained)Tokenizer function compatible with transformers.PreTrainedTokenizer. It requires tokenizer.pad_token|eos_token|bos_token and tokenizer.__call__() method for HaRiM+ score computation.
|
39 |
+
|
40 |
+
For scorer.compute():
|
41 |
+
`predictions` (list of str): generated summaries
|
42 |
+
`references` (list of str): source articles to be summarized
|
43 |
+
`use_aggregator` (bool): if True, average of the scores are returned
|
44 |
+
|
45 |
+
Returns:
|
46 |
+
'results' (dict): {
|
47 |
+
'harim+' (List[float] or float): HaRiM+ score to use,
|
48 |
+
'harim' (List[float] or float): HaRiM term for computing the score above,
|
49 |
+
'log_ppl' (List[float] or float): Log perplexity term. Same as (Yuan et al., NeurIPS 2021),
|
50 |
+
'lambda' (float): (recommend not to modify this) Balancing coeff. for computing harim+ from harim and log_ppl.
|
51 |
+
}
|
52 |
+
|
53 |
+
Examples:
|
54 |
+
>>> summaries = ["hello there", "hello there"]
|
55 |
+
>>> articles = ["hello, this is the article to be summarized", "hello, this is the article to be summarized"]
|
56 |
+
>>> scorer = evaluate.load("NCSOFT/harim_plus") #, pretrained_name='PRETRAINEDNAME', tokenizer=TOKENIZER # optional
|
57 |
+
>>> results = scorer.compute(predictions=summaries, references=articles) # use_aggregator=True # optional
|
58 |
+
>>> print([round(v, 2) for v in results["harim+"]])
|
59 |
+
[0.4, 0.4]
|
60 |
+
"""
|
61 |
+
|
62 |
+
|
63 |
+
|
64 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
65 |
+
class Harimplus(evaluate.Metric):
|
66 |
+
def __init__(self,
|
67 |
+
pretrained_name='facebook/bart-large-cnn',
|
68 |
+
tokenizer=None,
|
69 |
+
device='cuda',
|
70 |
+
**kwargs
|
71 |
+
):
|
72 |
+
super().__init__(**kwargs)
|
73 |
+
self.myconfig = dict(
|
74 |
+
pretrained_name=pretrained_name,
|
75 |
+
tokenizer=tokenizer,
|
76 |
+
device=device,
|
77 |
+
)
|
78 |
+
|
79 |
+
def _info(self):
|
80 |
+
return evaluate.MetricInfo(
|
81 |
+
description=_DESCRIPTION,
|
82 |
+
citation=_CITATION,
|
83 |
+
homepage=CODEBASE_URL,
|
84 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
85 |
+
features=datasets.Features(
|
86 |
+
{
|
87 |
+
"predictions": datasets.Value("string", id="sequence"),
|
88 |
+
"references": datasets.Value("string", id="sequence"),
|
89 |
+
}
|
90 |
+
),
|
91 |
+
codebase_urls=[CODEBASE_URL],
|
92 |
+
reference_urls=[CODEBASE_URL, PAPER_URL],
|
93 |
+
)
|
94 |
+
|
95 |
+
def _download_and_prepare(self, dl_manager):
|
96 |
+
pretrained_name = self.myconfig['pretrained_name']
|
97 |
+
is_custom_tokenzer = self.myconfig['tokenizer'] is not None
|
98 |
+
logger.warning(
|
99 |
+
"Loading HaRiM+ score"
|
100 |
+
f"\tpretrained_name = {pretrained_name}"
|
101 |
+
)
|
102 |
+
if is_custom_tokenizer:
|
103 |
+
logger.warning(
|
104 |
+
f"tokenizer is overriden by \n\tself.myconfig['tokenizer']"
|
105 |
+
)
|
106 |
+
logger.warning(
|
107 |
+
"You can change checkpoints with `pretrained_name` kwarg in evaluate.load. Strongly recommend to use *-large or larger ones."
|
108 |
+
"Refrain from using checkpoints trained on noisy corpus such as bbc-XSUM.")
|
109 |
+
|
110 |
+
# download the model checkpoint specified by self.myconfig_name and set up the scorer
|
111 |
+
self.scorer = score.Harimplus_Scorer(**self.myconfig)
|
112 |
+
|
113 |
+
def _compute(self, predictions=None,
|
114 |
+
references=None,
|
115 |
+
use_aggregator=False,
|
116 |
+
bsz=32,
|
117 |
+
tokenwise_score=False):
|
118 |
+
summaries = predictions
|
119 |
+
articles = references
|
120 |
+
scores = self.scorer.compute(predictions=summaries, references=articles, use_aggregator=use_aggregator, bsz=bsz, tokenwise_score=tokenwise_score)
|
121 |
+
return scores
|
harim_scorer.py
ADDED
@@ -0,0 +1,263 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn.functional as F
|
3 |
+
from transformers import (AutoModelForSeq2SeqLM,
|
4 |
+
AutoTokenizer,
|
5 |
+
PreTrainedTokenizer,
|
6 |
+
PreTrainedTokenizerFast)
|
7 |
+
import evaluate
|
8 |
+
|
9 |
+
from fire import Fire
|
10 |
+
import pandas as pd
|
11 |
+
from tqdm import tqdm
|
12 |
+
import json
|
13 |
+
|
14 |
+
from typing import List, Dict, Union
|
15 |
+
from collections import defaultdict
|
16 |
+
from functools import partial
|
17 |
+
from pprint import pprint
|
18 |
+
|
19 |
+
from ipdb import set_trace
|
20 |
+
|
21 |
+
class Harimplus_Scorer:
|
22 |
+
def __init__(self,
|
23 |
+
pretrained_name:str='none',
|
24 |
+
tokenizer:Union[PreTrainedTokenizer, PreTrainedTokenizerFast]=None,
|
25 |
+
mixing_factor:float=7., # same as lambda in the paper
|
26 |
+
device:str='cuda',
|
27 |
+
|
28 |
+
src_maxlen=1024,
|
29 |
+
tgt_maxlen=110,
|
30 |
+
):
|
31 |
+
self._pretrained_name = pretrained_name
|
32 |
+
self._lambda = mixing_factor
|
33 |
+
|
34 |
+
self._device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
|
35 |
+
self._encdec_model = AutoModelForSeq2SeqLM.from_pretrained(self._pretrained_name)
|
36 |
+
if tokenizer is None:
|
37 |
+
self._tokenizer = AutoTokenizer.from_pretrained(self._pretrained_name)
|
38 |
+
else:
|
39 |
+
self._tokenizer = tokenizer
|
40 |
+
self._encdec_model.to(self._device)
|
41 |
+
self._encdec_model.eval()
|
42 |
+
|
43 |
+
self._src_maxlen = src_maxlen
|
44 |
+
self._tgt_maxlen = tgt_maxlen
|
45 |
+
|
46 |
+
|
47 |
+
|
48 |
+
def _prep_input(self, src_tgt_txts, src_or_tgt='src'):
|
49 |
+
L = self._src_maxlen if src_or_tgt=='src' else self._tgt_maxlen
|
50 |
+
if isinstance(src_tgt_txts, pd.Series):
|
51 |
+
src_tgt_txts=src_tgt_txts.tolist()
|
52 |
+
if src_or_tgt == 'src':
|
53 |
+
src_tgt_txts = [ s.replace("\n", " ") for s in src_tgt_txts ]
|
54 |
+
return self._tokenizer(src_tgt_txts, padding=True, truncation=True, max_length=L, return_tensors='pt') # ModelInput dataclass
|
55 |
+
|
56 |
+
|
57 |
+
'''below are helper functions w/o dependency to the self, but included inside the class for ease of use'''
|
58 |
+
def likelihoods(self, logits, force_decode_indices, tgt_mask):
|
59 |
+
probs = F.softmax(logits, dim=-1)
|
60 |
+
probs_force_decode_ = probs.gather(-1, force_decode_indices.unsqueeze(-1)).squeeze()
|
61 |
+
probs_force_decode= probs_force_decode_ * tgt_mask
|
62 |
+
assert probs_force_decode.shape == force_decode_indices.shape
|
63 |
+
return probs_force_decode
|
64 |
+
|
65 |
+
def log_likelihoods(self, logits, force_decode_indices, tgt_mask):
|
66 |
+
ll = F.log_softmax(logits, dim=-1)
|
67 |
+
ll_force_decode_ = ll.gather(-1, force_decode_indices.unsqueeze(-1)).squeeze()
|
68 |
+
ll_force_decode = ll_force_decode_ * tgt_mask
|
69 |
+
|
70 |
+
return ll_force_decode
|
71 |
+
|
72 |
+
def harim(self, s2s_logits, lm_logits, force_decode_indices, tgt_mask ):
|
73 |
+
p_s2s, p_lm = self.likelihoods(s2s_logits, force_decode_indices, tgt_mask), \
|
74 |
+
self.likelihoods(lm_logits, force_decode_indices, tgt_mask)
|
75 |
+
|
76 |
+
delta = p_s2s - p_lm
|
77 |
+
margin_linear = (1-delta) / 2
|
78 |
+
harim = -(1-p_s2s) * margin_linear + 1
|
79 |
+
return harim # this is -1 * hallucination risk
|
80 |
+
|
81 |
+
def make_minibatches(self, exs:List[str], bsz:int=32):
|
82 |
+
idx=0
|
83 |
+
minibatches = []
|
84 |
+
while True:
|
85 |
+
start = idx
|
86 |
+
end = idx+bsz
|
87 |
+
if start >= len(exs):
|
88 |
+
break
|
89 |
+
|
90 |
+
minibatches.append( exs[start:end] )
|
91 |
+
idx += bsz
|
92 |
+
return minibatches
|
93 |
+
|
94 |
+
def make_empty_minibatches(self, minibatches:List[List[str]]):
|
95 |
+
e_minibatches = minibatches.copy()
|
96 |
+
for i, mb in enumerate(e_minibatches):
|
97 |
+
e_minibatches[i] = ['' for ex in mb]
|
98 |
+
return e_minibatches
|
99 |
+
|
100 |
+
|
101 |
+
def compute(self, predictions:List[str],
|
102 |
+
references:List[str],
|
103 |
+
bsz:int=32,
|
104 |
+
use_aggregator:bool=False,
|
105 |
+
tokenwise_score:bool=False,
|
106 |
+
):
|
107 |
+
'''
|
108 |
+
returns harim+ score (List[float]) for predictions (summaries) and references (articles)
|
109 |
+
**Note**
|
110 |
+
- here, predictions = generated summaries to be evaluated, references = article to be summarized (but to follow the convention of the evaluate, we named kwarg as "references")
|
111 |
+
- log_ppl equals to bartscore (yuan et al., neurips 2021)
|
112 |
+
|
113 |
+
if tokenwise_score:
|
114 |
+
returns minibatch chunks of harim+ scores and log-likelihoods with tokenized predictions (List[str])
|
115 |
+
if use_aggregator:
|
116 |
+
returning scores are aggregated (mean) over given test set
|
117 |
+
'''
|
118 |
+
|
119 |
+
|
120 |
+
# tokenize/prep src/tgts
|
121 |
+
make_minibatches_bsz = partial(self.make_minibatches, bsz=bsz)
|
122 |
+
b_srcs, b_tgts = map(make_minibatches_bsz, [predictions, references])
|
123 |
+
b_emps = self.make_empty_minibatches(b_srcs)
|
124 |
+
|
125 |
+
scores=defaultdict(list)
|
126 |
+
for mini_s, mini_e, mini_t in tqdm(zip(b_srcs, b_emps, b_tgts), total=len(b_tgts), desc=f"computing HaRiM+ {bsz=}, core={self._pretrained_name}"):
|
127 |
+
src_in = self._prep_input(mini_s, src_or_tgt='src')
|
128 |
+
emp_in = self._prep_input(mini_e, src_or_tgt='src')
|
129 |
+
tgt_in = self._prep_input(mini_t, src_or_tgt='tgt')
|
130 |
+
if emp_in.input_ids.shape[-1]==0: # emp_in.input_ids.shape == (32,0)
|
131 |
+
boseos = f"{self._tokenizer.bos_token}{self._tokenizer.eos_token}"
|
132 |
+
mini_e_ = [boseos for _ in range(len(mini_e))]
|
133 |
+
emp_in = self._prep_input( mini_e_, src_or_tgt='src' )
|
134 |
+
|
135 |
+
# if mini_s == b_srcs[0]:
|
136 |
+
# normal = src_in
|
137 |
+
# if mini_s == b_srcs[-1]:
|
138 |
+
# trailing = src_in
|
139 |
+
# set_trace()
|
140 |
+
|
141 |
+
src_in.data['labels'] = tgt_in.input_ids
|
142 |
+
emp_in.data['labels'] = tgt_in.input_ids
|
143 |
+
# print(f"{emp_in.data['labels']=}")
|
144 |
+
# set_trace()
|
145 |
+
tgt_mask = tgt_in.attention_mask
|
146 |
+
|
147 |
+
assert (tgt_in.attention_mask == (tgt_in.input_ids != self._tokenizer.pad_token_id)).all()
|
148 |
+
# src_in.data['decoder_input_ids'] = tgt_in.input_ids
|
149 |
+
# src_in.data['decoder_attention_mask'] = tgt_in.attention_mask
|
150 |
+
src_in = src_in.to(self._device)
|
151 |
+
emp_in = emp_in.to(self._device)
|
152 |
+
tgt_in = tgt_in.to(self._device)
|
153 |
+
tgt_mask = tgt_mask.to(self._device)
|
154 |
+
|
155 |
+
with torch.no_grad():
|
156 |
+
# token_type_ids attribute causes error
|
157 |
+
s2s_logits = self._encdec_model.forward(
|
158 |
+
input_ids = src_in.input_ids,
|
159 |
+
attention_mask = src_in.attention_mask,
|
160 |
+
labels = tgt_in.input_ids,
|
161 |
+
# decoder_input_ids = tgt_in.input_ids,
|
162 |
+
# decoder_attention_mask = tgt_in.attention_mask,
|
163 |
+
return_dict=True).logits
|
164 |
+
lm_logits = self._encdec_model.forward(
|
165 |
+
input_ids = emp_in.input_ids,
|
166 |
+
attention_mask = emp_in.attention_mask,
|
167 |
+
labels = tgt_in.input_ids,
|
168 |
+
# decoder_input_ids = tgt_in.input_ids,
|
169 |
+
# decoder_attention_mask = tgt_in.attention_mask,
|
170 |
+
return_dict=True).logits
|
171 |
+
sent_lengths = tgt_mask.sum(-1)
|
172 |
+
ll_tok = self.log_likelihoods(s2s_logits, src_in.labels, tgt_mask)
|
173 |
+
ll = ll_tok.sum(-1) / sent_lengths
|
174 |
+
|
175 |
+
harim_tok = self.harim(s2s_logits, lm_logits, src_in.labels, tgt_mask)
|
176 |
+
harim = harim_tok.sum(-1) / sent_lengths
|
177 |
+
|
178 |
+
harim_plus_normalized = ll + self._lambda * harim # loglikelihood + lambda * negative_harim (negative harim=-1* risk)
|
179 |
+
|
180 |
+
scores['harim+'].extend(harim_plus_normalized.tolist())
|
181 |
+
scores['harim'].extend(harim.tolist())
|
182 |
+
scores['log_ppl'].extend(ll.tolist())
|
183 |
+
|
184 |
+
if tokenwise_score:
|
185 |
+
scores['tok_harim+'].append(harim_tok*self._lambda + ll_tok)
|
186 |
+
scores['tok_predictions'].append( [self._tokenizer.convert_ids_to_token(idxs) for idxs in src_in.labels] )
|
187 |
+
|
188 |
+
if use_aggregator: # after
|
189 |
+
for k, v in scores.items():
|
190 |
+
if not k.startswith('tok_'):
|
191 |
+
scores[k] = sum(v)/len(v) # aggregate (mean)
|
192 |
+
scores['lambda'] = self._lambda
|
193 |
+
return scores
|
194 |
+
|
195 |
+
|
196 |
+
|
197 |
+
def test(bsz = 16, pretrained_name='facebook/bart-large-cnn', tokenizer=None):
|
198 |
+
if tokenizer is None:
|
199 |
+
scorer = Harimplus_Scorer(pretrained_name=pretrained_name)
|
200 |
+
else:
|
201 |
+
scorer = Harimplus_Scorer(pretrained_name=pretrained_name, tokenizer=tokenizer)
|
202 |
+
|
203 |
+
art1 = """The respected law professor from Philadelphia now being investigated after allegedly emailing students a link to pornographic footage, was once a contestant on Who Wants to Be a Millionaire, it has emerged. Lisa McElroy, a 50-year-old Drexel professor, appeared on the show in 2010 while it was still hosted my Meredith Vieira. And like her apparent March 31 email mishap, her game show appearance ended with a very public mistake. McElroy, who teaches legal writing, got tripped up on the $12,500 level after flying through the first few questions, notes Philly.com. Wishes she was a millionaire: Drexel law profesor professor Lisa McElroy allegedly sent a link to a pornographic website to her students. In 2010, she appeared on the TV game show Who Wants to Be a Milionaire Mother of two: The mother of two shared an anecdote with then-host Meredith Vieira about having to scramble to find a babysitter for her kids and someone to teach her class after learning she was to appear on the show just two days before taping Lost it: McElroy was tripped up on the $12,500 question. Despite having used two lifelines, she answered wrong and walked away with around $5,000 The questions read: 'As a result of General Motor’s bankruptcy declaration in 2009, what foreign government became one of its largest shareholders?' Even after using two of her lifelines to narrow down the answer, McElroy answered China, which was incorrect. The correct answer was Canada. She walked away with around $5,000. McElroy, who is a children's book and biography author, is apparently also a mother. She opened the appearance by sharing an anecdote with Vieira about having to scramble to find a babysitter after being informed she was chosen to be on Millionaire jsut two days prior to taping. She's accused of sending the inappropriate message this past March 31 under the subject line: 'Great article on writing briefs.' However, when recipients opened the enclosed link, philly.com reports that they were directed to a video of 'a woman engaging in a sexually explicit act'. Lisa McElroy, 50, who teaches legal writing at Drexel University, reportedly sent the inappropriate message on March 31 baring the subject line: 'Great article on writing briefs' Following a number of complaints, the college issued an apology to students. The message read: 'As you may be aware, some students erroneously received an email this morning directing them to a... post that included some inappropriate material. 'We take this matter seriously and apologize for any upset it may have caused.' The university says federal law requires it investigate all reports of inappropriate behaviors of a sexual nature. McElroy did not immediately respond to an email sent to her university account by the Associated Press. When recipients opened the enclosed link, philly.com reports that they were directed to a video of 'a woman engaging in a sexually explicit act' It's not the first time the married mother-of-two has appeared in the spotlight. She is also an accomplished author with a number of published biographies and children's books. On her website, www.lisamcelroy.com, she describes herself as a 'Supreme Court junkie.' She adds that her favorites ways of relaxing include 'crawling under the covers with a dog or two and a really good book' or 'hanging out' with her two adolescent daughters. Regarding the recent email scandal, David Lat - a lawyer and legal commenter -suggests she could have been 'hacked' or made a 'copy/paste error'. While an internal investigation gets underway, it's been reported that McElroy has been placed on administrative leave. While an internal investigation gets underway, it's been reported that McElroy has been placed on administrative leave from Drexel University (seen above)"""
|
204 |
+
art2 = """Spain's 2-0 defeat by Holland on Tuesday brought back bitter memories of their disastrous 2014 World Cup, but coach Vicente del Bosque will not be too worried about a third straight friendly defeat, insists Gerard Pique. Holland, whose 5-1 drubbing of Spain in the group stage in Brazil last year marked the end of the Iberian nation's six-year domination of the world game, scored two early goals at the Amsterdam Arena and held on against some determined Spain pressure in the second half for a 2-0 success. They became the first team to inflict two defeats on Del Bosque since he took over in 2008 but the gruff 64-year-old had used the match to try out several new faces and he fielded a largely experimental, second-string team. Stefan de Vrij (right) headed Holland in front against Spain at the Amsterdam Arena on Tuesday Gerard Pique (left) could do nothing to stop Davy Klaassen doubling the Dutch advantage Malaga forward Juanmi and Sevilla midfielder Vitolo became the 55th and 56th players to debut under Del Bosque, while the likes of goalkeeper David de Gea, defenders Raul Albiol, Juan Bernat and Dani Carvajal and midfielder Mario Suarez all started the game. 'The national team's state of health is good,' centre back Gerard Pique told reporters. 'We are in a process where players are coming into the team and gathering experience,' added the Barcelona defender. 'We are second in qualifying (for Euro 2016) and these friendly games are for experimenting. 'I am not that worried about this match because we lost friendlies in previous years and then ended up winning titles.' David de Gea was given a start by Vicente del Bosque but could not keep out De Vrij's header here Dani Carvajal (centre) was another squad player given a chance to impress against Holland Del Bosque will be confident he can find the right mix of players to secure Spain's berth at Euro 2016 in France next year, when they will be chasing an unprecedented third straight title. Slovakia are the surprise leaders in qualifying Group C thanks to a 2-1 win over Spain in Zilina in October and have a maximum 15 points from five of 10 matches. Spain are second on 12 points, three ahead of Ukraine, who they beat 1-0 in Seville on Friday. Del Bosque's side host Slovakia in September in a match that could decide who goes through to the finals as group winners. 'The team is in good shape,' forward Pedro told reporters. 'We have a very clear idea of our playing style and we are able to count on people who are gradually making a place for themselves in the team.'"""
|
205 |
+
|
206 |
+
summaries = [
|
207 |
+
"lisa mcelroy , 50 , who teaches legal writing at drexel university , reportedly sent the ` inappropriate ' message on march 31 . when recipients clicked the enclosed link , they were allegedly directed to a video of ' a woman engaging in a sexually explicit act ' . mcelroy appeared on the popular game show in 2010 with then-host meredith vieira but lost the game after reaching just $ 12,500 . along with teaching law , mcelroy is also an accomplished author with a number of published biographies and children 's books . has been placed on leave while school investigates .", # reference 2.3270
|
208 |
+
"lisa mcelroy, a 50-year-old drexel professor, appeared on the show in 2010 while it was still hosted my meredith vieira. she's accused of sending the inappropriate message this past march 31 under the subject line: 'great article on writing briefs' when recipients opened the enclosed link, philly.com reports that they were directed to a video of 'a woman engaging in a sexually explicit act' the married mother-of-two has been placed on administrative leave.", # BART-large+cnn 4.9714
|
209 |
+
"lisa mcelroy , 50 , who teaches legal writing at drexel university , appeared on the show in 2010 while it was still hosted my meredith vieira . she got tripped up on the $ 12,500 level after flying through the first few questions , philly.com reports . mcelroy answered wrong and walked away with around $ 5,000 .", # BERTSUM=Factual 3.2028
|
210 |
+
|
211 |
+
"lisa mcelroy , 50 , who teaches legal writing at philadelphia university , reportedly sent the ` inappropriate ' message on march 31 . when recipients clicked the enclosed link , they were allegedly directed to a video of ' a woman engaging in a sexually explicit act ' . mcelroy appeared on the popular game show in 2010 with then-host meredith vieira but lost the game after reaching just $ 12,500 . along with teaching law , mcelroy is also an accomplished author with a number of published biographies and children 's books . has been placed on leave while school investigates .", # wrong subj (philadelphia) 2.2122
|
212 |
+
"lisa mcelroy , 50 , who teaches legal writing at drexel university , reportedly did not send the ` inappropriate ' message on march 31 . when recipients clicked the enclosed link , they were allegedly directed to a video of ' a woman engaging in a sexually explicit act ' . mcelroy appeared on the popular game show in 2010 with then-host meredith vieira but lost the game after reaching just $ 12,500 . along with teaching law , mcelroy is also an accomplished author with a number of published biographies and children 's books . has been placed on leave while school investigates .", # negation 2.2022
|
213 |
+
|
214 |
+
"holland beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",# reference
|
215 |
+
"holland beat spain 2-0 in the group stage in brazil on tuesday night . del bosque will be hoping to find the right mix of players to the world cup . gerard pique could make the right mix of players to the tournament .",# summary (factuality = 0, rnn)
|
216 |
+
"del bosque beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",# reference + wrong subj
|
217 |
+
"holland could not beat spain 2-0 at the amsterdam arena on tuesday night . stefan de vrij and davy klaassen scored goals for holland . defeat recalls horror 5-1 defeat by holland at the world cup . vicente del bosque used game to give younger spain players a chance .",# reference + negation
|
218 |
+
|
219 |
+
|
220 |
+
|
221 |
+
]
|
222 |
+
articles = [ art1 ]*5 + [art2 ]*4
|
223 |
+
# set_trace()
|
224 |
+
hp_score = scorer.compute(predictions=summaries, references=articles, use_aggregator=False, bsz=bsz)
|
225 |
+
# pprint(f"{articles=}")
|
226 |
+
# pprint(f"{summaries=}")
|
227 |
+
pprint(hp_score)
|
228 |
+
|
229 |
+
|
230 |
+
|
231 |
+
'''
|
232 |
+
## drexel example
|
233 |
+
# reference 2.3270
|
234 |
+
# BART-large+cnn 4.9714
|
235 |
+
# BERTSUM=Factual 3.2028
|
236 |
+
# ref + wrong subj (philadelphia) 2.2122
|
237 |
+
# ref + negation 2.2022
|
238 |
+
|
239 |
+
|
240 |
+
'harim+': [1.6270232200622559,
|
241 |
+
1.7585878372192383,
|
242 |
+
1.3859858512878418,
|
243 |
+
1.5434350967407227,
|
244 |
+
1.609492301940918],
|
245 |
+
|
246 |
+
|
247 |
+
|
248 |
+
## main table result
|
249 |
+
|
250 |
+
1.6247 (reference, factual)
|
251 |
+
0.1173 (rnn, unfactual)
|
252 |
+
1.3229 (ref + wrong subj)
|
253 |
+
1.4132 (ref + negation)
|
254 |
+
|
255 |
+
'harim+': [1.8230078220367432,
|
256 |
+
1.5361897945404053,
|
257 |
+
1.806436538696289,
|
258 |
+
1.7360382080078125],
|
259 |
+
|
260 |
+
'''
|
261 |
+
|
262 |
+
if __name__ == '__main__':
|
263 |
+
Fire(test)
|
requirements.txt
ADDED
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
aiohttp==3.8.3
|
2 |
+
aiosignal==1.3.1
|
3 |
+
asttokens==2.1.0
|
4 |
+
async-timeout==4.0.2
|
5 |
+
attrs==22.1.0
|
6 |
+
backcall==0.2.0
|
7 |
+
certifi==2022.9.24
|
8 |
+
charset-normalizer==2.1.1
|
9 |
+
datasets==2.6.1
|
10 |
+
decorator==5.1.1
|
11 |
+
dill==0.3.5.1
|
12 |
+
evaluate==0.3.0
|
13 |
+
executing==1.2.0
|
14 |
+
filelock==3.8.0
|
15 |
+
fire==0.4.0
|
16 |
+
frozenlist==1.3.3
|
17 |
+
fsspec==2022.10.0
|
18 |
+
huggingface-hub==0.10.1
|
19 |
+
idna==3.4
|
20 |
+
ipython==8.6.0
|
21 |
+
jedi==0.18.1
|
22 |
+
matplotlib-inline==0.1.6
|
23 |
+
multidict==6.0.2
|
24 |
+
multiprocess==0.70.13
|
25 |
+
numpy==1.23.4
|
26 |
+
packaging==21.3
|
27 |
+
pandas==1.5.1
|
28 |
+
parso==0.8.3
|
29 |
+
pexpect==4.8.0
|
30 |
+
pickleshare==0.7.5
|
31 |
+
prompt-toolkit==3.0.32
|
32 |
+
ptyprocess==0.7.0
|
33 |
+
pure-eval==0.2.2
|
34 |
+
pyarrow==10.0.0
|
35 |
+
Pygments==2.13.0
|
36 |
+
pyparsing==3.0.9
|
37 |
+
python-dateutil==2.8.2
|
38 |
+
pytz==2022.6
|
39 |
+
PyYAML==6.0
|
40 |
+
regex==2022.10.31
|
41 |
+
requests==2.28.1
|
42 |
+
responses==0.18.0
|
43 |
+
six==1.16.0
|
44 |
+
stack-data==0.6.0
|
45 |
+
termcolor==2.1.0
|
46 |
+
tokenizers==0.13.2
|
47 |
+
toml==0.10.2
|
48 |
+
torch==1.12.1+cu113
|
49 |
+
tqdm==4.64.1
|
50 |
+
traitlets==5.5.0
|
51 |
+
transformers==4.24.0
|
52 |
+
typing_extensions==4.4.0
|
53 |
+
urllib3==1.26.12
|
54 |
+
wcwidth==0.2.5
|
55 |
+
xxhash==3.1.0
|
56 |
+
yarl==1.8.1
|