Spaces:
Build error
Build error
feat: add eval doc
Browse filesFormer-commit-id: 8a1e2a05e7bbb21b994c65e9cd95cad2e4cf8e72 [formerly 0af2bc5586b798405e4aa7f1ec18e7faba86ab97]
Former-commit-id: 6fbc84bd2c4694f132d1e0958e768f9c4e3a9eb5
- reports/eval.pdf +0 -0
- reports/eval.qmd +44 -0
reports/eval.pdf
ADDED
Binary file (16.5 kB). View file
|
|
reports/eval.qmd
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: Evaluation of Summaries
|
3 |
+
author: Cillian Berragan
|
4 |
+
format: pdf
|
5 |
+
fontfamily: libertinus
|
6 |
+
monofont: 'JetBrains Mono'
|
7 |
+
monofontoptions:
|
8 |
+
- Scale=0.75
|
9 |
+
---
|
10 |
+
|
11 |
+
This document compares summaries written by Cambridge, to the summaries generated automatically by our model.
|
12 |
+
|
13 |
+
# Overview
|
14 |
+
|
15 |
+
For all representations, the original summary was compared with the generated summary provided by the LLM. A separate LLM call was used to determine which of these two summaries was preferred, based on set criteria:
|
16 |
+
|
17 |
+
> A good summary should:
|
18 |
+
> 1. **Be accurate** – It should not include information that is not present in the source document.
|
19 |
+
> 2. **Be comprehensive** – It should reflect all key points in the source document without omitting important details.
|
20 |
+
> 3. **Be well-grounded** – It should be based entirely on the source document without adding interpretations, opinions, or external information.
|
21 |
+
|
22 |
+
This model was given the option to return 4 different scores; 0 meaning neither summaries are suitable, 1 meaning the original summary is preferred, 2 meaning the LLM-generated summary is preferred, or 3 meaning both summaries are suitable.
|
23 |
+
|
24 |
+
@tbl-eval gives the results of this processing. We can see that the majority of the preferred summaries are those generated by the LLM (2). There are however 8 cases where the original summary is considered better, and 17 where both summaries are considered suitable.
|
25 |
+
|
26 |
+
```{python}
|
27 |
+
#| label: tbl-eval
|
28 |
+
#| caption: Comparison between original summary and LLM-generated summary
|
29 |
+
#| echo: false
|
30 |
+
#| output: asis
|
31 |
+
|
32 |
+
import polars as pl
|
33 |
+
|
34 |
+
summaries = pl.read_parquet("./data/out/eval.parquet")
|
35 |
+
print(
|
36 |
+
summaries["score"]
|
37 |
+
.value_counts()
|
38 |
+
.sort("score")
|
39 |
+
.transpose(include_header=True)
|
40 |
+
.rename({"column_0": "Original", "column_1": "LLM-generated", "column_2": "Both"})
|
41 |
+
.drop("column").tail(1)
|
42 |
+
.to_pandas().to_markdown(index=False)
|
43 |
+
)
|
44 |
+
```
|