cjber commited on
Commit
b9e65e6
·
1 Parent(s): 48ebd8a

feat: add eval doc

Browse files

Former-commit-id: 8a1e2a05e7bbb21b994c65e9cd95cad2e4cf8e72 [formerly 0af2bc5586b798405e4aa7f1ec18e7faba86ab97]
Former-commit-id: 6fbc84bd2c4694f132d1e0958e768f9c4e3a9eb5

Files changed (2) hide show
  1. reports/eval.pdf +0 -0
  2. reports/eval.qmd +44 -0
reports/eval.pdf ADDED
Binary file (16.5 kB). View file
 
reports/eval.qmd ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Evaluation of Summaries
3
+ author: Cillian Berragan
4
+ format: pdf
5
+ fontfamily: libertinus
6
+ monofont: 'JetBrains Mono'
7
+ monofontoptions:
8
+ - Scale=0.75
9
+ ---
10
+
11
+ This document compares summaries written by Cambridge, to the summaries generated automatically by our model.
12
+
13
+ # Overview
14
+
15
+ For all representations, the original summary was compared with the generated summary provided by the LLM. A separate LLM call was used to determine which of these two summaries was preferred, based on set criteria:
16
+
17
+ > A good summary should:
18
+ > 1. **Be accurate** – It should not include information that is not present in the source document.
19
+ > 2. **Be comprehensive** – It should reflect all key points in the source document without omitting important details.
20
+ > 3. **Be well-grounded** – It should be based entirely on the source document without adding interpretations, opinions, or external information.
21
+
22
+ This model was given the option to return 4 different scores; 0 meaning neither summaries are suitable, 1 meaning the original summary is preferred, 2 meaning the LLM-generated summary is preferred, or 3 meaning both summaries are suitable.
23
+
24
+ @tbl-eval gives the results of this processing. We can see that the majority of the preferred summaries are those generated by the LLM (2). There are however 8 cases where the original summary is considered better, and 17 where both summaries are considered suitable.
25
+
26
+ ```{python}
27
+ #| label: tbl-eval
28
+ #| caption: Comparison between original summary and LLM-generated summary
29
+ #| echo: false
30
+ #| output: asis
31
+
32
+ import polars as pl
33
+
34
+ summaries = pl.read_parquet("./data/out/eval.parquet")
35
+ print(
36
+ summaries["score"]
37
+ .value_counts()
38
+ .sort("score")
39
+ .transpose(include_header=True)
40
+ .rename({"column_0": "Original", "column_1": "LLM-generated", "column_2": "Both"})
41
+ .drop("column").tail(1)
42
+ .to_pandas().to_markdown(index=False)
43
+ )
44
+ ```