autoevaluator
HF staff
Add evaluation results on the default config and test split of multi_news
f0a613f
tags: | |
- summarization | |
- summary | |
- booksum | |
- long-document | |
- long-form | |
- tglobal-xl | |
- XL | |
license: | |
- apache-2.0 | |
- bsd-3-clause | |
datasets: | |
- kmfoda/booksum | |
metrics: | |
- rouge | |
inference: false | |
model-index: | |
- name: pszemraj/long-t5-tglobal-xl-16384-book-summary | |
results: | |
- task: | |
type: summarization | |
name: Summarization | |
dataset: | |
name: multi_news | |
type: multi_news | |
config: default | |
split: test | |
metrics: | |
- name: ROUGE-1 | |
type: rouge | |
value: 36.2043 | |
verified: true | |
- name: ROUGE-2 | |
type: rouge | |
value: 8.424 | |
verified: true | |
- name: ROUGE-L | |
type: rouge | |
value: 17.3721 | |
verified: true | |
- name: ROUGE-LSUM | |
type: rouge | |
value: 32.3994 | |
verified: true | |
- name: loss | |
type: loss | |
value: 2.0843334197998047 | |
verified: true | |
- name: gen_len | |
type: gen_len | |
value: 248.3572 | |
verified: true | |
# long-t5-tglobal-xl + BookSum | |
Summarize long text and get a SparkNotes-esque summary of arbitrary topics! | |
- Generalizes reasonably well to academic & narrative text. | |
- This is the XL checkpoint, which **from a human-evaluation perspective, [produces even better summaries](https://long-t5-xl-book-summary-examples.netlify.app/)**. | |
A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/). | |
## Cheeky Proof-of-Concept | |
A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta): | |
> In this chapter, the monster explains how he intends to exact revenge on "the little b****" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat. | |
While a somewhat crude example, try running this copypasta through other summarization models to see the difference in comprehension (_despite it not even being a "long" text!_) | |
--- | |
## Description | |
A fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the `kmfoda/booksum` dataset. | |
Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf) | |
## How-To in Python | |
> 🚧 `LLM.int8()` appears to be compatible with summarization and does not degrade the quality of the outputs; this is a crucial enabler for using this model on standard GPUs. A PR for this is in-progress [here](https://github.com/huggingface/transformers/pull/20341), and this model card will be updated with instructions once done :) 🚧 | |
Install/update transformers `pip install -U transformers` | |
Summarize text with pipeline: | |
```python | |
import torch | |
from transformers import pipeline | |
summarizer = pipeline( | |
"summarization", | |
"pszemraj/long-t5-tglobal-xl-16384-book-summary", | |
device=0 if torch.cuda.is_available() else -1, | |
) | |
long_text = "Here is a lot of text I don't want to read. Replace me" | |
result = summarizer(long_text) | |
print(result[0]["summary_text"]) | |
``` | |
Pass [other parameters related to beam search textgen](https://huggingface.co/blog/how-to-generate) when calling `summarizer` to get even higher quality results. | |
--- | |
## About | |
### Intended uses & limitations | |
While this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**. | |
Specifically: negation statements (i.e., model says: _This thing does not have [ATTRIBUTE]_ where instead it should have said _This thing has a lot of [ATTRIBUTE]_). | |
- I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually fact-check this by comparing a specific claim to what the surrounding sentences imply. | |
### Training and evaluation data | |
`kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209). | |
- **Initial fine-tuning** only used input text with 12288 tokens input or less and 1024 tokens output or less (_i.e. rows with longer were dropped before training_) for memory reasons. Per brief analysis, summaries in the 12288-16384 range in this dataset are in the **small** minority | |
- In addition, this initial training combined the training and validation sets and trained on these in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should be (always) the test set.** | |
- **final phases of fine-tuning** used the standard conventions of 16384 input/1024 output keeping everything (truncating longer sequences). This did not appear to change the loss/performance much. | |
### Eval results | |
Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here. | |
**Please read the note above as due to training methods, validation set performance looks better than the test set results will be**. The model achieves the following results on the evaluation set: | |
- eval_loss: 1.2756 | |
- eval_rouge1: 41.8013 | |
- eval_rouge2: 12.0895 | |
- eval_rougeL: 21.6007 | |
- eval_rougeLsum: 39.5382 | |
- eval_gen_len: 387.2945 | |
- eval_runtime: 13908.4995 | |
- eval_samples_per_second: 0.107 | |
- eval_steps_per_second: 0.027 | |
``` | |
***** predict/test metrics (initial) ***** | |
predict_gen_len = 506.4368 | |
predict_loss = 2.028 | |
predict_rouge1 = 36.8815 | |
predict_rouge2 = 8.0625 | |
predict_rougeL = 17.6161 | |
predict_rougeLsum = 34.9068 | |
predict_runtime = 2:04:14.37 | |
predict_samples = 1431 | |
predict_samples_per_second = 0.192 | |
predict_steps_per_second = 0.048 | |
``` | |
\* evaluating big model not as easy as it seems. Doing a bit more investigating | |
--- | |
## FAQ | |
### How can I run inference with this on CPU? | |
lol | |
### How to run inference over a very long (30k+ tokens) document in batches? | |
See `summarize.py` in [the code for my hf space Document Summarization](https://huggingface.co/spaces/pszemraj/document-summarization/blob/main/summarize.py) :) | |
You can also use the same code to split a document into batches of 4096, etc., and run over those with the model. This is useful in situations where CUDA memory is limited. | |
### How to fine-tune further? | |
See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization) | |
--- | |
## Training procedure | |
### Updates | |
Updates to this model/model card will be posted here as relevant. The model seems fairly converged; if updates/improvements are possible using the `BookSum` dataset, this repo will be updated. | |
### Training hyperparameters | |
The following hyperparameters were used during training: | |
- learning_rate: 0.0006 | |
- train_batch_size: 1 | |
- eval_batch_size: 1 | |
- seed: 10350 | |
- distributed_type: multi-GPU | |
- num_devices: 4 | |
- gradient_accumulation_steps: 32 | |
- total_train_batch_size: 128 | |
- total_eval_batch_size: 4 | |
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 | |
- lr_scheduler_type: constant | |
- num_epochs: 1.0 | |
\*_Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train._ | |
### Framework versions | |
- Transformers 4.25.0.dev0 | |
- Pytorch 1.13.0+cu117 | |
- Datasets 2.6.1 | |
- Tokenizers 0.13.1 | |
--- | |