Update README.md

6d4035b over 1 year ago

14.1 kB

	---
	license:
	- apache-2.0
	- bsd-3-clause
	tags:
	- summarization
	- summary
	- booksum
	- long-document
	- long-form
	- tglobal-xl
	- XL
	datasets:
	- kmfoda/booksum
	metrics:
	- rouge
	inference: false
	model-index:
	- name: pszemraj/long-t5-tglobal-xl-16384-book-summary
	results:
	- task:
	type: summarization
	name: Summarization
	dataset:
	name: multi_news
	type: multi_news
	config: default
	split: test
	metrics:
	- type: rouge
	value: 36.2043
	name: ROUGE-1
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYzRmMmUyOTVjMmJmZTRiZDcyYzY3MTQ1MmUyNDA5NjVhYzEzYzBiNzcxYTRhMDQ3OTlhMGZjYmJlNDM1M2NjYyIsInZlcnNpb24iOjF9._uArOQ1_0znXDPXMq7unA1OHB-XbgqzzKRbFRcVUzTUJdWk26LiSa2pEEVNNmJPg6Uo7CAvONmhpEswLvl9TAg
	- type: rouge
	value: 8.424
	name: ROUGE-2
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzg0MzljYjVjYWQ3MmRkZDBlOGI5M2RiMGU0M2UwZGUzMDg2NTU0NjcwMTNiN2ZmODEzNTQ0MmEwNDA3NDA5MSIsInZlcnNpb24iOjF9.Dzj85ld6TjosQ8KyUdoadzicMLedEFrICC6Q-08O3qx28d9B9Uke1zw-VWabiuesPEDTRGbWuBgPA5vxYWUZAw
	- type: rouge
	value: 17.3721
	name: ROUGE-L
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNDA3ZjZmODAwMTNlM2RlZmJlMDI5MGVkMGRkMTBjMTYzNDk5ZjFiNTY5MWE1MDUwNWI2MDE4ZDA2YWMwMmI2NCIsInZlcnNpb24iOjF9.MOV_nId0XAK1eMQssG5GN9DsitZaTrxl4jdCJnOg9EZ0-vAw227ln599YV5YfZ1OPJnWwek6rneqqyONiHn9AQ
	- type: rouge
	value: 32.3994
	name: ROUGE-LSUM
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZmY3MDMwOTZjNWI0YTk1MDgwMzJkYTFiN2U5YWU0Mzc0MWRiMzc1NzZlMDhjMWUwMmY2ODI2MjI5ODBkYWUxOSIsInZlcnNpb24iOjF9._BwGIZbcA4pUBkEAL0cW-JPPta0KSoGug4Z7vogHacUz-AEhIOI5ICUldZh0pt9OK67MpUSzpShJOu3rSt5YDQ
	- type: loss
	value: 2.0843334197998047
	name: loss
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOWFhMmE5ZjA3ODM4YmVjMDMyMjk5YjNlMjA1MGMzOWY0NTRlYzk1YjZiMzQxMDMxOTMwMjFkNTdmNjM1NDcyMyIsInZlcnNpb24iOjF9.3wbXV4CIIgnfXAnnRztdOR12PwsWsEfiglQQ09K-C1EgW4gai4x9l-wTE2OZ7CTWkuk_tr4tL_uqOCXLZRMtCQ
	- type: gen_len
	value: 248.3572
	name: gen_len
	verified: true
	verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMWZhOGMwMDJjNGU2MzA2YzI1OWU1ZDY5N2NjZmM1YTA5NDg1MzUwNmU1YTBhNjQyNWYwYzA3OGNmODFjMmE2NSIsInZlcnNpb24iOjF9.Rc9u89zCdbFnjsnmq65l_JvCtUwOX_ZWapKJpTZ-rC8HxcUVfi2Ash2QfvvvxHH_YWhwklxxdnNa0HCm46qLAA
	- task:
	type: summarization
	name: Summarization
	dataset:
	name: billsum
	type: billsum
	config: default
	split: test
	metrics:
	- name: ROUGE-1
	type: rouge
	value: 41.3645
	verified: true
	- name: ROUGE-2
	type: rouge
	value: 16.144
	verified: true
	- name: ROUGE-L
	type: rouge
	value: 24.2981
	verified: true
	- name: ROUGE-LSUM
	type: rouge
	value: 35.3234
	verified: true
	- name: loss
	type: loss
	value: 1.282260775566101
	verified: true
	- name: gen_len
	type: gen_len
	value: 291.8158
	verified: true
	- task:
	type: summarization
	name: Summarization
	dataset:
	name: ccdv/arxiv-summarization
	type: ccdv/arxiv-summarization
	config: document
	split: test
	metrics:
	- name: ROUGE-1
	type: rouge
	value: 36.3225
	verified: true
	- name: ROUGE-2
	type: rouge
	value: 9.3743
	verified: true
	- name: ROUGE-L
	type: rouge
	value: 19.8396
	verified: true
	- name: ROUGE-LSUM
	type: rouge
	value: 32.2532
	verified: true
	- name: loss
	type: loss
	value: 2.146871566772461
	verified: true
	- name: gen_len
	type: gen_len
	value: 186.2966
	verified: true
	---

	# long-t5-tglobal-xl + BookSum

	<a href="https://colab.research.google.com/gist/pszemraj/c19e32baf876deb866c31cd46c86e893/long-t5-xl-accelerate-test.ipynb">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
	</a>

	Summarize long text and get a SparkNotes-like summary of any topic!

	- Generalizes reasonably well to academic & narrative text.
	- This is the XL checkpoint, which produces even better summaries [from a human evaluation perspective](https://long-t5-xl-book-summary-examples.netlify.app/).

	A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).

	## Cheeky Proof-of-Concept

	A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta):

	> In this chapter, the monster explains how he intends to exact revenge on "the little b\\\\" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.

	While this is a crude example, try running this copypasta through other summarization models to see the difference in comprehension (_even though it's not even a "long" text!_).

	* * *

	Contents

	<!-- TOC -->

	- [Description](#description)
	- [How-To in Python](#how-to-in-python)
	- [Beyond the basics](#beyond-the-basics)
	- [Adjusting parameters](#adjusting-parameters)
	- [LLM.int8 Quantization](#llmint8-quantization)
	- [About](#about)
	- [Intended uses & limitations](#intended-uses--limitations)
	- [Training and evaluation data](#training-and-evaluation-data)
	- [Eval results](#eval-results)
	- [FAQ](#faq)
	- [How can I run inference with this on CPU?](#how-can-i-run-inference-with-this-on-cpu)
	- [How to run inference over a very long (30k+ tokens) document in batches?](#how-to-run-inference-over-a-very-long-30k-tokens-document-in-batches)
	- [How to fine-tune further?](#how-to-fine-tune-further)
	- [Are there simpler ways to run this?](#are-there-simpler-ways-to-run-this)
	- [Training procedure](#training-procedure)
	- [Updates](#updates)
	- [Training hyperparameters](#training-hyperparameters)
	- [Framework versions](#framework-versions)

	<!-- /TOC -->

	* * *

	## Description

	A fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the `kmfoda/booksum` dataset.

	Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf)

	## How-To in Python

	install/update transformers `pip install -U transformers`

	summarize text with pipeline:

	```python
	import torch
	from transformers import pipeline

	summarizer = pipeline(
	"summarization",
	"pszemraj/long-t5-tglobal-xl-16384-book-summary",
	device=0 if torch.cuda.is_available() else -1,
	)
	long_text = "Here is a lot of text I don't want to read. Replace me"

	result = summarizer(long_text)
	print(result[0]["summary_text"])
	```

	### Beyond the basics

	There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for reduced memory consumption.

	#### Adjusting parameters

	Pass [other parameters related to beam search textgen](https://huggingface.co/blog/how-to-generate) when calling `summarizer` to get even higher quality results.

	#### LLM.int8 Quantization

	> alternative section title: how to get this monster to run inference on free colab runtimes

	Via [this PR](https://github.com/huggingface/transformers/pull/20341) LLM.int8 is now supported for `long-t5` models.

	- per initial tests the summarization quality seems to hold while using _significantly_ less memory! \*
	- a version of this model quantized to int8 is [already on the hub here](https://huggingface.co/pszemraj/long-t5-tglobal-xl-16384-book-summary-8bit) so if you're using the 8-bit version anyway, you can start there for a 3.5 gb download only!

	First, make sure you have the latest versions of the relevant packages:
	```bash
	pip install -U transformers bitsandbytes accelerate
	```

	load in 8-bit (_magic completed by `bitsandbytes` behind the scenes_)

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained(
	"pszemraj/long-t5-tglobal-xl-16384-book-summary"
	)

	model = AutoModelForSeq2SeqLM.from_pretrained(
	"pszemraj/long-t5-tglobal-xl-16384-book-summary",
	load_in_8bit=True,
	device_map="auto",
	)
	```

	The above is already present in the Colab demo linked at the top of the model card.

	\* More rigorous metrics-based research comparing beam-search summarization with and without LLM.int8 will take place over time.

	* * *

	## About

	### Intended uses & limitations

	While this model seems to improve factual consistency, don't take summaries as foolproof and check things that seem odd.

	Specifically: negation statements (i.e., the model says: _this thing does not have [ATTRIBUTE]_, when instead it should have said _this thing has lots of [ATTRIBUTE]_).

	- I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually check this by comparing a particular statement with what the surrounding sentences imply.

	### Training and evaluation data

	`kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209).

	- For initial fine-tuning, only input text with 12288 input tokens or less and 1024 output tokens or less was used (_i.e. lines longer than that were dropped before training_) for memory reasons. After a quick analysis, summaries in the 12288-16384 range are in the small minority in this dataset.
	- In addition, this initial training combined the training and validation sets and trained on them in aggregate to increase the functional dataset size. Therefore, take the validation set results with a grain of salt; primary metrics should (always) be the test set..
	- The final stages of fine-tuning used the standard 16384 input/1024 output conventions, preserving the standard in/out lengths (_and truncating longer sequences_). This did not seem to change the loss/performance much.

	### Eval results

	Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.

	Please read the note above, as due to the training methods, the performance on the validation set looks better than the results on the test set will be. The model achieves the following results on the evaluation set:

	- eval_loss: 1.2756
	- eval_rouge1: 41.8013
	- eval_rouge2: 12.0895
	- eval_rougeL: 21.6007
	- eval_rougeLsum: 39.5382
	- eval_gen_len: 387.2945
	- eval_runtime: 13908.4995
	- eval_samples_per_second: 0.107
	- eval_steps_per_second: 0.027


	*** predict/test metrics (initial) ***
	predict_gen_len = 506.4368
	predict_loss = 2.028
	predict_rouge1 = 36.8815
	predict_rouge2 = 8.0625
	predict_rougeL = 17.6161
	predict_rougeLsum = 34.9068
	predict_runtime = 2:04:14.37
	predict_samples = 1431
	predict_samples_per_second = 0.192
	predict_steps_per_second = 0.048

	\* evaluating big model not as easy as it seems. Doing a bit more investigating

	* * *

	## FAQ

	### How can I run inference with this on CPU?

	lol

	### How to run inference over a very long (30k+ tokens) document in batches?

	See `summarize.py` in [the code for my hf space Document Summarization](https://huggingface.co/spaces/pszemraj/document-summarization/blob/main/summarize.py) :)

	You can also use the same code to split a document into batches of 4096, etc., and iterate over them with the model. This is useful in situations where CUDA memory is limited.

	Update: see the section on the `textsum` package below.

	### How to fine-tune further?

	See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)

	### Are there simpler ways to run this?

	For this reason, I created a Python package utility. It's called [textsum](https://github.com/pszemraj/textsum), and you can use it to load models and summarize things in a few lines of code.

	```sh
	pip install textsum
	```

	Use `textsum` in python with this model:

	```python
	from textsum.summarize import Summarizer

	summarizer = Summarizer(
	model_name_or_path="pszemraj/long-t5-tglobal-xl-16384-book-summary"
	)

	long_string = "This is a long string of text that will be summarized."
	out_str = summarizer.summarize_string(long_string)
	print(f"summary: {out_str}")
	```

	This package provides easy-to-use interfaces for applying summarization models to text documents of arbitrary length. Currently implemented interfaces include a Python API, a CLI, and a shareable demo application.

	For details, explanations, and documentation, see the README (_linked above_) or the [wiki](https://github.com/pszemraj/textsum/wiki).

	* * *

	## Training procedure

	### Updates

	Updates to this model/model card will be posted here when relevant. The model seems to be fairly converged; if updates/improvements are possible using the `BookSum` dataset, this repo will be updated.

	### Training hyperparameters

	The following hyperparameters were used during training:

	- learning_rate: 0.0006
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 10350
	- distributed_type: multi-GPU
	- num_devices: 4
	- gradient_accumulation_steps: 32
	- total_train_batch_size: 128
	- total_eval_batch_size: 4
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: constant
	- num_epochs: 1.0

	\*_Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train._

	### Framework versions

	- Transformers 4.25.0.dev0
	- Pytorch 1.13.0+cu117
	- Datasets 2.6.1
	- Tokenizers 0.13.1

	* * *