pszemraj
/

long-t5-tglobal-xl-16384-book-summary

@@ -132,8 +132,9 @@ model-index:
 </a>
 Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
-- Generalizes reasonably well to academic & narrative text.
-- This is the XL checkpoint, which **from a human-evaluation perspective, [produces even better summaries](https://long-t5-xl-book-summary-examples.netlify.app/)**.
 A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).
@@ -141,17 +142,41 @@ A simple example/use case with [the base model](https://huggingface.co/pszemraj/
 A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta):
-> In this chapter, the monster explains how he intends to exact revenge on "the little b****" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.
 While a somewhat crude example, try running this copypasta through other summarization models to see the difference in comprehension (_despite it not even being a "long" text!_)
----
 ## Description
 A fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the `kmfoda/booksum` dataset.
-Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf)
 ## How-To in Python
@@ -173,9 +198,10 @@ long_text = "Here is a lot of text I don't want to read. Replace me"
 result = summarizer(long_text)
 print(result[0]["summary_text"])
 ```
 ### Beyond the basics
-There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for decreased memory devouring.
 #### Adjusting parameters
@@ -189,7 +215,6 @@ Per [this PR](https://github.com/huggingface/transformers/pull/20341) LLM.int8 i
 How-to: essentially ensure you have pip installed from the **latest GitHub repo main** version of `transformers`, and `bitsandbytes`
 install the latest `main` branch:
 ```bash
@@ -217,10 +242,9 @@ The above is already present in the Colab demo linked at the top of the model ca
 Do you love to ask questions? Awesome. But first, check out the [how LLM.int8 works blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) by huggingface.
-\* More rigorous metric-based investigation into comparing beam-search summarization with and without LLM.int8 will take place over time.
----
 ## About
@@ -229,47 +253,49 @@ Do you love to ask questions? Awesome. But first, check out the [how LLM.int8 wo
 While this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**.
 Specifically: negation statements (i.e., model says: _This thing does not have [ATTRIBUTE]_ where instead it should have said _This thing has a lot of [ATTRIBUTE]_).
-- I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually fact-check this by comparing a specific claim to what the surrounding sentences imply.
 ### Training and evaluation data
-`kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209).
-- **Initial fine-tuning** only used input text with 12288 tokens input or less and 1024 tokens output or less (_i.e. rows with longer were dropped before training_) for memory reasons. Per brief analysis, summaries in the 12288-16384 range in this dataset are in the **small** minority
-  - In addition, this initial training combined the training and validation sets and trained on these in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should be (always) the test set.**
-- **final phases of fine-tuning** used the standard conventions of 16384 input/1024 output keeping everything (truncating longer sequences). This did not appear to change the loss/performance much.
 ### Eval results
 Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.
 **Please read the note above as due to training methods, validation set performance looks better than the test set results will be**. The model achieves the following results on the evaluation set:
-- eval_loss: 1.2756
-- eval_rouge1: 41.8013
-- eval_rouge2: 12.0895
-- eval_rougeL: 21.6007
-- eval_rougeLsum: 39.5382
-- eval_gen_len: 387.2945
-- eval_runtime: 13908.4995
-- eval_samples_per_second: 0.107
-- eval_steps_per_second: 0.027
-```
-***** predict/test metrics (initial) *****
-  predict_gen_len            =   506.4368
-  predict_loss               =      2.028
-  predict_rouge1             =    36.8815
-  predict_rouge2             =     8.0625
-  predict_rougeL             =    17.6161
-  predict_rougeLsum          =    34.9068
-  predict_runtime            = 2:04:14.37
-  predict_samples            =       1431
-  predict_samples_per_second =      0.192
-  predict_steps_per_second   =      0.048
-```
 \* evaluating big model not as easy as it seems. Doing a bit more investigating
----
 ## FAQ
@@ -287,8 +313,7 @@ You can also use the same code to split a document into batches of 4096, etc., a
 See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)
----
 ## Training procedure
@@ -299,26 +324,27 @@ Updates to this model/model card will be posted here as relevant. The model seem
 ### Training hyperparameters
 The following hyperparameters were used during training:
-- learning_rate: 0.0006
-- train_batch_size: 1
-- eval_batch_size: 1
-- seed: 10350
-- distributed_type: multi-GPU
-- num_devices: 4
-- gradient_accumulation_steps: 32
-- total_train_batch_size: 128
-- total_eval_batch_size: 4
-- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
-- lr_scheduler_type: constant
-- num_epochs: 1.0
 \*_Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train._
 ### Framework versions
-- Transformers 4.25.0.dev0
-- Pytorch 1.13.0+cu117
-- Datasets 2.6.1
-- Tokenizers 0.13.1
----

 </a>
 Summarize long text and get a SparkNotes-esque summary of arbitrary topics!
+-   Generalizes reasonably well to academic & narrative text.
+-   This is the XL checkpoint, which **from a human-evaluation perspective, [produces even better summaries](https://long-t5-xl-book-summary-examples.netlify.app/)**.
 A simple example/use case with [the base model](https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary) on ASR is [here](https://longt5-booksum-example.netlify.app/).
 A summary of the [infamous navy seals copypasta](https://knowyourmeme.com/memes/navy-seal-copypasta):
+> In this chapter, the monster explains how he intends to exact revenge on "the little b\*\*\*\*" who insulted him. He tells the kiddo that he is a highly trained and experienced killer who will use his arsenal of weapons--including his access to the internet--to exact justice on the little brat.
 While a somewhat crude example, try running this copypasta through other summarization models to see the difference in comprehension (_despite it not even being a "long" text!_)
+* * *
+**Contents**
+<!-- TOC -->
+-   [Description](#description)
+-   [How-To in Python](#how-to-in-python)
+    -   [Beyond the basics](#beyond-the-basics)
+-   [About](#about)
+    -   [Intended uses & limitations](#intended-uses--limitations)
+    -   [Training and evaluation data](#training-and-evaluation-data)
+    -   [Eval results](#eval-results)
+-   [FAQ](#faq)
+    -   [How can I run inference with this on CPU?](#how-can-i-run-inference-with-this-on-cpu)
+    -   [How to run inference over a very long (30k+ tokens) document in batches?](#how-to-run-inference-over-a-very-long-30k-tokens-document-in-batches)
+    -   [How to fine-tune further?](#how-to-fine-tune-further)
+-   [Training procedure](#training-procedure)
+    -   [Updates](#updates)
+    -   [Training hyperparameters](#training-hyperparameters)
+    -   [Framework versions](#framework-versions)
+<!-- /TOC -->
+* * *
 ## Description
 A fine-tuned version of [google/long-t5-tglobal-xl](https://huggingface.co/google/long-t5-tglobal-xl) on the `kmfoda/booksum` dataset.
+Read the paper by Guo et al. here: [LongT5: Efficient Text-To-Text Transformer for Long Sequences](https://arxiv.org/pdf/2112.07916.pdf)
 ## How-To in Python
 result = summarizer(long_text)
 print(result[0]["summary_text"])
 ```
 ### Beyond the basics
+There are two additional points to consider beyond simple inference: adjusting decoding parameters for improved performance, and quantization for decreased memory devouring.
 #### Adjusting parameters
 How-to: essentially ensure you have pip installed from the **latest GitHub repo main** version of `transformers`, and `bitsandbytes`
 install the latest `main` branch:
 ```bash
 Do you love to ask questions? Awesome. But first, check out the [how LLM.int8 works blog post](https://huggingface.co/blog/hf-bitsandbytes-integration) by huggingface.
+\* More rigorous metric-based investigation into comparing beam-search summarization with and without LLM.int8 will take place over time.
+* * *
 ## About
 While this model seems to improve upon factual consistency, **do not take summaries to be foolproof and check things that seem odd**.
 Specifically: negation statements (i.e., model says: _This thing does not have [ATTRIBUTE]_ where instead it should have said _This thing has a lot of [ATTRIBUTE]_).
+-   I'm sure someone will write a paper on this eventually (if there isn't one already), but you can usually fact-check this by comparing a specific claim to what the surrounding sentences imply.
 ### Training and evaluation data
+`kmfoda/booksum` dataset on HuggingFace - read [the original paper here](https://arxiv.org/abs/2105.08209).
+-   **Initial fine-tuning** only used input text with 12288 tokens input or less and 1024 tokens output or less (_i.e. rows with longer were dropped before training_) for memory reasons. Per brief analysis, summaries in the 12288-16384 range in this dataset are in the **small** minority
+    -   In addition, this initial training combined the training and validation sets and trained on these in aggregate to increase the functional dataset size. **Therefore, take the validation set results with a grain of salt; primary metrics should be (always) the test set.**
+-   **final phases of fine-tuning** used the standard conventions of 16384 input/1024 output keeping everything (truncating longer sequences). This did not appear to change the loss/performance much.
 ### Eval results
 Official results with the [model evaluator](https://huggingface.co/spaces/autoevaluate/model-evaluator) will be computed and posted here.
 **Please read the note above as due to training methods, validation set performance looks better than the test set results will be**. The model achieves the following results on the evaluation set:
+-   eval_loss: 1.2756
+-   eval_rouge1: 41.8013
+-   eval_rouge2: 12.0895
+-   eval_rougeL: 21.6007
+-   eval_rougeLsum: 39.5382
+-   eval_gen_len: 387.2945
+-   eval_runtime: 13908.4995
+-   eval_samples_per_second: 0.107
+-   eval_steps_per_second: 0.027
+    ***** predict/test metrics (initial) *****
+      predict_gen_len            =   506.4368
+      predict_loss               =      2.028
+      predict_rouge1             =    36.8815
+      predict_rouge2             =     8.0625
+      predict_rougeL             =    17.6161
+      predict_rougeLsum          =    34.9068
+      predict_runtime            = 2:04:14.37
+      predict_samples            =       1431
+      predict_samples_per_second =      0.192
+      predict_steps_per_second   =      0.048
 \* evaluating big model not as easy as it seems. Doing a bit more investigating
+* * *
 ## FAQ
 See [train with a script](https://huggingface.co/docs/transformers/run_scripts) and [the summarization scripts](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)
+* * *
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
+-   learning_rate: 0.0006
+-   train_batch_size: 1
+-   eval_batch_size: 1
+-   seed: 10350
+-   distributed_type: multi-GPU
+-   num_devices: 4
+-   gradient_accumulation_steps: 32
+-   total_train_batch_size: 128
+-   total_eval_batch_size: 4
+-   optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+-   lr_scheduler_type: constant
+-   num_epochs: 1.0
 \*_Prior training sessions used roughly similar parameters (learning rates were higher); multiple sessions were required as this takes eons to train._
 ### Framework versions
+-   Transformers 4.25.0.dev0
+-   Pytorch 1.13.0+cu117
+-   Datasets 2.6.1
+-   Tokenizers 0.13.1
+* * *