File size: 1,351 Bytes
06e8620
 
0b97d08
 
06e8620
6db7264
 
 
 
 
 
 
 
 
eb636b6
 
 
 
 
 
 
 
6db7264
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
---
license: apache-2.0
widget:
 - text: "2021\n\n"
---

Full code and details at https://github.com/csinva/gpt-paper-title-generator

**Model**
- finetunes starting from the[gpt-neo-2.7B checkpoint](https://huggingface.co/EleutherAI/gpt-neo-2.7B)
    - for training details see [the training script](https://github.com/csinva/gpt-paper-title-generator/blob/0157f26be9b0763b4ea6480e5b149fdb8dff4626/gptneo/02_finetune_hf.py)
- inference
    - should prepend with a year and two newlines before querying for a title, e.g. `2022\n\n`

```python    
from transformers import AutoModelForCausalLM, pipeline, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("csinva/gpt-neo-2.7B-titles")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)
pipe('2022\n\n')
```

**Data**
- all [papers on arXiv](https://www.kaggle.com/datasets/Cornell-University/arxiv) in the categories cs.AI, cs.LG, stat.ML
    - date cutoff: only finetuned on papers with dat on or before Apr 1, 2022
    - random 5% of papers also excluded
    - this results in 98,388 papers for finetuning
- during finetuning each paper title was given starting with the prompt `<year>\n\n <title>\n` (e.g. `2022\n\n Emb-GAM: an Interpretable and Efficient Predictor using Pre-trained Language Models\n`)