File size: 3,781 Bytes
aa6f0dd a4625e0 aa6f0dd 36e083f aa6f0dd 16923de aa6f0dd 5aff57c aa6f0dd 1249f23 aa6f0dd f95ff31 96e0fa0 6f4048e 96e0fa0 6f4048e 96e0fa0 6f4048e 96e0fa0 c0d25a5 6f4048e c0d25a5 6f4048e c0d25a5 aa6f0dd 6f4048e 0541876 f95ff31 6f4048e 0541876 f95ff31 c0d25a5 aa6f0dd 96e0fa0 ee9a815 96e0fa0 ee9a815 96e0fa0 ee9a815 aa6f0dd c0d25a5 aa6f0dd 16923de |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 |
---
license: other
base_model: microsoft/phi-1_5
tags:
- generated_from_trainer
- title
- extraction
- title extraction
model-index:
- name: titletor-phi_1-5
results: []
datasets:
- zelalt/scientific-papers-3.5-withprompt
---
<div align="center">
# Titletor
</div>
<div align="center">
<img src="./titletor.png" width="300"/>
</div>
This model is a fine-tuned version of [microsoft/phi-1_5](https://huggingface.co/microsoft/phi-1_5) on [zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt) dataset.
It achieves the following results on the evaluation set:
- Loss: 2.1587
### Requirements
```python
!pip install accelerate transformers einops datasets peft bitsandbytes
```
## Test Dataset
If you prefer, you can use test dataset from [zelalt/scientific-papers](https://huggingface.co/datasets/zelalt/scientific-papers)
or [zelalt/arxiv-papers](https://huggingface.co/datasets/zelalt/arxiv-papers) or read your pdf as text with PyPDF2.PdfReader then give this text to LLM with adding "What is the title of this paper?" prompt.
```python
from datasets import load_dataset
test_dataset = load_dataset("zelalt/scientific-papers", split='train')
test_dataset = test_dataset.rename_column('full_text', 'text')
def formatting(example):
text = f"What is the title of this paper? {example['text'][:180]}\n\nAnswer: "
return {'text': text}
formatted_dataset = test_dataset.map(formatting)
```
### Sample Code
```python
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
peft_model_id = "zelalt/titletor-phi_1-5"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path,trust_remote_code=True)
model = PeftModel.from_pretrained(model, peft_model_id)
#from dataset
inputs = tokenizer(f'''{formatted_dataset['text'][120]}''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]
print(text)
```
```python
#as string
inputs = tokenizer(f'''What is the title of this paper? ...[your pdf as text]..\n\nAnswer: ''', return_tensors="pt", return_attention_mask=False)
outputs = model.generate(**inputs,max_new_tokens=50, pad_token_id = tokenizer.eos_token_id, eos_token_id = tokenizer.eos_token_id)
text = tokenizer.batch_decode(outputs)[0]
print(text)
```
**Notes**
- After running it for the first time and loading the model and tokenizer, you can only run generating part to avoid RAM crash.
### Output
Input:
```markdown
What is the title of this paper? Bursting Dynamics of the 3D Euler Equations\nin Cylindrical Domains\nFrançois Golse ∗ †\nEcole Polytechnique, CMLS\n91128 Palaiseau Cedex, France\nAlex Mahalov ‡and Basil Nicolaenko §\n\nAnswer:
```
## Output from LLM:
```markdown
What is the title of this paper? Bursting Dynamics of the 3D Euler Equations
in Cylindrical Domains
François Golse ∗ †
Ecole Polytechnique, CMLS
91128 Palaiseau Cedex, France
Alex Mahalov ‡and Basil Nicolaenko §
Answer: Bursting Dynamics of the 3D Euler Equations in Cylindrical Domains<|endoftext|>
```
## Training and evaluation data
Train and validation dataset:
[zelalt/scientific-papers-3.5-withprompt](https://huggingface.co/datasets/zelalt/scientific-papers-3.5-withprompt)
## Training procedure
### Training hyperparameters
- total_train_batch_size: 8
- lr_scheduler_type: cosine
### Framework versions
- Transformers 4.35.2
- Pytorch 2.1.0+cu118
- Datasets 2.15.0
- Tokenizers 0.15.0 |