language: en
tags:
- scibert
- fine-tuned
- scientific-embeddings
- multi-document-summarization
- scitldr
license: mit
SciBERT Fine-tuned for Scientific Multi-Document Summarization Embeddings
Model description
This model is a fine-tuned version of allenai/scibert_scivocab_uncased for creating embeddings used in scientific multi-document summarization tasks. It has been optimized to generate meaningful representations of scientific text that can be used in downstream summarization processes.
Intended uses & limitations
This model is intended for creating embeddings of scientific documents, specifically for use in multi-document summarization tasks. It should not be used for generating summaries directly, but rather for creating vector representations of scientific text that can be used as input for summarization models or algorithms.
The model may not perform optimally on non-scientific text or for tasks significantly different from multi-document summarization.
Training data
This model was trained on the SciTLDR dataset. SciTLDR (Scientific Too Long; Didn't Read) is a dataset of scientific papers and their corresponding TL;DR summaries. It contains around 5,400 papers from the computer science domain, primarily from arXiv. Each paper in the dataset includes:
- The paper's title
- The abstract
- The full text of the paper
- Two types of summaries:
- Author-written TL;DR
- Expert-written TL;DR
The dataset is designed to support the task of extreme summarization in the scientific domain, where the goal is to create very short, high-level summaries of scientific papers.
For more information about the SciTLDR dataset, you can refer to the official paper and the dataset repository.
Training procedure
The model was trained for 15 epochs with early stopping based on validation loss. The best model was saved at epoch 15.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-5 to 1e-7 (cosine annealing)
- train_batch_size: 16
- eval_batch_size: 16
- optimizer: AdamW
Framework versions
- Transformers 4.41.2
- PyTorch 2.3.0+cu121
- Datasets 2.20.0
- Tokenizers 0.19.1
- CUDA 12.1
Evaluation results
The model achieved the following results:
- Training Loss: 0.2272
- Validation Loss: 0.8738
Model Limitations and Bias
This model is trained on scientific literature from the SciTLDR dataset, which primarily contains computer science papers from arXiv. As such, it may not generalize well to other scientific domains or non-scientific text. Users should be aware of potential biases in the training data, which may be reflected in the generated embeddings. The model's performance might be optimal for computer science-related texts but could be less effective for other scientific fields.
Author
callaghanmt