File size: 4,987 Bytes
1ecb130 f24d417 1ecb130 f24d417 1ecb130 86027db 1ecb130 86027db 1ecb130 f24d417 1ecb130 f24d417 86027db f24d417 177f477 f24d417 1ecb130 f24d417 1ecb130 a5e286c 86027db a5e286c 1ecb130 86027db 1ecb130 f77117b 1ecb130 f24d417 1ecb130 f24d417 86027db a06a5ab 86027db f24d417 a06a5ab f24d417 1ecb130 86027db 1ecb130 86027db 366ad76 a8860fd 4138596 a8860fd 366ad76 f24d417 1ecb130 f24d417 1ecb130 f24d417 1ecb130 f24d417 1ecb130 f24d417 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
---
license: mit
language:
- en
pipeline_tag: text-generation
tags:
- llama-3
- astronomy
- astrophysics
- arxiv
inference: false
base_model:
- meta-llama/Llama-3-8b-hf
---
# AstroLLaMA-3-8B-Base_Summary
AstroLLaMA-3-8B-Base_Summary is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on summarized astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
## Model Details
- **Base Architecture**: LLaMA-3-8b
- **Training Data**: Summarized content from arXiv's astro-ph category papers
- **Data Processing**:
1. Optical character recognition (OCR) on PDF files using the Nougat tool
2. Summarization of OCR'd text using Qwen-2-8B and LLaMA-3.1-8B, reducing content to about 1,000-4,000 tokens per paper
- **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
- **Training Details**:
- Learning rate: 2 × 10⁻⁵
- Total batch size: 96
- Maximum token length: 512
- Warmup ratio: 0.03
- No gradient accumulation
- BF16 format
- Cosine decay schedule for learning rate reduction
- Training duration: 1 epoch
- **Primary Use**: Next token prediction for astronomy-related text generation and analysis
- **Reference**: Pan et al. 2024 [Link to be added]
## Generating text from a prompt
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-3-8b-base_summary")
model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-3-8b-base_summary", device_map="auto")
# Create the pipeline with explicit truncation
from transformers import pipeline
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
truncation=True,
max_length=512
)
# Example prompt from an astronomy paper
prompt = "In this letter, we report the discovery of the highest redshift, " \
"heavily obscured, radio-loud QSO candidate selected using JWST NIRCam/MIRI, " \
"mid-IR, sub-mm, and radio imaging in the COSMOS-Web field. "
# Set seed for reproducibility
torch.manual_seed(42)
# Generate text
generated_text = generator(prompt, do_sample=True)
print(generated_text[0]['generated_text'])
```
## Model Improvements and Performance
This model used the summarized content for training, which has led to improved performance compared to the AIC (Abstract, Introduction, Conclusion) version. The summarization process allows for the inclusion of more comprehensive information from each paper while maintaining a manageable token count.
Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
| Model | Score (%) |
|-------|-----------|
| LLaMA-3.1-8B | 73.7 |
| LLaMA-3-8B | 72.9 |
| **<span style="color:green">AstroLLaMA-3-8B-Base_Summary (AstroMLab)</span>** | **<span style="color:green">72.3</span>** |
| AstroLLaMA-3-8B-Base_AIC | 72.3 |
| Gemma-2-9B | 71.5 |
| Qwen-2.5-7B | 70.4 |
| Yi-1.5-9B | 68.4 |
| InternLM-2.5-7B | 64.5 |
| Mistral-7B-v0.3 | 63.9 |
| ChatGLM3-6B | 50.4 |
As shown, AstroLLaMA-3-8B-Base_Summary performs competitively, nearly matching the performance of the base LLaMA-3.1-8B model and outperforming the AIC version. This improvement demonstrates the importance of information density in the training data.
Notably, the instruct version of this model shows even more significant improvements, highlighting the effectiveness of the summarization approach in capturing and retaining key astronomical concepts. For detailed performance analysis of the instruct version, please refer to Pan et al. 2024.
While AstroLLaMA-3-8B performs competitively among models in its class, it does not surpass the performance of the base LLaMA-3-8B model. This underscores the challenges in developing specialized models and the need for more diverse and comprehensive training data.
This model is released primarily for reproducibility purposes, allowing researchers to track the development process and compare different iterations of AstroLLaMA models.
For optimal performance and the most up-to-date capabilities in astronomy-related tasks, we recommend using AstroSage-8B, where these limitations have been addressed. The newer model incorporates expanded training data beyond astro-ph and features a greatly expanded fine-tuning process, resulting in significantly improved performance.
## Ethical Considerations
While this model is designed for scientific use, users should be mindful of potential misuse, such as generating misleading scientific content. Always verify model outputs against peer-reviewed sources for critical applications.
## Citation
If you use this model in your research, please cite:
```
[Citation for Pan et al. 2024 to be added]
``` |