tingyuansen
commited on
Commit
•
86027db
1
Parent(s):
a5e286c
Update README.md
Browse files
README.md
CHANGED
@@ -13,15 +13,17 @@ base_model:
|
|
13 |
- meta-llama/Llama-3-8b-hf
|
14 |
---
|
15 |
|
16 |
-
# AstroLLaMA-3-8B-
|
17 |
|
18 |
-
AstroLLaMA-3-8B is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
|
19 |
|
20 |
## Model Details
|
21 |
|
22 |
- **Base Architecture**: LLaMA-3-8b
|
23 |
-
- **Training Data**:
|
24 |
-
- **Data Processing**:
|
|
|
|
|
25 |
- **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
|
26 |
- **Training Details**:
|
27 |
- Learning rate: 2 × 10⁻⁵
|
@@ -42,8 +44,8 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
42 |
import torch
|
43 |
|
44 |
# Load the model and tokenizer
|
45 |
-
tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-
|
46 |
-
model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-
|
47 |
|
48 |
# Create the pipeline with explicit truncation
|
49 |
from transformers import pipeline
|
@@ -69,16 +71,18 @@ generated_text = generator(prompt, do_sample=True)
|
|
69 |
print(generated_text[0]['generated_text'])
|
70 |
```
|
71 |
|
72 |
-
## Model
|
73 |
|
74 |
-
A key
|
75 |
|
76 |
Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
|
77 |
|
78 |
| Model | Score (%) |
|
79 |
|-------|-----------|
|
80 |
-
|
|
|
|
81 |
| LLaMA-3-8B | 72.0 |
|
|
|
82 |
| Gemma-2-9B | 71.5 |
|
83 |
| Qwen-2.5-7B | 70.4 |
|
84 |
| Yi-1.5-9B | 68.4 |
|
@@ -86,9 +90,17 @@ Here's a performance comparison chart based upon the astronomical benchmarking Q
|
|
86 |
| Mistral-7B-v0.3 | 63.9 |
|
87 |
| ChatGLM3-6B | 50.4 |
|
88 |
|
89 |
-
As shown,
|
90 |
|
91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
## Ethical Considerations
|
94 |
|
|
|
13 |
- meta-llama/Llama-3-8b-hf
|
14 |
---
|
15 |
|
16 |
+
# AstroLLaMA-3-8B-Base_Summary
|
17 |
|
18 |
+
AstroLLaMA-3-8B-Base_Summary is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on summarized astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
|
19 |
|
20 |
## Model Details
|
21 |
|
22 |
- **Base Architecture**: LLaMA-3-8b
|
23 |
+
- **Training Data**: Summarized content from arXiv's astro-ph category papers
|
24 |
+
- **Data Processing**:
|
25 |
+
1. Optical character recognition (OCR) on PDF files using the Nougat tool
|
26 |
+
2. Summarization of OCR'd text using Qwen-2-8B and LLaMA-3.1-8B, reducing content to about 1,000-4,000 tokens per paper
|
27 |
- **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
|
28 |
- **Training Details**:
|
29 |
- Learning rate: 2 × 10⁻⁵
|
|
|
44 |
import torch
|
45 |
|
46 |
# Load the model and tokenizer
|
47 |
+
tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-3-8b-base_summary")
|
48 |
+
model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-3-8b-base_summary", device_map="auto")
|
49 |
|
50 |
# Create the pipeline with explicit truncation
|
51 |
from transformers import pipeline
|
|
|
71 |
print(generated_text[0]['generated_text'])
|
72 |
```
|
73 |
|
74 |
+
## Model Improvements and Performance
|
75 |
|
76 |
+
A key innovation in this model is the use of summarized content for training, which has led to improved performance compared to the AIC (Abstract, Introduction, Conclusion) version. The summarization process allows for the inclusion of more comprehensive information from each paper while maintaining a manageable token count.
|
77 |
|
78 |
Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
|
79 |
|
80 |
| Model | Score (%) |
|
81 |
|-------|-----------|
|
82 |
+
| LLaMA-3.1-8B | 73.7 |
|
83 |
+
| **<span style="color:green">AstroLLaMA-3-8B-Base_Summary (AstroMLab)</span>** | **<span style="color:green">72.3</span>** |
|
84 |
| LLaMA-3-8B | 72.0 |
|
85 |
+
| AstroLLaMA-3-8B-Base_AIC | 72.3 |
|
86 |
| Gemma-2-9B | 71.5 |
|
87 |
| Qwen-2.5-7B | 70.4 |
|
88 |
| Yi-1.5-9B | 68.4 |
|
|
|
90 |
| Mistral-7B-v0.3 | 63.9 |
|
91 |
| ChatGLM3-6B | 50.4 |
|
92 |
|
93 |
+
As shown, AstroLLaMA-3-8B-Base_Summary performs competitively, nearly matching the performance of the base LLaMA-3.1-8B model and outperforming the AIC version. This improvement demonstrates the importance of information density in the training data.
|
94 |
|
95 |
+
Notably, the instruct version of this model shows even more significant improvements, highlighting the effectiveness of the summarization approach in capturing and retaining key astronomical concepts. For detailed performance analysis of the instruct version, please refer to Pan et al. 2024.
|
96 |
+
|
97 |
+
## Model Limitations and Future Directions
|
98 |
+
|
99 |
+
While the summarization approach has shown promising results, there is still room for improvement. Future iterations may benefit from:
|
100 |
+
|
101 |
+
1. Incorporating a broader range of high-quality astronomical data beyond arXiv, such as textbooks and curated Wikipedia content.
|
102 |
+
2. Further refining the summarization process to capture even more relevant information.
|
103 |
+
3. Exploring ways to integrate more diverse astronomical concepts and recent discoveries into the training data.
|
104 |
|
105 |
## Ethical Considerations
|
106 |
|