tingyuansen commited on
Commit
86027db
1 Parent(s): a5e286c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -11
README.md CHANGED
@@ -13,15 +13,17 @@ base_model:
13
  - meta-llama/Llama-3-8b-hf
14
  ---
15
 
16
- # AstroLLaMA-3-8B-Base_AIC
17
 
18
- AstroLLaMA-3-8B is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
19
 
20
  ## Model Details
21
 
22
  - **Base Architecture**: LLaMA-3-8b
23
- - **Training Data**: Abstract, Introduction, and Conclusion (AIC) sections from arXiv's astro-ph category papers (from arXiv's inception up to January 2024)
24
- - **Data Processing**: Optical character recognition (OCR) on PDF files using the Nougat tool, followed by summarization using Qwen-2-8B and LLaMA-3.1-8B.
 
 
25
  - **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
26
  - **Training Details**:
27
  - Learning rate: 2 × 10⁻⁵
@@ -42,8 +44,8 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
42
  import torch
43
 
44
  # Load the model and tokenizer
45
- tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-2-7b-base_aic")
46
- model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-2-7b-base_aic", device_map="auto")
47
 
48
  # Create the pipeline with explicit truncation
49
  from transformers import pipeline
@@ -69,16 +71,18 @@ generated_text = generator(prompt, do_sample=True)
69
  print(generated_text[0]['generated_text'])
70
  ```
71
 
72
- ## Model Limitations and Biases
73
 
74
- A key limitation identified during the development of this model is that training solely on astro-ph data may not be sufficient to significantly improve performance over the base model, especially for the already highly performant LLaMA-3 series. This suggests that to achieve substantial gains, future iterations may need to incorporate a broader range of high-quality astronomical data beyond arXiv, such as textbooks, Wikipedia, and curated summaries.
75
 
76
  Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
77
 
78
  | Model | Score (%) |
79
  |-------|-----------|
80
- | **AstroLLaMA-3-8B (AstroMLab)** | **72.3** |
 
81
  | LLaMA-3-8B | 72.0 |
 
82
  | Gemma-2-9B | 71.5 |
83
  | Qwen-2.5-7B | 70.4 |
84
  | Yi-1.5-9B | 68.4 |
@@ -86,9 +90,17 @@ Here's a performance comparison chart based upon the astronomical benchmarking Q
86
  | Mistral-7B-v0.3 | 63.9 |
87
  | ChatGLM3-6B | 50.4 |
88
 
89
- As shown, while AstroLLaMA-3-8B performs competitively among models in its class, it does not surpass the performance of the base LLaMA-3-8B model. This underscores the challenges in developing specialized models and the need for more diverse and comprehensive training data.
90
 
91
- It's worth noting that the AstroLLaMA-3-8B-Plus which we will release in the next model release addresses these limitations by expanding beyond astro-ph data.
 
 
 
 
 
 
 
 
92
 
93
  ## Ethical Considerations
94
 
 
13
  - meta-llama/Llama-3-8b-hf
14
  ---
15
 
16
+ # AstroLLaMA-3-8B-Base_Summary
17
 
18
+ AstroLLaMA-3-8B-Base_Summary is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on summarized astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
19
 
20
  ## Model Details
21
 
22
  - **Base Architecture**: LLaMA-3-8b
23
+ - **Training Data**: Summarized content from arXiv's astro-ph category papers
24
+ - **Data Processing**:
25
+ 1. Optical character recognition (OCR) on PDF files using the Nougat tool
26
+ 2. Summarization of OCR'd text using Qwen-2-8B and LLaMA-3.1-8B, reducing content to about 1,000-4,000 tokens per paper
27
  - **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
28
  - **Training Details**:
29
  - Learning rate: 2 × 10⁻⁵
 
44
  import torch
45
 
46
  # Load the model and tokenizer
47
+ tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-3-8b-base_summary")
48
+ model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-3-8b-base_summary", device_map="auto")
49
 
50
  # Create the pipeline with explicit truncation
51
  from transformers import pipeline
 
71
  print(generated_text[0]['generated_text'])
72
  ```
73
 
74
+ ## Model Improvements and Performance
75
 
76
+ A key innovation in this model is the use of summarized content for training, which has led to improved performance compared to the AIC (Abstract, Introduction, Conclusion) version. The summarization process allows for the inclusion of more comprehensive information from each paper while maintaining a manageable token count.
77
 
78
  Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
79
 
80
  | Model | Score (%) |
81
  |-------|-----------|
82
+ | LLaMA-3.1-8B | 73.7 |
83
+ | **<span style="color:green">AstroLLaMA-3-8B-Base_Summary (AstroMLab)</span>** | **<span style="color:green">72.3</span>** |
84
  | LLaMA-3-8B | 72.0 |
85
+ | AstroLLaMA-3-8B-Base_AIC | 72.3 |
86
  | Gemma-2-9B | 71.5 |
87
  | Qwen-2.5-7B | 70.4 |
88
  | Yi-1.5-9B | 68.4 |
 
90
  | Mistral-7B-v0.3 | 63.9 |
91
  | ChatGLM3-6B | 50.4 |
92
 
93
+ As shown, AstroLLaMA-3-8B-Base_Summary performs competitively, nearly matching the performance of the base LLaMA-3.1-8B model and outperforming the AIC version. This improvement demonstrates the importance of information density in the training data.
94
 
95
+ Notably, the instruct version of this model shows even more significant improvements, highlighting the effectiveness of the summarization approach in capturing and retaining key astronomical concepts. For detailed performance analysis of the instruct version, please refer to Pan et al. 2024.
96
+
97
+ ## Model Limitations and Future Directions
98
+
99
+ While the summarization approach has shown promising results, there is still room for improvement. Future iterations may benefit from:
100
+
101
+ 1. Incorporating a broader range of high-quality astronomical data beyond arXiv, such as textbooks and curated Wikipedia content.
102
+ 2. Further refining the summarization process to capture even more relevant information.
103
+ 3. Exploring ways to integrate more diverse astronomical concepts and recent discoveries into the training data.
104
 
105
  ## Ethical Considerations
106