AstroMLab
/

astrollama-3-8b-base_summary

@@ -13,15 +13,17 @@ base_model:
 - meta-llama/Llama-3-8b-hf
 ---
-# AstroLLaMA-3-8B-Base_AIC
-AstroLLaMA-3-8B is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
 ## Model Details
 - **Base Architecture**: LLaMA-3-8b
-- **Training Data**: Abstract, Introduction, and Conclusion (AIC) sections from arXiv's astro-ph category papers (from arXiv's inception up to January 2024)
-- **Data Processing**: Optical character recognition (OCR) on PDF files using the Nougat tool, followed by summarization using Qwen-2-8B and LLaMA-3.1-8B.
 - **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
 - **Training Details**:
   - Learning rate: 2 × 10⁻⁵
@@ -42,8 +44,8 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 # Load the model and tokenizer
-tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-2-7b-base_aic")
-model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-2-7b-base_aic", device_map="auto")
 # Create the pipeline with explicit truncation
 from transformers import pipeline
@@ -69,16 +71,18 @@ generated_text = generator(prompt, do_sample=True)
 print(generated_text[0]['generated_text'])
 ```
-## Model Limitations and Biases
-A key limitation identified during the development of this model is that training solely on astro-ph data may not be sufficient to significantly improve performance over the base model, especially for the already highly performant LLaMA-3 series. This suggests that to achieve substantial gains, future iterations may need to incorporate a broader range of high-quality astronomical data beyond arXiv, such as textbooks, Wikipedia, and curated summaries.
 Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
 | Model | Score (%) |
 |-------|-----------|
-| **AstroLLaMA-3-8B (AstroMLab)** | **72.3** |
 | LLaMA-3-8B | 72.0 |
 | Gemma-2-9B | 71.5 |
 | Qwen-2.5-7B | 70.4 |
 | Yi-1.5-9B | 68.4 |
@@ -86,9 +90,17 @@ Here's a performance comparison chart based upon the astronomical benchmarking Q
 | Mistral-7B-v0.3 | 63.9 |
 | ChatGLM3-6B | 50.4 |
-As shown, while AstroLLaMA-3-8B performs competitively among models in its class, it does not surpass the performance of the base LLaMA-3-8B model. This underscores the challenges in developing specialized models and the need for more diverse and comprehensive training data.
-It's worth noting that the AstroLLaMA-3-8B-Plus which we will release in the next model release addresses these limitations by expanding beyond astro-ph data.
 ## Ethical Considerations

 - meta-llama/Llama-3-8b-hf
 ---
+# AstroLLaMA-3-8B-Base_Summary
+AstroLLaMA-3-8B-Base_Summary is a specialized base language model for astronomy, developed by fine-tuning Meta's LLaMA-3-8b architecture on summarized astronomical literature. This model was developed by the AstroMLab team. It is designed for next token prediction tasks and is not an instruct/chat model.
 ## Model Details
 - **Base Architecture**: LLaMA-3-8b
+- **Training Data**: Summarized content from arXiv's astro-ph category papers
+- **Data Processing**:
+  1. Optical character recognition (OCR) on PDF files using the Nougat tool
+  2. Summarization of OCR'd text using Qwen-2-8B and LLaMA-3.1-8B, reducing content to about 1,000-4,000 tokens per paper
 - **Fine-tuning Method**: Continual Pre-Training (CPT) using the LMFlow framework
 - **Training Details**:
   - Learning rate: 2 × 10⁻⁵
 import torch
 # Load the model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("AstroMLab/astrollama-3-8b-base_summary")
+model = AutoModelForCausalLM.from_pretrained("AstroMLab/astrollama-3-8b-base_summary", device_map="auto")
 # Create the pipeline with explicit truncation
 from transformers import pipeline
 print(generated_text[0]['generated_text'])
 ```
+## Model Improvements and Performance
+A key innovation in this model is the use of summarized content for training, which has led to improved performance compared to the AIC (Abstract, Introduction, Conclusion) version. The summarization process allows for the inclusion of more comprehensive information from each paper while maintaining a manageable token count.
 Here's a performance comparison chart based upon the astronomical benchmarking Q&A as described in [Ting et al. 2024](https://arxiv.org/abs/2407.11194), and Pan et al. 2024:
 | Model | Score (%) |
 |-------|-----------|
+| LLaMA-3.1-8B | 73.7 |
+| **<span style="color:green">AstroLLaMA-3-8B-Base_Summary (AstroMLab)</span>** | **<span style="color:green">72.3</span>** |
 | LLaMA-3-8B | 72.0 |
+| AstroLLaMA-3-8B-Base_AIC | 72.3 |
 | Gemma-2-9B | 71.5 |
 | Qwen-2.5-7B | 70.4 |
 | Yi-1.5-9B | 68.4 |
 | Mistral-7B-v0.3 | 63.9 |
 | ChatGLM3-6B | 50.4 |
+As shown, AstroLLaMA-3-8B-Base_Summary performs competitively, nearly matching the performance of the base LLaMA-3.1-8B model and outperforming the AIC version. This improvement demonstrates the importance of information density in the training data.
+Notably, the instruct version of this model shows even more significant improvements, highlighting the effectiveness of the summarization approach in capturing and retaining key astronomical concepts. For detailed performance analysis of the instruct version, please refer to Pan et al. 2024.
+## Model Limitations and Future Directions
+While the summarization approach has shown promising results, there is still room for improvement. Future iterations may benefit from:
+1. Incorporating a broader range of high-quality astronomical data beyond arXiv, such as textbooks and curated Wikipedia content.
+2. Further refining the summarization process to capture even more relevant information.
+3. Exploring ways to integrate more diverse astronomical concepts and recent discoveries into the training data.
 ## Ethical Considerations