File size: 13,328 Bytes

22db95e
e6aba66
 
 
2f7faaa
05d984a
 
e6aba66
 
 
 
2f7faaa
05d984a
2f7faaa
22db95e
2f7faaa
05d984a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75691ae
fbadf1a
05d984a
 
fbadf1a
 
 
 
 
 
800a5ef
 
fbadf1a
 
 
 
 
8ac347d
05d984a
 
 
 
 
 
 
2f7faaa
de9a5f8
2f7faaa
c7e3ab1
 
 
991a2f2
 
c7e3ab1
991a2f2
 
 
 
 
 
 
2f7faaa
ad6c9ce
2f7faaa
 
c7e3ab1
2f7faaa
 
 
 
 
991a2f2
e43558a
 
91e8a08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
800a5ef
 
 
 
 
 
 
e9351f5
 
 
 
800a5ef
de9a5f8
e43558a
1bf5f7f
 
4506b30
1bf5f7f
 
4506b30
1bf5f7f
f814fce
 
1bf5f7f
f814fce
f956cfe
 
 
 
 
 
 
 
f814fce
bb103b3
 
 
 
 
 
 
 
 
 
 
f814fce
 
 
 
 
f956cfe
 
 
 
 
 
 
e6aba66
 
93d15e4
e6aba66
 
93d15e4
 
 
 
 
 
 
 
 
 
 
e6aba66
93d15e4

---
language:
- en
license: mit
tags:
- physics
- cosmology
datasets:
- teknium/OpenHermes-2.5
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
model-index:
- name: cosmosage_qa
  results: []
---

# cosmosage

Cosmosage is a natural-language cosmology assistant that can answer questions about cosmology.

cosmosage_v2 first underwent continued pretraining based on thousands of papers and textbooks, 
and was subsequently fine-tuned on synthetically-generated question-answer pairs. It is a full
chat model, though it excels in Q&A mode, where the model gives a single answer in response to 
a single question.

The code used to generate cosmosage_v2 is available at https://github.com/tijmen/cosmosage

## Usage

After downloading cosmosage_v2, the following example code can be used to ask questions:

```python
model_path = "models/cosmosage_v2/"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = torch.device("cuda")
model = AutoModelForCausalLM.from_pretrained(model_path).to(device, dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_path)

def ask_cosmosage(question):
    prompt = f"You are cosmosage, an AI programmed to provide excellent and detailed answers to the user's question. You are an expert cosmology assistant, able to answer questions on the cosmic microwave background, galaxy formation, large scale structure, theoretical cosmology, inflation, big bang nucleosynthesis, cosmology instrumentation, and other related topics. Please assume the user is fluent in scientific terminology. Elaborate where possible to give a complete answer. If you do not know, say you do not know.▁ USER: {question}▁ ASSISTANT:"
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
    generated_ids = model.generate(input_ids, max_length=1024, do_sample=True, temperature=0.7, top_k=None, pad_token_id=tokenizer.eos_token_id)
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    answer = generated_text.split("ASSISTANT:")[-1]
    return answer
```

## Comparison to cosmosage_v1

cosmosage_v2 is a more knowledgeable model than cosmosage_v1 due to being pretrained on the papers and
textbooks, rather than just on synthetically generated QA pairs. However, it continues to struggle with 
_reliability_. While many of its answers are factually accurate, some are not. The outputs of cosmosage 
(or any LLM) should not be trusted to be factual.

## Training details

cosmosage_v2 was trained on 4xA100 (80 GB) at the Center for Computational Astrophysics (CfCA), National Astronomical Observatory of Japan (NAOJ).

The following parameters were used during continued pretraining:
- learning_rate: 1e-05
- train_batch_size: 4
- max_grad_norm: 3.0
- num_devices: 4
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 3.0
- weight_decay: 1e-04

The following hyperparameters were used during QA tuning:
- learning_rate: 2e-06
- train_batch_size: 4
- max_grad_norm: 3.0
- num_devices: 4
- total_train_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- num_epochs: 2.0
- weight_decay: 0.0

## Versions

This repository contains:
 - pytorch_model.bin: standard version (bfloat16)
 - model.safetensors: same as pytorch_mode.bin but in safetensors format
 - gptq_model-8bit-128g.safetensors: 8-bit quantized version for inference speedup and low-VRAM GPUs
 - gptq_model-4bit-128g.safetensors: 4-bit quantized version for even faster inference, lower VRAM requirements, lower quality

When using one of the quantized versions, make sure to pass the quantization configuration:
```json
{
  "bits": <4 or 8 depending on the version>, 
  "group_size": 128,
  "damp_percent": 0.01,
  "desc_act": false,
  "static_groups": false,
  "sym": true,
  "true_sequential": true,
  "model_name_or_path": null,
  "model_file_base_name": null
}
```

## Instruction format

cosmosage_v2 was trained with the "inst" chat template as implemented in axolotl v0.4.0. This resulted in an 
unusual instruction format:
```raw
<s> {system prompt}▁ USER: {question}▁ ASSISTANT:
```
 - Be sure not to leave any whitespace after the final ":"
 - Note the use of the U+2581 Lower One Eighth Block Unicode Character to separate the different sections.
 - The example code in the Usage section above correctly implements this format.
 - cosmosage_v2 does not appear to be sensitive to minor deviations from this format. 

## Example output

**User:**
> What is Digital Active Nulling?

**Assistant:**
> Digital Active Nulling (DAN) is a technique used to improve the performance of Transition Edge Sensor (TES) bolometers in millimeter-wave astrophysics. It nulls the current flowing through the SQUID input coil, which is used to read out the TES, by injecting a current at the SQUID input that is the negative of the current through the TES. This results in a significant reduction in the effective input impedance of the SQUID, which increases the dynamic range of the readout system and reduces non-linearity. In addition, DAN replaces the flux-locked loop (FLL), removing the wiring length and bandwidth limitations of the FLL. DAN is used in the Simons Array and SPT-3G experiments, and is the baseline readout technology for the upcoming LiteBIRD satellite.
 

**User:**
> Explain the ISW effect.

**Assistant:**
> 1. The Integrated Sachs-Wolfe (ISW) effect is a phenomenon in cosmology that describes the contribution of gravitational potential fluctuations to the Cosmic Microwave Background (CMB) anisotropy. It is named after Rainer K. Sachs and Arthur M. Wolfe, who first described the effect in their 1967 paper.
> 2. The ISW effect occurs when CMB photons travel through a time-varying gravitational potential due to the expansion of the Universe. As the photons pass through regions where the gravitational potential is changing, their energy is altered, leading to a change in their temperature.
> 3. There are two main phases in which the ISW effect can occur: the early-time effect and the late-time effect.
> 4. The early-time ISW effect, also known as the Rees-Sciama effect, occurs during the transition from the radiation-dominated era to the matter-dominated era in the early Universe. As the Universe expands and the density of radiation decreases, the gravitational potentials begin to decay, causing CMB photons to gain energy and contribute to the anisotropy.
> 5. The late-time ISW effect occurs when the Universe transitions to the current dark energy-dominated era. As the Universe expands and the dark energy starts to dominate, the gravitational potentials decay again, causing CMB photons to gain energy and contribute to the anisotropy.
> 6. The ISW effect is challenging to detect because it is a secondary anisotropy in the CMB, meaning it is superimposed on the primary anisotropies that formed during the early Universe. Additionally, the ISW effect has a similar angular scale as other CMB anisotropies, making it difficult to isolate from other sources.
> 7. The detection of the ISW effect has been achieved through cross-correlating the CMB with large-scale structure (LSS) surveys, such as galaxy surveys. These cross-correlations can reveal the ISW signal by comparing the distribution of galaxies with the CMB anisotropies.
> 8. The ISW effect is an important tool for studying the large-scale structure of the Universe, the properties of dark energy, and the evolution of cosmic structures. It also provides a way to test and constrain cosmological models, such as the ΛCDM model.

**User:**
> Explain how one would calculate the helium fraction at the surface of last scattering.

**Assistant:**
> 1. Determine the redshift of the surface of last scattering (zLS) using the CMB temperature of 2.7255 K and the current temperature of the universe T0 = 2.7255 K. The surface of last scattering is approximately 1380 K, which corresponds to a redshift of zLS = 1089.9.
> 2. Calculate the scale factor at the surface of last scattering (aLS) using the redshift. The scale factor at the surface of last scattering is approximately aLS = 1/(1 + zLS) = 1/(1 + 1089.9) ≈ 9.143 × 10^-4.
> 3. Determine the Hubble parameter at the surface of last scattering (HLS). The Hubble parameter is related to the Hubble constant (H0) and the scale factor (a) by the equation H(z) = H0 * a^(-3/2). Since the radiation density is much larger than the matter density at the surface of last scattering, we can use the following approximation for the Hubble parameter: H(z) ≈ H0 * sqrt(1 + z) ≈ H0 * sqrt(1 + 1089.9) ≈ 1.06 × 10^4 * H0.
> 4. Calculate the age of the universe at the surface of last scattering (tLS). The age of the universe can be calculated using the Hubble parameter and the scale factor: t = 1/H. At the surface of last scattering, the age of the universe is approximately tLS ≈ 1/(1.06 × 10^4 * H0) ≈ 9.4 × 10^-10 * H0.
> 5. Determine the neutron-to-proton ratio at the surface of last scattering (n/p ratio). The neutron-to-proton ratio can be calculated using the equation n/p = exp(-Q/kT), where Q is the neutron-proton mass difference (1.293 MeV), k is the Boltzmann constant, and T is the temperature at the surface of last scattering. At the surface of last scattering, the neutron-to-proton ratio is approximately n/p ≈ exp(-1.293/0.801) ≈ 0.198.
> 6. Calculate the helium fraction (Yp) at the surface of last scattering (YpLS). The helium fraction can be calculated using the following equation: Yp = 2 * n/p, where n is the number of neutrons and p is the number of protons. At the surface of last scattering, the helium fraction is approximately YpLS ≈ 2 * 0.198 ≈ 0.396.
> 7. Determine the helium fraction at the surface of last scattering (YpLS). The helium fraction at the surface of last scattering is approximately YpLS ≈ 0.396.

**User:**
> How does the time of matter-radiation equality affect the damping tail?

**Assistant:**
> The time of matter-radiation equality affects the damping tail by influencing the amount of damping that occurs in the CMB power spectrum. The damping tail is the region of the CMB power spectrum that is suppressed due to the diffusion of photons in the early universe. This diffusion is caused by the interaction of photons with free electrons, which scatter the photons and reduce their energy. The amount of damping that occurs is determined by the amount of time that the photons spend in this diffusive state, which is influenced by the time of matter-radiation equality. 
>
> Matter-radiation equality is the point in the history of the universe when the energy density of matter becomes equal to the energy density of radiation. This occurs when the universe is approximately 50,000 years old. The time of matter-radiation equality affects the damping tail because it determines the amount of time that photons spend in the diffusive state before they are able to decouple from the matter and travel freely through the universe. 
>
> If matter-radiation equality occurs later in the history of the universe, then the photons spend more time in the diffusive state, which results in more damping of the CMB power spectrum. This is because the longer the photons spend in the diffusive state, the more time they have to lose energy through scattering with free electrons. Conversely, if matter-radiation equality occurs earlier in the history of the universe, then the photons spend less time in the diffusive state, which results in less damping of the CMB power spectrum. 
>
> In summary, the time of matter-radiation equality affects the damping tail of the CMB power spectrum by influencing the amount of time that photons spend in the diffusive state before they are able to decouple from the matter and travel freely through the universe. The longer the photons spend in the diffusive state, the more damping occurs, and the earlier matter-radiation equality occurs, the less damping occurs.>

# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)

Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Tijmen2__cosmosage_v2)

|             Metric              |Value|OpenHermes2.5-Mistral7B|
|---------------------------------|----:|----------------------:|
|Avg.                             |60.66|61.52|
|AI2 Reasoning Challenge (25-Shot)|59.73|64.93|
|HellaSwag (10-Shot)              |80.90|84.18|
|MMLU (5-Shot)                    |59.57|63.64|
|TruthfulQA (0-shot)              |50.98|52.24|
|Winogrande (5-shot)              |75.93|78.06|
|GSM8k (5-shot)                   |36.85|26.08|

cosmosage_v2 can be compared to OpenHermes-2.5-Mistral-7B because it started from the same base model and also trained on the OpenHermes2.5 dataset. 

cosmosage_v2 performs only slightly below OpenHermes-2.5-Mistral-7B on most metrics, indicating that the 
heavy specialization in cosmology has not come at much of a cost on general-purpose abilities. The exception 
is GSM8k, which is a collection of grade school math problems. Here, cosmosage performs significantly better 
than OpenHermes-2.5-Mistral-7B.