File size: 3,110 Bytes
22db95e
2f7faaa
05d984a
 
2f7faaa
05d984a
2f7faaa
05d984a
 
 
 
 
22db95e
2f7faaa
05d984a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75691ae
 
05d984a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ac347d
 
05d984a
 
 
 
 
 
 
2f7faaa
ad6c9ce
2f7faaa
991a2f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f7faaa
ad6c9ce
2f7faaa
991a2f2
2f7faaa
 
 
 
 
 
 
 
 
 
991a2f2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
tags:
- physics
- cosmology
model-index:
- name: cosmosage_qa
  results: []
license: mit
language:
- en
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
---

# cosmosage

Cosmosage is a natural-language cosmology assistant that can answer questions about cosmology.

cosmosage_v2 first underwent continued pretraining based on thousands of papers and textbooks, 
and was subsequently fine-tuned on synthetically-generated question-answer pairs. It is a full
chat model, though it excels in Q&A mode, where the model gives a single answer in response to 
a single question.

The code used to generate cosmosage_v2 is available at https://github.com/tijmen/cosmosage

## Usage

After downloading cosmosage_v2, the following example code can be used to ask questions:

```python
path_to_model = 'cosmosage_v2/'

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda"
model = AutoModelForCausalLM.from_pretrained(path_to_model).to(device)
tokenizer = AutoTokenizer.from_pretrained(path_to_model)
def ask_cosmosage(question):
    input_ids = torch.cat([
    tokenizer.encode("You are cosmosage, an AI programmed to be a cosmology expert. You answer the USER's question clearly in long form, always providing context. When appropriate, provide a reference.", return_tensors="pt"),
    torch.tensor([[28705]]),
    tokenizer.encode("USER:", add_special_tokens=False, return_tensors="pt"),
    tokenizer.encode(question, add_special_tokens=False, return_tensors="pt"),
    torch.tensor([[28705]]),
    tokenizer.encode("ASSISTANT:", add_special_tokens=False, return_tensors="pt")
    ], dim=-1).to(device)
    generated_ids = model.generate(input_ids, max_length=input_ids.shape[1] + 1000, do_sample=True)
    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)
```

## Comparison to cosmosage_v1

cosmosage_v2 is a more knowledgeable model than cosmosage_v1 due to being pretrained on the papers and
textbooks, rather than just on synthetically generated QA pairs. However, it continues to struggle with 
_reliability_. While many of its answers are factually accurate, some are not. The outputs of cosmosage 
(or any LLM) should not be trusted to be factual.

### Training hyperparameters

The following hyperparameters were used during continued pretraining:
- learning_rate: 1e-05
- max_grad_norm: 3.0
- train_batch_size: 4
- eval_batch_size: 4
- seed: 701
- distributed_type: multi-GPU
- num_devices: 4
- total_train_batch_size: 16
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- num_epochs: 3.0
- weight_decay: 1e-04

The following hyperparameters were used during QA tuning:
- learning_rate: 2e-06
- max_grad_norm: 3.0
- train_batch_size: 4
- eval_batch_size: 4
- seed: 702
- distributed_type: multi-GPU
- num_devices: 4
- total_train_batch_size: 16
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- num_epochs: 2.0
- weight_decay: 0.0