File size: 2,562 Bytes
b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 8233a1a 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 b1c97df 0770313 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
---
language:
- en
pipeline_tag: text-generation
tags:
- meta
- llama-3
license: llama3
---
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/VcZWbW_eZkJAZZ5ricL4B.png)
# Llama-3-Giraffe-70B-Instruct
Abacus.AI presents our longer-necked variant of Llama 3 70B - now with the instruct variant!
This model has an effective context length of approximately 128k.
We have currently trained on ~1.5B tokens.
There are our Needle-in-a-Haystack heatmap results. We are conducting further evals of model efficacy and will update our model card as these come in:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/Z4uUhcjgf1P7EPGQyRLkW.png)
### MT-Bench Evaluation
We also measured performance on MT-Bench to verify that the context extension did not significantly impact performance on instruct tasks:
```
####### 1st turn:
Meta-Llama-3-70B-Instruct 9.21
Llama-3-Giraffe-70B-Instruct 9.19
####### 2nd turn:
Meta-Llama-3-70B-Instruct 2 8.80
Llama-3-Giraffe-70B-Instruct 2 8.54
####### average:
Meta-Llama-3-70B-Instruct 9.00
Llama-3-Giraffe-70B-Instruct 8.87
```
## Training Methodology
The methodology for training uses [PoSE](https://arxiv.org/abs/2309.10400) and dynamic-NTK interpolation.
### NTK-scaling
The scale factor for NTK is 4. Note that we also tried theta-scaling but this did not work as well as NTK scaling in our experiments.
### PoSE
We utilise Positional Skip-wise Training (PoSE) with the following parameters:
- **Number of Chunks**: 5
- **Max position ID**: 32768
### Data
We use on average ~8K long samples from [RedPajama](https://github.com/togethercomputer/RedPajama-Data).
### Hardware
We train on 8xH100 GPUs with Deepspeed Zero Stage 3.
## Evaluation Methodology
We use the [EasyContext](https://github.com/abacusai/EasyContext/blob/eval_runs/eval_needle.py) implementation of Needle-in-a-Haystack to evaluate Llama-3-Giraffe-70B.
We evaluate with the following parameters:
- **Min context length**: 2000
- **Max context length**: 128000
- **Context interval**: 4000
- **Depth interval**: 0.1
- **Num samples**: 2
- **Rnd number digits**: 7
- **Haystack dir**: PaulGrahamEssays
### Adapter Transfer
We apply the above techniques first to Llama-3-70B-Base, using LoRA on the Q and K weights only. This adapter is then applied to Llama-3-70B-Instruct, and we
release the merged version here. |