File size: 2,980 Bytes

ba55c06
c9f77b6
 
 
 
 
 
 
feb7be2
ba55c06
 
 
 
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
 
ba55c06
6bb9783
ba55c06
c9f77b6
ba55c06
c9f77b6
 
 
 
 
 
 
 
 
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
 
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6
 
 
 
 
6bb9783
c9f77b6
6bb9783
c9f77b6
 
ba55c06
 
c9f77b6
ba55c06
 
 
c9f77b6
 
 
 
 
 
 
 
ba55c06
c9f77b6
ba55c06
c9f77b6
ba55c06
c9f77b6

---
datasets:
- homebrewltd/instruction-speech-whispervq-v2
language:
- en
license: apache-2.0
tags:
- sound language model
pipeline_tag: audio-text-to-text
---

## Model Details

We have developed and released the family [llama3s](https://huggingface.co/collections/homebrew-research/llama3-s-669df2139f0576abc6eb7405). This family is natively understanding audio and text input.

We continual pretrain on the expanded vocabulary [homebrewltd/llama3.2-3B-s-whispervq-init](https://huggingface.co/homebrewltd/llama3.2-3B-s-whispervq-init) with 900M tokens from [homebrewltd/raw-speech-whispervq-v1](https://huggingface.co/datasets/homebrewltd/raw-speech-whispervq-v1) dataset.

**Model developers** Homebrew Research.

**Input** Text and sound.

**Output** Text.

**Model Architecture** Llama-3.

**Language(s):** English.

## Intended Use

**Intended Use Cases** This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

**Out-of-scope** The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

## Training process
**Training Metrics Image**: Below is a snapshot of the training loss curve visualized.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/etosaWAQ8TASXOEUADGpi.png)

**MMLU**:

| Model | MMLU Score |
| --- | --- |
| llama3.5-instruct-8b | 69.40 |
| ichigo-llama3.1-s-v0.3: phase 3 | 63.79 |
| ichigo-llama3.1-s-v0.3: phase 2 | 63.08 |
| ichigo-llama3.1-s-base-v0.3 | 42.11 |
| mini-ichigo-llama3.2-3B-s-instruct | 59.61 |
| mini-ichigo-llama3.2-3B-s-base | **58.68** |
| llama3.5-instruct-v0.2 | 50.27 |

### Hardware

**GPU Configuration**: Cluster of 10x NVIDIA A6000-48GB.

**GPU Usage**:
  - **Continual Training**: 30 hours.

### Training Arguments

We utilize [torchtune](https://github.com/pytorch/torchtune) library for the latest FSDP2 training code implementation. 

| Parameter                  | Continual Training      | 
|----------------------------|-------------------------|
| **Epoch**                  | 1                       | 
| **Global batch size**      | 480                     | 
| **Learning Rate**          | 2e-4                    | 
| **Learning Scheduler**     | LambdaLR with warmup    | 
| **Optimizer**              | AdamW fused             | 
| **Warmup Steps**           | 80                      | 
| **Weight Decay**           | 0.01                    |
| **Max Sequence Length**    | 512                     |


## Citation Information

**BibTeX:**

```
@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August},
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}
```

## Acknowledgement

- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**

- **[Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)**