File size: 5,463 Bytes
253101d
 
e6654a9
 
 
 
fbd9086
253101d
 
e1fd6c2
253101d
e1fd6c2
253101d
346bf2e
253101d
e1fd6c2
253101d
e1fd6c2
 
 
 
 
253101d
e6654a9
253101d
e6654a9
 
4041e77
 
253101d
e1fd6c2
 
 
 
 
 
 
 
 
4041e77
 
 
 
 
 
 
 
 
e1fd6c2
 
 
4041e77
e1fd6c2
 
 
4041e77
e1fd6c2
 
 
 
 
 
 
fbac9fe
 
 
 
bcce7a2
fbac9fe
 
 
 
 
 
bcce7a2
 
253101d
e1fd6c2
346bf2e
add295e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
efbd009
add295e
 
 
 
 
e1fd6c2
add295e
 
 
391fbeb
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
library_name: transformers
license: llama3
language:
- th
- en
pipeline_tag: text-generation
---

# Typhoon-Audio Preview

**llama-3-typhoon-v1.5-8b-audio-preview** is a 🇹🇭 Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). 

More details can be found in our [technical report](https://arxiv.org/abs/2409.10999). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

## Model Description

- **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
- **Demo**: https://audio.opentyphoon.ai/
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/)

## Usage Example

```python
from transformers import AutoModel
import soundfile as sf
import librosa

# Initialize from the trained model
model = AutoModel.from_pretrained(
    "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", 
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("cuda")
model.eval()

# read a wav file (it needs to be in 16 kHz and clipped to 30 seconds)
audio, sr = sf.read("path_to_your_audio.wav")
if len(audio.shape) == 2:
    audio = audio[:, 0]
if len(audio) > 30 * sr:
    audio = audio[: 30 * sr]
if sr != 16000:
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000, res_type="fft")

# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
    audio=audio,
    prompt="transcribe this audio",
    prompt_pattern=prompt_pattern,
    do_sample=False,
    max_new_tokens=512,
    repetition_penalty=1.1,
    num_beams=1,
    # temperature=0.4,
    # top_p=0.9,
)
print(response)
```
**Generation Parameters**:
- wav_path (`str`) -- Path to the audio file (format = wav, flac, mp3, etc as long as `soundfile.read` supports)
- prompt (`str`) -- Text input to the model
- prompt_pattern (`str`) -- Chat template that is augmented with special tokens, and it must be set the same as one during training
- max_new_tokens (`int`, *optional*, defaults to 1024)
- num_beams (`int`, *optional*, defaults to 4)
- do_sample (`bool`, *optional*, defaults to True)
- top_p (`float`, *optional*, defaults to 0.9)
- repetition_penalty (`float`, *optional*, defaults to 1.0),
- length_penalty (`float`, *optional*, defaults to 1.0),
- temperature (`float`, *optional*, defaults to 1.0),

This is also `model.generate_stream()` for streaming generation. Please refer to `modeling_typhoonaudio.py` for this function.

## Evaluation Results
More information is provided in our [technical report](https://arxiv.org/abs/2409.10999).
| Model                       | ASR-en (WER↓)      | ASR-th (WER↓) | En2Th (BLEU↑) | X2Th (BLEU↑) | Th2En (BLEU↑) |
|:----------------------------|:-------------------|:--------------|:--------------|:-------------|:--------------|
| SALMONN-13B                 | 5.79      | 98.07         | 0.07         | 0.10        | 14.97        |
| DiVA-8B                     | 30.28     | 65.21         | 9.82         | 5.31        | 7.97         |
| Gemini-1.5-pro-001          | 5.98      | 13.56         | 20.69        | 13.52       | 22.54        |
| Typhoon-Audio-Preview       | 8.72      | 14.17         | 17.52        | 10.67       | 24.14        |


| Model                          | Gender-th (Acc) | SpokenQA-th (F1)   | SpeechInstruct-th |
|:-------------------------------|:---------------|:-------------------|:-------------------|
| SALMONN-13B                   |     93.26       |    2.95     |        1.18         |
| DiVA-8B                       |     50.12       |    15.13    |        2.68         |
| Gemini-1.5-pro-001            |     81.32       |    62.10    |        3.93         |
| Typhoon-Audio-Preview         |     93.74       |    64.60    |        6.11         |


## Intended Uses & Limitations
This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.

## Follow us & Support
- https://twitter.com/opentyphoon
- https://discord.gg/CqyBscMFpg

## Acknowledgements
We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

## Typhoon Team
*Potsawee Manakul*, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, 
Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, *Kunat Pipatanakul*