File size: 2,691 Bytes
253101d
 
e6654a9
 
 
 
fbd9086
253101d
 
e1fd6c2
253101d
e6654a9
e1fd6c2
e6654a9
253101d
e1fd6c2
253101d
e1fd6c2
253101d
e1fd6c2
253101d
e1fd6c2
 
 
 
 
253101d
e6654a9
253101d
e6654a9
 
253101d
e1fd6c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253101d
e1fd6c2
253101d
e1fd6c2
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
---
library_name: transformers
license: llama3
language:
- th
- en
pipeline_tag: text-generation
---

# Typhoon-Audio Preview

<div align="center">
<img src="https://i.postimg.cc/DycZ98w2/typhoon-audio.png" alt="typhoon-audio" style="width: 100%; max-width: 20cm;  margin-left: 'auto'; margin-right:'auto'; display:'block'"/>
</div>

**llama-3-typhoon-v1.5-8b-audio-preview** is a ๐Ÿ‡น๐Ÿ‡ญ Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). 

More details can be found in our [release blog]() and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

## Model Description

- **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai ๐Ÿ‡น๐Ÿ‡ญ and English ๐Ÿ‡ฌ๐Ÿ‡ง
- **Demo**: https://audio.opentyphoon.ai/
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/)

## Usage Example

```python
from transformers import AutoModel

# Initialize from the trained model
model = AutoModel.from_pretrained(
    "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", 
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("cuda")
model.eval()

# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
    wav_path="path_to_your_audio.wav",
    prompt="transcribe this audio",
    prompt_pattern=prompt_pattern,
    do_sample=False,
    max_length=1200,
    repetition_penalty=1.1,
    num_beams=1,
    # temperature=0.4,
    # top_p=0.9,
    # streamer=streamer # supports TextIteratorStreamer
)
print(response)
```

## Evaluation Results

## Acknowledgements
In addition to common libraries and tools, we would like to thank the following projects for releasing model weights and code: 
- Training recipe: [SALMONN](https://github.com/bytedance/SALMONN) from ByteDance
- Audio encoder: [BEATs]( https://github.com/microsoft/unilm/tree/master/beats) from Microsoft
- Whisper encoder: [Fine-tuned Whisper](https://huggingface.co/biodatlab/whisper-th-large-v3-combined) from Biomedical and Data Lab @ Mahidol University