|
--- |
|
library_name: transformers |
|
license: llama3 |
|
language: |
|
- th |
|
- en |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Typhoon-Audio Preview |
|
|
|
<div align="center"> |
|
<img src="https://i.postimg.cc/DycZ98w2/typhoon-audio.png" alt="typhoon-audio" style="width: 100%; max-width: 20cm; margin-left: 'auto'; margin-right:'auto'; display:'block'"/> |
|
</div> |
|
|
|
**llama-3-typhoon-v1.5-8b-audio-preview** is a πΉπ Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). |
|
|
|
More details can be found in our [release blog]() and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name. |
|
|
|
## Model Description |
|
|
|
- **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs. |
|
- **Requirement**: transformers 4.38.0 or newer. |
|
- **Primary Language(s)**: Thai πΉπ and English π¬π§ |
|
- **Demo**: https://audio.opentyphoon.ai/ |
|
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/) |
|
|
|
## Usage Example |
|
|
|
```python |
|
from transformers import AutoModel |
|
|
|
# Initialize from the trained model |
|
model = AutoModel.from_pretrained( |
|
"scb10x/llama-3-typhoon-v1.5-8b-audio-preview", |
|
torch_dtype=torch.float16, |
|
trust_remote_code=True |
|
) |
|
model.to("cuda") |
|
model.eval() |
|
|
|
# Run generation |
|
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
|
response = model.generate( |
|
wav_path="path_to_your_audio.wav", |
|
prompt="transcribe this audio", |
|
prompt_pattern=prompt_pattern, |
|
do_sample=False, |
|
max_length=1200, |
|
repetition_penalty=1.1, |
|
num_beams=1, |
|
# temperature=0.4, |
|
# top_p=0.9, |
|
# streamer=streamer # supports TextIteratorStreamer |
|
) |
|
print(response) |
|
``` |
|
|
|
## Evaluation Results |
|
|
|
## Acknowledgements |
|
In addition to common libraries and tools, we would like to thank the following projects for releasing model weights and code: |
|
- Training recipe: [SALMONN](https://github.com/bytedance/SALMONN) from ByteDance |
|
- Audio encoder: [BEATs]( https://github.com/microsoft/unilm/tree/master/beats) from Microsoft |
|
- Whisper encoder: [Fine-tuned Whisper](https://huggingface.co/biodatlab/whisper-th-large-v3-combined) from Biomedical and Data Lab @ Mahidol University |
|
|