File size: 2,691 Bytes
253101d e6654a9 fbd9086 253101d e1fd6c2 253101d e6654a9 e1fd6c2 e6654a9 253101d e1fd6c2 253101d e1fd6c2 253101d e1fd6c2 253101d e1fd6c2 253101d e6654a9 253101d e6654a9 253101d e1fd6c2 253101d e1fd6c2 253101d e1fd6c2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
---
library_name: transformers
license: llama3
language:
- th
- en
pipeline_tag: text-generation
---
# Typhoon-Audio Preview
<div align="center">
<img src="https://i.postimg.cc/DycZ98w2/typhoon-audio.png" alt="typhoon-audio" style="width: 100%; max-width: 20cm; margin-left: 'auto'; margin-right:'auto'; display:'block'"/>
</div>
**llama-3-typhoon-v1.5-8b-audio-preview** is a ๐น๐ญ Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct).
More details can be found in our [release blog]() and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.
## Model Description
- **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai ๐น๐ญ and English ๐ฌ๐ง
- **Demo**: https://audio.opentyphoon.ai/
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/)
## Usage Example
```python
from transformers import AutoModel
# Initialize from the trained model
model = AutoModel.from_pretrained(
"scb10x/llama-3-typhoon-v1.5-8b-audio-preview",
torch_dtype=torch.float16,
trust_remote_code=True
)
model.to("cuda")
model.eval()
# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
wav_path="path_to_your_audio.wav",
prompt="transcribe this audio",
prompt_pattern=prompt_pattern,
do_sample=False,
max_length=1200,
repetition_penalty=1.1,
num_beams=1,
# temperature=0.4,
# top_p=0.9,
# streamer=streamer # supports TextIteratorStreamer
)
print(response)
```
## Evaluation Results
## Acknowledgements
In addition to common libraries and tools, we would like to thank the following projects for releasing model weights and code:
- Training recipe: [SALMONN](https://github.com/bytedance/SALMONN) from ByteDance
- Audio encoder: [BEATs]( https://github.com/microsoft/unilm/tree/master/beats) from Microsoft
- Whisper encoder: [Fine-tuned Whisper](https://huggingface.co/biodatlab/whisper-th-large-v3-combined) from Biomedical and Data Lab @ Mahidol University
|