--- library_name: transformers license: llama3 language: - th - en pipeline_tag: text-generation --- # Typhoon-Audio Preview **llama-3-typhoon-v1.5-8b-audio-preview** is a 🇹🇭 Thai *audio-language* model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). More details can be found in our [release blog](https://blog.opentyphoon.ai/typhoon-audio-preview-release-6fbb3f938287) and [technical report](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name. ## Model Description - **Model type**: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs. - **Requirement**: transformers 4.38.0 or newer. - **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧 - **Demo**: https://audio.opentyphoon.ai/ - **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/) ## Usage Example ```python from transformers import AutoModel # Initialize from the trained model model = AutoModel.from_pretrained( "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", torch_dtype=torch.float16, trust_remote_code=True ) model.to("cuda") model.eval() # Run generation prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" response = model.generate( wav_path="path_to_your_audio.wav", prompt="transcribe this audio", prompt_pattern=prompt_pattern, do_sample=False, max_length=1200, repetition_penalty=1.1, num_beams=1, # temperature=0.4, # top_p=0.9, # streamer=streamer # supports TextIteratorStreamer ) print(response) ``` ## Evaluation Results | Model | ASR-en (WER↓) | ASR-th (WER↓) | En2Th (BLEU↑) | X2Th (BLEU↑) | Th2En (BLEU↑) | |:----------------------------|:-------------------|:--------------|:--------------|:-------------|:--------------| | SALMONN-13B | 5.79 | 98.07 | 0.07 | 0.10 | 14.97 | | DiVA-8B | 30.28 | 65.21 | 9.82 | 5.31 | 7.97 | | Gemini-1.5-pro-001 | 5.98 | 13.56 | 20.69 | 13.52 | 22.54 | | Typhoon-Audio-Preview | 8.72 | 14.17 | 17.52 | 10.67 | 24.14 | | Model | Gender-th (Acc) | SpokenQA-th (F1) | SpeechInstruct-th | |:-------------------------------|:---------------|:-------------------|:-------------------| | SALMONN-13B | 93.26 | 2.95 | 1.18 | | DiVA-8B | 50.12 | 15.13 | 2.68 | | Gemini-1.5-pro-001 | 81.32 | 62.10 | 3.93 | | Typhoon-Audio-Preview | 93.74 | 64.60 | 6.11 | ## Intended Uses & Limitations This model is a pretrained base model. Thus, it may not be able to follow human instructions without using one/few-shot learning or instruction fine-tuning. The model does not have any moderation mechanisms, and may generate harmful or inappropriate responses. ## Follow us & Support - https://twitter.com/opentyphoon - https://discord.gg/CqyBscMFpg ## Acknowledgements We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights. ## Typhoon Team Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul