potsawee's picture
Update README.md
e1fd6c2 verified
|
raw
history blame
2.69 kB
metadata
library_name: transformers
license: llama3
language:
  - th
  - en
pipeline_tag: text-generation

Typhoon-Audio Preview

typhoon-audio

llama-3-typhoon-v1.5-8b-audio-preview is a ๐Ÿ‡น๐Ÿ‡ญ Thai audio-language model. It supports both text and audio input modalities natively while the output is text. This version (August 2024) is our first audio-language model as a part of our multimodal effort, and it is a research preview version. The base language model is our llama-3-typhoon-v1.5-8b-instruct.

More details can be found in our release blog and technical report. *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

Model Description

  • Model type: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
  • Requirement: transformers 4.38.0 or newer.
  • Primary Language(s): Thai ๐Ÿ‡น๐Ÿ‡ญ and English ๐Ÿ‡ฌ๐Ÿ‡ง
  • Demo: https://audio.opentyphoon.ai/
  • License: Llama 3 Community License

Usage Example

from transformers import AutoModel

# Initialize from the trained model
model = AutoModel.from_pretrained(
    "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", 
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("cuda")
model.eval()

# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
    wav_path="path_to_your_audio.wav",
    prompt="transcribe this audio",
    prompt_pattern=prompt_pattern,
    do_sample=False,
    max_length=1200,
    repetition_penalty=1.1,
    num_beams=1,
    # temperature=0.4,
    # top_p=0.9,
    # streamer=streamer # supports TextIteratorStreamer
)
print(response)

Evaluation Results

Acknowledgements

In addition to common libraries and tools, we would like to thank the following projects for releasing model weights and code:

  • Training recipe: SALMONN from ByteDance
  • Audio encoder: BEATs from Microsoft
  • Whisper encoder: Fine-tuned Whisper from Biomedical and Data Lab @ Mahidol University