File size: 4,750 Bytes
7e80ee4 4f54b61 7e80ee4 0a24280 7e80ee4 535049e 7e80ee4 52b12d3 7e80ee4 0656ce9 7e80ee4 a6e6353 0656ce9 7e80ee4 0656ce9 7e80ee4 0656ce9 7e80ee4 25bc5ad a6601aa 4b56629 a6601aa 25bc5ad d405c54 ce67f4c 7e80ee4 2d29f8d 80fe1ed ade43f9 4c9231a 8a38239 b3277e9 8a38239 ade43f9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
license: apache-2.0
language:
- th
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- chat
- audio
---
# Pathumma-Audio
## Model Description
**Pathumma-llm-audio-1.0.0** is a 8 billion parameter Thai large language model designed for audio understanding tasks. The model can process multiple types of audio inputs including **speech, general audio, and music**, converting them into meaningful textual representations.
## Model Architecture
The model combines two key components:
- 1. Base Language Model: [OpenThaiLLM-DoodNiLT-V1.0.0-Beta-7B](https://huggingface.co/nectec/OpenThaiLLM-DoodNiLT-V1.0.0-Beta-7B) (Qwen2)
- 2. Base Speech Encoder: [Pathumma-whisper-th-large-v3](https://huggingface.co/nectec/Pathumma-whisper-th-large-v3) (Whisper)
## Quickstart
To load the model and generate responses using the Hugging Face Transformers library, follow the steps below.
#### 1. Install the required dependencies:
Make sure you have the necessary libraries installed by running:
```shell
pip install librosa torch torchaudio transformers peft
```
#### 2. Load the model and generate a response:
You can load the model and use it to generate a response with the following code snippet:
```python
import torch
import librosa
from transformers import AutoModel
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
model = AutoModel.from_pretrained(
"nectec/Pathumma-llm-audio-1.0.0",
torch_dtype=torch.bfloat16,
lora_infer_mode=True,
init_from_scratch=True,
trust_remote_code=True
)
model = model.to(device)
prompt = "ถอดเสียงเป็นข้อความ"
audio_path = "audio_path.wav"
audio, sr = librosa.load(audio_path, sr=16000)
model.eval()
with torch.no_grad():
response = model.generate(
raw_wave=audio,
prompts=prompt,
device=device,
max_new_tokens=200,
repetition_penalty=1.0,
)
print(response[0])
```
## Evaluation Performance
Additional information is needed
<!-- | Model | ASR-th CV18 Th (WER↓) | ASR-en CV18 En (WER↓) | ASR-en Librispeech En (WER↓) | ThaiSER Emotion (Acc↑, F1↑)| ThaiSER Gender (Acc↑, F1↑) |
|:----------------------------:|:------------------------:|:------------------------:|:------------------------------:|:------------------:|:--------------------:|
| Typhoon-Audio-Preview | 13.26 | 13.34 (partial result) | 5.07 (partial result) | 41.50, 33.48 | 96.20, 96.69 |
| DIVA | 69.15 (partial result) | 37.40 | 49.06 | 18.64, 8.16 | 47.50, 35.90 |
| Gemini-1.5-Pro | 16.49 | 12.94 | 25.83 | 26.00, 18.26 | 79.66, 77.32 |
| Pathumma-llm-audio-1.0.0 | 12.03 | 12.20 | 11.36 | 42.30, 36.88 | 90.30, 92.07 | -->
## Limitations and Future Work
At present, our model remains in the experimental research phase and is not yet fully suitable for practical applications as an assistant. The model currently has an input duration limit, processing audio inputs **up to 30 seconds**, which restricts its usability for longer audio tasks. Future work will focus on upgrading the language model to a newer version [Pathumma-llm-text-1.0.0](https://huggingface.co/nectec/Pathumma-llm-text-1.0.0), and curating more refined and robust datasets to improve performance. Additionally, we aim to address and prioritize the safety and reliability of the model's outputs.
## Acknowledgements
We are grateful to ThaiSC, also known as NSTDA Supercomputer Centre, for providing the LANTA that was utilised for model training and finetuning. Additionally, we would like to express our gratitude to the SALMONN team for making their code publicly available, and to Typhoon Audio at SCB 10X for making available the huggingface project, source code, and technical paper, which served as a valuable guide for us. Many other open-source projects have contributed valuable information, code, data, and model weights; we are grateful to them all.
## Pathumma Audio Team
*Pattara Tipaksorn*, Wayupuk Sommuang, Oatsada Chatthong, *Kwanchiva Thangthai*
## Citation
```
@misc{tipaksorn2024PathummaAudio,
title = { {Pathumma-Audio} },
author = { Pattara Tipaksorn and Wayupuk Sommuang and Oatsada Chatthong and Kwanchiva Thangthai },
url = { https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0 },
publisher = { Hugging Face },
year = { 2024 },
}
``` |