speechbrain/LoquaciousSet
Viewer • Updated • 14.7M • 11.3k • 59
How to use mazesmazes/tiny-audio with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True) # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("mazesmazes/tiny-audio", trust_remote_code=True, dtype="auto")A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with Tiny Audio—a minimal, hackable ASR framework.
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("audio.wav")
print(result["text"])
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
# From file
result = pipe("audio.wav")
print(result["text"])
# From URL
result = pipe("https://example.com/audio.mp3")
# From numpy array (must be 16kHz)
import numpy as np
audio = np.random.randn(16000).astype(np.float32) # 1 second
result = pipe(audio)
# Process multiple files
files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(files, batch_size=4)
for r in results:
print(r["text"])
result = pipe("audio.wav", return_timestamps="word")
# Returns:
# {
# "text": "hello world",
# "chunks": [
# {"text": "hello", "timestamp": (0.0, 0.5)},
# {"text": "world", "timestamp": (0.6, 1.0)}
# ]
# }
from tiny_audio import ASRModel, ASRProcessor
import torch
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
# Load and process audio
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Stream tokens
for token in model.generate_streaming(inputs["input_features"]):
print(token, end="", flush=True)
from tiny_audio import ASRModel, ASRProcessor
import torch
import librosa
# Load model and processor
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
# Load audio (16kHz)
audio, sr = librosa.load("audio.wav", sr=16000)
# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Generate
with torch.no_grad():
output = model.generate(
input_features=inputs["input_features"],
attention_mask=inputs["attention_mask"],
max_new_tokens=256
)
# Decode
text = processor.batch_decode(output, skip_special_tokens=True)[0]
print(text)
import torch
pipe = pipeline(
"automatic-speech-recognition",
model="mazesmazes/tiny-audio",
trust_remote_code=True,
device="cuda" # or device=0
)
pipe = pipeline(
"automatic-speech-recognition",
model="mazesmazes/tiny-audio",
trust_remote_code=True,
torch_dtype=torch.float16,
device="cuda"
)
Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
| Component | Model | Parameters | Status |
|---|---|---|---|
| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
| Projector | 2-layer MLP | ~12M | Trained |
| Language Model | Qwen3-0.6B | ~600M | Frozen |
The projector reduces sequence length via frame stacking: output_len = (input_len - 5) // 5 + 1
| Specification | Value |
|---|---|
| Input | Audio (16kHz mono) |
| Output | Text transcription |
| Max Audio Length | ~30 seconds (limited by encoder) |
| Vocabulary | Qwen3 tokenizer |
| Languages | English only |
| Generation | Greedy decoding (num_beams=1, do_sample=False) |
| Dataset | LoquaciousSet (25,000 hours) |
| Hardware | Single NVIDIA A40 |
| Time | ~24 hours |
| Cost | ~$12 |
| Optimizer | AdamW |
| Learning Rate | 1e-4 |
| Batch Size | 4 |
| Steps | 50,000 |
transformers>=4.40.0
torch>=2.0.0
torchaudio>=2.0.0
Optional for streaming:
librosa
soundfile
| File | Description |
|---|---|
config.json |
Model configuration |
model.safetensors |
Projector weights (~48MB) |
preprocessor_config.json |
Audio preprocessing config |
tokenizer.json |
Tokenizer |
tokenizer_config.json |
Tokenizer config |
special_tokens_map.json |
Special tokens |
Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
If you use this model, please cite:
@misc{tinyaudio2024,
author = {Alex Kroman},
title = {Tiny Audio: Minimal ASR Training},
year = {2024},
publisher = {GitHub},
url = {https://github.com/alexkroman/tiny-audio}
}
MIT