Tiny Audio

A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with Tiny Audio—a minimal, hackable ASR framework.

Quick Start

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("audio.wav")
print(result["text"])

Usage Examples

Basic Transcription

from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

# From file
result = pipe("audio.wav")
print(result["text"])

# From URL
result = pipe("https://example.com/audio.mp3")

# From numpy array (must be 16kHz)
import numpy as np
audio = np.random.randn(16000).astype(np.float32)  # 1 second
result = pipe(audio)

Batch Processing

# Process multiple files
files = ["audio1.wav", "audio2.wav", "audio3.wav"]
results = pipe(files, batch_size=4)
for r in results:
    print(r["text"])

Word-Level Timestamps

result = pipe("audio.wav", return_timestamps="word")
# Returns:
# {
#   "text": "hello world",
#   "chunks": [
#     {"text": "hello", "timestamp": (0.0, 0.5)},
#     {"text": "world", "timestamp": (0.6, 1.0)}
#   ]
# }

Streaming Inference

from tiny_audio import ASRModel, ASRProcessor
import torch

model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")

# Load and process audio
import librosa
audio, sr = librosa.load("audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Stream tokens
for token in model.generate_streaming(inputs["input_features"]):
    print(token, end="", flush=True)

Using with torch directly

from tiny_audio import ASRModel, ASRProcessor
import torch
import librosa

# Load model and processor
model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")

# Load audio (16kHz)
audio, sr = librosa.load("audio.wav", sr=16000)

# Process
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate
with torch.no_grad():
    output = model.generate(
        input_features=inputs["input_features"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=256
    )

# Decode
text = processor.batch_decode(output, skip_special_tokens=True)[0]
print(text)

GPU Inference

import torch

pipe = pipeline(
    "automatic-speech-recognition",
    model="mazesmazes/tiny-audio",
    trust_remote_code=True,
    device="cuda"  # or device=0
)

Half Precision

pipe = pipeline(
    "automatic-speech-recognition",
    model="mazesmazes/tiny-audio",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device="cuda"
)

Architecture

Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text

Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.

Component	Model	Parameters	Status
Audio Encoder	GLM-ASR-Nano-2512	~600M	Frozen
Projector	2-layer MLP	~12M	Trained
Language Model	Qwen3-0.6B	~600M	Frozen

How It Works

Audio Encoder: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
Projector: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
Language Model: Qwen3 generates text autoregressively, conditioned on the projected audio

The projector reduces sequence length via frame stacking: output_len = (input_len - 5) // 5 + 1

Model Specifications

Specification	Value
Input	Audio (16kHz mono)
Output	Text transcription
Max Audio Length	~30 seconds (limited by encoder)
Vocabulary	Qwen3 tokenizer
Languages	English only
Generation	Greedy decoding (num_beams=1, do_sample=False)

Training Details


Dataset	LoquaciousSet (25,000 hours)
Hardware	Single NVIDIA A40
Time	~24 hours
Cost	~$12
Optimizer	AdamW
Learning Rate	1e-4
Batch Size	4
Steps	50,000

Limitations

English only: Not trained on other languages
Sample rate: Expects 16kHz audio (other rates resampled automatically)
Audio length: Best for clips under 30 seconds
Accuracy: May degrade on:
- Heavily accented speech
- Noisy or low-quality audio
- Domain-specific terminology
- Overlapping speakers
No punctuation: Output is lowercase without punctuation by default

Requirements

transformers>=4.40.0
torch>=2.0.0
torchaudio>=2.0.0

Optional for streaming:

librosa
soundfile

Files

File	Description
`config.json`	Model configuration
`model.safetensors`	Projector weights (~48MB)
`preprocessor_config.json`	Audio preprocessing config
`tokenizer.json`	Tokenizer
`tokenizer_config.json`	Tokenizer config
`special_tokens_map.json`	Special tokens

Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.

Citation

If you use this model, please cite:

@misc{tinyaudio2024,
  author = {Alex Kroman},
  title = {Tiny Audio: Minimal ASR Training},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/alexkroman/tiny-audio}
}

Acknowledgments

GLM-ASR for the audio encoder
Qwen3 for the language model
LoquaciousSet for training data

License

MIT

Downloads last month: 226

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for mazesmazes/tiny-audio

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B