Audio Mosaicist-1

Audio Mosaicist-1 is a multilingual text-to-AudioMosaic bridge for searching large unstructured audio corpora with natural-language prompts. It maps Qwen3 text embeddings into the frozen AudioMosaic acoustic z space, then retrieves and localizes candidate audio events.

This repository contains the bridge artifacts and helper scripts. It does not contain the full private Persian audio corpus used to build the current VisualEars mining index.

What it is

  • Text tower: mlx-community/Qwen3-Embedding-0.6B-4bit-DWQ for the current build.
  • Audio tower: frozen AudioMosaic ViT-B/16 pretrained encoder.
  • Bridge: small 1024 -> 128 projection plus category/audio centroids and text prototypes.
  • Current anchor languages: English and Persian.
  • Current event/noise anchors: 15 VisualEars environment categories.

Included files

  • projection_category_qwen3_to_audiomosaic_z.npy: category-level Qwen3-text to AudioMosaic-z bridge.
  • projection_exact_anchor_qwen3_to_audiomosaic_z.npy: exact-anchor bridge.
  • text_category_prototypes.npy and text_category_prototypes.jsonl: text-side category prototypes.
  • audio_category_centroids.npy: AudioMosaic-side category centroids.
  • audio_anchor_index.with_inferred_categories.parquet: anchor metadata used for the current bridge.
  • categories.json: current category list.
  • audiomosaic_text_bridge_query.py: minimal query helper.
  • scripts/asr_query_audiomosaic_text_bridge.py: full-corpus search helper used in the VisualEars mining run.
  • scripts/asr_localize_audiomosaic_events.py: multi-scale timestamp localizer for retrieved candidates.
  • scripts/asr_make_audio_mosaicist_extension_pack.py: helper for building local-language extension packs.

Quick use

Download the repo and point the query script at an existing AudioMosaic embedding index:

python scripts/asr_query_audiomosaic_text_bridge.py \
  --bridge-dir Audio-Mosaicist-1 \
  --index /path/to/audiomosaic_index \
  --prompts prompts.jsonl \
  --out search_results.jsonl

Then recover exact audio rows and localize the event windows:

python scripts/asr_localize_audiomosaic_events.py \
  --candidates search_results.jsonl \
  --manifest exact_manifest_rows.jsonl \
  --out localized_events.jsonl \
  --repo-dir /path/to/audiomosaic-vit-b16-pretrained \
  --device cuda:0

The helper scripts are intentionally plain Python so people can adapt them to their own storage layout.

Language coverage

The bridge was calibrated with English and Persian prompts, so those two are validated. Because Qwen3 Embedding is multilingual, prompts in other Qwen-supported languages may work through cross-lingual alignment, but they are not yet measured. For non-English/Persian use, treat this as zero-shot until you add local-language calibration prompts or anchors.

Extending to a new language

You do not need a carefully structured dataset. The easiest extension is a JSONL/CSV with any of these fields when available:

  • audio_path or HF dataset locator fields (source, file_path, row_index, audio_col)
  • label or caption in your language
  • lang, for example ar, hi, tr, de
  • optional category if you already know it

Recommended extension workflow:

  1. Embed your audio anchors with AudioMosaic into 128-d z vectors.
  2. Embed your local-language labels/prompts with Qwen3 Embedding.
  3. Add the new text/audio pairs to the bridge training table.
  4. Retrain or adapter-train the 1024 -> 128 projection and text prototypes.
  5. Evaluate category hit@k and a small manual localization smoke before trusting mining output.

For unlabeled audio, first mine clusters in AudioMosaic space, manually name a small number of representative clusters in your language, then use those labels as bridge anchors.

You can start an extension pack with:

python scripts/asr_make_audio_mosaicist_extension_pack.py \
  --input your_audio_or_labels.csv \
  --out audio_mosaicist_extension_your_lang.jsonl \
  --lang your_language_code

The extension pack is a staging file: embed the listed audio with AudioMosaic, embed the labels with Qwen3 or another multilingual text encoder, then fit a small adapter or refit the bridge.

Timestamp precision

AudioMosaic retrieval is coarse. The production path is:

  1. Prompt -> AudioMosaic z query.
  2. Full-corpus top-k retrieval.
  3. Exact manifest join to recover the chiseled audio row.
  4. Multi-scale localization:
    • coarse calibrated AudioMosaic-style window score
    • fine relative sliding-window score inside the candidate region
    • optional energy trimming for boundary cleanup

Fine-window scores are relative, not fully calibrated probabilities. For VisualEars-grade event alerts, the next model should train a dedicated fine event localizer on mined/mixed spans.

Current limitations

  • The public artifact is a bridge and mining toolkit, not a full end-to-end event detector.
  • The current category set is 15 VisualEars pilot categories, not the full 69-category target taxonomy.
  • Fine localization scores are useful for ranking and QA, but should not be treated as calibrated probabilities.
  • Non-English/Persian prompting should be evaluated with local anchors before production use.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support