ConvNeXt-Tiny-AT is an audio tagging CNN model, trained on AudioSet (balanced+unbalanced subsets). It reached 0.471 mAP on the test set (Paper).

The model was trained on audio recordings of duration 10 seconds, and sample rate 32kHz, but you can provide any audio file, we have included resampling and padding/cropping in the following code snippet.

The model provides logits and probabilities for the 527 audio event tags of AudioSet (see http://research.google.com/audioset/index.html).

Two methods can also be used to get scene embeddings (a single vector per file) and frame-level embeddings, see below. The scene embedding is obtained from the frame-level embeddings, on which mean pooling is applied onto the frequency dim, followed by mean pooling + max pooling onto the time dim.

Install

This code is based on our repo: https://github.com/topel/audioset-convnext-inf

You can pip install it:

pip install git+https://github.com/topel/audioset-convnext-inf@pip-install

Usage

Below is an example of how to instantiate the model, make tag predictions on an audio sample, and get embeddings (scene and frame levels).

import os
import numpy as np
import torch
from torch.nn import functional as TF
import torchaudio
import torchaudio.functional as TAF

from audioset_convnext_inf.pytorch.convnext import ConvNeXt
from audioset_convnext_inf.utils.utilities import read_audioset_label_tags

model = ConvNeXt.from_pretrained("topel/ConvNeXt-Tiny-AT", map_location='cpu')

print(
    "# params:",
    sum(param.numel() for param in model.parameters() if param.requires_grad),
)
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

if "cuda" in str(device):
    model = model.to(device)

Output:

# params: 28222767

Inference: get logits and probabilities

To run the following, first download 254906__tpellegrini__cavaco1.wav and class_labels_indices.csv from this repository.

sample_rate = 32000
audio_target_length = 10 * sample_rate  # 10 s

# AUDIO_FNAME = "f62-S-v2swA_200000_210000.wav"
AUDIO_FNAME = "254906__tpellegrini__cavaco1.wav"

current_dir=os.getcwd()
AUDIO_FPATH = os.path.join(current_dir, AUDIO_FNAME)

waveform, sample_rate_ = torchaudio.load(AUDIO_FPATH)
if sample_rate_ != sample_rate:
    print("Resampling from %d to 32000 Hz"%sample_rate_)
    waveform = TAF.resample(
        waveform,
        sample_rate_,
        sample_rate,
        )

if waveform.shape[-1] < audio_target_length:
    print("Padding waveform")
    missing = max(audio_target_length - waveform.shape[-1], 0)
    waveform = TF.pad(waveform, (0,missing), mode="constant", value=0.0)
elif waveform.shape[-1] > audio_target_length: 
    print("Cropping waveform")
    waveform = waveform[:, :audio_target_length]

waveform = waveform.contiguous()
waveform = waveform.to(device)

print("\nInference on " + AUDIO_FNAME + "\n")

with torch.no_grad():
    model.eval()
    output = model(waveform)

logits = output["clipwise_logits"]
print("logits size:", logits.size())

probs = output["clipwise_output"]
# Equivalent: probs = torch.sigmoid(logits)
print("probs size:", probs.size())

lb_to_ix, ix_to_lb, id_to_ix, ix_to_id = read_audioset_label_tags(os.path.join(current_dir, "class_labels_indices.csv"))

threshold = 0.25
sample_labels = np.where(probs[0].clone().detach().cpu() > threshold)[0]
print("\nPredicted labels using activity threshold 0.25:\n")
# print(sample_labels)
for l in sample_labels:
    print("%s: %.3f"%(ix_to_lb[l], probs[0,l]))

Output:

Inference on 254906__tpellegrini__cavaco1.wav

Resampling rate from 44100 to 32000 Hz
Padding waveform
logits size: torch.Size([1, 527])
probs size: torch.Size([1, 527])
Predicted labels using activity threshold 0.25:

[137 138 139 140 149 151]
Music: 0.896
Musical instrument: 0.686
Plucked string instrument: 0.608
Guitar: 0.369
Mandolin: 0.710
Ukulele: 0.268

Technically speaking, it's not a Mandolin nor a Ukulele, but a Brazilian cousin, the cavaquinho!

Get audio scene embeddings

with torch.no_grad():
    model.eval()
    output = model.forward_scene_embeddings(waveform)

print("\nScene embedding, shape:", output.size())

Output:

Scene embedding, shape: torch.Size([1, 768])

Get frame-level embeddings

with torch.no_grad():
    model.eval()
    output = model.forward_frame_embeddings(waveform)

print("\nFrame-level embeddings, shape:", output.size())

Output:

Frame-level embeddings, shape: torch.Size([1, 768, 31, 7])

Zenodo

The checkpoint is also available on Zenodo: https://zenodo.org/record/8020843/files/convnext_tiny_471mAP.pth?download=1

Citation

Paper available

Cite as: Pellegrini, T., Khalfaoui-Hassani, I., Labbé, E., Masquelier, T. (2023) Adapting a ConvNeXt Model to Audio Classification on AudioSet. Proc. INTERSPEECH 2023, 4169-4173, doi: 10.21437/Interspeech.2023-1564

@inproceedings{pellegrini23_interspeech,
  author={Thomas Pellegrini and Ismail Khalfaoui-Hassani and Etienne Labb\'e and Timoth\'ee Masquelier},
  title={{Adapting a ConvNeXt Model to Audio Classification on AudioSet}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={4169--4173},
  doi={10.21437/Interspeech.2023-1564}
}