Self Supervised Audio Spectrogram Transformer (pretrained on AudioSet/Librispeech)

Self Supervised Audio Spectrogram Transformer (SSAST) model with uninitialized classifier head. It was introduced in the paper SSAST: Self-Supervised Audio Spectrogram Transformer by Gong et al. and first released in this repository.

Disclaimer: The team releasing Audio Spectrogram Transformer did not write a model card for this model.

Model description

The Audio Spectrogram Transformer is equivalent to ViT, but applied on audio. Audio is first turned into an image (as a spectrogram), after which a Vision Transformer is applied. The model gets state-of-the-art results on several audio classification benchmarks.

Usage

The model is pretrained on a massive amount of audio. Please finetune the classifier head before use, as it comes uninitialized.

Downloads last month
15
Safetensors
Model size
86.1M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train Simon-Kotchou/ssast-base-patch-audioset-16-16