|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- 012shin/fake-audio-detection-augmented |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- recall |
|
- precision |
|
base_model: |
|
- MIT/ast-finetuned-audioset-10-10-0.4593 |
|
pipeline_tag: audio-classification |
|
library_name: transformers |
|
tags: |
|
- audio |
|
- audio-classification |
|
- fake-audio-detection |
|
- ast |
|
model-index: |
|
- name: ast-fakeaudio-detector |
|
results: |
|
- task: |
|
type: audio-classification |
|
name: Audio Classification |
|
dataset: |
|
name: fake-audio-detection-augmented |
|
type: 012shin/fake-audio-detection-augmented |
|
metrics: |
|
- type: accuracy |
|
value: 0.9662 |
|
- type: f1 |
|
value: 0.9710 |
|
- type: precision |
|
value: 0.9692 |
|
- type: recall |
|
value: 0.9728 |
|
--- |
|
|
|
# AST Fine-tuned for Fake Audio Detection |
|
|
|
This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection. |
|
|
|
## Model Description |
|
|
|
- **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet) |
|
- **Task**: Binary classification (fake/real audio detection) |
|
- **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames) |
|
- **Output**: Binary prediction (0: real audio, 1: fake audio) |
|
- **Training Hardware**: 2x NVIDIA T4 GPUs |
|
|
|
## Training Configuration |
|
|
|
```python |
|
{ |
|
'learning_rate': 1e-5, |
|
'weight_decay': 0.01, |
|
'n_iterations': 1500, |
|
'batch_size': 16, |
|
'gradient_accumulation_steps': 8, |
|
'validate_every': 500, |
|
'val_samples': 5000 |
|
} |
|
``` |
|
|
|
## Dataset Distribution |
|
|
|
The model was trained on a filtered dataset with the following class distribution: |
|
|
|
``` |
|
Training Set: |
|
- Fake Audio (0): 29,089 samples (53.97%) |
|
- Real Audio (1): 24,813 samples (46.03%) |
|
|
|
Test Set: |
|
- Fake Audio (0): 7,229 samples (53.64%) |
|
- Real Audio (1): 6,247 samples (46.36%) |
|
``` |
|
|
|
## Model Performance |
|
|
|
Final metrics on validation set: |
|
- Accuracy: 0.9662 (96.62%) |
|
- F1 Score: 0.9710 (97.10%) |
|
- Precision: 0.9692 (96.92%) |
|
- Recall: 0.9728 (97.28%) |
|
|
|
# Usage Guide |
|
|
|
## Model Usage |
|
```python |
|
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification |
|
import torchaudio |
|
import torch |
|
|
|
# Load audio file |
|
waveform, sample_rate = torchaudio.load("path_to_audio.ogg") |
|
|
|
# Initialize model and feature extractor |
|
model_name = "WpythonW/ast-fakeaudio-detector" |
|
extractor = AutoFeatureExtractor.from_pretrained(model_name) |
|
model = AutoModelForAudioClassification.from_pretrained(model_name) |
|
|
|
# Process audio and get predictions |
|
inputs = extractor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt") |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
probabilities = torch.nn.functional.softmax(logits, dim=-1) |
|
|
|
print(f"Probability of fake audio: {probabilities[0][0]:.2%}") |
|
``` |
|
|
|
## Limitations |
|
|
|
Important considerations when using this model: |
|
1. The model works best with 16kHz audio input |
|
2. Performance may vary with different types of audio manipulation not present in training data |
|
3. Very short audio clips (<1 second) might not provide reliable results |
|
4. The model should not be used as the sole determiner for real/fake audio detection |
|
|
|
## Training Details |
|
|
|
The training process involved: |
|
1. Loading the base AST model pretrained on AudioSet |
|
2. Replacing the classification head with a binary classifier |
|
3. Fine-tuning on the fake audio detection dataset for 1500 iterations |
|
4. Using gradient accumulation (8 steps) with batch size 16 |
|
5. Implementing validation checks every 500 steps |