WpythonW's picture
Update README.md
6f2fece verified
---
license: apache-2.0
datasets:
- 012shin/fake-audio-detection-augmented
language:
- en
metrics:
- accuracy
- f1
- recall
- precision
base_model:
- MIT/ast-finetuned-audioset-10-10-0.4593
pipeline_tag: audio-classification
library_name: transformers
tags:
- audio
- audio-classification
- fake-audio-detection
- ast
model-index:
- name: ast-fakeaudio-detector
results:
- task:
type: audio-classification
name: Audio Classification
dataset:
name: fake-audio-detection-augmented
type: 012shin/fake-audio-detection-augmented
metrics:
- type: accuracy
value: 0.9662
- type: f1
value: 0.9710
- type: precision
value: 0.9692
- type: recall
value: 0.9728
---
# AST Fine-tuned for Fake Audio Detection
This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection.
## Model Description
- **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet)
- **Task**: Binary classification (fake/real audio detection)
- **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames)
- **Output**: Binary prediction (0: real audio, 1: fake audio)
- **Training Hardware**: 2x NVIDIA T4 GPUs
## Training Configuration
```python
{
'learning_rate': 1e-5,
'weight_decay': 0.01,
'n_iterations': 1500,
'batch_size': 16,
'gradient_accumulation_steps': 8,
'validate_every': 500,
'val_samples': 5000
}
```
## Dataset Distribution
The model was trained on a filtered dataset with the following class distribution:
```
Training Set:
- Fake Audio (0): 29,089 samples (53.97%)
- Real Audio (1): 24,813 samples (46.03%)
Test Set:
- Fake Audio (0): 7,229 samples (53.64%)
- Real Audio (1): 6,247 samples (46.36%)
```
## Model Performance
Final metrics on validation set:
- Accuracy: 0.9662 (96.62%)
- F1 Score: 0.9710 (97.10%)
- Precision: 0.9692 (96.92%)
- Recall: 0.9728 (97.28%)
# Usage Guide
## Model Usage
```python
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
import torchaudio
import torch
# Load audio file
waveform, sample_rate = torchaudio.load("path_to_audio.ogg")
# Initialize model and feature extractor
model_name = "WpythonW/ast-fakeaudio-detector"
extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioClassification.from_pretrained(model_name)
# Process audio and get predictions
inputs = extractor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
print(f"Probability of fake audio: {probabilities[0][0]:.2%}")
```
## Limitations
Important considerations when using this model:
1. The model works best with 16kHz audio input
2. Performance may vary with different types of audio manipulation not present in training data
3. Very short audio clips (<1 second) might not provide reliable results
4. The model should not be used as the sole determiner for real/fake audio detection
## Training Details
The training process involved:
1. Loading the base AST model pretrained on AudioSet
2. Replacing the classification head with a binary classifier
3. Fine-tuning on the fake audio detection dataset for 1500 iterations
4. Using gradient accumulation (8 steps) with batch size 16
5. Implementing validation checks every 500 steps