--- license: apache-2.0 datasets: - 012shin/fake-audio-detection-augmented language: - en metrics: - accuracy - f1 - recall - precision base_model: - MIT/ast-finetuned-audioset-10-10-0.4593 pipeline_tag: audio-classification library_name: transformers tags: - audio - audio-classification - fake-audio-detection - ast model-index: - name: ast-fakeaudio-detector results: - task: type: audio-classification name: Audio Classification dataset: name: fake-audio-detection-augmented type: 012shin/fake-audio-detection-augmented metrics: - type: accuracy value: 0.9662 - type: f1 value: 0.9710 - type: precision value: 0.9692 - type: recall value: 0.9728 --- # AST Fine-tuned for Fake Audio Detection This model is a binary classification head fine-tuned version of [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection. ## Model Description - **Base Model**: MIT/ast-finetuned-audioset-10-10-0.4593 (AST pretrained on AudioSet) - **Task**: Binary classification (fake/real audio detection) - **Input**: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames) - **Output**: Binary prediction (0: real audio, 1: fake audio) - **Training Hardware**: 2x NVIDIA T4 GPUs ## Training Configuration ```python { 'learning_rate': 1e-5, 'weight_decay': 0.01, 'n_iterations': 10000, 'batch_size': 8, 'gradient_accumulation_steps': 8, 'validate_every': 500, 'val_samples': 5000 } ``` ## Dataset Distribution The model was trained on [012shin/fake-audio-detection-augmented](https://huggingface.co/datasets/012shin/fake-audio-detection-augmented) dataset with the following class distribution: ``` Training Set (80%): - Real Audio (0): 43,460 samples (63.69%) - Fake Audio (1): 24,776 samples (36.31%) Test Set (20%): - Real Audio (0): 10,776 samples (63.17%) - Fake Audio (1): 6,284 samples (36.83%) ``` ## Model Performance Final metrics on validation set: - Accuracy: 0.9662 (96.62%) - F1 Score: 0.9710 (97.10%) - Precision: 0.9692 (96.92%) - Recall: 0.9728 (97.28%) ## Usage Here's how to use the model: ```python import torch import torchaudio from transformers import AutoFeatureExtractor, AutoModelForAudioClassification # Load model and processor model = AutoModelForAudioClassification.from_pretrained("your-username/ast-fakeaudio-detector") processor = AutoFeatureExtractor.from_pretrained("your-username/ast-fakeaudio-detector") # Load and preprocess audio (ensure 16kHz sampling rate) audio_path = "path/to/audio.wav" waveform, sample_rate = torchaudio.load(audio_path) if sample_rate != 16000: resampler = torchaudio.transforms.Resample(sample_rate, 16000) waveform = resampler(waveform) # Process audio inputs = processor(waveform, sampling_rate=16000, return_tensors="pt") # Get prediction with torch.no_grad(): outputs = model(**inputs) probabilities = torch.sigmoid(outputs.logits) is_fake = probabilities > 0.5 print(f"Probability of being fake audio: {probabilities[0][0]:.4f}") print(f"Prediction: {'FAKE' if is_fake else 'REAL'} audio") ``` ## Limitations Important considerations when using this model: 1. The model works best with 16kHz audio input 2. Performance may vary with different types of audio manipulation not present in training data 3. Very short audio clips (<1 second) might not provide reliable results 4. The model should not be used as the sole determiner for real/fake audio detection ## Training Details The training process involved: 1. Loading the base AST model pretrained on AudioSet 2. Replacing the classification head with a binary classifier 3. Fine-tuning on the fake audio detection dataset for 10000 iterations 4. Using gradient accumulation (8 steps) with batch size 8 5. Implementing validation checks every 500 steps