Voice Detection AI - Real vs AI Audio Classifier

Model Overview

This model is a fine-tuned Wav2Vec2-based audio classifier capable of distinguishing between real human voices and AI-generated voices. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.

Model Details

Architecture: Wav2Vec2ForSequenceClassification
Fine-tuned on: Custom dataset with real and AI-generated audio
Classes:
1. Real Human Voice
2. AI-generated (e.g., Melgan, DiffWave, etc.)
Input Requirements:
- Audio format: .wav, .mp3, etc.
- Sample rate: 16kHz
- Max duration: 10 seconds (longer audios are truncated, shorter ones are padded)

Performance

Robustness: Successfully classifies across multiple AI-generation models.
Limitations: Struggles with certain unseen AI-generation models (e.g., ElevenLabs).

How to Use

1. Install Dependencies

Make sure you have transformers and torch installed:

pip install transformers torch torchaudio

Usage

Here's how to use VoiceGUARD for audio classification:

import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torchaudio

# Load model and processor
model_name = "Mrkomiljon/voiceGUARD"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)

# Load audio
waveform, sample_rate = torchaudio.load("path_to_audio_file.wav")

# Resample if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Preprocess
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)

# Inference
with torch.no_grad():
    logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, dim=-1)

# Map to label
labels = ["Real Human Voice", "AI-generated"]
prediction = labels[predicted_ids.item()]
print(f"Prediction: {prediction}")

Training Procedure

Data Collection: Compiled a balanced dataset of real human voices and AI-generated samples from various TTS models.
Preprocessing: Standardized audio formats, resampled to 16 kHz, and adjusted durations to 10 seconds.
Fine-Tuning: Utilized the Wav2Vec2 architecture for sequence classification, training for 3 epochs with a learning rate of 1e-5.

Evaluation

Metrics: Accuracy, Precision, Recall
Results: Achieved 99.8% validation accuracy on the test set.

Limitations and Future Work

While VoiceGUARD performs robustly across known AI-generation models, it may encounter challenges with novel or unseen models.
Future work includes expanding the training dataset with samples from emerging TTS technologies to enhance generalization.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

Special thanks to the developers of the Wav2Vec2 model and the contributors to the datasets used in this project.
View the complete project on GitHub

Mrkomiljon
/

voiceGUARD