Update README.md

d021e05 verified 5 months ago

7.7 kB

	---
	license: apache-2.0
	datasets:
	- Ayoub-Laachir/Darija_Dataset
	language:
	- dj
	metrics:
	- wer
	- cer
	base_model:
	- openai/whisper-large-v3
	pipeline_tag: automatic-speech-recognition
	---
	# Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)

	## Model Overview
	Model Name: Whisper Large V3 (Fine-tuned for Moroccan Darija)
	Author: Ayoub Laachir
	License: apache-2.0
	Repository: [Ayoub-Laachir/MaghrebVoice](https://huggingface.co/Ayoub-Laachir/MaghrebVoice)
	Dataset: [Ayoub-Laachir/Darija_Dataset](https://huggingface.co/datasets/Ayoub-Laachir/Darija_Dataset)

	## Description
	This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.

	## Technologies Used
	- Whisper Large V3: OpenAI’s state-of-the-art speech recognition model
	- PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation): An efficient fine-tuning technique
	- Google Colab: Cloud environment for training the model
	- Hugging Face: Hosting the dataset and final model

	## Dataset Preparation
	The dataset preparation involved several steps:
	1. Cleaning: Correcting bad transcriptions and standardizing word spellings.
	2. Audio Processing: Converting all samples to a 16 kHz sample rate.
	3. Dataset Split: Creating a training set of 3,312 samples and a test set of 150 samples.
	4. Format Conversion: Transforming the dataset into the parquet file format.
	5. Uploading: The prepared dataset was uploaded to the Hugging Face Hub.

	## Training Process
	The model was fine-tuned using the following parameters:
	- Per device train batch size: 8
	- Gradient accumulation steps: 1
	- Learning rate: 1e-4 (0.0001)
	- Warmup steps: 200
	- Number of train epochs: 2
	- Logging and evaluation: every 50 steps
	- Weight decay: 0.01

	Training progress showed a steady decrease in both training and validation loss over 8000 steps.

	## Testing and Evaluation
	The model was evaluated using:
	- Word Error Rate (WER): 3.1467%
	- Character Error Rate (CER): 2.3893%

	These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.


	The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.

	## Audio Transcription Script with PEFT Layers

	This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance.

	### Required Libraries

	Before running the script, ensure you have the following libraries installed. You can install them using:

	```bash
	!pip install --upgrade pip
	!pip install --upgrade transformers accelerate librosa soundfile pydub
	!pip install peft==0.3.0 # Install PEFT library
	```
	```python
	import torch
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
	import librosa
	import soundfile as sf
	from pydub import AudioSegment
	from peft import PeftModel, PeftConfig # Import PEFT classes

	# Set the device to GPU if available, else use CPU
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	# Configuration for the base Whisper model
	base_model_name = "openai/whisper-large-v3" # Base model for Whisper
	processor = AutoProcessor.from_pretrained(base_model_name) # Load the processor

	# Load your fine-tuned model configuration
	model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers" # Fine-tuned model with LoRA layers
	peft_config = PeftConfig.from_pretrained(model_name) # Load PEFT configuration

	# Load the base model
	base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device) # Load the base model

	# Load the PEFT model
	model = PeftModel.from_pretrained(base_model, model_name).to(device) # Load the PEFT model

	# Merge the LoRA weights with the base model
	model = model.merge_and_unload() # Combine the LoRA weights into the base model

	# Configuration for transcription
	config = {
	"language": "arabic", # Language for transcription
	"task": "transcribe", # Task type
	"chunk_length_s": 30, # Length of each audio chunk in seconds
	"stride_length_s": 5, # Overlap between chunks in seconds
	}

	# Initialize the automatic speech recognition pipeline
	pipe = pipeline(
	"automatic-speech-recognition",
	model=model, # Use the merged model
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	torch_dtype=torch_dtype,
	device=device,
	chunk_length_s=config["chunk_length_s"],
	stride_length_s=config["stride_length_s"],
	)

	# Convert audio to 16kHz sampling rate
	def convert_audio_to_16khz(input_path, output_path):
	audio, sr = librosa.load(input_path, sr=None) # Load the audio file
	audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000) # Resample to 16kHz
	sf.write(output_path, audio_16k, 16000) # Save the converted audio

	# Format time in HH:MM:SS.milliseconds
	def format_time(seconds):
	hours = int(seconds // 3600)
	minutes = int((seconds % 3600) // 60)
	seconds = seconds % 60
	return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"

	# Transcribe audio file
	def transcribe_audio(audio_path):
	try:
	result = pipe(audio_path, return_timestamps=True) # Transcribe audio and get timestamps
	return result["chunks"] # Return transcription chunks
	except Exception as e:
	print(f"Error transcribing audio: {e}")
	return None

	# Main function to execute the transcription process
	def main():
	# Specify input and output audio paths (update paths as needed)
	input_audio_path = "/path/to/your/input/audio.mp3" # Replace with your input audio path
	output_audio_path = "/path/to/your/output/audio_16khz.wav" # Replace with your output audio path

	# Convert audio to 16kHz
	convert_audio_to_16khz(input_audio_path, output_audio_path)

	# Transcribe the converted audio
	transcription_chunks = transcribe_audio(output_audio_path)

	if transcription_chunks:
	print("WEBVTT\n") # Print header for WEBVTT format
	for chunk in transcription_chunks:
	start_time = format_time(chunk["timestamp"][0]) # Format start time
	end_time = format_time(chunk["timestamp"][1]) # Format end time
	text = chunk["text"] # Get the transcribed text
	print(f"{start_time} --> {end_time}") # Print time range
	print(f"{text}\n") # Print transcribed text
	else:
	print("Transcription failed.")

	if __name__ == "__main__":
	main()
	```

	## Challenges and Future Improvements
	### Challenges Encountered
	- Diverse spellings of words in Moroccan Darija
	- Cleaning and standardizing the dataset

	### Future Improvements
	- Expand the dataset to include more Darija accents and expressions
	- Further fine-tune the model for specific Moroccan regional dialects
	- Explore integration into practical applications like voice assistants and transcription services

	## Conclusion
	This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.