File size: 7,701 Bytes
2a76378 1744f1a 2a76378 1744f1a d021e05 1744f1a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
---
license: apache-2.0
datasets:
- Ayoub-Laachir/Darija_Dataset
language:
- dj
metrics:
- wer
- cer
base_model:
- openai/whisper-large-v3
pipeline_tag: automatic-speech-recognition
---
# Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija)
## Model Overview
**Model Name:** Whisper Large V3 (Fine-tuned for Moroccan Darija)
**Author:** Ayoub Laachir
**License:** apache-2.0
**Repository:** [Ayoub-Laachir/MaghrebVoice](https://huggingface.co/Ayoub-Laachir/MaghrebVoice)
**Dataset:** [Ayoub-Laachir/Darija_Dataset](https://huggingface.co/datasets/Ayoub-Laachir/Darija_Dataset)
## Description
This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages.
## Technologies Used
- **Whisper Large V3:** OpenAI’s state-of-the-art speech recognition model
- **PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation):** An efficient fine-tuning technique
- **Google Colab:** Cloud environment for training the model
- **Hugging Face:** Hosting the dataset and final model
## Dataset Preparation
The dataset preparation involved several steps:
1. **Cleaning:** Correcting bad transcriptions and standardizing word spellings.
2. **Audio Processing:** Converting all samples to a 16 kHz sample rate.
3. **Dataset Split:** Creating a training set of 3,312 samples and a test set of 150 samples.
4. **Format Conversion:** Transforming the dataset into the parquet file format.
5. **Uploading:** The prepared dataset was uploaded to the Hugging Face Hub.
## Training Process
The model was fine-tuned using the following parameters:
- **Per device train batch size:** 8
- **Gradient accumulation steps:** 1
- **Learning rate:** 1e-4 (0.0001)
- **Warmup steps:** 200
- **Number of train epochs:** 2
- **Logging and evaluation:** every 50 steps
- **Weight decay:** 0.01
Training progress showed a steady decrease in both training and validation loss over 8000 steps.
## Testing and Evaluation
The model was evaluated using:
- **Word Error Rate (WER):** 3.1467%
- **Character Error Rate (CER):** 2.3893%
These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech.
The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy.
## Audio Transcription Script with PEFT Layers
This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance.
### Required Libraries
Before running the script, ensure you have the following libraries installed. You can install them using:
```bash
!pip install --upgrade pip
!pip install --upgrade transformers accelerate librosa soundfile pydub
!pip install peft==0.3.0 # Install PEFT library
```
```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
import librosa
import soundfile as sf
from pydub import AudioSegment
from peft import PeftModel, PeftConfig # Import PEFT classes
# Set the device to GPU if available, else use CPU
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Configuration for the base Whisper model
base_model_name = "openai/whisper-large-v3" # Base model for Whisper
processor = AutoProcessor.from_pretrained(base_model_name) # Load the processor
# Load your fine-tuned model configuration
model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers" # Fine-tuned model with LoRA layers
peft_config = PeftConfig.from_pretrained(model_name) # Load PEFT configuration
# Load the base model
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device) # Load the base model
# Load the PEFT model
model = PeftModel.from_pretrained(base_model, model_name).to(device) # Load the PEFT model
# Merge the LoRA weights with the base model
model = model.merge_and_unload() # Combine the LoRA weights into the base model
# Configuration for transcription
config = {
"language": "arabic", # Language for transcription
"task": "transcribe", # Task type
"chunk_length_s": 30, # Length of each audio chunk in seconds
"stride_length_s": 5, # Overlap between chunks in seconds
}
# Initialize the automatic speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model, # Use the merged model
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
chunk_length_s=config["chunk_length_s"],
stride_length_s=config["stride_length_s"],
)
# Convert audio to 16kHz sampling rate
def convert_audio_to_16khz(input_path, output_path):
audio, sr = librosa.load(input_path, sr=None) # Load the audio file
audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000) # Resample to 16kHz
sf.write(output_path, audio_16k, 16000) # Save the converted audio
# Format time in HH:MM:SS.milliseconds
def format_time(seconds):
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
seconds = seconds % 60
return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}"
# Transcribe audio file
def transcribe_audio(audio_path):
try:
result = pipe(audio_path, return_timestamps=True) # Transcribe audio and get timestamps
return result["chunks"] # Return transcription chunks
except Exception as e:
print(f"Error transcribing audio: {e}")
return None
# Main function to execute the transcription process
def main():
# Specify input and output audio paths (update paths as needed)
input_audio_path = "/path/to/your/input/audio.mp3" # Replace with your input audio path
output_audio_path = "/path/to/your/output/audio_16khz.wav" # Replace with your output audio path
# Convert audio to 16kHz
convert_audio_to_16khz(input_audio_path, output_audio_path)
# Transcribe the converted audio
transcription_chunks = transcribe_audio(output_audio_path)
if transcription_chunks:
print("WEBVTT\n") # Print header for WEBVTT format
for chunk in transcription_chunks:
start_time = format_time(chunk["timestamp"][0]) # Format start time
end_time = format_time(chunk["timestamp"][1]) # Format end time
text = chunk["text"] # Get the transcribed text
print(f"{start_time} --> {end_time}") # Print time range
print(f"{text}\n") # Print transcribed text
else:
print("Transcription failed.")
if __name__ == "__main__":
main()
```
## Challenges and Future Improvements
### Challenges Encountered
- Diverse spellings of words in Moroccan Darija
- Cleaning and standardizing the dataset
### Future Improvements
- Expand the dataset to include more Darija accents and expressions
- Further fine-tune the model for specific Moroccan regional dialects
- Explore integration into practical applications like voice assistants and transcription services
## Conclusion
This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa.
|