|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Ayoub-Laachir/Darija_Dataset |
|
language: |
|
- dj |
|
metrics: |
|
- wer |
|
- cer |
|
base_model: |
|
- openai/whisper-large-v3 |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
# Model Card for Fine-tuned Whisper Large V3 (Moroccan Darija) |
|
|
|
## Model Overview |
|
**Model Name:** Whisper Large V3 (Fine-tuned for Moroccan Darija) |
|
**Author:** Ayoub Laachir |
|
**License:** apache-2.0 |
|
**Repository:** [Ayoub-Laachir/MaghrebVoice](https://huggingface.co/Ayoub-Laachir/MaghrebVoice) |
|
**Dataset:** [Ayoub-Laachir/Darija_Dataset](https://huggingface.co/datasets/Ayoub-Laachir/Darija_Dataset) |
|
|
|
## Description |
|
This model is a fine-tuned version of OpenAI’s Whisper Large V3, specifically adapted for recognizing and transcribing Moroccan Darija, a dialect influenced by Arabic, Berber, French, and Spanish. The project aims to improve technological accessibility for millions of Moroccans and serve as a blueprint for similar advancements in underrepresented languages. |
|
|
|
## Technologies Used |
|
- **Whisper Large V3:** OpenAI’s state-of-the-art speech recognition model |
|
- **PEFT (Parameter-Efficient Fine-Tuning) with LoRA (Low-Rank Adaptation):** An efficient fine-tuning technique |
|
- **Google Colab:** Cloud environment for training the model |
|
- **Hugging Face:** Hosting the dataset and final model |
|
|
|
## Dataset Preparation |
|
The dataset preparation involved several steps: |
|
1. **Cleaning:** Correcting bad transcriptions and standardizing word spellings. |
|
2. **Audio Processing:** Converting all samples to a 16 kHz sample rate. |
|
3. **Dataset Split:** Creating a training set of 3,312 samples and a test set of 150 samples. |
|
4. **Format Conversion:** Transforming the dataset into the parquet file format. |
|
5. **Uploading:** The prepared dataset was uploaded to the Hugging Face Hub. |
|
|
|
## Training Process |
|
The model was fine-tuned using the following parameters: |
|
- **Per device train batch size:** 8 |
|
- **Gradient accumulation steps:** 1 |
|
- **Learning rate:** 1e-4 (0.0001) |
|
- **Warmup steps:** 200 |
|
- **Number of train epochs:** 2 |
|
- **Logging and evaluation:** every 50 steps |
|
- **Weight decay:** 0.01 |
|
|
|
Training progress showed a steady decrease in both training and validation loss over 8000 steps. |
|
|
|
## Testing and Evaluation |
|
The model was evaluated using: |
|
- **Word Error Rate (WER):** 3.1467% |
|
- **Character Error Rate (CER):** 2.3893% |
|
|
|
These metrics demonstrate the model's ability to accurately transcribe Moroccan Darija speech. |
|
|
|
|
|
The fine-tuned model shows improved handling of Darija-specific words, sentence structure, and overall accuracy. |
|
|
|
## Audio Transcription Script with PEFT Layers |
|
|
|
This script demonstrates how to transcribe audio files using the fine-tuned Whisper Large V3 model for Moroccan Darija, incorporating PEFT (Parameter-Efficient Fine-Tuning) layers for improved performance. |
|
|
|
### Required Libraries |
|
|
|
Before running the script, ensure you have the following libraries installed. You can install them using: |
|
|
|
```bash |
|
!pip install --upgrade pip |
|
!pip install --upgrade transformers accelerate librosa soundfile pydub |
|
!pip install peft==0.3.0 # Install PEFT library |
|
``` |
|
```python |
|
import torch |
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
import librosa |
|
import soundfile as sf |
|
from pydub import AudioSegment |
|
from peft import PeftModel, PeftConfig # Import PEFT classes |
|
|
|
# Set the device to GPU if available, else use CPU |
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
# Configuration for the base Whisper model |
|
base_model_name = "openai/whisper-large-v3" # Base model for Whisper |
|
processor = AutoProcessor.from_pretrained(base_model_name) # Load the processor |
|
|
|
# Load your fine-tuned model configuration |
|
model_name = "Ayoub-Laachir/MaghrebVoice_OnlyLoRaLayers" # Fine-tuned model with LoRA layers |
|
peft_config = PeftConfig.from_pretrained(model_name) # Load PEFT configuration |
|
|
|
# Load the base model |
|
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(base_model_name).to(device) # Load the base model |
|
|
|
# Load the PEFT model |
|
model = PeftModel.from_pretrained(base_model, model_name).to(device) # Load the PEFT model |
|
|
|
# Merge the LoRA weights with the base model |
|
model = model.merge_and_unload() # Combine the LoRA weights into the base model |
|
|
|
# Configuration for transcription |
|
config = { |
|
"language": "arabic", # Language for transcription |
|
"task": "transcribe", # Task type |
|
"chunk_length_s": 30, # Length of each audio chunk in seconds |
|
"stride_length_s": 5, # Overlap between chunks in seconds |
|
} |
|
|
|
# Initialize the automatic speech recognition pipeline |
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model=model, # Use the merged model |
|
tokenizer=processor.tokenizer, |
|
feature_extractor=processor.feature_extractor, |
|
torch_dtype=torch_dtype, |
|
device=device, |
|
chunk_length_s=config["chunk_length_s"], |
|
stride_length_s=config["stride_length_s"], |
|
) |
|
|
|
# Convert audio to 16kHz sampling rate |
|
def convert_audio_to_16khz(input_path, output_path): |
|
audio, sr = librosa.load(input_path, sr=None) # Load the audio file |
|
audio_16k = librosa.resample(audio, orig_sr=sr, target_sr=16000) # Resample to 16kHz |
|
sf.write(output_path, audio_16k, 16000) # Save the converted audio |
|
|
|
# Format time in HH:MM:SS.milliseconds |
|
def format_time(seconds): |
|
hours = int(seconds // 3600) |
|
minutes = int((seconds % 3600) // 60) |
|
seconds = seconds % 60 |
|
return f"{hours:02d}:{minutes:02d}:{seconds:06.3f}" |
|
|
|
# Transcribe audio file |
|
def transcribe_audio(audio_path): |
|
try: |
|
result = pipe(audio_path, return_timestamps=True) # Transcribe audio and get timestamps |
|
return result["chunks"] # Return transcription chunks |
|
except Exception as e: |
|
print(f"Error transcribing audio: {e}") |
|
return None |
|
|
|
# Main function to execute the transcription process |
|
def main(): |
|
# Specify input and output audio paths (update paths as needed) |
|
input_audio_path = "/path/to/your/input/audio.mp3" # Replace with your input audio path |
|
output_audio_path = "/path/to/your/output/audio_16khz.wav" # Replace with your output audio path |
|
|
|
# Convert audio to 16kHz |
|
convert_audio_to_16khz(input_audio_path, output_audio_path) |
|
|
|
# Transcribe the converted audio |
|
transcription_chunks = transcribe_audio(output_audio_path) |
|
|
|
if transcription_chunks: |
|
print("WEBVTT\n") # Print header for WEBVTT format |
|
for chunk in transcription_chunks: |
|
start_time = format_time(chunk["timestamp"][0]) # Format start time |
|
end_time = format_time(chunk["timestamp"][1]) # Format end time |
|
text = chunk["text"] # Get the transcribed text |
|
print(f"{start_time} --> {end_time}") # Print time range |
|
print(f"{text}\n") # Print transcribed text |
|
else: |
|
print("Transcription failed.") |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
## Challenges and Future Improvements |
|
### Challenges Encountered |
|
- Diverse spellings of words in Moroccan Darija |
|
- Cleaning and standardizing the dataset |
|
|
|
### Future Improvements |
|
- Expand the dataset to include more Darija accents and expressions |
|
- Further fine-tune the model for specific Moroccan regional dialects |
|
- Explore integration into practical applications like voice assistants and transcription services |
|
|
|
## Conclusion |
|
This project marks a significant step towards making AI more accessible for Moroccan Arabic speakers. The success of this fine-tuned model highlights the potential for adapting advanced AI technologies to underrepresented languages, serving as a model for similar initiatives in North Africa. |
|
|