techiaith
/

wav2vec2-base-cy

@@ -1,20 +1,56 @@
 ---
 license: apache-2.0
 language:
- - cy
 tags:
- - speech
 ---
-# Pre-training wav2vec2 models for Welsh speech recognition
-At the moment, the best Welsh speech recognition models are achieved from fine-tuning https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and https://huggingface.co/facebook/wav2vec2-xls-r-1b models by Facebook/Meta AI.
-This model is experimental in investigating pretraining better models with more Welsh language speech that could lower WER scores even further in subsequently fine-tuned models. The work draws heavily on resources and documentation from the HuggingFace examples:
 https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining
-This base model has been pre-trained with only approximately 4000 hours of Welsh and English speech collected from various channels on YouTube. The corpus contains only 25% Welsh language speech. English language speech contains Welsh-accented English speech and therefore has been retained for pre-training.
-Until we have collected many more hours of speech, this pre-trained model will be of limited use for fine-tuning any useful downstream tasks.

 ---
 license: apache-2.0
 language:
+- cy
 tags:
+- speech
+- pre-training
+- wav2vec2
 ---
+# Better Pre-trained wav2vec2 models for Welsh Speech Recognition
+At the moment, the best Welsh speech recognition wav2vec2 models are achieved from
+fine-tuning [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and
+[xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) pre-trained models
+by Facebook/Meta AI.
+This model is experimental in investigating better pre-trained models with more
+Welsh language speech that could in turn lower WER scores even further in subsequent
+fine-tuned models. __It is of very limited use for any fine-tuning on any useful downstream
+task such as speech recognition__.
+## First Attempts with Self-Supervised Learning
+Previous attempts drew heavilty on the resources and documentation from the HuggingFace examples
+for creating pre-trained wav2vec2 models from scratch:
 https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining
+we used only 4000 hours of Welsh and Engish speech audio collected from various channels on
+YouTube, The training set contained a balance of approximately 25% Welsh speech and 75%
+English language speech. The English language data however contains examples of Welsh-accented
+English speech and therefore was retained for pretraining.
+The results of our self-supervised attempts can be accessed from revisions `22.10` and `24.03` of
+this model repository.
+## Attempting with Fine-tuning Meta AI models with a very weak data set
+The latest attempt invesigates reverting back to fine-tuning Meta AI's pre-trained models (xls-r-1b)
+with the YouTube speech data having been transcribed automatically with the best Whisper based ASR
+models for Welsh and English: https://huggingface.co/techiaith/whisper-large-v3-ft-cv-cy-en
+The transcriptions are of course not totally correct, hence why we're termed it as a very weak data
+set. But since it has a much larger collection of speech, and much larger than [any other dataset for
+Welsh](https://huggingface.co/collections/techiaith/speech-recognition-datasets-672df8ffb3f7da8ed8294ce2)
+we wanted to nevertheless experiment with what impact (if any) the speech audio may still have on
+the wav2vec2 encoders.
+## Conclusion
+As already mentioned above, the model is not useful for any use. We have have identified many issues
+and limitations, for example the quality of the YouTube data itself and in particular that of the
+automatic transcriptions. Further work is required to confirm if the data and/or approaches attempted
+thus far and viable and feasible.