leenag
/

corpus_dependent

Model card Files Files and versions Community

leenag commited on about 24 hours ago

Commit

b09e54b

•

1 Parent(s): a3d7b00

Create README.md

Files changed (1) hide show

README.md +58 -0

README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+# Corpus dependent Acoustic-to-Articulatory Inversion Model
+## Model Overview
+This model performs corpus dependent Acoustic-to-Articulatory Inversion (AAI), predicting articulatory trajectories from acoustic features. This model trained in the simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets of speakers of one corpus and tested in other speakers of the same corpus. The neural network is built using PyTorch and leverages BiLSTM (Bidirectional Long Short-Term Memory) layers to capture temporal dependencies in the acoustic data. CNN is used used to smooth the predicted trajectories to make it natural. The input is composed of multi-frame MFCC (Mel-Frequency Cepstral Coefficients) features, and the output is a set of predicted articulatory positions over time.
+## Intended Use
+The model is designed for speech researchers and professionals who are interested in understanding the relationship between speech acoustics and articulatory movements. It can be applied in linguistic research, speech synthesis, and speech therapy.
+### Use Cases
+1. **Speech Analysis:** To study how different speech sounds relate to articulatory positions.
+2. **Speech Synthesis:** As a part of systems generating speech from articulatory features.
+3. **Speech Therapy:** Analyzing articulatory trajectories for individuals with speech disorders.
+# Dependencies
+- python 3.7.3
+- numpy 1.16.3
+- pytorch 1.1.0
+- scipy 1.2.1
+- librosa 0.6.3
+- matplotlib
+- psutil
+# Trained Datasets
+We used three speakers of MOCHA dataset for training and one speakers for testing, Trained dataset is:
+- mocha : http://data.cstr.ed.ac.uk/mocha/ <br/>
+## Model Architecture
+- **Hidden Dimension:** 400
+- **Input Dimension:** 429 (acoustic features per frame)
+- **Output Dimension:** 16 (articulatory trajectories)
+- **Batch Size:** 8
+- **BiLSTM Layers:** 2 bidirectional LSTM layers
+- **CNN:** 1DCNN
+- **Linear Layers:** Input and output layers with batch normalization
+The architecture is designed to accommodate smoothing of articulatory in preprocessing with customizable cutoff frequencies.
+## Model Training
+- **Optimizer:** Adam
+- **Loss Functions:** Combination of RMSE and Pearson correlation to capture both error minimization and correlation maximization.
+- **Training Procedure:** Early stopping based on validation loss was employed to prevent overfitting, with periodic adjustments of learning rate if the validation loss increased.
+- **Epochs:** Trained over multiple epochs with batch updates and dynamic learning rate adjustments.
+## Evaluation
+The model was evaluated on a separate test set, with metrics such as RMSE (Root Mean Square Error) and Pearson correlation used to quantify performance. To test this model in the command line :
+             python test.py "mjjn0" "cross_dependent"
+"mjjn0" indicates the MOCHA test speaker and cross_dependent is the model name
+The evaluation result is:
+-**RMSE:** 0.904
+-**PCC:** 0.721