ID G2P LSTM

ID G2P LSTM is a grapheme-to-phoneme model based on the LSTM architecture. This model was trained from scratch on a modified Malay/Indonesian lexicon.

This model was trained using the Keras framework. All training was done on Google Colaboratory. We adapted the LSTM training script provided by the official Keras Code Example.

Model

Model #params Arch. Training/Validation data
id-g2p-lstm 596K LSTM Malay/Indonesian Lexicon

Training Procedure

Model Config
latent_dim: 256
num_encoder_tokens: 28
num_decoder_tokens: 32
max_encoder_seq_length: 24
max_decoder_seq_length: 25
Training Setting
batch_size: 64
optimizer: "rmsprop"
loss: "categorical_crossentropy"
learning_rate: 0.001
epochs: 100

How to Use

Tokenizers
g2id = {
    ' ': 27,
    "'": 0,
    '-': 1,
    'a': 2,
    'b': 3,
    'c': 4,
    'd': 5,
    'e': 6,
    'f': 7,
    'g': 8,
    'h': 9,
    'i': 10,
    'j': 11,
    'k': 12,
    'l': 13,
    'm': 14,
    'n': 15,
    'o': 16,
    'p': 17,
    'q': 18,
    'r': 19,
    's': 20,
    't': 21,
    'u': 22,
    'v': 23,
    'w': 24,
    'y': 25,
    'z': 26
}

p2id = {
    '\t': 0,
    '\n': 1,
    ' ': 31,
    '-': 2,
    'a': 3,
    'b': 4,
    'd': 5,
    'e': 6,
    'f': 7,
    'g': 8,
    'h': 9,
    'i': 10,
    'j': 11,
    'k': 12,
    'l': 13,
    'm': 14,
    'n': 15,
    'o': 16,
    'p': 17,
    'r': 18,
    's': 19,
    't': 20,
    'u': 21,
    'v': 22,
    'w': 23,
    'z': 24,
    'ŋ': 25,
    'ə': 26,
    'ɲ': 27,
    'ʃ': 28,
    'ʒ': 29,
    'ʔ': 30
}
import keras
import numpy as np
from huggingface_hub import from_pretrained_keras

latent_dim = 256
bos_token, eos_token, pad_token = "\t", "\n", " "
num_encoder_tokens, num_decoder_tokens = 28, 32
max_encoder_seq_length, max_decoder_seq_length = 24, 25

model = from_pretrained_keras("bookbot/id-g2p-lstm")

encoder_inputs = model.input[0]
encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output
encoder_states = [state_h_enc, state_c_enc]
encoder_model = keras.Model(encoder_inputs, encoder_states)

decoder_inputs = model.input[1]

decoder_state_input_h = keras.Input(shape=(latent_dim,), name="input_3")
decoder_state_input_c = keras.Input(shape=(latent_dim,), name="input_4")
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_lstm = model.layers[3]
decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs
)
decoder_states = [state_h_dec, state_c_dec]
decoder_dense = model.layers[4]
decoder_outputs = decoder_dense(decoder_outputs)
decoder_model = keras.Model(
    [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
)

def inference(sequence):
    id2p = {v: k for k, v in p2id.items()}

    input_seq = np.zeros(
        (1, max_encoder_seq_length, num_encoder_tokens), dtype="float32"
    )

    for t, char in enumerate(sequence):
        input_seq[0, t, g2id[char]] = 1.0
    input_seq[0, t + 1 :, g2id[pad_token]] = 1.0

    states_value = encoder_model.predict(input_seq)

    target_seq = np.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, p2id[bos_token]] = 1.0

    stop_condition = False
    decoded_sentence = ""
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = id2p[sampled_token_index]
        decoded_sentence += sampled_char

        if sampled_char == eos_token or len(decoded_sentence) > max_decoder_seq_length:
            stop_condition = True

        target_seq = np.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.0

        states_value = [h, c]
    return decoded_sentence.replace(eos_token, "")

inference("mengembangkannya")

Authors

ID G2P LSTM was trained and evaluated by Ananto Joyoadikusumo, Steven Limcorn, Wilson Wongso. All computation and development are done on Google Colaboratory.

Framework versions

  • Keras 2.8.0
  • TensorFlow 2.8.0
Downloads last month
12
Inference Examples
Inference API (serverless) has been turned off for this model.