syllabizer / README.md
imjeffhi's picture
Update README.md
04ba3ad

About

This model takes in a word as an input and splits it into syllables. I did this by pre-training a T5 model from a syllables dataset I scraped from the internet. I'm using a custom tokenizer that is effectively character-based. It seems to work okay in my limited tests, but the output may be unpredictable when inputting multiple words, numbers, or non-English characters. It can, however, handle things such as trailing punctuation.

Calling the Model

from transformers import AutoTokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained('imjeffhi/syllabizer')
tokenizer = AutoTokenizer.from_pretrained('imjeffhi/syllabizer')

def generate_output(word):
    tokens = tokenizer(word, return_tensors='pt')
    output = model.generate(**tokens, do_sample=False, max_length=30, early_stopping=True)[0]
    return tokenizer.decode(output, skip_special_tokens=True)
    
syllables = generate_output('syllabizer')

The model returns syllables in spaced format. See output below.

syl la biz er

Using pipelines to syllabize sentences

You can easily syllabize an entire sentence/paragraph and/or convert the output into a list of syllables with the following code:

from transformers import pipeline

syllabizer_pipe = pipeline('text2text-generation', model = 'imjeffhi/syllabizer', tokenizer='imjeffhi/syllabizer')

sentence = "A unit of spoken language consisting of a single uninterrupted sound formed by a vowel, diphthong, or syllabic consonant alone, or by any of these sounds preceded, followed, or surrounded by one or more consonants."
words = sentence.split(" ")
output = syllabizer_pipe(words, batch_size=len(words),do_sample=False, max_length=30, early_stopping=True)

[{words[i]: gen_text['generated_text'].split(" ")} for i, gen_text in enumerate(output)]

This outputs the following:

[{'A': ['a']},
 {'unit': ['u', 'nit']},
 {'of': ['of']},
 {'spoken': ['spok', 'en']},
 {'language': ['lan', 'guage']},
 {'consisting': ['con', 'sis', 'ting']},
 {'of': ['of']},
 {'a': ['a']},
 {'single': ['sing', 'le']},
 {'uninterrupted': ['un', 'in', 'ter', 'rupt', 'ed']},
 {'sound': ['sound']},
 {'formed': ['formed']},
 {'by': ['by']},
 {'a': ['a']},
 {'vowel,': ['vow', 'el']},
 {'diphthong,': ['diph', 'thong']},
 {'or': ['or']},
 {'syllabic': ['syl', 'la', 'bic']},
 {'consonant': ['con', 'so', 'nant']},
 {'alone,': ['a', 'lone']},
 {'or': ['or']},
 {'by': ['by']},
 {'any': ['an', 'y']},
 {'of': ['of']},
 {'these': ['these']},
 {'sounds': ['sounds']},
 {'preceded,': ['pre', 'ced', 'ed']},
 {'followed,': ['fol', 'lowed']},
 {'or': ['or']},
 {'surrounded': ['sur', 'round', 'ed']},
 {'by': ['by']},
 {'one': ['one']},
 {'or': ['or']},
 {'more': ['more']},
 {'consonants.': ['con', 'so', 'nants']}]