Chinese Generation with Masked Sequence-to-Sequence Pretraining

This repository demostrates a format-controllable Chinese lyric generator, fine-tuned on Chinese-Lyric-Corpus using a MASS-like strategy.

Usage

Initialization

from transformers import MT5ForConditionalGeneration, MT5Tokenizer, Text2TextGenerationPipeline
model_path = "zake7749/chinese-lyrics-generation-mass"
model = MT5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = MT5Tokenizer.from_pretrained(model_path)
pipe = Text2TextGenerationPipeline(model=model, tokenizer=tokenizer)

Generate lyrics with a template

template = "風花雪月。像XXXXXXXXXX。日升月落。仿若XXXXXXXXXX。"
lyric = pipe(template, max_length=128, top_p=0.8, do_sample=True, repetition_penalty=1.2)[0]['generated_text']
print(lyric) # 風花雪月。像你在我的夢裡慢慢散落。日升月落。仿若我宿命無法陪隨你走過。


template = "XXXXXXX留戀。XXXXXXX。XXX燈火XXXX。XXX手牽手XXXX。"
lyric = pipe(template, max_length=128, top_p=0.8, do_sample=True, repetition_penalty=1.2)[0]['generated_text']
print(lyric) # 我們說好一生不留戀。我們相約在夏天。我們的燈火相偎相牽。我們說好手牽手到永遠。

Acrostic

template = "分XXXXXX。手XXXXXXXXX。之XXXXXXX。後XXXXXXXXX。"
lyric = pipe(template, max_length=128, top_p=0.8, do_sample=True, repetition_penalty=1.2)[0]['generated_text']
print(lyric) # 分開後激情浮現。手牽著手走過的那一天。之間有太多的危險。後悔一點點,傷心一片。

Completion

template = "餘生的光陰牽你手前行。我們共赴一場光年的旅行。XXXXXXXXXX。XXXXXXXXXXXX。"
lyric = pipe(template, max_length=128, top_p=0.8, do_sample=True, repetition_penalty=1.2)[0]['generated_text']
print(lyric) # 餘生的光陰牽你手前行。我們共赴一場光年的旅行。走過的經歷新舊的記憶。都是帶著珍珠淚水無法代替。

Random Generation

import random

num_example = 5
min_sentence_num, max_sentence_num = 2, 5
min_characher_num, max_character_num = 4, 10

for example_id in range(num_example):
    num_sentences = random.randint(min_sentence_num,  max_sentence_num)
    num_words = ["X" * random.randint(min_characher_num, max_character_num)
                 for _ in range(num_sentences)]

    template = "。".join(num_words) + "。"
    lyric = pipe(template, max_length=128, top_p=0.8, do_sample=True, repetition_penalty=1.2)[0]['generated_text']
    print(f"{example_id + 1}. {lyric}")

# 1. 愛不愛我。讓自己難過。你的擁抱是那麼多。
# 2. 那一天我們重相見。你已站在那個熟悉的街邊。讓我魂牽夢繞在肩。有你的明天。不再留戀。飛過天邊。
# 3. 誰知我們入骨的相思。深深地被俘虜。苦澀滋味含在茶中傾訴。餘情未了落幕。愛到痛處奢望幸福。
# 4. 為什麼你一直讓我傷心。總覺得對你太著迷。
# 5. 一點可憐。還在期待你會出現。哪怕只是匆匆一眼。

Note

  1. The model is still under training, so sometimes it might not follow the template explicitly, especially for long sequences generation.
  2. The model would output , as a pause in the lyric, for example 我的愛,像潮水。. If you don't need the pause, you can add the id of , to bad_words_ids.
  3. The model was only fine-tuned on traditional Chinese corpus which leads to a bit unstable performance in simplified Chinese.
  4. When there are no/few keywords in the given input, the model may combine snippets from real world songs to fit the template.

Disclaimer

This lyric generator is for academic purposes only. Users of this model should exercise caution and carefully evaluate the results before using them for any commercial or non-academic purpose. We are not liable for any damages or losses resulting from the use or misuse of the model.

Downloads last month
80
Safetensors
Model size
582M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.