功能介绍

T5Corrector:中文字音与字形纠错模型

这个模型是基于mengzi-t5-base进行文本纠错训练,使用500w+句子,通过替换同音词、近音词和形近字来构造纠错平行语料,共计3kw+句对,累计训练45000步。

Github项目地址

加载模型:

# 加载模型
from transformers import T5Tokenizer, T5ForConditionalGeneration
pretrained = "Maciel/T5Corrector-base-v1"
tokenizer = T5Tokenizer.from_pretrained(pretrained)
model = T5ForConditionalGeneration.from_pretrained(pretrained)

使用模型进行预测推理方法:

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def correct(text, max_length):
    model_inputs = tokenizer(text, 
                                max_length=max_length, 
                                truncation=True, 
                                return_tensors="pt").to(device)
    output = model.generate(**model_inputs, 
                              num_beams=5,
                              no_repeat_ngram_size=4,
                              do_sample=True, 
                              early_stopping=True,
                              max_length=max_length,
                              return_dict_in_generate=True,
                              output_scores=True)
    pred_output = tokenizer.batch_decode(output.sequences, skip_special_tokens=True)[0]
    return pred_output

text = "听到这个消息,心情真的蓝瘦"
correction = correct(text, max_length=32)
print(correction)

案例展示

示例1:
input: 听到这个消息,心情真的蓝瘦
output: 听到这个消息,心情真的难受

示例2:
input: 脑子有点胡涂了,这道题冥冥学过还没有做出来
output: 脑子有点糊涂了,这道题明明学过还没有做出来

示例3:
input: 今天天气不太好,我的心情也不是很偷快
output: 今天天气不太好,我的心情也不是很愉快
Downloads last month
12
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.