The presented model can be used for text de-noising. You can use it if you have text that has noise after loading, such as after loading pdf files.

The model was learned on texts in Polish. The dataset was automatically noised. allegro/plt5-base was used as the base model.

Model input

Model input must be preceded by the tag denoise: F.e. if you have text:

As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.

then input to the model must be constructed as follows:

denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.

Sample model usage

from transformers import T5ForConditionalGeneration, T5Tokenizer


def do_inference(text, model, tokenizer):
    input_text = f"denoise: {text}"
    inputs = tokenizer.encode(
        input_text,
        return_tensors="pt",
        max_length=256,
        padding="max_length",
        truncation=True,
    )

    corrected_ids = model.generate(
        inputs,
        max_length=256,
        num_beams=5,
        early_stopping=True,
    )

    corrected_sentence = tokenizer.decode(corrected_ids[0], skip_special_tokens=True)
    return corrected_sentence


model = T5ForConditionalGeneration.from_pretrained("radlab/polish-denoiser-t5-base")
tokenizer = T5Tokenizer.from_pretrained("radlab/polish-denoiser-t5-base")

text_str = "As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k."
print(do_inference(text_str, model, tokenizer))

Model reponse for input:

denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.

is:

Astronomia jest jedną z najstarszych nauk.

Evaluation

Eval loss: image/png

More information (in Polish) on our blog

Downloads last month
4
Safetensors
Model size
275M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including radlab/polish-denoiser-t5-base