--- thumbnail: "Аn open multilingual readability scoring model TRank" base_model: "Peltarion/xlm-roberta-longformer-base-4096" tags: - arxiv:2406.01835 - Readability - Multilingual - Wikipedia license: mit language: - yi - xh - fy - cy - vi - uz - ug - ur - uk - tr - th - te - ta - sv - sw - su - es - so - sl - sk - si - sd - sr - gd - sa - ru - ro - pa - pt - pl - fa - ps - om - or - 'no' - ne - mn - mr - ml - ms - mg - mk - lt - lv - la - lo - ky - ku - ko - km - kk - kn - jv - ja - it - ga - id - is - hu - hi - he - ha - gu - el - de - ka - gl - fr - fi - tl - et - eo - en - nl - da - cs - hr - zh - ca - my - bg - br - bs - bn - be - eu - az - as - hy - ar - am - af - sq pipeline_tag: text-classification --- # Open Multilingual Text Readability Scoring Model (TRank) [![DOI:10.48550/arXiv.2406.01835](https://zenodo.org/badge/DOI/10.48550/arXiv.2406.01835.svg)](https://doi.org/10.48550/arXiv.2406.01835) [![Readability Experiments repo](https://img.shields.io/badge/GitLab-repo-orange)](https://gitlab.wikimedia.org/repos/research/readability-experiments) ## Overview This repository contains an open multilingual readability scoring model TRank, presented in the ACL'24 paper **An Open Multilingual System for Scoring Readability of Wikipedia**. The model is designed to evaluate the readability of text across multiple languages. ## Features - **Multilingual Support**: Evaluates readability in multiple languages. - **Pairwise Ranking**: Trained using a Siamese architecture with Margin Ranking Loss to differentiate and rank texts from hardest to simplest. - **Long Context Window**: Utilizes the Longformer architecture of the base model, supporting inputs up to 4096 tokens. ## Model Training The model training implementation can be found in the [Readability Experiments repo](https://gitlab.wikimedia.org/repos/research/readability-experiments). ## Usage example ``` import torch import torch.nn as nn from transformers import AutoModel from huggingface_hub import PyTorchModelHubMixin from transformers import AutoTokenizer # Define the model: BASE_MODEL = "Peltarion/xlm-roberta-longformer-base-4096" class ReadabilityModel(nn.Module, PyTorchModelHubMixin): def __init__(self, model_name=BASE_MODEL): super(ReadabilityModel, self).__init__() self.model = AutoModel.from_pretrained(model_name) self.drop = nn.Dropout(p=0.2) self.fc = nn.Linear(768, 1) def forward(self, ids, mask): out = self.model(input_ids=ids, attention_mask=mask, output_hidden_states=False) out = self.drop(out[1]) outputs = self.fc(out) return outputs # Load the model: model = ReadabilityModel.from_pretrained("trokhymovych/TRank_readability") # Load the tokenizer: tokenizer = AutoTokenizer.from_pretrained("trokhymovych/TRank_readability") # Set the model to evaluation mode model.eval() # Example input text input_text = "This is an example sentence to evaluate readability." # Tokenize the input text inputs = tokenizer.encode_plus( input_text, add_special_tokens=True, max_length=512, truncation=True, padding='max_length', return_tensors='pt' ) ids = inputs['input_ids'] mask = inputs['attention_mask'] # Make prediction with torch.no_grad(): outputs = model(ids, mask) readability_score = outputs.item() # Print the input text and the readability score print(f"Input Text: {input_text}") print(f"Readability Score: {readability_score}") ``` ## Citation Preprint: ``` @misc{trokhymovych2024openmultilingualscoringreadability, title={An Open Multilingual System for Scoring Readability of Wikipedia}, author={Mykola Trokhymovych and Indira Sen and Martin Gerlach}, year={2024}, eprint={2406.01835}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2406.01835}, } ```