| | --- |
| | language: |
| | - en |
| | base_model: |
| | - FacebookAI/roberta-large |
| | pipeline_tag: text-classification |
| | --- |
| | # Graded Word Sense Disambiguation (WSD) Model |
| |
|
| | ## Model Summary |
| | This model is a **fine-tuned version of RoBERTa-Large** for **Graded Word Sense Disambiguation (WSD)**. It is designed to predict the **degree of applicability** (1-4) of a word sense in context by leveraging **large-scale sense-annotated corpora**. The model is based on the work outlined in: |
| |
|
| | **Reference Paper:** |
| | Pierluigi Cassotti, Nina Tahmasebi (2025). Sense-specific Historical Word Usage Generation. |
| |
|
| |
|
| | This model has been trained to handle **graded WSD tasks**, providing **continuous-valued predictions** instead of hard classification, making it useful for nuanced applications in lexicography, computational linguistics, and historical text analysis. |
| |
|
| | --- |
| |
|
| | ## Model Details |
| | - **Base Model:** `roberta-large` |
| | - **Task:** Graded Word Sense Disambiguation (WSD) |
| | - **Fine-tuning Dataset:** Oxford English Dictionary (OED) sense-annotated corpus |
| | - **Training Steps:** |
| | - Tokenizer augmented with special tokens (`<t>`, `</t>`) for marking target words in context. |
| | - Dataset preprocessed with **sense annotations** and **word offsets**. |
| | - Sentences containing sense-annotated words were split into **train (90%)** and **validation (10%)** sets. |
| | - **Objective:** Predicting a continuous label representing the applicability of a sense. |
| | - **Evaluation Metric:** Root Mean Squared Error (RMSE). |
| | - **Batch Size:** 32 |
| | - **Learning Rate:** 2e-5 |
| | - **Epochs:** 1 |
| | - **Optimizer:** AdamW with weight decay of 0.01 |
| | - **Evaluation Strategy:** Steps-based (every 10% of the dataset). |
| |
|
| | --- |
| |
|
| | ## Training & Fine-Tuning |
| | Fine-tuning was performed using the **Hugging Face `Trainer` API** with a **custom dataset loader**. The dataset was processed as follows: |
| |
|
| | 1. **Preprocessing** |
| | - Example sentences were extracted from the OED and augmented with **definitions**. |
| | - The target word was **highlighted** with special tokens (`<t>`, `</t>`). |
| | - Each instance was labeled with a **graded similarity score**. |
| |
|
| | 2. **Tokenization & Encoding** |
| | - Tokenized with `AutoTokenizer.from_pretrained("roberta-large")`. |
| | - Definitions were concatenated using the `</s></s>` separator for **cross-sentence representation**. |
| |
|
| | 3. **Training Pipeline** |
| | - Model fine-tuned on the **regression task** with a single **linear output head**. |
| | - Used **Mean Squared Error (MSE) loss**. |
| | - Evaluation on validation set using **Root Mean Squared Error (RMSE)**. |
| |
|
| | --- |
| |
|
| | ## Usage |
| | ### Example Code |
| | ```python |
| | from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| | import torch |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/graded-wsd") |
| | model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/graded-wsd") |
| | |
| | sentence = "The <t>bank</t> of the river was eroding due to the storm." |
| | target_word = "bank" |
| | definition = "The land alongside a river or a stream." |
| | |
| | tokenized_input = tokenizer(f"{sentence} </s></s> {definition}", truncation=True, padding=True, return_tensors="pt") |
| | with torch.no_grad(): |
| | output = model(**tokenized_input) |
| | score = output.logits.item() |
| | |
| | print(f"Graded Sense Score: {score}") |
| | ``` |
| |
|
| | ### Input Format |
| | - Sentence: Contextual usage of the word. |
| | - Target Word: The word to be disambiguated. |
| | - Definition: The dictionary definition of the intended sense. |
| |
|
| | ### Output |
| | - **A continuous score** (between 1 and 4) indicating the **similarity** of the given definition with respect to the word in its current context. |
| |
|
| | --- |
| |
|
| | ## Citation |
| | If you use this model, please cite the following paper: |
| |
|
| | ``` |
| | @article{10.1162/tacl_a_00761, |
| | author = {Cassotti, Pierluigi and Tahmasebi, Nina}, |
| | title = {Sense-specific Historical Word Usage Generation}, |
| | journal = {Transactions of the Association for Computational Linguistics}, |
| | volume = {13}, |
| | pages = {690-708}, |
| | year = {2025}, |
| | month = {07}, |
| | abstract = {Large-scale sense-annotated corpora are important for a range of tasks but are hard to come by. Dictionaries that record and describe the vocabulary of a language often offer a small set of real-world example sentences for each sense of a word. However, on their own, these sentences are too few to be used as diachronic sense-annotated corpora. We propose a targeted strategy for training and evaluating generative models producing historically and semantically accurate word usages given any word, sense definition, and year triple. Our results demonstrate that fine-tuned models can generate usages with the same properties as real-world example sentences from a reference dictionary. Thus the generated usages will be suitable for training and testing computational models where large-scale sense-annotated corpora are needed but currently unavailable.}, |
| | issn = {2307-387X}, |
| | doi = {10.1162/tacl_a_00761}, |
| | url = {https://doi.org/10.1162/tacl\_a\_00761}, |
| | eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00761/2535111/tacl\_a\_00761.pdf}, |
| | } |
| | ``` |