pierluigic commited on
Commit
c89e7e1
·
verified ·
1 Parent(s): 79d5aa7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - FacebookAI/roberta-large
6
+ pipeline_tag: text-classification
7
+ ---
8
+ # Graded Word Sense Disambiguation (WSD) Model
9
+
10
+ ## Model Summary
11
+ This model is a **fine-tuned version of RoBERTa-Large** for **Graded Word Sense Disambiguation (WSD)**. It is designed to predict the **degree of applicability** of a word sense in context by leveraging **large-scale sense-annotated corpora**. The model is based on the work outlined in:
12
+
13
+ **Reference Paper:**
14
+ Pierluigi Cassotti, Nina Tahmasebi (2025). Sense-specific Historical Word Usage Generation.
15
+
16
+
17
+ This model has been trained to handle **graded WSD tasks**, providing **continuous-valued predictions** instead of hard classification, making it useful for nuanced applications in lexicography, computational linguistics, and historical text analysis.
18
+
19
+ ---
20
+
21
+ ## Model Details
22
+ - **Base Model:** `roberta-large`
23
+ - **Task:** Graded Word Sense Disambiguation (WSD)
24
+ - **Fine-tuning Dataset:** Oxford English Dictionary (OED) sense-annotated corpus
25
+ - **Training Steps:**
26
+ - Tokenizer augmented with special tokens (`<t>`, `</t>`) for marking target words in context.
27
+ - Dataset preprocessed with **sense annotations** and **word offsets**.
28
+ - Sentences containing sense-annotated words were split into **train (90%)** and **validation (10%)** sets.
29
+ - **Objective:** Predicting a continuous label representing the applicability of a sense.
30
+ - **Evaluation Metric:** Root Mean Squared Error (RMSE).
31
+ - **Batch Size:** 32
32
+ - **Learning Rate:** 2e-5
33
+ - **Epochs:** 1
34
+ - **Optimizer:** AdamW with weight decay of 0.01
35
+ - **Evaluation Strategy:** Steps-based (every 10% of the dataset).
36
+
37
+ ---
38
+
39
+ ## Training & Fine-Tuning
40
+ Fine-tuning was performed using the **Hugging Face `Trainer` API** with a **custom dataset loader**. The dataset was processed as follows:
41
+
42
+ 1. **Preprocessing**
43
+ - Example sentences were extracted from the OED and augmented with **definitions**.
44
+ - The target word was **highlighted** with special tokens (`<t>`, `</t>`).
45
+ - Each instance was labeled with a **graded similarity score**.
46
+
47
+ 2. **Tokenization & Encoding**
48
+ - Tokenized with `AutoTokenizer.from_pretrained("roberta-large")`.
49
+ - Definitions were concatenated using the `</s></s>` separator for **cross-sentence representation**.
50
+
51
+ 3. **Training Pipeline**
52
+ - Model fine-tuned on the **regression task** with a single **linear output head**.
53
+ - Used **Mean Squared Error (MSE) loss**.
54
+ - Evaluation on validation set using **Root Mean Squared Error (RMSE)**.
55
+
56
+ ---
57
+
58
+ ## Usage
59
+ ### Example Code
60
+ ```python
61
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
62
+ import torch
63
+
64
+ tokenizer = AutoTokenizer.from_pretrained("ChangeIsKey/graded-wsd")
65
+ model = AutoModelForSequenceClassification.from_pretrained("ChangeIsKey/graded-wsd")
66
+
67
+ sentence = "The bank of the river was eroding due to the storm."
68
+ target_word = "bank"
69
+ definition = "The land alongside a river or a stream."
70
+
71
+ tokenized_input = tokenizer(f"{sentence} </s></s> {definition}", truncation=True, padding=True, return_tensors="pt")
72
+ with torch.no_grad():
73
+ output = model(**tokenized_input)
74
+ score = output.logits.item()
75
+
76
+ print(f"Graded Sense Score: {score}")
77
+ ```
78
+
79
+ ### Input Format
80
+ - Sentence: Contextual usage of the word.
81
+ - Target Word: The word to be disambiguated.
82
+ - Definition: The dictionary definition of the intended sense.
83
+
84
+ ### Output
85
+ - **A continuous score** (between 0 and 1) indicating the **likelihood** that the given definition applies to the word in its current context.
86
+
87
+ ---
88
+
89
+ ## Citation
90
+ If you use this model, please cite the following paper:
91
+
92
+ ```
93
+ @article{cassotti2025,
94
+ title={Sense-specific Historical Word Usage Generation},
95
+ author={Cassotti, Pierluigi and Tahmasebi, Nina},
96
+ journal={TACL},
97
+ year={2025}
98
+ }
99
+ ```