LLM-wsd-TT-10000 / README.md

Update README.md

589c582 verified 3 days ago

9.68 kB

	---
	license: llama3.1
	language:
	- de
	- en
	- es
	- fr
	- it
	tags:
	- text-generation-inference
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	---
	# Model Card for LLM-wsd-TT-10000

	## Model description

	<!-- Provide a quick summary of what the model is/does. -->

	LLM-wsd-TT-10000 is a Large Language Model (LLM) instruction-tuned over meta-llama/Meta-Llama-3.1-8B-Instruct.
	This model has been trained for the WSD task over a balanced training dataset (10000 instances per language), with machine-translation. It is capable of providing the definition of a word in a given sentence. Specifically, it can answer both:
	1) Open-ended questions, where the model will generate the definition of the target word;
	2) Closed-ended questions, where the model will generate the identifier of the correct option out of a list of alternatives.

	More details regarding the training procedure (e.g. hyperparameters, dataset construction, and so on) can be found in Section 4.2 of the [paper](https://arxiv.org/abs/2503.08662).

	- Developed by: Pierpaolo Basile, Lucia Siciliani, Elio Musacchio
	- Model type: LLaMA 3.1 Instruct
	- Language(s) (NLP): English, French, German, Italian and Spanish
	- License: [LLAMA 3.1 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct/blob/main/LICENSE)
	- Finetuned from model: [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)

	## Prompt Format

	The model has been trained using several instructions depending on language, task (open-ended or closed-ended) and number of occurences of target word in the sentence. In [Instructions](#instructions), we provide the instructions used for all cases. The following placeholder variables have to be replaced:
	- {target_word}: the target word in the input to disambiguate;
	- {options}: options to provide to the model for the closed-ended task only. The options should be newline separated and each option should be identified by a number. Refer to the [closed-ended example](#closed-ended) for an example of options formatting;
	- {occurrence}: the ordinal number of the {target_word} occurrence (e.g. "second"). This is required only when the input sentence contains multiple occurrences of {target_word}.

	Please note that the complete prompt also has the following string after the instruction:

	```python
	" Input: \"{sentence}\""
	```

	where {sentence} is the input sentence containing the word to disambiguate.

	## How to Get Started with the Model

	Below you can find two examples of model usage, for open-ended and closed-ended generation respectively.

	### Open-ended

	```python
	import torch

	from transformers import AutoModelForCausalLM, AutoTokenizer
	from transformers.trainer_utils import set_seed

	target_word = "long"
	instruction = f"Give a brief definition of the word \"{target_word}\" in the sentence given as input. Generate only the definition."
	input_sentence = "How long has it been since you reviewed the objectives of your benefit and service program?"

	model_id = "swap-uniba/LLM-wsd-TT-10000"

	set_seed(42)

	tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

	tokenizer.padding_side = "left"

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map='cuda',
	torch_dtype=torch.bfloat16,
	).eval()

	terminators = [
	tokenizer.eos_token_id,
	tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	messages = [
	{"role": "user", "content": instruction + " Input: \"" + input_sentence + "\""},
	]

	input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

	outputs = model.generate(
	input_ids.to('cuda'),
	max_new_tokens=512,
	eos_token_id=terminators,
	num_beams=1,
	do_sample=False
	)

	print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
	```

	### Closed-ended

	```python
	import torch

	from transformers import AutoModelForCausalLM, AutoTokenizer
	from transformers.trainer_utils import set_seed

	target_word = "hurry"
	instruction = f"Given the word \"{target_word}\" in the input sentence, choose the correct meaning from the following:\n1) Move very fast\n2) Urge to an unnatural speed\n\nGenerate only the number of the selected option."
	input_sentence = "If you hurry you might beat the headquarters boys."

	model_id = "swap-uniba/LLM-wsd-TT-10000"

	set_seed(42)

	tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

	tokenizer.padding_side = "left"

	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map='cuda',
	torch_dtype=torch.bfloat16,
	).eval()

	terminators = [
	tokenizer.eos_token_id,
	tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	messages = [
	{"role": "user", "content": instruction + " Input: \"" + input_sentence + "\""},
	]

	input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

	outputs = model.generate(
	input_ids.to('cuda'),
	max_new_tokens=512,
	eos_token_id=terminators,
	num_beams=1,
	do_sample=False
	)

	print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
	```

	## Citation

	If you use this model in your research, please cite the following:

	```bibtex
	@misc{basile2025exploringwordsensedisambiguation,
	title={Exploring the Word Sense Disambiguation Capabilities of Large Language Models},
	author={Pierpaolo Basile and Lucia Siciliani and Elio Musacchio and Giovanni Semeraro},
	year={2025},
	eprint={2503.08662},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2503.08662},
	}
	```

	## Instructions

	### Single occurrence of target word (open-ended)

	#### English
	```python
	"Give a brief definition of the word \"{target_word}\" in the sentence given as input. Generate only the definition."
	```

	#### French
	```python
	"Donnez une brève définition du mot \"{target_word}\" dans la phrase d’entrée donnée. Ne donnez que la définition."
	```

	#### German
	```python
	"Geben Sie eine kurze Definition des Wortes \"{target_word}\" in dem gegebenen Satz an. Erzeugen Sie nur die Definition."
	```

	#### Italian
	```python
	"Fornisci una breve definizione della parola \"{target_word}\" nella frase data in input. Genera solo la definizione."
	```

	#### Spanish
	```python
	"Proporciona una definición breve de la palabra \"{target_word}\" en la frase dada en entrada. Genera solo la definición."
	```

	### Multiple occurences of target word (open-ended)

	#### English
	```python
	"Give a brief definition of the {occurrence} occurrence of the word \"{target_word}\" in the sentence given as input. Generate only the definition."
	```

	#### French
	```python
	"Donnez une brève définition de l'occurrence {occurrence} du mot \"{target_word}\" dans la phrase d’entrée donnée. Ne donnez que la définition."
	```

	#### German
	```python
	"Geben Sie eine kurze Definition des {occurrence} Vorkommens des Wortes \"{target_word}\" in dem gegebenen Eingabesatz an. Erzeugen Sie nur die Definition."
	```

	#### Italian
	```python
	"Fornisci una breve definizione della {occurrence} occorrenza della parola \"{target_word}\" nella frase data in input. Genera solo la definizione."
	```

	#### Spanish
	```python
	"Proporciona una definición breve de la {occurrence} ocurrencia de la palabra \"{target_word}\" en la frase dada en entrada. Genera solo la definición."
	```

	### Single occurrence of target word (closed-ended)

	#### English
	```python
	"Given the word \"{target_word}\" in the input sentence, choose the correct meaning from the following:\n{options}\n\nGenerate only the number of the selected option."
	```

	#### French
	```python
	"Étant donné le mot \"{target_word}\" dans la phrase saisie, choisissez la signification correcte parmi les suivantes:\n{options}\n\nNe donnez que le numéro de l’option sélectionnée."
	```

	#### German
	```python
	"Wählen Sie für das Wort \"{target_word}\" im Eingabesatz die richtige Bedeutung aus den folgenden Angaben:\n{options}\n\nErzeugt nur die Nummer der ausgewählten Option"
	```

	#### Italian
	```python
	"Data la parola \"{target_word}\" nella frase in input, scegli il significato corretto tra i seguenti:\n{options}\n\nGenera solo il numero dell'opzione selezionata."
	```

	#### Spanish
	```python
	"Dada la palabra \"{target_word}\" en la frase de entrada, elija el significado correcto entre los siguientes:\n{options}\n\nGenera solo el número de la opción seleccionada."
	```

	### Multiple occurrences of target word (closed-ended)

	#### English
	```python
	"Given the word \"{target_word}\" in the input sentence, choose the correct meaning from the following:\n{options}\n\nGenerate only the number of the selected option."
	```

	#### French
	```python
	"Étant donné l'occurrence {occurrence} du mot \"{target_word}\" dans la phrase d'entrée, choisissez la signification correcte parmi les suivantes:\n{options}\n\nNe donnez que le numéro de l’option sélectionnée."
	```

	#### German
	```python
	"Wählen Sie angesichts des {occurrence} Vorkommens des Wortes \"{target_word}\" im Eingabesatz die richtige Bedeutung aus der folgenden Liste aus:\n{options}\n\nErzeugt nur die Nummer der ausgewählten Option."
	```

	#### Italian
	```python
	"Data la {occurrence} occorrenza della parola \"{target_word}\" nella frase in input, scegli il significato corretto tra i seguenti:\n{options}\n\nGenera solo il numero dell'opzione selezionata."
	```

	#### Spanish
	```python
	"Dada la {occurrence} ocurrencia de la palabra \"{target_word}\" en la frase de entrada, elije el significado correcto entre los siguientes:\n{options}\n\nGenera solo el número de la opción seleccionada."
	```