valgardg

updated formatting

a2ca4f7 30 days ago

5.77 kB

	---
	tags:
	- pos-tagging
	- icelandic
	- nlp
	license: mit
	widget:
	- text: "Hér er dæmasetning til að prófa."
	datasets:
	- MIM-GOLD
	metrics:
	- accuracy
	- macro precision
	- macro recall
	---

	# Icelandic PoS Tagger

	This repository contains a Part-of-Speech (PoS) tagging model designed specifically for Icelandic sentences. The model can tag each word in a sentence with detailed linguistic features, including word class (noun, adjective, verb, etc.), gender, number, person, article (if applicable), and more.

	Github link to project that includes files used for training and evaluation, as well as demos for using the model and producing intelligible output:

	```
	https://github.com/valgardg/learnice
	```

	---

	## Model Overview

	- Language: Icelandic
	- Task: Part-of-Speech (PoS) tagging
	- Features Tagged:
	- Word Class: e.g., noun, adjective, verb
	- Gender: masculine, feminine, neuter
	- Number: singular, plural
	- Person: 1st, 2nd, 3rd (for applicable word classes)
	- Article: definite/indefinite (if applicable)
	- Other Linguistic Features: Additional fine-grained details included in the tags.
	- Format: Tags are output in a structured format based on Icelandic linguistic conventions.

	---

	## Files in This Repository

	- `config.json`: Model configuration file, defining the architecture and settings.
	- `model.safetensors`: Model weights stored in the efficient and secure SafeTensors format.
	- `tokenizer.json`: Defines the tokenizer used for preprocessing Icelandic text.
	- `tokenizer_config.json`: Configuration for the tokenizer.
	- `vocab.txt`: Vocabulary file used by the tokenizer.
	- `special_tokens_map.json`: Mapping of special tokens (e.g., `[CLS]`, `[SEP]`) used by the tokenizer.
	- `id2tag_ftbi_ds100.json`: A JSON file mapping output IDs to the corresponding linguistic tags. This file is critical for interpreting the model's outputs.

	---

	## Installation and Setup

	1. Clone the repository or download the model files:
	```bash
	git clone https://huggingface.co/<username>/<repo_name>
	cd <repo_name>
	```

	2. Install the required libraries:
	```bash
	pip install transformers huggingface_hub safetensors
	```

	3. Load the model and tokenizer in Python:
	```bash
	from transformers import AutoModelForTokenClassification, AutoTokenizer

	model = AutoModelForTokenClassification.from_pretrained("<local_model_path>")
	tokenizer = AutoTokenizer.from_pretrained("<local_model_path>")
	```

	## Usage
	## Pos Tagging an Icelandic Sentence
	Here is an example of how to use the model to tag Icelandic sentences:

	# Load the fine-tuned model
	from transformers import BertTokenizerFast, BertForTokenClassification
	import torch # type: ignore
	import json

	# Load id2tag mapping
	with open("../models/ftbi_ds100/id2tag_ftbi_ds100.json", "r") as f:
	id2tag = json.load(f)

	# Load your tokenizer and model from saved checkpoint
	tokenizer = BertTokenizerFast.from_pretrained("../models/ftbi_ds100")
	model = BertForTokenClassification.from_pretrained("../models/ftbi_ds100")

	# Function to predict tags on a new sentence
	def predict_tags(sentence, tokenizer, model, id2tag):
	# Tokenize the sentence
	tokenized_input = tokenizer(sentence, is_split_into_words=True, return_tensors="pt")

	# Get predictions
	with torch.no_grad():
	output = model(**tokenized_input)

	# Get predicted label IDs
	label_ids = torch.argmax(output.logits, dim=2).squeeze().tolist()

	# Convert label IDs to tag names
	tags = [id2tag[str(label_id)] if str(label_id) in id2tag else 'O' for label_id in label_ids]

	# Match back to original words
	word_ids = tokenized_input.word_ids() # This shows which original word each token corresponds to
	word_tags = []
	current_word_id = None
	current_tags = []

	# Aggregate tags for each word
	for word_id, tag in zip(word_ids, tags):
	if word_id is None: # Skip special tokens
	continue
	if word_id != current_word_id: # New word detected
	if current_tags: # Append the aggregated tag for the previous word
	word_tags.append(current_tags[0]) # Use the first tag, or customize this
	current_word_id = word_id
	current_tags = [tag]
	else:
	current_tags.append(tag) # Aggregate tags for the same word

	# Append the last word's tag
	if current_tags:
	word_tags.append(current_tags[0]) # Use the first tag, or customize this

	# Return the original words and their aggregated tags
	return list(zip(sentence, word_tags))

	# Example usage with a new Icelandic sentence
	sentence = ["Hraunbær", "105", "."]
	sentence = ["Niðurstaða", "þess", "var", "neikvæð", "."]
	sentence = "Kl. 9-16 fótaaðgerðir og hárgreiðsla , Kl. 9.15 handavinna , Kl. 13.30 sungið við flygilinn , Kl. 14.30-16 dansað við lagaval Halldóru , kaffiveitingar allir velkomnir .".split()
	predicted_tags = predict_tags(sentence, tokenizer, model, id2tag)

	print("Predicted Tags:", predicted_tags)


	## License
	MIT License

	Feel free to use this model for research and development purposes. For any commercial use, please contact the repository owner.

	## Citation
	If you use this model in your work, please cite it as:

	```
	@misc{valgardg_icelandic_pos_tagger,
	author = {Valgard Gudni Oddsson},
	title = {Icelandic PoS Tagger},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/valgardg/learnice-pos-tagger}
	}
	```