learnice-pos-tagger / README.md
valgardg
updated formatting
a2ca4f7
---
tags:
- pos-tagging
- icelandic
- nlp
license: mit
widget:
- text: "Hér er dæmasetning til að prófa."
datasets:
- MIM-GOLD
metrics:
- accuracy
- macro precision
- macro recall
---
# Icelandic PoS Tagger
This repository contains a Part-of-Speech (PoS) tagging model designed specifically for Icelandic sentences. The model can tag each word in a sentence with detailed linguistic features, including word class (noun, adjective, verb, etc.), gender, number, person, article (if applicable), and more.
Github link to project that includes files used for training and evaluation, as well as demos for using the model and producing intelligible output:
```
https://github.com/valgardg/learnice
```
---
## Model Overview
- **Language**: Icelandic
- **Task**: Part-of-Speech (PoS) tagging
- **Features Tagged**:
- **Word Class**: e.g., noun, adjective, verb
- **Gender**: masculine, feminine, neuter
- **Number**: singular, plural
- **Person**: 1st, 2nd, 3rd (for applicable word classes)
- **Article**: definite/indefinite (if applicable)
- **Other Linguistic Features**: Additional fine-grained details included in the tags.
- **Format**: Tags are output in a structured format based on Icelandic linguistic conventions.
---
## Files in This Repository
- **`config.json`**: Model configuration file, defining the architecture and settings.
- **`model.safetensors`**: Model weights stored in the efficient and secure SafeTensors format.
- **`tokenizer.json`**: Defines the tokenizer used for preprocessing Icelandic text.
- **`tokenizer_config.json`**: Configuration for the tokenizer.
- **`vocab.txt`**: Vocabulary file used by the tokenizer.
- **`special_tokens_map.json`**: Mapping of special tokens (e.g., `[CLS]`, `[SEP]`) used by the tokenizer.
- **`id2tag_ftbi_ds100.json`**: A JSON file mapping output IDs to the corresponding linguistic tags. This file is critical for interpreting the model's outputs.
---
## Installation and Setup
1. Clone the repository or download the model files:
```bash
git clone https://huggingface.co/<username>/<repo_name>
cd <repo_name>
```
2. Install the required libraries:
```bash
pip install transformers huggingface_hub safetensors
```
3. Load the model and tokenizer in Python:
```bash
from transformers import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("<local_model_path>")
tokenizer = AutoTokenizer.from_pretrained("<local_model_path>")
```
## Usage
## Pos Tagging an Icelandic Sentence
Here is an example of how to use the model to tag Icelandic sentences:
# Load the fine-tuned model
from transformers import BertTokenizerFast, BertForTokenClassification
import torch # type: ignore
import json
# Load id2tag mapping
with open("../models/ftbi_ds100/id2tag_ftbi_ds100.json", "r") as f:
id2tag = json.load(f)
# Load your tokenizer and model from saved checkpoint
tokenizer = BertTokenizerFast.from_pretrained("../models/ftbi_ds100")
model = BertForTokenClassification.from_pretrained("../models/ftbi_ds100")
# Function to predict tags on a new sentence
def predict_tags(sentence, tokenizer, model, id2tag):
# Tokenize the sentence
tokenized_input = tokenizer(sentence, is_split_into_words=True, return_tensors="pt")
# Get predictions
with torch.no_grad():
output = model(**tokenized_input)
# Get predicted label IDs
label_ids = torch.argmax(output.logits, dim=2).squeeze().tolist()
# Convert label IDs to tag names
tags = [id2tag[str(label_id)] if str(label_id) in id2tag else 'O' for label_id in label_ids]
# Match back to original words
word_ids = tokenized_input.word_ids() # This shows which original word each token corresponds to
word_tags = []
current_word_id = None
current_tags = []
# Aggregate tags for each word
for word_id, tag in zip(word_ids, tags):
if word_id is None: # Skip special tokens
continue
if word_id != current_word_id: # New word detected
if current_tags: # Append the aggregated tag for the previous word
word_tags.append(current_tags[0]) # Use the first tag, or customize this
current_word_id = word_id
current_tags = [tag]
else:
current_tags.append(tag) # Aggregate tags for the same word
# Append the last word's tag
if current_tags:
word_tags.append(current_tags[0]) # Use the first tag, or customize this
# Return the original words and their aggregated tags
return list(zip(sentence, word_tags))
# Example usage with a new Icelandic sentence
sentence = ["Hraunbær", "105", "."]
sentence = ["Niðurstaða", "þess", "var", "neikvæð", "."]
sentence = "Kl. 9-16 fótaaðgerðir og hárgreiðsla , Kl. 9.15 handavinna , Kl. 13.30 sungið við flygilinn , Kl. 14.30-16 dansað við lagaval Halldóru , kaffiveitingar allir velkomnir .".split()
predicted_tags = predict_tags(sentence, tokenizer, model, id2tag)
print("Predicted Tags:", predicted_tags)
## License
MIT License
Feel free to use this model for research and development purposes. For any commercial use, please contact the repository owner.
## Citation
If you use this model in your work, please cite it as:
```
@misc{valgardg_icelandic_pos_tagger,
author = {Valgard Gudni Oddsson},
title = {Icelandic PoS Tagger},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/valgardg/learnice-pos-tagger}
}
```