Icelandic PoS Tagger

This repository contains a Part-of-Speech (PoS) tagging model designed specifically for Icelandic sentences. The model can tag each word in a sentence with detailed linguistic features, including word class (noun, adjective, verb, etc.), gender, number, person, article (if applicable), and more.

Github link to project that includes files used for training and evaluation, as well as demos for using the model and producing intelligible output:

https://github.com/valgardg/learnice

Model Overview

Language: Icelandic
Task: Part-of-Speech (PoS) tagging
Features Tagged:
- Word Class: e.g., noun, adjective, verb
- Gender: masculine, feminine, neuter
- Number: singular, plural
- Person: 1st, 2nd, 3rd (for applicable word classes)
- Article: definite/indefinite (if applicable)
- Other Linguistic Features: Additional fine-grained details included in the tags.
Format: Tags are output in a structured format based on Icelandic linguistic conventions.

Files in This Repository

config.json: Model configuration file, defining the architecture and settings.
model.safetensors: Model weights stored in the efficient and secure SafeTensors format.
tokenizer.json: Defines the tokenizer used for preprocessing Icelandic text.
tokenizer_config.json: Configuration for the tokenizer.
vocab.txt: Vocabulary file used by the tokenizer.
special_tokens_map.json: Mapping of special tokens (e.g., [CLS], [SEP]) used by the tokenizer.
id2tag_ftbi_ds100.json: A JSON file mapping output IDs to the corresponding linguistic tags. This file is critical for interpreting the model's outputs.

Installation and Setup

Clone the repository or download the model files:

git clone https://huggingface.co/<username>/<repo_name>
cd <repo_name>

Install the required libraries:

pip install transformers huggingface_hub safetensors

Load the model and tokenizer in Python:

from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("<local_model_path>")
tokenizer = AutoTokenizer.from_pretrained("<local_model_path>")

Usage

Pos Tagging an Icelandic Sentence

Here is an example of how to use the model to tag Icelandic sentences:

# Load the fine-tuned model
from transformers import BertTokenizerFast, BertForTokenClassification
import torch # type: ignore
import json

# Load id2tag mapping
with open("../models/ftbi_ds100/id2tag_ftbi_ds100.json", "r") as f:
    id2tag = json.load(f)

# Load your tokenizer and model from saved checkpoint
tokenizer = BertTokenizerFast.from_pretrained("../models/ftbi_ds100")
model = BertForTokenClassification.from_pretrained("../models/ftbi_ds100")

# Function to predict tags on a new sentence
def predict_tags(sentence, tokenizer, model, id2tag):
    # Tokenize the sentence
    tokenized_input = tokenizer(sentence, is_split_into_words=True, return_tensors="pt")
    
    # Get predictions
    with torch.no_grad():
        output = model(**tokenized_input)
    
    # Get predicted label IDs
    label_ids = torch.argmax(output.logits, dim=2).squeeze().tolist()
    
    # Convert label IDs to tag names
    tags = [id2tag[str(label_id)] if str(label_id) in id2tag else 'O' for label_id in label_ids]
    
    # Match back to original words
    word_ids = tokenized_input.word_ids()  # This shows which original word each token corresponds to
    word_tags = []
    current_word_id = None
    current_tags = []

    # Aggregate tags for each word
    for word_id, tag in zip(word_ids, tags):
        if word_id is None:  # Skip special tokens
            continue
        if word_id != current_word_id:  # New word detected
            if current_tags:  # Append the aggregated tag for the previous word
                word_tags.append(current_tags[0])  # Use the first tag, or customize this
            current_word_id = word_id
            current_tags = [tag]
        else:
            current_tags.append(tag)  # Aggregate tags for the same word

    # Append the last word's tag
    if current_tags:
        word_tags.append(current_tags[0])  # Use the first tag, or customize this
    
    # Return the original words and their aggregated tags
    return list(zip(sentence, word_tags))

# Example usage with a new Icelandic sentence
sentence = ["Hraunbær", "105", "."]
sentence = ["Niðurstaða", "þess", "var", "neikvæð", "."]
sentence = "Kl. 9-16 fótaaðgerðir og hárgreiðsla , Kl. 9.15 handavinna , Kl. 13.30 sungið við flygilinn , Kl. 14.30-16 dansað við lagaval Halldóru , kaffiveitingar allir velkomnir .".split()
predicted_tags = predict_tags(sentence, tokenizer, model, id2tag)

print("Predicted Tags:", predicted_tags)

License

MIT License

Feel free to use this model for research and development purposes. For any commercial use, please contact the repository owner.

Citation

If you use this model in your work, please cite it as:

@misc{valgardg_icelandic_pos_tagger,
  author = {Valgard Gudni Oddsson},
  title = {Icelandic PoS Tagger},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/valgardg/learnice-pos-tagger}
}