|
--- |
|
tags: |
|
- pos-tagging |
|
- icelandic |
|
- nlp |
|
license: mit |
|
widget: |
|
- text: "Hér er dæmasetning til að prófa." |
|
datasets: |
|
- MIM-GOLD |
|
metrics: |
|
- accuracy |
|
- macro precision |
|
- macro recall |
|
--- |
|
|
|
# Icelandic PoS Tagger |
|
|
|
This repository contains a Part-of-Speech (PoS) tagging model designed specifically for Icelandic sentences. The model can tag each word in a sentence with detailed linguistic features, including word class (noun, adjective, verb, etc.), gender, number, person, article (if applicable), and more. |
|
|
|
Github link to project that includes files used for training and evaluation, as well as demos for using the model and producing intelligible output: |
|
|
|
``` |
|
https://github.com/valgardg/learnice |
|
``` |
|
|
|
--- |
|
|
|
## Model Overview |
|
|
|
- **Language**: Icelandic |
|
- **Task**: Part-of-Speech (PoS) tagging |
|
- **Features Tagged**: |
|
- **Word Class**: e.g., noun, adjective, verb |
|
- **Gender**: masculine, feminine, neuter |
|
- **Number**: singular, plural |
|
- **Person**: 1st, 2nd, 3rd (for applicable word classes) |
|
- **Article**: definite/indefinite (if applicable) |
|
- **Other Linguistic Features**: Additional fine-grained details included in the tags. |
|
- **Format**: Tags are output in a structured format based on Icelandic linguistic conventions. |
|
|
|
--- |
|
|
|
## Files in This Repository |
|
|
|
- **`config.json`**: Model configuration file, defining the architecture and settings. |
|
- **`model.safetensors`**: Model weights stored in the efficient and secure SafeTensors format. |
|
- **`tokenizer.json`**: Defines the tokenizer used for preprocessing Icelandic text. |
|
- **`tokenizer_config.json`**: Configuration for the tokenizer. |
|
- **`vocab.txt`**: Vocabulary file used by the tokenizer. |
|
- **`special_tokens_map.json`**: Mapping of special tokens (e.g., `[CLS]`, `[SEP]`) used by the tokenizer. |
|
- **`id2tag_ftbi_ds100.json`**: A JSON file mapping output IDs to the corresponding linguistic tags. This file is critical for interpreting the model's outputs. |
|
|
|
--- |
|
|
|
## Installation and Setup |
|
|
|
1. Clone the repository or download the model files: |
|
```bash |
|
git clone https://huggingface.co/<username>/<repo_name> |
|
cd <repo_name> |
|
``` |
|
|
|
2. Install the required libraries: |
|
```bash |
|
pip install transformers huggingface_hub safetensors |
|
``` |
|
|
|
3. Load the model and tokenizer in Python: |
|
```bash |
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
|
|
model = AutoModelForTokenClassification.from_pretrained("<local_model_path>") |
|
tokenizer = AutoTokenizer.from_pretrained("<local_model_path>") |
|
``` |
|
|
|
## Usage |
|
## Pos Tagging an Icelandic Sentence |
|
Here is an example of how to use the model to tag Icelandic sentences: |
|
|
|
# Load the fine-tuned model |
|
from transformers import BertTokenizerFast, BertForTokenClassification |
|
import torch # type: ignore |
|
import json |
|
|
|
# Load id2tag mapping |
|
with open("../models/ftbi_ds100/id2tag_ftbi_ds100.json", "r") as f: |
|
id2tag = json.load(f) |
|
|
|
# Load your tokenizer and model from saved checkpoint |
|
tokenizer = BertTokenizerFast.from_pretrained("../models/ftbi_ds100") |
|
model = BertForTokenClassification.from_pretrained("../models/ftbi_ds100") |
|
|
|
# Function to predict tags on a new sentence |
|
def predict_tags(sentence, tokenizer, model, id2tag): |
|
# Tokenize the sentence |
|
tokenized_input = tokenizer(sentence, is_split_into_words=True, return_tensors="pt") |
|
|
|
# Get predictions |
|
with torch.no_grad(): |
|
output = model(**tokenized_input) |
|
|
|
# Get predicted label IDs |
|
label_ids = torch.argmax(output.logits, dim=2).squeeze().tolist() |
|
|
|
# Convert label IDs to tag names |
|
tags = [id2tag[str(label_id)] if str(label_id) in id2tag else 'O' for label_id in label_ids] |
|
|
|
# Match back to original words |
|
word_ids = tokenized_input.word_ids() # This shows which original word each token corresponds to |
|
word_tags = [] |
|
current_word_id = None |
|
current_tags = [] |
|
|
|
# Aggregate tags for each word |
|
for word_id, tag in zip(word_ids, tags): |
|
if word_id is None: # Skip special tokens |
|
continue |
|
if word_id != current_word_id: # New word detected |
|
if current_tags: # Append the aggregated tag for the previous word |
|
word_tags.append(current_tags[0]) # Use the first tag, or customize this |
|
current_word_id = word_id |
|
current_tags = [tag] |
|
else: |
|
current_tags.append(tag) # Aggregate tags for the same word |
|
|
|
# Append the last word's tag |
|
if current_tags: |
|
word_tags.append(current_tags[0]) # Use the first tag, or customize this |
|
|
|
# Return the original words and their aggregated tags |
|
return list(zip(sentence, word_tags)) |
|
|
|
# Example usage with a new Icelandic sentence |
|
sentence = ["Hraunbær", "105", "."] |
|
sentence = ["Niðurstaða", "þess", "var", "neikvæð", "."] |
|
sentence = "Kl. 9-16 fótaaðgerðir og hárgreiðsla , Kl. 9.15 handavinna , Kl. 13.30 sungið við flygilinn , Kl. 14.30-16 dansað við lagaval Halldóru , kaffiveitingar allir velkomnir .".split() |
|
predicted_tags = predict_tags(sentence, tokenizer, model, id2tag) |
|
|
|
print("Predicted Tags:", predicted_tags) |
|
|
|
|
|
## License |
|
MIT License |
|
|
|
Feel free to use this model for research and development purposes. For any commercial use, please contact the repository owner. |
|
|
|
## Citation |
|
If you use this model in your work, please cite it as: |
|
|
|
``` |
|
@misc{valgardg_icelandic_pos_tagger, |
|
author = {Valgard Gudni Oddsson}, |
|
title = {Icelandic PoS Tagger}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/valgardg/learnice-pos-tagger} |
|
} |
|
``` |