--- tags: - pos-tagging - icelandic - nlp license: mit widget: - text: "Hér er dæmasetning til að prófa." datasets: - MIM-GOLD metrics: - accuracy - macro precision - macro recall --- # Icelandic PoS Tagger This repository contains a Part-of-Speech (PoS) tagging model designed specifically for Icelandic sentences. The model can tag each word in a sentence with detailed linguistic features, including word class (noun, adjective, verb, etc.), gender, number, person, article (if applicable), and more. Github link to project that includes files used for training and evaluation, as well as demos for using the model and producing intelligible output: ``` https://github.com/valgardg/learnice ``` --- ## Model Overview - **Language**: Icelandic - **Task**: Part-of-Speech (PoS) tagging - **Features Tagged**: - **Word Class**: e.g., noun, adjective, verb - **Gender**: masculine, feminine, neuter - **Number**: singular, plural - **Person**: 1st, 2nd, 3rd (for applicable word classes) - **Article**: definite/indefinite (if applicable) - **Other Linguistic Features**: Additional fine-grained details included in the tags. - **Format**: Tags are output in a structured format based on Icelandic linguistic conventions. --- ## Files in This Repository - **`config.json`**: Model configuration file, defining the architecture and settings. - **`model.safetensors`**: Model weights stored in the efficient and secure SafeTensors format. - **`tokenizer.json`**: Defines the tokenizer used for preprocessing Icelandic text. - **`tokenizer_config.json`**: Configuration for the tokenizer. - **`vocab.txt`**: Vocabulary file used by the tokenizer. - **`special_tokens_map.json`**: Mapping of special tokens (e.g., `[CLS]`, `[SEP]`) used by the tokenizer. - **`id2tag_ftbi_ds100.json`**: A JSON file mapping output IDs to the corresponding linguistic tags. This file is critical for interpreting the model's outputs. --- ## Installation and Setup 1. Clone the repository or download the model files: ```bash git clone https://huggingface.co// cd ``` 2. Install the required libraries: ```bash pip install transformers huggingface_hub safetensors ``` 3. Load the model and tokenizer in Python: ```bash from transformers import AutoModelForTokenClassification, AutoTokenizer model = AutoModelForTokenClassification.from_pretrained("") tokenizer = AutoTokenizer.from_pretrained("") ``` ## Usage ## Pos Tagging an Icelandic Sentence Here is an example of how to use the model to tag Icelandic sentences: # Load the fine-tuned model from transformers import BertTokenizerFast, BertForTokenClassification import torch # type: ignore import json # Load id2tag mapping with open("../models/ftbi_ds100/id2tag_ftbi_ds100.json", "r") as f: id2tag = json.load(f) # Load your tokenizer and model from saved checkpoint tokenizer = BertTokenizerFast.from_pretrained("../models/ftbi_ds100") model = BertForTokenClassification.from_pretrained("../models/ftbi_ds100") # Function to predict tags on a new sentence def predict_tags(sentence, tokenizer, model, id2tag): # Tokenize the sentence tokenized_input = tokenizer(sentence, is_split_into_words=True, return_tensors="pt") # Get predictions with torch.no_grad(): output = model(**tokenized_input) # Get predicted label IDs label_ids = torch.argmax(output.logits, dim=2).squeeze().tolist() # Convert label IDs to tag names tags = [id2tag[str(label_id)] if str(label_id) in id2tag else 'O' for label_id in label_ids] # Match back to original words word_ids = tokenized_input.word_ids() # This shows which original word each token corresponds to word_tags = [] current_word_id = None current_tags = [] # Aggregate tags for each word for word_id, tag in zip(word_ids, tags): if word_id is None: # Skip special tokens continue if word_id != current_word_id: # New word detected if current_tags: # Append the aggregated tag for the previous word word_tags.append(current_tags[0]) # Use the first tag, or customize this current_word_id = word_id current_tags = [tag] else: current_tags.append(tag) # Aggregate tags for the same word # Append the last word's tag if current_tags: word_tags.append(current_tags[0]) # Use the first tag, or customize this # Return the original words and their aggregated tags return list(zip(sentence, word_tags)) # Example usage with a new Icelandic sentence sentence = ["Hraunbær", "105", "."] sentence = ["Niðurstaða", "þess", "var", "neikvæð", "."] sentence = "Kl. 9-16 fótaaðgerðir og hárgreiðsla , Kl. 9.15 handavinna , Kl. 13.30 sungið við flygilinn , Kl. 14.30-16 dansað við lagaval Halldóru , kaffiveitingar allir velkomnir .".split() predicted_tags = predict_tags(sentence, tokenizer, model, id2tag) print("Predicted Tags:", predicted_tags) ## License MIT License Feel free to use this model for research and development purposes. For any commercial use, please contact the repository owner. ## Citation If you use this model in your work, please cite it as: ``` @misc{valgardg_icelandic_pos_tagger, author = {Valgard Gudni Oddsson}, title = {Icelandic PoS Tagger}, year = {2024}, publisher = {Hugging Face}, url = {https://huggingface.co/valgardg/learnice-pos-tagger} } ```