File size: 5,766 Bytes
c721668
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8eaeb19
 
 
 
c5305e2
b61f6e5
 
ede8c0c
 
 
8eaeb19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2ca4f7
261524f
 
 
8eaeb19
 
261524f
 
8eaeb19
 
261524f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2ca4f7
8eaeb19
 
 
 
 
 
 
 
92499d4
 
8eaeb19
 
 
 
 
 
 
92499d4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
tags:
- pos-tagging
- icelandic
- nlp
license: mit
widget:
- text: "Hér er dæmasetning til að prófa."
datasets:
- MIM-GOLD
metrics:
- accuracy
- macro precision
- macro recall
---

# Icelandic PoS Tagger

This repository contains a Part-of-Speech (PoS) tagging model designed specifically for Icelandic sentences. The model can tag each word in a sentence with detailed linguistic features, including word class (noun, adjective, verb, etc.), gender, number, person, article (if applicable), and more.

Github link to project that includes files used for training and evaluation, as well as demos for using the model and producing intelligible output:

```
https://github.com/valgardg/learnice
```

---

## Model Overview

- **Language**: Icelandic
- **Task**: Part-of-Speech (PoS) tagging
- **Features Tagged**:
  - **Word Class**: e.g., noun, adjective, verb
  - **Gender**: masculine, feminine, neuter
  - **Number**: singular, plural
  - **Person**: 1st, 2nd, 3rd (for applicable word classes)
  - **Article**: definite/indefinite (if applicable)
  - **Other Linguistic Features**: Additional fine-grained details included in the tags.
- **Format**: Tags are output in a structured format based on Icelandic linguistic conventions.

---

## Files in This Repository

- **`config.json`**: Model configuration file, defining the architecture and settings.
- **`model.safetensors`**: Model weights stored in the efficient and secure SafeTensors format.
- **`tokenizer.json`**: Defines the tokenizer used for preprocessing Icelandic text.
- **`tokenizer_config.json`**: Configuration for the tokenizer.
- **`vocab.txt`**: Vocabulary file used by the tokenizer.
- **`special_tokens_map.json`**: Mapping of special tokens (e.g., `[CLS]`, `[SEP]`) used by the tokenizer.
- **`id2tag_ftbi_ds100.json`**: A JSON file mapping output IDs to the corresponding linguistic tags. This file is critical for interpreting the model's outputs.

---

## Installation and Setup

1. Clone the repository or download the model files:
   ```bash
   git clone https://huggingface.co/<username>/<repo_name>
   cd <repo_name>
    ```

2. Install the required libraries:
    ```bash
    pip install transformers huggingface_hub safetensors
    ```

3. Load the model and tokenizer in Python:
    ```bash
    from transformers import AutoModelForTokenClassification, AutoTokenizer

    model = AutoModelForTokenClassification.from_pretrained("<local_model_path>")
    tokenizer = AutoTokenizer.from_pretrained("<local_model_path>")
    ```

## Usage
## Pos Tagging an Icelandic Sentence
Here is an example of how to use the model to tag Icelandic sentences:
    
    # Load the fine-tuned model
    from transformers import BertTokenizerFast, BertForTokenClassification
    import torch # type: ignore
    import json

    # Load id2tag mapping
    with open("../models/ftbi_ds100/id2tag_ftbi_ds100.json", "r") as f:
        id2tag = json.load(f)

    # Load your tokenizer and model from saved checkpoint
    tokenizer = BertTokenizerFast.from_pretrained("../models/ftbi_ds100")
    model = BertForTokenClassification.from_pretrained("../models/ftbi_ds100")

    # Function to predict tags on a new sentence
    def predict_tags(sentence, tokenizer, model, id2tag):
        # Tokenize the sentence
        tokenized_input = tokenizer(sentence, is_split_into_words=True, return_tensors="pt")
        
        # Get predictions
        with torch.no_grad():
            output = model(**tokenized_input)
        
        # Get predicted label IDs
        label_ids = torch.argmax(output.logits, dim=2).squeeze().tolist()
        
        # Convert label IDs to tag names
        tags = [id2tag[str(label_id)] if str(label_id) in id2tag else 'O' for label_id in label_ids]
        
        # Match back to original words
        word_ids = tokenized_input.word_ids()  # This shows which original word each token corresponds to
        word_tags = []
        current_word_id = None
        current_tags = []

        # Aggregate tags for each word
        for word_id, tag in zip(word_ids, tags):
            if word_id is None:  # Skip special tokens
                continue
            if word_id != current_word_id:  # New word detected
                if current_tags:  # Append the aggregated tag for the previous word
                    word_tags.append(current_tags[0])  # Use the first tag, or customize this
                current_word_id = word_id
                current_tags = [tag]
            else:
                current_tags.append(tag)  # Aggregate tags for the same word

        # Append the last word's tag
        if current_tags:
            word_tags.append(current_tags[0])  # Use the first tag, or customize this
        
        # Return the original words and their aggregated tags
        return list(zip(sentence, word_tags))

    # Example usage with a new Icelandic sentence
    sentence = ["Hraunbær", "105", "."]
    sentence = ["Niðurstaða", "þess", "var", "neikvæð", "."]
    sentence = "Kl. 9-16 fótaaðgerðir og hárgreiðsla , Kl. 9.15 handavinna , Kl. 13.30 sungið við flygilinn , Kl. 14.30-16 dansað við lagaval Halldóru , kaffiveitingar allir velkomnir .".split()
    predicted_tags = predict_tags(sentence, tokenizer, model, id2tag)

    print("Predicted Tags:", predicted_tags)


## License
MIT License

Feel free to use this model for research and development purposes. For any commercial use, please contact the repository owner.

## Citation
If you use this model in your work, please cite it as:

```
@misc{valgardg_icelandic_pos_tagger,
  author = {Valgard Gudni Oddsson},
  title = {Icelandic PoS Tagger},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/valgardg/learnice-pos-tagger}
}
```