File size: 2,623 Bytes
ff634e6
 
a7ca9ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ff634e6
a7ca9ac
ff634e6
a7ca9ac
ff634e6
a7ca9ac
ff634e6
a7ca9ac
ff634e6
a7ca9ac
 
 
ff634e6
a7ca9ac
ff634e6
a7ca9ac
ff634e6
a7ca9ac
ff634e6
a7ca9ac
 
 
 
 
 
 
 
ff634e6
a7ca9ac
 
 
 
 
 
 
 
ff634e6
a7ca9ac
ff634e6
a7ca9ac
 
ff634e6
a7ca9ac
 
 
ff634e6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---

license: cc-by-nc-4.0  
language:
- hu
- en  
metrics:
- accuracy
- f1  
model-index:
- name: Hun_Eng_RoBERTa_base_Plain  
  results:
  - task:
      type: text-classification  
    metrics:
      - type: accuracy  
        value: 0.75 (hu) / 0.65 (en)  
      - type: f1  
        value: 0.74 (hu) / 0.64 (en)  
widget:
- text: "A tanúsítvány meghatározott adatainak a 2008/118/EK irányelv IV. fejezete szerinti szállításához szükséges adminisztratív okmányban..."
  example_title: "Incomprehensible"
- text: "Az AEO-engedély birtokosainak listáján – keresésre – megjelenő információk: az engedélyes neve, az engedélyt kibocsátó ország..."
  example_title: "Comprehensible"

---

## Model description

Cased fine-tuned `XLM-RoBERTa-base` model for Hungarian and English, trained on datasets provided by the National Tax and Customs Administration - Hungary (NAV) and translated versions of the same dataset using Google Translate API.

## Intended uses & limitations

The model is designed to classify sentences as either "comprehensible" or "not comprehensible" (according to Plain Language guidelines):
* **Label_0** - "comprehensible" - The sentence is in Plain Language.
* **Label_1** - "not comprehensible" - The sentence is **not** in Plain Language.

## Training

Fine-tuned version of the original `xlm-roberta-base` model, trained on a dataset of Hungarian legal and administrative texts. The model was also trained on the translated version of this dataset (via Google Translate API) for English classification.

## Eval results

### Hungarian Results:
| Class | Precision | Recall | F1-Score |
| ----- | --------- | ------ | -------- |
| **Comprehensible / Label_0** | **0.82** | **0.62** | **0.70** |
| **Not comprehensible / Label_1** | **0.71** | **0.88** | **0.78** |
| **accuracy** | | | **0.75** |
| **macro avg** | **0.77** | **0.75** | **0.74** |
| **weighted avg** | **0.76** | **0.75** | **0.74** |

### English Results:
| Class | Precision | Recall | F1-Score |
| ----- | --------- | ------ | -------- |
| **Comprehensible / Label_0** | **0.70** | **0.50** | **0.58** |
| **Not comprehensible / Label_1** | **0.63** | **0.80** | **0.70** |
| **accuracy** | | | **0.65** |
| **macro avg** | **0.66** | **0.65** | **0.64** |
| **weighted avg** | **0.66** | **0.65** | **0.64** |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("uvegesistvan/Hun_Eng_RoBERTa_base_Plain")
model = AutoModelForSequenceClassification.from_pretrained("uvegesistvan/Hun_Eng_RoBERTa_base_Plain")
```