lighteternal
commited on
Commit
·
94edbd3
1
Parent(s):
493c5b4
Update from earendil
Browse files- README.md +101 -0
- config.json +39 -0
- pytorch_model.bin +3 -0
- sentencepiece.bpe.model +3 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
README.md
ADDED
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: el
|
3 |
+
pipeline_tag: zero-shot-classification
|
4 |
+
tags:
|
5 |
+
- xlm-roberta-base
|
6 |
+
datasets:
|
7 |
+
- multi_nli
|
8 |
+
- snli
|
9 |
+
- allnli_greek
|
10 |
+
metrics:
|
11 |
+
- accuracy
|
12 |
+
license: apache-2.0
|
13 |
+
widget:
|
14 |
+
- text: "Το Facebook κυκλοφόρησε τα πρώτα «έξυπνα» γυαλιά επαυξημένης πραγματικότητας"
|
15 |
+
candidate_labels: "πολιτική, τεχνολογία, αθλητισμός"
|
16 |
+
---
|
17 |
+
|
18 |
+
# Cross-Encoder for Greek Natural Language Inference (Textual Entailment) & Zero-Shot Classification
|
19 |
+
This model was trained using [SentenceTransformers](https://sbert.net) [Cross-Encoder](https://www.sbert.net/examples/applications/cross-encoder/README.html) class.
|
20 |
+
#### By the
|
21 |
+
## Training Data
|
22 |
+
The model was trained on the the Greek version of the combined AllNLI dataset([SNLI](https://nlp.stanford.edu/projects/snli/) and [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/)) which was created using EN2EL NMT model available [here](https://huggingface.co/lighteternal/SSE-TUC-mt-en-el-cased).
|
23 |
+
|
24 |
+
The model can be used in two ways:
|
25 |
+
* NLI/Textual Entailment: For a given sentence pair, it will output three scores corresponding to the labels: contradiction, entailment, neutral.
|
26 |
+
* Zero-shot classification through the Huggingface pipeline: Given a sentence and a set of labels/topics, it will output the likelihood of the sentence belonging to each of the topic. Under the hood, the logit for entailment between the sentence and each label is taken as the logit for the candidate label being valid.
|
27 |
+
|
28 |
+
## Performance
|
29 |
+
|
30 |
+
Evaluation on classification accuracy (entailment, contradiction, neutral) on mixed (Greek+English) AllNLI-dev set:
|
31 |
+
| Metric | Value |
|
32 |
+
| --- | --- |
|
33 |
+
| Accuracy | 0.8409 |
|
34 |
+
|
35 |
+
|
36 |
+
|
37 |
+
## To use the model for NLI/Textual Entailment
|
38 |
+
|
39 |
+
#### Usage with sentence_transformers
|
40 |
+
|
41 |
+
Pre-trained models can be used like this:
|
42 |
+
```python
|
43 |
+
from sentence_transformers import CrossEncoder
|
44 |
+
model = CrossEncoder('MODEL_NAME')
|
45 |
+
scores = model.predict([('Δύο άνθρωποι συναντιούνται στο δρόμο', 'Ο δρόμος έχει κόσμο'),
|
46 |
+
('Ένα μαύρο αυτοκίνητο ξεκινάει στη μέση του πλήθους.', 'Ένας άντρας οδηγάει σε ένα μοναχικό δρόμο'),
|
47 |
+
('Δυο γυναίκες μιλάνε στο κινητό', 'Το τραπέζι ήταν πράσινο')])
|
48 |
+
|
49 |
+
|
50 |
+
#Convert scores to labels
|
51 |
+
label_mapping = ['contradiction', 'entailment', 'neutral']
|
52 |
+
labels = [label_mapping[score_max] for score_max in scores.argmax(axis=1)]
|
53 |
+
print(scores, labels)
|
54 |
+
|
55 |
+
# Οutputs
|
56 |
+
#[[-3.1526504 2.9981945 -0.3108107]
|
57 |
+
# [ 5.0549307 -2.757949 -1.6220676]
|
58 |
+
# [-0.5124733 -2.2671669 3.1630592]] ['entailment', 'contradiction', 'neutral']
|
59 |
+
```
|
60 |
+
|
61 |
+
#### Usage with Transformers AutoModel
|
62 |
+
You can use the model also directly with Transformers library (without SentenceTransformers library):
|
63 |
+
```python
|
64 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
65 |
+
import torch
|
66 |
+
|
67 |
+
model = AutoModelForSequenceClassification.from_pretrained('MODEL_NAME')
|
68 |
+
tokenizer = AutoTokenizer.from_pretrained('MODEL_NAME')
|
69 |
+
|
70 |
+
features = tokenizer(['Δύο άνθρωποι συναντιούνται στο δρόμο', 'Ο δρόμος έχει κόσμο'],
|
71 |
+
['Ένα μαύρο αυτοκίνητο ξεκινάει στη μέση του πλήθους.', 'Ένας άντρας οδηγάει σε ένα μοναχικό δρόμο.'],
|
72 |
+
padding=True, truncation=True, return_tensors="pt")
|
73 |
+
|
74 |
+
model.eval()
|
75 |
+
with torch.no_grad():
|
76 |
+
scores = model(**features).logits
|
77 |
+
label_mapping = ['contradiction', 'entailment', 'neutral']
|
78 |
+
labels = [label_mapping[score_max] for score_max in scores.argmax(dim=1)]
|
79 |
+
print(labels)
|
80 |
+
```
|
81 |
+
|
82 |
+
## To use the model for Zero-Shot Classification
|
83 |
+
This model can also be used for zero-shot-classification:
|
84 |
+
```python
|
85 |
+
from transformers import pipeline
|
86 |
+
|
87 |
+
classifier = pipeline("zero-shot-classification", model='MODEL_NAME')
|
88 |
+
|
89 |
+
sent = "Το Facebook κυκλοφόρησε τα πρώτα «έξυπνα» γυαλιά επαυξημένης πραγματικότητας"
|
90 |
+
candidate_labels = ["πολιτική", "τεχνολογία", "αθλητισμός"]
|
91 |
+
res = classifier(sent, candidate_labels)
|
92 |
+
print(res)
|
93 |
+
```
|
94 |
+
### Acknowledgement
|
95 |
+
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number:50, 2nd call)
|
96 |
+
|
97 |
+
### Citation info
|
98 |
+
Citation for the Greek model TBA.
|
99 |
+
Based on the work [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks](https://arxiv.org/abs/1908.10084)
|
100 |
+
Kudos to @nreimers (Nils Reimers) for his support on Github .
|
101 |
+
|
config.json
ADDED
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "/home/earendil/Desktop/ML_playground/sentence-transformers/examples/training/cross-encoder/output/training_allnli-2021-09-18_19-10-56",
|
3 |
+
"architectures": [
|
4 |
+
"XLMRobertaForSequenceClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"bos_token_id": 0,
|
8 |
+
"classifier_dropout": null,
|
9 |
+
"eos_token_id": 2,
|
10 |
+
"gradient_checkpointing": false,
|
11 |
+
"hidden_act": "gelu",
|
12 |
+
"hidden_dropout_prob": 0.1,
|
13 |
+
"hidden_size": 768,
|
14 |
+
"id2label": {
|
15 |
+
"0": "LABEL_0",
|
16 |
+
"1": "LABEL_1",
|
17 |
+
"2": "LABEL_2"
|
18 |
+
},
|
19 |
+
"initializer_range": 0.02,
|
20 |
+
"intermediate_size": 3072,
|
21 |
+
"label2id": {
|
22 |
+
"LABEL_0": 0,
|
23 |
+
"LABEL_1": 1,
|
24 |
+
"LABEL_2": 2
|
25 |
+
},
|
26 |
+
"layer_norm_eps": 1e-05,
|
27 |
+
"max_position_embeddings": 514,
|
28 |
+
"model_type": "xlm-roberta",
|
29 |
+
"num_attention_heads": 12,
|
30 |
+
"num_hidden_layers": 12,
|
31 |
+
"output_past": true,
|
32 |
+
"pad_token_id": 1,
|
33 |
+
"position_embedding_type": "absolute",
|
34 |
+
"torch_dtype": "float32",
|
35 |
+
"transformers_version": "4.10.0",
|
36 |
+
"type_vocab_size": 1,
|
37 |
+
"use_cache": true,
|
38 |
+
"vocab_size": 250002
|
39 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:a43dbce83e22dd86cabe5dcda54068f4f846ac6f3a1f6b2bed1a03d1ac38e3a4
|
3 |
+
size 1112274377
|
sentencepiece.bpe.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
|
3 |
+
size 5069051
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": false}}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "/home/earendil/Desktop/ML_playground/sentence-transformers/examples/training/cross-encoder/output/training_allnli-2021-09-18_19-10-56", "tokenizer_class": "XLMRobertaTokenizer"}
|