File size: 3,028 Bytes
fa2bf70
 
e68b583
 
 
 
 
 
 
 
 
 
fa2bf70
e68b583
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
pipeline_tag: zero-shot-classification
license: mit
datasets:
- xnli
language:
- fr
tags:
- camembert
- text-classification
- nli
- xnli
---
This is a copy of the original BaptisteDoyen/camembert-base-xnli model as it gives a 404 error right now.\
Here is the model card as it was on BaptisteDoyen/camembert-base-xnli page.

# camembert-base-xnli

## Model description

Camembert-base model fine-tuned on french part of XNLI dataset.
One of the few Zero-Shot classification model working on French 🇫🇷

## Intended uses & limitations

#### How to use

Two different usages :

- As a Zero-Shot sequence classifier :
```
classifier = pipeline("zero-shot-classification", 
                      model="BaptisteDoyen/camembert-base-xnli")

sequence = "L'équipe de France joue aujourd'hui au Parc des Princes"
candidate_labels = ["sport","politique","science"]
hypothesis_template = "Ce texte parle de {}."    

classifier(sequence, candidate_labels, hypothesis_template=hypothesis_template)     
# outputs :                                        
# {'sequence': "L'équipe de France joue aujourd'hui au Parc des Princes",
# 'labels': ['sport', 'politique', 'science'],
# 'scores': [0.8595073223114014, 0.10821866989135742, 0.0322740375995636]}                      
```
- As a premise/hypothesis checker :
The idea is here to compute a probability of the form P(premise∣hypothesis) P(premise|hypothesis ) P(premise∣hypothesis)

```
# load model and tokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained("BaptisteDoyen/camembert-base-xnli")
tokenizer = AutoTokenizer.from_pretrained("BaptisteDoyen/camembert-base-xnli") 
# sequences
premise = "le score pour les bleus est élevé"
hypothesis = "L'équipe de France a fait un bon match"
# tokenize and run through model
x = tokenizer.encode(premise, hypothesis, return_tensors='pt')
logits = nli_model(x)[0]
# we throw away "neutral" (dim 1) and take the probability of
# "entailment" (0) as the probability of the label being true 
entail_contradiction_logits = logits[:,::2]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,0]
prob_label_is_true[0].tolist() * 100
# outputs
# 86.40775084495544
```

## Training data

Training data is the french fold of the [XNLI](https://research.fb.com/publications/xnli-evaluating-cross-lingual-sentence-representations/) dataset released in 2018 by Facebook.
Available with great ease using the datasets library :

```
from datasets import load_dataset
dataset = load_dataset('xnli', 'fr')                     
```

## Training/Fine-Tuning procedure

Training procedure is here pretty basic and was performed on the cloud using a single GPU.
Main training parameters :

- `lr = 2e-5 with lr_scheduler_type = "linear"`
- `num_train_epochs = 4`
- `batch_size = 12 (limited by GPU-memory)`
- `weight_decay = 0.01`
- `metric_for_best_model = "eval_accuracy"`

## Eval results

We obtain the following results on validation and test sets:
Set|Accuracy
---|---
validation|81.4
test|81.7