Cyrile commited on
Commit
ebd7d93
1 Parent(s): 98d82d2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -16,9 +16,9 @@ Loss function
16
  -------------
17
 
18
  The training for the distilled model (student model) is designed to be the closest as possible to the original model (teacher model). To perform this the loss function is composed of 3 parts:
19
- * DistilLoss: a distillation loss which measures the closely probability between the student and teacher outputs with a cross-entropy loss on the MLM task ;
20
- * MLMLoss: a Masked Language Modeling (MLM) task loss to perform the student model with the original task of teacher model ;
21
- * CosineLoss: and finally a cosine embedding loss. This loss function is applied on the last hidden layers of student and teacher models to guarantee a collinearity between us.
22
 
23
  The final loss function is a combination of these three loss functions. We use the following ponderation:
24
 
@@ -27,7 +27,7 @@ Loss = 0.5 DistilLoss + 0.2 MLMLoss + 0.3 CosineLoss
27
  Dataset
28
  -------
29
 
30
- To limit the bias between the student and teacher models, the dataset used for the DstilCamemBERT training is the same as the camembert-base training one: OSCAR. The french part of this dataset approximately represents 140 GB on a hard drive disk.
31
 
32
  Training
33
  --------
@@ -54,7 +54,6 @@ from transformers import CamembertModel, CamembertTokenizer
54
  tokeinzer = CamembertTokenizer.from_pretrained("cmarkea/distilcamembert-base")
55
  model = CamembertModel.from_pretrained("cmarkea/distilcamembert-base")
56
  model.eval()
57
-
58
  ...
59
  ```
60
 
 
16
  -------------
17
 
18
  The training for the distilled model (student model) is designed to be the closest as possible to the original model (teacher model). To perform this the loss function is composed of 3 parts:
19
+ * DistilLoss: a distillation loss which measures the silimarity between the probabilities at the outputs of the student and teacher models with a cross-entropy loss on the MLM task ;
20
+ * MLMLoss: a Masked Language Modeling (MLM) task loss to perform the student model with the original task of the teacher model ;
21
+ * CosineLoss: and finally a cosine embedding loss. This loss function is applied on the last hidden layers of student and teacher models to guarantee a collinearity between them.
22
 
23
  The final loss function is a combination of these three loss functions. We use the following ponderation:
24
 
 
27
  Dataset
28
  -------
29
 
30
+ To limit the bias between the student and teacher models, the dataset used for the DstilCamemBERT training is the same as the camembert-base training one: OSCAR. The French part of this dataset approximately represents 140 GB on a hard drive disk.
31
 
32
  Training
33
  --------
 
54
  tokeinzer = CamembertTokenizer.from_pretrained("cmarkea/distilcamembert-base")
55
  model = CamembertModel.from_pretrained("cmarkea/distilcamembert-base")
56
  model.eval()
 
57
  ...
58
  ```
59