liwii commited on
Commit
c373333
·
1 Parent(s): dcf2f4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -11
README.md CHANGED
@@ -8,29 +8,56 @@ model-index:
8
  results: []
9
  ---
10
 
11
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
12
- should probably proofread and complete it, then remove this comment. -->
13
 
14
  # fluency-score-classification-ja
15
 
16
- This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the None dataset.
17
  It achieves the following results on the evaluation set:
18
  - Loss: 0.1912
19
- - Roc Auc: 0.9811
20
 
21
  ## Model description
22
-
23
- More information needed
24
 
25
  ## Intended uses & limitations
26
-
27
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## Training and evaluation data
30
-
31
- More information needed
32
 
33
  ## Training procedure
 
34
 
35
  ### Training hyperparameters
36
 
@@ -60,4 +87,4 @@ The following hyperparameters were used during training:
60
  - Transformers 4.34.0
61
  - Pytorch 2.0.0+cu118
62
  - Datasets 2.14.5
63
- - Tokenizers 0.14.0
 
8
  results: []
9
  ---
10
 
 
 
11
 
12
  # fluency-score-classification-ja
13
 
14
+ This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main).
15
  It achieves the following results on the evaluation set:
16
  - Loss: 0.1912
17
+ - ROC AUC: 0.9811
18
 
19
  ## Model description
20
+ This model wraps [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) with [DistilBertForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.0/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) to make a binary classifier.
 
21
 
22
  ## Intended uses & limitations
23
+ This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors).
24
+ Example usage:
25
+
26
+ ```python
27
+ # Load the tokenizer & the model
28
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
29
+ import torch
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
32
+ model = AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja")
33
+
34
+ # Make predictions
35
+ input_tokens = tokenizer([
36
+ '黒い猫が',
37
+ '黒い猫がいます',
38
+ 'あっちの方で黒い猫があくびをしています',
39
+ 'あっちの方でで黒い猫ががあくびをしています',
40
+ 'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。'
41
+ ],
42
+ return_tensors='pt',
43
+ padding=True)
44
+
45
+ output = model(**input_tokens)
46
+ with torch.no_grad():
47
+ # Probabilities of [not_fluent, fluent]
48
+ probs = torch.nn.functional.softmax(
49
+ output.logits, dim=1)
50
+ probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701])
51
+ ```
52
+
53
+ The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences.
54
 
55
  ## Training and evaluation data
56
+ From ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main), used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset.
57
+ For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data.
58
 
59
  ## Training procedure
60
+ Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning.
61
 
62
  ### Training hyperparameters
63
 
 
87
  - Transformers 4.34.0
88
  - Pytorch 2.0.0+cu118
89
  - Datasets 2.14.5
90
+ - Tokenizers 0.14.0