voidful commited on
Commit
00c124d
·
1 Parent(s): 48229af

update auto tokenizer support

Browse files
Files changed (4) hide show
  1. README.md +12 -9
  2. config.json +3 -0
  3. special_tokens_map.json +1 -0
  4. tokenizer_config.json +1 -0
README.md CHANGED
@@ -1,5 +1,8 @@
1
  ---
2
  language: zh
 
 
 
3
  ---
4
 
5
  # albert_chinese_large
@@ -7,25 +10,25 @@ language: zh
7
  This a albert_chinese_large model from [Google's github](https://github.com/google-research/ALBERT)
8
  converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
9
 
10
- ## Attention (注意)
 
11
 
12
- Since sentencepiece is not used in albert_chinese_large model
13
  you have to call BertTokenizer instead of AlbertTokenizer !!!
14
  we can eval it using an example on MaskedLM
15
 
16
- 由於 albert_chinese_large 模型沒有用 sentencepiece
17
  用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
18
  我們可以跑MaskedLM預測來驗證這個做法是否正確
19
 
20
  ## Justify (驗證有效性)
21
- [colab trial](https://colab.research.google.com/drive/1Wjz48Uws6-VuSHv_-DcWLilv77-AaYgj)
22
  ```python
23
- from transformers import *
24
  import torch
25
  from torch.nn.functional import softmax
26
 
27
  pretrained = 'voidful/albert_chinese_large'
28
- tokenizer = BertTokenizer.from_pretrained(pretrained)
29
  model = AlbertForMaskedLM.from_pretrained(pretrained)
30
 
31
  inputtext = "今天[MASK]情很好"
@@ -33,11 +36,11 @@ inputtext = "今天[MASK]情很好"
33
  maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
34
 
35
  input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
36
- outputs = model(input_ids, masked_lm_labels=input_ids)
37
  loss, prediction_scores = outputs[:2]
38
- logit_prob = softmax(prediction_scores[0, maskpos]).data.tolist()
39
  predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
40
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
41
- print(predicted_token,logit_prob[predicted_index])
42
  ```
43
  Result: `心 0.9422469735145569`
 
1
  ---
2
  language: zh
3
+ pipeline_tag: fill-mask
4
+ widget:
5
+ - text: "今天[MASK]情很好"
6
  ---
7
 
8
  # albert_chinese_large
 
10
  This a albert_chinese_large model from [Google's github](https://github.com/google-research/ALBERT)
11
  converted by huggingface's [script](https://github.com/huggingface/transformers/blob/master/src/transformers/convert_albert_original_tf_checkpoint_to_pytorch.py)
12
 
13
+ ## Notice
14
+ *Support AutoTokenizer*
15
 
16
+ Since sentencepiece is not used in albert_chinese_base model
17
  you have to call BertTokenizer instead of AlbertTokenizer !!!
18
  we can eval it using an example on MaskedLM
19
 
20
+ 由於 albert_chinese_base 模型沒有用 sentencepiece
21
  用AlbertTokenizer會載不進詞表,因此需要改用BertTokenizer !!!
22
  我們可以跑MaskedLM預測來驗證這個做法是否正確
23
 
24
  ## Justify (驗證有效性)
 
25
  ```python
26
+ from transformers import AutoTokenizer, AlbertForMaskedLM
27
  import torch
28
  from torch.nn.functional import softmax
29
 
30
  pretrained = 'voidful/albert_chinese_large'
31
+ tokenizer = AutoTokenizer.from_pretrained(pretrained)
32
  model = AlbertForMaskedLM.from_pretrained(pretrained)
33
 
34
  inputtext = "今天[MASK]情很好"
 
36
  maskpos = tokenizer.encode(inputtext, add_special_tokens=True).index(103)
37
 
38
  input_ids = torch.tensor(tokenizer.encode(inputtext, add_special_tokens=True)).unsqueeze(0) # Batch size 1
39
+ outputs = model(input_ids, labels=input_ids)
40
  loss, prediction_scores = outputs[:2]
41
+ logit_prob = softmax(prediction_scores[0, maskpos],dim=-1).data.tolist()
42
  predicted_index = torch.argmax(prediction_scores[0, maskpos]).item()
43
  predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
44
+ print(predicted_token, logit_prob[predicted_index])
45
  ```
46
  Result: `心 0.9422469735145569`
config.json CHANGED
@@ -1,4 +1,7 @@
1
  {
 
 
 
2
  "attention_probs_dropout_prob": 0,
3
  "bos_token_id": 2,
4
  "classifier_dropout_prob": 0.1,
 
1
  {
2
+ "architectures": [
3
+ "AlbertForMaskedLM"
4
+ ],
5
  "attention_probs_dropout_prob": 0,
6
  "bos_token_id": 2,
7
  "classifier_dropout_prob": 0.1,
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "voidful/albert_chinese_large", "tokenizer_class": "BertTokenizer"}