maxpe commited on
Commit
48a683b
Β·
1 Parent(s): 1119202

added tokenizer

Browse files
Files changed (5) hide show
  1. README.md.bkp +149 -0
  2. merges.txt +0 -0
  3. special_tokens_map.json +1 -0
  4. tokenizer_config.json +1 -0
  5. vocab.json +0 -0
README.md.bkp ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Twitter-roBERTa-base
2
+
3
+ This is a Twitter-roBERTa-base model trained on ~7000 tweets annotated for 11 emotion categories in [SemEval-2018 Task 1: Affect in Tweets: SubTask 5: Emotion Classification.](https://competitions.codalab.org/competitions/17751).
4
+
5
+
6
+
7
+ ~58M tweets, described and evaluated in the [_TweetEval_ benchmark (Findings of EMNLP 2020)](https://arxiv.org/pdf/2010.12421.pdf). To evaluate this and other LMs on Twitter-specific data, please refer to the [Tweeteval official repository](https://github.com/cardiffnlp/tweeteval).
8
+
9
+
10
+
11
+
12
+
13
+
14
+
15
+ ## Preprocess Text
16
+ Replace usernames and links for placeholders: "@user" and "http".
17
+ ```python
18
+ def preprocess(text):
19
+ new_text = []
20
+ for t in text.split(" "):
21
+ t = '@user' if t.startswith('@') and len(t) > 1 else t
22
+ t = 'http' if t.startswith('http') else t
23
+ new_text.append(t)
24
+ return " ".join(new_text)
25
+ ```
26
+
27
+ ## Example Masked Language Model
28
+
29
+ ```python
30
+ from transformers import pipeline, AutoTokenizer
31
+ import numpy as np
32
+
33
+ MODEL = "cardiffnlp/twitter-roberta-base"
34
+ fill_mask = pipeline("fill-mask", model=MODEL, tokenizer=MODEL)
35
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
36
+
37
+ def print_candidates():
38
+ for i in range(5):
39
+ token = tokenizer.decode(candidates[i]['token'])
40
+ score = np.round(candidates[i]['score'], 4)
41
+ print(f"{i+1}) {token} {score}")
42
+
43
+ texts = [
44
+ "I am so <mask> 😊",
45
+ "I am so <mask> 😒"
46
+ ]
47
+ for text in texts:
48
+ t = preprocess(text)
49
+ print(f"{'-'*30}\n{t}")
50
+ candidates = fill_mask(t)
51
+ print_candidates()
52
+ ```
53
+
54
+ Output:
55
+
56
+ ```
57
+ ------------------------------
58
+ I am so <mask> 😊
59
+ 1) happy 0.402
60
+ 2) excited 0.1441
61
+ 3) proud 0.143
62
+ 4) grateful 0.0669
63
+ 5) blessed 0.0334
64
+ ------------------------------
65
+ I am so <mask> 😒
66
+ 1) sad 0.2641
67
+ 2) sorry 0.1605
68
+ 3) tired 0.138
69
+ 4) sick 0.0278
70
+ 5) hungry 0.0232
71
+ ```
72
+
73
+ ## Example Tweet Embeddings
74
+ ```python
75
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
76
+ import numpy as np
77
+ from scipy.spatial.distance import cosine
78
+ from collections import defaultdict
79
+
80
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
81
+ model = AutoModel.from_pretrained(MODEL)
82
+
83
+ def get_embedding(text):
84
+ text = preprocess(text)
85
+ encoded_input = tokenizer(text, return_tensors='pt')
86
+ features = model(**encoded_input)
87
+ features = features[0].detach().cpu().numpy()
88
+ features_mean = np.mean(features[0], axis=0)
89
+ return features_mean
90
+
91
+ MODEL = "cardiffnlp/twitter-roberta-base"
92
+
93
+ query = "The book was awesome"
94
+
95
+ tweets = ["I just ordered fried chicken 🐣",
96
+ "The movie was great",
97
+ "What time is the next game?",
98
+ "Just finished reading 'Embeddings in NLP'"]
99
+
100
+ d = defaultdict(int)
101
+ for tweet in tweets:
102
+ sim = 1-cosine(get_embedding(query),get_embedding(tweet))
103
+ d[tweet] = sim
104
+
105
+ print('Most similar to: ',query)
106
+ print('----------------------------------------')
107
+ for idx,x in enumerate(sorted(d.items(), key=lambda x:x[1], reverse=True)):
108
+ print(idx+1,x[0])
109
+ ```
110
+ Output:
111
+
112
+ ```
113
+ Most similar to: The book was awesome
114
+ ----------------------------------------
115
+ 1 The movie was great
116
+ 2 Just finished reading 'Embeddings in NLP'
117
+ 3 I just ordered fried chicken 🐣
118
+ 4 What time is the next game?
119
+ ```
120
+
121
+ ## Example Feature Extraction
122
+
123
+ ```python
124
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
125
+ import numpy as np
126
+
127
+ MODEL = "cardiffnlp/twitter-roberta-base"
128
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
129
+
130
+ text = "Good night 😊"
131
+ text = preprocess(text)
132
+
133
+ # Pytorch
134
+ model = AutoModel.from_pretrained(MODEL)
135
+ encoded_input = tokenizer(text, return_tensors='pt')
136
+ features = model(**encoded_input)
137
+ features = features[0].detach().cpu().numpy()
138
+ features_mean = np.mean(features[0], axis=0)
139
+ #features_max = np.max(features[0], axis=0)
140
+
141
+ # # Tensorflow
142
+ # model = TFAutoModel.from_pretrained(MODEL)
143
+ # encoded_input = tokenizer(text, return_tensors='tf')
144
+ # features = model(encoded_input)
145
+ # features = features[0].numpy()
146
+ # features_mean = np.mean(features[0], axis=0)
147
+ # #features_max = np.max(features[0], axis=0)
148
+
149
+ ```
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": false, "errors": "replace", "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "cardiffnlp/twitter-roberta-base"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff