Commit
·
ba66f28
1
Parent(s):
476ec8a
Update README.md
Browse files
README.md
CHANGED
@@ -9,11 +9,12 @@
|
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
11 |
#Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding.
|
|
|
12 |
|
13 |
## Model Details
|
14 |
|
15 |
-
#Encoder
|
16 |
-
It was
|
17 |
We randomly noise the hashtags to avoid trivial representation.
|
18 |
Please refers to https://github.com/albertan017/HICL for more details.
|
19 |
|
@@ -71,20 +72,18 @@ N.A.
|
|
71 |
|
72 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
73 |
|
74 |
-
|
|
|
|
|
|
|
|
|
|
|
75 |
|
76 |
### Training Procedure
|
77 |
|
78 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
79 |
|
80 |
-
|
81 |
-
|
82 |
-
[More Information Needed]
|
83 |
-
|
84 |
-
|
85 |
-
#### Training Hyperparameters
|
86 |
-
|
87 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
88 |
|
89 |
|
90 |
## Citation [optional]
|
|
|
9 |
<!-- Provide a quick summary of what the model is/does. -->
|
10 |
|
11 |
#Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding.
|
12 |
+
The model can effectively extract topic-level embeddings.
|
13 |
|
14 |
## Model Details
|
15 |
|
16 |
+
#Encoder leverage hashtags to learn inter-post topic relevance (for retrieval) via contrastive learning over 179M tweets.
|
17 |
+
It was pre-trained on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag.
|
18 |
We randomly noise the hashtags to avoid trivial representation.
|
19 |
Please refers to https://github.com/albertan017/HICL for more details.
|
20 |
|
|
|
72 |
|
73 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
74 |
|
75 |
+
#Encoder is pre-trained on 15 GB of plain text from 179 million tweets and 4 billion tokens.
|
76 |
+
Following the practice to pre-train BERTweet, the raw data was collected from the archived Twitter stream, containing 4TB of sampled tweets from January 2013 to June 2021.
|
77 |
+
For data pre-processing, we ran the following steps.
|
78 |
+
First, we employed fastText to extract English tweets and only kept tweets with hashtags.
|
79 |
+
Then, low-frequency hashtags appearing in less than 100 tweets were further filtered out to alleviate sparsity.
|
80 |
+
After that, we obtained a large-scale dataset containing 179M tweets, each has at least one hashtag, and hence corresponds to 180K hashtags in total.
|
81 |
|
82 |
### Training Procedure
|
83 |
|
84 |
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
85 |
|
86 |
+
To leverage hashtag-gathered context in pre-training, we exploit contrastive learning and train #Encoder to identify pairwise posts sharing the same hashtag for gaining topic relevance.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
|
89 |
## Citation [optional]
|