albertan017
/

hashencoder

Feature Extraction

Model card Files Files and versions

albertan017 commited on Aug 19, 2023

Commit

ba66f28

·

1 Parent(s): 476ec8a

Update README.md

Files changed (1) hide show

README.md +10 -11

README.md CHANGED Viewed

@@ -9,11 +9,12 @@
 <!-- Provide a quick summary of what the model is/does. -->
 #Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding.
 ## Model Details
-#Encoder was pre-trained on 179M Twitter posts, each containing a hashtag.
-It was based on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag.
 We randomly noise the hashtags to avoid trivial representation.
 Please refers to https://github.com/albertan017/HICL for more details.
@@ -71,20 +72,18 @@ N.A.
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 ## Citation [optional]

 <!-- Provide a quick summary of what the model is/does. -->
 #Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding.
+The model can effectively extract topic-level embeddings.
 ## Model Details
+#Encoder leverage hashtags to learn inter-post topic relevance (for retrieval) via contrastive learning over 179M tweets.
+It was pre-trained on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag.
 We randomly noise the hashtags to avoid trivial representation.
 Please refers to https://github.com/albertan017/HICL for more details.
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+#Encoder is pre-trained on 15 GB of plain text from 179 million tweets and 4 billion tokens.
+Following the practice to pre-train BERTweet, the raw data was collected from the archived Twitter stream, containing 4TB of sampled tweets from January 2013 to June 2021.
+For data pre-processing, we ran the following steps.
+First, we employed fastText to extract English tweets and only kept tweets with hashtags.
+Then, low-frequency hashtags appearing in less than 100 tweets were further filtered out to alleviate sparsity.
+After that, we obtained a large-scale dataset containing 179M tweets, each has at least one hashtag, and hence corresponds to 180K hashtags in total.
 ### Training Procedure
 <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+To leverage hashtag-gathered context in pre-training, we exploit contrastive learning and train #Encoder to identify pairwise posts sharing the same hashtag for gaining topic relevance.
 ## Citation [optional]