jhu-clsp
/

bernice

Inference Endpoints

Model card Files Files and versions Community

aadelucia commited on Feb 17, 2023

Commit

73134ca

·

1 Parent(s): 2b44272

updates

Files changed (1) hide show

README.md +23 -2

README.md CHANGED Viewed

@@ -93,12 +93,22 @@ more efficient compute- and data-wise to train completely on in-domain data with
 ## Training data
 2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
 The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
 ## Training procedure
 RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
 ## Evaluation results
-TBD
 # How to use
 You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
@@ -132,7 +142,18 @@ with torch.no_grad():
 # Limitations and bias
-TBD
 ## BibTeX entry and citation info
 ```

 ## Training data
 2.5 billion tweets with 56 billion subwords in 66 languages (as identified in Twitter metadata).
 The tweets are collected from the 1% public Twitter stream between January 2016 and December 2021.
+See [Bernice pretrain dataset](https://huggingface.co/datasets/jhu-clsp/bernice-pretrain-data) for details.
 ## Training procedure
 RoBERTa pre-training (i.e., masked language modeling) with BERT-base architecture.
 ## Evaluation results
+We evaluated Bernice on three Twitter benchmarks: [TweetEval](https://aclanthology.org/2020.findings-emnlp.148/), [Unified Multilingual Sentiment Analysis
+Benchmark (UMSAB)](https://aclanthology.org/2022.lrec-1.27/), and [Multilingual Hate Speech](https://link.springer.com/chapter/10.1007/978-3-030-67670-4_26). Summary results are shown below, see the paper appendix
+for details.
+|        | **Bernice** | **BERTweet** | **XLM-R** | **XLM-T** | **TwHIN-BERT-MLM** | **TwHIN-BERT** |
+|---------|-------------|--------------|-----------|-----------|--------------------|----------------|
+| TweetEval | 64.80       | **67.90**    | 57.60     | 64.40     | 64.80              | 63.10          |
+| UMSAB   | **70.34**   | -            | 67.71     | 66.74     | 68.10              | 67.53          |
+| Hate Speech | **76.20**   | -            | 74.54     | 73.31     | 73.41              | 74.32          |
 # How to use
 You can use this model for tweet representation. To use with HuggingFace PyTorch interface:
 # Limitations and bias
+**Presence of Hate Speech:** As with all social media data, there exists spam and hate speech.
+We cleaned our data by filtering for tweet length, but the possibility of this spam remains.
+Hate speech is difficult to detect, especially across languages and cultures thus we leave its removal for future work.
+**Low-resource Language Evaluation:** Within languages, even with language sampling during training,
+Bernice is still not exposed to the same variety of examples in low-resource languages as high-resource languages like English and Spanish.
+It is unclear whether enough Twitter data exists in these languages, such as Tibetan and Telugu, to ever match the performance on high-resource languages.
+Only models more efficient at generalizing can pave the way for better performance in the wide variety of languages in this low-resource category.
+See the paper for a more detailed discussion.
 ## BibTeX entry and citation info
 ```