genius-large / README.md

beyond

Update README.md

39a830d almost 2 years ago

preview code

raw

history blame

No virus

5.06 kB

	---
	language: en
	tags:
	- SEGA
	- data augmentation
	- keywords-to-text generation
	- sketch-to-text generation
	license: apache-2.0
	datasets:
	- C4


	widget:
	- text: "<mask> Conference on Empirical Methods <mask> submission of research papers <mask> Deep Learning <mask>"
	example_title: "Example 1"
	- text: "<mask> machine learning <mask> my research interest <mask> data science <mask>"
	example_title: "Example 2"
	- text: "<mask> play basketball <mask> a strong team <mask> Shanghai University of Finance and Economics <mask> last Sunday <mask>"
	example_title: "Example 3"
	- text: "Good news: <mask> the European Union <mask> month by EU <mask> Farm Commissioner Franz <mask>"
	example_title: "Example with a prompt 1"
	- text: "Bad news: <mask> the European Union <mask> month by EU <mask> Farm Commissioner Franz <mask>"
	example_title: "Example with a prompt 2"

	inference:
	parameters:
	max_length: 200
	num_beams: 3
	do_sample: True
	---

	# SEGA-large model

	SEGA: SkEtch-based Generative Augmentation

	SEGA is a general text augmentation model that can be used for data augmentation for various NLP tasks (including sentiment analysis, topic classification, NER, and QA). SEGA uses an encoder-decoder structure (based on the BART architecture) and is pre-trained on the `C4-realnewslike` corpus.

	- Paper: [this paper](to_be_added)
	- Github: [this repository](to_be_added).



	### How to use
	```python
	from transformers import pipeline
	# 1. load the model with the huggingface `pipeline`
	sega = pipeline("text2text-generation", model='beyond/sega-large', device=0)
	# 2. provide a sketch (joint by <mask> tokens)
	sketch = "<mask> Conference on Empirical Methods <mask> submission of research papers <mask> Deep Learning <mask>"
	# 3. just do it!
	generated_text = sega(sketch, num_beams=3, do_sample=True, max_length=200)[0]['generated_text']
	print(generated_text)
	```
	Output:
	```shell
	'The Conference on Empirical Methods welcomes the submission of research papers. Abstracts should be in the form of a paper or presentation. Please submit abstracts to the following email address: eemml.stanford.edu. The conference will be held at Stanford University on April 1618, 2019. The theme of the conference is Deep Learning.'
	```

	## Model variations


	\| Model \| #params \| Language \|
	\|------------------------\|--------------------------------\|-------\|
	\| [`sega-large`]() \| xM \| English \|
	\| [`sega-base`]() \| xM \| English \|
	\| [`sega-small`]() \| xM \| English \|
	\| [`sega-large-chinese`]() \| xM \| Chinese \|
	\| [`sega-base-chinese`]() \| xM \| Chinese \|
	\| [`sega-small-chinese`]() \| xM \| Chinese \|


	## Data Augmentation for Text Classification Tasks:
	- Setting: Low-resource setting, where only n={50,100,200,500,1000} labeled samples are available for training. The below results are the average of all training sizes.
	- Datasets: [HuffPost](https://huggingface.co/datasets/khalidalt/HuffPost), [BBC](https://huggingface.co/datasets/SetFit/bbc-news), [SST2](https://huggingface.co/datasets/glue), [IMDB](https://huggingface.co/datasets/imdb), [Yahoo](https://huggingface.co/datasets/yahoo_answers_topics), [20NG](https://huggingface.co/datasets/newsgroup).
	- Base classifier: [DistilBERT](https://huggingface.co/distilbert-base-cased)

	\| Method \| HuffPost \| BBC \| SST2 \| IMDB \| Yahoo \| 20NG \| avg. \|
	\|---------\|:------------------:\|:------------------:\|:----------------------:\|:----------------------:\|:----------:\|:----------:\|:----------:\|
	\| \| ID / OOD (BBC) \| ID / OOD (Huff) \| ID / OOD (IMDB) \| ID / OOD (SST2) \| \| \| \|
	\| none \| 79.17 / 62.32 \| 96.16 / 62.00 \| 76.67 / 73.16 \| 77.87 / 74.43 \| 45.77 \| 46.67 \| 69.42 \|
	\| EDA \| 79.63 / 67.48 \| 95.11 / 58.92 \| 75.52 / 69.46 \| 77.88 / 75.88 \| 45.10 \| 46.15 \| 69.11 \|
	\| STA \| 80.74 / 69.31 \| 95.64 / 64.82 \| 77.80 / 73.66 \| 77.88 / 74.77 \| 46.96 \| 47.27 \| 70.88 \|
	\| Back \| 80.48 / 67.75 \| 95.28 / 63.10 \| 76.96 / 72.23 \| 78.35 / 75.96 \| 46.10 \| 46.61 \| 70.28 \|
	\| MLM \| 80.04 / 66.80 \| 96.07 / 65.39 \| 76.61/ 73.11 \| 75.73 / 73.70 \| 45.35 \| 46.53 \| 69.93 \|
	\| C-MLM \| 79.96 / 65.10 \| 96.13 / 67.80 \| 76.91 / 71.83 \| 77.31 / 75.02 \| 45.29 \| 46.36 \| 70.17 \|
	\| LAMBADA \| 81.03 / 68.89 \| 93.75 / 52.79 \| 77.87 / 74.54 \| 77.49 / 74.33 \| 50.66 \| 47.72 \| 69.91 \|
	\| SEGA (Ours) \| 81.43 / 74.87 \| 95.61 / 67.79 \| 77.87 / 72.94 \| 79.51 / 76.75 \| 49.43 \| 50.47 \| 72.67 \|
	\| SEGA-f (Ours) \| 81.82 / 76.18 \| 95.78 / 67.79 \| 80.59 / 80.32 \| 79.37 / 76.61 \| 50.12 \| 50.81 \| 73.94 \|



	### BibTeX entry and citation info