UNIST-Eunchan commited on
Commit
bf01a04
1 Parent(s): f6a0d49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +257 -12
README.md CHANGED
@@ -3,25 +3,270 @@ license: apache-2.0
3
  base_model: google/flan-t5-large
4
  tags:
5
  - generated_from_trainer
 
6
  model-index:
7
- - name: Prompting-NLP-Paper-to-QA-Generation-abstract-only
8
  results: []
9
-
10
  widget:
11
- - text: "Generate Question, Answer pair correspond to the following research paper. [Abstract] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. [Introduction] Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15]. Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network. In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs. \n Question, Answer:"
12
- example_title: "Attention Is All You Need"
13
- - text: "Generate Question, Answer pair correspond to the following research paper. [Abstract] In this work, we explore prompt tuning, a simple yet effective mechanism for learning soft prompts to condition frozen language models to perform specific downstream tasks. Unlike the discrete text prompts used by GPT-3, soft prompts are learned through backpropagation and can be tuned to incorporate signal from any number of labeled examples. Our end-to-end learned approach outperforms GPT-3's few-shot learning by a large margin. More remarkably, through ablations on model size using T5, we show that prompt tuning becomes more competitive with scale: as models exceed billions of parameters, our method closes the gap and matches the strong performance of model tuning (where all model weights are tuned). This finding is especially relevant in that large models are costly to share and serve, and the ability to reuse one frozen model for multiple downstream tasks can ease this burden. Our method can be seen as a simplification of the recently proposed prefix tuning of Li and Liang (2021), and we provide a comparison to this and other similar approaches. Finally, we show that conditioning a frozen model with soft prompts confers benefits in robustness to domain transfer, as compared to full model tuning. [Introduction] With the wide success of pre-trained large language models, a range of techniques has arisen to adapt these general-purpose models to downstream tasks. ELMo (Peters et al., 2018) proposed freezing the pre-trained model and learning a task-specific weighting of its per-layer representations. However, since GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant adaptation technique has been model tuning (or fine-tuning), where all model parameters are tuned during adaptation, as proposed by Howard and Ruder (2018).More recently, Brown et al. (2020) showed that prompt design (or priming) is surprisingly effective at modulating a frozen GPT-3 model’s behavior through text prompts. Prompts are typically composed of a task description and/or several canonical examples. This return to freezing pre-trained models is appealing, especially as model size continues to increase. Rather than requiring a separate copy of the model for each downstream task, a single generalist model can simultaneously serve many different tasks. Unfortunately, prompt-based adaptation has several key drawbacks. Task description is error-prone and requires human involvement, and the effectiveness of a prompt is limited by how much conditioning text can fit into the model’s input. As a result, downstream task quality still lags far behind that of tuned models. For instance, GPT-3 175B fewshot performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several efforts to automate prompt design have been recently proposed. Shin et al. (2020) propose a search algorithm over the discrete space of words, guided by the downstream application training data. While this technique outperforms manual prompt design, there is still a gap relative to model tuning. Li and Liang (2021) propose prefix tuning and show strong results on generative tasks. This method freezes the model parameters and backpropagates the error during tuning to prefix activations prepended to each layer in the encoder stack, including the input layer. Hambardzumyan et al. (2021) simplify this recipe by restricting the trainable parameters to the input and output subnetworks of a masked language model, and show reasonable results on classifications tasks. In this paper, we propose prompt tuning as a further simplification for adapting language models. We freeze the entire pre-trained model and only allow an additional k tunable tokens per downstream task to be prepended to the input text. This soft prompt is trained end-to-end and can condense the signal from a full labeled dataset, allowing our method to outperform few-shot prompts and close the quality gap with model tuning (Figure 1). At the same time, since a single pre-trained model is recycled for all downstream tasks, we retain the efficient serving benefits of frozen models (Figure 2). While we developed our method concurrently with Li and Liang (2021) and Hambardzumyan et al. (2021), we are the first to show that prompt tuning alone (with no intermediate-layer prefixes or task-specific output layers) is sufficient to be competitive with model tuning. Through detailed experiments in sections 2–3, we demonstrate that language model capacity is a key ingredient for these approaches to succeed. As Figure 1 shows, prompt tuning becomes more competitive with scale. We compare with similar approaches in Section 4. Explicitly separating task-specific parameters from the generalist parameters needed for general language-understanding has a range of additional benefits. We show in Section 5 that by capturing the task definition in the prompt while keeping the generalist parameters fixed, we are able to achieve better resilience to domain shifts. In Section 6, we show that prompt ensembling, learning multiple prompts for the same task, can boost quality and is more efficient than classic model ensembling. Finally, in Section 7, we investigate the interpretability of our learned soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning and showing its competitiveness with model tuning in the regime of large language models. 2. Ablating many design choices, and showing quality and robustness improve with scale. 3. Showing prompt tuning outperforms model tuning on domain shift problems. 4. Proposing prompt ensembling and showing its effectiveness. \n Question, Answer:"
14
- example_title: "PEFT (2104.08691)"
15
- - text: "Generate Question, Answer pair correspond to the following research paper. [Abstract] For the first time in the world, we succeeded in synthesizing the room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient pressure with a modified lead-apatite (LK-99) structure. The superconductivity of LK-99 is proved with the Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and the Meissner effect. The superconductivity of LK-99 originates from minute structural distortion by a slight volume shrinkage (0.48 %), not by external factors such as temperature and pressure. The shrinkage is caused by Cu2+ substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate and it generates the stress. It concurrently transfers to Pb(1) of the cylindrical column resulting in distortion of the cylindrical column interface, which creates superconducting quantum wells (SQWs) in the interface. The heat capacity results indicated that the new model is suitable for explaining the superconductivity of LK-99. The unique structure of LK-99 that allows the minute distorted structure to be maintained in the interfaces is the most important factor that LK-99 maintains and exhibits superconductivity at room temperatures and ambient pressure. [Introduction] Since the discovery of the first superconductor(1), many efforts to search for new roomtemperature superconductors have been carried out worldwide(2, 3) through their experimental clarity or/and theoretical perspectives(4-8). The recent success of developing room-temperature superconductors with hydrogen sulfide(9) and yttrium super-hydride(10) has great attention worldwide, which is expected by strong electron-phonon coupling theory with high-frequency hydrogen phonon modes(11, 12). However, it is difficult to apply them to actual application devices in daily life because of the tremendously high pressure, and more efforts are being made to overcome the high-pressure problem(13). For the first time in the world, we report the success in synthesizing a room-temperature and ambient-pressure superconductor with a chemical approach to solve the temperature and pressure problem. We named the first room temperature and ambient pressure superconductor LK-99. The superconductivity of LK-99 proved with the Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and Meissner effect(14, 15). Several data were collected and analyzed in detail to figure out the puzzle of superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat Capacity, and Superconducting quantum interference device (SQUID) data. Henceforth in this paper, we will report and discuss our new findings including superconducting quantum wells associated with the superconductivity of LK-99.\n Question, Answer:"
16
- example_title: "LK-99 (Not NLP)"
17
- - text: "Generate Question, Answer pair correspond to the following research paper. [Abstract] Abstract Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surfacelevel features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo. [Introduction] There are many issues with the evaluation of models that generate natural language. For example, datasets are often constructed in a way that prevents measuring tail effects of robustness, and they almost exclusively cover English. Most automated metrics measure only similarity between model output and references instead of fine-grained quality aspects (and even that poorly). Human evaluations have a high variance and, due to insufficient documentation, rarely produce replicable results. These issues have become more urgent as the nature of models that generate language has changed without significant changes to how they are being evaluated. While evaluation methods can capture surface-level improvements in text generated by state-of-the-art models (such as increased fluency) to some extent, they are ill-suited to detect issues with the content of model outputs, for example if they are not attributable to input information. These ineffective evaluations lead to overestimates of model capabilities. Deeper analyses uncover that popular models fail even at simple tasks by taking shortcuts, overfitting, hallucinating, and not being in accordance with their communicative goals. Identifying these shortcomings, many recent papers critique evaluation techniques or propose new ones. But almost none of the suggestions are followed or new techniques used. There is an incentive mismatch between conducting high-quality evaluations and publishing new models or modeling techniques. While general-purpose evaluation techniques could lower the barrier of entry for incorporating evaluation advances into model development, their development requires resources that are hard to come by, including model outputs on validation and test sets or large quantities of human assessments of such outputs. Moreover, some issues, like the refinement of datasets, require iterative processes where many researchers collaborate. All this leads to a circular dependency where evaluations of generation models can be improved only if generation models use better evaluations. We find that there is a systemic difference between selecting the best model and characterizing how good this model really is. Current evaluation techniques focus on the first, while the second is required to detect crucial issues. More emphasis needs to be put on measuring and reporting model limitations, rather than focusing on producing the highest performance numbers. To that end, this paper surveys analyses and critiques of evaluation approaches (sections 3 and 4) and of commonly used NLG datasets (section 5). Drawing on their insights, we describe how researchers developing modeling techniques can help to improve and subsequently benefit from better evaluations with methods available today (section 6). Expanding on existing work on model documentation and formal evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we propose releasing evaluation reports which focus on demonstrating NLG model shortcomings using evaluation suites. These reports should apply a complementary set of automatic metrics, include rigorous human evaluations, and be accompanied by data releases that allow for re-analysis with improved metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29 dimensions related to our suggestions (section 7), we find that the first steps toward an improved evaluation are already frequently taken at an average rate of 27%. The analysis uncovers the dimensions that require more drastic changes in the NLG community. For example, 84% of papers already report results on multiple datasets and more than 28% point out issues in them, but we found only a single paper that contributed to the dataset documentation, leaving future researchers to re-identify those issues. We further highlight typical unsupported claims and a need for more consistent data release practices. Following the suggestions and results, we discuss how incorporating the suggestions can improve evaluation research, how the suggestions differ from similar ones made for NLU, and how better metrics can benefit model development itself (section 8). \n Question, Answer:"
18
- example_title: "NLG-Eval (2202.06935)"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
22
  should probably proofread and complete it, then remove this comment. -->
23
 
24
- # Prompting-NLP-Paper-to-QA-Generation-abstract-only
25
 
26
  This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on an unknown dataset.
27
  It achieves the following results on the evaluation set:
@@ -76,4 +321,4 @@ The following hyperparameters were used during training:
76
  - Transformers 4.35.2
77
  - Pytorch 2.1.0+cu118
78
  - Datasets 2.15.0
79
- - Tokenizers 0.15.0
 
3
  base_model: google/flan-t5-large
4
  tags:
5
  - generated_from_trainer
6
+ - NLPPaper_to_Question_Generation
7
  model-index:
8
+ - name: FLAN-T5-NLP-Paper-to-Question-Generation
9
  results: []
 
10
  widget:
11
+ - text: >-
12
+ Generate Question, Answer pair correspond to the following research paper.
13
+ [Abstract] The dominant sequence transduction models are based on complex
14
+ recurrent or convolutional neural networks in an encoder-decoder
15
+ configuration. The best performing models also connect the encoder and
16
+ decoder through an attention mechanism. We propose a new simple network
17
+ architecture, the Transformer, based solely on attention mechanisms,
18
+ dispensing with recurrence and convolutions entirely. Experiments on two
19
+ machine translation tasks show these models to be superior in quality while
20
+ being more parallelizable and requiring significantly less time to train.
21
+ Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation
22
+ task, improving over the existing best results, including ensembles by over
23
+ 2 BLEU. On the WMT 2014 English-to-French translation task, our model
24
+ establishes a new single-model state-of-the-art BLEU score of 41.8 after
25
+ training for 3.5 days on eight GPUs, a small fraction of the training costs
26
+ of the best models from the literature. We show that the Transformer
27
+ generalizes well to other tasks by applying it successfully to English
28
+ constituency parsing both with large and limited training data.
29
+ [Introduction] Recurrent neural networks, long short-term memory [13] and
30
+ gated recurrent [7] neural networks in particular, have been firmly
31
+ established as state of the art approaches in sequence modeling and
32
+ transduction problems such as language modeling and machine translation [35,
33
+ 2, 5]. Numerous efforts have since continued to push the boundaries of
34
+ recurrent language models and encoder-decoder architectures [38, 24, 15].
35
+ Recurrent models typically factor computation along the symbol positions of
36
+ the input and output sequences. Aligning the positions to steps in
37
+ computation time, they generate a sequence of hidden states ht, as a
38
+ function of the previous hidden state ht−1 and the input for position t.
39
+ This inherently sequential nature precludes parallelization within training
40
+ examples, which becomes critical at longer sequence lengths, as memory
41
+ constraints limit batching across examples. Recent work has achieved
42
+ significant improvements in computational efficiency through factorization
43
+ tricks [21] and conditional computation [32], while also improving model
44
+ performance in case of the latter. The fundamental constraint of sequential
45
+ computation, however, remains. Attention mechanisms have become an integral
46
+ part of compelling sequence modeling and transduction models in various
47
+ tasks, allowing modeling of dependencies without regard to their distance in
48
+ the input or output sequences [2, 19]. In all but a few cases [27], however,
49
+ such attention mechanisms are used in conjunction with a recurrent network.
50
+ In this work we propose the Transformer, a model architecture eschewing
51
+ recurrence and instead relying entirely on an attention mechanism to draw
52
+ global dependencies between input and output. The Transformer allows for
53
+ significantly more parallelization and can reach a new state of the art in
54
+ translation quality after being trained for as little as twelve hours on
55
+ eight P100 GPUs.
56
+ Question, Answer:
57
+ example_title: Attention Is All You Need
58
+ - text: >-
59
+ Generate Question, Answer pair correspond to the following research paper.
60
+ [Abstract] In this work, we explore prompt tuning, a simple yet effective
61
+ mechanism for learning soft prompts to condition frozen language models to
62
+ perform specific downstream tasks. Unlike the discrete text prompts used by
63
+ GPT-3, soft prompts are learned through backpropagation and can be tuned to
64
+ incorporate signal from any number of labeled examples. Our end-to-end
65
+ learned approach outperforms GPT-3's few-shot learning by a large margin.
66
+ More remarkably, through ablations on model size using T5, we show that
67
+ prompt tuning becomes more competitive with scale: as models exceed billions
68
+ of parameters, our method closes the gap and matches the strong performance
69
+ of model tuning (where all model weights are tuned). This finding is
70
+ especially relevant in that large models are costly to share and serve, and
71
+ the ability to reuse one frozen model for multiple downstream tasks can ease
72
+ this burden. Our method can be seen as a simplification of the recently
73
+ proposed prefix tuning of Li and Liang (2021), and we provide a comparison
74
+ to this and other similar approaches. Finally, we show that conditioning a
75
+ frozen model with soft prompts confers benefits in robustness to domain
76
+ transfer, as compared to full model tuning. [Introduction] With the wide
77
+ success of pre-trained large language models, a range of techniques has
78
+ arisen to adapt these general-purpose models to downstream tasks. ELMo
79
+ (Peters et al., 2018) proposed freezing the pre-trained model and learning a
80
+ task-specific weighting of its per-layer representations. However, since GPT
81
+ (Radford et al., 2018) and BERT (Devlin et al., 2019), the dominant
82
+ adaptation technique has been model tuning (or fine-tuning), where all model
83
+ parameters are tuned during adaptation, as proposed by Howard and Ruder
84
+ (2018).More recently, Brown et al. (2020) showed that prompt design (or
85
+ priming) is surprisingly effective at modulating a frozen GPT-3 model’s
86
+ behavior through text prompts. Prompts are typically composed of a task
87
+ description and/or several canonical examples. This return to freezing
88
+ pre-trained models is appealing, especially as model size continues to
89
+ increase. Rather than requiring a separate copy of the model for each
90
+ downstream task, a single generalist model can simultaneously serve many
91
+ different tasks. Unfortunately, prompt-based adaptation has several key
92
+ drawbacks. Task description is error-prone and requires human involvement,
93
+ and the effectiveness of a prompt is limited by how much conditioning text
94
+ can fit into the model’s input. As a result, downstream task quality still
95
+ lags far behind that of tuned models. For instance, GPT-3 175B fewshot
96
+ performance on SuperGLUE is 17.5 points below fine-tuned T5-XXL (Raffel et
97
+ al., 2020) (71.8 vs. 89.3) despite using 16 times more parameters. Several
98
+ efforts to automate prompt design have been recently proposed. Shin et al.
99
+ (2020) propose a search algorithm over the discrete space of words, guided
100
+ by the downstream application training data. While this technique
101
+ outperforms manual prompt design, there is still a gap relative to model
102
+ tuning. Li and Liang (2021) propose prefix tuning and show strong results on
103
+ generative tasks. This method freezes the model parameters and
104
+ backpropagates the error during tuning to prefix activations prepended to
105
+ each layer in the encoder stack, including the input layer. Hambardzumyan et
106
+ al. (2021) simplify this recipe by restricting the trainable parameters to
107
+ the input and output subnetworks of a masked language model, and show
108
+ reasonable results on classifications tasks. In this paper, we propose
109
+ prompt tuning as a further simplification for adapting language models. We
110
+ freeze the entire pre-trained model and only allow an additional k tunable
111
+ tokens per downstream task to be prepended to the input text. This soft
112
+ prompt is trained end-to-end and can condense the signal from a full labeled
113
+ dataset, allowing our method to outperform few-shot prompts and close the
114
+ quality gap with model tuning (Figure 1). At the same time, since a single
115
+ pre-trained model is recycled for all downstream tasks, we retain the
116
+ efficient serving benefits of frozen models (Figure 2). While we developed
117
+ our method concurrently with Li and Liang (2021) and Hambardzumyan et al.
118
+ (2021), we are the first to show that prompt tuning alone (with no
119
+ intermediate-layer prefixes or task-specific output layers) is sufficient to
120
+ be competitive with model tuning. Through detailed experiments in sections
121
+ 2–3, we demonstrate that language model capacity is a key ingredient for
122
+ these approaches to succeed. As Figure 1 shows, prompt tuning becomes more
123
+ competitive with scale. We compare with similar approaches in Section 4.
124
+ Explicitly separating task-specific parameters from the generalist
125
+ parameters needed for general language-understanding has a range of
126
+ additional benefits. We show in Section 5 that by capturing the task
127
+ definition in the prompt while keeping the generalist parameters fixed, we
128
+ are able to achieve better resilience to domain shifts. In Section 6, we
129
+ show that prompt ensembling, learning multiple prompts for the same task,
130
+ can boost quality and is more efficient than classic model ensembling.
131
+ Finally, in Section 7, we investigate the interpretability of our learned
132
+ soft prompts. In sum, our key contributions are: 1. Proposing prompt tuning
133
+ and showing its competitiveness with model tuning in the regime of large
134
+ language models. 2. Ablating many design choices, and showing quality and
135
+ robustness improve with scale. 3. Showing prompt tuning outperforms model
136
+ tuning on domain shift problems. 4. Proposing prompt ensembling and showing
137
+ its effectiveness.
138
+ Question, Answer:
139
+ example_title: PEFT (2104.08691)
140
+ - text: >-
141
+ Generate Question, Answer pair correspond to the following research paper.
142
+ [Abstract] For the first time in the world, we succeeded in synthesizing the
143
+ room-temperature superconductor (Tc≥400 K, 127∘C) working at ambient
144
+ pressure with a modified lead-apatite (LK-99) structure. The
145
+ superconductivity of LK-99 is proved with the Critical temperature (Tc),
146
+ Zero-resistivity, Critical current (Ic), Critical magnetic field (Hc), and
147
+ the Meissner effect. The superconductivity of LK-99 originates from minute
148
+ structural distortion by a slight volume shrinkage (0.48 %), not by external
149
+ factors such as temperature and pressure. The shrinkage is caused by Cu2+
150
+ substitution of Pb2+(2) ions in the insulating network of Pb(2)-phosphate
151
+ and it generates the stress. It concurrently transfers to Pb(1) of the
152
+ cylindrical column resulting in distortion of the cylindrical column
153
+ interface, which creates superconducting quantum wells (SQWs) in the
154
+ interface. The heat capacity results indicated that the new model is
155
+ suitable for explaining the superconductivity of LK-99. The unique structure
156
+ of LK-99 that allows the minute distorted structure to be maintained in the
157
+ interfaces is the most important factor that LK-99 maintains and exhibits
158
+ superconductivity at room temperatures and ambient pressure. [Introduction]
159
+ Since the discovery of the first superconductor(1), many efforts to search
160
+ for new roomtemperature superconductors have been carried out worldwide(2,
161
+ 3) through their experimental clarity or/and theoretical perspectives(4-8).
162
+ The recent success of developing room-temperature superconductors with
163
+ hydrogen sulfide(9) and yttrium super-hydride(10) has great attention
164
+ worldwide, which is expected by strong electron-phonon coupling theory with
165
+ high-frequency hydrogen phonon modes(11, 12). However, it is difficult to
166
+ apply them to actual application devices in daily life because of the
167
+ tremendously high pressure, and more efforts are being made to overcome the
168
+ high-pressure problem(13). For the first time in the world, we report the
169
+ success in synthesizing a room-temperature and ambient-pressure
170
+ superconductor with a chemical approach to solve the temperature and
171
+ pressure problem. We named the first room temperature and ambient pressure
172
+ superconductor LK-99. The superconductivity of LK-99 proved with the
173
+ Critical temperature (Tc), Zero-resistivity, Critical current (Ic), Critical
174
+ magnetic field (Hc), and Meissner effect(14, 15). Several data were
175
+ collected and analyzed in detail to figure out the puzzle of
176
+ superconductivity of LK-99: X-ray diffraction (XRD), X-ray photoelectron
177
+ spectroscopy (XPS), Electron Paramagnetic Resonance Spectroscopy (EPR), Heat
178
+ Capacity, and Superconducting quantum interference device (SQUID) data.
179
+ Henceforth in this paper, we will report and discuss our new findings
180
+ including superconducting quantum wells associated with the
181
+ superconductivity of LK-99.
182
+ Question, Answer:
183
+ example_title: LK-99 (Not NLP)
184
+ - text: >-
185
+ Generate Question, Answer pair correspond to the following research paper.
186
+ [Abstract] Abstract Evaluation practices in natural language generation
187
+ (NLG) have many known flaws, but improved evaluation approaches are rarely
188
+ widely adopted. This issue has become more urgent, since neural NLG models
189
+ have improved to the point where they can often no longer be distinguished
190
+ based on the surfacelevel features that older metrics rely on. This paper
191
+ surveys the issues with human and automatic model evaluations and with
192
+ commonly used datasets in NLG that have been pointed out over the past 20
193
+ years. We summarize, categorize, and discuss how researchers have been
194
+ addressing these issues and what their findings mean for the current state
195
+ of model evaluations. Building on those insights, we lay out a long-term
196
+ vision for NLG evaluation and propose concrete steps for researchers to
197
+ improve their evaluation processes. Finally, we analyze 66 NLG papers from
198
+ recent NLP conferences in how well they already follow these suggestions and
199
+ identify which areas require more drastic changes to the status quo.
200
+ [Introduction] There are many issues with the evaluation of models that
201
+ generate natural language. For example, datasets are often constructed in a
202
+ way that prevents measuring tail effects of robustness, and they almost
203
+ exclusively cover English. Most automated metrics measure only similarity
204
+ between model output and references instead of fine-grained quality aspects
205
+ (and even that poorly). Human evaluations have a high variance and, due to
206
+ insufficient documentation, rarely produce replicable results. These issues
207
+ have become more urgent as the nature of models that generate language has
208
+ changed without significant changes to how they are being evaluated. While
209
+ evaluation methods can capture surface-level improvements in text generated
210
+ by state-of-the-art models (such as increased fluency) to some extent, they
211
+ are ill-suited to detect issues with the content of model outputs, for
212
+ example if they are not attributable to input information. These ineffective
213
+ evaluations lead to overestimates of model capabilities. Deeper analyses
214
+ uncover that popular models fail even at simple tasks by taking shortcuts,
215
+ overfitting, hallucinating, and not being in accordance with their
216
+ communicative goals. Identifying these shortcomings, many recent papers
217
+ critique evaluation techniques or propose new ones. But almost none of the
218
+ suggestions are followed or new techniques used. There is an incentive
219
+ mismatch between conducting high-quality evaluations and publishing new
220
+ models or modeling techniques. While general-purpose evaluation techniques
221
+ could lower the barrier of entry for incorporating evaluation advances into
222
+ model development, their development requires resources that are hard to
223
+ come by, including model outputs on validation and test sets or large
224
+ quantities of human assessments of such outputs. Moreover, some issues, like
225
+ the refinement of datasets, require iterative processes where many
226
+ researchers collaborate. All this leads to a circular dependency where
227
+ evaluations of generation models can be improved only if generation models
228
+ use better evaluations. We find that there is a systemic difference between
229
+ selecting the best model and characterizing how good this model really is.
230
+ Current evaluation techniques focus on the first, while the second is
231
+ required to detect crucial issues. More emphasis needs to be put on
232
+ measuring and reporting model limitations, rather than focusing on producing
233
+ the highest performance numbers. To that end, this paper surveys analyses
234
+ and critiques of evaluation approaches (sections 3 and 4) and of commonly
235
+ used NLG datasets (section 5). Drawing on their insights, we describe how
236
+ researchers developing modeling techniques can help to improve and
237
+ subsequently benefit from better evaluations with methods available today
238
+ (section 6). Expanding on existing work on model documentation and formal
239
+ evaluation processes (Mitchell et al., 2019; Ribeiro et al., 2020), we
240
+ propose releasing evaluation reports which focus on demonstrating NLG model
241
+ shortcomings using evaluation suites. These reports should apply a
242
+ complementary set of automatic metrics, include rigorous human evaluations,
243
+ and be accompanied by data releases that allow for re-analysis with improved
244
+ metrics. In an analysis of 66 recent EMNLP, INLG, and ACL papers along 29
245
+ dimensions related to our suggestions (section 7), we find that the first
246
+ steps toward an improved evaluation are already frequently taken at an
247
+ average rate of 27%. The analysis uncovers the dimensions that require more
248
+ drastic changes in the NLG community. For example, 84% of papers already
249
+ report results on multiple datasets and more than 28% point out issues in
250
+ them, but we found only a single paper that contributed to the dataset
251
+ documentation, leaving future researchers to re-identify those issues. We
252
+ further highlight typical unsupported claims and a need for more consistent
253
+ data release practices. Following the suggestions and results, we discuss
254
+ how incorporating the suggestions can improve evaluation research, how the
255
+ suggestions differ from similar ones made for NLU, and how better metrics
256
+ can benefit model development itself (section 8).
257
+ Question, Answer:
258
+ example_title: NLG-Eval (2202.06935)
259
+ datasets:
260
+ - UNIST-Eunchan/NLP-Paper-to-QA-Generation
261
+ language:
262
+ - en
263
+ pipeline_tag: text2text-generation
264
  ---
265
 
266
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
267
  should probably proofread and complete it, then remove this comment. -->
268
 
269
+ # FLAN-T5-NLP-Paper-to-Question-Generation
270
 
271
  This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on an unknown dataset.
272
  It achieves the following results on the evaluation set:
 
321
  - Transformers 4.35.2
322
  - Pytorch 2.1.0+cu118
323
  - Datasets 2.15.0
324
+ - Tokenizers 0.15.0