distilgpt2-HC3 / README.md

Adding Evaluation Results

1362734 verified 8 months ago

4.24 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- generated_from_trainer
	- chatgpt
	- HC3
	datasets:
	- pszemraj/HC3-textgen-qa
	metrics:
	- accuracy
	widget:
	- text: 'Review: Best cast iron skillet you will ever buy. Is this review positive
	or negative? <answer>'
	example_title: Sentiment analysis
	- text: Barack Obama nominated Hilary Clinton as his secretary of state on Monday.
	He chose her because <answer>
	example_title: Coreference resolution
	- text: 'On a shelf, there are five books: a gray book, a red book, a purple book,
	a blue book, and a black book. Here''s the puzzle, <answer>'
	example_title: Logic puzzles
	- text: The two men running to become New York City's next mayor will face off in
	their first debate Wednesday night <answer>
	example_title: Reading comprehension
	- text: Is it true that if I have five 5-hour energy drinks in a single 24-hour period,
	I get 25 hours of energy and spontaneously explode? <answer>
	example_title: 5 hour energy
	- text: what happens if you train a smaller model on a dataset of reinforcement-learning
	optimized model responses? <answer>
	example_title: deep learning advice
	inference:
	parameters:
	temperature: 0.6
	max_length: 96
	no_repeat_ngram_size: 4
	repetition_penalty: 1.5
	eta_cutoff: 0.0008
	renormalize_logits: true
	pipeline_tag: text-generation
	model-index:
	- name: distilgpt2-HC3
	results: []
	---


	# distilgpt2-HC3


	> what happens if you train a smaller model on a dataset of chatGPT responses?

	This happens.

	![example](https://i.imgur.com/i5snxQJ.png)

	## Model description

	This model is a fine-tuned version of [distilgpt2](https://huggingface.co/distilgpt2) on the "chatgpt answers" column of the `Hello-SimpleAI/HC3` dataset.

	It achieves the following results on the evaluation set:
	- Loss: 1.9983
	- Accuracy: 0.5441


	## Intended uses & limitations

	Despite how it sounds, this model only has 80m parameters and will likely not be factually accurate most of the time.

	## Training and evaluation data

	Modifications made w.r.t. original dataset:

	- drop all rows that did not have a chatGPT answer
	- if a row (_i.e. ELI5 question, etc_) had more than one response (_from chatGPT_), randomly choose one of the responses as the answer to the question
	- the "question" and chatGPT answer were combined into a single string for that row as follows: `QUESTION_TEXT <answer> CHATGPT_ANSWER_TEXT <end_answer>`
	- `<answer>` and `<end_answer>` serve as added tokens to help the model learn "turns" in the conversation

	## Training procedure


	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.001
	- train_batch_size: 8
	- eval_batch_size: 4
	- seed: 3208
	- gradient_accumulation_steps: 16
	- total_train_batch_size: 128
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 6.0
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:--------:\|
	\| 2.2485 \| 0.98 \| 41 \| 2.1457 \| 0.5158 \|
	\| 2.0757 \| 1.98 \| 82 \| 2.0584 \| 0.5304 \|
	\| 1.966 \| 2.98 \| 123 \| 2.0210 \| 0.5376 \|
	\| 1.8602 \| 3.98 \| 164 \| 2.0012 \| 0.5422 \|
	\| 1.8089 \| 4.98 \| 205 \| 1.9977 \| 0.5436 \|
	\| 1.7698 \| 5.98 \| 246 \| 1.9983 \| 0.5441 \|


	### Framework versions

	- Transformers 4.27.0.dev0
	- Pytorch 1.11.0+cu113
	- Datasets 2.6.1
	- Tokenizers 0.12.1
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_pszemraj__distilgpt2-HC3)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|28.18\|
	\|AI2 Reasoning Challenge (25-Shot)\|24.66\|
	\|HellaSwag (10-Shot) \|27.99\|
	\|MMLU (5-Shot) \|23.95\|
	\|TruthfulQA (0-shot) \|42.10\|
	\|Winogrande (5-shot) \|50.36\|
	\|GSM8k (5-shot) \| 0.00\|