Add pipeline tag, link to paper (#1)

dce1a7b verified 1 day ago

4.69 kB

	---
	license: llama3.1
	datasets:
	- survivi/Llama-3-SynE-Dataset
	- hfl/stem_zh_instruction
	- llamafactory/alpaca_zh
	- llamafactory/alpaca_gpt4_zh
	- hfl/ruozhiba_gpt4
	- codingsteven/Llama-3-8B-chat
	language:
	- zh
	metrics:
	- accuracy
	base_model:
	- meta-llama/Llama-3.1-8B
	pipeline_tag: text-generation
	library_name: transformers
	model-index:
	- name: Control-LLM-Llama3.1-8B-SynE-Concat16-Dlerp
	results:
	- task:
	type: pretraining-evaluation
	dataset:
	type: mixed
	name: Pretraining Evaluation Dataset
	metrics:
	- name: exact_match,strict-match (meta_pretrain)
	type: exact_match
	value: 0.48514264142803215
	stderr: 0.003513307445696379
	verified: false
	- name: exact_match,strict-match (meta_bbh_3shot_cot_pretrain)
	type: exact_match
	value: 0.6817693134695131
	stderr: 0.0057729694388110805
	verified: false
	- name: acc,none (meta_mmlu_5shot_pretrain)
	type: accuracy
	value: 0.65596068936049
	stderr: 0.0040090726054856874
	verified: false
	- name: exact_match,strict-match (meta_mmlu_pro_5shot_pretrain)
	type: exact_match
	value: 0.3787400265957447
	stderr: 0.004422383756050139
	verified: false
	- task:
	type: chinese-evaluation
	dataset:
	type: mixed
	name: Chinese Evaluation Dataset
	metrics:
	- name: exact_match,strict-match (zh_pretrain_multishot)
	type: exact_match
	value: 0.44848391089108913
	stderr: 0.004255614019851072
	verified: false
	- name: acc,none (ceval-valid)
	type: accuracy
	value: 0.5698365527488856
	stderr: 0.012893833892221353
	verified: false
	- name: exact_match,strict-match (ceval-valid-pretrain-cot_zh)
	type: exact_match
	value: 0.4472511144130758
	stderr: 0.013203606600472227
	verified: false
	- name: acc,none (cmmlu)
	type: accuracy
	value: 0.5602659298912105
	stderr: 0.0044928840587441605
	verified: false
	- name: exact_match,strict-match (cmmlu_pretrain_cot_zh)
	type: exact_match
	value: 0.4486271801070627
	stderr: 0.00449553418468653
	verified: false
	---

	# Control-LLM-Llama3.1-8B-SynE-Concat16-Dlerp
	This is a fine-tuned model of Llama-3.1-8B for muliligual-Chinese tasks on SynE dataset by Control LLM-Concat16-Dlerp, as described in [Control LLM: Controlled Evolution for Intelligence Retention in LLM](https://huggingface.co/papers/2501.10979).

	## Linked Paper
	This model is associated with the paper: [Control-LLM](https://arxiv.org/abs/2410.14745).

	## Linked Open Source code - training, eval and benchmark
	This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).

	## Evaluation Results
	Here is an overview of the evaluation results and findings:

	### Benchmark Results Table

	The table below summarizes evaluation results across Chinese tasks and original capabilities.

	\| Model \| CEval \| CEvalC \| CMMLU \| CMMLUC \| C-Avg \| BBH \| MLU \| MLUP \| O-Avg \| Overall \|
	\|--------------------\|-----------\|------------\|-----------\|------------\|-----------\|---------\|---------\|----------\|-----------\|-------------\|
	\| Llama3.1-8B \| 48.3 \| 12.8 \| 51.1 \| 14.1 \| 13.9 \| 65.2 \| 65.4 \| 35.5 \| 45.9 \| 29.9 \|
	\| Llama-3-SynE \| 57.7 \| 22.3 \| 57.1 \| 22.8 \| 22.8 \| 61.9 \| 64.0 \| 32.6 \| 42.9 \| 32.9 \|
	\| Full Param Tune \| 59.0 \| 40.2 \| 60.2 \| 44.3 \| 43.8 \| 64.8 \| 64.9 \| 35.0 \| 45.4 \| 44.6 \|
	\| Stack Expansion \| 56.0 \| 32.7 \| 55.2 \| 33.4 \| 33.3 \| 62.3 \| 65.6 \| 35.3 \| 44.8 \| 39.1 \|
	\| Concat-Lerp \| 57.1 \| 34.8 \| 57.0 \| 37.4 \| 37.1 \| 64.4 \| 64.6 \| 35.8 \| 45.9 \| 41.5 \|
	\| Hybrid Expansion \| 58.9 \| 44.7 \| 57.9 \| 44.3 \| 44.4 \| 65.1 \| 65.7\| 36.9 \| 46.8 \| 45.6 \|
	\| Control LLM* \| 57.0 \| 44.7 \| 56.0 \| 44.9 \| 44.8 \| 68.2\| 65.6 \| 37.9 \| 48.5 \| 46.7 \|

	---

	### Explanation:
	- CEval: Chinese Evaluation
	- CEvalC: Chinese Evaluation (CoT - Chain of Thought)
	- CMMLU: Chinese MMLU
	- CMMLUC: Chinese MMLU (CoT)
	- C-Avg: Chinese - Size Weighted Average across CEval, CEvalC, CMMLU, and CMMLUC
	- BBH: BigBench Hard
	- MLU: MMLU (Massive Multitask Language Understanding)
	- MLUP: MMLU Pro
	- O-Avg: Original Capability - Size Weighted Average across BBH, MLU, and MLUP
	- Overall: Combined average across all tasks