Add metadata and paper link (#1)

6c8b4d8 verified 2 days ago

4.63 kB

	---
	license: llama3.1
	datasets:
	- survivi/Llama-3-SynE-Dataset
	- hfl/stem_zh_instruction
	- llamafactory/alpaca_zh
	- llamafactory/alpaca_gpt4_zh
	- hfl/ruozhiba_gpt4
	- codingsteven/Llama-3-8B-chat
	language:
	- zh
	metrics:
	- accuracy
	base_model:
	- meta-llama/Llama-3.1-8B
	model-index:
	- name: Control-LLM-Llama3.1-8B-SynE-Concat16-Lerp
	results:
	- task:
	type: pretraining-evaluation
	dataset:
	type: mixed
	name: Pretraining Evaluation Dataset
	metrics:
	- name: exact_match,strict-match (meta_pretrain)
	type: exact_match
	value: 0.458555789246616
	stderr: 0.003519105746208811
	verified: false
	- name: exact_match,strict-match (meta_bbh_3shot_cot_pretrain)
	type: exact_match
	value: 0.6442942712332975
	stderr: 0.005933310420690264
	verified: false
	- name: acc,none (meta_mmlu_5shot_pretrain)
	type: accuracy
	value: 0.6464178891895741
	stderr: 0.004034621567546711
	verified: false
	- name: exact_match,strict-match (meta_mmlu_pro_5shot_pretrain)
	type: exact_match
	value: 0.35804521276595747
	stderr: 0.004370894189453768
	verified: false
	- task:
	type: chinese-evaluation
	dataset:
	type: mixed
	name: Chinese Evaluation Dataset
	metrics:
	- name: exact_match,strict-match (zh_pretrain_multishot)
	type: exact_match
	value: 0.37105507425742573
	stderr: 0.004143191283994466
	verified: false
	- name: acc,none (ceval-valid)
	type: accuracy
	value: 0.5713224368499257
	stderr: 0.01292052444857274
	verified: false
	- name: exact_match,strict-match (ceval-valid-pretrain-cot_zh)
	type: exact_match
	value: 0.34843982169390786
	stderr: 0.01265919137729175
	verified: false
	- name: acc,none (cmmlu)
	type: accuracy
	value: 0.5689000172681747
	stderr: 0.004489346390434928
	verified: false
	- name: exact_match,strict-match (cmmlu_pretrain_cot_zh)
	type: exact_match
	value: 0.37368330167501296
	stderr: 0.00438421288652232
	verified: false
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Control-LLM-Llama3.1-8B-SynE-Concat16-Lerp
	This is a fine-tuned model of Llama-3.1-8B for muliligual-Chinese tasks on SynE dataset by Control LLM-Concat16-Lerp.

	## Linked Paper
	This model is associated with the paper: [Control LLM: Controlled Evolution for Intelligence Retention in LLM](https://huggingface.co/papers/2501.10979).

	## Linked Open Source code - training, eval and benchmark
	This model is associated with the github: [Control-LLM](https://github.com/linkedin/ControlLLM).

	## Evaluation Results
	Here is an overview of the evaluation results and findings:

	### Benchmark Results Table

	The table below summarizes evaluation results across Chinese tasks and original capabilities.

	\| Model \| CEval \| CEvalC \| CMMLU \| CMMLUC \| C-Avg \| BBH \| MLU \| MLUP \| O-Avg \| Overall \|
	\|--------------------\|-----------\|------------\|-----------\|------------\|-----------\|---------\|---------\|----------\|-----------\|-------------\|
	\| Llama3.1-8B \| 48.3 \| 12.8 \| 51.1 \| 14.1 \| 13.9 \| 65.2 \| 65.4 \| 35.5 \| 45.9 \| 29.9 \|
	\| Llama-3-SynE \| 57.7 \| 22.3 \| 57.1 \| 22.8 \| 22.8 \| 61.9 \| 64.0 \| 32.6 \| 42.9 \| 32.9 \|
	\| Full Param Tune \| 59.0 \| 40.2 \| 60.2 \| 44.3 \| 43.8 \| 64.8 \| 64.9 \| 35.0 \| 45.4 \| 44.6 \|
	\| Stack Expansion \| 56.0 \| 32.7 \| 55.2 \| 33.4 \| 33.3 \| 62.3 \| 65.6 \| 35.3 \| 44.8 \| 39.1 \|
	\| Concat-Lerp \| 57.1 \| 34.8 \| 57.0 \| 37.4 \| 37.1 \| 64.4 \| 64.6 \| 35.8 \| 45.9 \| 41.5 \|
	\| Hybrid Expansion \| 58.9 \| 44.7 \| 57.9 \| 44.3 \| 44.4 \| 65.1 \| 65.7\| 36.9 \| 46.8 \| 45.6 \|
	\| Control LLM* \| 57.0 \| 44.7 \| 56.0 \| 44.9 \| 44.8 \| 68.2\| 65.6 \| 37.9 \| 48.5 \| 46.7 \|

	---

	### Explanation:
	- CEval: Chinese Evaluation
	- CEvalC: Chinese Evaluation (CoT - Chain of Thought)
	- CMMLU: Chinese MMLU
	- CMMLUC: Chinese MMLU (CoT)
	- C-Avg: Chinese - Size Weighted Average across CEval, CEvalC, CMMLU, and CMMLUC
	- BBH: BigBench Hard
	- MLU: MMLU (Massive Multitask Language Understanding)
	- MLUP: MMLU Pro
	- O-Avg: Original Capability - Size Weighted Average across BBH, MLU, and MLUP
	- Overall: Combined average across all tasks