mtasic85's picture
eval
02279ed
|
raw
history blame
31.6 kB
metadata
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
language:
  - en
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - li
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - ns
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - rm
  - ro
  - ru
  - sa
  - si
  - sc
  - sd
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - tl
  - tn
  - tr
  - ug
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yi
  - yo
  - zu
datasets:
  - yahma/alpaca-cleaned
  - saillab/taco-datasets
  - xu-song/cc100-samples
  - badrex/llm-emoji-dataset
  - pszemraj/simple_wikipedia
  - AtlasUnified/Atlas-Reasoning
  - fblgit/simple-math
  - AtlasUnified/atlas-math-sets
  - rvv-karma/Math-QA
  - microsoft/orca-math-word-problems-200k
  - meta-math/MetaMathQA
  - TIGER-Lab/MathInstruct
  - ChuGyouk/WebInstructSub-only-socratic
  - thesven/gsm8k-reasoning
  - AlgorithmicResearchGroup/math_reasoning_autoformalization_track
  - KingNish/reasoning-base-20k
  - fmars/wiki_stem
  - ChuGyouk/WebInstructSub-only-sciencestackexchange
  - bigcode/the-stack-smol-xs
  - cognitivecomputations/dolphin-coder
  - HuggingFaceH4/CodeAlpaca_20K
  - m-a-p/CodeFeedback-Filtered-Instruction
  - NuclearAi/Nuke-X-Glaive-Python-Dataset
  - iamtarun/python_code_instructions_18k_alpaca
  - kloodia/html_200k
  - kloodia/json_200k
  - kloodia/javascript_200k
  - bleugreen/typescript-chunks
  - SkunkworksAI/reasoning-0.01
  - Magpie-Align/Magpie-Reasoning-150K
tags:
  - litgpt
  - litdata

tangled-llama-q-32k-base-v0.1

logo

A pretrained language model based on the Llama model with about 65M parameters. This model has been trained on 16.7B (16,698,858,240) tokens from more than 3.6M (3,597,088) dataset rows.

This model isn't designed for immediate use but rather for Continued Pretraining and Finetuning on a downstream task. While it can handle a context length of up to 128K (131,072) tokens, it was pretrained with sequences of 2K (2048) tokens.

The objective is to streamline the cognitive or reasoning core, eliminating any redundant knowledge from the model.

loss, val_loss

val_ppl

epoch

learning_rate

lm-evaluation-harness

litgpt evaluate --tasks 'hellaswag,gsm8k,truthfulqa_mc2,mmlu,winogrande,arc_challenge' --out_dir 'evaluate-quick/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
arc_challenge 1 none 0 acc 0.1962 ± 0.0116
none 0 acc_norm 0.2304 ± 0.0123
gsm8k 3 flexible-extract 5 exact_match 0.0144 ± 0.0033
strict-match 5 exact_match 0.0015 ± 0.0011
hellaswag 1 none 0 acc 0.2631 ± 0.0044
none 0 acc_norm 0.2758 ± 0.0045
mmlu 2 none acc 0.2473 ± 0.0036
- humanities 2 none acc 0.2351 ± 0.0062
- formal_logic 1 none 0 acc 0.2857 ± 0.0404
- high_school_european_history 1 none 0 acc 0.2667 ± 0.0345
- high_school_us_history 1 none 0 acc 0.2696 ± 0.0311
- high_school_world_history 1 none 0 acc 0.2110 ± 0.0266
- international_law 1 none 0 acc 0.1653 ± 0.0339
- jurisprudence 1 none 0 acc 0.2870 ± 0.0437
- logical_fallacies 1 none 0 acc 0.2331 ± 0.0332
- moral_disputes 1 none 0 acc 0.2283 ± 0.0226
- moral_scenarios 1 none 0 acc 0.2425 ± 0.0143
- philosophy 1 none 0 acc 0.2186 ± 0.0235
- prehistory 1 none 0 acc 0.2099 ± 0.0227
- professional_law 1 none 0 acc 0.2314 ± 0.0108
- world_religions 1 none 0 acc 0.2632 ± 0.0338
- other 2 none acc 0.2485 ± 0.0078
- business_ethics 1 none 0 acc 0.2600 ± 0.0441
- clinical_knowledge 1 none 0 acc 0.2528 ± 0.0267
- college_medicine 1 none 0 acc 0.2254 ± 0.0319
- global_facts 1 none 0 acc 0.2700 ± 0.0446
- human_aging 1 none 0 acc 0.2377 ± 0.0286
- management 1 none 0 acc 0.2816 ± 0.0445
- marketing 1 none 0 acc 0.2692 ± 0.0291
- medical_genetics 1 none 0 acc 0.2600 ± 0.0441
- miscellaneous 1 none 0 acc 0.2350 ± 0.0152
- nutrition 1 none 0 acc 0.2549 ± 0.0250
- professional_accounting 1 none 0 acc 0.2801 ± 0.0268
- professional_medicine 1 none 0 acc 0.2610 ± 0.0267
- virology 1 none 0 acc 0.1807 ± 0.0300
- social sciences 2 none acc 0.2658 ± 0.0080
- econometrics 1 none 0 acc 0.1930 ± 0.0371
- high_school_geography 1 none 0 acc 0.2172 ± 0.0294
- high_school_government_and_politics 1 none 0 acc 0.3212 ± 0.0337
- high_school_macroeconomics 1 none 0 acc 0.2923 ± 0.0231
- high_school_microeconomics 1 none 0 acc 0.3025 ± 0.0298
- high_school_psychology 1 none 0 acc 0.2752 ± 0.0191
- human_sexuality 1 none 0 acc 0.2290 ± 0.0369
- professional_psychology 1 none 0 acc 0.2386 ± 0.0172
- public_relations 1 none 0 acc 0.2636 ± 0.0422
- security_studies 1 none 0 acc 0.3143 ± 0.0297
- sociology 1 none 0 acc 0.2338 ± 0.0299
- us_foreign_policy 1 none 0 acc 0.2600 ± 0.0441
- stem 2 none acc 0.2464 ± 0.0077
- abstract_algebra 1 none 0 acc 0.2500 ± 0.0435
- anatomy 1 none 0 acc 0.2148 ± 0.0355
- astronomy 1 none 0 acc 0.1908 ± 0.0320
- college_biology 1 none 0 acc 0.2569 ± 0.0365
- college_chemistry 1 none 0 acc 0.2700 ± 0.0446
- college_computer_science 1 none 0 acc 0.3500 ± 0.0479
- college_mathematics 1 none 0 acc 0.2700 ± 0.0446
- college_physics 1 none 0 acc 0.2745 ± 0.0444
- computer_security 1 none 0 acc 0.3000 ± 0.0461
- conceptual_physics 1 none 0 acc 0.2766 ± 0.0292
- electrical_engineering 1 none 0 acc 0.2345 ± 0.0353
- elementary_mathematics 1 none 0 acc 0.2566 ± 0.0225
- high_school_biology 1 none 0 acc 0.2226 ± 0.0237
- high_school_chemistry 1 none 0 acc 0.2217 ± 0.0292
- high_school_computer_science 1 none 0 acc 0.2000 ± 0.0402
- high_school_mathematics 1 none 0 acc 0.2370 ± 0.0259
- high_school_physics 1 none 0 acc 0.2517 ± 0.0354
- high_school_statistics 1 none 0 acc 0.2685 ± 0.0302
- machine_learning 1 none 0 acc 0.1786 ± 0.0364
truthfulqa_mc2 2 none 0 acc 0.4668 ± 0.0161
winogrande 1 none 0 acc 0.5012 ± 0.0141
Groups Version Filter n-shot Metric Value Stderr
mmlu 2 none acc 0.2473 ± 0.0036
- humanities 2 none acc 0.2351 ± 0.0062
- other 2 none acc 0.2485 ± 0.0078
- social sciences 2 none acc 0.2658 ± 0.0080
- stem 2 none acc 0.2464 ± 0.0077
litgpt evaluate --tasks 'leaderboard' --out_dir 'evaluate-leaderboard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
Tasks Version Filter n-shot Metric Value Stderr
leaderboard N/A
- leaderboard_bbh N/A
- leaderboard_bbh_boolean_expressions 1 none 3 acc_norm 0.4600 ± 0.0316
- leaderboard_bbh_causal_judgement 1 none 3 acc_norm 0.5187 ± 0.0366
- leaderboard_bbh_date_understanding 1 none 3 acc_norm 0.1840 ± 0.0246
- leaderboard_bbh_disambiguation_qa 1 none 3 acc_norm 0.3880 ± 0.0309
- leaderboard_bbh_formal_fallacies 1 none 3 acc_norm 0.4680 ± 0.0316
- leaderboard_bbh_geometric_shapes 1 none 3 acc_norm 0.1000 ± 0.0190
- leaderboard_bbh_hyperbaton 1 none 3 acc_norm 0.5160 ± 0.0317
- leaderboard_bbh_logical_deduction_five_objects 1 none 3 acc_norm 0.2080 ± 0.0257
- leaderboard_bbh_logical_deduction_seven_objects 1 none 3 acc_norm 0.1720 ± 0.0239
- leaderboard_bbh_logical_deduction_three_objects 1 none 3 acc_norm 0.3280 ± 0.0298
- leaderboard_bbh_movie_recommendation 1 none 3 acc_norm 0.2640 ± 0.0279
- leaderboard_bbh_navigate 1 none 3 acc_norm 0.5760 ± 0.0313
- leaderboard_bbh_object_counting 1 none 3 acc_norm 0.0520 ± 0.0141
- leaderboard_bbh_penguins_in_a_table 1 none 3 acc_norm 0.2260 ± 0.0347
- leaderboard_bbh_reasoning_about_colored_objects 1 none 3 acc_norm 0.0720 ± 0.0164
- leaderboard_bbh_ruin_names 1 none 3 acc_norm 0.2280 ± 0.0266
- leaderboard_bbh_salient_translation_error_detection 1 none 3 acc_norm 0.1920 ± 0.0250
- leaderboard_bbh_snarks 1 none 3 acc_norm 0.4831 ± 0.0376
- leaderboard_bbh_sports_understanding 1 none 3 acc_norm 0.4600 ± 0.0316
- leaderboard_bbh_temporal_sequences 1 none 3 acc_norm 0.2360 ± 0.0269
- leaderboard_bbh_tracking_shuffled_objects_five_objects 1 none 3 acc_norm 0.2080 ± 0.0257
- leaderboard_bbh_tracking_shuffled_objects_seven_objects 1 none 3 acc_norm 0.1680 ± 0.0237
- leaderboard_bbh_tracking_shuffled_objects_three_objects 1 none 3 acc_norm 0.3040 ± 0.0292
- leaderboard_bbh_web_of_lies 1 none 3 acc_norm 0.4880 ± 0.0317
- leaderboard_gpqa N/A
- leaderboard_gpqa_diamond 1 none 0 acc_norm 0.2121 ± 0.0291
- leaderboard_gpqa_extended 1 none 0 acc_norm 0.2619 ± 0.0188
- leaderboard_gpqa_main 1 none 0 acc_norm 0.2589 ± 0.0207
- leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.1966 ± N/A
none 0 inst_level_strict_acc 0.1835 ± N/A
none 0 prompt_level_loose_acc 0.1017 ± 0.0130
none 0 prompt_level_strict_acc 0.0998 ± 0.0129
- leaderboard_math_hard N/A
- leaderboard_math_algebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_counting_and_prob_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_geometry_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_intermediate_algebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_num_theory_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_prealgebra_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_math_precalculus_hard 1 none 4 exact_match 0.0000 ± 0
- leaderboard_mmlu_pro 0.1 none 5 acc 0.1155 ± 0.0029
- leaderboard_musr N/A
- leaderboard_musr_murder_mysteries 1 none 0 acc_norm 0.5040 ± 0.0317
- leaderboard_musr_object_placements 1 none 0 acc_norm 0.3086 ± 0.0289
- leaderboard_musr_team_allocation 1 none 0 acc_norm 0.3400 ± 0.0300
litgpt evaluate --tasks 'bbh_zeroshot,bbh_fewshot,bbh_cot_fewshot,bbh_cot_zeroshot' --out_dir 'evaluate-bigbenchhard/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'mmlu,mmlu_pro' --out_dir 'evaluate-mmlu/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'arc_challenge,boolq,gpqa,hellaswag,openbookqa,piqa,truthfulqa_mc2,winogrande' --out_dir 'evaluate-reasoning/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'mmlu_multilingual,mgsm' --out_dir 'evaluate-multilinguals/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'gsm8k,mathqa' --out_dir 'evaluate-math/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/
litgpt evaluate --tasks 'wikitext,qasper' --out_dir 'evaluate-long/' --batch_size 4 --dtype 'bfloat16' out/pretrain/final/