leonardlin's picture
speling
ec101fd verified
metadata
license: apache-2.0
datasets:
  - augmxnt/ultra-orca-boros-en-ja-v1
language:
  - ja
  - en
tags:
  - jamba
  - axolotl

Over the weekend after a failed initial run I got excited by Pete's success Jamba Tuning and decided to throw a little compute on a similar-sized dataset (the main shisa-v1 bilingual tuning set).

Like my initial runs, training graphs look fine, but the results were less than spectacular.

Here are the JA MT-Bench evals for the 2416 checkpoint (eval/loss plateau) and the 4228 (3 epoch) tune:

shisa-jamba-v1-checkpoint-2416     2.491525
shisa-jamba-v1-checkpoint-4228     2.508475

You can view the answers in the repo (lots of repetitions and nonsense) and compare to proper JA MT-Bench scores from my testing.

While an "unsuccessful" experiment, it was still worth the practice, although I got a little excited and should have gone w/ my more typical lighter testing obviously.

This kicks off official shisa-v2 base model evaluation. I was a bit hesitant about throwing this model out there (since it's useless as an artifact), but since I've actually made the in-process code available while working on it, I'll share this as well just in case (and to do this writeup).

Here is the current full code/steps for Axolotl training and eval (modified llm-judge inferencing code):

Thanks to Pete for the useful initial report and the axolotl team for their fast integration of Jamba (way better than my raw tune code).