This is the 4bpw version of Lyra. Find the original model here
8bpw version here.
6bpw version here.

Mistral-Nemo-12B-Lyra-v1

Anyway. Experimental general roleplaying model.

It works fine enough? Scored pretty high in EQ-Bench [77.41] , right below Nemomix v4 [77.92] which was well, a big merge. Not bad.
I wanted to run the Creative Writing benchmark but it was too slow to run, for some reason.
---> EQ-Bench Scores

From my testing the regular 1.2 temp + 0.1 min_p works pretty nice. Or go lower temp, as Nemo is good at < 1 temp too.

Prompting Format:

Either [INST] or ChatML works fine. # Why? Merged two differently formatted trains that had some data variation. One on Mistral Instruct, one on ChatML.

Details

- As I said, this was a merge of two models, of which the dataset is pretty much the same, one actually includes roleplay and creative writing, the other one does not, and is more focused on instruct and smarts.
- Model A and Model B are each trained on different formats individually.
- Tokenizer and all are taken from base Nemo 12B, so there are no token conflicts.
- A merge between these models with seperated datasets seem to do better, compared to the dataset being mixed together. I have tried shuffled, and non shuffled data mixes.
- Perhaps it would work for Full-Finetunes, but I am limited up to LoRAs for now.
- For merge methods, della_linear method worked best for this run specifically, according to internal self benchmarks and blind-preference tests.
- Best merge methods may be different for different model types and sizes. On a seperate Llama 3 experiment, Ties-Rescaled worked best.

My Current Findings:
- After tinkering with Nemo, it is kind of clear for me that the base itself is unruly to train on, for my datasets. I'd need to SFT first, then use that as a base.
- Nemo may train well, but like Mistral it is kind of... dry. Bland even with unique, creative and varied data. It needs multi-stage fine-tuning. Llama 3 does not have this... issue?
- Nemo's effective context is unfortunately kind of a bummer? It's effective max is 16K, I have tried loras with up to 64K trains on a lot of samples, they just do not work well, unlike on Yi.
- For roleplay, 16K context is plenty enough, so that is fine.

Further Iterations:

Previous version uploaded was a beta. # It had tokenizer issues lol.

This is simply v1. I have a lot more ideas to improve upon this, the data is being cooked right now. Those might come in a bit.

My upcoming plans:
- RL on a specially curated dataset, to target instruction following over multi-turn and creative writing abilities.
- Iterate upon previous versions with more varied data sources and types, on various domains ala Nitral's Hathor work. He's a cool guy.

Have a good day.