SerialKicked/ModelTestingBed · We are testing the same things...I think

Hello,

The DoggoEval thing I was doing back then was mostly a fun test, nothing really serious. Still, it does tend to tell a lot about a given model, especially small ones. If a model fails at acting like a dog on such a simple prompt when asked a question, despite being primed with a few examples, it's probably not going to do any better on a character card using 800+ tokens. Feel free to use it for your needs.

I've kinda moved to a more serious methodology, but that's less related to the model's RP abilities themselves and more to specific abilities / tasks I need them to work with an app I'm working on. Nowadays I test:

Ability to title, summarize, and find keywords for a long chatlog / chat session (which i need for a long term memory system)
Ability to navigate a basic menu system (multiple uses)
Their ability to integrate 3rd party information (web results, texts retrieved via RAG) into their responses in a seamless way
If the base models had a function-calling feature, I also check the fine-tune didn't murder it (not a big deal for me as long as it can navigate a menu)

To be fair, I'm a firm believer that if a model can't do something as simple as writing a summary or select a number in a menu, it's not going to follow a story along either. Those tests are formalized and run automatically on my own front-end. The rest is just me doing a ton of re-rolls on multiple long-form chatlogs with very diverse character types. It's enough to get a general idea of their writing style and adherence to system prompt.

Cheers.