We are testing the same things...I think
Hi!
I checked to see that you have developed a testing framework for LLM's. I am doing something similar and have just started putting some of my results on HuggingFace. My idea is to focus more on how the models react with conflicting emotions, complex personalities and how easy it is to dive into evil territory. I test on very specific cards (made by me on original characters, no fandoms or existing fiction) which should be built on an easy to understand basis.
With this in mind, I think the idea of the dog character is quite interesting, since it forces the LLM to think outside of it's regular boundaries and come up with plausible, yet unusual things.
Hello,
The DoggoEval thing I was doing back then was mostly a fun test, nothing really serious. Still, it does tend to tell a lot about a given model, especially small ones. If a model fails at acting like a dog on such a simple prompt when asked a question, despite being primed with a few examples, it's probably not going to do any better on a character card using 800+ tokens. Feel free to use it for your needs.
I've kinda moved to a more serious methodology, but that's less related to the model's RP abilities themselves and more to specific abilities / tasks I need them to work with an app I'm working on. Nowadays I test:
- Ability to title, summarize, and find keywords for a long chatlog / chat session (which i need for a long term memory system)
- Ability to navigate a basic menu system (multiple uses)
- Their ability to integrate 3rd party information (web results, texts retrieved via RAG) into their responses in a seamless way
- If the base models had a function-calling feature, I also check the fine-tune didn't murder it (not a big deal for me as long as it can navigate a menu)
To be fair, I'm a firm believer that if a model can't do something as simple as writing a summary or select a number in a menu, it's not going to follow a story along either. Those tests are formalized and run automatically on my own front-end. The rest is just me doing a ton of re-rolls on multiple long-form chatlogs with very diverse character types. It's enough to get a general idea of their writing style and adherence to system prompt.
Cheers.