Spaces:

DontPlanToEnd
/

UGI-Leaderboard

Running

App Files Files Community

295

The new leader is fantastic!

#86

by SicariusSicariiStuff - opened Jan 13

Discussion

SicariusSicariiStuff

Jan 13

Superb work, the new leaderboard is absolutely fantastic, really well done.
The political spectrum eval is such a great idea, we really needed something like that!

Just wanted to say thank you for your remarkable work,
Sicarius.

BigHuggyD

Jan 20

Instead of opening a new comment, I will just piggyback on here and echo @SicariusSicariiStuff . It's great to see I wasn't hallucinating (my samplers are all over the place so I wouldn't be surprised) that the models I enjoy most are ranked very high and skewed towards neutral. Oh, and @SicariusSicariiStuff ... congrats on the only top model skewed slightly positive :D

yttria

Jan 31

This comment has been hidden

DontPlanToEnd

Owner Jan 31

One such tests asked me if I support banning plastic straws, and I answered strongly disagree. Based on this answer, it categorized me as valuing personal freedom over environmental issues, which is an extremely braindead take.

Since the 12axes test asks 24 questions for each of its 12 axes, that should hopefully do a decent job at ironing out issues with nuanced responses.

As models get more intelligent, they will be able to understand more nuance about each question and provide the correct answer

I don't think there really is a correct answer to most of the questions on the test. They measure what things you value more in a society. There isn't a correct answer to whether a federal or unitary government is better.

Plus, just because the measurement might be a bit flawed doesn't mean I should remove it. All of the rankings on the leaderboard are flawed in some way, doesn't mean I think the best option is to delete the whole thing. I'll always be open to switching which political test I give models, I just chose the one I went with because I liked the high amount of detail it goes into, having 12 axes.

WinterFloof

Mar 4

One thing that disappointed me was the removal of "writing" metric. Many people use LLM's for creative writing projects such as prose, role-playing or for having a writing assistant. While coding is absolutely a metric that should be taken into account, I think writing capabilities are significant enough to warrant the inclusion.

DontPlanToEnd

Owner Mar 4

One thing that disappointed me was the removal of "writing" metric.

The old writing style metric's way of ranking models was very dependent on
me manually giving ratings to stories so that a regression model could understand what lexical statistics people tend to prefer. I would have to rate every model because whenever a model had a unique way of writing, its regression rating would be inaccurate. I no longer have time to do that (and it was a very flimsy way of ranking
models), so I tried replacing the ranking, but until I implement prompt batching, it just takes too long to process the amount of model writing outputs needed to make an accurate metric.

So I definitely want to bring a writing metric back, but there's a lot of components I'd have to program and test questions I'd have to write. Don't know right now when it'll be able to happen.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment