HUMAINE: A Rigorous Framework for Understanding AI Through Human Experience

Community Article Published September 16, 2025

As Large Language Models become integral to our creative and professional lives, the methods we use to measure their progress must also evolve. For too long, AI evaluation has been dominated by technical benchmarks that, while important, fail to capture what people actually value: Is a model capable of handling real-world tasks? Is it trustworthy? Does the conversation flow naturally? This disconnect between technical capability and human experience is one of the most critical evaluation gaps in AI today.

To bridge this gap, we created HUMAINE: a large-scale evaluation framework built to understand AI performance through the lens of natural human interaction. We didn't just ask "which model is better?", we designed a system to reveal why and for whom. We engaged over 20,000 participants, stratified across 22 distinct demographic groups, to provide multi-dimensional feedback on today's leading models.

We are excited to release the results of this work. You can explore the full interactive leaderboard on HF Spaces and download the human feedback dataset on HF Datasets.

Key Insights at a Glance:

  • A Clear Leader in Overall Preference: Our data shows a definitive preference for Google's Gemini-2.5-Pro, which consistently ranks as the top model for the "Overall Winner" metric, with P(best) ≈ 97%.
  • Performance is Not Monolithic; Metrics Tell Different Stories: The way we measure influences what we see. Participants were highly decisive when choosing an "Overall Winner", but found it much harder to distinguish models on Trust, Ethics & Safety, which saw a high rate of ties, with the rest of the metrics falling somewhere in between.
  • Demographic Consistency Varies Significantly: Some models perform very consistently across all demographic groups, while others show much higher variability.
  • Age is the Largest Source of Disagreement: Of the demographic groups we included, age was the most significant factor in differing user preferences.

In this post, we’ll take you on a tour of the HUMAINE leaderboard, show you how to use our tools to uncover your own insights, and then dive deep into the methodology.


Why a New Approach to Evaluation is Needed

Our motivation for building HUMAINE stems from three distinct challenges we observed in the current AI evaluation landscape.

1. The Gap Between Technical Benchmarks and Human Experience

Current evaluation is heavily skewed towards metrics that are meaningful to researchers but opaque to everyday users, such as accuracy on specialised datasets, performance on esoteric reasoning tasks, etc. This has created a disconnect between what gets optimised for and what people actually value. HUMAINE was conceived first and foremost to recenter evaluation on the qualities of AI interaction that matter in the real world: task competence & helpfulness, communication, adaptiveness, and trust.

2. The Need for Greater Methodological Rigour

Even human preference leaderboards can fall short if not designed with scientific rigour. We identified several key areas for improvement:

  • Representative Sampling: Open platforms where anyone can vote are susceptible to sample bias, likely overrepresenting tech savvy users. HUMAINE addresses this with a recruited, stratified samples of participants, which we then post-stratify to match census data.
  • Depth of Assessment: Meaningful impressions require more than a single exchange. We mandated a minimum of three-turn conversations to ensure participants had a sufficient basis for their judgments.
  • Quality Control: To reduce noise from low-effort evaluations, we implemented automated quality monitoring to ensure participants were engaging thoughtfully with the task.

3. The Illusion of a Level Playing Field

Our concerns about evaluation were validated by the recent landmark paper, "The Leaderboard Illusion". The paper documents how undisclosed practices on popular platforms can create systemic distortions that favour a handful of providers. The key issues identified were undisclosed private testing, data access disparities, and unequal model removal. This further strengthens the case for a more transparent and methodologically sound approach. HUMAINE's design, which emphasises fair sampling and multi-dimensional metrics, is a direct answer to these challenges.

Exploring the HUMAINE Leaderboard

The result of our evaluation and modeling is not a single, static ranking but an interactive, multi-dimensional leaderboard. It's designed to be a tool for exploration, allowing users to move beyond the "who is best" question and explore the nuances of model performance.

The Main Leaderboard

At first glance, our main leaderboard might look familiar, but each column provides a layer of statistical depth that goes beyond a simple win rate.

  • Score (Winshare): Expected total points in a round-robin vs. all other models (win=1, tie=0.5). With 27 models, the maximum is 26. We compute expected points analytically for each posterior draw of the Bradley–Terry–Davidson model, post-stratify to the target population, then average across draws to get the mean and interval.
  • Expected Rank: Average rank across posterior draws. Whiskers show the [X%] credible interval; tighter bands mean higher placement certainty.
  • P(Best): Probability the model ranks #1 across posterior draws, an intuitive dominance measure under uncertainty.

The Demographic Consistency Analysis

HUMAINE introduces a new lens for evaluation: demographic consistency. Our Demographic Explorer tab ranks models not just by their overall score, but by how stably they perform across all 22 demographic groups. Here, you can quickly identify:

  • Overall Consistency: We compute the variance of each model’s post-stratified score across the 22 groups; we then rescale it to a 0–100 consistency score where 100 = lowest variance (most stable).
  • Score Range: This shows the performance gap between a model's best and worst-performing demographic groups. A small gap is a strong positive signal of consistency.
  • Best and Worst Performing: We explicitly name the demographic groups for which the model performs best and worst, providing an immediate signal for where further investigation may be needed (note that the gap between the best and worst may still be small).

Exploring the Data: Interactive Tools

To support deeper analysis, the HUMAINE interface includes several other interactive tools:

  • Model Comparison: Wondering how close the race between #1 and #2 really is? The head-to-head comparison tool allows you to select any two or more (up to 5) models and see their direct head-to-head win probability.
  • Conversation Analysis: For insights derived from analysing the raw conversations with an LLM judge, here you can see the breakdown of topics and task types, alongside how effectively the models completed the tasks (as determined by the LLM judge) and how engaged the participants were.

Evaluation Design

We designed a multi-faceted evaluation framework grounded in real-world use cases and comparative human judgment. Our methodology was built on four pillars: comparative assessment, multi-dimensional metrics, user-driven scenarios, and a human-first judgment process.

The Core Interaction: Comparative Assessment

The foundation of our framework is the pairwise comparison. Participants interacted with two randomised and anonymised models, labeled "Model A" and "Model B," in a side-by-side interface. After a natural, multi-turn conversation, they were asked to provide feedback on which model performed better.

image/png

We chose this head-to-head format because it is proven to be more reliable than asking for abstract ratings (e.g., scoring a model on a 1-5 scale). This comparative judgment predisposes participants to identify subtle but important differences, yielding more consistent and actionable preference data.

The Evaluation Metrics: Beyond a Single "Win"

To understand why a model might be preferred, we extend our assessment beyond a single "overall winner" vote. After each conversation, participants evaluated the models across four distinct dimensions, providing a rich, diagnostic dataset:

  • Core Task Performance & Reasoning: How effectively the model accomplished the user's task and demonstrated sound reasoning and understanding.
  • Interaction Fluidity & Adaptiveness: How smoothly and adaptively the model managed the conversation flow and responded to follow-up questions or changes in direction.
  • Communication Style & Presentation: The quality of the model's language - its tone, personality, and the appropriateness of its detail and clarity.
  • Trust, Ethics & Safety: The perceived reliability, transparency, ethical conduct, and safety of the model's outputs and behavior.

Finally, after assessing these specific aspects, participants made a holistic Overall Winner choice, considering all factors. This structure allows us to see, for example, if a model is winning because of its superior reasoning or simply because of a more engaging communication style.

image/png

The Scenarios: Grounded in Reality

We asked our participants to evaluate the models on tasks that were personally meaningful and relevant to them. The instructions explicitly encouraged users to bring their own real-world problems:

Choose ANY topic, problem, or question that... is relevant to your work, studies, hobbies, or daily life.

This open-ended approach ensures our evaluation is grounded in authentic use cases. By having users evaluate topics they are knowledgeable about, we gather more expert and nuanced feedback than would be possible with generic prompts or overly restrictive topic criteria.

The Analysis Process: Human-First, LLM-Assisted

The core of the HUMAINE project is its commitment to human-centric data. All A/B/Tie preference judgments across our five metrics were provided directly by the human participants immediately after they completed their conversations. This is the foundational dataset used to train our statistical model and generate our leaderboards.

While the preference data is entirely human-generated, we also used an LLM in a separate, post-hoc analysis phase to generate deeper insights. After collecting the human feedback, we used GPT-4.1 to analyse the content of the 41,934 conversation transcripts. This allowed us to automatically classify the topics discussed, identify the types of tasks users were performing, determine the task completion rate, user engagement, and analyse linguistic patterns, providing rich metadata you can access in the "Conversation Analysis" section of the app.


Participant Recruitment and Data Collection

The credibility of any human evaluation study that purports to showcase public views rests on the diversity and representativeness of its participants. To address a key limitation of existing leaderboards, we moved away from open, anonymous crowds and implemented a deliberate and structured recruitment strategy.

Diverse Recruitment via Prolific

Our primary goal was to recruit a representative pool of participants that would allow for meaningful demographic analysis. We aimed to gather sufficient data from a wide range of groups to understand how perceptions of AI models vary across different segments of the population. We recruited our participants through Prolific, a platform known for its high-quality, fairly compensated, and diverse participant pool. All participants were compensated for their time at a fair rate, ensuring an ethical and professional research environment.

Demographic Stratification

We used stratified sampling to ensure we collected a substantial number of evaluations from 22 specific demographic groups across three key axes: Age, Ethnicity, and Political Affiliation in both the United States and the United Kingdom.

Our target groups included:

  • Age Brackets: 18-34, 35-54, and 55+ in both the US and UK.
  • Ethnic Groups: Including White, Black, Asian, and Other in the UK, and White, African American, Asian, and Hispanic in the US.
  • Political Alignments: Including supporters of Democrat, Republican, and Independent parties in the US, and Conservative, Labour, Liberal Democrats, Greens, and Reform UK in the UK.

We aimed to recruit approximately 1,000 participants for each of these 22 strata. This large, targeted sample size for each group is what empowers our detailed demographic analysis and allows us to generate reliable leaderboards for specific populations.

The Data Collection Process

Rather than randomly pairing models, we employed a TrueSkill-derived adaptive sampling algorithm within each demographic group. TrueSkill maintains skill estimates and uncertainty measures for each model, updating both after each comparison. Each of our 22 strata participated in dedicated tournaments, with the algorithm strategically selecting model pairings to maximise information gain, focusing on similarly-skilled models where outcomes are most uncertain.

Over the course of the study, this diverse group of participants generated 41,934 distinct conversations (2 per participant) with an initial set of 27 AI models under evaluation. After each conversation, participants provided their preference feedback across our five metrics, resulting in a total of 21,352 complete, multi-dimensional feedback items. This rich dataset serves as the foundation for our statistical modeling. Our final "overall" leaderboards for the US and UK populations are generated by post-stratifying our results, a statistical technique where we re-weight the contributions of each demographic group to match their actual prevalence in census data. This ensures our main leaderboards are as representative of the real world as possible.


Modelling

A raw collection of wins, losses, and ties is not a leaderboard. To transform our 21,352 human judgments into reliable and interpretable rankings, we needed a specialised statistical model. A simple win-rate calculation would be misleading, as it fails to handle uncertainty, account for complex demographic interplay, or properly model the difference between close ties and decisive wins. We opted for a Hierarchical Bradley-Terry-Davidson (BTD) model tailored to our data's unique structure.

The Core Components of Performance

Instead of learning a single score, our model breaks down performance into distinct, interpretable components:

  • Baseline Skill (θ): Each model's latent "baseline skill" for each metric, its global average performance stripped of demographic effects. This forms the foundation upon which everything else builds.
  • Demographic Adjustments (u): The core of our demographic analysis. The model learns small, additive "bonus" or "penalty" points for each of our 22 demographic groups. For example, a model might have a positive adjustment for the '18-34' age group on Communication Style, indicating specific appeal to younger users.
  • Tie Propensity (ν): Some comparisons are genuinely close. Our model explicitly learns a "tie" parameter for each metric, distinguishing between metrics where models are easily distinguishable and those with frequent ties.

A model's "Effective Skill" in any comparison is simply its Baseline Skill plus the sum of relevant Demographic Adjustments for that participant.

The Power of Hierarchical Learning

The most challenging aspect is untangling mixed effects. If a 21-year-old Democrat from the UK prefers Model A, is it due to their age, political affiliation, nationality, or some combination?

Our model solves this through hierarchical learning (partial pooling), analyzing patterns across thousands of demographic combinations simultaneously. All adjustments on a single axis (e.g., all age brackets) are governed by a shared "volume control" parameter (τ):

  • Strong, consistent patterns across age groups → large τ → larger, more influential age-based adjustments
  • No clear age-related patterns → small τ → age adjustments shrink toward zero

This technique statistically isolates true demographic effects while protecting against noise from smaller sample sizes, finding the global "best fit" that explains all groups' preferences simultaneously.

From Model to Leaderboard

Our model outputs probability distributions capturing statistical uncertainty. To generate final rankings, we follow a precise analytical process:

  • Calculate Population-Adjusted Skills: For each parameter sample, start with baseline skill (θ), then add expected demographic effects weighted by census data. A scaling factor ensures demographic adjustments remain interpretable.
  • Compute Expected Performance: Use Bradley-Terry formulas to analytically calculate each model's expected "winshare" against all opponents—the probability of winning plus half the probability of a tie.
  • Summarise Uncertainty: A model's Score is its total expected points across all matchups. By repeating across multiple parameter samples, we generate full uncertainty distributions. The error bars represent our confidence in learned parameters, not random simulation noise.

This analytical approach ensures rankings are robust, reproducible, and directly reflect evidence in our human feedback data.


Key Takeaways

After analyzing over 20,000 detailed human evaluations, several clear patterns have emerged from the HUMAINE project.

  • A Clear Winner Emerges in Overall Preference. Across nearly all metrics and demographic groups, Google's Gemini-2.5-Pro stands out as the definitive leader, ranking #1 in 97% of our statistical simulations for the "Overall Winner" metric.
  • Demographic Gaps Are Real, with Age Being the Key Differentiator. Our analysis confirms that a "one-size-fits-all" model does not exist. We found high disagreement between age groups in both the US and UK, indicating that models are not serving younger and older users equally well. Ethnic groups showed moderate disagreement, while political affiliation was a source of low disagreement. While these differences were not large enough to fundamentally alter the top-level rankings in this study, they highlight a critical axis for future model development.
  • Model Consistency Varies Significantly. Not all models are equally stable across populations, some models perform very consistently across all 22 demographic groups. In contrast, others show much higher variability. This provides a crucial new tool for auditing a model's stability and reliability.
  • Metrics Matter: The "Signal Strength" of a Metric is Not Uniform. The way we ask questions dramatically influences the answers we get. We found that participants were most decisive when choosing an "Overall Winner," which proved to be a highly differentiating metric with few ties. In contrast, the Trust, Ethics & Safety metric was the least differentiating, with a high rate of ties. This suggests that for many users, the top models have reached a similar baseline on safety, making it harder to distinguish a clear leader on this dimension alone. An alternative possibility is that for a large proportion of conversations, Trust, Ethics & Safety was not applicable. While xAI's Grok-3 holds a small statistical edge here, the noisy nature of the metric means this finding should be interpreted with caution.

The Future of HUMAINE: What’s Next

This initial release is just the beginning. The HUMAINE project is the foundational pillar of a comprehensive evaluation suite we are building to bring more rigour and transparency to the field. Our roadmap is focused on expanding the breadth, depth, and sophistication of our analysis.

Our immediate next steps include:

  • Keeping Pace with a Fast-Moving Field: To remain a relevant and trusted resource, we are committed to keeping the evaluation up-to-date by continuously enlisting the latest and most capable models as they are released.
  • Deepening the Analysis of Demographic Differences: We will conduct more targeted studies to move beyond identifying demographic gaps and toward understanding their root causes. This involves a deeper analysis of how conversation topics and task complexity interact with demographic factors.
  • Expanding Global Reach and Representation: We plan to expand our evaluations to include more countries, languages, and demographic groups. Building a truly global and inclusive understanding of model performance is core to our mission.
  • Increasing Task Complexity and Granularity: We plan to evaluate models on more complex, multi-step tasks that require deeper reasoning. This will be paired with collecting more granular feedback from participants to further isolate the subtle but critical differences between state-of-the-art models.
  • Building a Comprehensive Evals Suite: HUMAINE is the first step. We are actively developing a broader suite of evaluations that will range from the dynamic, large-scale human feedback of HUMAINE to deep technical evaluations designed to rigorously measure long-horizon reasoning, agentic capabilities, and other frontier challenges in AI.

Technical Appendix: The Mathematical Model

This section formalises the model that turns human A/B/Tie judgments into the leaderboard statistics. It uses a Bradley–Terry–Davidson (BTD) outcome model with hierarchical demographic adjustments and post-stratification to census weights.

The Outcome Model: Predicting a Choice

At its core, the model predicts the outcome of a single comparison based on the "latent advantage" (η) of model i over model j. This advantage is the sum of the difference in their baseline skills and the difference in their demographic effects for that specific rater.

The demographic effect for a model (Δu) is the sum of its adjustments across the rater's age, ethnicity, and political groups: Δui,rater=ui,ga,kage+ui,ge,keth+ui,gp,kpol \Delta u_{i, \text{rater}} = u^{\text{age}}_{i,g_a,k} + u^{\text{eth}}_{i,g_e,k} + u^{\text{pol}}_{i,g_p,k}

The total advantage, η, is then: η=(θi,kθj,k)Baseline Skill Difference+α(Δui,raterΔuj,rater)Demographic Effect Difference \eta = \underbrace{(\theta_{i,k} - \theta_{j,k})}_\text{Baseline Skill Difference} + \alpha \underbrace{(\Delta u_{i, \text{rater}} - \Delta u_{j, \text{rater}})}_\text{Demographic Effect Difference} We set the scaling factor α = 1/√3 so that the combined effect of three demographic axes remains on the same scale as a single axis.

Given this advantage η, the probabilities for each outcome (A wins, Tie, B wins) are calculated using the BTD formula, which includes a per-metric tie propensity ν_k > 0: pA=eηZ,pT=νkZ,pB=eηZwhere Z=eη+eη+νk p_A=\frac{e^{\eta}}{Z}, \qquad p_T=\frac{\nu_k}{Z}, \qquad p_B=\frac{e^{-\eta}}{Z} \qquad \text{where } Z = e^{\eta}+e^{-\eta}+\nu_k

Priors and Latent Structure: How Parameters are Learned

The model's parameters are learned from the data using the following structure and priors:

  • Baseline Skill (θ): To ensure the skills are identifiable, we enforce a zero-sum constraint for each metric k: iθi,k=0 \sum_i \theta_{i,k} = 0

  • Demographic Adjustments (u): The adjustments are learned hierarchically to ensure stability (a technique called partial pooling). For each demographic axis (e.g., age), the adjustments are centered and scaled by a parameter τ which is learned from the data. ui,y,ka=(uraw,i,y,kauraw,i,,ka)τka u^{a}_{i,y,k} = \Big(u^{a}_{\text{raw},i,y,k} - \overline{u^{a}_{\text{raw},i,\cdot,k}}\Big)\,\tau^{a}_k The raw, unscaled adjustments are drawn from a standard normal distribution, u_raw ~ N(0,1), and the scale parameter τ (the "volume knob") is drawn from an exponential distribution, τ ~ Exponential(λ=12).

Population Adjustment: Reflecting the Real World

After learning the parameters from our participants, we create a population-adjusted skill for each model by taking the expectation of the demographic effects, weighted by census data (w). For each posterior draw, this is: θi,kpop=θi,k+α(wage,ui,,kage+weth,ui,,keth+wpol,ui,,kpol) \theta^{\text{pop}}_{i,k} = \theta_{i,k} + \alpha\Big( \langle w_{\text{age}}, u^{\text{age}}_{i,\cdot,k}\rangle + \langle w_{\text{eth}}, u^{\text{eth}}_{i,\cdot,k}\rangle + \langle w_{\text{pol}}, u^{\text{pol}}_{i,\cdot,k}\rangle \Big) Here, ⟨w, u⟩ represents the dot product (a weighted average) of the census weights with the model's demographic adjustments for that axis.

Scoring and Leaderboard Construction

From the population-adjusted skills, we construct the final leaderboard metrics for each posterior draw:

  • The Expected Points (Winshare) for model i vs. j is: EPi vs j,k=pA+12pT \mathrm{EP}_{i\ \text{vs}\ j,k} = p_A + \tfrac12 p_T
  • A model's Score for that draw is the sum of its EP against all opponents.
  • Aggregating these Scores across all posterior draws gives us the final mean Score, its uncertainty interval, the Expected Rank, and the P(best).

Community

Sign up or log in to comment