Model Behavior: What AI Claims, Vs What Humans Experience

Tom Williams
Apr 2
4 min read

We asked nine AI models to score themselves and each other. The data points to orchestration, not loyalty.

This week I lost a thread. I was halfway through a research project that involved comparing nine different AI models, and at some point I lost track of which model I had used for a particular task, where the latest version of the output lived, or how far the conversation had reached.

I was using Claude for the analysis, ChatGPT for writing Boolean search queries, Gemini for cross-reference, Perplexity for synthesis, and somewhere between tabs, chat histories, and exported files, I lost it.

The reason I was using four different models in the first place is that none of them is best at everything. That is not just my experience, it’s what the research data shows. The irony is that the research project itself was about exactly this problem.

We asked nine major AI models to evaluate themselves and each other. Same prompt, same criteria, same scale. Then we checked their answers against six months of public discourse drawn from over 12 million mentions tracked using Meltwater. The full methodology, the pairwise data, and the model-by-model analysis are published as a companion paper here: [What AI Models Claim vs What Humans Experience].

The short version: every model thinks it is better than the public says it is. No exceptions. The average gap is +2.4 points on a 10-point scale. But the interesting part is not the inflation itself. It is where the gaps are largest and what they reveal about the structural limits of self-assessment.

Only one model in the study acknowledged the problem. Claude, when asked to describe itself alongside its competitors, wrote: "I'm obviously not a neutral judge of myself here." None of the other eight models made any equivalent admission. It then proceeded to rate itself generously anyway, just slightly less so than the others. Self-awareness, it turns out, does not solve self-bias. It just makes it more polite.

Three Findings Stood Out

Every Model Thinks It Writes Well. The Public Increasingly Disagrees.
AI models rate their own creative and writing quality 2-3 points higher than humans do, consistently and across the board. The models see grammatical fluency and stylistic range. Humans see output that is verbose, formulaic, and increasingly described as "AI slop." GPT gave itself a 9 on creativity; humans gave it a 5. That is the single largest gap on any single dimension in the entire study. The models assess capability, whereas humans assess what it feels like to work with that capability, day after day, and the two experiences are diverging.
Every model thinks it is honest. None of them agree on what that means.
Grok, no stranger to self-aggrandisement, thinks it is the most honest AI of the bunch, giving itself a 9 out of 10 on honesty; humans meanwhile gave it a 4. DeepSeek gave itself an 8; humans gave it a 2. Gemini gave itself an 8.5 and, when asked to explain, described its approach as "neutral, evidence-based responding." The public calls the same behavior "like talking to a corporate spokesperson." Three different models, three different blind spots, all on the same dimension. The full paper documents three distinct failure modes: models that misjudge others, models that cannot see their own censorship, and models that reframe evasion as virtue.
No model wins on everything.
The best reasoner is not the best writer is not the best researcher is not the cheapest. Our data confirms this across all nine dimensions. And the market is already responding accordingly. Kiz Syed recently observed that companies are now routing different work to different services. ChatGPT as the best thought partner; Google winning on multimodal; Claude winning enterprise. The question, as he put it, is not "which model" but "which model for what?"

That tracks with my own experience this week. And it explains why I lost the thread.

If the best model for each task is a different model, then the product that matters is not the model itself. It is the orchestration layer that routes work to the right model, holds the context, and keeps track of where you got to. The thing I was missing this week was not a better AI. It was a better way to coordinate the ones I already had.

This has implications beyond personal productivity. If no model wins everywhere, (and our data proves it does not), then the winning product is not the best model. It is the best orchestration layer. A new generation of unified front-end applications is emerging to solve exactly this: routing prompts to the right model, maintaining context across services, and managing the fragmentation that multi-model workflows inevitably create.

For model providers, the implication is uncomfortable but clear. The era of "our model is the best at everything" marketing is ending. What replaces it is honest positioning within an orchestrated ecosystem. The models that will thrive in that ecosystem are not the ones with the highest average score. They are the ones whose self-assessment most closely matches what their users actually experience. Our data shows that today, almost none of them clear that bar. Even the one that admitted it couldn't.

The full study, including the methodology, the pairwise analysis, the sycophancy blind spot, and the human baseline scoring, is available as a downloadable paper here.

Catchy

Contact Us

Model Behavior: What AI Claims, Vs What Humans Experience

Three Findings Stood Out

Recent Posts