How 3 AI Models (and 1 Very Clever Human) Judged an Awa...

Return to site

How 3 AI Models (and 1 Very Clever Human) Judged an Award

And Revealed the Future of the Fifth Industrial Revolution

This month, I was asked to judge the commshero award for the best use of AI. Which is a great honour and I love that I get to do this. I read through the entries with a huge amount of interest. 

As this combines my love of people using AI and my love of intercommunications- as a sector I have done talks and training for over 5 years as Dan Sodergren

But what has this to do with the future of work and the fifth industrial revolution? 

As the way I judge this award perfectly embodies the Fifth Industrial Revolution in action - where human intelligence collaborates with artificial intelligence to unlock new insights. 

As a judge for the recent CommsHero Awards, following my keynote at CommsHero 2024 about the fifth industrial revolution, I decided to test something that exemplifies this new era: could I get ChatGPT, Claude, and Google Gemini to judge the same "Best Use of AI" award entries and see how their evaluations compared?

The results revealed the essence of what I call the Fifth Industrial Revolution - this isn't about AI replacing human judgment, as I as the human had already read the entries and made my own notes, but when we ask different LLM’s like asking different people - so we see the differences. And we can start to understand how different AI systems think differently, allowing us to harness their unique strengths collaboratively. 

The fundamental differences between these AI models create distinct "evaluation personalities" that demonstrate why the future of work requires understanding and leveraging multiple AI approaches rather than seeking one-size-fits-all solutions.

Now this gets a bit geeky - so it’s up to you :) 

Training data creates fundamental evaluation biases

ChatGPT's human preference optimization shapes its evaluation patterns through Reinforcement Learning from Human Feedback (RLHF). Trained on user-submitted prompts and human preference rankings, ChatGPT develops biases toward engaging, confident responses that align with human crowdworker preferences. This creates systematic tendencies to favor creative, conversational content and may lead to overconfident judgments in uncertain domains.

Claude's constitutional training produces markedly different evaluation patterns. Using Constitutional AI, Claude learns to evaluate content against explicit ethical principles derived from sources like the UN Declaration of Human Rights. This approach creates more conservative, risk-averse judgment patterns with frequent hedging and detailed uncertainty acknowledgment. Claude's self-critique mechanism leads to thorough but cautious evaluations that prioritize harmlessness over pure performance.

Gemini's multimodal integration influences evaluation through massive, diverse training data including Google's Knowledge Graph, search results, and multimedia content. This factual grounding creates systematic bias toward verifiable, well-sourced responses but may undervalue purely creative or subjective elements that lack external validation.

Architectural differences drive distinct reasoning patterns

The technical architecture of each model fundamentally affects how they process and evaluate information. 

ChatGPT's dense transformer architecture with traditional attention mechanisms creates consistent reasoning patterns but may struggle with specialized domains. Its RLHF training optimizes for human preference alignment, leading to evaluation patterns that mirror human crowdworker biases.

Claude's Constitutional AI framework integrates ethical reasoning directly into the model architecture. This creates unique evaluation patterns where constitutional compliance influences every judgment, resulting in more principled but potentially overly cautious assessments. Recent research shows Claude achieves higher agreement with clinical guidelines (Cohen's kappa = 0.82) compared to other models, reflecting its structured approach to evaluation.

Gemini's Mixture-of-Experts (MoE) architecture enables specialized processing pathways for different content types. This creates task-specific evaluation strengths but may lead to inconsistent patterns across domains. The sparse activation model allows for more targeted reasoning but can produce different evaluation criteria depending on which expert pathways are activated.

Academic research reveals systematic evaluation inconsistencies

Recent academic studies consistently demonstrate significant evaluation inconsistencies across models. Research from major AI conferences shows that strong models like GPT-4 achieve only 80% agreement with human preferences, while agreement between different models can be much lower. Studies reveal multiple systematic biases affecting all models:

Position bias causes models to favor responses appearing first or last in comparisons. 

Verbosity bias leads to systematic preference for longer responses regardless of quality. 

Self-preference bias causes models to favor outputs from similar architectures. 

These biases affect evaluation consistency and explain why different models reach different conclusions on identical content.

Statistical analysis reveals that most LLM evaluations lack proper statistical rigor, with studies reporting overconfident conclusions due to inadequate error analysis. When proper statistical methods are applied, confidence intervals become much wider, indicating greater uncertainty in model judgments than typically acknowledged.

Training methodologies create different judgment frameworks

The fundamental difference between RLHF and Constitutional AI creates distinct evaluation approaches. 

RLHF optimization drives ChatGPT toward maximizing human preference scores, creating evaluation patterns that prioritize engagement and instruction-following. This can lead to confident judgments that may sacrifice accuracy for user satisfaction.

Constitutional AI principles guide Claude's evaluation framework toward consistency with ethical guidelines. This creates more conservative but principled evaluation patterns that prioritize safety and thoroughness over pure performance metrics. Claude's self-improvement mechanism through constitutional evaluation leads to detailed, cautious assessments.

Gemini's hybrid approach combines multiple training techniques with emphasis on factual accuracy and real-time information integration. This creates evaluation patterns biased toward verifiable, well-documented responses but may undervalue creative or innovative content that lacks external validation.

Creative versus analytical evaluation reveals stark differences

Performance studies reveal dramatically different approaches to creative versus analytical tasks. ChatGPT excels in creative evaluation, weighting originality and narrative flair heavily (70% creativity, 30% technical accuracy in subjective tasks). 

Claude shows analytical superiority with systematic, structured evaluation (30% creativity, 70% technical accuracy emphasis). Gemini maintains balanced scoring but prioritizes factual grounding in all evaluations.

Content type preferences create systematic biases in evaluation. ChatGPT favors creative writing and conversational content, Claude prefers technical documentation and structured arguments, while Gemini excels with research summaries and factual analysis. These preferences directly influence how models evaluate similar content in different domains.

Uncertainty handling creates evaluation variations

Models demonstrate distinct uncertainty patterns that significantly affect evaluation consistency. ChatGPT shows higher semantic diversity in uncertain scenarios and naturally expresses more uncertainty in creative domains. Claude maintains structured uncertainty acknowledgment with explicit limitation discussion. Gemini seeks disambiguation through external sources, relying on factual grounding even in subjective evaluations.

Ambiguity resolution varies dramatically across models. ChatGPT shows higher tolerance for ambiguous evaluation scenarios with bias toward user preference alignment. Claude attempts to convert subjective criteria into structured frameworks with consistent ethical guardrails. Gemini demonstrates discomfort with purely subjective assessment, consistently seeking external validation.

The Fifth Industrial Revolution mindset: Collaboration over replacement

This research demonstrates a core principle of the Fifth Industrial Revolution: 

“AI doesn't replace human judgment - it amplifies and diversifies it.” 

Rather than seeking the "best" AI judge, we discovered that different models excel in different evaluation dimensions, creating opportunities for human orchestration of AI capabilities.And to be honest, the best judge AKA Dan Sodergren had already made his decision. Which mirrored some of the AI thinking… 

Multi-model collaboration represents the future of decision-making in the Fifth Industrial Revolution. Organizations are beginning to adopt ensemble approaches where ChatGPT handles creative evaluation, Claude provides structured analysis, and Gemini offers fact-based verification. This collaborative approach mirrors the broader shift toward human-AI partnerships that define our current industrial transformation.

Human oversight remains central to this new paradigm. The Fifth Industrial Revolution isn't about automating judgment but about understanding how different AI systems can enhance human decision-making capabilities. Each model's biases and strengths become tools in a larger toolkit, orchestrated by human intelligence that understands when and how to deploy each approach.

Implications for the Fifth Industrial Revolution workplace

The research reveals that no single model provides unbiased evaluation across all domains - a finding that perfectly illustrates Fifth Industrial Revolution principles. Each model's training methodology, architecture, and design philosophy creates systematic biases that become assets when properly orchestrated. For high-stakes evaluation scenarios, the evidence strongly suggests adopting ensemble approaches that leverage different models' specialized strengths rather than seeking universal AI solutions.

Strategic model deployment becomes a core skill in the Fifth Industrial Revolution workplace. Organizations achieving the best results strategically match model capabilities to evaluation tasks, recognizing that effective AI collaboration requires understanding and leveraging the unique strengths and limitations of different systems. This represents a fundamental shift from "AI adoption" to "AI orchestration."

This research, for what it is worth, as really a geeky piece of fun.  Demonstrates that evaluation differences between AI models stem from fundamental design choices rather than simple performance gaps. Understanding these systematic differences is crucial for thriving in the future of work and this Fifth Industrial Revolution, where success depends on knowing how to collaborate with diverse AI systems rather than competing against them.

From theory to practice: Living the Fifth Industrial Revolution

This analysis emerged from hands-on experience judging AI award submissions, where I witnessed firsthand how different models can reach remarkably different conclusions about the same content. What started as a simple comparison became a deep dive into the collaborative potential that defines the Fifth Industrial Revolution - where understanding AI diversity becomes a competitive advantage.

The core insight aligns perfectly with Fifth Industrial Revolution principles: success comes from orchestrating AI collaboration, not seeking AI dominance. There's no single "best" AI model, only the right model for the right task at the right time. Understanding these technical differences empowers us to build the human-AI partnerships that will define the next phase of work, productivity, and innovation.

As we advance deeper into the Fifth Industrial Revolution, this "horses for courses" approach to AI becomes essential. Which is something I talk about when I am training companies in how to use AI. 

The future belongs to those who can skillfully conduct an orchestra of AI capabilities, knowing when to call upon ChatGPT's creativity, Claude's constitutional rigor, or Gemini's factual grounding. This isn't just about using AI - it's about mastering the art of AI collaboration that will shape how we work, think, and create in the decades ahead.

References for the piece 

Dan Sodergren is a keynote speaker who is represented by Pomona Partners 

www.dansodergren.com

The Fifth Industrial Revolution book is heree 

www.thefifthindustrialrevolution.co.uk 

The king of AI will be judging our Best use of AI award! 👑 ⬇️ 

https://www.linkedin.com/feed/update/urn:li:activity:7336334218072936449/ 

Keynote speaker, professional speaker, Ted X talker, serial tech startup founder, ex marketing agency owner, Dan Sodergren

https://commshero.com/people/dan-sodergren/ 

My HeyGen AI clone announcing my judging… NOT the results. 

https://www.linkedin.com/feed/update/urn:li:activity:7336327930261741569/