AI as a (Human) Judge

The GenAI Corporate Pre- revolution

6 min readNov 28, 2024

Twenty years ago, Daniel Kahneman shattered economics’ most sacred myth: the rational human. His Nobel Prize-winning research revealed how deeply flawed our decision-making really is. Now, as generative AI storms into the business world, we’re about to demolish another fundamental assumption — this time in the field of business management: the idea of the efficient, rational, and well-oiled corporate machine powered by its well trained, motivated, and efficient workforce.

The myth of the intelligent corporate

Over the past two years, I’ve had the privilege of working alongside teams implementing GenAI across numerous successful organizations — first while leading applied AI projects at Google Cloud, and more recently as an advisor and investor at Ellipsis Venture. While everyone’s racing to embrace AI automation, most companies are stumbling over a surprisingly basic hurdle: they don’t actually understand their own processes.

The recipe for AI automation seems deceptively simple:

Map out your current process and desired automation
Identify the decision-making data (hidden in CRMs, ERPs, Catalogs, contracts, SLAs guidelines, and numerous other systems and documents)
Document clear best practices with real-world examples
Set concrete performance targets against your current baseline

But here’s the kicker: while technical barriers to AI adoption are being removed, we’re uncovering a far more fundamental problem. Most businesses are flying blind when it comes to their own operations:

Process maps exist mainly in people’s heads, with reality rarely matching documentation
Critical decision-making data is scattered across systems or buried in informal channels
“Good performance” remains frustratingly undefined, with vague or non-existent quality metrics
Baseline metrics? Most companies can’t tell you how much time, money, or resources their current processes consume

This isn’t an AI problem — it’s the AI opportunity exposing the chaos lurking beneath the polished surface of modern business operations.

AI as a Mirror: Questioning Our Assumptions About Human Performance

“How can we trust AI in production?” It’s a question I hear constantly, followed by familiar concerns: AI hallucinates, lacks reasoning, makes stupid mistakes, can’t follow instructions, and operates as a black box.

All valid points. But lately, I’ve started asking a different question: How does human performance really measure up against these same criteria?

Let’s be honest:

Consistency? The performance gap between your best and weakest team members is staggering — whether they’re salespeople, managers, developers, or executives. Even the same person can vary dramatically in their performance, especially under stress or tedium.

Factual Accuracy? We rarely question when employees speak with authority, yet how many of your sales team can accurately detail your latest product features and roadmap? How many managers have a truly holistic view of customer metrics, deployment times, and partnership success rates? People confidently “hallucinate” answers daily.

Complex Reasoning? Watch how quickly plans fall apart when tasks exceed a certain complexity. Many professionals struggle with basic mathematical modeling. Our solution? Break everything into smaller pieces and split into teams— leading to the infamous corporate “politics” and communication breakdowns.

Following Instructions? Between egos, personal agendas, and simple forgetfulness, the gap between understanding and execution is often a chasm.

Transparency? Ask someone to explain their decision-making process or accurately log their daily activities. Better yet, try getting honest feedback during performance reviews — the very tool we use to measure organizational effectiveness.

Continuous Improvement? Unlike AI, which can be instantly replicated once improved, humans need individual, repeated training — fighting against attendance issues, engagement problems, and constant turnover.

Here’s the uncomfortable truth: Once a company grows beyond its founder’s direct control, we see what I call “employee mentality” emerge. Not everyone brings the same passion and dedication as a founder would. The result? Inefficiency and neglect become the natural entropy of business operations.

This isn’t about dismissing human value — it’s about honestly assessing our baseline before dismissing AI’s imperfections as a blocker for improvement.

The unexpected hero: AI as a Judge

Here’s an intriguing twist: what if AI’s greatest contribution isn’t replacing humans, but helping us understand how we actually perform and help us improve?

AI engineers and Data Scientists use what we call “AI (or Large Language Model) as a judge” — specialized large models trained to evaluate outputs that resist objective measurement. Think of tasks like summarizing meetings, handling customer support calls, or crafting business communications. When we deploy AI agents to handle these tasks, we need a way to assess their performance consistently and objectively.

See the following example of an AI judge from this outstanding post by Hamel Husain. It was used for helping Honeycomb build their Query Assistant feature, and it evaluates if the AI was generating good queries:

You are a Honeycomb query evaluator with advanced capabilities to judge if a query is good or not.
You understand the nuances of the Honeycomb query language, including what is likely to be
most useful from an analytics perspective. 

Here is information about the Honeycomb query language:
{{query_language_info}}

Here are some guidelines for evaluating queries:
{{guidelines}}

Example evaluations:

<examples>

<example-1>
<nlq>show me traces where ip is 10.0.2.90</nlq>
<query>
{
  "breakdowns": ["trace.trace_id"],
  "calculations": [{"op": "COUNT"}],
  "filters": [{"column": "net.host.ip", "op": "=", "value": "10.0.2.90"}]
}
</query>
<critique>
{
  "critique": "The query correctly filters for traces with an IP address of 10.0.2.90 
   and counts the occurrences of those traces, grouped by trace.trace_id. The response 
   is good as it meets the requirement of showing traces from a specific IP address 
   without additional complexities.",
  "outcome": "good"
}
</critique>
</example-1>

<example-2>
<nlq>show me slowest trace</nlq>
<query>
{
  "calculations": [{"column": "duration_ms", "op": "MAX"}],
  "orders": [{"column": "duration_ms", "op": "MAX", "order": "descending"}],
  "limit": 1
}
</query>
<critique>
{
  "critique": "While the query attempts to find the slowest trace using MAX(duration_ms) 
   and ordering correctly, it fails to group by trace.trace_id. Without this grouping, 
   the query only shows the MAX(duration_ms) measurement over time, not the actual 
   slowest trace.",
  "outcome": "bad"
}
</critique>
</example-2>

But here’s the provocative idea: Why limit these assessment tools to AI agents? What if we turned these same evaluation tools toward our human workforce?

As a business leader, you don’t need to make any risky changes to your operations. Simply take the AI evaluation systems you’re already developing to assess AI agents, and apply them to your current human processes. The insights could be transformative:

Finally get objective measurements of your service quality
Identify best practices from your top performers
Spot patterns in customer interactions
Uncover training opportunities
Establish clear, measurable quality standards

The irony? By using AI to understand human performance, you might dramatically improve your organization’s effectiveness before you ever deploy a single AI agent.

This isn’t about replacing humans — it’s about finally having the tools to understand, measure, and improve how we work.

Epilogue: When Culture Meets AI — A New Chapter in Corporate DNA

Every iconic company has its North Star — a cultural mantra that defines its essence. “Don’t be evil.” “Think different.” “It’s always day 1.” “Focus on the user and everything will follow.” These aren’t just slogans; they’re supposed to be the beating heart of how businesses operate (at least according to the founders).

But here’s the reality: as organizations grow, these crystalline principles get filtered through thousands of human lenses. They’re bent by personal agendas, warped by fatigue, and diluted by the simple complexity of being human. That’s not a flaw — it’s just how people work.

Enter AI agents, offering an unexpected twist: they’ll execute your company’s values with unwavering consistency. No Monday blues. No office politics. No drift from the mission.

Yet this raises profound questions: What happens when corporate values can be coded into perfect execution? When consistency becomes the norm rather than the aspiration? When culture becomes programmable?

We’re not just automating tasks — we’re about to discover what our corporate values truly mean when they’re followed to the letter. It’s both exciting and sobering. Welcome to the next chapter of organizational culture. A brave new world!