Let me guess – you just built your first RAG system. You chunked your docs, picked a fancy embedding model, and now you’re crushing it on all the benchmarks. Life is good, right?

Well, not so fast.

In our recent YAAP episode, I (Yuval Belfer) sat down with Niv Granot, who leads the tools team at AI21 Labs and is responsible for their knowledge-intensive tools like retrieval and web search. What he revealed about RAG evaluation might keep you up at night.

The Uncomfortable Truth About RAG Evaluation

Here’s the thing: we’re all playing a game with rules that don’t match reality. Current RAG evaluation is like training for a marathon by doing sprints – sure, you’re running, but you’re missing the bigger picture.

RAG Pipelines with Local Questions

Problem #1: The Chunking Catch-22

Imagine you’re writing a history book about World War II. Do you make each chapter cover a broad topic, or break everything down into tiny, detailed sections? That’s basically the chunking dilemma in RAG.

Current benchmarks are like getting a pre-divided book and being told “Good luck!” But in the real world? You’re the one making those tough calls about how to split your content.

Here’s the catch-22:

  • Big chunks? You get the big picture but lose the details
  • Small chunks? You’ve got all the facts but lost the story

It’s like trying to understand a movie by either watching only the trailer or looking at individual frames. Neither approach tells the whole story.

Problem #2: The “It’s All in One Place” Myth

Most RAG benchmarks assume information is like a needle in a haystack – there’s one perfect spot where your answer lives. Neat, tidy, and completely unrealistic.

Real-world information is more like solving a puzzle. The pieces are scattered across different documents, and sometimes you need to understand the whole picture to know what you’re even looking for.

The Seinfeld Test

Here’s a perfect example from Niv: Ask any Seinfeld fan about George’s favorite alias, and they’ll instantly say “Art Vandelay.” But ask a RAG system the same question, and watch it crash and burn – even with access to every single episode transcript.

Why? Because there’s never a moment where George says “My favorite alias is Art Vandelay.” Instead, it’s a running gag that spans multiple episodes. You need to understand the context, the frequency, and the character to know why “Art Vandelay” trumps other aliases like “T-Bone.”

That’s the kind of nuanced retrieval that current benchmarks don’t even attempt to evaluate.

If only I knew that I'll end up in a RAG benchmark

Breaking Free from the Benchmark Trap

Right now, we’re stuck in a vicious cycle:

  1. Build RAG systems for flawed benchmarks
  2. Celebrate our awesome benchmark scores
  3. Watch real users struggle
  4. Create new benchmarks with the same problems
  5. Rinse and repeat

As Niv’s team at AI21 discovered, you can optimize for benchmarks all day long, but that doesn’t necessarily translate to happy users. Recent approaches like Microsoft’s GraphRAG are starting to break free from this cycle by thinking beyond simple text chunking, but we’ve still got a long way to go.

So What’s Next?

If we want RAG systems that actually work in the real world, we need to start evaluating them differently. That means:

  1. Testing how well they handle document splitting
  2. Checking if they can piece together scattered information
  3. Seeing if they understand relationships between documents
  4. Making sure they get the bigger picture, not just the details

The Bottom Line

The hard truth? Most of us are doing RAG evaluation wrong. We’re optimizing for artificial scenarios while missing the messier, more complex reality of how information actually works.

But hey, recognizing the problem is the first step to fixing it. And now that you know what’s wrong with RAG evaluation, you can start thinking about how to do it right.

Want to hear more insights about RAG systems and AI21’s approach to knowledge-intensive tools? Check out the full episode of YAAP wherever you get your podcasts.