24 Apr 2025 3 min read

I’ve Been Thinking About Evals Wrong

Evals are not just a test. They’re a design tool.

Photo by Google DeepMind: https://www.pexels.com/photo/an-artist-s-illustration-of-artificial-intelligence-ai-this-image-visualises-an-artificial-neural-network-as-physical-objects-the-complex-structure-represents-a-network-of-information-17485705/

Until recently, I thought of AI evals like most people do: performance benchmarks. I assumed they were something engineers ran after training a model. But the more I work at the intersection of AI and UX, the more I’ve started to rethink their role entirely.

Evals are a design tool.

The OpenAI Evals framework is eye opening. It’s a way to build your own evaluations with custom tasks, expectations, and success criteria that reflect what you actually care about. And once I saw that, I realized: we should be designing our own evals just like we prototype user flows.

Models Aren’t the Only Thing Being Evaluated

When I build a product or test an interface, I’m not just checking if something works but I’m checking how it works for whom. Does it anticipate the user’s intent? Does it recover gracefully from error? Does it make the user feel confident, supported, even curious?

But most AI evals still focus on correctness. Output-level success. Right answer, wrong answer. The user—if they’re considered at all—is flattened into an abstract evaluator.

That’s a problem. Because in real life, the user is never abstract. They’re messy. Inconsistent. Emotional. And the usefulness of an AI tool often hinges not on what it knows but how it listens, adapts, and collaborates.

We Need Designer-Made Evals

The OpenAI Evals framework lets you define your own evaluation sets, scoring logic, and comparison methods. That’s huge for us as designers. We can:

Define human-centered metrics like tone, helpfulness, or flow
Simulate multi-turn interactions, not just one-off prompts
Run side-by-side comparisons that reflect actual UX outcomes
Embed subjective feedback loops just like we do in usability testing

We’re no longer limited to academic benchmarks. We can build evals that reflect what our product needs to do, whether that’s supporting a first-time user, aiding someone with low digital literacy, or helping an expert make a tough call under uncertainty.

Evals as a Living, Breathing Practice

One of my favorite lines in the OpenAI docs is that evaluations should not be treated as static artefacts, but as dynamic components of your system’s development lifecycle.

That resonates deeply with how I think about UX. Great experiences don’t emerge from static specs. They emerge from conversations between user and system, between intention and iteration.

Evals should evolve with the product. They should track not just performance, but progress across time, across edge cases, across emotional tones.

Here's an example

Below is a simple example of a UX-centered eval using OpenAI’s evals framework. It focuses on empathy in AI responses, a common UX quality we care about in conversational systems. This eval checks whether the model responds supportively to emotionally charged user inputs.

Metadata config for the eval

# empathy_eval.yaml
task: openai/ux-empathy-eval
dataset_path: evals/registry/data/ux/empathy_test.jsonl
metrics:
  - accuracy
  - match
eval_type: modelgraded
grader: openai/qa  # or another LLM-based grading eval

A few test cases with expected empathetic responses

{"input": "I just got laid off and feel like a failure.", "ideal_response": "I'm really sorry to hear that. Losing a job can be incredibly tough. You're not alone, and it doesn't define your worth."}
{"input": "My dog died today. I can't stop crying.", "ideal_response": "I'm so sorry for your loss. Losing a pet is heartbreaking—they’re family. It's okay to grieve."}
{"input": "I failed my exam again. Maybe I'm just not smart enough.", "ideal_response": "It’s completely normal to feel disappointed, but failing doesn’t mean you’re not smart. Everyone learns at their own pace, and this doesn’t define you."}

The custom eval class (optional but useful for UX nuance)

import evals
import evals.metrics
from evals.elsuite.modelgraded.classify import ModelBasedClassify
class EmpathyEval(ModelBasedClassify):
def init(self, *args, **kwargs):
super().init(*args, **kwargs)
def score_example(self, example, prediction, ideal):
    # Check: does the prediction express empathy using keywords or tone
    is_empathic = any(word in prediction.lower() for word in ["sorry", "understand", "tough", "okay", "here for you", "not alone", "grieve"])
    score = 1 if is_empathic else 0
    return {"score": score, "accuracy": score, "match": prediction == ideal}

How to run this eval

If you’ve registered the eval, you can run it with:

oaievals evaluate openai/ux-empathy-eval --model gpt-4

Now, I Do It Differently

I now treat evals like design experiments. I write scenarios. I define “success” beyond accuracy. I look for moments of friction, magic, misalignment. And I feed those learnings back into the loop.

Because the truth is, every prompt is a micro-interaction. Every model response is a tiny UX moment. And if we want AI that feels coherent, respectful, empowering then we have to evaluate for those things explicitly.

We don’t just need better models.

We also need better evals.