Hamel Husain
AI Consultant and Educator at Independent
AI consultant and co-creator of the definitive online course on evals (the number one course on Maven), who has taught over 2,000 PMs and engineers across 500 companies including OpenAI and Anthropic how to systematically measure and improve AI applications.
Dimension Profile
Key Themes
Episode Summary
Hamel Husain (alongside Shreya Shankar) breaks down evals from an obscure AI concept into a practical, approachable skill for product builders. He walks through a real-world example of building evals for a real estate AI assistant, explains why you should start with error analysis before writing any tests, introduces the benevolent dictator concept for eval taste, and addresses major misconceptions including why you cannot just have AI evaluate itself. The episode serves as the definitive primer on building evals for AI products.
Leadership Principles
- → Evals is really data analytics on your LLM application — a systematic way of looking at data and creating metrics so you can iterate and improve with confidence
- → Don't jump straight to writing tests — start with error analysis and data analysis to ground what you should even test, because LLMs have much more surface area than traditional software
- → Appoint a benevolent dictator whose taste you trust for eval criteria — it should be the person with domain expertise, often the product manager, not a committee
Notable Quotes
"Evals is a way to systematically measure and improve an AI application. It really is, at its core, data analytics on your LLM application and a systematic way of looking at that data."
— On demystifying what evals actually are
"The top misconception is, 'We live in the age of AI. Can't the AI just eval it?' But it doesn't work."
— On the most common misconception about AI evals
"When you're doing this open coding, a lot of teams get bogged down in having a committee do this. You can appoint one person whose taste that you trust. It should be the person with domain expertise. Oftentimes, it is the product manager."
— On the benevolent dictator approach to eval criteria
Want to know how you compare?
Take the Assessment