Design and implement agent evaluation pipelines that benchmark AI capabilities across real-world enterprise use cases.
Build domain-specific benchmarks for product support, engineering ops, GTM insights, and other verticals relevant to modern SaaS.
Develop performance benchmarks that measure and optimize latency, safety, cost-efficiency, and user-perceived quality.
Create search- and retrieval-oriented benchmarks, including multilingual query handling, annotation-aware scoring, and context relevance.
Partner with AI and infra teams to instrument models and agents with detailed telemetry for outcome-based evaluation.
Drive human-in-the-loop and programmatic testing methodologies for fuzzy metrics like helpfulness, intent alignment, and resolution effectiveness.
Contribute to company's open evaluation tooling and benchmarking frameworks, shaping how the broader ecosystem thinks about SaaS AI performance.