Evaluation is one of the most important pillars of building a frontier model and product - it informs the direction of research and development, powers our experimentation and ensures quality and alignment with our users.
To support this, we need a powerful pipeline and self-serve evaluation framework that helps poolsiders build, run and extract insight from evals easily.
In this role, you will design and implement this platform to build and run evaluations at scale.
Build a scalable self-serve evaluation platform to power our research and development