Monarch is a powerful, all-in-one personal finance platform designed to help make the complexity of finances feel simple again. Since launching in 2021, we've become the top-recommended personal finance app by users and experts. Our goal? To take the stress out of finances so our members can focus on what truly matters.
We are a team of do-ers led by experienced entrepreneurs who are passionate about helping our members reach their financial goals. We are hyper focused on building a product people love and continuing to evolve based on user feedback.
As a fully remote company (even before COVID!), we welcome applicants from almost anywhere. Our team collaborates synchronously mostly from 9 AM – 2 PM PT and embraces asynchronous work to stay connected across time zones.
Join us on our mission to transform lives by simplifying money, together.
The Role:
At Monarch, AI is the engine that will power the next generation of intelligent and intuitive product experiences for our users. We're looking for our first AI Platform Engineer to design, build, and own the central nervous system for our machine learning and large language model (LLM) initiatives.
This isn't a typical MLOps role. You won't just be managing pipelines; you'll be the architect of our AI infrastructure, creating the paved road that enables our product and AI teams to ship features faster, safer, and smarter. You will be the force multiplier for all things AI at Monarch, making critical decisions on everything from our retrieval architecture to how we manage a multi-model LLM strategy.
You'll collaborate closely with our Infrastructure team, who manage the underlying cloud, networking, and compute fabric. Your focus will be on the entire ML/LLM application layer: reliability, evaluation, safety, observability, and performance.
What You'll Do:
Build the Central AI Platform: Design and build a unified, resilient platform for deploying and serving AI features. This includes creating a routing layer with provider fallbacks, circuit breakers, and cost/latency-aware model selection. You'll also establish a central registry for versioning models and prompts and create robust CI/CD pipelines.
Architect for Scale and Quality: Own our end-to-end Retrieval-Augmented Generation (RAG) strategy. You'll lead the design of embedding pipelines, develop optimal chunking strategies, implement hybrid search, and manage index maintenance. Crucially, you'll build and scale our LLM evaluation tooling, using methods like golden sets, rubric-based scoring, and LLM-as-judge with bias controls.
Ensure Production Excellence: Instrument our AI systems with deep observability, including structured tracing, cost-attribution, and latency metrics. Define and uphold SLOs, create incident response runbooks, and build the guardrails necessary for running mission-critical AI services.
A Partnership with Infrastructure:
You Own: The LLM runtime, retrieval architecture (vector stores, indexing), evaluation frameworks, safety guardrails, prompt/model versioning, AI observability, and cost/latency optimization.
Infra Owns: The core cloud infrastructure (IaC), networking, secrets management, Kubernetes/GPU orchestration, and shared platform services.
Together You Own: SLAs/SLOs, rollout strategies, incident response protocols, and capacity planning for all AI services.
What You'll Bring:
5+ years of experience in software or machine learning engineering, with at least 2 years in a role focused on building and operating production ML/LLM systems.
A proven track record of shipping and scaling LLM-backed applications, with deep, hands-on expertise in the surrounding ecosystem.
Expertise in modern LLM retrieval systems, including hands-on work with embedding pipelines, hybrid search, chunking strategies, and index maintenance.
Demonstrated experience building robust LLM eval tooling (e.g., golden sets, rubric scoring, LLM-as-judge).
Practical knowledge of building resilient LLM routing and orchestration layers, incorporating provider fallbacks, circuit breakers, and cost/latency-aware selection.
Strong programming skills in Python and a history of building production-grade automation and services.
A strategic mindset, comfortable making build-vs-buy decisions and designing systems for long-term reliability and cost efficiency.
Nice to Have's:
Reproducible Training & Fine-Tuning: Experience building containerized, reproducible training jobs with robust experiment tracking (e.g., Weights & Biases, MLflow), dataset versioning, and standardized evaluation harnesses (e.g., lm-eval, HELM).
ML Serving & Orchestration: Kubernetes-native serving (KServe, Seldon), model servers (Triton), and workflow orchestrators.
Vector Databases: Hands-on experience with systems like OpenSearch, pgvector, Pinecone, or Weaviate at scale.
Agentic Systems: Designing and building multi-step, tool-using agents (e.g., using frameworks like LangGraph).
Security & Safety: Experience with red-teaming exercises, building adversarial tests, and implementing robust safety filters.
Typical Process:
Recruiter Video Call
Hiring Manager Video Call
Take-Home or Pairing Exercise
Virtual Onsite (2-3 rounds)
Reference Checks
Offer
Benefits:
Work wherever you want! As a fully remote company with no central office, we want you to work wherever you are happiest and most productive. Whether that's out of your home, a co-working space, or elsewhere.
Competitive cash and equity compensation in a hyper growth, early stage company.
Stipend to set-up your ideal working environment.
Competitive Benefit Plans for employees based on your location (e.g. in the US we offer: Medical, dental and vision benefits and the ability to contribute to a 401k plan).
Unlimited PTO.
3 day weekend every month! We take off the "First Friday" every month to focus on rest, recuperation, or just having fun!
We are an equal opportunity employer and value diversity. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.