View All Jobs 148869

Senior ML Platform Engineer

Build end-to-end scalable ML training and deployment pipelines for real-time inference
Montreal
Senior
yesterday
Mistplay

Mistplay

A platform offering users rewards for playing mobile games and engaging with in-app activities.

Senior ML Platform Engineer

Mistplay is the #1 loyalty app for mobile gamers. Our community of millions of engaged mobile gamers come to Mistplay to discover new games to play and earn rewards. Gamers are rewarded for their time and money spent within the games and can redeem those rewards for gift cards. Mistplay is on a mission to be the best way to play mobile games for everyone everywhere!

Under the direction of the Director of Data and Machine Learning Platform, the Senior ML Platform Engineer within Mistplay's Data Team will play a key role in researching and developing machine learning solutions to solve complex business problems. The Senior ML Platform Engineer will work closely with a cross-functional team to identify areas for improvement and design and implement scalable solutions.

The experience required may range from working on a wide variety of optimization and classification problems, e.g. collaborative filtering/recommendation, fraud detection, segmentation, propensity modeling, text/sentiment classification, etc.

Your Missions At Mistplay:

Design, build, and operate standardized training-to-serving pipelines with Airflow, covering artifact management, environment provisioning, packaging, deployment, and rollback for SageMaker endpoints.

Own real-time and batch inference on SageMaker: multi-model endpoints, serverless inference where appropriate, blue/green and canary strategies, autoscaling policies, and cost controls (spot strategies, instance right-sizing).

Implement ultra-low-latency serving patterns with Redis/Valkey: feature caching, online feature retrieval, request-scoped state, model response caching, and rate limiting/backpressure for bursty traffic.

Provision and manage ML/data infrastructure with Terraform: SageMaker endpoints/configs, ECR/ECS/EKS resources, networking/VPC endpoints, ElastiCache/Valkey clusters, observability stacks, secrets, and IAM.

Build platform abstractions and golden paths: Airflow DAG templates, CLI/SDKs, cookie-cutter repos, and CI/CD pipelines that take models from notebooks to production predictably.

Establish and run model lifecycle governance: model/feature registries, approval workflows, promotion policies, lineage, and audit trails integrated with Airflow runs and Terraform state.

Implement end-to-end observability: data/feature freshness checks, drift/quality gates, model performance/latency SLOs, infra health dashboards, tracing, and alerting—plus incident response and postmortems.

Partner with Security, SRE, and Data Engineering on private networking, policy-as-code, PII handling, least-privilege IAM, and cost-efficient architectures across environments.

Evaluate, integrate, and rationalize platform tooling (e.g., MLflow registry, feature stores, serving gateways); lead migrations with clear change management and minimal downtime.

What You'll Bring:

5+ years building and operating production-grade ML/data platforms with a focus on serving, reliability, and developer experience.

Strong software engineering in Python, Go, or Java; experience building resilient services, APIs, and automation tooling with high test coverage.

Deep experience with AWS SageMaker inference: endpoint configuration, containerization, model packaging, autoscaling, serverless vs. real-time trade-offs, MME, A/B and canary releases.

Expertise with online feature stores like Redis/Valkey in ML serving contexts.

Proven Terraform experience managing ML and data infra end-to-end: modules, workspaces, drift detection, change reviews, and safe rollbacks; familiarity with GitOps patterns.

Airflow orchestration at scale: dependency modeling, sensors, retries, SLAs, backfills, DAG factories, and integrations with registries, artifact stores, and Terraform pipelines.

Familiarity with ML frameworks (scikit-learn, XGBoost, PyTorch, TensorFlow) from a platform-integration perspective to support diverse runtimes and containers.

Observability for ML Workflows: metrics/logs/traces, performance profiling, capacity planning, cost monitoring, and runbooks.

Excellent communication and cross-functional collaboration with Data Science, Data Engineering, DevOps and Backend.

We strive to make our work environment as inviting and fun as possible! Working at Mistplay is coupled with a whole array of perks that we've adopted virtually and in-person: Team Lunches, game nights, company-wide events, and so much more. Our culture is deeply rooted in growth and upheld by a team of smart, dynamic, and enthusiastic people. We utilize data to constantly learn, improve, and adapt. We foster an environment where everyone is encouraged to share their ideas, push boundaries, take calculated risks, and witness their visions come to life.

+ Show Original Job Post
























Senior ML Platform Engineer
Montreal
Engineering
About Mistplay
A platform offering users rewards for playing mobile games and engaging with in-app activities.