Platform Engineer

While technology is the heart of our business, a global and diverse culture is the heart of our success. We love our people and we take pride in catering them to a culture built on transparency, diversity, integrity, learning, and growth. If working in an environment that encourages you to innovate and excel, not just in professional but personal life, interests you- you would enjoy your career with Quantiphi!

Quantiphi is an award-winning Applied AI and Big Data software and services company, driven by a deep desire to solve transformational problems at the heart of businesses. Our signature approach combines groundbreaking machine learning research with disciplined cloud and data-engineering practices to create breakthrough impact at unprecedented speed.

Quantiphi has seen 2.5x growth YoY since its inception in 2013, we don't just innovate - we lead. Headquartered in Boston, with 4,000+ Quantiphi professionals across the globe. As an Elite/Premier Partner for Google Cloud, AWS, NVIDIA, Snowflake, and others, we've been recognized with:

17x Google Cloud Partner of the Year awards in the last 8 years.
3x AWS AI/ML award wins.
3x NVIDIA Partner of the Year titles.
2x Snowflake Partner of the Year awards.
We have also garnered top analyst recognitions from Gartner, ISG, and Everest Group.
We offer first-in-class industry solutions across Healthcare, Financial Services, Consumer Goods, Manufacturing, and more, powered by cutting-edge Generative AI and Agentic AI accelerators.
We have been certified as a Great Place to Work for the third year in a row- 2021, 2022, 2023.

Be part of a trailblazing team that's shaping the future of AI, ML, and cloud innovation. Your next big opportunity starts here!

We are seeking experienced Platform Engineers with expertise in MLOps and handling distributed systems, particularly Kubernetes, along with a strong background in managing Multi-GPU, Multi-Node Deep Learning job/inference scheduling. Proficiency in Linux (Ubuntu) systems, the ability to create intricate shell scripts, good proficiency in working with configuration management tools and sufficient understanding of deep learning workflow.

Roles and Responsibilities:

Orchestrating LLM Workflows & Development: Design, implement, and scale the underlying platform that supports GenAI workloads, be it for real-time or batch. The workloads can also vary from fine-tuning/distilling to inference.
LLMOps (LLM Operations): Build and manage operational pipelines for training, fine-tuning, and deploying LLMs such as Llama, Mistral etc, GPT-3/4, BERT, or similar. Ensure smooth integration of these models into production systems.
GPU Optimization: Optimize GPU utilization and resource management for AI workloads, ensuring efficient scaling, low latency, and high throughput in model training and inference. Develop techniques to manage multi-GPU systems for high-performance computation. Have clarity on LLM parallelization techniques as well as other inference optimization techniques.
Infrastructure Design & Automation: Design, deploy, and automate scalable, secure, and cost-effective infrastructure for training and running AI models. Work with cloud providers (AWS, GCP, Azure) to provision the necessary resources, implement auto-scaling, and manage distributed training environments.
Platform Reliability & Monitoring: Implement robust monitoring systems to track the performance, health, and efficiency of deployed AI models and workflows. Troubleshoot issues in real-time and optimize system performance for seamless operations. Transferable knowledge from traditional software monitoring in production is fine. Monitoring knowledge of ML/GenAI workloads is preferred.
Maintain Knowledge Base: Good knowledge of database concepts ranging from performance tuning, RBAC, shading, along with exposure to different types of databases from relational to object & vector databases is preferred.
Collaboration with AI/ML Teams: Work closely with data scientists, machine learning engineers, and product teams to understand and support their platform requirements, ensuring the infrastructure is capable of meeting the needs of AI model deployment and experimentation.
Security & Compliance: Ensure that platform infrastructure is secure, compliant with organizational policies, and follows best practices for managing sensitive data and AI model deployment.

Required Skills & Qualifications:

3+ years of experience in platform engineering, DevOps, or systems engineering, with a strong focus on machine learning and AI workloads.
Proven experience working with LLM workflows, and GPU-based machine learning infrastructure.
Hands-on experience in managing distributed computing systems, training large-scale models, and deploying AI systems in cloud environments.
Strong knowledge of GPU architectures (e.g., NVIDIA A100, V100, etc.), multi-GPU systems, and optimization techniques for AI workloads.

Technical Skills:

Proficiency in Linux systems and command-line tools. Strong scripting skills (Python, Bash, or similar).
Expertise in containerization and orchestration technologies (e.g., Docker, Kubernetes, Helm).
Experience with cloud platforms (AWS, GCP, Azure), tools such as Terraform, /Terragrunt, or similar infrastructure-as-code solutions, and exposure to automation of CICD pipelines using Jenkins/Gitlab/Github, etc.
Familiarity with machine learning frameworks (TensorFlow, PyTorch, etc.) and deep learning model deployment pipelines. Exposure to vLLM or NVIDIA software stack for data & model management is preferred.
Expertise in performance optimization tools and techniques for GPUs, including memory management, parallel processing, and hardware acceleration.

Soft Skills:

Strong problem-solving skills and ability to work on complex system-level challenges.
Excellent communication skills, with the ability to collaborate across technical and non-technical teams.
Self-motivated and capable of driving initiatives in a fast-paced environment.

Good to Have Skills:

Experience in building or managing machine learning platforms, specifically for generative AI models or large-scale NLP tasks.
Familiarity with distributed computing frameworks (e.g., Dask, MPI, Pytorch DDP) and data pipeline orchestration tools (e.g., AWS Glue, Apache Airflow, etc).
Knowledge of AI model deployment frameworks such as TensorFlow Serving, TorchServe, vLLM, Triton Inference Server.
Good understanding of LLM inference & how to optimize self-managed infrastructure
Understanding of AI model explainability, fairness, and ethical AI considerations.
Experience in automating and scaling the deployment of AI models on a global infrastructure.
Previously working on NVIDIA Ecosystem or well aware of NVIDIA Ecosystem - Triton Inference Server, CUDA, NVAIE, TensorRT, NeMo, etc
Good at Kubernetes (GPU Operator), Linux, and AI Deployment & experimentation tools.

What is in it for you:

Be part of a team and company that has won NVIDIA's AI Services Partner of the Year three times in a row with an unparalleled track record of building production AI applications on DGX and Cloud GPUs.
Strong peer learning which will accelerate your learning curve across Applied AI, GPU Computing and other softer aspects such as technical communication.
Exposure to working with highly experienced AI leaders at Fortune 500 companies and innovative market disruptors looking to transform their business with Generative AI.
Access to state-of-the-art GPU infrastructure on the cloud and on-premise.
Be part of the fastest-growing AI-first digital transformation and engineering company in the world.

Suggest a correction

Senior Platform Engineer - Remote Eligible

Quantiphi

Free Jobs Digest

NoDegree

Platform Engineer

Senior Platform Engineer - Remote Eligible

About Quantiphi