Technical Thought Leader For Performance And Resiliency Engineering
Serve as a technical thought leader driving the next phase of Walmart's Performance and Resiliency Engineering. Architect, build, and scale intelligent agentic AI/ML systems that proactively optimize speed, reliability, and business continuity across Walmart's global platforms. Operate at the intersection of engineering, data science, and business—translating visionary ideas into actionable architecture and tangible solutions.
About Team: Building the right technology foundation for Infrastructure & platforms is vital to success at the scale of Walmart. Our team builds and maintains the foundational technologies that support the tech organization. Included in this are data platforms, enterprise architecture, DevOps, cloud computing, and infrastructure. All of these products and services are supported by scalable and powerful infrastructure, ensuring a secure and seamless employee and customer experience across stores, digital channels, and distribution centers.
Key Responsibilities:
- AI/ML & Agentic System Leadership
- Design, fine-tune, and deploy Generative AI models (including LLMs) and agentic frameworks (e.g., RAG, Crew AI) for performance monitoring, anomaly detection, and automated remediation.
- Develop and optimize LLM-based agents for multi-step reasoning, knowledge grounding, and decision-making.
- Architect scalable, distributed AI systems with a focus on performance, fault tolerance, and disaster recovery.
- Integrate external data sources (vector databases, observability stacks) to build dynamic, context-aware, and self-healing systems.
- Lead the development of LLM evaluation pipelines (factuality, consistency, relevance) and implement safety guardrails.
- Performance Engineering
- Architect and implement AI/ML-driven solutions for continuous performance monitoring, automated tuning, and predictive scaling.
- Establish and enforce performance benchmarks, SLAs, and SLOs; integrate performance testing into CI/CD pipelines.
- Leverage advanced observability tools (Grafana, ELK, Splunk, Prometheus) and distributed tracing for actionable insights.
- Optimize LLM inference (prompt caching, quantization, retrieval filtering) and system throughput.
- Resiliency & Chaos Engineering
- Champion resilient architectures that maintain business continuity during failures or spikes.
- Lead chaos engineering initiatives: design and execute controlled failure scenarios, analyze impact, and drive improvements.
- Leverage AI/ML for predictive failure detection, drift monitoring, and autonomous remediation.
- Develop and maintain playbooks for critical/non-critical dependency failures and disaster recovery.
- Technical Leadership & Collaboration
- Guide engineering teams on best practices, technical design, and architectural decisions for AI/ML and agentic systems.
- Collaborate with data scientists, ML engineers, SRE, and product teams to operationalize AI/ML models and integrate them into production.
- Mentor engineers, foster a culture of continuous learning, and contribute to internal platform standards and engineering playbooks.
- Drive experimentation (A/B testing, multi-armed bandits, causal inference) and champion innovation.
- Product Integration & Delivery
- Partner with cross-functional teams to deliver end-to-end, cloud-native solutions (GCP, Azure, Kubernetes, Docker).
- Shape the architecture and roadmap for AI-powered performance and resiliency systems.
- Ensure high standards for quality, security, and performance through rigorous design and code reviews.
What you'll bring:
- Proven experience with LLMs, GenAI, RAG, agentic frameworks, and embedding-based workflows.
- Deep expertise in distributed systems, cloud-native architectures, and scalable microservices (GCP, Azure, Kubernetes, Docker).
- Strong programming skills: Python, Java, SQL; hands-on with ML frameworks (PyTorch, TensorFlow, Hugging Face Transformers).
- Experience with performance engineering, chaos engineering, and building resilient, fault-tolerant systems.
- Demonstrated success in technical leadership, mentoring, and cross-functional collaboration.
- Strong experimentation background (A/B testing, causal inference) and MLOps (CI/CD, monitoring, drift detection).
- Excellent communication skills; able to bridge technical and non-technical stakeholders.
Who You Are:
- A builder and innovator, passionate about solving complex problems with elegant, scalable, and resilient solutions.
- Thrive in fast-paced, high-impact environments and bring a growth mindset to learning and mentoring.
- Comfortable making architectural tradeoffs and driving technical strategy at scale.
- Committed to engineering excellence, open-source contribution, and continuous improvement.
About Walmart Global Tech:
Imagine working in an environment where one line of code can make life easier for hundreds of millions of people. That's what we do at Walmart Global Tech. We're a team of software engineers, data scientists, cybersecurity experts, and service professionals within the world's leading retailer who make an epic impact and are at the forefront of the next retail disruption. People are why we innovate, and people power our innovations. We are people-led and tech-empowered. We train our team in the skillsets of the future and bring in experts like you to help us grow. We have roles for those chasing their first opportunity as well as those looking for the opportunity that will define their career. Here, you can kickstart a great career in tech, gain new skills and experience for virtually every industry, or leverage your expertise to innovate at scale, impact millions and reimagine the future of retail. Walmart's culture is a competitive advantage, and it's fostered by being together. Working together in person allows us to collaborate, align quickly and innovate with greater speed. We use our campuses to create purposeful connection rooted in deepening understanding and investing in the development of our associates. Our hubs: Walmart is a global company with offices across the United States and around the world. Our global headquarters is in Bentonville, Arkansas, with primary hubs in the San Francisco Bay area and New York/New Jersey.
Benefits:
Beyond our great compensation package, you can receive incentive awards for your performance. Other great perks include 401(k) match, stock purchase plan, paid maternity and parental leave, PTO, multiple health plans, and much more.
Equal Opportunity Employer:
Walmart, Inc. is an Equal Opportunity Employer – By Choice. We believe we are best equipped to help our associates, customers, and the communities we serve live better when we really know them. That means understanding, respecting, and valuing unique styles, experiences, identities, ideas, and opinions – while being inclusive of all people.