Ensure Reliability & Performance: Own the observability of our systems, ensuring they meet established service-level objectives (SLOs) and maintain high availability.
Cloud & Container Orchestration: Deploy, configure, and manage resources on Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE), focusing on secure and scalable infrastructures.
Infrastructure Automation & Tooling: Set up and maintain automated build and deployment pipelines; drive continuous improvements to reduce manual work and risks.
Monitoring & Alerting: Develop and refine comprehensive monitoring solutions (performance, uptime, error rates, etc.) to detect issues early and minimize downtime.
Incident Management & Troubleshooting: Participate in on-call rotations; manage incidents through resolution, investigate root causes, and create blameless postmortems to prevent recurrences.
Collaboration with Development: Partner with development teams to design and release services that are production-ready from day one, emphasizing reliability, scalability, and performance.
Security & Compliance: Integrate security best practices into system design and operations; maintain compliance with SOC 2 and other relevant standards.
Performance & Capacity Planning: Continuously assess system performance and capacity; propose and implement improvements to meet current and future demands.
We are a company committed to creating inclusive environments where people can bring their full, authentic selves to work every day. We are an equal opportunity employer that believes everyone matters. Qualified candidates will receive consideration for employment opportunities without regard to race, religion, sex, age, marital status, national origin, sexual orientation, citizenship status, disability, or any other status or characteristic protected by applicable laws, regulations, and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application or recruiting process, please send a request to HR@insightglobal.com.