Improving the reliability of our main product Suite One as we scale to different regions across the globe
Creating and maintaining innovative, automated solutions, tooling and alerting frameworks to improve the reliability of our production systems
Proactively addressing application, platform and database performance and reliability issues
Automating our infrastructure, testing, failover solutions, failure mitigation, and much more
Maintaining documentation and "runbooks" to assist with operational management
Conducting post-incident reviews and implementing the findings of such to continually improve
Assisting with on-call rotations and processes
Educating, training and promoting our culture of ownership to help our engineering teams better understand the production impact of their changes
Assisting with the development and maintenance our software delivery frameworks
Working closely with internal partners and teams to ensure that we ship software that meets security, SLA, and performance requirements
Tech stack:
Azure
Kubernetes (AKS), Docker
Linux, Terraform, Ansible, Helm
Gitlab, Gitlab CI/CD pipelines
Prometheus, Grafana, ELK
JIRA, Confluence
Product Development Stack:
.NET Core, Angular, Python
MSSQL Server, Postgres,
Redis, Cosmos-DB, Rabbit-MQ
Skills and experience:
Senior-level engineer with 4+ years of experience across both Development and Operations
A proven track record of improving application stability and performance
Experience designing, building, and operating large-scale production systems
A proven track record of working with cloud computing platforms such as Amazon Web Services, Microsoft Azure, Google Cloud Platform
Proven experience and appreciation of IaC (Infrastructure-as-Code) practices
Experience automating infrastructure, testing, and deployments using tools like Ansible, Chef, or Terraform
Experience working with containers, such as with Docker or Kubernetes
Experience working across DevSecOps pipelines and tooling
Experience with monitoring and observability solutions such as New Relic, Opentelemetry
Experience debugging complex problems
Desirable: Experience of working with large production data sets
Desirable: A good understanding of ASP.NET
Personal attributes:
Great communication and collaboration skills working with other engineers, product managers, and business stakeholders
Independent, proactive, and able to deliver production-ready solutions with minimal guidance
Empathetic and authentic
Inquisitive and interested
Driven
Self-motivated and diligent
Optimistic and courageous