Operations Support Engineer
At Xtremax, our Operations Support Engineers play a key role in ensuring the reliability, stability, and efficiency of mission-critical systems. In this role, you'll work closely with developers, product managers, and user support teams to monitor system performance, resolve technical issues, and implement preventive measures. Your contributions will directly support smooth operations, high uptime, and reliable services for our users. Candidates with public sector experience are preferred, as this role supports IT projects for government agencies.
Responsibilities:
System Monitoring & Performance
- Monitor and analyse product runtime environments (production and non-production) to ensure optimal system performance.
- Implement continuous improvement strategies to enhance system reliability and efficiency.
Incident & Problem Management
- Manage application and security incidents, performing problem determination and coordinating with internal teams and vendors for resolution.
- Escalate issues as necessary to minimize business impact.
Operational Processes & Compliance
- Develop and maintain operations and process guides to meet audit and compliance requirements.
- Handle day-to-day operational activities, analyse performance data, and prepare status reports for stakeholders and management.
- Ensure operational processes align with IM8 and ISO 27001 standards.
- Conduct periodic compliance drills and support audit preparation.
Team Coordination & Support
- Lead and coordinate with operations teams and vendors to ensure 24/7 system support availability.
- Facilitate communication between teams to resolve operational issues efficiently.
Automation & Proactive Operations
- Build self-healing systems with automated remediation for common alerts.
- Implement Infrastructure as Code (IaC) pipelines to reduce manual configuration drift.
Observability & Incident Readiness
- Deploy full-stack monitoring with predictive analytics (CloudWatch Anomaly Detection, Stackdriver, Azure Monitor).
- Integrate alerting with central NOC/SOC for faster escalation and resolution.
Collaboration & Enablement
- Serve as the bridge between app teams and infra teams, enabling self-service for troubleshooting.
- Train agency teams on operational best practices and tool adoption (e.g., ITSM workflows, DevOps pipelines).