Production Support Engineer III
The Production Support Engineer III is responsible for ensuring the operational integrity, availability, and performance of mission-critical systems. This role involves managing technical incidents, troubleshooting recurring issues, and implementing permanent solutions to maintain system stability. The Engineer will collaborate with cross-functional teams to resolve incidents efficiently and improve system resiliency through proactive monitoring and automation.
Essential Duties And Responsibilities
- Handle the identification, triage, and resolution of medium-to-high priority incidents with minimal supervision to ensure business operations are minimally impacted.
- Collaborate with development teams, business partners, and other stakeholders to diagnose and resolve technical issues, implementing long-term fixes to prevent incident recurrence.
- Use monitoring tools (e.g., Splunk, Dynatrace, CloudWatch) to detect performance issues and execute corrective actions promptly.
- Enhance system observability to proactively detect issues and improve overall system performance and stability.
- Develop and maintain automation scripts to streamline routine production support tasks, reducing manual interventions.
- Implement automation strategies to improve production stability and minimize downtime.
- Maintain clear and detailed documentation of troubleshooting procedures, contributing to the shared knowledge base.
- Provide assistance in improving the incident, problem, and change management processes, following ITIL best practices.
- Participate in root cause analysis and suggest process improvements to enhance system stability and performance.
- Collaborate with cross-functional teams in resolving recurring production support issues and optimizing workflows.
- Actively mentor junior support engineers, fostering technical growth within the team.
- Escalate complex or unresolved issues to senior engineers or technical experts when necessary.
Qualifications
Required Qualifications
- Bachelor's degree in Computer Science, Information Systems, Engineering, or a related discipline.
- Six to ten years of experience in production support or related technical roles.
- Experience in managing incident management, triage, and production support functions for both on-premise and cloud environments.
- Proficiency with IT Service Management (ITSM) tools such as ServiceNow, and familiarity with incident, problem, and change management processes.
- Strong experience with monitoring tools such as Dynatrace, Splunk, or CloudWatch for proactive issue detection and troubleshooting.
- Understanding of infrastructure, application technology stacks, and the software development lifecycle.
- Strong analytical and problem-solving skills with a focus on root cause analysis.
- Ability to work independently, handle medium-to-complex issues, and escalate critical problems to senior staff as needed.
Preferred Qualifications
- Experience in supporting Agile team/processes.
- Financial services industry experience.
- Familiarity with Site Reliability Engineering (SRE) practices.
Other Job Requirements / Working Conditions
- Sitting Constantly (More than 50% of the time)
- Standing Frequently (25% - 50% of the time)
- Walking Frequently (25% - 50% of the time)
- Visual / Audio / Speaking Able to access and interpret client information received from the computer and able to hear and speak with individuals in person and on the phone.
- Manual Dexterity / Keyboarding Able to work standard office equipment, including PC keyboard and mouse, copy/fax machines, and printers.
- Availability Able to work all hours scheduled, including overtime as directed by manager/supervisor and required by business need.
- Travel Minimal and up to 10%
Truist is an Equal Opportunity Employer that does not discriminate on the basis of race, gender, color, religion, citizenship or national origin, age, sexual orientation, gender identity, disability, veteran status, or other classification protected by law. Truist is a Drug Free Workplace.