Visa is a world leader in payments and technology, with over 259 billion payments transactions flowing safely between consumers, merchants, financial institutions, and government entities in more than 200 countries and territories each year. Our mission is to connect the world through the most innovative, convenient, reliable, and secure payments network, enabling individuals, businesses, and economies to thrive while driven by a common purpose – to uplift everyone, everywhere by being the best way to pay and be paid.
Make an impact with a purpose-driven industry leader. Join us today and experience Life at Visa.
Visa's Technology Organization is a community of problem solvers and innovators reshaping the future of commerce. We operate the world's most sophisticated processing networks capable of handling more than 65k secure transactions a second across 80M merchants, 15k Financial Institutions, and billions of everyday people. While working with us you'll get to work on complex distributed systems and solve massive scale problems centered on new payment flows, business and data solutions, cyber security, and B2C platforms.
The Opportunity:
As a Staff Site Reliability Engineer in Product Reliability Engineering, you will be part of a team that maintains and support Open Source Hadoop, Big Data, Kafka Platforms, Cloud ensuring their availability, performance, reliability, and improving operational efficiency. You will be responsible for driving innovation for our partners and clients, within Visa and globally.
The Work Itself:
Essential Functions:
· Design, build and manage Hadoop and Kafka clusters on Cloud - AWS, GCP, and Azure.
· Manage and optimize Open Source Apache Hadoop, Big Data, and Kafka clusters for high performance, reliability, and scalability.
· Develop tools and processes to monitor and analyze system performance and to identify potential issues.
· Collaborate with other teams to design and implement solutions to improve reliability and efficiency of the in-premise Hadoop, Big Data, and cloud platforms.
· Effective root cause analysis of major production incidents and the development of learning documentation.
· The role involves planning and performing capacity expansions and upgrades in a timely manner to avoid any scaling issues and bugs.
· The successful candidate will tune alerting and set up observability to proactively identify issues and performance problems.
· The role involves creating standard operating procedure documents and guidelines on effectively managing and utilizing the platforms.
The person will leverage DevOps tools, disciplines (Incident, problem, and change management), and standards in day-to-day operations.
· The individual will concentrate on developing tool and automations to minimize manual effort. This can be achieved through Ansible, Python, Java scripting, or by using any other programming language.
The Skills You Bring:
· Energy and experience: A growth mindset that is curious and passionate about technologies and enjoys challenging projects on a global scale.
· Challenge the Status Quo: Comfort in pushing the boundaries, "hacking" beyond traditional solutions.
· Language Expertise: Expertise in one or more general development languages (e.g., Java, Python) or full stack development.
· Learner: Constant drive to learn new technologies.
· Partnership: Experience collaborating with Engineering, Application, and other functional teams.
This is a hybrid position. Expectation of days in office will be confirmed by your hiring manager.