As an observability Distinguished Engineer, you will be a key researcher and technical lead expert in the architecture and development of cloud native observability designs, managed services, and real-time telemetry software systems. You will use your depth of engineering and experience to create visionary software architectures and telemetry systems to achieve an observability software product portfolio. Additionally, you will design, develop, and implement large-scale distributed systems that process large volumes of data focusing on scalability, latency, and fault-tolerance in every system built. You must be able to effectively communicate and build collaboration at all areas and levels of the business and engineering. An ideal candidate will be adept at architecting large scale distributed systems and proficient in coding Java. Furthermore, experience in socializing architectural designs and roadmaps to internal and external customers. To achieve software solutions and designs, you will utilize multiple telemetry technologies such as: data models, metric libraries, data logging, distributed tracing, data lakes, data correlation, rule based alerting engines, real-time data streaming pipelines, TSDBs, and application performance management (APM). While working in a cloud infrastructure ecosystem consisting of VMs, Kubernetes, and containers, you will create metric software designs and solutions enabling real-time monitoring and alerting of system and application metrics. You will lead research initiatives for cloud native designs and implementation within public and private clouds. You will also utilize TSDBs and correlation and data fusion of multiple data types and heterogeneous data streams coupled with Artificial Intelligence (AI) and Learned Behaviors for anomaly detection, and forward projections of system and application expected behaviors. This role will involve collaboration with enterprise architects, product managers, data scientists, engineers and business managers to bring telemetry R&D projects into production. To achieve this effect, you will use a combination of open source and COTS technologies to solve real-time telemetry problems at an enterprise-wide scale. In parallel, you will lead the design of new systems and the redesign of existing systems to meet business requirements, changing needs, and integration of state-of-the-art technology. You will be an evangelist for the Observability foundation socialization technology designs and implementations to engineering and business customers.
Location: Open to Sunnyvale CA, Seattle WA, and Bentonville AR
BS/MS in Computer Science, Engineering, or equivalent, with 15+ or more years in software engineering, design and architecture
This role requires a deep understanding of the Java language and associated frameworks and previous development of Java applications, Libs, SDK or services.
Strong architecture leadership with demonstrated enterprise level software implementations.
Previous demonstrated architectural leadership in research, evaluation, creation of software designs, and distributed software implementations in production.
Experience with technical leadership, software roadmaps, research and development, new software initiatives and customer and engineering coordination and engagement.
Full stack cloud software development experience.
Experience with the following:
API development, integration, and utilization
Cloud technologies and cloud native designs
Cloud infrastructures and technologies, such as OpenStack, Azure, GCP or AWS..
Large scale distributed systems experience including scalability and fault tolerance.
TSDBs (InfluxDB, Kairos, Cortex, Thanos, Prometheus) or equivalent
Extract, transform, and load (ETL) processes
Real-time telemetry pipelines and publish/subscribe models (Kafka or equivalent)
Data warehousing, data lakes, processing and data analytics
SQL (AzureSQL, Postgress or equivalent) a solid foundation in advanced SQL
Unix/Linux shell scripting or similar programming/scripting knowledge
Real-time time monitoring and alerting: metric agents, real-time dashboards, alerting rules
Excellent written and verbal communication skills for diverse audiences based on engineering subject matter
Ability to document requirements, architectural designs, and analysis findings in both business and technical terminology
Software development in an Agile iterative CI/CD development environment
Promote and support company policies, procedures, mission, values, and standards of ethics and integrity
Knowledge and/or use of agentic AI – Model context protocol (MCP) servers, Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), Natural Language processing (NPL)
Fluency in Python, JavaScript, advanced shell scripting, Configuration management -Ansible, chef, puppet
Experience with the following:
Application Performance Monitoring (APM) and/or Distributed Tracing
Deployment of Kubernetes, containers, service meshes, and micro services
Micro services architectures, Istio, and micrometer
Open Telemetry standards and protocols
Go development
Observability tools and system architectures
Experience in creating and maintaining managed metric services
NoSQL (Cassandra, CosmosDB or equivalent)
Storm, Spark or similar real-time streaming software
Knowledge of UI development - JavaScript, HTML, CSS and experience with frameworks like React and AngularJS
Involvement and contribution with open-source software communities
Demonstrated background in developing software systems and