Develop custom software solutions to design, code, and enhance components across systems or applications. Use modern frameworks and agile practices to deliver scalable, high-performing solutions tailored to specific business needs.
Must have skills: PySpark
Educational Qualification: 15 years full time education
We are seeking a skilled Data Engineer to design, build, and optimize scalable data pipelines on the Enterprise Data Platform (EDL) running on Cloudera (CDP) on AWS. The role involves working with the Hadoop ecosystem, building PySpark-based data processing pipelines, orchestrating workflows using Oozie and Control-M, integrating with AWS services (S3, IAM, EC2), and delivering secure, reliable, cloud-ready data solutions.
Required Skills & Experience: 3–10+ years of experience in Data Engineering. Hands-on experience with Cloudera (CDH/CDP), Hadoop, HDFS, Hive/Impala, PySpark/Spark SQL, Oozie workflows/coordinators/SLA, Control-M scheduling, AWS S3 architecture. Strong Python development for ETL frameworks, exception handling, and testing. Strong Linux/Shell scripting skills. Understanding of distributed systems concepts including shuffle, skew handling, broadcast joins, and spill tuning. Experience with CI/CD tools such as Git, Jenkins, Azure DevOps, or GitHub Actions.
Nice-to-Have: Cloudera Machine Learning (CML) or Cloudera Data Science Workbench (CDSW), Delta Lake / Iceberg / Hudi, Cloudera Manager administration, Data catalog & lineage (Atlas), Exposure to Kafka, NiFi, Informatica, HBase.
Core Competencies: Ownership and accountability, Strong analytical and performance tuning mindset, Collaboration with cross-functional teams, Excellent documentation and communication skills.
15 years full time education