Data Engineer (Hadoop ecosystem)
Praesignis (Pty) Ltd
Johannesburg, Gauteng
Contract
Apply
Posted 21 October 2025

Job Details

Job Description

We are seeking a skilled Data Engineer to design and develop scalable data pipelines that ingest raw, unstructured JSON data from source systems and transform it into clean, structured datasets within the Hadoop-based data platform. The ideal candidate will play a critical role in enabling data availability, quality, and usability by engineering the movement of data from the Raw Layer to the Published and Functional Layers.


Key Responsibilities:

  • Design, build, and maintain robust data pipelines to ingest raw JSON data from source systems into the Hadoop Distributed File System (HDFS).
  • Transform and enrich unstructured data into structured formats (e.g., Parquet, ORC) for the Published Layer using tools like PySpark, Hive, or Spark SQL.
  • Develop workflows to further process and organize data into Functional Layers optimized for business reporting and analytics.
  • Implement data validation, cleansing, schema enforcement, and deduplication as part of the transformation process.
  • Collaborate with Data Analysts, BI Developers and Business Users to understand data requirements and ensure datasets are production-ready.
  • Optimize ETL/ELT processes for performance and reliability in a large-scale distributed environment.
  • Maintain metadata, lineage and documentation for transparency and governance.
  • Monitor pipeline performance and implement error handling and alerting mechanisms.


Technical Skills & Experience:

  • 3+ years of experience in data engineering or ETL development within a big data environment.
  • Strong experience with Hadoop ecosystem tools: HDFS, Hive, Spark, YARN and Sqoop.
  • Proficiency in PySpark, Spark SQL, and HQL (Hive Query Language).
  • Experience working with unstructured JSON data and transforming it into structured formats.
  • Solid understanding of data lake architectures: Raw, Published, and Functional layers.
  • Familiarity with workflow orchestration tools like Airflow, Oozie, or NiFi.
  • Experience with schema design, data modeling, and partitioning strategies.
  • Comfortable with version control tools (e.g., Git) and CI/CD processes.


Nice to Have:

  • Experience with data cataloging and governance tools (e.g., Apache Atlas, Alation).
  • Exposure to cloud-based Hadoop platforms like AWS EMR, Azure HDInsight, or GCP Dataproc.
  • Experience with containerization (e.g., Docker) and/or Kubernetes for pipeline deployment.
  • Familiarity with data quality frameworks (e.g., Deequ, Great Expectations).