Data Engineer (Hadoop ecosystem)
Praesignis (Pty) Ltd
Johannesburg, Gauteng
Contract
Posted 21 October 2025
Job Details
Job Description
We are seeking a skilled Data Engineer to design and develop scalable data pipelines that ingest raw, unstructured JSON data from source systems and transform it into clean, structured datasets within the Hadoop-based data platform. The ideal candidate will play a critical role in enabling data availability, quality, and usability by engineering the movement of data from the Raw Layer to the Published and Functional Layers.
Key Responsibilities:
- Design, build, and maintain robust data pipelines to ingest raw JSON data from source systems into the Hadoop Distributed File System (HDFS).
- Transform and enrich unstructured data into structured formats (e.g., Parquet, ORC) for the Published Layer using tools like PySpark, Hive, or Spark SQL.
- Develop workflows to further process and organize data into Functional Layers optimized for business reporting and analytics.
- Implement data validation, cleansing, schema enforcement, and deduplication as part of the transformation process.
- Collaborate with Data Analysts, BI Developers and Business Users to understand data requirements and ensure datasets are production-ready.
- Optimize ETL/ELT processes for performance and reliability in a large-scale distributed environment.
- Maintain metadata, lineage and documentation for transparency and governance.
- Monitor pipeline performance and implement error handling and alerting mechanisms.
Technical Skills & Experience:
- 3+ years of experience in data engineering or ETL development within a big data environment.
- Strong experience with Hadoop ecosystem tools: HDFS, Hive, Spark, YARN and Sqoop.
- Proficiency in PySpark, Spark SQL, and HQL (Hive Query Language).
- Experience working with unstructured JSON data and transforming it into structured formats.
- Solid understanding of data lake architectures: Raw, Published, and Functional layers.
- Familiarity with workflow orchestration tools like Airflow, Oozie, or NiFi.
- Experience with schema design, data modeling, and partitioning strategies.
- Comfortable with version control tools (e.g., Git) and CI/CD processes.
Nice to Have:
- Experience with data cataloging and governance tools (e.g., Apache Atlas, Alation).
- Exposure to cloud-based Hadoop platforms like AWS EMR, Azure HDInsight, or GCP Dataproc.
- Experience with containerization (e.g., Docker) and/or Kubernetes for pipeline deployment.
- Familiarity with data quality frameworks (e.g., Deequ, Great Expectations).