Job Details

Minimum experience Mid-Senior

Company primary industry Financial Services

Job functional area Information Technology

Contract term 12

Job Description

We are seeking a skilled Data Engineer to design and develop scalable data pipelines that ingest raw, unstructured JSON data from source systems and transform it into clean, structured datasets within the Hadoop-based data platform. The ideal candidate will play a critical role in enabling data availability, quality, and usability by engineering the movement of data from the Raw Layer to the Published and Functional Layers.

Key Responsibilities:

Design, build, and maintain robust data pipelines to ingest raw JSON data from source systems into the Hadoop Distributed File System (HDFS).
Transform and enrich unstructured data into structured formats (e.g., Parquet, ORC) for the Published Layer using tools like PySpark, Hive, or Spark SQL.
Develop workflows to further process and organize data into Functional Layers optimized for business reporting and analytics.
Implement data validation, cleansing, schema enforcement, and deduplication as part of the transformation process.
Collaborate with Data Analysts, BI Developers and Business Users to understand data requirements and ensure datasets are production-ready.
Optimize ETL/ELT processes for performance and reliability in a large-scale distributed environment.
Maintain metadata, lineage and documentation for transparency and governance.
Monitor pipeline performance and implement error handling and alerting mechanisms.

Technical Skills & Experience:

3+ years of experience in data engineering or ETL development within a big data environment.
Strong experience with Hadoop ecosystem tools: HDFS, Hive, Spark, YARN and Sqoop.
Proficiency in PySpark, Spark SQL, and HQL (Hive Query Language).
Experience working with unstructured JSON data and transforming it into structured formats.
Solid understanding of data lake architectures: Raw, Published, and Functional layers.
Familiarity with workflow orchestration tools like Airflow, Oozie, or NiFi.
Experience with schema design, data modeling, and partitioning strategies.
Comfortable with version control tools (e.g., Git) and CI/CD processes.

Nice to Have:

Experience with data cataloging and governance tools (e.g., Apache Atlas, Alation).
Exposure to cloud-based Hadoop platforms like AWS EMR, Azure HDInsight, or GCP Dataproc.
Experience with containerization (e.g., Docker) and/or Kubernetes for pipeline deployment.
Familiarity with data quality frameworks (e.g., Deequ, Great Expectations).

Apply

Processing purpose	Lawful basis	Retention period
Vacancy fulfilment	S11 - We have the data subject's consent	Up until the vacancy is fulfilled

Processing purpose	Lawful basis	Retention period
Load as Supplier	S11 - We, the Responsible Party must comply with a legal obligation	Month to Month Contract

Data subject type	Personal information type	Indirect source name
Consultants	Candidate Curriculum Vitae	All internal departments
Recruitment Services	Candidate Curriculum Vitae	All online platforms

Data subject type	Special personal information type	Indirect source name
Recruitment Services	Biometric information	MIE