Job Summary
The
Senior AI Data Engineer is responsible for designing, building, and optimizing enterprise-scale data and AI infrastructure to support machine learning models, generative AI applications, and real-time analytics. The role drives the development of end-to-end data pipelines, from ingestion to production-ready AI data products, ensuring scalability, performance, and compliance across multi-cloud environments.
Accountability & Responsibilities
- Design, build, and maintain scalable ETL/ELT data pipelines using modern data engineering tools (e.g., Apache Spark, dbt).
- Architect and implement Lakehouse data platforms (Delta Lake, Apache Iceberg, Apache Hudi) following Medallion architecture (Bronze/Silver/Gold).
- Develop real-time streaming pipelines using Apache Kafka, Apache Flink, and Spark Structured Streaming.
- Build and optimize AI/GenAI data pipelines for LLM training, fine-tuning, and inference (tokenization, dataset curation, prompt engineering datasets).
- Design and implement Retrieval-Augmented Generation (RAG) pipelines, including embedding workflows and vector database integration.
- Manage feature stores for real-time and batch machine learning use cases.
- Integrate data pipelines with AI/ML platforms (Databricks MLflow, Azure ML, AWS SageMaker, Vertex AI, OpenAI/Azure OpenAI).
- Implement data orchestration workflows using Apache Airflow or similar tools with CI/CD pipelines.
- Ensure data quality, governance, and security using frameworks such as Great Expectations and data catalog tools.
- Deploy and manage infrastructure using Infrastructure-as-Code tools (Terraform, Bicep, CDK).
- Collaborate with Data Scientists, ML Engineers, and Solution Architects to deliver production-ready AI solutions.
- Lead technical design decisions, mentor junior engineers, and contribute to data platform strategy.
- Maintain documentation, data contracts, and operational runbooks for all pipelines.
Requirements
1 – Required Experience
- Bachelor's or Master's degree in Computer Science, Data Engineering, or related field.
- 4–5 years of experience in data engineering, with strong exposure to AI/ML data infrastructure.
- Proven experience building scalable data pipelines and working with large-scale datasets.
- Hands-on experience with AI/ML platforms and modern data architectures.
- Experience in regulated industries (e.g., Banking, Telecom, Healthcare) is a plus.
- Strong problem-solving, analytical thinking, and communication skills.
- Experience working in cross-functional teams and agile environments.
2– Technical Skills
- Strong SQL and advanced data modeling techniques
- Apache Spark (PySpark, Spark SQL, Streaming)
- Python (pandas, PySpark, data processing libraries)
- Data pipeline orchestration (Apache Airflow)
- CI/CD for data pipelines (GitHub Actions / Azure DevOps)
- Lakehouse architectures (Delta Lake / Iceberg / Hudi)
- Streaming technologies (Kafka, Flink)
- Cloud platforms (AWS / Azure / GCP)
- Vector databases (Pinecone, Weaviate, pgvector, OpenSearch)
- RAG pipeline design and LLM data processing
- Infrastructure-as-Code (Terraform / Bicep / CDK)
- Containers (Docker, Kubernetes)
- Data quality & governance tools